chore(deps): bump datasets from 2.15.0 to 2.16.0 #304

dependabot · 2023-12-24T18:06:54Z

Bumps datasets from 2.15.0 to 2.16.0.

Release notes

2.16.0

Security features

Add trust_remote_code argument by @lhoestq in huggingface/datasets#6429

Some Hugging Face datasets contain custom code which must be executed to correctly load the dataset. The code can be inspected in the repository content at https://hf.co/datasets/<repo_id>. A warning is shown to let the user know about the custom code, and they can avoid this message in future by passing the argument trust_remote_code=True.

Passing trust_remote_code=True will be mandatory to load these datasets from the next major release of datasets.

Using the environment variable HF_DATASETS_TRUST_REMOTE_CODE=0 you can already disable custom code by default without waiting for the next release of datasets

Use parquet export if possible by @lhoestq in huggingface/datasets#6448

This allows loading most old datasets based on custom code by downloading the Parquet export provided by Hugging Face

You can see a dataset's Parquet export at https://hf.co/datasets/<repo_id>/tree/refs%2Fconvert%2Fparquet

Features

Webdataset dataset builder by @lhoestq in huggingface/datasets#6391

Implement get dataset default config name by @albertvillanova in huggingface/datasets#6511

Lazy data files resolution and offline cache reload by @lhoestq in huggingface/datasets#6493

This speeds up the load_dataset step that lists the data files of big repositories (up to x100) but requires huggingface_hub 0.20 or newer

Fix load_dataset that used to reload data from cache even if the dataset was updated on Hugging Face

Reload a dataset from your cache even if you don't have internet connection

New cache directory scheme for no-script datasets: ~/.cache/huggingface/datasets/username___dataset_name/config_name/version/commit_sha

Backward comaptibility: cached datasets from datasets 2.15 (using the old scheme) are still reloaded from cache

General improvements and bug fixes

Remove unused argument in _get_data_files_patterns by @lhoestq in huggingface/datasets#6343

Set usedforsecurity=False in hashlib methods (FIPS compliance) by @Wauplin in huggingface/datasets#6414

Use ruff for formatting by @mariosasko in huggingface/datasets#6434

Create DatasetNotFoundError and DataFilesNotFoundError by @albertvillanova in huggingface/datasets#6431

Fix multi gpu map example by @lhoestq in huggingface/datasets#6415

Better tqdm wrapper by @mariosasko in huggingface/datasets#6433

Remove Table.__getstate__ and Table.__setstate__ by @LZHgrla in huggingface/datasets#6444

Use filelock package for file locking by @mariosasko in huggingface/datasets#6445

Fix metadata file resolution when inferred pattern is ** by @mariosasko in huggingface/datasets#6449

Update hub-docs reference by @mishig25 in huggingface/datasets#6453

Refactor dill logic by @mariosasko in huggingface/datasets#6454

Don't require trust_remote_code in inspect_dataset by @lhoestq in huggingface/datasets#6456

[docs] troubleshooting guide by @MKhalusova in huggingface/datasets#6424

Missing DatasetNotFoundError by @lhoestq in huggingface/datasets#6462

Disable benchmarks in PRs by @lhoestq in huggingface/datasets#6463

More robust temporary directory deletion by @mariosasko in huggingface/datasets#6426

Fix shard retry mechanism in push_to_hub by @mariosasko in huggingface/datasets#6461

Use auth to get parquet export by @lhoestq in huggingface/datasets#6468

Remove delete doc CI by @lhoestq in huggingface/datasets#6471

Fix CI quality by @albertvillanova in huggingface/datasets#6473

Fix PermissionError on Windows CI by @albertvillanova in huggingface/datasets#6477

More robust preupload retry mechanism by @mariosasko in huggingface/datasets#6479

Add IterableDataset __repr__ by @lhoestq in huggingface/datasets#6480

Fix max lock length on unix by @lhoestq in huggingface/datasets#6482

Fix ArrayXD YAML conversion by @mariosasko in huggingface/datasets#6168

Fix docs phrasing about supported formats when sharing a dataset by @albertvillanova in huggingface/datasets#6486

... (truncated)

Commits

a85fb52 Release: 2.16.0 (#6527)
7b5fc58 Preserve order of configs and splits when using Parquet exports (#6526)
2afbf78 Cache backward compatibility with 2.15.0 (#6514)
e1b82ea fix tests (#6523)
ef3b5dd Lazy data files resolution and offline cache reload (#6493)
cf71653 Fix metrics dead link (#6491)
2246d31 fix get_metadata_patterns function args error (#6518)
0b2147a Support commit_description parameter in push_to_hub (#6520)
8b04288 Implement get dataset default config name (#6511)
a887ee7 Support push_to_hub canonical datasets (#6519)
Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR
@dependabot recreate will recreate this PR, overwriting any edits that have been made to it
@dependabot merge will merge this PR after your CI passes on it
@dependabot squash and merge will squash and merge this PR after your CI passes on it
@dependabot cancel merge will cancel a previously requested merge and block automerging
@dependabot reopen will reopen this PR if it is closed
@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
@dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependabot · 2023-12-24T18:06:55Z

The following labels could not be found: :game_die: dependencies, :robot: bot.

Bumps [datasets](https://github.com/huggingface/datasets) from 2.15.0 to 2.16.0. - [Release notes](https://github.com/huggingface/datasets/releases) - [Commits](huggingface/datasets@2.15.0...2.16.0) --- updated-dependencies: - dependency-name: datasets dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>

pull-request-quantifier-deprecated · 2023-12-28T20:21:55Z

This PR has 15 quantified lines of changes. In general, a change size of upto 200 lines is ideal for the best PR experience!

Quantification details

Label      : Extra Small
Size       : +8 -7
Percentile : 6%

Total files changed: 2

Change summary by file extension:
.lock : +7 -6
.toml : +1 -1

Change counts above are quantified counts, based on the PullRequestQuantifier customizations.

Why proper sizing of changes matters

Optimal pull request sizes drive a better predictable PR flow as they strike a
balance between between PR complexity and PR review overhead. PRs within the
optimal size (typical small, or medium sized PRs) mean:

Fast and predictable releases to production:
- Optimal size changes are more likely to be reviewed faster with fewer
  iterations.
- Similarity in low PR complexity drives similar review times.
Review quality is likely higher as complexity is lower:
- Bugs are more likely to be detected.
- Code inconsistencies are more likely to be detected.
Knowledge sharing is improved within the participants:
- Small portions can be assimilated better.
Better engineering practices are exercised:
- Solving big problems by dividing them in well contained, smaller problems.
- Exercising separation of concerns within the code changes.

What can I do to optimize my changes

Use the PullRequestQuantifier to quantify your PR accurately
- Create a context profile for your repo using the context generator
- Exclude files that are not necessary to be reviewed or do not increase the review complexity. Example: Autogenerated code, docs, project IDE setting files, binaries, etc. Check out the Excluded section from your prquantifier.yaml context profile.
- Understand your typical change complexity, drive towards the desired complexity by adjusting the label mapping in your prquantifier.yaml context profile.
- Only use the labels that matter to you, see context specification to customize your prquantifier.yaml context profile.
Change your engineering behaviors
- For PRs that fall outside of the desired spectrum, review the details and check if:
  - Your PR could be split in smaller, self-contained PRs instead
  - Your PR only solves one particular issue. (For example, don't refactor and code new features in the same PR).

How to interpret the change counts in git diff output

One line was added: +1 -0
One line was deleted: +0 -1
One line was modified: +1 -1 (git diff doesn't know about modified, it will
interpret that line like one addition plus one deletion)
Change percentiles: Change characteristics (addition, deletion, modification)
of this PR in relation to all other PRs within the repository.

Was this comment helpful? 👍 :ok_hand: :thumbsdown: (Email)
Customize PullRequestQuantifier for this repository.

pull-request-quantifier-deprecated bot added the Extra Small label Dec 24, 2023

zube bot added the [zube]: Inbox label Dec 24, 2023

dependabot bot force-pushed the dependabot/pip/datasets-2.16.0 branch from fc93118 to 81ae45e Compare December 28, 2023 20:21

entelecheia self-requested a review December 28, 2023 20:22

entelecheia approved these changes Dec 28, 2023

View reviewed changes

entelecheia merged commit 4f20f88 into main Dec 28, 2023
2 checks passed

zube bot added [zube]: Done and removed [zube]: Inbox labels Dec 28, 2023

entelecheia deleted the dependabot/pip/datasets-2.16.0 branch December 28, 2023 20:22

zube bot removed the [zube]: Done label Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(deps): bump datasets from 2.15.0 to 2.16.0 #304

chore(deps): bump datasets from 2.15.0 to 2.16.0 #304

dependabot bot commented on behalf of github Dec 24, 2023 •

edited

Loading

dependabot bot commented on behalf of github Dec 24, 2023

pull-request-quantifier-deprecated bot commented Dec 28, 2023

What can I do to optimize my changes

How to interpret the change counts in git diff output

chore(deps): bump datasets from 2.15.0 to 2.16.0 #304

chore(deps): bump datasets from 2.15.0 to 2.16.0 #304

Conversation

dependabot bot commented on behalf of github Dec 24, 2023 • edited Loading

2.16.0

Security features

Features

General improvements and bug fixes

dependabot bot commented on behalf of github Dec 24, 2023

pull-request-quantifier-deprecated bot commented Dec 28, 2023

What can I do to optimize my changes

How to interpret the change counts in git diff output

dependabot bot commented on behalf of github Dec 24, 2023 •

edited

Loading