Add HuggingfaceDatasetReader for using Huggingface datasets #5095

ghost · 2021-04-04T17:14:20Z

Added a new reader to allow for reading huggingface datasets as instance
Mapped limited datasets.features to allenlp.data.fields

Verified for selective dataset and/or dataset configurations for training split, mentioned in the documentation comments of the reader.

datasets==1.5.0

Joint work with @divijbajaj @annajung @prajaktakini-vmware @agururajvais

Signed-off-by: Abhishek P (VMware) pab@vmware.com

Fixes #4962

Changes proposed in this pull request:
Introduce a new reader that wraps huggingface datasets to provide instances for a split of the dataset with configuration if required

ghost · 2021-04-05T04:54:48Z

Looking at the lint and format issues. Will address them hopefully today.

ghost · 2021-04-05T17:18:29Z

Locally new tests pass -
tests/data/dataset_readers/huggingface_datasets_test.py ........ . [ 25%]

But since they are doing downloads it is quite possible they fail. Will try to see if I can pin point it. But should be okay for the draft.

dirkgr · 2021-04-05T20:14:14Z

It's OK for tests to do downloads, but it's better if the downloads are small. Is there a tiny dataset we can use? Maybe we upload one just for this purpose?

dirkgr

I'm really excited to have this. I've been wanting to do this since HF came out with the datasets project.

allennlp/data/dataset_readers/huggingface_datasets_reader.py

setup.py

albertvillanova · 2021-04-06T08:51:30Z

Thanks for this PR, @pab-vmware, @divijbajaj, @annajung, @pkini-vmware and @agururajvais.

And thanks @dirkgr for your thorough review.

Just let me know if there is anything we can do on Hugging Face Datasets to help you with the integration. 🤗

ghost · 2021-04-06T13:36:12Z

It's OK for tests to do downloads, but it's better if the downloads are small. Is there a tiny dataset we can use? Maybe we upload one just for this purpose?

Nice Idea, we can formulate a custom dataset and test around it to validate the mapping of datasets.features to allenlp.data.fields.

In the meantime I noticed glue is small, will change UTs to use that only.

We can probably have sanity scripts to test w.r.t specific important datasets to validate support for them, independent of UT.

Let me make one such dataset as part of it. Will ask on where to place it (my repo or somewhere else) once I have it.

epwalsh

This looks really good so far!

allennlp/data/dataset_readers/huggingface_datasets_reader.py

ghost · 2021-04-07T17:31:24Z

Looks like I need to do more changes to get it working even for glue. Will inform once it is done.

ghost · 2021-04-08T17:24:21Z

Working for 3 datasets. With couple different types of features.

Pending addressing of type check failures

Samples from the 2 datasets, xnli is a little too big for a comment

glue-cola:
From:{'idx': 0, 'label': 1, 'sentence': "Our friends won't buy this analysis, let alone the next one we propose."}
To:
Instance with fields:
 	 sentence: TextField of length 1 with text: 
 		[Our friends won't buy this analysis, let alone the next one we propose.]
 
 	 label: LabelField with label: 1 in namespace: 'label'. 
 	 idx: LabelField with label: 0 in namespace: 'idx'. 


_universal_dependencies - en_lines:_
From:
{'deprel': ['mark', 'det', 'nsubj', 'advcl', 'case', 'det', 'compound', 'flat', 'obl', 'case', 'det', 'compound', 'flat', 'nmod', 'punct', 'det', 'nsubj:pass', 'aux:pass', 'root', 'case', 'det', 'compound', 'flat', 'compound', 'obl', 'punct'], 'deps': ['None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None'], 'feats': ["{'Case': 'Nom'}", "{'Definite': 'Ind', 'PronType': 'Art'}", "{'Number': 'Sing'}", "{'Mood': 'Ind', 'Number': 'Sing', 'Person': '3', 'Tense': 'Pres', 'VerbForm': 'Fin'}", 'None', "{'Definite': 'Def', 'PronType': 'Art'}", "{'Number': 'Sing'}", "{'Number': 'Sing'}", "{'Number': 'Sing'}", 'None', "{'Definite': 'Ind', 'PronType': 'Art'}", "{'Number': 'Sing'}", "{'Case': 'Nom'}", "{'Number': 'Sing'}", 'None', "{'Definite': 'Def', 'PronType': 'Art'}", "{'Number': 'Sing'}", "{'Mood': 'Ind', 'Number': 'Sing', 'Person': '3', 'Tense': 'Pres', 'VerbForm': 'Fin'}", "{'Tense': 'Past', 'VerbForm': 'Part', 'Voice': 'Pass'}", 'None', "{'Definite': 'Ind', 'PronType': 'Art'}", "{'Case': 'Nom'}", "{'Case': 'Nom'}", "{'Number': 'Sing'}", "{'Number': 'Sing'}", 'None'], 'head': ['4', '3', '4', '19', '9', '9', '9', '7', '4', '14', '14', '14', '12', '9', '4', '17', '19', '19', '0', '25', '25', '25', '22', '25', '19', '19'], 'idx': 'en_lines-ud-dev-doc1-3177', 'lemmas': ['when', 'a', 'user', 'connect', 'to', 'the', 'SQL', 'server', 'database', 'through', 'a', 'Microsoft', 'Access', 'project', ',', 'the', 'connection', 'be', 'enable', 'through', 'a', 'Windows', 'NT', 'user', 'account', '.'], 'misc': ['None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', "{'SpaceAfter': 'No'}", 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', "{'SpaceAfter': 'No'}", 'None'], 'text': 'When a user connects to the SQL Server database through a Microsoft Access project, the connection is enabled through a Windows NT user account.', 'tokens': ['When', 'a', 'user', 'connects', 'to', 'the', 'SQL', 'Server', 'database', 'through', 'a', 'Microsoft', 'Access', 'project', ',', 'the', 'connection', 'is', 'enabled', 'through', 'a', 'Windows', 'NT', 'user', 'account', '.'], 'upos': [5, 8, 0, 16, 2, 8, 10, 0, 0, 2, 8, 10, 10, 0, 1, 8, 0, 17, 16, 2, 8, 10, 10, 0, 0, 1], 'xpos': [None, 'IND-SG', 'SG-NOM', 'PRES', None, 'DEF', 'SG-NOM', 'SG-NOM', 'SG-NOM', None, 'IND-SG', 'SG-NOM', 'SG-NOM', 'SG-NOM', 'Comma', 'DEF', 'SG-NOM', 'PRES-AUX', 'PASS', None, 'IND-SG', 'SG-NOM', 'SG-NOM', 'SG-NOM', 'SG-NOM', 'Period']}

To:
{'idx': <allennlp.data.fields.text_field.TextField object at 0x147f4b0c0>, 'text': <allennlp.data.fields.text_field.TextField object at 0x147c7a740>, 'tokens': <allennlp.data.fields.list_field.ListField object at 0x145b8c040>, 'lemmas': <allennlp.data.fields.list_field.ListField object at 0x145b8c820>, 'upos': <allennlp.data.fields.list_field.ListField object at 0x145b8c1f0>, 'xpos': <allennlp.data.fields.list_field.ListField object at 0x145b8cd60>, 'feats': <allennlp.data.fields.list_field.ListField object at 0x147ed3a00>, 'head': <allennlp.data.fields.list_field.ListField object at 0x147ed30d0>, 'deprel': <allennlp.data.fields.list_field.ListField object at 0x147ed39d0>, 'deps': <allennlp.data.fields.list_field.ListField object at 0x147ed3a30>, 'misc': <allennlp.data.fields.list_field.ListField object at 0x147ed3af0>}

ghost · 2021-04-10T17:02:25Z

Have working implementation with limited , features-> field mapping. let me work on getting the code coverage up for the diff even if I have to use slightly heavier dataset at this point.
I will also look at adding a custom HF dataset so we can use it for testing purposes.

I will try to increase the features that we are mapping but is the current change (post coverage) sufficient to go in?

epwalsh · 2021-04-12T22:53:40Z

I will also look at adding a custom HF dataset so we can use it for testing purposes.

That would great!

dirkgr · 2021-04-13T21:25:11Z

What's the status of this? Is it close? Anything you need? I'd love to have this in, maybe even for the 2.3 release!

ghost · 2021-04-15T03:13:41Z

Hi Dirk, So now that the coverage is up to 90%, this MR IMO although slightly Raw in w.r.t implementation can go in.
Let me re-check the known supported dataset/config and update it.

There are things that need to be done, for most of which I have added TODOs and I will pick them up one by one in the coming weeks.

Cleanup the code so we have mapper class/method for each of the features
Support Nested Features (Linked to Above)
Add Custom Light Dataset and configs and use them fo in UT
Add a test where the new reader is compared to the customer conll2003 reader - This would validate the reader in a practical manner

ghost · 2021-04-15T04:50:34Z

So I have updated the datasets list, added validation tests that are now skipped. Squashed all changes into a single commit. Please take a look when time permits.

I will work on cleaning up the code and providing nested support.
If the dataset download during tests becomes problematic, I will prioritize the creation of custom dataset.

dirkgr · 2021-04-19T18:16:54Z

Squashed all changes into a single commit.

Don't worry about doing that for our sake. It'll get squashed when we merge anyways. In fact, now I have no way of seeing just the changes you made since the last time I reviewed this.

* fix race condition when extracting files with cached_path * add warning when directory already exists

Bumps [checklist](https://github.com/marcotcr/checklist) from 0.0.10 to 0.0.11. - [Release notes](https://github.com/marcotcr/checklist/releases) - [Commits](https://github.com/marcotcr/checklist/commits) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Akshita Bhagia <akshita23bhagia@gmail.com> Co-authored-by: Dirk Groeneveld <dirkg@allenai.org>

…5221) * ADD: add from_pretrained method for vocab * MOD: test format * MOD: format file * MOD: update changelog * MOD: fix bug * MOD: fix bug * MOD: fix typo * MOD: make the mothod in class * MOD: fix bug * MOD: change to instance method * MOD: fix typo * MOD: fix bug * MOD: change oov to avoid bug * Update allennlp/data/vocabulary.py * Update allennlp/data/vocabulary.py Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> * Update allennlp/data/vocabulary.py Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> * Update allennlp/data/vocabulary.py Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> * MOD: fix formate * MOD: add test case * Update CHANGELOG.md * MOD: fix worker info bug * ADD: update changelog * MOD: fix format * Update allennlp/data/data_loaders/multitask_data_loader.py Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> * Update CHANGELOG.md Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> * MOD: add demo code * MOD: align code * MOD: fix bug * MOD: fix bug * MOD: fix bug * MOD: formate code * Update allennlp/data/data_loaders/data_collator.py Co-authored-by: Pete <epwalsh10@gmail.com> * fix error * MOD: add test code * mod: change tokenizer * mod: fix tokenizer * MOD: fix bug * MOD: fix bug * MOD: fix bug * Update allennlp/data/data_loaders/data_collator.py Co-authored-by: Dirk Groeneveld <groeneveld@gmail.com> * MOD: update changelog * MOD: update change log * Update allennlp/data/data_loaders/data_collator.py We should be using underscores for everything. * Formatting Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> Co-authored-by: Dirk Groeneveld <dirkg@allenai.org> Co-authored-by: Dirk Groeneveld <groeneveld@gmail.com>

Adding support for inputs to the backbone with more than 3 dimensions

* Removes unused variable * Formatting * Make sure we always restore the model's weights properly * Give TrainerCallbacks the ability to save and load state dicts * Give MovingAverage the ability to save and load state dicts * Do not set gradients to None * Typo * Remove unused variable * Typo * Entirely new checkpointing code * Formatting * Make mypy happy lol * Makes the no-op trainer work with the new checkpointer * Mark epochs as completed when they're skipped * Changelog * Fixes how we get the best weights after a training run * Mypy is annoying * Callback fixes * Fix the no op trainer * Simplify * Assorted checkpointer fixes * Mypy is now happy * Fixed all the tests except for one * Removed unused variable * Fix trainer restore logic * Fix test for trainer restore logic * Check the Checkpointing branch of the models repo * Help mypy along * Fixed finalizing logic * More mypy stuff * Update allennlp/training/checkpointer.py Co-authored-by: Pete <petew@allenai.org> * Make weaker claims Co-authored-by: Pete <petew@allenai.org>

* Implementing blocking repeated ngrams * Adding comment * Adding unit tests for the end to end beam search * Renaming class * Adding comment about function * Simplifying indexing to variable * Refactoring the state copying into the class * Reformatting * Editing changelog * fix line too long * comments * doc updates Co-authored-by: Pete <petew@allenai.org> Co-authored-by: epwalsh <epwalsh10@gmail.com>

* Make BeamSearch Registrable * Update changelog * Remove unused import * Update CHANGELOG.md Co-authored-by: Pete <petew@allenai.org> Co-authored-by: Pete <epwalsh10@gmail.com>

* initial commit * general self attn * fixing bugs, adding tests, adding docs * updating other modules * refactor * bug fix * update changelog * fix shape * fix format * address feedback * small doc fix * Update allennlp/modules/transformer/transformer_stack.py Co-authored-by: Pete <petew@allenai.org> * remove old file Co-authored-by: epwalsh <epwalsh10@gmail.com> Co-authored-by: Pete <petew@allenai.org>

* Fix tqdm logging into multiple files with allennlp-optuna * Update changelog * Add unittest for resetting tqdm logger handlers Co-authored-by: Pete <petew@allenai.org>

* bug fix * common lexicons * update changelog * Update CHANGELOG.md

* added linear and hard debiasers * worked on documentation * committing changes before branch switch * committing changes before switching branch * finished bias direction, linear and hard debiasers, need to write tests * finished bias direction test * Commiting changes before switching branch * finished hard and linear debiasers * finished OSCaR * bias mitigators tests and bias metrics remaining * added bias mitigator tests * added bias mitigator tests * finished tests for bias mitigation methods * fixed gpu issues * fixed gpu issues * fixed gpu issues * resolve issue with count_nonzero not being differentiable * added more references * fairness during finetuning * finished bias mitigator wrapper * added reference * updated CHANGELOG and fixed minor docs issues * move id tensors to embedding device * fixed to use predetermined bias direction * fixed minor doc errors * snli reader registration issue * fixed _pretrained from params issue * fixed device issues * evaluate bias mitigation initial commit * finished evaluate bias mitigation * handles multiline prediction files * fixed minor bugs * fixed minor bugs * improved prediction diff JSON format * forgot to resolve a conflict * Refactored evaluate bias mitigation to use NLI metric * Added SNLIPredictionsDiff class * ensured dataloader is same for bias mitigated and baseline models * finished evaluate bias mitigation * Update CHANGELOG.md * Replaced local data files with github raw content links * Update allennlp/fairness/bias_mitigator_applicator.py Co-authored-by: Pete <petew@allenai.org> * deleted evaluate_bias_mitigation from git tracking * removed evaluate-bias-mitigation instances from rest of repo * addressed Akshita's comments * moved bias mitigator applicator test to allennlp-models * removed unnecessary files Co-authored-by: Arjun Subramonian <arjuns@Arjuns-MacBook-Pro.local> Co-authored-by: Arjun Subramonian <arjuns@ip-192-168-0-106.us-west-2.compute.internal> Co-authored-by: Arjun Subramonian <arjuns@ip-192-168-0-108.us-west-2.compute.internal> Co-authored-by: Arjun Subramonian <arjuns@ip-192-168-1-108.us-west-2.compute.internal> Co-authored-by: Akshita Bhagia <akshita23bhagia@gmail.com> Co-authored-by: Pete <petew@allenai.org>

Bumps [black](https://github.com/psf/black) from 21.5b1 to 21.5b2. - [Release notes](https://github.com/psf/black/releases) - [Changelog](https://github.com/psf/black/blob/main/CHANGES.md) - [Commits](https://github.com/psf/black/commits) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fixed broken link

…gins()` (allenai#5246) * ensure allennlp is a default plugin * fix logging issue * fixes * actually fix

* added BackwardCallback * finished tests * fixed linting issue * revised design per Dirk's suggestion * added OnBackwardException, changed loss to batch_ouputs, etc. Co-authored-by: Arjun Subramonian <arjuns@Arjuns-MacBook-Pro.local>

…o datasets_feature

Convert Dict to N ListFields for Dict of Lists

… moving it down the list.

wolhandlerdeb · 2021-11-02T11:39:06Z

Hi,
I would be glad to know if you add this code to the allennlp library. It would be very helpful.
Thanks

Abhishek-P · 2021-11-02T14:28:57Z

Hi, I would be glad to know if you add this code to the allennlp library. It would be very helpful. Thanks

Hi @wolhandlerdeb,
I am a little strapped for bandwidth and hence this has been at the bottom of my personal queue.
It is pending some work, before it is up to standards for check-in as described in #5095 (comment)

Please feel free to let clone the changes and continue the work.
I plan to get back to it in some time, although I am not exactly sure when as of yet.

ghost marked this pull request as draft April 4, 2021 17:14

ghost force-pushed the datasets_feature branch from aa3aefa to 78a1487 Compare April 4, 2021 17:54

ghost force-pushed the datasets_feature branch 2 times, most recently from ebdf5fa to 990388f Compare April 5, 2021 17:16

dirkgr suggested changes Apr 5, 2021

View reviewed changes

epwalsh reviewed Apr 6, 2021

View reviewed changes

allennlp/data/dataset_readers/huggingface_datasets_reader.py Outdated Show resolved Hide resolved

allennlp/data/dataset_readers/huggingface_datasets_reader.py Outdated Show resolved Hide resolved

ghost force-pushed the datasets_feature branch 2 times, most recently from ad78b69 to 977c0b2 Compare April 7, 2021 16:17

epwalsh reviewed Apr 7, 2021

View reviewed changes

allennlp/data/dataset_readers/huggingface_datasets_reader.py Outdated Show resolved Hide resolved

ghost force-pushed the datasets_feature branch 2 times, most recently from 6b3e402 to f77cfa3 Compare April 7, 2021 16:57

ghost changed the title ~~Add HuggingfaceDatasetSplitReader for using Huggingface datasets~~ Add HuggingfaceDatasetReader for using Huggingface datasets Apr 10, 2021

ghost force-pushed the datasets_feature branch 3 times, most recently from 55c0124 to d082e55 Compare April 10, 2021 16:44

ghost marked this pull request as ready for review April 15, 2021 03:13

ghost force-pushed the datasets_feature branch from 867a23f to 9007258 Compare April 15, 2021 04:42

epwalsh and others added 24 commits August 11, 2021 09:07

fix race condition when extracting files with cached_path (allenai#5227)

51a01fe

* fix race condition when extracting files with cached_path * add warning when directory already exists

Roll backbone (allenai#5229)

b75c60c

Adding support for inputs to the backbone with more than 3 dimensions

Emergency fix. I forgot to take this out.

804fd59

Make BeamSearch Registrable (allenai#5231)

0bdee9d

* Make BeamSearch Registrable * Update changelog * Remove unused import * Update CHANGELOG.md Co-authored-by: Pete <petew@allenai.org> Co-authored-by: Pete <epwalsh10@gmail.com>

tick version for nightly release

8e10f69

Fix tqdm logging into multiple files with allennlp-optuna (allenai#5235)

3916cf3

* Fix tqdm logging into multiple files with allennlp-optuna * Update changelog * Add unittest for resetting tqdm logger handlers Co-authored-by: Pete <petew@allenai.org>

Checklist fixes (allenai#5239)

4753906

* bug fix * common lexicons * update changelog * Update CHANGELOG.md

Prepare for release v2.5.0

1159432

tick version for nightly release

5f76b59

[Docs] Fixes broken link in Fairness_Metrics (allenai#5245)

b7fd842

* fixed broken link

Ensure all relevant allennlp submodules are imported with `import_plu…

38c930b

…gins()` (allenai#5246) * ensure allennlp is a default plugin * fix logging issue * fixes * actually fix

added on_backward trainer callback (allenai#5249)

0e3a225

* added BackwardCallback * finished tests * fixed linting issue * revised design per Dirk's suggestion * added OnBackwardException, changed loss to batch_ouputs, etc. Co-authored-by: Arjun Subramonian <arjuns@Arjuns-MacBook-Pro.local>

Add float mapping to TensorField

69d05ff

Verification tests

356b383

Attempt to Support Dict

3192d70

Quick changes

e32c5b0

Dictionary works with SQUAD

5f702ef

Abhishek-P force-pushed the datasets_feature branch from b277534 to 5f702ef Compare August 11, 2021 16:07

Abhishek-P added 3 commits August 11, 2021 09:13

Merge branch 'datasets_feature' of github.com:Abhishek-P/allennlp int…

fd95128

…o datasets_feature

Fix typing issues

af029b3

Convert Dict to N ListFields for Dict of Lists

Works for Mocha, although may need to add specific handling for SQUAD…

41b7034

… moving it down the list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HuggingfaceDatasetReader for using Huggingface datasets #5095

Add HuggingfaceDatasetReader for using Huggingface datasets #5095

ghost commented Apr 4, 2021 •

edited by ghost

Loading

ghost commented Apr 5, 2021

ghost commented Apr 5, 2021

dirkgr commented Apr 5, 2021

dirkgr left a comment

albertvillanova commented Apr 6, 2021

ghost commented Apr 6, 2021 •

edited by ghost

Loading

epwalsh left a comment

ghost commented Apr 7, 2021

ghost commented Apr 8, 2021

ghost commented Apr 10, 2021 •

edited by ghost

Loading

epwalsh commented Apr 12, 2021

dirkgr commented Apr 13, 2021

ghost commented Apr 15, 2021 •

edited by ghost

Loading

ghost commented Apr 15, 2021

dirkgr commented Apr 19, 2021

wolhandlerdeb commented Nov 2, 2021

Abhishek-P commented Nov 2, 2021

Add HuggingfaceDatasetReader for using Huggingface datasets #5095

Are you sure you want to change the base?

Add HuggingfaceDatasetReader for using Huggingface datasets #5095

Conversation

ghost commented Apr 4, 2021 • edited by ghost Loading

ghost commented Apr 5, 2021

ghost commented Apr 5, 2021

dirkgr commented Apr 5, 2021

dirkgr left a comment

Choose a reason for hiding this comment

albertvillanova commented Apr 6, 2021

ghost commented Apr 6, 2021 • edited by ghost Loading

epwalsh left a comment

Choose a reason for hiding this comment

ghost commented Apr 7, 2021

ghost commented Apr 8, 2021

ghost commented Apr 10, 2021 • edited by ghost Loading

epwalsh commented Apr 12, 2021

dirkgr commented Apr 13, 2021

ghost commented Apr 15, 2021 • edited by ghost Loading

ghost commented Apr 15, 2021

dirkgr commented Apr 19, 2021

wolhandlerdeb commented Nov 2, 2021

Abhishek-P commented Nov 2, 2021

ghost commented Apr 4, 2021 •

edited by ghost

Loading

ghost commented Apr 6, 2021 •

edited by ghost

Loading

ghost commented Apr 10, 2021 •

edited by ghost

Loading

ghost commented Apr 15, 2021 •

edited by ghost

Loading