Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Add HuggingfaceDatasetReader for using Huggingface datasets #5095

Open
wants to merge 67 commits into
base: main
Choose a base branch
from

Conversation

ghost
Copy link

@ghost ghost commented Apr 4, 2021

Added a new reader to allow for reading huggingface datasets as instance
Mapped limited datasets.features to allenlp.data.fields

Verified for selective dataset and/or dataset configurations for training split, mentioned in the documentation comments of the reader.

datasets==1.5.0

Joint work with @divijbajaj @annajung @prajaktakini-vmware @agururajvais

Signed-off-by: Abhishek P (VMware) pab@vmware.com

Fixes #4962

Changes proposed in this pull request:
Introduce a new reader that wraps huggingface datasets to provide instances for a split of the dataset with configuration if required

@ghost ghost marked this pull request as draft April 4, 2021 17:14
@ghost ghost force-pushed the datasets_feature branch from aa3aefa to 78a1487 Compare April 4, 2021 17:54
@ghost
Copy link
Author

ghost commented Apr 5, 2021

Looking at the lint and format issues. Will address them hopefully today.

@ghost ghost force-pushed the datasets_feature branch 2 times, most recently from ebdf5fa to 990388f Compare April 5, 2021 17:16
@ghost
Copy link
Author

ghost commented Apr 5, 2021

Locally new tests pass -
tests/data/dataset_readers/huggingface_datasets_test.py ........ . [ 25%]

But since they are doing downloads it is quite possible they fail. Will try to see if I can pin point it. But should be okay for the draft.

@dirkgr
Copy link
Member

dirkgr commented Apr 5, 2021

It's OK for tests to do downloads, but it's better if the downloads are small. Is there a tiny dataset we can use? Maybe we upload one just for this purpose?

Copy link
Member

@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really excited to have this. I've been wanting to do this since HF came out with the datasets project.

setup.py Outdated Show resolved Hide resolved
@albertvillanova
Copy link

Thanks for this PR, @pab-vmware, @divijbajaj, @annajung, @pkini-vmware and @agururajvais.

And thanks @dirkgr for your thorough review.

Just let me know if there is anything we can do on Hugging Face Datasets to help you with the integration. 🤗

@ghost
Copy link
Author

ghost commented Apr 6, 2021

It's OK for tests to do downloads, but it's better if the downloads are small. Is there a tiny dataset we can use? Maybe we upload one just for this purpose?

Nice Idea, we can formulate a custom dataset and test around it to validate the mapping of datasets.features to allenlp.data.fields.

In the meantime I noticed glue is small, will change UTs to use that only.

We can probably have sanity scripts to test w.r.t specific important datasets to validate support for them, independent of UT.

Let me make one such dataset as part of it. Will ask on where to place it (my repo or somewhere else) once I have it.

Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good so far!

@ghost ghost force-pushed the datasets_feature branch 2 times, most recently from ad78b69 to 977c0b2 Compare April 7, 2021 16:17
@ghost ghost force-pushed the datasets_feature branch 2 times, most recently from 6b3e402 to f77cfa3 Compare April 7, 2021 16:57
@ghost
Copy link
Author

ghost commented Apr 7, 2021

Looks like I need to do more changes to get it working even for glue. Will inform once it is done.

@ghost
Copy link
Author

ghost commented Apr 8, 2021

Working for 3 datasets. With couple different types of features.

Pending addressing of type check failures

Samples from the 2 datasets, xnli is a little too big for a comment

glue-cola:
From:{'idx': 0, 'label': 1, 'sentence': "Our friends won't buy this analysis, let alone the next one we propose."}
To:
Instance with fields:
 	 sentence: TextField of length 1 with text: 
 		[Our friends won't buy this analysis, let alone the next one we propose.]
 
 	 label: LabelField with label: 1 in namespace: 'label'. 
 	 idx: LabelField with label: 0 in namespace: 'idx'. 


_universal_dependencies - en_lines:_
From:
{'deprel': ['mark', 'det', 'nsubj', 'advcl', 'case', 'det', 'compound', 'flat', 'obl', 'case', 'det', 'compound', 'flat', 'nmod', 'punct', 'det', 'nsubj:pass', 'aux:pass', 'root', 'case', 'det', 'compound', 'flat', 'compound', 'obl', 'punct'], 'deps': ['None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None'], 'feats': ["{'Case': 'Nom'}", "{'Definite': 'Ind', 'PronType': 'Art'}", "{'Number': 'Sing'}", "{'Mood': 'Ind', 'Number': 'Sing', 'Person': '3', 'Tense': 'Pres', 'VerbForm': 'Fin'}", 'None', "{'Definite': 'Def', 'PronType': 'Art'}", "{'Number': 'Sing'}", "{'Number': 'Sing'}", "{'Number': 'Sing'}", 'None', "{'Definite': 'Ind', 'PronType': 'Art'}", "{'Number': 'Sing'}", "{'Case': 'Nom'}", "{'Number': 'Sing'}", 'None', "{'Definite': 'Def', 'PronType': 'Art'}", "{'Number': 'Sing'}", "{'Mood': 'Ind', 'Number': 'Sing', 'Person': '3', 'Tense': 'Pres', 'VerbForm': 'Fin'}", "{'Tense': 'Past', 'VerbForm': 'Part', 'Voice': 'Pass'}", 'None', "{'Definite': 'Ind', 'PronType': 'Art'}", "{'Case': 'Nom'}", "{'Case': 'Nom'}", "{'Number': 'Sing'}", "{'Number': 'Sing'}", 'None'], 'head': ['4', '3', '4', '19', '9', '9', '9', '7', '4', '14', '14', '14', '12', '9', '4', '17', '19', '19', '0', '25', '25', '25', '22', '25', '19', '19'], 'idx': 'en_lines-ud-dev-doc1-3177', 'lemmas': ['when', 'a', 'user', 'connect', 'to', 'the', 'SQL', 'server', 'database', 'through', 'a', 'Microsoft', 'Access', 'project', ',', 'the', 'connection', 'be', 'enable', 'through', 'a', 'Windows', 'NT', 'user', 'account', '.'], 'misc': ['None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', "{'SpaceAfter': 'No'}", 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', "{'SpaceAfter': 'No'}", 'None'], 'text': 'When a user connects to the SQL Server database through a Microsoft Access project, the connection is enabled through a Windows NT user account.', 'tokens': ['When', 'a', 'user', 'connects', 'to', 'the', 'SQL', 'Server', 'database', 'through', 'a', 'Microsoft', 'Access', 'project', ',', 'the', 'connection', 'is', 'enabled', 'through', 'a', 'Windows', 'NT', 'user', 'account', '.'], 'upos': [5, 8, 0, 16, 2, 8, 10, 0, 0, 2, 8, 10, 10, 0, 1, 8, 0, 17, 16, 2, 8, 10, 10, 0, 0, 1], 'xpos': [None, 'IND-SG', 'SG-NOM', 'PRES', None, 'DEF', 'SG-NOM', 'SG-NOM', 'SG-NOM', None, 'IND-SG', 'SG-NOM', 'SG-NOM', 'SG-NOM', 'Comma', 'DEF', 'SG-NOM', 'PRES-AUX', 'PASS', None, 'IND-SG', 'SG-NOM', 'SG-NOM', 'SG-NOM', 'SG-NOM', 'Period']}

To:
{'idx': <allennlp.data.fields.text_field.TextField object at 0x147f4b0c0>, 'text': <allennlp.data.fields.text_field.TextField object at 0x147c7a740>, 'tokens': <allennlp.data.fields.list_field.ListField object at 0x145b8c040>, 'lemmas': <allennlp.data.fields.list_field.ListField object at 0x145b8c820>, 'upos': <allennlp.data.fields.list_field.ListField object at 0x145b8c1f0>, 'xpos': <allennlp.data.fields.list_field.ListField object at 0x145b8cd60>, 'feats': <allennlp.data.fields.list_field.ListField object at 0x147ed3a00>, 'head': <allennlp.data.fields.list_field.ListField object at 0x147ed30d0>, 'deprel': <allennlp.data.fields.list_field.ListField object at 0x147ed39d0>, 'deps': <allennlp.data.fields.list_field.ListField object at 0x147ed3a30>, 'misc': <allennlp.data.fields.list_field.ListField object at 0x147ed3af0>}

@ghost ghost changed the title Add HuggingfaceDatasetSplitReader for using Huggingface datasets Add HuggingfaceDatasetReader for using Huggingface datasets Apr 10, 2021
@ghost ghost force-pushed the datasets_feature branch 3 times, most recently from 55c0124 to d082e55 Compare April 10, 2021 16:44
@ghost
Copy link
Author

ghost commented Apr 10, 2021

Have working implementation with limited , features-> field mapping. let me work on getting the code coverage up for the diff even if I have to use slightly heavier dataset at this point.
I will also look at adding a custom HF dataset so we can use it for testing purposes.

I will try to increase the features that we are mapping but is the current change (post coverage) sufficient to go in?

@epwalsh
Copy link
Member

epwalsh commented Apr 12, 2021

I will also look at adding a custom HF dataset so we can use it for testing purposes.

That would great!

@dirkgr
Copy link
Member

dirkgr commented Apr 13, 2021

What's the status of this? Is it close? Anything you need? I'd love to have this in, maybe even for the 2.3 release!

@ghost
Copy link
Author

ghost commented Apr 15, 2021

Hi Dirk, So now that the coverage is up to 90%, this MR IMO although slightly Raw in w.r.t implementation can go in.
Let me re-check the known supported dataset/config and update it.

There are things that need to be done, for most of which I have added TODOs and I will pick them up one by one in the coming weeks.

  • Cleanup the code so we have mapper class/method for each of the features
  • Support Nested Features (Linked to Above)
  • Add Custom Light Dataset and configs and use them fo in UT
  • Add a test where the new reader is compared to the customer conll2003 reader - This would validate the reader in a practical manner

@ghost ghost marked this pull request as ready for review April 15, 2021 03:13
@ghost
Copy link
Author

ghost commented Apr 15, 2021

So I have updated the datasets list, added validation tests that are now skipped. Squashed all changes into a single commit. Please take a look when time permits.

I will work on cleaning up the code and providing nested support.
If the dataset download during tests becomes problematic, I will prioritize the creation of custom dataset.

@dirkgr
Copy link
Member

dirkgr commented Apr 19, 2021

Squashed all changes into a single commit.

Don't worry about doing that for our sake. It'll get squashed when we merge anyways. In fact, now I have no way of seeing just the changes you made since the last time I reviewed this.

epwalsh and others added 24 commits August 11, 2021 09:07
* fix race condition when extracting files with cached_path

* add warning when directory already exists
Bumps [checklist](https://github.com/marcotcr/checklist) from 0.0.10 to 0.0.11.
- [Release notes](https://github.com/marcotcr/checklist/releases)
- [Commits](https://github.com/marcotcr/checklist/commits)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Akshita Bhagia <akshita23bhagia@gmail.com>
Co-authored-by: Dirk Groeneveld <dirkg@allenai.org>
…5221)

* ADD: add from_pretrained method for vocab

* MOD: test format

* MOD: format file

* MOD: update changelog

* MOD: fix bug

* MOD: fix bug

* MOD: fix typo

* MOD: make the mothod in class

* MOD: fix bug

* MOD: change to instance method

* MOD: fix typo

* MOD: fix bug

* MOD: change oov to avoid bug

* Update allennlp/data/vocabulary.py

* Update allennlp/data/vocabulary.py

Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com>

* Update allennlp/data/vocabulary.py

Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com>

* Update allennlp/data/vocabulary.py

Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com>

* MOD: fix formate

* MOD: add test case

* Update CHANGELOG.md

* MOD: fix worker info bug

* ADD: update changelog

* MOD: fix format

* Update allennlp/data/data_loaders/multitask_data_loader.py

Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com>

* Update CHANGELOG.md

Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com>

* MOD: add demo code

* MOD: align code

* MOD: fix bug

* MOD: fix bug

* MOD: fix bug

* MOD: formate code

* Update allennlp/data/data_loaders/data_collator.py

Co-authored-by: Pete <epwalsh10@gmail.com>

* fix error

* MOD: add test code

* mod: change tokenizer

* mod: fix tokenizer

* MOD: fix bug

* MOD: fix bug

* MOD: fix bug

* Update allennlp/data/data_loaders/data_collator.py

Co-authored-by: Dirk Groeneveld <groeneveld@gmail.com>

* MOD: update changelog

* MOD: update change log

* Update allennlp/data/data_loaders/data_collator.py

We should be using underscores for everything.

* Formatting

Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com>
Co-authored-by: Dirk Groeneveld <dirkg@allenai.org>
Co-authored-by: Dirk Groeneveld <groeneveld@gmail.com>
Adding support for inputs to the backbone with more than 3 dimensions
* Removes unused variable

* Formatting

* Make sure we always restore the model's weights properly

* Give TrainerCallbacks the ability to save and load state dicts

* Give MovingAverage the ability to save and load state dicts

* Do not set gradients to None

* Typo

* Remove unused variable

* Typo

* Entirely new checkpointing code

* Formatting

* Make mypy happy

lol

* Makes the no-op trainer work with the new checkpointer

* Mark epochs as completed when they're skipped

* Changelog

* Fixes how we get the best weights after a training run

* Mypy is annoying

* Callback fixes

* Fix the no op trainer

* Simplify

* Assorted checkpointer fixes

* Mypy is now happy

* Fixed all the tests except for one

* Removed unused variable

* Fix trainer restore logic

* Fix test for trainer restore logic

* Check the Checkpointing branch of the models repo

* Help mypy along

* Fixed finalizing logic

* More mypy stuff

* Update allennlp/training/checkpointer.py

Co-authored-by: Pete <petew@allenai.org>

* Make weaker claims

Co-authored-by: Pete <petew@allenai.org>
* Implementing blocking repeated ngrams

* Adding comment

* Adding unit tests for the end to end beam search

* Renaming class

* Adding comment about  function

* Simplifying indexing to variable

* Refactoring the state copying into the  class

* Reformatting

* Editing changelog

* fix line too long

* comments

* doc updates

Co-authored-by: Pete <petew@allenai.org>
Co-authored-by: epwalsh <epwalsh10@gmail.com>
* Make BeamSearch Registrable

* Update changelog

* Remove unused import

* Update CHANGELOG.md

Co-authored-by: Pete <petew@allenai.org>
Co-authored-by: Pete <epwalsh10@gmail.com>
* initial commit

* general self attn

* fixing bugs, adding tests, adding docs

* updating other modules

* refactor

* bug fix

* update changelog

* fix shape

* fix format

* address feedback

* small doc fix

* Update allennlp/modules/transformer/transformer_stack.py

Co-authored-by: Pete <petew@allenai.org>

* remove old file

Co-authored-by: epwalsh <epwalsh10@gmail.com>
Co-authored-by: Pete <petew@allenai.org>
* Fix tqdm logging into multiple files with allennlp-optuna

* Update changelog

* Add unittest for resetting tqdm logger handlers

Co-authored-by: Pete <petew@allenai.org>
* bug fix

* common lexicons

* update changelog

* Update CHANGELOG.md
* added linear and hard debiasers

* worked on documentation

* committing changes before branch switch

* committing changes before switching branch

* finished bias direction, linear and hard debiasers, need to write tests

* finished bias direction test

* Commiting changes before switching branch

* finished hard and linear debiasers

* finished OSCaR

* bias mitigators tests and bias metrics remaining

* added bias mitigator tests

* added bias mitigator tests

* finished tests for bias mitigation methods

* fixed gpu issues

* fixed gpu issues

* fixed gpu issues

* resolve issue with count_nonzero not being differentiable

* added more references

* fairness during finetuning

* finished bias mitigator wrapper

* added reference

* updated CHANGELOG and fixed minor docs issues

* move id tensors to embedding device

* fixed to use predetermined bias direction

* fixed minor doc errors

* snli reader registration issue

* fixed _pretrained from params issue

* fixed device issues

* evaluate bias mitigation initial commit

* finished evaluate bias mitigation

* handles multiline prediction files

* fixed minor bugs

* fixed minor bugs

* improved prediction diff JSON format

* forgot to resolve a conflict

* Refactored evaluate bias mitigation to use NLI metric

* Added SNLIPredictionsDiff class

* ensured dataloader is same for bias mitigated and baseline models

* finished evaluate bias mitigation

* Update CHANGELOG.md

* Replaced local data files with github raw content links

* Update allennlp/fairness/bias_mitigator_applicator.py

Co-authored-by: Pete <petew@allenai.org>

* deleted evaluate_bias_mitigation from git tracking

* removed evaluate-bias-mitigation instances from rest of repo

* addressed Akshita's comments

* moved bias mitigator applicator test to allennlp-models

* removed unnecessary files

Co-authored-by: Arjun Subramonian <arjuns@Arjuns-MacBook-Pro.local>
Co-authored-by: Arjun Subramonian <arjuns@ip-192-168-0-106.us-west-2.compute.internal>
Co-authored-by: Arjun Subramonian <arjuns@ip-192-168-0-108.us-west-2.compute.internal>
Co-authored-by: Arjun Subramonian <arjuns@ip-192-168-1-108.us-west-2.compute.internal>
Co-authored-by: Akshita Bhagia <akshita23bhagia@gmail.com>
Co-authored-by: Pete <petew@allenai.org>
Bumps [black](https://github.com/psf/black) from 21.5b1 to 21.5b2.
- [Release notes](https://github.com/psf/black/releases)
- [Changelog](https://github.com/psf/black/blob/main/CHANGES.md)
- [Commits](https://github.com/psf/black/commits)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…gins()` (allenai#5246)

* ensure allennlp is a default plugin

* fix logging issue

* fixes

* actually fix
* added BackwardCallback

* finished tests

* fixed linting issue

* revised design per Dirk's suggestion

* added OnBackwardException, changed loss to batch_ouputs, etc.

Co-authored-by: Arjun Subramonian <arjuns@Arjuns-MacBook-Pro.local>
@wolhandlerdeb
Copy link

Hi,
I would be glad to know if you add this code to the allennlp library. It would be very helpful.
Thanks

@Abhishek-P
Copy link

Hi, I would be glad to know if you add this code to the allennlp library. It would be very helpful. Thanks

Hi @wolhandlerdeb,
I am a little strapped for bandwidth and hence this has been at the bottom of my personal queue.
It is pending some work, before it is up to standards for check-in as described in #5095 (comment)

Please feel free to let clone the changes and continue the work.
I plan to get back to it in some time, although I am not exactly sure when as of yet.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integration with Huggingface datasets