Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Added DataCollator for dynamic operations for each batch. #5221

Merged
merged 56 commits into from
May 27, 2021
Merged

Added DataCollator for dynamic operations for each batch. #5221

merged 56 commits into from
May 27, 2021

Conversation

wlhgtc
Copy link
Contributor

@wlhgtc wlhgtc commented May 25, 2021

related #5218

Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Could you fix the failing test, and also add a test for the LanguageModelingDataCollator?

allennlp/data/data_loaders/data_collator.py Outdated Show resolved Hide resolved
CHANGELOG.md Outdated
@@ -6,6 +6,8 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## Unreleased
### Added
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already an "Added" section below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wlhgtc, can you move this to the "Added" section below?

Copy link
Member

@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The collator seems like a really weird place to do things like this, but if that's how huggingface does it, we shouldn't be different just for the sake of being different.

I'll leave final approval to @epwalsh.

@wlhgtc
Copy link
Contributor Author

wlhgtc commented May 26, 2021

The collator seems like a really weird place to do things like this, but if that's how huggingface does it, we shouldn't be different just for the sake of being different.

I'll leave final approval to @epwalsh.

It's the only way that could do some dynamic operations for different batch.
We cached each instance in Dataloader, so we can not do it before instance.

@dirkgr
Copy link
Member

dirkgr commented May 26, 2021

It's the only way that could do some dynamic operations for different batch.

It would make more sense to me to write a data loader that has the ability to modify instances before they are being batched. We could give the data loader a callback to do just that. That makes more sense than abusing a function that's meant to just construct batches. But like I said, let's just stick to what huggingface does, so people aren't surprised when they encounter our API.

Copy link
Member

@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks for doing it! I just have some minor requests about how to use Registrable.

CHANGELOG.md Outdated
@@ -6,6 +6,8 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## Unreleased
### Added
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wlhgtc, can you move this to the "Added" section below?

allennlp/data/data_loaders/data_collator.py Show resolved Hide resolved
allennlp/data/data_loaders/data_collator.py Show resolved Hide resolved
allennlp/data/data_loaders/data_collator.py Outdated Show resolved Hide resolved
@dirkgr dirkgr self-assigned this May 27, 2021
Copy link
Member

@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@dirkgr dirkgr enabled auto-merge (squash) May 27, 2021 18:20
@dirkgr dirkgr merged commit babc450 into allenai:main May 27, 2021
Abhishek-P pushed a commit to Abhishek-P/allennlp that referenced this pull request Aug 11, 2021
…5221)

* ADD: add from_pretrained method for vocab

* MOD: test format

* MOD: format file

* MOD: update changelog

* MOD: fix bug

* MOD: fix bug

* MOD: fix typo

* MOD: make the mothod in class

* MOD: fix bug

* MOD: change to instance method

* MOD: fix typo

* MOD: fix bug

* MOD: change oov to avoid bug

* Update allennlp/data/vocabulary.py

* Update allennlp/data/vocabulary.py

Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com>

* Update allennlp/data/vocabulary.py

Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com>

* Update allennlp/data/vocabulary.py

Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com>

* MOD: fix formate

* MOD: add test case

* Update CHANGELOG.md

* MOD: fix worker info bug

* ADD: update changelog

* MOD: fix format

* Update allennlp/data/data_loaders/multitask_data_loader.py

Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com>

* Update CHANGELOG.md

Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com>

* MOD: add demo code

* MOD: align code

* MOD: fix bug

* MOD: fix bug

* MOD: fix bug

* MOD: formate code

* Update allennlp/data/data_loaders/data_collator.py

Co-authored-by: Pete <epwalsh10@gmail.com>

* fix error

* MOD: add test code

* mod: change tokenizer

* mod: fix tokenizer

* MOD: fix bug

* MOD: fix bug

* MOD: fix bug

* Update allennlp/data/data_loaders/data_collator.py

Co-authored-by: Dirk Groeneveld <groeneveld@gmail.com>

* MOD: update changelog

* MOD: update change log

* Update allennlp/data/data_loaders/data_collator.py

We should be using underscores for everything.

* Formatting

Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com>
Co-authored-by: Dirk Groeneveld <dirkg@allenai.org>
Co-authored-by: Dirk Groeneveld <groeneveld@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants