Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sample method to data_pipeline #20

Merged
merged 16 commits into from
Sep 29, 2023
Merged

Conversation

najielhachem
Copy link
Contributor

@najielhachem najielhachem commented Sep 7, 2023

What does this PR do? Please describe:
Add sample data source operator that can sample from multiple data sources provided weights.

Does your PR introduce any breaking changes? If yes, please list them:
List of all backwards-incompatible changes.

Check list:

  • Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
  • Did you read the contributor guideline?
  • Did you make sure that your PR does only one thing instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 7, 2023
@najielhachem najielhachem marked this pull request as ready for review September 7, 2023 13:45
Copy link
Contributor

@gwenzek gwenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code LGTM, but I wanted to talk about the use case and the handling of end of pipeline.

In this implementation, as soon as one child data source is done, the sampled_data_source stops.
In round_robin_data_source on the contrary I ensure that all data source have been exhausted at least once before stopping.

  • do we need the two logics ?
  • if yes, I think they should be in the same data source with a stop_on_shortest flag
  • if yes, I think we could merge sampled_data_source and round_robin_data_source by adding a weights parameter to round_robin

tests/unit/data/data_pipeline/test_sample.py Show resolved Hide resolved
pipelines: Sequence["DataPipeline"],
weights: Optional[Sequence[float]] = None,
) -> "DataPipelineBuilder":
"""Extract examples from ``pipelines`` by sampling based on ``weights``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you also document the behavior around the end of the pipeline ?
when is this pipeline considered exhausted ?

@cbalioglu
Copy link
Contributor

@gwenzek although they are technically similar in that they both iterate through various data pipelines in each iteration, I think semantically they are fundamentally two different operations. For sampling you never want to keep reading from a data pipeline once it is exhausted, doing that would defeat the whole purpose of sampling. So to me stop_on_shortest feels like a redundant/confusing parameter for sampling.

Maybe they can internally share some implementation logic, but I think they should be exposed as two separate operators.

Copy link
Contributor

@cbalioglu cbalioglu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for your work Naji! And sorry for the late review. I just left an important comment that should be easy to fix, otherwise looks good to me.

fairseq2n/src/fairseq2n/data/sample_data_source.cc Outdated Show resolved Hide resolved
fairseq2n/src/fairseq2n/data/sample_data_source.cc Outdated Show resolved Hide resolved
fairseq2n/src/fairseq2n/utils/tensor.h Outdated Show resolved Hide resolved
fairseq2n/src/fairseq2n/data/sample_data_source.h Outdated Show resolved Hide resolved
fairseq2n/src/fairseq2n/data/data_pipeline.cc Outdated Show resolved Hide resolved
fairseq2n/src/fairseq2n/data/data_pipeline.cc Outdated Show resolved Hide resolved
fairseq2n/src/fairseq2n/data/sample_data_source.cc Outdated Show resolved Hide resolved
@gwenzek
Copy link
Contributor

gwenzek commented Sep 22, 2023

For sampling you never want to keep reading from a data pipeline once it is exhausted, doing that would defeat the whole purpose of sampling

imagine you're training a multilingual model. You have English, French and Zulu data. Volume is 100Gb of English, 10Gb of French, 1Gb of Zulu.
You want to up sample the Zulu and down sample English. So that you're model is trained on 50% English, 40% French and 10% Zulu. If you use sample([0.5, 0.4, 0.1]) you're epoch will stop after having read the first 5Gb of English, which is only 5% of the original training set. If you then reset the dataloader and start a new epoch you'll restart reading the first 5Gb of English over and over.

I think semantically they are fundamentally two different operations

Round robin looks like a lot like .sample([1/n, 1/n, ... 1/n]) minus the randomization part

@cbalioglu
Copy link
Contributor

You want to up sample the Zulu and down sample English. So that you're model is trained on 50% English, 40% French and 10% Zulu. If you use sample([0.5, 0.4, 0.1]) you're epoch will stop after having read the first 5Gb of English, which is only 5% of the original training set. If you then reset the dataloader and start a new epoch you'll restart reading the first 5Gb of English over and over.

Yeah that makes sense. I haven't been focusing on up/down sampling really, but when you consider that, stop_on_shortest sounds reasonable. Thanks for the example.

I still think that round-robin and sampling should be two separate ops though. Round-robin by its nature does not really have any randomness, it is all about iterating each pipeline one-by-one. It feels weird to me to use an op called round_robin() for sampling or the other way round.

@najielhachem
Copy link
Contributor Author

I agree with @cbalioglu about having two operators since semantically we don't expect a similar behaviour from round_robin and sample. But as mentioned by @gwenzek, the buffering seems useful so I will extract this part into a separate code and use it in both of our operators.

Copy link
Contributor

@cbalioglu cbalioglu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@cbalioglu cbalioglu merged commit af213ff into main Sep 29, 2023
18 checks passed
@cbalioglu cbalioglu deleted the feature_sampling_data_source branch September 29, 2023 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants