Add sample method to data_pipeline #20

najielhachem · 2023-09-07T13:27:53Z

What does this PR do? Please describe:
Add sample data source operator that can sample from multiple data sources provided weights.

Does your PR introduce any breaking changes? If yes, please list them:
List of all backwards-incompatible changes.

Check list:

Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
Did you read the contributor guideline?
Did you make sure that your PR does only one thing instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

gwenzek

The code LGTM, but I wanted to talk about the use case and the handling of end of pipeline.

In this implementation, as soon as one child data source is done, the sampled_data_source stops.
In round_robin_data_source on the contrary I ensure that all data source have been exhausted at least once before stopping.

do we need the two logics ?
if yes, I think they should be in the same data source with a stop_on_shortest flag
if yes, I think we could merge sampled_data_source and round_robin_data_source by adding a weights parameter to round_robin

tests/unit/data/data_pipeline/test_sample.py

gwenzek · 2023-09-22T09:43:46Z

src/fairseq2/data/data_pipeline.py

+            pipelines: Sequence["DataPipeline"],
+            weights: Optional[Sequence[float]] = None,
+        ) -> "DataPipelineBuilder":
+            """Extract examples from ``pipelines`` by sampling based on ``weights``.


could you also document the behavior around the end of the pipeline ?
when is this pipeline considered exhausted ?

cbalioglu · 2023-09-22T11:59:09Z

@gwenzek although they are technically similar in that they both iterate through various data pipelines in each iteration, I think semantically they are fundamentally two different operations. For sampling you never want to keep reading from a data pipeline once it is exhausted, doing that would defeat the whole purpose of sampling. So to me stop_on_shortest feels like a redundant/confusing parameter for sampling.

Maybe they can internally share some implementation logic, but I think they should be exposed as two separate operators.

cbalioglu

Thanks again for your work Naji! And sorry for the late review. I just left an important comment that should be easy to fix, otherwise looks good to me.

fairseq2n/src/fairseq2n/data/sample_data_source.cc

fairseq2n/src/fairseq2n/utils/tensor.h

fairseq2n/src/fairseq2n/data/sample_data_source.h

fairseq2n/src/fairseq2n/data/data_pipeline.cc

fairseq2n/src/fairseq2n/data/sample_data_source.cc

gwenzek · 2023-09-22T15:41:53Z

For sampling you never want to keep reading from a data pipeline once it is exhausted, doing that would defeat the whole purpose of sampling

imagine you're training a multilingual model. You have English, French and Zulu data. Volume is 100Gb of English, 10Gb of French, 1Gb of Zulu.
You want to up sample the Zulu and down sample English. So that you're model is trained on 50% English, 40% French and 10% Zulu. If you use sample([0.5, 0.4, 0.1]) you're epoch will stop after having read the first 5Gb of English, which is only 5% of the original training set. If you then reset the dataloader and start a new epoch you'll restart reading the first 5Gb of English over and over.

I think semantically they are fundamentally two different operations

Round robin looks like a lot like .sample([1/n, 1/n, ... 1/n]) minus the randomization part

cbalioglu · 2023-09-22T16:20:00Z

You want to up sample the Zulu and down sample English. So that you're model is trained on 50% English, 40% French and 10% Zulu. If you use sample([0.5, 0.4, 0.1]) you're epoch will stop after having read the first 5Gb of English, which is only 5% of the original training set. If you then reset the dataloader and start a new epoch you'll restart reading the first 5Gb of English over and over.

Yeah that makes sense. I haven't been focusing on up/down sampling really, but when you consider that, stop_on_shortest sounds reasonable. Thanks for the example.

I still think that round-robin and sampling should be two separate ops though. Round-robin by its nature does not really have any randomness, it is all about iterating each pipeline one-by-one. It feels weird to me to use an op called round_robin() for sampling or the other way round.

…_source

najielhachem · 2023-09-28T09:44:12Z

I agree with @cbalioglu about having two operators since semantically we don't expect a similar behaviour from round_robin and sample. But as mentioned by @gwenzek, the buffering seems useful so I will extract this part into a separate code and use it in both of our operators.

fairseq2n/src/fairseq2n/data/sample_data_source.cc

cbalioglu

LGTM!

najielhachem added 6 commits September 7, 2023 14:19

Add tensor.h to utils

c137b46

Add sample_data_source class

204ffa7

Add sample method to data_pipeline

f544f9e

Add sample binding to data_pipeline

f99e870

Add test_sample

07a9d65

Fix build errors

894dbc1

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 7, 2023

fix python lint errors

8ce33db

najielhachem marked this pull request as ready for review September 7, 2023 13:45

najielhachem requested a review from cbalioglu as a code owner September 7, 2023 13:45

najielhachem added 2 commits September 8, 2023 10:58

Add empty lines at end of c++ files

134c585

remove \n from runtime error regex match

08a3a65

gwenzek reviewed Sep 22, 2023

View reviewed changes

cbalioglu reviewed Sep 22, 2023

View reviewed changes

najielhachem added 4 commits September 28, 2023 10:20

Merge remote-tracking branch 'origin/main' into feature_sampling_data…

8869aef

…_source

Use float32 instead of float

a75141d

Resolve nit comments

8b06f85

Remove set_seed from sampler

a10b5b4

najielhachem requested a review from cbalioglu September 28, 2023 12:08

Fix document error

0773e89

cbalioglu reviewed Sep 28, 2023

View reviewed changes

fairseq2n/src/fairseq2n/data/sample_data_source.cc Outdated Show resolved Hide resolved

fairseq2n/src/fairseq2n/data/sample_data_source.cc Outdated Show resolved Hide resolved

fairseq2n/src/fairseq2n/data/sample_data_source.cc Outdated Show resolved Hide resolved

najielhachem and others added 2 commits September 29, 2023 15:40

Handle generator state out of sample operator

1662c11

Merge branch 'main' into feature_sampling_data_source

a1d2c9c

cbalioglu approved these changes Sep 29, 2023

View reviewed changes

cbalioglu merged commit af213ff into main Sep 29, 2023
18 checks passed

cbalioglu deleted the feature_sampling_data_source branch September 29, 2023 19:49

vb000 mentioned this pull request Sep 30, 2023

Tests in test_sample.py failing on main branch #75

Closed

najielhachem mentioned this pull request Oct 3, 2023

Introduce stop_at_shortest in sample and round_robin #76

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sample method to data_pipeline #20

Add sample method to data_pipeline #20

najielhachem commented Sep 7, 2023 •

edited

gwenzek left a comment

gwenzek Sep 22, 2023

cbalioglu commented Sep 22, 2023

cbalioglu left a comment

gwenzek commented Sep 22, 2023 •

edited

cbalioglu commented Sep 22, 2023

najielhachem commented Sep 28, 2023

cbalioglu left a comment

Add sample method to data_pipeline #20

Add sample method to data_pipeline #20

Conversation

najielhachem commented Sep 7, 2023 • edited

gwenzek left a comment

Choose a reason for hiding this comment

gwenzek Sep 22, 2023

Choose a reason for hiding this comment

cbalioglu commented Sep 22, 2023

cbalioglu left a comment

Choose a reason for hiding this comment

gwenzek commented Sep 22, 2023 • edited

cbalioglu commented Sep 22, 2023

najielhachem commented Sep 28, 2023

cbalioglu left a comment

Choose a reason for hiding this comment

najielhachem commented Sep 7, 2023 •

edited

gwenzek commented Sep 22, 2023 •

edited