Introduce stop_at_shortest in sample and round_robin #76

najielhachem · 2023-10-02T08:49:35Z

What does this PR do? Please describe:

Following this PR review we decided to do the following:

Extract common parts of sample and round_robin into separate class
Introduce circular logic for sample operator useful for up_sampling
Add stop_at_shortest flag to both operator to give user more flexibility on how the operator works.

Does your PR introduce any breaking changes? If yes, please list them:
List of all backwards-incompatible changes.

Check list:

Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
Did you read the contributor guideline?
Did you make sure that your PR does only one thing instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

…_source

gwenzek

LGTM.

If I had to nit, I'd says that I feed it weird to have the logic spread out in 6 files.
sampled_data_source and round_robin_data_source, may not need to be their own class anymore.
But it's also consistent with the rest of the codebase, where we have two files per operator.

tests/unit/data/data_pipeline/test_sample.py

gwenzek · 2023-10-03T16:13:19Z

tests/unit/data/data_pipeline/test_round_robin.py

@@ -76,6 +77,28 @@ def test_op_works_when_pipelines_have_different_lengths(self) -> None:

            pipeline.reset()

+    @pytest.mark.skipif(
+        python_devel_only(),
+        reason="New fairseq2n API in Python-only installation. Skipping till v0.2.",


Not for this PR:

I'm not fond of the behavior of this python devel only
First it seems to be about not disturbing Python only developer that didn't build fairseq2n locally. So fairseq2n_dev_only would be more clear.
Second I'd prefer if it was annotation @fairseq2n_dev_only with a standard reason.
Third when are we supposed to removed those annotations ? Will we have to do this manually ?

I don't really see who would want to run our unit tests from github main branch but can't build fairseq2n.
Can someone explain me the use case ?

First it seems to be about not disturbing Python only developer that didn't build fairseq2n locally. So fairseq2n_dev_only would be more clear.

I think if you convert it to a standalone annotation, a name like @fairseq2n_dev_only makes sense, but as a parameter to skip_if, fairseq2n_dev_only does not sound right, because it is just the opposite: you want to skip the test if this is a python-only installation. I am open to having a standalone annotation though.

Third when are we supposed to removed those annotations ? Will we have to do this manually ?

There are two ways to resolve this issue. We can either have a version range check, like @skipif(fairseq2n.__version__< 0.2) which would not require us to remove the annotation in the future. The disadvantage is, those checks would not be relevant after a stable release and will start to pile up with time for no good reason. The other alternative is what we have right now. Just checking whether the versions do not match @skipif(fairseq2.__version__ != fairseq2n.__version__). The advantage is that this forces you to remove the annotations before a stable release (otherwise the tests won't run), the disadvantage is that it is a manual work that needs to be done as part of the version bump. Having said that I think we won't need this annotation that often in the future once the native library becomes more mature. The manual removal is not ideal, but we might add some extra measures like forcibly fail any test that has this annotation and if the latest version is already past the one in the annotation.

Can someone explain me the use case ?

Pretty much anyone who wants to contribute to the Python code base of fairseq2, but does not want to deal (or have experience) with C++. The whole reason why we ended up with fairseq2n in last minute was because people had quite a bit of trouble with building the library (e.g. while working on seq-gen and UnitY upstreaming). The original feedback was to completely separate the C++ parts to a separate repo, but we then decided to keep it here and offer pre-built binaries so that people who want to contribute to fairseq2 won't have to build the native parts if they don't want to. By the way, I got this approach pretty much from jax where they similarly separate the library into two parts: jax (python only) and jaxlib (pre-built).

Ok, just gave it a try in #83. Let me know what you think.

thanks, let's follow on that PR, and merge this one.

cbalioglu

Solid work! Just left a few nit comments. There is only one line where we should leverage move-semantics for performance, but otherwise most of the my comments are optional suggestions. Let me know once you at least address that performance related issue, and I can stamp it right away.

fairseq2n/src/fairseq2n/data/multi_data_source.cc

fairseq2n/src/fairseq2n/data/data_pipeline.h

fairseq2n/src/fairseq2n/data/multi_data_source.cc

fairseq2n/src/fairseq2n/data/round_robin_data_source.cc

tests/unit/data/data_pipeline/test_round_robin.py

fairseq2n/src/fairseq2n/data/sample_data_source.cc

fairseq2n/src/fairseq2n/data/multi_data_source.h

fairseq2n/src/fairseq2n/data/multi_data_source.cc

najielhachem · 2023-10-05T09:26:17Z

@gwenzek I agree that there is too many files now but we must have a class for round_robin and sample since we have different logic for the save/reload and reset methods.

najielhachem · 2023-10-05T09:26:53Z

@cbalioglu Thanks for the detailed review! I did all the changes requested and the PR should be ready by now.

cbalioglu

I left two very nit comments. Thanks again for this work! The implementation has become much cleaner! Feel free to merge it whenever you want.

src/fairseq2/data/data_pipeline.py

fairseq2n/src/fairseq2n/data/composite_data_source.cc

najielhachem added 17 commits September 7, 2023 14:19

Add tensor.h to utils

c137b46

Add sample_data_source class

204ffa7

Add sample method to data_pipeline

f544f9e

Add sample binding to data_pipeline

f99e870

Add test_sample

07a9d65

Fix build errors

894dbc1

fix python lint errors

8ce33db

Add empty lines at end of c++ files

134c585

remove \n from runtime error regex match

08a3a65

Merge remote-tracking branch 'origin/main' into feature_sampling_data…

8869aef

…_source

Use float32 instead of float

a75141d

Resolve nit comments

8b06f85

Remove set_seed from sampler

a10b5b4

Fix document error

0773e89

Handle generator state out of sample operator

1662c11

Add circular data source

9805525

Merge branch 'main' into up_sampling

bd642cd

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 2, 2023

najielhachem added 12 commits October 2, 2023 18:28

Use circular data_soruce in round_robin

4d774dd

fix circular data source

73e348f

Add stop_on_shortest flag

a24baa0

Add stop_at_shortest flag to round_robin

97b6aca

Add stop_at_shortest round_robin test

563ba05

Fix circular data source stop_at_shortest flag

b0d8e54

Use circular data_source in sample op

6abfb53

Add up_sampling test

48f6cb5

Rename circular_data_source to multi_data_source

7164e1f

Improve sample and round_robin doc

cdc35f0

lint code

394b9bb

Merge remote-tracking branch 'origin/main' into up_sampling

74202b8

najielhachem changed the title ~~Up sampling~~ Introduce stop_at_shortest in sample and round_robin Oct 3, 2023

najielhachem marked this pull request as ready for review October 3, 2023 09:18

najielhachem requested a review from cbalioglu as a code owner October 3, 2023 09:18

najielhachem requested a review from gwenzek October 3, 2023 09:19

skip test that uses new API

006bf59

gwenzek approved these changes Oct 3, 2023

View reviewed changes

cbalioglu mentioned this pull request Oct 4, 2023

Introduce fairseq2n pytest marker #83

Closed

cbalioglu reviewed Oct 4, 2023

View reviewed changes

najielhachem added 4 commits October 5, 2023 09:52

Merge remote-tracking branch 'origin/main' into up_sampling

218f69b

nit changes

996801e

rename multi_data_source to composite_data_source

c535d02

use std::exchange instead of =

a6e5a58

najielhachem requested a review from cbalioglu October 5, 2023 09:29

cbalioglu approved these changes Oct 5, 2023

View reviewed changes

src/fairseq2/data/data_pipeline.py Outdated Show resolved Hide resolved

src/fairseq2/data/data_pipeline.py Outdated Show resolved Hide resolved

fairseq2n/src/fairseq2n/data/composite_data_source.cc Outdated Show resolved Hide resolved

resolve last nit comment

e2dcab6

cbalioglu approved these changes Oct 5, 2023

View reviewed changes

najielhachem merged commit 52b05c7 into main Oct 5, 2023
19 checks passed

najielhachem deleted the up_sampling branch October 5, 2023 14:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce stop_at_shortest in sample and round_robin #76

Introduce stop_at_shortest in sample and round_robin #76

najielhachem commented Oct 2, 2023 •

edited

Loading

gwenzek left a comment •

edited

Loading

gwenzek Oct 3, 2023 •

edited

Loading

cbalioglu Oct 3, 2023

cbalioglu Oct 4, 2023

gwenzek Oct 4, 2023

cbalioglu left a comment

najielhachem commented Oct 5, 2023 •

edited

Loading

najielhachem commented Oct 5, 2023

cbalioglu left a comment

Introduce stop_at_shortest in sample and round_robin #76

Introduce stop_at_shortest in sample and round_robin #76

Conversation

najielhachem commented Oct 2, 2023 • edited Loading

gwenzek left a comment • edited Loading

Choose a reason for hiding this comment

gwenzek Oct 3, 2023 • edited Loading

Choose a reason for hiding this comment

cbalioglu Oct 3, 2023

Choose a reason for hiding this comment

cbalioglu Oct 4, 2023

Choose a reason for hiding this comment

gwenzek Oct 4, 2023

Choose a reason for hiding this comment

cbalioglu left a comment

Choose a reason for hiding this comment

najielhachem commented Oct 5, 2023 • edited Loading

najielhachem commented Oct 5, 2023

cbalioglu left a comment

Choose a reason for hiding this comment

najielhachem commented Oct 2, 2023 •

edited

Loading

gwenzek left a comment •

edited

Loading

gwenzek Oct 3, 2023 •

edited

Loading

najielhachem commented Oct 5, 2023 •

edited

Loading