-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sample method to data_pipeline #20
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code LGTM, but I wanted to talk about the use case and the handling of end of pipeline.
In this implementation, as soon as one child data source is done, the sampled_data_source stops
.
In round_robin_data_source
on the contrary I ensure that all data source have been exhausted at least once before stopping.
- do we need the two logics ?
- if yes, I think they should be in the same data source with a
stop_on_shortest
flag - if yes, I think we could merge
sampled_data_source
andround_robin_data_source
by adding aweights
parameter toround_robin
pipelines: Sequence["DataPipeline"], | ||
weights: Optional[Sequence[float]] = None, | ||
) -> "DataPipelineBuilder": | ||
"""Extract examples from ``pipelines`` by sampling based on ``weights``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you also document the behavior around the end of the pipeline ?
when is this pipeline considered exhausted ?
@gwenzek although they are technically similar in that they both iterate through various data pipelines in each iteration, I think semantically they are fundamentally two different operations. For sampling you never want to keep reading from a data pipeline once it is exhausted, doing that would defeat the whole purpose of sampling. So to me Maybe they can internally share some implementation logic, but I think they should be exposed as two separate operators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for your work Naji! And sorry for the late review. I just left an important comment that should be easy to fix, otherwise looks good to me.
imagine you're training a multilingual model. You have English, French and Zulu data. Volume is 100Gb of English, 10Gb of French, 1Gb of Zulu.
Round robin looks like a lot like |
Yeah that makes sense. I haven't been focusing on up/down sampling really, but when you consider that, I still think that round-robin and sampling should be two separate ops though. Round-robin by its nature does not really have any randomness, it is all about iterating each pipeline one-by-one. It feels weird to me to use an op called |
I agree with @cbalioglu about having two operators since semantically we don't expect a similar behaviour from round_robin and sample. But as mentioned by @gwenzek, the buffering seems useful so I will extract this part into a separate code and use it in both of our operators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
What does this PR do? Please describe:
Add sample data source operator that can sample from multiple data sources provided weights.
Does your PR introduce any breaking changes? If yes, please list them:
List of all backwards-incompatible changes.
Check list: