Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more flexible sampler types through Range #2758

Draft
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

lostella
Copy link
Contributor

@lostella lostella commented Mar 24, 2023

Description of changes: This proposes a new type of samplers, that select (not necessarily) random instances to construct training or validation batches. The main addition to the existing samplers, is that we can configure sampling ranges with timestamps (pd.Period) in addition to integer indices: this is obtained through the Range class.

This should allow composing batches as required in this discussion. In addition, it improves the code compared to existing samplers, I believe.

In summary, Range represents a "partially specified" range object, that can only be constructed once we know what sequence we intend to range over (where in time it starts, and how log it is). Once the sequence is known, a regular Python range object is constructed, and samplers take (not necessarily) random elements from it according to their own strategy.

Examples:

# consider observations between the 30th-last and the 10th-last, going 4 by 4
rge = Range(-30, -10, step=4)

# let's just get all of them
sample = SampleAll(rge)

sample(start=pd.Period("2023-01-01", freq="D"), length=90)
# [60, 64, 68, 72, 76]
# consider observations between these dates
rge = Range(pd.Period("2023-03-10"), pd.Period("2023-03-20"))

# sample one on average
sample = SampleOnAverage(rge, 1)

sample(start=pd.Period("2023-01-01", freq="D"), length=90)
# [68, 74]
# [76]
# [71]
# []
# [70, 75, 76]
# [76]

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Please tag this pr with at least one of these labels to make our release process faster: BREAKING, new feature, bug fix, other change, dev setup

@jaheba
Copy link
Contributor

jaheba commented Mar 27, 2023

I like the idea of splitting (heh) the tasks of defining windows where splits should happen and selecting the concrete instances. Still, I think splitting a dataset into train/val/test might be something different.

Say, I want to incrementally train an existing model using new data. I can define a train/test split first, and then select training windows to be only selected from a recent range.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants