-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data split for time series #1441
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1441 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 220 222 +2
Lines 14742 14813 +71
=========================================
+ Hits 14735 14806 +71
Misses 7 7
Continue to review full report at Codecov.
|
0a30055
to
879f9cb
Compare
max_test_index = min(last_test + 1 + self.gap, max_index) | ||
new_train = np.arange(train[0], last_train + 1 + self.gap) | ||
new_test = np.arange(first_test - self.max_delay, max_test_index) | ||
yield new_train, new_test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why yield instead of returning at the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeremyliweishih using yield
creates a generator, i.e. a one-time iterable! Its a good pattern to use for expensive computations which are going to be iterated over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks great to me! I don't really have any other comments other than it would be helpful to include links to Delayed Feature Transformer
and TimeSeriesSplit
and also an explanation of what ROCV is. It took my a while to understand the logic of everything and it only clicked once I read through all the context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton looks good!
I left some questions about the impl, as well as a suggested simplification (don't use sklearn splitter). I also left a question about why the algo we appear to be using isn't dividing the data into equally-sized sections. That said, the impl makes sense to me, these questions aren't blocking merge, and we can certainly file something to revisit our timeseries splitting, if you prefer to merge and address this later; your call.
max_test_index = min(last_test + 1 + self.gap, max_index) | ||
new_train = np.arange(train[0], last_train + 1 + self.gap) | ||
new_test = np.arange(first_test - self.max_delay, max_test_index) | ||
yield new_train, new_test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeremyliweishih using yield
creates a generator, i.e. a one-time iterable! Its a good pattern to use for expensive computations which are going to be iterated over.
if self._check_if_empty(X) and self._check_if_empty(y): | ||
raise ValueError("Both X and y cannot be None or empty in TimeSeriesSplit.split") | ||
elif self._check_if_empty(X) and not self._check_if_empty(y): | ||
split_kwargs = dict(X=y, groups=groups) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creative solution!
I suppose you could also create an empty dataframe with the same index as y
X_temp = pd.DataFrame(index=y.index)
split_kwargs = {'X': X_temp}
but I think the only advantage of doing so is code readability, should be functionally equivalent. I.e., I like your solution :)
last_test = test[-1] | ||
first_test = test[0] | ||
max_test_index = min(last_test + 1 + self.gap, max_index) | ||
new_train = np.arange(train[0], last_train + 1 + self.gap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should start at 0
rather than at train[0]
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, will there ever be a case where last_train + 1 + self.gap
is larger than len(train)
? Do we need to do min(last_train + 1 + self.gap, len(train))
for safety?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep train[0] == 0
so I'll make that change. I'll also make sure the train
doesn't extend past X
!
from sklearn.model_selection._split import BaseCrossValidator | ||
|
||
|
||
class TimeSeriesSplit(BaseCrossValidator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton so the argument for defining this class, vs simply using sklearn TimeSeriesSplit
, is that this splitter takes the gap
and max_delay
into account, to make sure the right rows are provided to the pipeline in each case according to what we expect?
I'm just playing devil's advocate here and trying to justify why we want to add code :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's the reason!
last_train = train[-1] | ||
last_test = test[-1] | ||
first_test = test[0] | ||
max_test_index = min(last_test + 1 + self.gap, max_index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, nice, this looks great.
first_test = test[0] | ||
max_test_index = min(last_test + 1 + self.gap, max_index) | ||
new_train = np.arange(train[0], last_train + 1 + self.gap) | ||
new_test = np.arange(first_test - self.max_delay, max_test_index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here as above. Will these indices always fit into the bounds of test
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes they will but we make sure they never extend past the indices of X
!
split_kwargs = dict(X=X, y=y, groups=groups) | ||
max_index = X.shape[0] | ||
|
||
for train, test in self._splitter.split(**split_kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its fine for us to build off of the sklearn timeseries splitter. But... would the impl be easier to read/modify/test/debug if we built it from scratch instead?
max_index = len(x)
for fold in range(self.n_folds):
boundary = int(round(max_index * (fold + 1) / float(self.n_folds)))
train = np.arange(0, boundary)
test = np.arange(boundary, max_index)
yield train, test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so sure. Sklearn already has a tested implementation of the rolling origin cv algorithm we set forth in the design phase so the only thing we need to take care of is that the splits are aware of gap
and max_delay
. I prefer to delegate the part we can delegate to one of our dependencies and the code that lives in evalml only needs to worry about evalml requirements. Besides, I feel like implementing rolling-origin cv ourselves would amount to something very similar to sklearn's implementation which I feel isn't very DRY .
So I vote we keep the implementation as is for now and then we can revisit later if we decide to move away from rolling origin cv!
_ = list(ts_split.split(X=None, y=None)) | ||
|
||
with pytest.raises(ValueError, match="Both X and y cannot be None or empty in TimeSeriesSplit.split"): | ||
_ = list(ts_split.split(X=pd.DataFrame(), y=pd.Series([]))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I believe this should still work if both are "empty" but have valid indices defined
|
||
answer = [(pd.date_range("2020-10-01", f"2020-10-{10 + gap}"), pd.date_range(f"2020-10-{11 - max_delay}", f"2020-10-{17 + gap}")), | ||
(pd.date_range("2020-10-01", f"2020-10-{17 + gap}"), pd.date_range(f"2020-10-{18 - max_delay}", f"2020-10-{24 + gap}")), | ||
(pd.date_range("2020-10-01", f"2020-10-{24 + gap}"), pd.date_range(f"2020-10-{25 - max_delay}", "2020-10-31"))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah neat, I didn't expect the answers here to be this cleanly defined.
Considering the case when max_delay=0
and gap=0
for simplicity, why does the first training split range from 10-01 to 10-10 excluding the last day (i.e. 9 days), but the test split always has 6 days, and each subsequent update increases by 7 days? I guess my question is, why aren't all these splits the same size?
I think the simplest algo to follow for this would be if we divide the data into n_splits
equal sections, and then each split in the generator uses the i-th ordered section as the test data, and the concatenation of all previous ordered sections as the training data. It appears the algo we've currently got matches this in the ordering, but the section sizes are a bit different. How come?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason here is that the number of datapoints (31) is not divisible by the number of folds (4 because n_splits=3
creates 4 chunks of data).
Splitting the data equally into four would result in four 7-day chunks of data with three days leftover. We just add the first three days to the first chunk of data.
I think if the data isn't evenly divisible by the number of splits, you'll always end up in a situation where one of the splits is slightly larger (unless you throw out data which I don't think we should do).
If I make a dataframe with 32 rows instead of 31 and use n_splits=3 with max_delay/gap = 0, the first train/test are 8 days long and each subsequent train set increases by 8 days, which matches our expectation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton got it, thanks for the explanation. I wanted to make sure there wasn't a bug lurking in there somewhere. I would've guessed the underlying sklearn splits would be as even as possible but I guess not! It'll be interesting to learn how this behaves on other datasets.
879f9cb
to
f9b1011
Compare
f9b1011
to
d0313df
Compare
Pull Request Description
Fixes #1381.
After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rst
to include this pull request by adding :pr:123
.