Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data split for time series #1441

Merged
merged 8 commits into from
Nov 23, 2020
Merged

Data split for time series #1441

merged 8 commits into from
Nov 23, 2020

Conversation

freddyaboulton
Copy link
Contributor

@freddyaboulton freddyaboulton commented Nov 16, 2020

Pull Request Description

Fixes #1381.


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@codecov
Copy link

codecov bot commented Nov 16, 2020

Codecov Report

Merging #1441 (d0313df) into main (5d1ea1f) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1441     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         220      222      +2     
  Lines       14742    14813     +71     
=========================================
+ Hits        14735    14806     +71     
  Misses          7        7             
Impacted Files Coverage Δ
evalml/automl/data_splitters/__init__.py 100.0% <100.0%> (ø)
evalml/automl/data_splitters/time_series_split.py 100.0% <100.0%> (ø)
...valml/tests/automl_tests/test_time_series_split.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5d1ea1f...d0313df. Read the comment docs.

@freddyaboulton freddyaboulton force-pushed the 1381-timeseries-data-split branch 2 times, most recently from 0a30055 to 879f9cb Compare November 17, 2020 15:06
@freddyaboulton freddyaboulton marked this pull request as ready for review November 17, 2020 16:07
max_test_index = min(last_test + 1 + self.gap, max_index)
new_train = np.arange(train[0], last_train + 1 + self.gap)
new_test = np.arange(first_test - self.max_delay, max_test_index)
yield new_train, new_test
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why yield instead of returning at the end?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremyliweishih using yield creates a generator, i.e. a one-time iterable! Its a good pattern to use for expensive computations which are going to be iterated over.

Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks great to me! I don't really have any other comments other than it would be helpful to include links to Delayed Feature Transformer and TimeSeriesSplit and also an explanation of what ROCV is. It took my a while to understand the logic of everything and it only clicked once I read through all the context.

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton looks good!

I left some questions about the impl, as well as a suggested simplification (don't use sklearn splitter). I also left a question about why the algo we appear to be using isn't dividing the data into equally-sized sections. That said, the impl makes sense to me, these questions aren't blocking merge, and we can certainly file something to revisit our timeseries splitting, if you prefer to merge and address this later; your call.

max_test_index = min(last_test + 1 + self.gap, max_index)
new_train = np.arange(train[0], last_train + 1 + self.gap)
new_test = np.arange(first_test - self.max_delay, max_test_index)
yield new_train, new_test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremyliweishih using yield creates a generator, i.e. a one-time iterable! Its a good pattern to use for expensive computations which are going to be iterated over.

if self._check_if_empty(X) and self._check_if_empty(y):
raise ValueError("Both X and y cannot be None or empty in TimeSeriesSplit.split")
elif self._check_if_empty(X) and not self._check_if_empty(y):
split_kwargs = dict(X=y, groups=groups)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creative solution!

I suppose you could also create an empty dataframe with the same index as y

X_temp = pd.DataFrame(index=y.index)
split_kwargs = {'X': X_temp}

but I think the only advantage of doing so is code readability, should be functionally equivalent. I.e., I like your solution :)

last_test = test[-1]
first_test = test[0]
max_test_index = min(last_test + 1 + self.gap, max_index)
new_train = np.arange(train[0], last_train + 1 + self.gap)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should start at 0 rather than at train[0], right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, will there ever be a case where last_train + 1 + self.gap is larger than len(train)? Do we need to do min(last_train + 1 + self.gap, len(train)) for safety?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep train[0] == 0 so I'll make that change. I'll also make sure the train doesn't extend past X !

from sklearn.model_selection._split import BaseCrossValidator


class TimeSeriesSplit(BaseCrossValidator):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton so the argument for defining this class, vs simply using sklearn TimeSeriesSplit, is that this splitter takes the gap and max_delay into account, to make sure the right rows are provided to the pipeline in each case according to what we expect?

I'm just playing devil's advocate here and trying to justify why we want to add code :)

Copy link
Contributor Author

@freddyaboulton freddyaboulton Nov 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's the reason!

last_train = train[-1]
last_test = test[-1]
first_test = test[0]
max_test_index = min(last_test + 1 + self.gap, max_index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, nice, this looks great.

first_test = test[0]
max_test_index = min(last_test + 1 + self.gap, max_index)
new_train = np.arange(train[0], last_train + 1 + self.gap)
new_test = np.arange(first_test - self.max_delay, max_test_index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here as above. Will these indices always fit into the bounds of test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes they will but we make sure they never extend past the indices of X!

split_kwargs = dict(X=X, y=y, groups=groups)
max_index = X.shape[0]

for train, test in self._splitter.split(**split_kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its fine for us to build off of the sklearn timeseries splitter. But... would the impl be easier to read/modify/test/debug if we built it from scratch instead?

max_index = len(x)
for fold in range(self.n_folds):
    boundary = int(round(max_index * (fold + 1) / float(self.n_folds)))
    train = np.arange(0, boundary)
    test = np.arange(boundary, max_index)
    yield train, test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so sure. Sklearn already has a tested implementation of the rolling origin cv algorithm we set forth in the design phase so the only thing we need to take care of is that the splits are aware of gap and max_delay. I prefer to delegate the part we can delegate to one of our dependencies and the code that lives in evalml only needs to worry about evalml requirements. Besides, I feel like implementing rolling-origin cv ourselves would amount to something very similar to sklearn's implementation which I feel isn't very DRY .

So I vote we keep the implementation as is for now and then we can revisit later if we decide to move away from rolling origin cv!

_ = list(ts_split.split(X=None, y=None))

with pytest.raises(ValueError, match="Both X and y cannot be None or empty in TimeSeriesSplit.split"):
_ = list(ts_split.split(X=pd.DataFrame(), y=pd.Series([])))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

I believe this should still work if both are "empty" but have valid indices defined


answer = [(pd.date_range("2020-10-01", f"2020-10-{10 + gap}"), pd.date_range(f"2020-10-{11 - max_delay}", f"2020-10-{17 + gap}")),
(pd.date_range("2020-10-01", f"2020-10-{17 + gap}"), pd.date_range(f"2020-10-{18 - max_delay}", f"2020-10-{24 + gap}")),
(pd.date_range("2020-10-01", f"2020-10-{24 + gap}"), pd.date_range(f"2020-10-{25 - max_delay}", "2020-10-31"))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah neat, I didn't expect the answers here to be this cleanly defined.

Considering the case when max_delay=0 and gap=0 for simplicity, why does the first training split range from 10-01 to 10-10 excluding the last day (i.e. 9 days), but the test split always has 6 days, and each subsequent update increases by 7 days? I guess my question is, why aren't all these splits the same size?

I think the simplest algo to follow for this would be if we divide the data into n_splits equal sections, and then each split in the generator uses the i-th ordered section as the test data, and the concatenation of all previous ordered sections as the training data. It appears the algo we've currently got matches this in the ordering, but the section sizes are a bit different. How come?

Copy link
Contributor Author

@freddyaboulton freddyaboulton Nov 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason here is that the number of datapoints (31) is not divisible by the number of folds (4 because n_splits=3 creates 4 chunks of data).

Splitting the data equally into four would result in four 7-day chunks of data with three days leftover. We just add the first three days to the first chunk of data.

I think if the data isn't evenly divisible by the number of splits, you'll always end up in a situation where one of the splits is slightly larger (unless you throw out data which I don't think we should do).

If I make a dataframe with 32 rows instead of 31 and use n_splits=3 with max_delay/gap = 0, the first train/test are 8 days long and each subsequent train set increases by 8 days, which matches our expectation!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton got it, thanks for the explanation. I wanted to make sure there wasn't a bug lurking in there somewhere. I would've guessed the underlying sklearn splits would be as even as possible but I guess not! It'll be interesting to learn how this behaves on other datasets.

@freddyaboulton freddyaboulton force-pushed the 1381-timeseries-data-split branch from 879f9cb to f9b1011 Compare November 20, 2020 19:28
@freddyaboulton freddyaboulton force-pushed the 1381-timeseries-data-split branch from f9b1011 to d0313df Compare November 20, 2020 20:26
@freddyaboulton freddyaboulton self-assigned this Nov 20, 2020
@freddyaboulton freddyaboulton merged commit e0f65c1 into main Nov 23, 2020
@freddyaboulton freddyaboulton deleted the 1381-timeseries-data-split branch November 23, 2020 15:25
@dsherry dsherry mentioned this pull request Nov 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data splitting for time series
3 participants