Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gap separation #3208

Merged
merged 56 commits into from Jan 18, 2022
Merged
Show file tree
Hide file tree
Changes from 40 commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
51ef066
Initial commit
ParthivNaresh Dec 20, 2021
b2f2ca4
Merge branch 'main' into Gap_Separated_Training_Test
ParthivNaresh Dec 20, 2021
58d4dcd
release notes
ParthivNaresh Dec 20, 2021
ad6eb04
Update tests and pin min woodwork to 0.10.0
ParthivNaresh Dec 20, 2021
21a9e4d
update weather demo to fill in missing values
ParthivNaresh Dec 20, 2021
56bb4d4
update demo test
ParthivNaresh Dec 21, 2021
052fc8b
add nltk==3.6.5
ParthivNaresh Dec 21, 2021
e9f4bd9
nltk==3.6.5
ParthivNaresh Dec 21, 2021
b9ea305
plotly and nltk
ParthivNaresh Dec 21, 2021
9b98f74
no message
ParthivNaresh Dec 21, 2021
528da11
no message
ParthivNaresh Dec 21, 2021
d4d98c8
add ValueError for _are_datasets_separated_by_gap_time_index if train…
ParthivNaresh Dec 21, 2021
95e0bc5
no message
ParthivNaresh Dec 22, 2021
6e8d3e3
no message
ParthivNaresh Dec 22, 2021
95e08ea
Merge branch 'main' into Gap_Separated_Training_Test
ParthivNaresh Dec 22, 2021
b9fdd28
no message
ParthivNaresh Dec 22, 2021
7efde41
no message
ParthivNaresh Dec 22, 2021
599ce89
no message
ParthivNaresh Dec 22, 2021
81a8fc9
no message
ParthivNaresh Dec 22, 2021
8e42bf3
Trigger Build
ParthivNaresh Dec 22, 2021
b25d550
Merge branch 'main' into Gap_Separated_Training_Test
ParthivNaresh Jan 4, 2022
b105535
enhance tests
ParthivNaresh Jan 4, 2022
358bd2a
lint fix
ParthivNaresh Jan 4, 2022
8ccd12f
Merge branch 'main' into Gap_Separated_Training_Test
ParthivNaresh Jan 5, 2022
853fcf2
Merge branch 'main' into Gap_Separation
ParthivNaresh Jan 10, 2022
099bf32
Merge branch 'main' into Gap_Separation
ParthivNaresh Jan 11, 2022
035d9a6
no message
ParthivNaresh Jan 11, 2022
fc4c67a
Merge branch 'main' into Gap_Separation
ParthivNaresh Jan 11, 2022
98bcb4e
Merge branch 'Gap_Separation' of https://github.com/alteryx/evalml in…
ParthivNaresh Jan 11, 2022
d6932aa
Merge branch 'main' into Gap_Separation
ParthivNaresh Jan 12, 2022
e4ce8c9
Merge branch 'main' into Gap_Separation
ParthivNaresh Jan 12, 2022
53546f2
circular import
ParthivNaresh Jan 12, 2022
6efc6db
lint
ParthivNaresh Jan 12, 2022
0ad9ed6
circular
ParthivNaresh Jan 12, 2022
2765e42
no message
ParthivNaresh Jan 13, 2022
6628bfc
no message
ParthivNaresh Jan 13, 2022
49f1e29
fix test
ParthivNaresh Jan 13, 2022
5d57abd
no message
ParthivNaresh Jan 13, 2022
361624b
Merge branch 'main' into Gap_Separation
ParthivNaresh Jan 13, 2022
6e13b71
release notes
ParthivNaresh Jan 13, 2022
ec682b0
Merge branch 'main' into Gap_Separation
ParthivNaresh Jan 14, 2022
1fd8a1d
update to not use partialdependence code
ParthivNaresh Jan 14, 2022
352e6ae
Merge branch 'Gap_Separation' of https://github.com/alteryx/evalml in…
ParthivNaresh Jan 14, 2022
69e4ca5
lint
ParthivNaresh Jan 14, 2022
7c6204c
no message
ParthivNaresh Jan 14, 2022
e985fae
no message
ParthivNaresh Jan 14, 2022
e04d628
move from pipeline utils to gen utils
ParthivNaresh Jan 14, 2022
3f3375b
no message
ParthivNaresh Jan 14, 2022
1db24ed
no message
ParthivNaresh Jan 14, 2022
3bb4b90
Merge branch 'main' into Gap_Separation
ParthivNaresh Jan 14, 2022
5b7974e
no message
ParthivNaresh Jan 14, 2022
7412f48
Merge branch 'main' into Gap_Separation
ParthivNaresh Jan 17, 2022
af1b7b9
Merge branch 'main' into Gap_Separation
ParthivNaresh Jan 18, 2022
adf67bf
test change
ParthivNaresh Jan 18, 2022
2f6c4d2
test fix
ParthivNaresh Jan 18, 2022
b176569
no message
ParthivNaresh Jan 18, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/release_notes.rst
Expand Up @@ -2,6 +2,7 @@ Release Notes
-------------
**Future Releases**
* Enhancements
* Required the separation of training and test data by ``gap`` + 1 units to be verified by ``time_index`` for time series problems :pr:`3208`
* Added support for boolean features for ``ARIMARegressor`` :pr:`3187`
* Updated dependency bot workflow to remove outdated description and add new configuration to delete branches automatically :pr:`3212`
* Fixes
Expand All @@ -12,6 +13,7 @@ Release Notes
* Changes
* Changed the default objective to ``MedianAE`` from ``R2`` for time series regression :pr:`3205`
* Removed all-nan Unknown to Double logical conversion in ``infer_feature_types`` :pr:`3196`
* Checking the validity of holdout data for time series problems can be performed by calling ``pipelines.utils.validate_holdout_datasets`` prior to calling ``predict`` :pr:`3208`
* Documentation Changes
* Testing Changes

Expand Down
17 changes: 17 additions & 0 deletions evalml/demos/weather.py
@@ -1,6 +1,9 @@
"""The Australian daily-min-termperatures weather dataset."""
import pandas as pd

import evalml
from evalml.preprocessing import load_data
from evalml.utils import infer_feature_types


def load_weather():
Expand All @@ -15,4 +18,18 @@ def load_weather():
+ evalml.__version__
)
X, y = load_data(filename, index=None, target="Temp")

missing_date_1 = pd.DataFrame([pd.to_datetime("1984-12-31")], columns=["Date"])
missing_date_2 = pd.DataFrame([pd.to_datetime("1988-12-31")], columns=["Date"])
missing_y_1 = pd.Series([14.5], name="Temp")
missing_y_2 = pd.Series([14.5], name="Temp")

X = pd.concat([X.iloc[:1460], missing_date_1, X.iloc[1460:]]).reset_index(drop=True)
X = pd.concat([X.iloc[:2921], missing_date_2, X.iloc[2921:]]).reset_index(drop=True)
y = pd.concat([y.iloc[:1460], missing_y_1, y.iloc[1460:]]).reset_index(drop=True)
y = pd.concat([y.iloc[:2921], missing_y_2, y.iloc[2921:]]).reset_index(drop=True)

X = infer_feature_types(X)
y = infer_feature_types(y)

return X, y
2 changes: 2 additions & 0 deletions evalml/exceptions/exceptions.py
Expand Up @@ -124,6 +124,8 @@ class PartialDependenceErrorCode(Enum):
"""ice_plot_requested_for_two_way_partial_dependence_plot"""
INVALID_CLASS_LABEL = "invalid_class_label_requested_for_plot"
"""invalid_class_label_requested_for_plot"""
INVALID_HOLDOUT_SET = "invalid_holdout_set"
"""invalid_holdout_set"""
ALL_OTHER_ERRORS = "all_other_errors"
"""all_other_errors"""

Expand Down
1 change: 0 additions & 1 deletion evalml/pipelines/time_series_classification_pipelines.py
Expand Up @@ -119,7 +119,6 @@ def predict_proba(self, X, X_train=None, y_train=None):
X.index = self._move_index_forward(
X_train.index[-X.shape[0] :], self.gap + X.shape[0]
)
self._validate_holdout_datasets(X, X_train)
y_holdout = self._create_empty_series(y_train, X.shape[0])
y_holdout = infer_feature_types(y_holdout)
y_holdout.index = X.index
Expand Down
56 changes: 10 additions & 46 deletions evalml/pipelines/time_series_pipeline_base.py
Expand Up @@ -36,11 +36,11 @@ def __init__(
"time_index, gap, max_delay, and forecast_horizon parameters cannot be omitted from the parameters dict. "
"Please specify them as a dictionary with the key 'pipeline'."
)
pipeline_params = parameters["pipeline"]
self.gap = pipeline_params["gap"]
self.max_delay = pipeline_params["max_delay"]
self.forecast_horizon = pipeline_params["forecast_horizon"]
self.time_index = pipeline_params["time_index"]
self.pipeline_params = parameters["pipeline"]
self.gap = self.pipeline_params["gap"]
self.max_delay = self.pipeline_params["max_delay"]
self.forecast_horizon = self.pipeline_params["forecast_horizon"]
self.time_index = self.pipeline_params["time_index"]
if self.time_index is None:
raise ValueError("Parameter time_index cannot be None!")
super().__init__(
Expand All @@ -66,55 +66,20 @@ def _move_index_forward(index, gap):
else:
return index + gap

@staticmethod
def _are_datasets_separated_by_gap(train_index, test_index, gap):
"""Determine if the train and test datasets are separated by gap number of units.

This will be true when users are predicting on unseen data but not during cross
validation since the target is known.
"""
gap_difference = gap + 1
index_difference = test_index[0] - train_index[-1]
if isinstance(
train_index, (pd.DatetimeIndex, pd.PeriodIndex, pd.TimedeltaIndex)
):
gap_difference *= test_index.freq
return index_difference == gap_difference

def _validate_holdout_datasets(self, X, X_train):
"""Validate the holdout datasets match out expectations.

Args:
X (pd.DataFrame): Data of shape [n_samples, n_features].
X_train (pd.DataFrame): Training data.

Raises:
ValueError: If holdout data does not have forecast_horizon entries or if datasets
are not separated by gap.
"""
right_length = len(X) <= self.forecast_horizon
X_separated_by_gap = self._are_datasets_separated_by_gap(
X_train.index, X.index, self.gap
)
if not (right_length and X_separated_by_gap):
raise ValueError(
f"Holdout data X must have {self.forecast_horizon} rows (value of forecast horizon) "
"and its index needs to "
f"start {self.gap + 1} values ahead of the training index. "
f"Data received - Length X: {len(X)}, "
f"X index start: {X.index[0]}, X_train index end {X_train.index[-1]}."
)

def _add_training_data_to_X_Y(self, X, y, X_train, y_train):
"""Append the training data to the holdout data.

Need to do this so that we have all the data we need to compute lagged features on the holdout set.
"""
from evalml.pipelines.utils import (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the import to top of file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We end up running into a circular dependency issue unfortunately

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Let's move it to gen_utils then? That's where are_ts_parameters_valid_for_split so I think its sensible to include it there.

are_datasets_separated_by_gap_time_index,
)

last_row_of_training = self.forecast_horizon + self.max_delay + self.gap
gap_features = pd.DataFrame()
gap_target = pd.Series()
if (
self._are_datasets_separated_by_gap(X_train.index, X.index, self.gap)
are_datasets_separated_by_gap_time_index(X_train, X, self.pipeline_params)
and self.gap
):
# The training data does not have the gap dates so don't need to include them
Expand Down Expand Up @@ -235,7 +200,6 @@ def predict(self, X, objective=None, X_train=None, y_train=None):
X.index = self._move_index_forward(
X_train.index[-X.shape[0] :], self.gap + X.shape[0]
)
self._validate_holdout_datasets(X, X_train)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the third point of this: Rather than raising a ValueError in predict, let's refactor that logic into a helper function that could be used before calling predict. predict should no longer raise exceptions if the data violates our constraints.

y_holdout = self._create_empty_series(y_train, X.shape[0])
y_holdout = infer_feature_types(y_holdout)
y_holdout.index = X.index
Expand Down
70 changes: 70 additions & 0 deletions evalml/pipelines/utils.py
Expand Up @@ -4,6 +4,7 @@

from woodwork import logical_types

from ..exceptions import PartialDependenceError, PartialDependenceErrorCode
from . import (
TimeSeriesBinaryClassificationPipeline,
TimeSeriesMulticlassClassificationPipeline,
Expand Down Expand Up @@ -815,6 +816,75 @@ def make_timeseries_baseline_pipeline(problem_type, gap, forecast_horizon, time_
return baseline


def are_datasets_separated_by_gap_time_index(train, test, pipeline_params):
"""Determine if the train and test datasets are separated by gap number of units using the time_index.

This will be true when users are predicting on unseen data but not during cross
validation since the target is known.

Args:
train (pd.DataFrame): Training data.
test (pd.DataFrame): Data of shape [n_samples, n_features].
pipeline_params (dict): Dictionary of time series parameters.

Returns:
bool: True if the difference in time units is equal to gap + 1.

"""
gap_difference = pipeline_params["gap"] + 1

train_copy = train.copy()
test_copy = test.copy()
train_copy.ww.init(time_index=pipeline_params["time_index"])
test_copy.ww.init(time_index=pipeline_params["time_index"])

X_frequency_dict = train_copy.ww.infer_temporal_frequencies(
temporal_columns=[train_copy.ww.time_index]
)
freq = X_frequency_dict[test_copy.ww.time_index]
if freq is None:
return True
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the third point of this: If the training data does not have an inferable frequency, let's assume the datasets are correctly separated by the gap for now.


first_testing_date = test_copy[test_copy.ww.time_index].iloc[0]
last_training_date = train_copy[train_copy.ww.time_index].iloc[-1]
dt_difference = first_testing_date - last_training_date

try:
units_difference = dt_difference / freq
except ValueError:
units_difference = dt_difference / ("1" + freq)
return units_difference == gap_difference


def validate_holdout_datasets(X, X_train, pipeline_params):
"""Validate the holdout datasets match out expectations.

Args:
X (pd.DataFrame): Data of shape [n_samples, n_features].
X_train (pd.DataFrame): Training data.
pipeline_params (dict): Dictionary of time series parameters.

Raises:
PartialDependenceError: If holdout data does not have forecast_horizon entries or if datasets are not separated by gap.
"""
forecast_horizon = pipeline_params["forecast_horizon"]
gap = pipeline_params["gap"]
time_index = pipeline_params["time_index"]
right_length = len(X) <= forecast_horizon
X_separated_by_gap = are_datasets_separated_by_gap_time_index(
X_train, X, pipeline_params
)
if not (right_length and X_separated_by_gap):
raise PartialDependenceError(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the second point of this: This helper function will return "something" if the test data violates our constraints. Users can then use this "something" to display warning messages prior to calling predict.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it'll be better if rather than raising an exception, we return a tuple of bool and List[ValidationErrorCode]?

  • If the dataset is valid, return True, []
  • If the dataset does not have right length but is separated by gap, return False, [NotRightLength],
  • If the dataset has right length but is not separated by gap, return False, [NotSeparatedByGap]
  • If the dataset is not right length and not separated by gap, return False, [NotRightLength, Not SeparatedByGap]

If we do it this way, it might be easier to communicate which of the two criteria was not met.
What do you think? FYI @fjlanasa

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think that makes sense

f"Holdout data X must have {forecast_horizon} rows (value of forecast horizon) "
f"and the first value indicated by the column {time_index} needs to "
f"start {gap + 1} units ahead of the training data. "
f"Data received - Length X: {len(X)}, "
f"X value start: {X[time_index].iloc[0]}, X_train value end {X_train[time_index].iloc[-1]}.",
PartialDependenceErrorCode.INVALID_HOLDOUT_SET,
)


def rows_of_interest(
pipeline, X, y=None, threshold=None, epsilon=0.1, sort_values=True, types="all"
):
Expand Down
6 changes: 3 additions & 3 deletions evalml/tests/automl_tests/test_engine_base.py
Expand Up @@ -135,13 +135,13 @@ def test_train_pipeline_trains_and_tunes_threshold(

def test_train_pipeline_trains_and_tunes_threshold_ts(
ts_data,
dummy_ts_binary_linear_classifier_pipeline_class,
dummy_ts_binary_tree_classifier_pipeline_class,
):
X = pd.DataFrame([i for i in range(32)])
X = pd.DataFrame(pd.date_range("1/1/21", periods=32), columns=["date"])
y = pd.Series([0, 1, 0, 1] * 8)

params = {"gap": 1, "max_delay": 1, "forecast_horizon": 1, "time_index": "date"}
ts_binary = dummy_ts_binary_linear_classifier_pipeline_class(
ts_binary = dummy_ts_binary_tree_classifier_pipeline_class(
parameters={"pipeline": params}
)
assert ts_binary.threshold is None
Expand Down
6 changes: 3 additions & 3 deletions evalml/tests/conftest.py
Expand Up @@ -826,11 +826,11 @@ def __init__(


@pytest.fixture
def dummy_ts_binary_linear_classifier_pipeline_class():
log_reg_classifier = LogisticRegressionClassifier
def dummy_ts_binary_tree_classifier_pipeline_class():
dec_tree_classifier = DecisionTreeClassifier

class MockBinaryClassificationPipeline(TimeSeriesBinaryClassificationPipeline):
estimator = log_reg_classifier
estimator = dec_tree_classifier
component_graph = [estimator]

def __init__(
Expand Down
26 changes: 26 additions & 0 deletions evalml/tests/demo_tests/test_datasets.py
Expand Up @@ -86,5 +86,31 @@ def test_datasets(dataset_name, expected_shape, local_datasets):
def test_datasets_match_local(dataset_name, demo_method, local_datasets):
X, y = demo_method
X_local, y_local = local_datasets[dataset_name]

if dataset_name == "daily_temp":
missing_date_1 = pd.DataFrame([pd.to_datetime("1984-12-31")], columns=["Date"])
missing_date_2 = pd.DataFrame([pd.to_datetime("1988-12-31")], columns=["Date"])
missing_y_1 = pd.Series([14.5], name="Temp")
missing_y_2 = pd.Series([14.5], name="Temp")

X_local = pd.concat(
[
X_local.iloc[:1460],
missing_date_1,
X_local.iloc[1460:2920],
missing_date_2,
X_local.iloc[2920:],
]
).reset_index(drop=True)
y_local = pd.concat(
[
y_local.iloc[:1460],
missing_y_1,
y_local.iloc[1460:2920],
missing_y_2,
y_local.iloc[2920:],
]
).reset_index(drop=True)
ParthivNaresh marked this conversation as resolved.
Show resolved Hide resolved

pd.testing.assert_frame_equal(X, X_local)
pd.testing.assert_series_equal(y, y_local)