Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Woodwork 0.17.2 compatibility :( #3626

Merged
merged 109 commits into from
Aug 10, 2022
Merged

Woodwork 0.17.2 compatibility :( #3626

merged 109 commits into from
Aug 10, 2022

Conversation

chukarsten
Copy link
Contributor

@chukarsten chukarsten commented Jul 25, 2022

All the work required to get EvalML compatible with Woodwork 0.17.2.

@codecov
Copy link

codecov bot commented Jul 27, 2022

Codecov Report

Merging #3626 (69156a1) into main (987cdcc) will decrease coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3626     +/-   ##
=======================================
- Coverage   99.7%   99.7%   -0.0%     
=======================================
  Files        335     335             
  Lines      33750   33839     +89     
=======================================
+ Hits       33627   33714     +87     
- Misses       123     125      +2     
Impacted Files Coverage Δ
.../tests/component_tests/test_datetime_featurizer.py 100.0% <ø> (ø)
.../component_tests/test_drop_nan_rows_transformer.py 100.0% <ø> (ø)
evalml/tests/data_checks_tests/test_data_checks.py 100.0% <ø> (ø)
evalml/tests/pipeline_tests/test_pipeline_utils.py 99.8% <ø> (ø)
evalml/model_understanding/metrics.py 100.0% <100.0%> (ø)
evalml/pipelines/classification_pipeline.py 100.0% <100.0%> (ø)
...omponents/estimators/regressors/arima_regressor.py 100.0% <100.0%> (ø)
...onents/estimators/regressors/catboost_regressor.py 100.0% <100.0%> (ø)
...onents/estimators/regressors/lightgbm_regressor.py 100.0% <100.0%> (ø)
...elines/components/transformers/imputers/imputer.py 100.0% <100.0%> (ø)
... and 28 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@chukarsten chukarsten changed the title Ww 0.17.0 compatibility Woodwork 0.17.0 compatibility :( Jul 27, 2022
@ParthivNaresh ParthivNaresh changed the title Woodwork 0.17.0 compatibility :( Woodwork 0.17.2 compatibility :( Aug 6, 2022
Copy link
Contributor

@ParthivNaresh ParthivNaresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work man, this was truly next level finessing. I think some outstanding questions/issues are:

  • Why is the ClassImbalanceDataCheck potentially casting 2.0 to 2?
  • Issue has been filed to keep pseudo floats like 2.0, 12.0, -4.0, etc as Double instead of Integer or IntegerNullable.
  • Issue has been filed for support of multiple column assignments
  • Issue has been filed for no exception when casting to a Boolean.
  • Issue has been filed to match ww.init() behaviour across DataFrames and Series.

@@ -161,7 +161,8 @@ def transform(self, X, y=None):
if self._numeric_cols is not None and len(self._numeric_cols) > 0:
X_numeric = X.ww[self._numeric_cols.tolist()]
imputed = self._numeric_imputer.transform(X_numeric)
X_no_all_null[X_numeric.columns] = imputed
for numeric_col in X_numeric.columns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've filed it

imputed.bfill(inplace=True) # Fill in the first value, if missing
X_not_all_null[X_interpolate.columns] = imputed
X_not_all_null.ww.init(schema=X_schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reinitializes the dataframe with the original schema excluding IntegerNullable and BooleanNullable types so that they can be reinferred post imputation

y_imputed.bfill(inplace=True)
y_imputed.ww.init(schema=y.ww.schema)

X_not_all_null.ww.init(schema=X_schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Covered as part of test_numeric_only_input and test_imputer_bool_dtype_object

@@ -311,5 +312,8 @@ def transform(self, X, y=None):

if cleaned_y is not None:
cleaned_y = cleaned_y["target"]
cleaned_y = ww.init_series(cleaned_y)

cleaned_x.ww.init()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introduction of nulls makes initialization necessary here

@@ -380,3 +381,27 @@ def make_balancing_dictionary(y, sampling_ratio):
# this class is already larger than the ratio, don't change
class_dic[index] = value_counts[index]
return class_dic


def downcast_int_nullable_to_double(X):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A function that helps with some components not accepting an IntegerArray or being unable to cast values from a float to an int

@@ -88,14 +90,14 @@ def test_numeric_only_input(imputer_test_data):
expected = pd.DataFrame(
{
"int col": [0, 1, 2, 0, 3] * 4,
"float col": [0.0, 1.0, 0.0, -2.0, 5.0] * 4,
"float col": [0.1, 1.0, 0.0, -2.0, 5.0] * 4,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed this

@@ -1982,7 +1982,7 @@ def imputer_test_data():
),
"int col": [0, 1, 2, 0, 3] * 4,
"object col": ["b", "b", "a", "c", "d"] * 4,
"float col": [0.0, 1.0, 0.0, -2.0, 5.0] * 4,
"float col": [0.1, 1.0, 0.0, -2.0, 5.0] * 4,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed this

@ParthivNaresh ParthivNaresh marked this pull request as ready for review August 7, 2022 16:12
Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just left some small efficiency and clarification questions.

@@ -4016,7 +4016,7 @@ def test_automl_baseline_pipeline_predictions_and_scores_time_series(problem_typ
expected_predictions = pd.Series(expected_predictions, name="target_delay_1")

preds = baseline.predict(X_validation, None, X_train, y_train)
pd.testing.assert_series_equal(expected_predictions, preds)
pd.testing.assert_series_equal(expected_predictions, preds, check_dtype=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worries me slightly - are there any scenarios where this would cause us issues down the road?

evalml/pipelines/components/utils.py Outdated Show resolved Hide resolved
@@ -288,6 +288,7 @@ def transform(self, X, y=None):
delayed_features = self._compute_delays(X_ww, y)
rolling_means = self._compute_rolling_transforms(X_ww, y, original_features)
features = ww.concat_columns([delayed_features, rolling_means])
features.ww.init()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse any part of the initial schema or use what we know about the dtypes of these features here to reduce the amount of type reinference this might introduce?

evalml/pipelines/classification_pipeline.py Show resolved Hide resolved
@@ -66,7 +66,16 @@ def fit(self, X, y):
)

self._fit(X, y)
self._classes_ = list(ww.init_series(np.unique(y)))

# TODO: Added this in because numpy's unique() does not support pandas.NA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's a workaround for this error, why do we start off by attempting to use numpy? Are there downsides to just using y.unique() in all cases instead?

Copy link
Contributor

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chukarsten this looks good to me! Apologies for all the "did we file an issue" comments. Just wanted to make sure we're keep track of removing these fixes once the appropriate WW changes are in. 😄

@@ -66,7 +66,16 @@ def fit(self, X, y):
)

self._fit(X, y)
self._classes_ = list(ww.init_series(np.unique(y)))

# TODO: Added this in because numpy's unique() does not support pandas.NA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have an issue filed to resolve this?

@@ -165,7 +166,10 @@ def transform(self, X, y=None):

if self._interpolate_cols is not None:
X_interpolate = X.ww[self._interpolate_cols]
imputed = X_interpolate.interpolate()
# TODO: Revert when pandas introduces Float64 dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have an issue filed to track this?

@@ -4016,7 +4016,7 @@ def test_automl_baseline_pipeline_predictions_and_scores_time_series(problem_typ
expected_predictions = pd.Series(expected_predictions, name="target_delay_1")

preds = baseline.predict(X_validation, None, X_train, y_train)
pd.testing.assert_series_equal(expected_predictions, preds)
pd.testing.assert_series_equal(expected_predictions, preds, check_dtype=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a little confused here - is preds coming out as an integers here and why if so?

@@ -328,6 +343,7 @@ def test_delayed_feature_extractor_numpy(mock_roll, delayed_features_data):
"target_delay_11": y_answer.shift(11),
},
)
answer_only_y.ww.init()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was the new ww.init call in TimeSeriesFeaturizer in response to this? If not should we file an issue to cover this?

@chukarsten chukarsten merged commit 12ec98e into main Aug 10, 2022
@chukarsten chukarsten deleted the ww_0.17.0_compatibility branch August 10, 2022 03:37
@chukarsten chukarsten mentioned this pull request Aug 10, 2022
chukarsten added a commit that referenced this pull request Aug 15, 2022
* Revert "Woodwork 0.17.2 compatibility :( (#3626)"

This reverts commit 12ec98e.

* Updated the release and pandas version.

Co-authored-by: Karsten Chu <karsten.chu@alteryx.com>
@ParthivNaresh ParthivNaresh restored the ww_0.17.0_compatibility branch August 22, 2022 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants