-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Statically set woodwork typing in tests #3697
Conversation
…licitly set types
Codecov Report
@@ Coverage Diff @@
## main #3697 +/- ##
=======================================
- Coverage 99.7% 99.7% -0.0%
=======================================
Files 339 339
Lines 34465 34386 -79
=======================================
- Hits 34338 34254 -84
- Misses 127 132 +5
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
evalml/pipelines/components/transformers/imputers/time_series_imputer.py
Outdated
Show resolved
Hide resolved
Moving back to draft |
logical_types={col: "Double" for col in cols_derived_from_categoricals}, | ||
logical_types={col: "Double" for col in lagged_features.columns}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not confident in the correctness of this change, as to how we handle all types and if everything does in fact become a double here. If anyone knows better about this, please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job Becca. Really huge amount of work and very impressive that you lowered the sensitivity of the codebase to inference by so much!!
if problem_type == ProblemTypes.TIME_SERIES_MULTICLASS: | ||
X, y = ts_data_multi | ||
elif problem_type == ProblemTypes.TIME_SERIES_BINARY: | ||
X, y = ts_data_binary | ||
else: | ||
X, y = ts_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice refactoring 👍
@@ -568,7 +566,7 @@ def test_simple_imputer_ignores_natural_language( | |||
ans = X_df.mode().iloc[0, :] | |||
ans["natural language col"] = pd.NA | |||
X_df.iloc[-1, :] = ans | |||
assert_frame_equal(result, X_df) | |||
assert_frame_equal(result, X_df, check_dtype=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Classic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work on making the tests less reliant on inference. Just had a comment about nan behavior and down casting for series. Lmk what you think!
@@ -297,7 +297,22 @@ def test_schema_is_equal_fraud(fraud_100): | |||
assert _schema_is_equal(X.ww.schema, X2.ww.schema) | |||
|
|||
|
|||
def test_test_downcast_nullable_types_can_handle_no_schema(): | |||
def test_downcast_nullable_types_series(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you want to add the test case with NaN
s as well?
|
||
X_bool_nullable_cols = X.ww.select("BooleanNullable") | ||
X_int_nullable_cols = X.ww.select(["IntegerNullable", "AgeNullable"]) | ||
if isinstance(data, pd.Series): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the series contains null values should we still downcast into Boolean or Double? I think we should follow what we have and use ignore_null_cols
to differentiate.
Closes #3651
This is not a woodwork 0.17.2 upgrade. The goal was simply to add explicit woodwork typing throughout the tests to make them more resilient to woodwork inference changes. The end result of this does have fewer tests failing in 0.17.2 since I used the upgrade as a check for general stability.
This does include some setting of woodwork types in components though, related to my previous work about being explicit about woodwork typing to reduce the number of times we re-infer within components (plus it makes tests pass 😁).
Changes in this PR:
ts_data
andget_ts_X_y
test pytest fixtures into one fixture, namedts_data
but with data fromget_ts_X_y
.ts_data
,imputer_test_data
,X_y_binary
,X_y_multi
,X_y_regression
,X_y_categorical_classification
, andX_y_categorical_regression
to explicitly set woodwork types.downcast_nullable_types
now works with either a DataFrame or a Series as input data (instead of just dataframes), and there's now a test for it.ww.init()
ininfer_feature_types
in the case where we already have a valid schema for the input data. This has performance implications and there are performance test results for this branch.Before vs after woodwork upgrade comparison:
Main/0.16.4 -> 0 failing tests
Main/0.17.2 -> 287 failing tests
Main /0.18.0 -> 291 failing tests
Branch/0.16.4 -> 0 failing tests
Branch/0.17.2 -> 23 failing tests
Branch/0.18.0 -> 22 failing tests
I apologize in advance for the length of this PR 😬