-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Component specific nullable types handling across EvalML #4043
Conversation
Codecov Report
@@ Coverage Diff @@
## main #4043 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 349 349
Lines 37661 37719 +58
=======================================
+ Hits 37542 37602 +60
+ Misses 119 117 -2
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
494c4a5
to
80be9be
Compare
1130690
to
39e2826
Compare
9541e99
to
07b89dc
Compare
07b89dc
to
63f12da
Compare
2696587
to
d89feee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤩 I am in awe of this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 🎉 🎉
docs/source/release_notes.rst
Outdated
@@ -3,11 +3,16 @@ Release Notes | |||
**Future Releases** | |||
* Enhancements | |||
* Updated `pipeline.get_prediction_intervals()` to add trend prediction interval information from STL decomposer :pr:`4093` | |||
* Add component-specific nullable type handling :pr:`4043` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative (and I'm not sure what's better here so please feel free to ignore if you disagree) would be to simply list this PR as a second PR for all the changes - i.e. * Handled nullable type incompatibility in ``Decomposer`` :pr:
4105, :pr:
4043` - there is some precedent for this method earlier in these notes!
I will also continue to be the release notes stickler, can all these comments be past tense? 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that! Would be better at helping people find the PRs where the discussions happened if need be!
(also sorry about making you be a broken record on the release notes thing. To be fair, I apparently cannot remember, but I also think I chose the wrong note in a couple of rebases, undoing your hard work from the past)
y_ww = infer_feature_types(y) | ||
X_d, y_d = self._handle_nullable_types(X_ww, y_ww) | ||
X_t, y_t = super().transform(X_d, y_d) | ||
X_t.ww.init(schema=original_schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must have missed this or forgotten it from a previous PR - why do we need to reinit woodwork here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's because we change the types of X and y before passing them into super().transform(X_d, y_d)
, so it doesn't matter that BaseSampler.transform
maintains the schema of the data passed in, because we don't want that schema! So I'm re-initing with the original schema here to maintain the original types (which should be possible because the data itself shouldn't be changing in a way that would invalidate the original schema)!
But I'm now seeing that it looks like I forgot to do the same thing for y. It won't because there' no y incompatibility, but I'm gonna change the call to _handle_nullable_types
to make that explicit.
I can also add a comment to clarify why this is necessary
* Stop using woodwork describe to get nan info in time series imputer * remove logic that's no longer needed and return if all bool dtype * remove unnecessary logic from target imputer * remove unused utils * remove logic to convert dfs features to categorical logical type * fix email featureizer test * Revert changes to transfomr prim components for testing * Revert "Revert changes to transfomr prim components for testing" This reverts commit 57dda43. * Fix bugs with ww not imputed and string categories * Add release note * Handle case where nans are present at transform in previously all bool dtype data * Stop truncating ints in target imputer * clean up * Fix tests * Keep ltype integer for most frequent impute tpye in target imputer * refactor knn imputer to use new logic * Fix list bug * remove comment * Update release note to mention nullable types * remove outdated comment * Conver all bool dfs to boolean nullable instead of refitting * lint fix * PR comments * add second bool col to imputer fixture
… methods (#4046) * Remove existing nullable type handling from oversampler and use _handle_nullable_types instead * Add handle call for lgbm regressor and remove existing handling * Add handle call for lgbm classifier * temp broken exp smoothing tests * lint fix * add release note * Fix broken tests by initting woodwork on y in lgbm classifier * Update tests * Call handle in arima * call handle from ts imputer y ltype is downcasted value * remove unnecessary comments * Fix time series guide * lint fix * Only call handle_nullable_types when necessary in methods * Remove remaining unnecessary handle calls * resolve remaining comments * Add y ww init to ts imputer back in to fix tests * Copy X in testing nullable types to stop hiding potential incompatibilities in methods * use X_d in lgbm predict proba * remove nullable type handling after sklearn upgrade fixed incompatibilities * use common util to determine type for time series imputed integers * Add comments around why we copy X * remove _prepare_data from samplers * PR comments * remove tests to check if handle method is called * remove nullable types from imputed data because of regularizer * fix typo * fix docstrings * fix codecov issues * PR comments * Revert "Fix time series guide" This reverts commit 964622a. * return unchanged ltype in nullabl;e type utils * add back ts imputer incompatibility test * use dict get return value * call handle nullable types in oversampler and check schema equality
* Update tests * remove unnecessary comments * resolve remaining comments * Copy X in testing nullable types to stop hiding potential incompatibilities in methods * Handle new oversampler nullable type incompatibility and add tests * Remove existing nullable type handling from oversampler and use _handle_nullable_types instead * Add comments to self * Add handle call for lgbm classifier * lint fix * Only call handle_nullable_types when necessary in methods * Remove remaining unnecessary handle calls * resolve remaining comments * Copy X in testing nullable types to stop hiding potential incompatibilities in methods * remove nullable type handling after sklearn upgrade fixed incompatibilities * Remove downcast call from imputer.py and fix tests * remove downcast call from catboost regressor * stop adding ohe if bool null present * remove nullable type handling from knn imputer * remove nullable type handling from targer imputer * use util for ltype deciding in simple imputer * Fix broken target imputer test * Remove replace nullable types transformer from automl search and fix tests * Handle nullable types in lgbm classifier predict_proba * fix after rebase * move util to imputer utils and other fixes * remove duplicate util * Change how imputer reinits woodwork types * Expand tests * use automl test env * fix broken test * Clean up * Add release note * Further clean up and note incompatibility * Continue clean up * fix failing docstring * remove remaining comments * fix invali target data check docstring * let knn and simple imputers use same flow of logic for readability * Move _get_new_logical_types_for_imputed_data to nullable type utils file * PR comment
* Handle decomposer incompatibility * fix comment * Stop scaling the values up * Add release note * PR Comments * Create incompatibility for testing more similarly to how it shows up in the decomposer
ddb42fc
to
946ff0b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, Tamar :)
closes #3706
This is the PR into which all intermediate PRs will be merged for the implementation of nullable type handling.