Component specific nullable types handling across EvalML #4043

tamargrey · 2023-03-01T21:41:39Z

This is the PR into which all intermediate PRs will be merged for the implementation of nullable type handling.

codecov · 2023-03-01T21:48:30Z

Codecov Report

Merging #4043 (85ee013) into main (b43b1bb) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #4043     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        349     349             
  Lines      37661   37719     +58     
=======================================
+ Hits       37542   37602     +60     
+ Misses       119     117      -2

Impacted Files	Coverage Δ
evalml/data_checks/invalid_target_data_check.py	`100.0% <ø> (ø)`
...alml/data_checks/target_distribution_data_check.py	`100.0% <ø> (ø)`
...onents/estimators/regressors/catboost_regressor.py	`100.0% <ø> (ø)`
evalml/pipelines/components/utils.py	`96.3% <ø> (-0.2%)`	⬇️
evalml/pipelines/utils.py	`99.6% <ø> (-<0.1%)`	⬇️
...valml/tests/automl_tests/test_default_algorithm.py	`100.0% <ø> (ø)`
evalml/tests/component_tests/test_utils.py	`99.2% <ø> (-0.1%)`	⬇️
evalml/tests/conftest.py	`98.3% <ø> (ø)`
evalml/tests/data_checks_tests/test_data_checks.py	`100.0% <ø> (ø)`
...hecks_tests/test_target_distribution_data_check.py	`100.0% <ø> (ø)`
... and 40 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

jeremyliweishih

🤩 I am in awe of this PR

eccabay

🎉 🎉 🎉

eccabay · 2023-03-24T19:32:07Z

docs/source/release_notes.rst

@@ -3,11 +3,16 @@ Release Notes
 **Future Releases**
    * Enhancements
        * Updated `pipeline.get_prediction_intervals()` to add trend prediction interval information from STL decomposer :pr:`4093`
+        * Add component-specific nullable type handling :pr:`4043`


An alternative (and I'm not sure what's better here so please feel free to ignore if you disagree) would be to simply list this PR as a second PR for all the changes - i.e. * Handled nullable type incompatibility in ``Decomposer`` :pr:4105, :pr:4043` - there is some precedent for this method earlier in these notes!

I will also continue to be the release notes stickler, can all these comments be past tense? 😁

I like that! Would be better at helping people find the PRs where the discussions happened if need be!

(also sorry about making you be a broken record on the release notes thing. To be fair, I apparently cannot remember, but I also think I chose the wrong note in a couple of rebases, undoing your hard work from the past)

eccabay · 2023-03-24T19:37:26Z

evalml/pipelines/components/transformers/samplers/oversampler.py

+        y_ww = infer_feature_types(y)
+        X_d, y_d = self._handle_nullable_types(X_ww, y_ww)
+        X_t, y_t = super().transform(X_d, y_d)
+        X_t.ww.init(schema=original_schema)


I must have missed this or forgotten it from a previous PR - why do we need to reinit woodwork here?

It's because we change the types of X and y before passing them into super().transform(X_d, y_d), so it doesn't matter that BaseSampler.transform maintains the schema of the data passed in, because we don't want that schema! So I'm re-initing with the original schema here to maintain the original types (which should be possible because the data itself shouldn't be changing in a way that would invalidate the original schema)!

But I'm now seeing that it looks like I forgot to do the same thing for y. It won't because there' no y incompatibility, but I'm gonna change the call to _handle_nullable_types to make that explicit.

I can also add a comment to clarify why this is necessary

* Stop using woodwork describe to get nan info in time series imputer * remove logic that's no longer needed and return if all bool dtype * remove unnecessary logic from target imputer * remove unused utils * remove logic to convert dfs features to categorical logical type * fix email featureizer test * Revert changes to transfomr prim components for testing * Revert "Revert changes to transfomr prim components for testing" This reverts commit 57dda43. * Fix bugs with ww not imputed and string categories * Add release note * Handle case where nans are present at transform in previously all bool dtype data * Stop truncating ints in target imputer * clean up * Fix tests * Keep ltype integer for most frequent impute tpye in target imputer * refactor knn imputer to use new logic * Fix list bug * remove comment * Update release note to mention nullable types * remove outdated comment * Conver all bool dfs to boolean nullable instead of refitting * lint fix * PR comments * add second bool col to imputer fixture

… methods (#4046) * Remove existing nullable type handling from oversampler and use _handle_nullable_types instead * Add handle call for lgbm regressor and remove existing handling * Add handle call for lgbm classifier * temp broken exp smoothing tests * lint fix * add release note * Fix broken tests by initting woodwork on y in lgbm classifier * Update tests * Call handle in arima * call handle from ts imputer y ltype is downcasted value * remove unnecessary comments * Fix time series guide * lint fix * Only call handle_nullable_types when necessary in methods * Remove remaining unnecessary handle calls * resolve remaining comments * Add y ww init to ts imputer back in to fix tests * Copy X in testing nullable types to stop hiding potential incompatibilities in methods * use X_d in lgbm predict proba * remove nullable type handling after sklearn upgrade fixed incompatibilities * use common util to determine type for time series imputed integers * Add comments around why we copy X * remove _prepare_data from samplers * PR comments * remove tests to check if handle method is called * remove nullable types from imputed data because of regularizer * fix typo * fix docstrings * fix codecov issues * PR comments * Revert "Fix time series guide" This reverts commit 964622a. * return unchanged ltype in nullabl;e type utils * add back ts imputer incompatibility test * use dict get return value * call handle nullable types in oversampler and check schema equality

* Update tests * remove unnecessary comments * resolve remaining comments * Copy X in testing nullable types to stop hiding potential incompatibilities in methods * Handle new oversampler nullable type incompatibility and add tests * Remove existing nullable type handling from oversampler and use _handle_nullable_types instead * Add comments to self * Add handle call for lgbm classifier * lint fix * Only call handle_nullable_types when necessary in methods * Remove remaining unnecessary handle calls * resolve remaining comments * Copy X in testing nullable types to stop hiding potential incompatibilities in methods * remove nullable type handling after sklearn upgrade fixed incompatibilities * Remove downcast call from imputer.py and fix tests * remove downcast call from catboost regressor * stop adding ohe if bool null present * remove nullable type handling from knn imputer * remove nullable type handling from targer imputer * use util for ltype deciding in simple imputer * Fix broken target imputer test * Remove replace nullable types transformer from automl search and fix tests * Handle nullable types in lgbm classifier predict_proba * fix after rebase * move util to imputer utils and other fixes * remove duplicate util * Change how imputer reinits woodwork types * Expand tests * use automl test env * fix broken test * Clean up * Add release note * Further clean up and note incompatibility * Continue clean up * fix failing docstring * remove remaining comments * fix invali target data check docstring * let knn and simple imputers use same flow of logic for readability * Move _get_new_logical_types_for_imputed_data to nullable type utils file * PR comment

* Handle decomposer incompatibility * fix comment * Stop scaling the values up * Add release note * PR Comments * Create incompatibility for testing more similarly to how it shows up in the decomposer

chukarsten

Great work, Tamar :)

tamargrey force-pushed the component-specific-nullable-types-handling branch 2 times, most recently from 494c4a5 to 80be9be Compare March 6, 2023 15:44

tamargrey force-pushed the component-specific-nullable-types-handling branch 3 times, most recently from 1130690 to 39e2826 Compare March 16, 2023 15:26

tamargrey force-pushed the component-specific-nullable-types-handling branch 2 times, most recently from 9541e99 to 07b89dc Compare March 20, 2023 16:48

gsheni assigned tamargrey Mar 21, 2023

tamargrey force-pushed the component-specific-nullable-types-handling branch from 07b89dc to 63f12da Compare March 22, 2023 14:42

tamargrey marked this pull request as ready for review March 24, 2023 14:13

tamargrey requested review from eccabay, ParthivNaresh, christopherbunn, chukarsten and jeremyliweishih March 24, 2023 14:13

tamargrey force-pushed the component-specific-nullable-types-handling branch from 2696587 to d89feee Compare March 24, 2023 14:57

jeremyliweishih approved these changes Mar 24, 2023

View reviewed changes

eccabay approved these changes Mar 24, 2023

View reviewed changes

Tamar Grey and others added 9 commits March 27, 2023 16:18

initial commit to make the PR

6ee0dfc

Handle decomposer incompatibility (#4105)

e48b502

* Handle decomposer incompatibility * fix comment * Stop scaling the values up * Add release note * PR Comments * Create incompatibility for testing more similarly to how it shows up in the decomposer

Add release note

6553164

lint fix

38462d8

pr notes

c474572

fix after rebase

946ff0b

tamargrey force-pushed the component-specific-nullable-types-handling branch from ddb42fc to 946ff0b Compare March 27, 2023 20:20

chukarsten approved these changes Mar 29, 2023

View reviewed changes

Merge branch 'main' into component-specific-nullable-types-handling

85ee013

tamargrey enabled auto-merge (squash) March 29, 2023 13:26

tamargrey merged commit 65f5402 into main Mar 29, 2023

tamargrey deleted the component-specific-nullable-types-handling branch March 29, 2023 13:41

chukarsten mentioned this pull request Apr 3, 2023

Release v0.73.0. #4122

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Component specific nullable types handling across EvalML #4043

Component specific nullable types handling across EvalML #4043

tamargrey commented Mar 1, 2023 •

edited

Loading

codecov bot commented Mar 1, 2023 •

edited

Loading

jeremyliweishih left a comment

eccabay left a comment

eccabay Mar 24, 2023 •

edited

Loading

tamargrey Mar 24, 2023

eccabay Mar 24, 2023

tamargrey Mar 24, 2023

chukarsten left a comment

Component specific nullable types handling across EvalML #4043

Component specific nullable types handling across EvalML #4043

Conversation

tamargrey commented Mar 1, 2023 • edited Loading

codecov bot commented Mar 1, 2023 • edited Loading

Codecov Report

jeremyliweishih left a comment

Choose a reason for hiding this comment

eccabay left a comment

Choose a reason for hiding this comment

eccabay Mar 24, 2023 • edited Loading

Choose a reason for hiding this comment

tamargrey Mar 24, 2023

Choose a reason for hiding this comment

eccabay Mar 24, 2023

Choose a reason for hiding this comment

tamargrey Mar 24, 2023

Choose a reason for hiding this comment

chukarsten left a comment

Choose a reason for hiding this comment

tamargrey commented Mar 1, 2023 •

edited

Loading

codecov bot commented Mar 1, 2023 •

edited

Loading

eccabay Mar 24, 2023 •

edited

Loading