Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Component specific nullable types handling across EvalML #4043

Merged
merged 10 commits into from
Mar 29, 2023

Conversation

tamargrey
Copy link
Contributor

@tamargrey tamargrey commented Mar 1, 2023

closes #3706

This is the PR into which all intermediate PRs will be merged for the implementation of nullable type handling.

@codecov
Copy link

codecov bot commented Mar 1, 2023

Codecov Report

Merging #4043 (85ee013) into main (b43b1bb) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #4043     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        349     349             
  Lines      37661   37719     +58     
=======================================
+ Hits       37542   37602     +60     
+ Misses       119     117      -2     
Impacted Files Coverage Δ
evalml/data_checks/invalid_target_data_check.py 100.0% <ø> (ø)
...alml/data_checks/target_distribution_data_check.py 100.0% <ø> (ø)
...onents/estimators/regressors/catboost_regressor.py 100.0% <ø> (ø)
evalml/pipelines/components/utils.py 96.3% <ø> (-0.2%) ⬇️
evalml/pipelines/utils.py 99.6% <ø> (-<0.1%) ⬇️
...valml/tests/automl_tests/test_default_algorithm.py 100.0% <ø> (ø)
evalml/tests/component_tests/test_utils.py 99.2% <ø> (-0.1%) ⬇️
evalml/tests/conftest.py 98.3% <ø> (ø)
evalml/tests/data_checks_tests/test_data_checks.py 100.0% <ø> (ø)
...hecks_tests/test_target_distribution_data_check.py 100.0% <ø> (ø)
... and 40 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@tamargrey tamargrey force-pushed the component-specific-nullable-types-handling branch 2 times, most recently from 494c4a5 to 80be9be Compare March 6, 2023 15:44
@tamargrey tamargrey force-pushed the component-specific-nullable-types-handling branch 3 times, most recently from 1130690 to 39e2826 Compare March 16, 2023 15:26
@tamargrey tamargrey force-pushed the component-specific-nullable-types-handling branch 2 times, most recently from 9541e99 to 07b89dc Compare March 20, 2023 16:48
@tamargrey tamargrey force-pushed the component-specific-nullable-types-handling branch from 07b89dc to 63f12da Compare March 22, 2023 14:42
@tamargrey tamargrey marked this pull request as ready for review March 24, 2023 14:13
@tamargrey tamargrey force-pushed the component-specific-nullable-types-handling branch from 2696587 to d89feee Compare March 24, 2023 14:57
Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤩 I am in awe of this PR

Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 🎉 🎉

@@ -3,11 +3,16 @@ Release Notes
**Future Releases**
* Enhancements
* Updated `pipeline.get_prediction_intervals()` to add trend prediction interval information from STL decomposer :pr:`4093`
* Add component-specific nullable type handling :pr:`4043`
Copy link
Contributor

@eccabay eccabay Mar 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative (and I'm not sure what's better here so please feel free to ignore if you disagree) would be to simply list this PR as a second PR for all the changes - i.e. * Handled nullable type incompatibility in ``Decomposer`` :pr:4105, :pr:4043` - there is some precedent for this method earlier in these notes!

I will also continue to be the release notes stickler, can all these comments be past tense? 😁

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that! Would be better at helping people find the PRs where the discussions happened if need be!

(also sorry about making you be a broken record on the release notes thing. To be fair, I apparently cannot remember, but I also think I chose the wrong note in a couple of rebases, undoing your hard work from the past)

y_ww = infer_feature_types(y)
X_d, y_d = self._handle_nullable_types(X_ww, y_ww)
X_t, y_t = super().transform(X_d, y_d)
X_t.ww.init(schema=original_schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must have missed this or forgotten it from a previous PR - why do we need to reinit woodwork here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because we change the types of X and y before passing them into super().transform(X_d, y_d), so it doesn't matter that BaseSampler.transform maintains the schema of the data passed in, because we don't want that schema! So I'm re-initing with the original schema here to maintain the original types (which should be possible because the data itself shouldn't be changing in a way that would invalidate the original schema)!

But I'm now seeing that it looks like I forgot to do the same thing for y. It won't because there' no y incompatibility, but I'm gonna change the call to _handle_nullable_types to make that explicit.

I can also add a comment to clarify why this is necessary

Tamar Grey and others added 9 commits March 27, 2023 16:18
* Stop using woodwork describe to get nan info in time series imputer

* remove logic that's no longer needed and return if all bool dtype

* remove unnecessary logic from target imputer

* remove unused utils

* remove logic to convert dfs features to categorical logical type

* fix email featureizer test

* Revert changes to transfomr prim components for testing

* Revert "Revert changes to transfomr prim components for testing"

This reverts commit 57dda43.

* Fix bugs with ww not imputed and string categories

* Add release note

* Handle case where nans are present at transform in previously all bool dtype data

* Stop truncating ints in target imputer

* clean up

* Fix tests

* Keep ltype integer for most frequent impute tpye in target imputer

* refactor knn imputer to use new logic

* Fix list bug

* remove comment

* Update release note to mention nullable types

* remove outdated comment

* Conver all bool dfs to boolean nullable instead of refitting

* lint fix

* PR comments

* add second bool col to imputer fixture
… methods (#4046)

* Remove existing nullable type handling from oversampler and use _handle_nullable_types instead

* Add handle call for lgbm regressor and remove existing handling

* Add handle call for lgbm classifier

* temp broken exp smoothing tests

* lint fix

* add release note

* Fix broken tests by initting woodwork on y in lgbm classifier

* Update tests

* Call handle in arima

* call handle from ts imputer y ltype is downcasted value

* remove unnecessary comments

* Fix time series guide

* lint fix

* Only call handle_nullable_types when necessary in methods

* Remove remaining unnecessary handle calls

* resolve remaining comments

* Add y ww init to ts imputer back in to fix tests

* Copy X in testing nullable types to stop hiding potential incompatibilities in methods

* use X_d in lgbm predict proba

* remove nullable type handling after sklearn upgrade fixed incompatibilities

* use common util to determine type for time series imputed integers

* Add comments around why we copy X

* remove _prepare_data from samplers

* PR comments

* remove tests to check if handle method is called

* remove nullable types from imputed data because of regularizer

* fix typo

* fix docstrings

* fix codecov issues

* PR comments

* Revert "Fix time series guide"

This reverts commit 964622a.

* return unchanged ltype in nullabl;e type utils

* add back ts imputer incompatibility test

* use dict get return value

* call handle nullable types in oversampler and check schema equality
* Update tests

* remove unnecessary comments

* resolve remaining comments

* Copy X in testing nullable types to stop hiding potential incompatibilities in methods

* Handle new oversampler nullable type incompatibility and add tests

* Remove existing nullable type handling from oversampler and use _handle_nullable_types instead

* Add comments to self

* Add handle call for lgbm classifier

* lint fix

* Only call handle_nullable_types when necessary in methods

* Remove remaining unnecessary handle calls

* resolve remaining comments

* Copy X in testing nullable types to stop hiding potential incompatibilities in methods

* remove nullable type handling after sklearn upgrade fixed incompatibilities

* Remove downcast call from imputer.py and fix tests

* remove downcast call from catboost regressor

* stop adding ohe if bool null present

* remove nullable type handling from knn imputer

* remove nullable type handling from targer imputer

* use util for ltype deciding in simple imputer

* Fix broken target imputer test

* Remove replace nullable types transformer from automl search and fix tests

* Handle nullable types in lgbm classifier predict_proba

* fix after rebase

* move util to imputer utils and other fixes

* remove duplicate util

* Change how imputer reinits woodwork types

* Expand tests

* use automl test env

* fix broken test

* Clean up

* Add release note

* Further clean up and note incompatibility

* Continue clean up

* fix failing docstring

* remove remaining comments

* fix invali target data check docstring

* let knn and simple imputers use same flow of logic for readability

* Move _get_new_logical_types_for_imputed_data to nullable type utils file

* PR comment
* Handle decomposer incompatibility

* fix comment

* Stop scaling the values up

* Add release note

* PR Comments

* Create incompatibility for testing more similarly to how it shows up in the decomposer
@tamargrey tamargrey force-pushed the component-specific-nullable-types-handling branch from ddb42fc to 946ff0b Compare March 27, 2023 20:20
Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, Tamar :)

@tamargrey tamargrey enabled auto-merge (squash) March 29, 2023 13:26
@tamargrey tamargrey merged commit 65f5402 into main Mar 29, 2023
@tamargrey tamargrey deleted the component-specific-nullable-types-handling branch March 29, 2023 13:41
@chukarsten chukarsten mentioned this pull request Apr 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clean up downcast functions
4 participants