Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle decomposer incompatibility #4105

Conversation

tamargrey
Copy link
Contributor

@tamargrey tamargrey commented Mar 22, 2023

closes #4103

This handles a small incompatibility in the Decomposer's determine_periodicity method where float64 data containing nans when subtracted from Int64 dtype data produces Float64 dtype data with a np.NaN values that are not recognized by dropna, resulting in a periodicity of None. If we used Float64 as a dtype in Woodwork, we'd also need to handle that here. boolean isn't an issue, because this is only used for regression problems.

@tamargrey tamargrey changed the base branch from main to component-specific-nullable-types-handling March 22, 2023 23:01
@codecov
Copy link

codecov bot commented Mar 22, 2023

Codecov Report

❗ No coverage uploaded for pull request base (component-specific-nullable-types-handling@75a4714). Click here to learn what that means.
The diff coverage is n/a.

@@                             Coverage Diff                              @@
##             component-specific-nullable-types-handling   #4105   +/-   ##
============================================================================
  Coverage                                              ?   99.7%           
============================================================================
  Files                                                 ?     349           
  Lines                                                 ?   37628           
  Branches                                              ?       0           
============================================================================
  Hits                                                  ?   37511           
  Misses                                                ?     117           
  Partials                                              ?       0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@@ -462,6 +468,7 @@ def test_decomposer_determine_periodicity(
period,
trend_degree=trend_degree,
)
y = ww.init_series(y.astype(int), logical_type=y_logical_type)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recognize that it feels very wrong that we can just truncate y by converting to int and still have the tests work.

Originally, I was scaling up by a factor of 1000 or so to try and maintain the information necessary to achieve periodicity, but that ended up with the tests where the period is set to None and the trend_degree is 1 failing. I don't really know why that would cause those tests to fail but assumed it might have something to do with larger values interacting with the acf_threshold? Oddly, I found that changing the scaling factor to 10 caused more tests to fail but just truncating the data allowed all of them to pass.

This is rubbing me the wrong way, but I don't really understand what's going on in determine_periodicity well enough to find a smarter way around this. I'm open to any thoughts anyone may have!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tamargrey which logical types fail if just using the original float information?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"IntegerNullable" and "AgeNullable" are the logical types that can produce this bug (they both use Int64 underlyingly). The goal with this fix was to allow the same behavior we see with Age and Integer (which both use int64).

If I try to pass the original float information into the init_series call with the nullable logical types, I get an error Error converting datatype for Temp from type float64 to type Int64. since pandas doesn't allow conversion from float64 to Int64 directly. int64 is allowed, but it just truncates the data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm OK with this - an alternative would be to run the tests with the integer logical types with integer y by changing generate_seasonal_data but it would serve the same purpose. Wonder what @eccabay and @chukarsten think!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I'm quite following why this is necessary. Is this just because we get errors when trying to convert the output of generate_seasonal_data into the correct logical type, or is some part of the actual function failing when we have non-integer data? With the former I'm unconcerned, but the latter would worry me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first one - the output of generate_seasonal_data is floats, which we can't convert to Int64 because pandas decided it will no longer quietly truncate your data for you as it does if the intended dtype is int64.

I just needed a way to test Int64 data with determine_periodicity, so rather than reinvent the wheel, I figured I could turn the output of generate_seasonal_data into something I could use.

And to be clear: the incompatibility with Int64 in determine_periodicity is due only to nullable type incompatibilities. It can handle int64 data just fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, this feels fine to me. Thanks for the clarification!

@tamargrey tamargrey force-pushed the handle-decomposer-incompatibility branch from 2ba48ca to 2aa24d3 Compare March 23, 2023 13:50
@tamargrey tamargrey marked this pull request as ready for review March 23, 2023 14:26

# Multiply by 1000 so we can convert to Integer, truncating the rest of the value
# while still mainlining trend, period, etc
y = y * 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this scaling if we don't do so above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call! I can remove this


subtracted_floats = y - numpy_float_data

dropped_nans = subtracted_floats.dropna()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain why we're tracking this here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7b71388 - let me know if this makes sense to you!

Copy link
Contributor

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this LGTM but will leave the final call to the decomp experts!

@@ -462,6 +468,7 @@ def test_decomposer_determine_periodicity(
period,
trend_degree=trend_degree,
)
y = ww.init_series(y.astype(int), logical_type=y_logical_type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, this feels fine to me. Thanks for the clarification!

@tamargrey tamargrey merged commit 2696587 into component-specific-nullable-types-handling Mar 24, 2023
@tamargrey tamargrey deleted the handle-decomposer-incompatibility branch March 24, 2023 14:02
tamargrey added a commit that referenced this pull request Mar 24, 2023
* Handle decomposer incompatibility

* fix comment

* Stop scaling the values up

* Add release note

* PR Comments

* Create incompatibility for testing more similarly to how it shows up in the decomposer
tamargrey added a commit that referenced this pull request Mar 27, 2023
* Handle decomposer incompatibility

* fix comment

* Stop scaling the values up

* Add release note

* PR Comments

* Create incompatibility for testing more similarly to how it shows up in the decomposer
tamargrey added a commit that referenced this pull request Mar 29, 2023
* initial commit to make the PR

* Refactor imputer components to remove unnecessary logic  (#4038)

* Stop using woodwork describe to get nan info in time series imputer

* remove logic that's no longer needed and return if all bool dtype

* remove unnecessary logic from target imputer

* remove unused utils

* remove logic to convert dfs features to categorical logical type

* fix email featureizer test

* Revert changes to transfomr prim components for testing

* Revert "Revert changes to transfomr prim components for testing"

This reverts commit 57dda43.

* Fix bugs with ww not imputed and string categories

* Add release note

* Handle case where nans are present at transform in previously all bool dtype data

* Stop truncating ints in target imputer

* clean up

* Fix tests

* Keep ltype integer for most frequent impute tpye in target imputer

* refactor knn imputer to use new logic

* Fix list bug

* remove comment

* Update release note to mention nullable types

* remove outdated comment

* Conver all bool dfs to boolean nullable instead of refitting

* lint fix

* PR comments

* add second bool col to imputer fixture

* Use nullable type handling in components' fit, transform, and predict methods (#4046)

* Remove existing nullable type handling from oversampler and use _handle_nullable_types instead

* Add handle call for lgbm regressor and remove existing handling

* Add handle call for lgbm classifier

* temp broken exp smoothing tests

* lint fix

* add release note

* Fix broken tests by initting woodwork on y in lgbm classifier

* Update tests

* Call handle in arima

* call handle from ts imputer y ltype is downcasted value

* remove unnecessary comments

* Fix time series guide

* lint fix

* Only call handle_nullable_types when necessary in methods

* Remove remaining unnecessary handle calls

* resolve remaining comments

* Add y ww init to ts imputer back in to fix tests

* Copy X in testing nullable types to stop hiding potential incompatibilities in methods

* use X_d in lgbm predict proba

* remove nullable type handling after sklearn upgrade fixed incompatibilities

* use common util to determine type for time series imputed integers

* Add comments around why we copy X

* remove _prepare_data from samplers

* PR comments

* remove tests to check if handle method is called

* remove nullable types from imputed data because of regularizer

* fix typo

* fix docstrings

* fix codecov issues

* PR comments

* Revert "Fix time series guide"

This reverts commit 964622a.

* return unchanged ltype in nullabl;e type utils

* add back ts imputer incompatibility test

* use dict get return value

* call handle nullable types in oversampler and check schema equality

* Remove other nullable handling across AutoMLSearch (#4085)

* Update tests

* remove unnecessary comments

* resolve remaining comments

* Copy X in testing nullable types to stop hiding potential incompatibilities in methods

* Handle new oversampler nullable type incompatibility and add tests

* Remove existing nullable type handling from oversampler and use _handle_nullable_types instead

* Add comments to self

* Add handle call for lgbm classifier

* lint fix

* Only call handle_nullable_types when necessary in methods

* Remove remaining unnecessary handle calls

* resolve remaining comments

* Copy X in testing nullable types to stop hiding potential incompatibilities in methods

* remove nullable type handling after sklearn upgrade fixed incompatibilities

* Remove downcast call from imputer.py and fix tests

* remove downcast call from catboost regressor

* stop adding ohe if bool null present

* remove nullable type handling from knn imputer

* remove nullable type handling from targer imputer

* use util for ltype deciding in simple imputer

* Fix broken target imputer test

* Remove replace nullable types transformer from automl search and fix tests

* Handle nullable types in lgbm classifier predict_proba

* fix after rebase

* move util to imputer utils and other fixes

* remove duplicate util

* Change how imputer reinits woodwork types

* Expand tests

* use automl test env

* fix broken test

* Clean up

* Add release note

* Further clean up and note incompatibility

* Continue clean up

* fix failing docstring

* remove remaining comments

* fix invali target data check docstring

* let knn and simple imputers use same flow of logic for readability

* Move _get_new_logical_types_for_imputed_data to nullable type utils file

* PR comment

* Handle decomposer incompatibility (#4105)

* Handle decomposer incompatibility

* fix comment

* Stop scaling the values up

* Add release note

* PR Comments

* Create incompatibility for testing more similarly to how it shows up in the decomposer

* Add release note

* lint fix

* pr notes

* fix after rebase

---------

Co-authored-by: chukarsten <64713315+chukarsten@users.noreply.github.com>
@chukarsten chukarsten mentioned this pull request Apr 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Decomposer: IntegerNulllable incompatibility in determine_periodicity
3 participants