-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle decomposer incompatibility #4105
Handle decomposer incompatibility #4105
Conversation
Codecov Report
@@ Coverage Diff @@
## component-specific-nullable-types-handling #4105 +/- ##
============================================================================
Coverage ? 99.7%
============================================================================
Files ? 349
Lines ? 37628
Branches ? 0
============================================================================
Hits ? 37511
Misses ? 117
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
@@ -462,6 +468,7 @@ def test_decomposer_determine_periodicity( | |||
period, | |||
trend_degree=trend_degree, | |||
) | |||
y = ww.init_series(y.astype(int), logical_type=y_logical_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recognize that it feels very wrong that we can just truncate y by converting to int and still have the tests work.
Originally, I was scaling up by a factor of 1000 or so to try and maintain the information necessary to achieve periodicity, but that ended up with the tests where the period is set to None and the trend_degree is 1 failing. I don't really know why that would cause those tests to fail but assumed it might have something to do with larger values interacting with the acf_threshold
? Oddly, I found that changing the scaling factor to 10
caused more tests to fail but just truncating the data allowed all of them to pass.
This is rubbing me the wrong way, but I don't really understand what's going on in determine_periodicity
well enough to find a smarter way around this. I'm open to any thoughts anyone may have!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tamargrey which logical types fail if just using the original float information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"IntegerNullable"
and "AgeNullable"
are the logical types that can produce this bug (they both use Int64
underlyingly). The goal with this fix was to allow the same behavior we see with Age
and Integer
(which both use int64
).
If I try to pass the original float information into the init_series
call with the nullable logical types, I get an error Error converting datatype for Temp from type float64 to type Int64.
since pandas doesn't allow conversion from float64
to Int64
directly. int64
is allowed, but it just truncates the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm OK with this - an alternative would be to run the tests with the integer logical types with integer y by changing generate_seasonal_data
but it would serve the same purpose. Wonder what @eccabay and @chukarsten think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I'm quite following why this is necessary. Is this just because we get errors when trying to convert the output of generate_seasonal_data
into the correct logical type, or is some part of the actual function failing when we have non-integer data? With the former I'm unconcerned, but the latter would worry me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first one - the output of generate_seasonal_data
is floats, which we can't convert to Int64
because pandas decided it will no longer quietly truncate your data for you as it does if the intended dtype is int64
.
I just needed a way to test Int64
data with determine_periodicity
, so rather than reinvent the wheel, I figured I could turn the output of generate_seasonal_data
into something I could use.
And to be clear: the incompatibility with Int64
in determine_periodicity
is due only to nullable type incompatibilities. It can handle int64
data just fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, this feels fine to me. Thanks for the clarification!
2ba48ca
to
2aa24d3
Compare
|
||
# Multiply by 1000 so we can convert to Integer, truncating the rest of the value | ||
# while still mainlining trend, period, etc | ||
y = y * 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this scaling if we don't do so above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call! I can remove this
|
||
subtracted_floats = y - numpy_float_data | ||
|
||
dropped_nans = subtracted_floats.dropna() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain why we're tracking this here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7b71388 - let me know if this makes sense to you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this LGTM but will leave the final call to the decomp experts!
…in the decomposer
@@ -462,6 +468,7 @@ def test_decomposer_determine_periodicity( | |||
period, | |||
trend_degree=trend_degree, | |||
) | |||
y = ww.init_series(y.astype(int), logical_type=y_logical_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, this feels fine to me. Thanks for the clarification!
* Handle decomposer incompatibility * fix comment * Stop scaling the values up * Add release note * PR Comments * Create incompatibility for testing more similarly to how it shows up in the decomposer
* Handle decomposer incompatibility * fix comment * Stop scaling the values up * Add release note * PR Comments * Create incompatibility for testing more similarly to how it shows up in the decomposer
* initial commit to make the PR * Refactor imputer components to remove unnecessary logic (#4038) * Stop using woodwork describe to get nan info in time series imputer * remove logic that's no longer needed and return if all bool dtype * remove unnecessary logic from target imputer * remove unused utils * remove logic to convert dfs features to categorical logical type * fix email featureizer test * Revert changes to transfomr prim components for testing * Revert "Revert changes to transfomr prim components for testing" This reverts commit 57dda43. * Fix bugs with ww not imputed and string categories * Add release note * Handle case where nans are present at transform in previously all bool dtype data * Stop truncating ints in target imputer * clean up * Fix tests * Keep ltype integer for most frequent impute tpye in target imputer * refactor knn imputer to use new logic * Fix list bug * remove comment * Update release note to mention nullable types * remove outdated comment * Conver all bool dfs to boolean nullable instead of refitting * lint fix * PR comments * add second bool col to imputer fixture * Use nullable type handling in components' fit, transform, and predict methods (#4046) * Remove existing nullable type handling from oversampler and use _handle_nullable_types instead * Add handle call for lgbm regressor and remove existing handling * Add handle call for lgbm classifier * temp broken exp smoothing tests * lint fix * add release note * Fix broken tests by initting woodwork on y in lgbm classifier * Update tests * Call handle in arima * call handle from ts imputer y ltype is downcasted value * remove unnecessary comments * Fix time series guide * lint fix * Only call handle_nullable_types when necessary in methods * Remove remaining unnecessary handle calls * resolve remaining comments * Add y ww init to ts imputer back in to fix tests * Copy X in testing nullable types to stop hiding potential incompatibilities in methods * use X_d in lgbm predict proba * remove nullable type handling after sklearn upgrade fixed incompatibilities * use common util to determine type for time series imputed integers * Add comments around why we copy X * remove _prepare_data from samplers * PR comments * remove tests to check if handle method is called * remove nullable types from imputed data because of regularizer * fix typo * fix docstrings * fix codecov issues * PR comments * Revert "Fix time series guide" This reverts commit 964622a. * return unchanged ltype in nullabl;e type utils * add back ts imputer incompatibility test * use dict get return value * call handle nullable types in oversampler and check schema equality * Remove other nullable handling across AutoMLSearch (#4085) * Update tests * remove unnecessary comments * resolve remaining comments * Copy X in testing nullable types to stop hiding potential incompatibilities in methods * Handle new oversampler nullable type incompatibility and add tests * Remove existing nullable type handling from oversampler and use _handle_nullable_types instead * Add comments to self * Add handle call for lgbm classifier * lint fix * Only call handle_nullable_types when necessary in methods * Remove remaining unnecessary handle calls * resolve remaining comments * Copy X in testing nullable types to stop hiding potential incompatibilities in methods * remove nullable type handling after sklearn upgrade fixed incompatibilities * Remove downcast call from imputer.py and fix tests * remove downcast call from catboost regressor * stop adding ohe if bool null present * remove nullable type handling from knn imputer * remove nullable type handling from targer imputer * use util for ltype deciding in simple imputer * Fix broken target imputer test * Remove replace nullable types transformer from automl search and fix tests * Handle nullable types in lgbm classifier predict_proba * fix after rebase * move util to imputer utils and other fixes * remove duplicate util * Change how imputer reinits woodwork types * Expand tests * use automl test env * fix broken test * Clean up * Add release note * Further clean up and note incompatibility * Continue clean up * fix failing docstring * remove remaining comments * fix invali target data check docstring * let knn and simple imputers use same flow of logic for readability * Move _get_new_logical_types_for_imputed_data to nullable type utils file * PR comment * Handle decomposer incompatibility (#4105) * Handle decomposer incompatibility * fix comment * Stop scaling the values up * Add release note * PR Comments * Create incompatibility for testing more similarly to how it shows up in the decomposer * Add release note * lint fix * pr notes * fix after rebase --------- Co-authored-by: chukarsten <64713315+chukarsten@users.noreply.github.com>
closes #4103
This handles a small incompatibility in the Decomposer's
determine_periodicity
method wherefloat64
data containing nans when subtracted fromInt64
dtype data producesFloat64
dtype data with anp.NaN
values that are not recognized bydropna
, resulting in a periodicity ofNone
. If we usedFloat64
as a dtype in Woodwork, we'd also need to handle that here.boolean
isn't an issue, because this is only used for regression problems.