Handle decomposer incompatibility #4105

tamargrey · 2023-03-22T22:48:44Z

This handles a small incompatibility in the Decomposer's determine_periodicity method where float64 data containing nans when subtracted from Int64 dtype data produces Float64 dtype data with a np.NaN values that are not recognized by dropna, resulting in a periodicity of None. If we used Float64 as a dtype in Woodwork, we'd also need to handle that here. boolean isn't an issue, because this is only used for regression problems.

codecov · 2023-03-22T23:11:01Z

Codecov Report

❗ No coverage uploaded for pull request base (component-specific-nullable-types-handling@75a4714). Click here to learn what that means.
The diff coverage is n/a.

@@                             Coverage Diff                              @@
##             component-specific-nullable-types-handling   #4105   +/-   ##
============================================================================
  Coverage                                              ?   99.7%           
============================================================================
  Files                                                 ?     349           
  Lines                                                 ?   37628           
  Branches                                              ?       0           
============================================================================
  Hits                                                  ?   37511           
  Misses                                                ?     117           
  Partials                                              ?       0

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

tamargrey · 2023-03-22T23:26:47Z

evalml/tests/component_tests/decomposer_tests/test_decomposer.py

@@ -462,6 +468,7 @@ def test_decomposer_determine_periodicity(
        period,
        trend_degree=trend_degree,
    )
+    y = ww.init_series(y.astype(int), logical_type=y_logical_type)


I recognize that it feels very wrong that we can just truncate y by converting to int and still have the tests work.

Originally, I was scaling up by a factor of 1000 or so to try and maintain the information necessary to achieve periodicity, but that ended up with the tests where the period is set to None and the trend_degree is 1 failing. I don't really know why that would cause those tests to fail but assumed it might have something to do with larger values interacting with the acf_threshold? Oddly, I found that changing the scaling factor to 10 caused more tests to fail but just truncating the data allowed all of them to pass.

This is rubbing me the wrong way, but I don't really understand what's going on in determine_periodicity well enough to find a smarter way around this. I'm open to any thoughts anyone may have!

@tamargrey which logical types fail if just using the original float information?

"IntegerNullable" and "AgeNullable" are the logical types that can produce this bug (they both use Int64 underlyingly). The goal with this fix was to allow the same behavior we see with Age and Integer (which both use int64).

If I try to pass the original float information into the init_series call with the nullable logical types, I get an error Error converting datatype for Temp from type float64 to type Int64. since pandas doesn't allow conversion from float64 to Int64 directly. int64 is allowed, but it just truncates the data.

I think I'm OK with this - an alternative would be to run the tests with the integer logical types with integer y by changing generate_seasonal_data but it would serve the same purpose. Wonder what @eccabay and @chukarsten think!

I don't think I'm quite following why this is necessary. Is this just because we get errors when trying to convert the output of generate_seasonal_data into the correct logical type, or is some part of the actual function failing when we have non-integer data? With the former I'm unconcerned, but the latter would worry me.

The first one - the output of generate_seasonal_data is floats, which we can't convert to Int64 because pandas decided it will no longer quietly truncate your data for you as it does if the intended dtype is int64.

I just needed a way to test Int64 data with determine_periodicity, so rather than reinvent the wheel, I figured I could turn the output of generate_seasonal_data into something I could use.

And to be clear: the incompatibility with Int64 in determine_periodicity is due only to nullable type incompatibilities. It can handle int64 data just fine.

In that case, this feels fine to me. Thanks for the clarification!

jeremyliweishih · 2023-03-23T17:52:26Z

evalml/tests/component_tests/decomposer_tests/test_decomposer.py

+
+    # Multiply by 1000 so we can convert to Integer, truncating the rest of the value
+    # while still mainlining trend, period, etc
+    y = y * 1000


do we need this scaling if we don't do so above?

good call! I can remove this

jeremyliweishih · 2023-03-23T17:53:31Z

evalml/tests/component_tests/decomposer_tests/test_decomposer.py

+
+    subtracted_floats = y - numpy_float_data
+
+    dropped_nans = subtracted_floats.dropna()


can you explain why we're tracking this here?

7b71388 - let me know if this makes sense to you!

jeremyliweishih

I think this LGTM but will leave the final call to the decomp experts!

…in the decomposer

eccabay · 2023-03-24T12:11:26Z

evalml/tests/component_tests/decomposer_tests/test_decomposer.py

@@ -462,6 +468,7 @@ def test_decomposer_determine_periodicity(
        period,
        trend_degree=trend_degree,
    )
+    y = ww.init_series(y.astype(int), logical_type=y_logical_type)


In that case, this feels fine to me. Thanks for the clarification!

* Handle decomposer incompatibility * fix comment * Stop scaling the values up * Add release note * PR Comments * Create incompatibility for testing more similarly to how it shows up in the decomposer

* initial commit to make the PR * Refactor imputer components to remove unnecessary logic (#4038) * Stop using woodwork describe to get nan info in time series imputer * remove logic that's no longer needed and return if all bool dtype * remove unnecessary logic from target imputer * remove unused utils * remove logic to convert dfs features to categorical logical type * fix email featureizer test * Revert changes to transfomr prim components for testing * Revert "Revert changes to transfomr prim components for testing" This reverts commit 57dda43. * Fix bugs with ww not imputed and string categories * Add release note * Handle case where nans are present at transform in previously all bool dtype data * Stop truncating ints in target imputer * clean up * Fix tests * Keep ltype integer for most frequent impute tpye in target imputer * refactor knn imputer to use new logic * Fix list bug * remove comment * Update release note to mention nullable types * remove outdated comment * Conver all bool dfs to boolean nullable instead of refitting * lint fix * PR comments * add second bool col to imputer fixture * Use nullable type handling in components' fit, transform, and predict methods (#4046) * Remove existing nullable type handling from oversampler and use _handle_nullable_types instead * Add handle call for lgbm regressor and remove existing handling * Add handle call for lgbm classifier * temp broken exp smoothing tests * lint fix * add release note * Fix broken tests by initting woodwork on y in lgbm classifier * Update tests * Call handle in arima * call handle from ts imputer y ltype is downcasted value * remove unnecessary comments * Fix time series guide * lint fix * Only call handle_nullable_types when necessary in methods * Remove remaining unnecessary handle calls * resolve remaining comments * Add y ww init to ts imputer back in to fix tests * Copy X in testing nullable types to stop hiding potential incompatibilities in methods * use X_d in lgbm predict proba * remove nullable type handling after sklearn upgrade fixed incompatibilities * use common util to determine type for time series imputed integers * Add comments around why we copy X * remove _prepare_data from samplers * PR comments * remove tests to check if handle method is called * remove nullable types from imputed data because of regularizer * fix typo * fix docstrings * fix codecov issues * PR comments * Revert "Fix time series guide" This reverts commit 964622a. * return unchanged ltype in nullabl;e type utils * add back ts imputer incompatibility test * use dict get return value * call handle nullable types in oversampler and check schema equality * Remove other nullable handling across AutoMLSearch (#4085) * Update tests * remove unnecessary comments * resolve remaining comments * Copy X in testing nullable types to stop hiding potential incompatibilities in methods * Handle new oversampler nullable type incompatibility and add tests * Remove existing nullable type handling from oversampler and use _handle_nullable_types instead * Add comments to self * Add handle call for lgbm classifier * lint fix * Only call handle_nullable_types when necessary in methods * Remove remaining unnecessary handle calls * resolve remaining comments * Copy X in testing nullable types to stop hiding potential incompatibilities in methods * remove nullable type handling after sklearn upgrade fixed incompatibilities * Remove downcast call from imputer.py and fix tests * remove downcast call from catboost regressor * stop adding ohe if bool null present * remove nullable type handling from knn imputer * remove nullable type handling from targer imputer * use util for ltype deciding in simple imputer * Fix broken target imputer test * Remove replace nullable types transformer from automl search and fix tests * Handle nullable types in lgbm classifier predict_proba * fix after rebase * move util to imputer utils and other fixes * remove duplicate util * Change how imputer reinits woodwork types * Expand tests * use automl test env * fix broken test * Clean up * Add release note * Further clean up and note incompatibility * Continue clean up * fix failing docstring * remove remaining comments * fix invali target data check docstring * let knn and simple imputers use same flow of logic for readability * Move _get_new_logical_types_for_imputed_data to nullable type utils file * PR comment * Handle decomposer incompatibility (#4105) * Handle decomposer incompatibility * fix comment * Stop scaling the values up * Add release note * PR Comments * Create incompatibility for testing more similarly to how it shows up in the decomposer * Add release note * lint fix * pr notes * fix after rebase --------- Co-authored-by: chukarsten <64713315+chukarsten@users.noreply.github.com>

tamargrey changed the base branch from main to component-specific-nullable-types-handling March 22, 2023 23:01

tamargrey commented Mar 22, 2023

View reviewed changes

Tamar Grey added 4 commits March 23, 2023 09:49

Handle decomposer incompatibility

2825849

fix comment

ead92fc

Stop scaling the values up

1e0d2ac

Add release note

2aa24d3

tamargrey force-pushed the handle-decomposer-incompatibility branch from 2ba48ca to 2aa24d3 Compare March 23, 2023 13:50

tamargrey requested review from chukarsten and eccabay March 23, 2023 14:04

tamargrey marked this pull request as ready for review March 23, 2023 14:26

auto-assign bot assigned tamargrey Mar 23, 2023

tamargrey requested review from jeremyliweishih March 23, 2023 16:54

jeremyliweishih reviewed Mar 23, 2023

View reviewed changes

PR Comments

7b71388

jeremyliweishih approved these changes Mar 23, 2023

View reviewed changes

Create incompatibility for testing more similarly to how it shows up …

049626b

…in the decomposer

eccabay approved these changes Mar 24, 2023

View reviewed changes

tamargrey merged commit 2696587 into component-specific-nullable-types-handling Mar 24, 2023

tamargrey deleted the handle-decomposer-incompatibility branch March 24, 2023 14:02

chukarsten mentioned this pull request Apr 3, 2023

Release v0.73.0. #4122

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle decomposer incompatibility #4105

Handle decomposer incompatibility #4105

tamargrey commented Mar 22, 2023 •

edited

Loading

codecov bot commented Mar 22, 2023 •

edited

Loading

tamargrey Mar 22, 2023

jeremyliweishih Mar 23, 2023

tamargrey Mar 23, 2023

jeremyliweishih Mar 23, 2023

eccabay Mar 23, 2023

tamargrey Mar 23, 2023

eccabay Mar 24, 2023

jeremyliweishih Mar 23, 2023

tamargrey Mar 23, 2023

jeremyliweishih Mar 23, 2023

tamargrey Mar 23, 2023

jeremyliweishih left a comment

eccabay Mar 24, 2023


		subtracted_floats = y - numpy_float_data

		dropped_nans = subtracted_floats.dropna()

Handle decomposer incompatibility #4105

Handle decomposer incompatibility #4105

Conversation

tamargrey commented Mar 22, 2023 • edited Loading

codecov bot commented Mar 22, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeremyliweishih left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tamargrey commented Mar 22, 2023 •

edited

Loading

codecov bot commented Mar 22, 2023 •

edited

Loading