-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle decomposer incompatibility #4105
Changes from 4 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,6 +5,7 @@ | |
import numpy as np | ||
import pandas as pd | ||
import pytest | ||
import woodwork as ww | ||
|
||
import evalml.exceptions | ||
from evalml.pipelines.components.transformers.preprocessing import ( | ||
|
@@ -438,6 +439,10 @@ def test_decomposer_set_period(decomposer_child_class, period, generate_seasonal | |
assert dec.parameters["period"] == dec.period | ||
|
||
|
||
@pytest.mark.parametrize( | ||
"y_logical_type", | ||
["Double", "Integer", "Age", "IntegerNullable", "AgeNullable"], | ||
) | ||
@pytest.mark.parametrize( | ||
"decomposer_child_class", | ||
decomposer_list, | ||
|
@@ -451,6 +456,7 @@ def test_decomposer_set_period(decomposer_child_class, period, generate_seasonal | |
], | ||
) | ||
def test_decomposer_determine_periodicity( | ||
y_logical_type, | ||
decomposer_child_class, | ||
period, | ||
trend_degree, | ||
|
@@ -462,6 +468,7 @@ def test_decomposer_determine_periodicity( | |
period, | ||
trend_degree=trend_degree, | ||
) | ||
y = ww.init_series(y.astype(int), logical_type=y_logical_type) | ||
|
||
# Test that the seasonality can be determined if trend guess isn't spot on. | ||
if not decomposer_picked_correct_degree: | ||
|
@@ -476,6 +483,60 @@ def test_decomposer_determine_periodicity( | |
assert 0.95 * period <= ac <= 1.05 * period | ||
|
||
|
||
@pytest.mark.parametrize( | ||
"decomposer_child_class", | ||
decomposer_list, | ||
) | ||
@pytest.mark.parametrize( | ||
"nullable_ltype", | ||
["IntegerNullable", "AgeNullable"], | ||
) | ||
@pytest.mark.parametrize( | ||
"handle_incompatibility", | ||
[ | ||
True, | ||
pytest.param( | ||
False, | ||
marks=pytest.mark.xfail(strict=True, raises=AssertionError), | ||
), | ||
], | ||
) | ||
def test_decomposer_determine_periodicity_nullable_type_incompatibility( | ||
decomposer_child_class, | ||
handle_incompatibility, | ||
nullable_ltype, | ||
generate_seasonal_data, | ||
): | ||
"""Testing that the nullable type incompatibility that caused us to add handling for the Decomposer | ||
is still present in pandas. If this test is causing the test suite to fail | ||
because the code below no longer raises the expected AssertionError, we should confirm that the nullable | ||
types now work for our use case and remove the nullable type handling logic from Decomposer.determine_periodicity. | ||
""" | ||
trend_degree = 2 | ||
period = 7 | ||
X, y = generate_seasonal_data(real_or_synthetic="synthetic")( | ||
period, | ||
trend_degree=trend_degree, | ||
) | ||
|
||
# Multiply by 1000 so we can convert to Integer, truncating the rest of the value | ||
# while still mainlining trend, period, etc | ||
y = y * 1000 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we need this scaling if we don't do so above? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good call! I can remove this |
||
y = ww.init_series(y.astype(int), logical_type=nullable_ltype) | ||
|
||
if handle_incompatibility: | ||
dec = decomposer_child_class(degree=trend_degree, period=period) | ||
X, y = dec._handle_nullable_types(X, y) | ||
|
||
numpy_float_data = pd.Series(range(len(y)), dtype="float64") | ||
numpy_float_data.iloc[-1] = np.nan | ||
|
||
subtracted_floats = y - numpy_float_data | ||
|
||
dropped_nans = subtracted_floats.dropna() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you explain why we're tracking this here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 7b71388 - let me know if this makes sense to you! |
||
assert len(dropped_nans) == len(y) - 1 | ||
|
||
|
||
@pytest.mark.parametrize( | ||
"decomposer_child_class", | ||
decomposer_list, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recognize that it feels very wrong that we can just truncate y by converting to int and still have the tests work.
Originally, I was scaling up by a factor of 1000 or so to try and maintain the information necessary to achieve periodicity, but that ended up with the tests where the period is set to None and the trend_degree is 1 failing. I don't really know why that would cause those tests to fail but assumed it might have something to do with larger values interacting with the
acf_threshold
? Oddly, I found that changing the scaling factor to10
caused more tests to fail but just truncating the data allowed all of them to pass.This is rubbing me the wrong way, but I don't really understand what's going on in
determine_periodicity
well enough to find a smarter way around this. I'm open to any thoughts anyone may have!There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tamargrey which logical types fail if just using the original float information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"IntegerNullable"
and"AgeNullable"
are the logical types that can produce this bug (they both useInt64
underlyingly). The goal with this fix was to allow the same behavior we see withAge
andInteger
(which both useint64
).If I try to pass the original float information into the
init_series
call with the nullable logical types, I get an errorError converting datatype for Temp from type float64 to type Int64.
since pandas doesn't allow conversion fromfloat64
toInt64
directly.int64
is allowed, but it just truncates the data.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm OK with this - an alternative would be to run the tests with the integer logical types with integer y by changing
generate_seasonal_data
but it would serve the same purpose. Wonder what @eccabay and @chukarsten think!There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I'm quite following why this is necessary. Is this just because we get errors when trying to convert the output of
generate_seasonal_data
into the correct logical type, or is some part of the actual function failing when we have non-integer data? With the former I'm unconcerned, but the latter would worry me.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first one - the output of
generate_seasonal_data
is floats, which we can't convert toInt64
because pandas decided it will no longer quietly truncate your data for you as it does if the intended dtype isint64
.I just needed a way to test
Int64
data withdetermine_periodicity
, so rather than reinvent the wheel, I figured I could turn the output ofgenerate_seasonal_data
into something I could use.And to be clear: the incompatibility with
Int64
indetermine_periodicity
is due only to nullable type incompatibilities. It can handleint64
data just fine.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, this feels fine to me. Thanks for the clarification!