Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle decomposer incompatibility #4105

Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ Release Notes
* Remove unnecessary logic from imputer components prior to nullable type handling :pr:`4038`
* Added calls to ``_handle_nullable_types`` in component fit, transform, and predict methods when needed :pr:`4046`
* Remove existing nullable type handling across AutoMLSearch to just use new handling :pr:`4085`
* Handle nullable type incompatibility in ``Decomposer`` :pr:`4105`
* Documentation Changes
* Testing Changes
* Updated graphviz installation in GitHub workflows to fix windows nightlies :pr:`4088`
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,9 @@ class Decomposer(Transformer):
modifies_target = True
needs_fitting = True
invalid_frequencies = []
# Incompatibility: https://github.com/alteryx/evalml/issues/4103
# TODO: Remove when support is added https://github.com/pandas-dev/pandas/issues/52127
_integer_nullable_incompatibilities = ["y"]

def __init__(
self,
Expand Down Expand Up @@ -148,6 +151,7 @@ def determine_periodicity(
period is detected, returns None.

"""
X, y = cls._handle_nullable_types(cls, X, y)

def _get_rel_max_from_acf(y):
"""Determines the relative maxima of the target's autocorrelation."""
Expand Down
62 changes: 62 additions & 0 deletions evalml/tests/component_tests/decomposer_tests/test_decomposer.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import numpy as np
import pandas as pd
import pytest
import woodwork as ww

import evalml.exceptions
from evalml.pipelines.components.transformers.preprocessing import (
Expand Down Expand Up @@ -438,6 +439,10 @@ def test_decomposer_set_period(decomposer_child_class, period, generate_seasonal
assert dec.parameters["period"] == dec.period


@pytest.mark.parametrize(
"y_logical_type",
["Double", "Integer", "Age", "IntegerNullable", "AgeNullable"],
)
@pytest.mark.parametrize(
"decomposer_child_class",
decomposer_list,
Expand All @@ -451,6 +456,7 @@ def test_decomposer_set_period(decomposer_child_class, period, generate_seasonal
],
)
def test_decomposer_determine_periodicity(
y_logical_type,
decomposer_child_class,
period,
trend_degree,
Expand All @@ -462,6 +468,7 @@ def test_decomposer_determine_periodicity(
period,
trend_degree=trend_degree,
)
y = ww.init_series(y.astype(int), logical_type=y_logical_type)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recognize that it feels very wrong that we can just truncate y by converting to int and still have the tests work.

Originally, I was scaling up by a factor of 1000 or so to try and maintain the information necessary to achieve periodicity, but that ended up with the tests where the period is set to None and the trend_degree is 1 failing. I don't really know why that would cause those tests to fail but assumed it might have something to do with larger values interacting with the acf_threshold? Oddly, I found that changing the scaling factor to 10 caused more tests to fail but just truncating the data allowed all of them to pass.

This is rubbing me the wrong way, but I don't really understand what's going on in determine_periodicity well enough to find a smarter way around this. I'm open to any thoughts anyone may have!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tamargrey which logical types fail if just using the original float information?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"IntegerNullable" and "AgeNullable" are the logical types that can produce this bug (they both use Int64 underlyingly). The goal with this fix was to allow the same behavior we see with Age and Integer (which both use int64).

If I try to pass the original float information into the init_series call with the nullable logical types, I get an error Error converting datatype for Temp from type float64 to type Int64. since pandas doesn't allow conversion from float64 to Int64 directly. int64 is allowed, but it just truncates the data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm OK with this - an alternative would be to run the tests with the integer logical types with integer y by changing generate_seasonal_data but it would serve the same purpose. Wonder what @eccabay and @chukarsten think!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I'm quite following why this is necessary. Is this just because we get errors when trying to convert the output of generate_seasonal_data into the correct logical type, or is some part of the actual function failing when we have non-integer data? With the former I'm unconcerned, but the latter would worry me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first one - the output of generate_seasonal_data is floats, which we can't convert to Int64 because pandas decided it will no longer quietly truncate your data for you as it does if the intended dtype is int64.

I just needed a way to test Int64 data with determine_periodicity, so rather than reinvent the wheel, I figured I could turn the output of generate_seasonal_data into something I could use.

And to be clear: the incompatibility with Int64 in determine_periodicity is due only to nullable type incompatibilities. It can handle int64 data just fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, this feels fine to me. Thanks for the clarification!


# Test that the seasonality can be determined if trend guess isn't spot on.
if not decomposer_picked_correct_degree:
Expand All @@ -476,6 +483,61 @@ def test_decomposer_determine_periodicity(
assert 0.95 * period <= ac <= 1.05 * period


@pytest.mark.parametrize(
"decomposer_child_class",
decomposer_list,
)
@pytest.mark.parametrize(
"nullable_ltype",
["IntegerNullable", "AgeNullable"],
)
@pytest.mark.parametrize(
"handle_incompatibility",
[
True,
pytest.param(
False,
marks=pytest.mark.xfail(strict=True, raises=AssertionError),
),
],
)
def test_decomposer_determine_periodicity_nullable_type_incompatibility(
decomposer_child_class,
handle_incompatibility,
nullable_ltype,
generate_seasonal_data,
):
"""Testing that the nullable type incompatibility that caused us to add handling for the Decomposer
is still present in pandas. If this test is causing the test suite to fail
because the code below no longer raises the expected AssertionError, we should confirm that the nullable
types now work for our use case and remove the nullable type handling logic from Decomposer.determine_periodicity.
"""
trend_degree = 2
period = 7
X, y = generate_seasonal_data(real_or_synthetic="synthetic")(
period,
trend_degree=trend_degree,
)

# Convert to Integer, truncating the rest of the value
y = ww.init_series(y.astype(int), logical_type=nullable_ltype)

if handle_incompatibility:
dec = decomposer_child_class(degree=trend_degree, period=period)
X, y = dec._handle_nullable_types(X, y)

numpy_float_data = pd.Series(range(len(y)), dtype="float64")
numpy_float_data.iloc[-1] = np.nan

subtracted_floats = y - numpy_float_data

# Pandas will not recognize the np.NaN value in a Float64 subtracted_floats
# and will not drop that null value, so calling _handle_nullable_types ensures
# that we stay in float64 and properly drop the null value
dropped_nans = subtracted_floats.dropna()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain why we're tracking this here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7b71388 - let me know if this makes sense to you!

assert len(dropped_nans) == len(y) - 1


@pytest.mark.parametrize(
"decomposer_child_class",
decomposer_list,
Expand Down