-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pandas 1.4.0 Support #3324
Pandas 1.4.0 Support #3324
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3324 +/- ##
=======================================
- Coverage 99.7% 99.6% -0.0%
=======================================
Files 329 329
Lines 31912 31977 +65
=======================================
+ Hits 31788 31847 +59
- Misses 124 130 +6
Continue to review full report at Codecov.
|
4975994
to
4327631
Compare
@@ -97,6 +97,7 @@ def test_xgboost_predict_all_boolean_columns(): | |||
assert not preds.isna().any() | |||
|
|||
|
|||
@pytest.mark.xfail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marking this xfail so that we can explicitly remove it when XGBoost releases the fixing change! The related issue is #3331 so we don't forget.
} | ||
|
||
# Categorical -> Boolean fails in infer_feature_types | ||
if data == "categorical_data" and logical_type == Boolean: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplified some of the logic here. The only failure that seems to occur between the override types and the dataframe that's going in is going from categorical data to Boolean. This seems to trigger a Woodwork conversion failure.
def test_simple_imputer_supports_natural_language_constant(): | ||
@pytest.mark.parametrize("na_type", ["python_none", "numpy_nan", "pandas_na"]) | ||
@pytest.mark.parametrize("data_type", ["Categorical", "NaturalLanguage"]) | ||
def test_simple_imputer_supports_natural_language_and_categorical_constant( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these tests failed, I just refactored them to be more descriptive per the parameters in case they should fail again.
["free-form text", "will", "be imputed", "placeholder"], dtype="string" | ||
"Categorical": pd.Series(["a", "b", "a", "placeholder"], dtype="category"), | ||
"NaturalLanguage": pd.Series( | ||
["free-form text", "will", "be imputed", pd.NA], dtype="string" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be a little redundant, but the imputer now ignores the natural language columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it only ignores natural language columns if the only column in the dataframe is a natural language column?
from evalml.pipelines.components import SimpleImputer
import pandas as pd
import woodwork as ww
import pytest
X = pd.DataFrame({"a": ["foo", "bar", "baz"], "b": [1, 2, 3]})
X.ww.init(logical_types={"a": "NaturalLanguage"})
si = SimpleImputer()
with pytest.raises(ValueError):
si.fit_transform(X)
I think it might be better to raise an error if non-imputable types are present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton I'm more of a fan of ignoring the columns if our imputer explicitly (we should not this in the docstring if we pursue this path) does not support natural language columns and that's the design decision I made with all of the select column transformers. If we raise an error it would mean that we would error out during search if the input contains a natural language column with nulls (but I'm unsure how the NaturalLanguageFeaturizer
deals with nulls). Maybe raising a warning is a happy medium?
@chukarsten if we want to keep with the ignoring path - we would need to add the natural language columns back in after transform
-ing as well.
y = pd.Series([1, 2, 1]) | ||
override_types = [Integer, Double, Categorical, NaturalLanguage, Boolean] | ||
for logical_type in override_types: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored this test so that all the parameters are listed in the test failure sections and it's easier to understand what's failing, specifically which override type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chukarsten Looks good to me! I just want to synch up on the expected behavior of natural language in the simple imputer prior to merge.
evalml/pipelines/components/transformers/imputers/simple_imputer.py
Outdated
Show resolved
Hide resolved
["free-form text", "will", "be imputed", "placeholder"], dtype="string" | ||
"Categorical": pd.Series(["a", "b", "a", "placeholder"], dtype="category"), | ||
"NaturalLanguage": pd.Series( | ||
["free-form text", "will", "be imputed", pd.NA], dtype="string" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it only ignores natural language columns if the only column in the dataframe is a natural language column?
from evalml.pipelines.components import SimpleImputer
import pandas as pd
import woodwork as ww
import pytest
X = pd.DataFrame({"a": ["foo", "bar", "baz"], "b": [1, 2, 3]})
X.ww.init(logical_types={"a": "NaturalLanguage"})
si = SimpleImputer()
with pytest.raises(ValueError):
si.fit_transform(X)
I think it might be better to raise an error if non-imputable types are present?
@@ -225,10 +225,23 @@ def test_component_fit_transform( | |||
component.fit(data) | |||
new_X = component.transform(data) | |||
|
|||
pd.testing.assert_frame_equal(new_X, expected) | |||
# TODO: Understand why the "EMAIL_ADDRESS_TO_DOMAIN(email)" engineered feature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the new change in behavior with pandas 1.4.0 is that the categories
attribute of the categorical dtype now includes "original" feature values. Since the physical type of URL
/EmailAddress
logical types is "string" (nullable), that's why the categories have nullable values.
I think we can get around this behavior if we update the make_answer
functions but I think what we have here is fine too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know how to do that? I pinged the woodwork team for how we can add in that specific column with the "string" type, but I can't seem to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not blocking though. The current equality logic in the test is sufficient.
# Not using select because we just need column names, not a new dataframe | ||
natural_language_columns = self._get_columns_of_type(X, NaturalLanguage) | ||
if natural_language_columns: | ||
X = X.ww.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to copy here and in _set_boolean_columns_to_categorical
?
["free-form text", "will", "be imputed", "placeholder"], dtype="string" | ||
"Categorical": pd.Series(["a", "b", "a", "placeholder"], dtype="category"), | ||
"NaturalLanguage": pd.Series( | ||
["free-form text", "will", "be imputed", pd.NA], dtype="string" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton I'm more of a fan of ignoring the columns if our imputer explicitly (we should not this in the docstring if we pursue this path) does not support natural language columns and that's the design decision I made with all of the select column transformers. If we raise an error it would mean that we would error out during search if the input contains a natural language column with nulls (but I'm unsure how the NaturalLanguageFeaturizer
deals with nulls). Maybe raising a warning is a happy medium?
@chukarsten if we want to keep with the ignoring path - we would need to add the natural language columns back in after transform
-ing as well.
evalml/pipelines/components/transformers/imputers/simple_imputer.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes are looking good! Agreed about moving some of the dfs to the conftest.py
file, as well as left a few questions.
As per Freddy's comment here, it could be useful to add a non-categorical/NL column to the test_simple_imputer_ignores_natural_language
in test_simple_imputer
to ensure that this continues to work as we want.
evalml/pipelines/components/transformers/imputers/simple_imputer.py
Outdated
Show resolved
Hide resolved
ccefefb
to
416d296
Compare
…data fixture. Moved the imputer_test_data fixture up into conftest.py so that test_simple_imputer could borrow it.
…ce of multiple types.
…re detected during SimpleImputer fit AND NaturalLanguage is one of them.
975b762
to
686c25f
Compare
evalml/pipelines/components/transformers/imputers/simple_imputer.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you @chukarsten !
Fixes: #3275
Like the issue says, there are basically three areas to address: