Conversation
Codecov Report
@@ Coverage Diff @@
## main #3324 +/- ##
=======================================
- Coverage 99.7% 99.6% -0.0%
=======================================
Files 329 329
Lines 31912 31977 +65
=======================================
+ Hits 31788 31847 +59
- Misses 124 130 +6
Continue to review full report at Codecov.
|
4975994 to
4327631
Compare
| assert not preds.isna().any() | ||
|
|
||
|
|
||
| @pytest.mark.xfail |
There was a problem hiding this comment.
Marking this xfail so that we can explicitly remove it when XGBoost releases the fixing change! The related issue is #3331 so we don't forget.
| } | ||
|
|
||
| # Categorical -> Boolean fails in infer_feature_types | ||
| if data == "categorical_data" and logical_type == Boolean: |
There was a problem hiding this comment.
Simplified some of the logic here. The only failure that seems to occur between the override types and the dataframe that's going in is going from categorical data to Boolean. This seems to trigger a Woodwork conversion failure.
| def test_simple_imputer_supports_natural_language_constant(): | ||
| @pytest.mark.parametrize("na_type", ["python_none", "numpy_nan", "pandas_na"]) | ||
| @pytest.mark.parametrize("data_type", ["Categorical", "NaturalLanguage"]) | ||
| def test_simple_imputer_supports_natural_language_and_categorical_constant( |
There was a problem hiding this comment.
Since these tests failed, I just refactored them to be more descriptive per the parameters in case they should fail again.
| ["free-form text", "will", "be imputed", "placeholder"], dtype="string" | ||
| "Categorical": pd.Series(["a", "b", "a", "placeholder"], dtype="category"), | ||
| "NaturalLanguage": pd.Series( | ||
| ["free-form text", "will", "be imputed", pd.NA], dtype="string" |
There was a problem hiding this comment.
Might be a little redundant, but the imputer now ignores the natural language columns.
There was a problem hiding this comment.
I think it only ignores natural language columns if the only column in the dataframe is a natural language column?
from evalml.pipelines.components import SimpleImputer
import pandas as pd
import woodwork as ww
import pytest
X = pd.DataFrame({"a": ["foo", "bar", "baz"], "b": [1, 2, 3]})
X.ww.init(logical_types={"a": "NaturalLanguage"})
si = SimpleImputer()
with pytest.raises(ValueError):
si.fit_transform(X)I think it might be better to raise an error if non-imputable types are present?
There was a problem hiding this comment.
@freddyaboulton I'm more of a fan of ignoring the columns if our imputer explicitly (we should not this in the docstring if we pursue this path) does not support natural language columns and that's the design decision I made with all of the select column transformers. If we raise an error it would mean that we would error out during search if the input contains a natural language column with nulls (but I'm unsure how the NaturalLanguageFeaturizer deals with nulls). Maybe raising a warning is a happy medium?
@chukarsten if we want to keep with the ignoring path - we would need to add the natural language columns back in after transform-ing as well.
| }[logical_type] | ||
| y = pd.Series([1, 2, 1]) | ||
| override_types = [Integer, Double, Categorical, NaturalLanguage, Boolean] | ||
| for logical_type in override_types: |
There was a problem hiding this comment.
Refactored this test so that all the parameters are listed in the test failure sections and it's easier to understand what's failing, specifically which override type.
freddyaboulton
left a comment
There was a problem hiding this comment.
@chukarsten Looks good to me! I just want to synch up on the expected behavior of natural language in the simple imputer prior to merge.
evalml/pipelines/components/transformers/imputers/simple_imputer.py
Outdated
Show resolved
Hide resolved
| ["free-form text", "will", "be imputed", "placeholder"], dtype="string" | ||
| "Categorical": pd.Series(["a", "b", "a", "placeholder"], dtype="category"), | ||
| "NaturalLanguage": pd.Series( | ||
| ["free-form text", "will", "be imputed", pd.NA], dtype="string" |
There was a problem hiding this comment.
I think it only ignores natural language columns if the only column in the dataframe is a natural language column?
from evalml.pipelines.components import SimpleImputer
import pandas as pd
import woodwork as ww
import pytest
X = pd.DataFrame({"a": ["foo", "bar", "baz"], "b": [1, 2, 3]})
X.ww.init(logical_types={"a": "NaturalLanguage"})
si = SimpleImputer()
with pytest.raises(ValueError):
si.fit_transform(X)I think it might be better to raise an error if non-imputable types are present?
| new_X = component.transform(data) | ||
|
|
||
| pd.testing.assert_frame_equal(new_X, expected) | ||
| # TODO: Understand why the "EMAIL_ADDRESS_TO_DOMAIN(email)" engineered feature |
There was a problem hiding this comment.
I think the new change in behavior with pandas 1.4.0 is that the categories attribute of the categorical dtype now includes "original" feature values. Since the physical type of URL/EmailAddress logical types is "string" (nullable), that's why the categories have nullable values.
I think we can get around this behavior if we update the make_answer functions but I think what we have here is fine too.
There was a problem hiding this comment.
Do you know how to do that? I pinged the woodwork team for how we can add in that specific column with the "string" type, but I can't seem to do it.
There was a problem hiding this comment.
Not blocking though. The current equality logic in the test is sufficient.
| # Not using select because we just need column names, not a new dataframe | ||
| natural_language_columns = self._get_columns_of_type(X, NaturalLanguage) | ||
| if natural_language_columns: | ||
| X = X.ww.copy() |
There was a problem hiding this comment.
why do we need to copy here and in _set_boolean_columns_to_categorical?
| ["free-form text", "will", "be imputed", "placeholder"], dtype="string" | ||
| "Categorical": pd.Series(["a", "b", "a", "placeholder"], dtype="category"), | ||
| "NaturalLanguage": pd.Series( | ||
| ["free-form text", "will", "be imputed", pd.NA], dtype="string" |
There was a problem hiding this comment.
@freddyaboulton I'm more of a fan of ignoring the columns if our imputer explicitly (we should not this in the docstring if we pursue this path) does not support natural language columns and that's the design decision I made with all of the select column transformers. If we raise an error it would mean that we would error out during search if the input contains a natural language column with nulls (but I'm unsure how the NaturalLanguageFeaturizer deals with nulls). Maybe raising a warning is a happy medium?
@chukarsten if we want to keep with the ignoring path - we would need to add the natural language columns back in after transform-ing as well.
evalml/pipelines/components/transformers/imputers/simple_imputer.py
Outdated
Show resolved
Hide resolved
bchen1116
left a comment
There was a problem hiding this comment.
These changes are looking good! Agreed about moving some of the dfs to the conftest.py file, as well as left a few questions.
As per Freddy's comment here, it could be useful to add a non-categorical/NL column to the test_simple_imputer_ignores_natural_language in test_simple_imputer to ensure that this continues to work as we want.
evalml/pipelines/components/transformers/imputers/simple_imputer.py
Outdated
Show resolved
Hide resolved
ccefefb to
416d296
Compare
…data fixture. Moved the imputer_test_data fixture up into conftest.py so that test_simple_imputer could borrow it.
…ce of multiple types.
…re detected during SimpleImputer fit AND NaturalLanguage is one of them.
975b762 to
686c25f
Compare
evalml/pipelines/components/transformers/imputers/simple_imputer.py
Outdated
Show resolved
Hide resolved
freddyaboulton
left a comment
There was a problem hiding this comment.
thank you @chukarsten !

Fixes: #3275
Like the issue says, there are basically three areas to address: