Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas 1.4.0 Support #3324

Merged
merged 23 commits into from
Feb 18, 2022
Merged

Pandas 1.4.0 Support #3324

merged 23 commits into from
Feb 18, 2022

Conversation

chukarsten
Copy link
Contributor

@chukarsten chukarsten commented Feb 14, 2022

Fixes: #3275

Like the issue says, there are basically three areas to address:

  1. XGBoost future warning: This is handled by expecting this test to fail on this PR and then, when XGBoost does their next release, fixing the auto dependency update as the test will no longer fail as there will be no more warning.
  2. Results of email featurization: The underlying problem is that featuretools is returning a CategoricalDtype with the dtype of "string" when we, in our test, build an expected dataframe with CategoricalDtype with dtype of "object." I'm not entirely sure why this is the case as I've stepped through both the woodwork accessor and featuretools trying to find out how this happens. I've also tried to reproduce making a CategoricalDtype with categories that have dtype "string" but to no avail. I've rewritten the equality check here to check the values, column names, indices and logical types separately as a stop-gap until this behavior is understood.
  3. Imputer/SimpleImputer failures: We decided that the NaturalLanguage should not be imputed, so it is imputed, no longer.

@codecov
Copy link

codecov bot commented Feb 14, 2022

Codecov Report

Merging #3324 (8779828) into main (6220044) will decrease coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #3324     +/-   ##
=======================================
- Coverage   99.7%   99.6%   -0.0%     
=======================================
  Files        329     329             
  Lines      31912   31977     +65     
=======================================
+ Hits       31788   31847     +59     
- Misses       124     130      +6     
Impacted Files Coverage Δ
...components/transformers/imputers/simple_imputer.py 100.0% <100.0%> (ø)
...nt_tests/test_ft_transform_primitive_components.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_imputer.py 100.0% <100.0%> (ø)
...valml/tests/component_tests/test_simple_imputer.py 100.0% <100.0%> (ø)
...l/tests/component_tests/test_xgboost_classifier.py 93.1% <100.0%> (-6.9%) ⬇️
evalml/tests/conftest.py 97.1% <100.0%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6220044...8779828. Read the comment docs.

@@ -97,6 +97,7 @@ def test_xgboost_predict_all_boolean_columns():
assert not preds.isna().any()


@pytest.mark.xfail
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking this xfail so that we can explicitly remove it when XGBoost releases the fixing change! The related issue is #3331 so we don't forget.

}

# Categorical -> Boolean fails in infer_feature_types
if data == "categorical_data" and logical_type == Boolean:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified some of the logic here. The only failure that seems to occur between the override types and the dataframe that's going in is going from categorical data to Boolean. This seems to trigger a Woodwork conversion failure.

def test_simple_imputer_supports_natural_language_constant():
@pytest.mark.parametrize("na_type", ["python_none", "numpy_nan", "pandas_na"])
@pytest.mark.parametrize("data_type", ["Categorical", "NaturalLanguage"])
def test_simple_imputer_supports_natural_language_and_categorical_constant(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these tests failed, I just refactored them to be more descriptive per the parameters in case they should fail again.

["free-form text", "will", "be imputed", "placeholder"], dtype="string"
"Categorical": pd.Series(["a", "b", "a", "placeholder"], dtype="category"),
"NaturalLanguage": pd.Series(
["free-form text", "will", "be imputed", pd.NA], dtype="string"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a little redundant, but the imputer now ignores the natural language columns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it only ignores natural language columns if the only column in the dataframe is a natural language column?

from evalml.pipelines.components import SimpleImputer
import pandas as pd
import woodwork as ww
import pytest

X = pd.DataFrame({"a": ["foo", "bar", "baz"], "b": [1, 2, 3]})
X.ww.init(logical_types={"a": "NaturalLanguage"})
si = SimpleImputer()

with pytest.raises(ValueError):
    si.fit_transform(X)

I think it might be better to raise an error if non-imputable types are present?

Copy link
Collaborator

@jeremyliweishih jeremyliweishih Feb 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton I'm more of a fan of ignoring the columns if our imputer explicitly (we should not this in the docstring if we pursue this path) does not support natural language columns and that's the design decision I made with all of the select column transformers. If we raise an error it would mean that we would error out during search if the input contains a natural language column with nulls (but I'm unsure how the NaturalLanguageFeaturizer deals with nulls). Maybe raising a warning is a happy medium?

@chukarsten if we want to keep with the ignoring path - we would need to add the natural language columns back in after transform-ing as well.

y = pd.Series([1, 2, 1])
override_types = [Integer, Double, Categorical, NaturalLanguage, Boolean]
for logical_type in override_types:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored this test so that all the parameters are listed in the test failure sections and it's easier to understand what's failing, specifically which override type.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chukarsten Looks good to me! I just want to synch up on the expected behavior of natural language in the simple imputer prior to merge.

["free-form text", "will", "be imputed", "placeholder"], dtype="string"
"Categorical": pd.Series(["a", "b", "a", "placeholder"], dtype="category"),
"NaturalLanguage": pd.Series(
["free-form text", "will", "be imputed", pd.NA], dtype="string"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it only ignores natural language columns if the only column in the dataframe is a natural language column?

from evalml.pipelines.components import SimpleImputer
import pandas as pd
import woodwork as ww
import pytest

X = pd.DataFrame({"a": ["foo", "bar", "baz"], "b": [1, 2, 3]})
X.ww.init(logical_types={"a": "NaturalLanguage"})
si = SimpleImputer()

with pytest.raises(ValueError):
    si.fit_transform(X)

I think it might be better to raise an error if non-imputable types are present?

@@ -225,10 +225,23 @@ def test_component_fit_transform(
component.fit(data)
new_X = component.transform(data)

pd.testing.assert_frame_equal(new_X, expected)
# TODO: Understand why the "EMAIL_ADDRESS_TO_DOMAIN(email)" engineered feature
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the new change in behavior with pandas 1.4.0 is that the categories attribute of the categorical dtype now includes "original" feature values. Since the physical type of URL/EmailAddress logical types is "string" (nullable), that's why the categories have nullable values.

I think we can get around this behavior if we update the make_answer functions but I think what we have here is fine too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know how to do that? I pinged the woodwork team for how we can add in that specific column with the "string" type, but I can't seem to do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think like this:

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking though. The current equality logic in the test is sufficient.

# Not using select because we just need column names, not a new dataframe
natural_language_columns = self._get_columns_of_type(X, NaturalLanguage)
if natural_language_columns:
X = X.ww.copy()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to copy here and in _set_boolean_columns_to_categorical?

["free-form text", "will", "be imputed", "placeholder"], dtype="string"
"Categorical": pd.Series(["a", "b", "a", "placeholder"], dtype="category"),
"NaturalLanguage": pd.Series(
["free-form text", "will", "be imputed", pd.NA], dtype="string"
Copy link
Collaborator

@jeremyliweishih jeremyliweishih Feb 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton I'm more of a fan of ignoring the columns if our imputer explicitly (we should not this in the docstring if we pursue this path) does not support natural language columns and that's the design decision I made with all of the select column transformers. If we raise an error it would mean that we would error out during search if the input contains a natural language column with nulls (but I'm unsure how the NaturalLanguageFeaturizer deals with nulls). Maybe raising a warning is a happy medium?

@chukarsten if we want to keep with the ignoring path - we would need to add the natural language columns back in after transform-ing as well.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are looking good! Agreed about moving some of the dfs to the conftest.py file, as well as left a few questions.

As per Freddy's comment here, it could be useful to add a non-categorical/NL column to the test_simple_imputer_ignores_natural_language in test_simple_imputer to ensure that this continues to work as we want.

@chukarsten chukarsten force-pushed the pandas_1.4.0_support_take_2 branch from ccefefb to 416d296 Compare February 17, 2022 17:20
@chukarsten chukarsten force-pushed the pandas_1.4.0_support_take_2 branch from 975b762 to 686c25f Compare February 17, 2022 19:02
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @chukarsten !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for pandas 1.4.0
5 participants