Skip to content

Sklearn 1.1.1 Upgrade #3525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 45 commits into from
Jun 21, 2022
Merged

Sklearn 1.1.1 Upgrade #3525

merged 45 commits into from
Jun 21, 2022

Conversation

chukarsten
Copy link
Contributor

@chukarsten chukarsten commented May 17, 2022

Upgrades EvalML to scikit-learn 1.1.1!

@codecov
Copy link

codecov bot commented May 17, 2022

Codecov Report

Merging #3525 (291f2bd) into main (d6462a8) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3525     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        335     335             
  Lines      33260   33350     +90     
=======================================
+ Hits       33131   33221     +90     
  Misses       129     129             
Impacted Files Coverage Δ
evalml/pipelines/component_graph.py 99.8% <ø> (ø)
evalml/tests/component_tests/test_components.py 99.1% <ø> (ø)
...ation_pipeline_tests/test_binary_classification.py 100.0% <ø> (ø)
...n_pipeline_tests/test_multiclass_classification.py 100.0% <ø> (ø)
...tests/regression_pipeline_tests/test_regression.py 100.0% <ø> (ø)
...valml/tests/pipeline_tests/test_component_graph.py 99.9% <ø> (ø)
evalml/tests/pipeline_tests/test_pipeline_utils.py 99.8% <ø> (ø)
evalml/tests/pipeline_tests/test_pipelines.py 99.8% <ø> (ø)
...elines/components/transformers/imputers/imputer.py 100.0% <100.0%> (ø)
...components/transformers/imputers/simple_imputer.py 100.0% <100.0%> (ø)
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d6462a8...291f2bd. Read the comment docs.

@gsheni gsheni changed the title Sklearn 11 upgrade Sklearn 1.1 upgrade May 17, 2022
@chukarsten chukarsten force-pushed the sklearn_11_upgrade branch 3 times, most recently from 4c7e676 to 3523f29 Compare June 6, 2022 21:31
@chukarsten chukarsten changed the title Sklearn 1.1 upgrade Sklearn 1.1.1 Upgrade Jun 7, 2022
@chukarsten chukarsten force-pushed the sklearn_11_upgrade branch from 9f89b0c to 3dba63e Compare June 9, 2022 22:00
@chukarsten chukarsten marked this pull request as ready for review June 9, 2022 23:23
@@ -216,9 +216,9 @@ def test_precision_recall_curve(data_type, make_data_type):
recall = precision_recall_curve_data.get("recall")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All changes in these files result from this sklearn PR: scikit-learn/scikit-learn#23214.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was also merged into scikit-learn 1.1.1, the post 1.1.0 fix release and is partially why we have to move the lower limit of sklearn to 1.1.1 instead of 1.1.0.

@@ -72,7 +72,7 @@
"X.ww.init(logical_types={\"provider\": \"Categorical\", \"region\": \"Categorical\",\n",
" \"currency\": \"Categorical\", \"expiration_date\": \"Categorical\"})\n",
"objective = F1()\n",
"pipeline = BinaryClassificationPipeline(component_graph=['Simple Imputer', 'DateTime Featurizer', 'One Hot Encoder', 'Random Forest Classifier'])\n",
"pipeline = BinaryClassificationPipeline(component_graph=['Imputer', 'DateTime Featurizer', 'One Hot Encoder', 'Random Forest Classifier'])\n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to change this as the SimpleImputer now struggles handling both boolean and categorical data within the same dataframe. Made the choice to follow the paradigm of the Imputer spinning up SimpleImputers for numeric, categorical and boolean data separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

random_seed=0,
**kwargs,
):
if categorical_impute_strategy not in self._valid_categorical_impute_strategies:
raise ValueError(
f"{categorical_impute_strategy} is an invalid parameter. Valid categorical impute strategies are {', '.join(self._valid_numeric_impute_strategies)}"
)
elif numeric_impute_strategy not in self._valid_numeric_impute_strategies:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it's possible for neither, both, or either of these strategies to be invalid, so removed elif.

return self

def transform(self, X, y=None):
"""Transforms data X by imputing missing values. 'None' values are converted to np.nan before imputation and are treated as the same.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this because I don't think this is true anymore.

@@ -458,28 +573,35 @@ def test_imputer_int_preserved():
assert {k: type(v) for k, v in transformed.ww.logical_types.items()} == {0: Integer}


def test_imputer_bool_preserved():
X = pd.DataFrame(pd.Series([True, False, True, np.nan] * 4))
@pytest.mark.parametrize("null_type", ["pandas_na", "numpy_nan", "python_none"])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanded coverage a bit and made it so the failed tests are more obvious.

{"int with None": [1, 0, 5, 2], "float with None": [0.1, 0.0, 0.5, 0.2]}
)
assert_frame_equal(expected, transformed, check_dtype=False)
X = pd.DataFrame(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

["categorical col", "float col"],
],
)
def test_simple_imputer_errors_with_bool_and_categorical_columns(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this to cover the SimpleImputer moving over to not covering mixed cat/bool cols.

@@ -830,3 +831,17 @@ def test_year_start_separated_by_gap():
assert are_datasets_separated_by_gap_time_index(
train, test, {"time_index": "time_index", "gap": 2}
)


def test_is_categorical_actually_boolean():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this when #3556 is added.

@@ -49,6 +49,25 @@ def import_or_raise(library, error_msg=None, warning=False):
raise Exception(msg)


def is_categorical_actually_boolean(df, df_col):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this when #3556 is added.

@@ -1,8 +1,8 @@
numpy>=1.21.0
pandas>=1.4.0
scipy>=1.5.0
scikit-learn>=0.24.0,<1.1.0
scikit-optimize>=0.8.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this change (scikit-optimize) also be reflected in meta.yaml?

@freddyaboulton
Copy link
Contributor

freddyaboulton commented Jun 10, 2022

@chukarsten Thank you for putting this together. Might be worth running the release perf tests to see if we're getting some bugs?

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chukarsten Thank you for this! There are some small comments regarding target imputer and the requirements files but other than that I think this is good to go.

@@ -120,6 +121,14 @@ def transform(self, X, y):

transformed = self._component_obj.transform(y_df)
y_t = pd.Series(transformed[:, 0], index=y_ww.index)

# TODO: Fix this after WW adds inference of object type booleans to BooleanNullable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? This changes output type from categorical to boolean but having the output type be categorical until the ww + nullable type work is ok since the target will be the same for the model

@@ -72,7 +72,7 @@ outputs:
- lightgbm >=2.3.1
- lime >=0.2.0.1
- python >=3.8.*
- imbalanced-learn >=0.8.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to bump imbalanced learn in all the other files? And scitkit-optimize here?

imputer = SkImputer(
strategy=impute_strategy,
fill_value=fill_value,
missing_values=pd.NA,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using np.nan here makes my local tests pass for the imputer/simple_imputer. Any idea what's up?

… BooleanNullable rather than categorical and removed the SimpleImputer's changing of BooleanNullable to categorical.
…emed there was some interaction between columns and generating the errors.
…xpected output of BooleanNullable to be BooleanNullable.
…nd updated the new separated test to include this.
…o the testing. Added some comments and removed the old imputer test.
chukarsten and others added 21 commits June 17, 2022 12:21
…ying to call Masked Array which no longer exists in sklearn.utils.fixes.
…out some of the bug fixes introduced in 1.1.1.
…instead of the Imputer on data that had both categorical and boolean data in the same frame. Added a test to catch this in the SimpleImputer.
Removed ipykernel
@chukarsten chukarsten force-pushed the sklearn_11_upgrade branch from e83d082 to 7ce31d7 Compare June 17, 2022 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants