-
Notifications
You must be signed in to change notification settings - Fork 89
Sklearn 1.1.1 Upgrade #3525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sklearn 1.1.1 Upgrade #3525
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3525 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 335 335
Lines 33260 33350 +90
=======================================
+ Hits 33131 33221 +90
Misses 129 129
Continue to review full report at Codecov.
|
4c7e676
to
3523f29
Compare
9f89b0c
to
3dba63e
Compare
@@ -216,9 +216,9 @@ def test_precision_recall_curve(data_type, make_data_type): | |||
recall = precision_recall_curve_data.get("recall") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All changes in these files result from this sklearn PR: scikit-learn/scikit-learn#23214.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change was also merged into scikit-learn 1.1.1, the post 1.1.0 fix release and is partially why we have to move the lower limit of sklearn to 1.1.1 instead of 1.1.0.
@@ -72,7 +72,7 @@ | |||
"X.ww.init(logical_types={\"provider\": \"Categorical\", \"region\": \"Categorical\",\n", | |||
" \"currency\": \"Categorical\", \"expiration_date\": \"Categorical\"})\n", | |||
"objective = F1()\n", | |||
"pipeline = BinaryClassificationPipeline(component_graph=['Simple Imputer', 'DateTime Featurizer', 'One Hot Encoder', 'Random Forest Classifier'])\n", | |||
"pipeline = BinaryClassificationPipeline(component_graph=['Imputer', 'DateTime Featurizer', 'One Hot Encoder', 'Random Forest Classifier'])\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to change this as the SimpleImputer now struggles handling both boolean and categorical data within the same dataframe. Made the choice to follow the paradigm of the Imputer spinning up SimpleImputers for numeric, categorical and boolean data separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
random_seed=0, | ||
**kwargs, | ||
): | ||
if categorical_impute_strategy not in self._valid_categorical_impute_strategies: | ||
raise ValueError( | ||
f"{categorical_impute_strategy} is an invalid parameter. Valid categorical impute strategies are {', '.join(self._valid_numeric_impute_strategies)}" | ||
) | ||
elif numeric_impute_strategy not in self._valid_numeric_impute_strategies: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it's possible for neither, both, or either of these strategies to be invalid, so removed elif.
return self | ||
|
||
def transform(self, X, y=None): | ||
"""Transforms data X by imputing missing values. 'None' values are converted to np.nan before imputation and are treated as the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this because I don't think this is true anymore.
@@ -458,28 +573,35 @@ def test_imputer_int_preserved(): | |||
assert {k: type(v) for k, v in transformed.ww.logical_types.items()} == {0: Integer} | |||
|
|||
|
|||
def test_imputer_bool_preserved(): | |||
X = pd.DataFrame(pd.Series([True, False, True, np.nan] * 4)) | |||
@pytest.mark.parametrize("null_type", ["pandas_na", "numpy_nan", "python_none"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expanded coverage a bit and made it so the failed tests are more obvious.
{"int with None": [1, 0, 5, 2], "float with None": [0.1, 0.0, 0.5, 0.2]} | ||
) | ||
assert_frame_equal(expected, transformed, check_dtype=False) | ||
X = pd.DataFrame( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
["categorical col", "float col"], | ||
], | ||
) | ||
def test_simple_imputer_errors_with_bool_and_categorical_columns( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this to cover the SimpleImputer moving over to not covering mixed cat/bool cols.
@@ -830,3 +831,17 @@ def test_year_start_separated_by_gap(): | |||
assert are_datasets_separated_by_gap_time_index( | |||
train, test, {"time_index": "time_index", "gap": 2} | |||
) | |||
|
|||
|
|||
def test_is_categorical_actually_boolean(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this when #3556 is added.
@@ -49,6 +49,25 @@ def import_or_raise(library, error_msg=None, warning=False): | |||
raise Exception(msg) | |||
|
|||
|
|||
def is_categorical_actually_boolean(df, df_col): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this when #3556 is added.
@@ -1,8 +1,8 @@ | |||
numpy>=1.21.0 | |||
pandas>=1.4.0 | |||
scipy>=1.5.0 | |||
scikit-learn>=0.24.0,<1.1.0 | |||
scikit-optimize>=0.8.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this change (scikit-optimize) also be reflected in meta.yaml?
@chukarsten Thank you for putting this together. Might be worth running the release perf tests to see if we're getting some bugs? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chukarsten Thank you for this! There are some small comments regarding target imputer and the requirements files but other than that I think this is good to go.
@@ -120,6 +121,14 @@ def transform(self, X, y): | |||
|
|||
transformed = self._component_obj.transform(y_df) | |||
y_t = pd.Series(transformed[:, 0], index=y_ww.index) | |||
|
|||
# TODO: Fix this after WW adds inference of object type booleans to BooleanNullable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this? This changes output type from categorical to boolean but having the output type be categorical until the ww + nullable type work is ok since the target will be the same for the model
@@ -72,7 +72,7 @@ outputs: | |||
- lightgbm >=2.3.1 | |||
- lime >=0.2.0.1 | |||
- python >=3.8.* | |||
- imbalanced-learn >=0.8.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also need to bump imbalanced learn in all the other files? And scitkit-optimize here?
imputer = SkImputer( | ||
strategy=impute_strategy, | ||
fill_value=fill_value, | ||
missing_values=pd.NA, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using np.nan
here makes my local tests pass for the imputer/simple_imputer. Any idea what's up?
evalml/pipelines/components/transformers/imputers/simple_imputer.py
Outdated
Show resolved
Hide resolved
… BooleanNullable rather than categorical and removed the SimpleImputer's changing of BooleanNullable to categorical.
…emed there was some interaction between columns and generating the errors.
…xpected output of BooleanNullable to be BooleanNullable.
…nd updated the new separated test to include this.
…o the testing. Added some comments and removed the old imputer test.
…ying to call Masked Array which no longer exists in sklearn.utils.fixes.
…out some of the bug fixes introduced in 1.1.1.
…instead of the Imputer on data that had both categorical and boolean data in the same frame. Added a test to catch this in the SimpleImputer.
Removed ipykernel
e83d082
to
7ce31d7
Compare
Upgrades EvalML to scikit-learn 1.1.1!