Add support for scikit-learn v0.24.0 #1733

angela97lin · 2021-01-25T18:26:33Z

Closes #1593.

There are two items that changed for scikit-learn 0.24.0:

scikit-learn now raises a warning in permutation_importance: https://github.com/scikit-learn/scikit-learn/blob/7c2f928cbc8d5261e5b85b5e63d549a794116c3b/sklearn/inspection/_partial_dependence.py#L504
scikit-learn throws Setting a random_state has no effect since shuffle is False. You should leave random_state to its default (None), or set shuffle=True. when random state is set and shuffle is False.

I've written about my proposed solutions below for both of these issues, but always welcome more discussion 😁

Item 1: scikit-learn now raises a warning in permutation_importance: https://github.com/scikit-learn/scikit-learn/blob/7c2f928cbc8d5261e5b85b5e63d549a794116c3b/sklearn/inspection/_partial_dependence.py#L504

In order to support this version, we could:

Update our test_jupyter_graph_check test to check that the number of warnings returned == 1, not 0 as before.
Update our partial dependency impl to set kind=legacy. Then, no warnings will be raised. However, we would have to set our dependencies for scikit-learn to be scikit-learn>=0.24.0 since kind is not a valid parameter in previous versions.- Update our partial dependency impl to set kind=average. Update our partial dependency impl to handle the new output (Bunch, not ndarrays). Then, no warnings will be raised. However, we would have to set our dependencies for scikit-learn to be scikit-learn>=0.24.0 since kind is not a valid parameter in previous versions.

For this PR, I'm choosing to just stick with option #1. This is a change that will be enforced in scikit-learn 1.1, and we will have to revisit then, but for now, option #1 makes the most sense and allows us to support older versions.

Item 2: scikit-learn throws Setting a random_state has no effect since shuffle is False. You should leave random_state to its default (None), or set shuffle=True. when random state is set and shuffle is False.

In order to support this version, we could:

Update our codebase to check if random_state is None and shuffle=False and throw. This is nice in that the exception is still coming from evalml
Leave as is, and add test to check that this error is thrown as expected for scikit-learn data splitters..? I guess since we're requiring data splitters to be a child class of the scikit-learn BaseCrossValidator, this might be a valid thing to do, but it could just be better to make it clear that this is a scikit-learn limitation.

For this PR, I chose to stick with option #2. Reasoning is, this is specific for some scikit-learn splitters, but works for our TimeSeriesSplit (even though its base class is the scikit-learn BaseCrossValidator). Still, its not something that we need to disallow entirely then?

codecov · 2021-01-25T21:23:52Z

Codecov Report

Merging #1733 (9c531dd) into main (4b3bfe4) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@            Coverage Diff            @@
##             main    #1733     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         243      243             
  Lines       19430    19435      +5     
=========================================
+ Hits        19421    19426      +5     
  Misses          9        9

Impacted Files	Coverage Δ
...essing/data_splitters/training_validation_split.py	`100.0% <ø> (ø)`
evalml/automl/utils.py	`100.0% <100.0%> (ø)`
evalml/tests/automl_tests/test_automl_utils.py	`100.0% <100.0%> (ø)`
...lml/tests/model_understanding_tests/test_graphs.py	`100.0% <100.0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4b3bfe4...9c531dd. Read the comment docs.

angela97lin · 2021-01-26T00:47:27Z

evalml/tests/automl_tests/test_automl_utils.py

@@ -75,30 +75,28 @@ def test_make_data_splitter_default(problem_type, large_data):
        assert data_splitter.max_delay == 7


-def test_make_data_splitter_parameters():
+@pytest.mark.parametrize("problem_type, expected_data_splitter", [(ProblemTypes.REGRESSION, KFold),


Cleanup by using parametrize 😁

freddyaboulton

@angela97lin This looks great! I agree with the plan for the two problems you mentioned. Just making sure, but the only unit tests that would fail if someone were to run 0.23.2 are test_make_data_splitter_error_shuffle_random_state and test_jupyter_graph_check, correct?

freddyaboulton · 2021-01-26T19:04:35Z

evalml/tests/automl_tests/test_automl_utils.py

+                      'target': list(range(n))})
+    y = X.pop('target')
+
+    with pytest.raises(ValueError, match="Setting a random_state has no effect since shuffle is False."):


Does our TrainingValidationSplit also raise this exception?

Nope, I don't think so! I parametrized the test though for the different data types / large datasize to make this more clear 😁

bchen1116

LGTM! Just left one comment, but I think going with the options you chose are good!

evalml/automl/utils.py

chukarsten

This looks good. I'm not sure I get the changes to EvalML/automl/utils.py make_data_split(). But I think it's just stylistic, so no big deal.

chukarsten · 2021-01-27T15:42:11Z

docs/source/release_notes.rst

@@ -17,14 +17,15 @@ Release Notes
        * Added time series support for ``make_pipeline`` :pr:`1566`
        * Added target name for output of pipeline ``predict`` method :pr:`1578`
        * Added multiclass check to ``InvalidTargetDataCheck`` for two examples per class :pr:`1596`
-        * Support graphviz 0.16 :pr:`1657`
+        * Added support for ``graphviz`` ``v0.16`` :pr:`1657`


Good for you.

angela97lin · 2021-01-27T16:29:41Z

@chukarsten Partially stylistic! The reason why I shifted and returned rather than what was previously there was because in the previous code, we would set the data_splitter based on the problem type, and then at the very end if the dataset was too large, we'd set it to TrainingValidationSplit regardless. However, scikit-learn 0.24.0 breaks if shuffle=False and random_state!=None for the binary/multiclass/regression case. So if the user has a super large binary dataset, we want to give them TrainingValidationSplit. Say they want to specify a random state and shuffle=False. Without the refactor, we would run into errors when it tries the initial set to StratifiedKFold/KFold before we set it to TrainingValidationSplit. Now, we just set to TrainingValidationSplit and don't run into that error :P

angela97lin · 2021-01-27T20:39:34Z

docs/source/release_notes.rst

    * Fixes
    * Changes
+        * Updated components and pipelines to return ``Woodwork`` data structures :pr:`1668`


Oops, sneaking this in here. This should have been moved :P

init

e157ace

angela97lin self-assigned this Jan 25, 2021

init

4906e47

angela97lin added 3 commits January 25, 2021 17:49

release notes, update latest dependencies

7cfea22

merge and fix

b1f6cd6

fix test

b8007aa

angela97lin commented Jan 26, 2021

View reviewed changes

Merge branch 'main' into 1593_scikit_learn_v0.24.0

1ba4c1d

angela97lin marked this pull request as ready for review January 26, 2021 18:53

angela97lin requested review from dsherry, freddyaboulton, bchen1116, chukarsten, jeremyliweishih and ParthivNaresh January 26, 2021 18:53

freddyaboulton approved these changes Jan 26, 2021

View reviewed changes

angela97lin added 4 commits January 26, 2021 20:29

Merge branch 'main' into 1593_scikit_learn_v0.24.0

5ce6485

parametrize test

bd660d1

update test for large data

55fcaed

Merge branch 'main' into 1593_scikit_learn_v0.24.0

5e4d1cb

bchen1116 approved these changes Jan 27, 2021

View reviewed changes

evalml/automl/utils.py Outdated Show resolved Hide resolved

chukarsten approved these changes Jan 27, 2021

View reviewed changes

Merge branch 'main' into 1593_scikit_learn_v0.24.0

ff43367

angela97lin added 3 commits January 27, 2021 11:34

minor cleanup

9749d18

remove unreachable

d554853

Merge branch 'main' into 1593_scikit_learn_v0.24.0

9c531dd

angela97lin commented Jan 27, 2021

View reviewed changes

angela97lin merged commit 476c49b into main Jan 27, 2021

angela97lin deleted the 1593_scikit_learn_v0.24.0 branch January 27, 2021 20:46

angela97lin mentioned this pull request Jan 28, 2021

Add support for scipy v1.6.0 #1752

Merged

chukarsten mentioned this pull request Feb 1, 2021

Release v0.18.1 #1774

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for scikit-learn v0.24.0 #1733

Add support for scikit-learn v0.24.0 #1733

angela97lin commented Jan 25, 2021 •

edited

Loading

codecov bot commented Jan 25, 2021 •

edited

Loading

angela97lin Jan 26, 2021

freddyaboulton left a comment

freddyaboulton Jan 26, 2021

angela97lin Jan 27, 2021 •

edited

Loading

bchen1116 left a comment

chukarsten left a comment

chukarsten Jan 27, 2021

angela97lin commented Jan 27, 2021

angela97lin Jan 27, 2021

Add support for scikit-learn v0.24.0 #1733

Add support for scikit-learn v0.24.0 #1733

Conversation

angela97lin commented Jan 25, 2021 • edited Loading

I've written about my proposed solutions below for both of these issues, but always welcome more discussion 😁

codecov bot commented Jan 25, 2021 • edited Loading

Codecov Report

angela97lin Jan 26, 2021

Choose a reason for hiding this comment

freddyaboulton left a comment

Choose a reason for hiding this comment

freddyaboulton Jan 26, 2021

Choose a reason for hiding this comment

angela97lin Jan 27, 2021 • edited Loading

Choose a reason for hiding this comment

bchen1116 left a comment

Choose a reason for hiding this comment

chukarsten left a comment

Choose a reason for hiding this comment

chukarsten Jan 27, 2021

Choose a reason for hiding this comment

angela97lin commented Jan 27, 2021

angela97lin Jan 27, 2021

Choose a reason for hiding this comment

angela97lin commented Jan 25, 2021 •

edited

Loading

codecov bot commented Jan 25, 2021 •

edited

Loading

angela97lin Jan 27, 2021 •

edited

Loading