Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for scikit-learn v0.24.0 #1733

Merged
merged 14 commits into from Jan 27, 2021
Merged

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Jan 25, 2021

Closes #1593.

There are two items that changed for scikit-learn 0.24.0:

  1. scikit-learn now raises a warning in permutation_importance: https://github.com/scikit-learn/scikit-learn/blob/7c2f928cbc8d5261e5b85b5e63d549a794116c3b/sklearn/inspection/_partial_dependence.py#L504
  2. scikit-learn throws Setting a random_state has no effect since shuffle is False. You should leave random_state to its default (None), or set shuffle=True. when random state is set and shuffle is False.

I've written about my proposed solutions below for both of these issues, but always welcome more discussion 馃榿

Item 1: scikit-learn now raises a warning in permutation_importance: https://github.com/scikit-learn/scikit-learn/blob/7c2f928cbc8d5261e5b85b5e63d549a794116c3b/sklearn/inspection/_partial_dependence.py#L504

In order to support this version, we could:

  • Update our test_jupyter_graph_check test to check that the number of warnings returned == 1, not 0 as before.
  • Update our partial dependency impl to set kind=legacy. Then, no warnings will be raised. However, we would have to set our dependencies for scikit-learn to be scikit-learn>=0.24.0 since kind is not a valid parameter in previous versions.- Update our partial dependency impl to set kind=average. Update our partial dependency impl to handle the new output (Bunch, not ndarrays). Then, no warnings will be raised. However, we would have to set our dependencies for scikit-learn to be scikit-learn>=0.24.0 since kind is not a valid parameter in previous versions.

For this PR, I'm choosing to just stick with option #1. This is a change that will be enforced in scikit-learn 1.1, and we will have to revisit then, but for now, option #1 makes the most sense and allows us to support older versions.


Item 2: scikit-learn throws Setting a random_state has no effect since shuffle is False. You should leave random_state to its default (None), or set shuffle=True. when random state is set and shuffle is False.

In order to support this version, we could:

  • Update our codebase to check if random_state is None and shuffle=False and throw. This is nice in that the exception is still coming from evalml
  • Leave as is, and add test to check that this error is thrown as expected for scikit-learn data splitters..? I guess since we're requiring data splitters to be a child class of the scikit-learn BaseCrossValidator, this might be a valid thing to do, but it could just be better to make it clear that this is a scikit-learn limitation.

For this PR, I chose to stick with option #2. Reasoning is, this is specific for some scikit-learn splitters, but works for our TimeSeriesSplit (even though its base class is the scikit-learn BaseCrossValidator). Still, its not something that we need to disallow entirely then?

@angela97lin angela97lin self-assigned this Jan 25, 2021
@codecov
Copy link

codecov bot commented Jan 25, 2021

Codecov Report

Merging #1733 (9c531dd) into main (4b3bfe4) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1733     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         243      243             
  Lines       19430    19435      +5     
=========================================
+ Hits        19421    19426      +5     
  Misses          9        9             
Impacted Files Coverage 螖
...essing/data_splitters/training_validation_split.py 100.0% <酶> (酶)
evalml/automl/utils.py 100.0% <100.0%> (酶)
evalml/tests/automl_tests/test_automl_utils.py 100.0% <100.0%> (酶)
...lml/tests/model_understanding_tests/test_graphs.py 100.0% <100.0%> (酶)

Continue to review full report at Codecov.

Legend - Click here to learn more
螖 = absolute <relative> (impact), 酶 = not affected, ? = missing data
Powered by Codecov. Last update 4b3bfe4...9c531dd. Read the comment docs.

@@ -75,30 +75,28 @@ def test_make_data_splitter_default(problem_type, large_data):
assert data_splitter.max_delay == 7


def test_make_data_splitter_parameters():
@pytest.mark.parametrize("problem_type, expected_data_splitter", [(ProblemTypes.REGRESSION, KFold),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup by using parametrize 馃榿

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin This looks great! I agree with the plan for the two problems you mentioned. Just making sure, but the only unit tests that would fail if someone were to run 0.23.2 are test_make_data_splitter_error_shuffle_random_state and test_jupyter_graph_check, correct?

'target': list(range(n))})
y = X.pop('target')

with pytest.raises(ValueError, match="Setting a random_state has no effect since shuffle is False."):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does our TrainingValidationSplit also raise this exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I don't think so! I parametrized the test though for the different data types / large datasize to make this more clear 馃榿

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just left one comment, but I think going with the options you chose are good!

evalml/automl/utils.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I'm not sure I get the changes to EvalML/automl/utils.py make_data_split(). But I think it's just stylistic, so no big deal.

@@ -17,14 +17,15 @@ Release Notes
* Added time series support for ``make_pipeline`` :pr:`1566`
* Added target name for output of pipeline ``predict`` method :pr:`1578`
* Added multiclass check to ``InvalidTargetDataCheck`` for two examples per class :pr:`1596`
* Support graphviz 0.16 :pr:`1657`
* Added support for ``graphviz`` ``v0.16`` :pr:`1657`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good for you.

@angela97lin
Copy link
Contributor Author

@chukarsten Partially stylistic! The reason why I shifted and returned rather than what was previously there was because in the previous code, we would set the data_splitter based on the problem type, and then at the very end if the dataset was too large, we'd set it to TrainingValidationSplit regardless. However, scikit-learn 0.24.0 breaks if shuffle=False and random_state!=None for the binary/multiclass/regression case. So if the user has a super large binary dataset, we want to give them TrainingValidationSplit. Say they want to specify a random state and shuffle=False. Without the refactor, we would run into errors when it tries the initial set to StratifiedKFold/KFold before we set it to TrainingValidationSplit. Now, we just set to TrainingValidationSplit and don't run into that error :P

* Fixes
* Changes
* Updated components and pipelines to return ``Woodwork`` data structures :pr:`1668`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, sneaking this in here. This should have been moved :P

@angela97lin angela97lin merged commit 476c49b into main Jan 27, 2021
2 checks passed
@angela97lin angela97lin deleted the 1593_scikit_learn_v0.24.0 branch January 27, 2021 20:46
@chukarsten chukarsten mentioned this pull request Feb 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support sklearn 0.24.0
4 participants