New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for scikit-learn v0.24.0 #1733
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1733 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 243 243
Lines 19430 19435 +5
=========================================
+ Hits 19421 19426 +5
Misses 9 9
Continue to review full report at Codecov.
|
@@ -75,30 +75,28 @@ def test_make_data_splitter_default(problem_type, large_data): | |||
assert data_splitter.max_delay == 7 | |||
|
|||
|
|||
def test_make_data_splitter_parameters(): | |||
@pytest.mark.parametrize("problem_type, expected_data_splitter", [(ProblemTypes.REGRESSION, KFold), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleanup by using parametrize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin This looks great! I agree with the plan for the two problems you mentioned. Just making sure, but the only unit tests that would fail if someone were to run 0.23.2
are test_make_data_splitter_error_shuffle_random_state
and test_jupyter_graph_check
, correct?
'target': list(range(n))}) | ||
y = X.pop('target') | ||
|
||
with pytest.raises(ValueError, match="Setting a random_state has no effect since shuffle is False."): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does our TrainingValidationSplit
also raise this exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, I don't think so! I parametrized the test though for the different data types / large datasize to make this more clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just left one comment, but I think going with the options you chose are good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. I'm not sure I get the changes to EvalML/automl/utils.py make_data_split(). But I think it's just stylistic, so no big deal.
@@ -17,14 +17,15 @@ Release Notes | |||
* Added time series support for ``make_pipeline`` :pr:`1566` | |||
* Added target name for output of pipeline ``predict`` method :pr:`1578` | |||
* Added multiclass check to ``InvalidTargetDataCheck`` for two examples per class :pr:`1596` | |||
* Support graphviz 0.16 :pr:`1657` | |||
* Added support for ``graphviz`` ``v0.16`` :pr:`1657` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good for you.
@chukarsten Partially stylistic! The reason why I shifted and returned rather than what was previously there was because in the previous code, we would set the data_splitter based on the problem type, and then at the very end if the dataset was too large, we'd set it to |
* Fixes | ||
* Changes | ||
* Updated components and pipelines to return ``Woodwork`` data structures :pr:`1668` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, sneaking this in here. This should have been moved :P
Closes #1593.
There are two items that changed for scikit-learn 0.24.0:
Setting a random_state has no effect since shuffle is False. You should leave random_state to its default (None), or set shuffle=True.
when random state is set and shuffle is False.I've written about my proposed solutions below for both of these issues, but always welcome more discussion😁
Item 1: scikit-learn now raises a warning in permutation_importance: https://github.com/scikit-learn/scikit-learn/blob/7c2f928cbc8d5261e5b85b5e63d549a794116c3b/sklearn/inspection/_partial_dependence.py#L504
In order to support this version, we could:
test_jupyter_graph_check
test to check that the number of warnings returned == 1, not 0 as before.kind=legacy
. Then, no warnings will be raised. However, we would have to set our dependencies for scikit-learn to bescikit-learn>=0.24.0
sincekind
is not a valid parameter in previous versions.- Update our partial dependency impl to setkind=average
. Update our partial dependency impl to handle the new output (Bunch, not ndarrays). Then, no warnings will be raised. However, we would have to set our dependencies for scikit-learn to bescikit-learn>=0.24.0
sincekind
is not a valid parameter in previous versions.For this PR, I'm choosing to just stick with option #1. This is a change that will be enforced in scikit-learn 1.1, and we will have to revisit then, but for now, option #1 makes the most sense and allows us to support older versions.
Item 2: scikit-learn throws
Setting a random_state has no effect since shuffle is False. You should leave random_state to its default (None), or set shuffle=True.
when random state is set and shuffle is False.In order to support this version, we could:
For this PR, I chose to stick with option #2. Reasoning is, this is specific for some scikit-learn splitters, but works for our TimeSeriesSplit (even though its base class is the scikit-learn BaseCrossValidator). Still, its not something that we need to disallow entirely then?