Sklearn diabetes dataset update + test update#3591
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3591 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 335 335
Lines 33375 33381 +6
=======================================
+ Hits 33246 33252 +6
Misses 129 129
Continue to review full report at Codecov.
|
|
Didn't mean to close this oops. |
Don't need it since it's on the featurelabs/datasets website now
Don't need it since it's on the featurelabs/dataset website
eccabay
left a comment
There was a problem hiding this comment.
Looks great, just left one code cleanliness comment!
evalml/demos/diabetes.py
Outdated
| numpy_of_X = X.to_numpy() | ||
| numpy_of_X = scale(numpy_of_X, copy=False) |
There was a problem hiding this comment.
You can simplify these two lines to just X_np = scale(X)!
(I'd personally rename numpy_of_X to X_np for conciseness' sake)
jeremyliweishih
left a comment
There was a problem hiding this comment.
Left some questions but overall looks good! Just blocking on the fix to release notes.
docs/source/release_notes.rst
Outdated
| * Fixes | ||
| * Updated the Imputer and SimpleImputer to work with scikit-learn 1.1.1. :pr:`3525` | ||
| * Bumped the minimum versions of scikit-learn to 1.1.1 and imbalanced-learn to 0.9.1. :pr:`3525` | ||
| * Updated the `load_diabetes()` method to account for scikit-learn 1.1.1 changes to the dataset :pr:`3584` |
There was a problem hiding this comment.
hey - why are there two mentions here? Let's fix this before merge and put it into Future Releases
There was a problem hiding this comment.
@MichaelFu512 sometimes for changes like this, you can use the "edit file" button in the upper right of this window panel (the ... button) and add a change commit directly to the branch. Little life hack to make addressing comments like this easier.
There was a problem hiding this comment.
Hey thanks Karsten for the tip!
Also thanks Jeremy, honestly I didn't notice it mentioned twice oops.
evalml/demos/diabetes.py
Outdated
| X, y = load_data(filename, index=None, target="target") | ||
| y.name = None | ||
|
|
||
| X = X.astype(float) |
There was a problem hiding this comment.
you can also skip these X astype calls and just use the dtype parameter in to_numpy
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html#pandas.DataFrame.to_numpy
evalml/demos/diabetes.py
Outdated
| y = y.astype(float) | ||
| numpy_of_X = X.to_numpy() | ||
| numpy_of_X = scale(numpy_of_X, copy=False) | ||
| numpy_of_X /= numpy_of_X.shape[0] ** 0.5 |
There was a problem hiding this comment.
@MichaelFu512 - think I'm a little lost here, why do we need to divide by the square root of number of rows?
There was a problem hiding this comment.
To be honest, I simply copied what sklearn does inside their own code and don't really know why the regularization does this.
While I don't understand the math or reasoning behind the division, from my understanding of the methods, load_diabetes from evalml before simply loaded the dataset in a similar manner to how sklearn did it. When scikit learn updated to 1.1.1, inside of sklearn's own code, they added this regularization to the diabetes dataset before they returned, rather than just simply returning the dataset like they had previously.
This happened because now the diabetes.csv file is no longer pre-regularized, and is actually the whole numbers (so the values are now numbers like 116 or 81 rather than stuff like 0.00116 or 0.00081). Therefore, to pass the test, I thought to just imitate what sklearn now does with its diabetes dataset.
Here's a screenshot of the load_diabetes() method from scikit learn which shows the difference between v1.0.2 and v1.1.1.
They also added this note to the sklearn's load_diabetes() method docstring:
And they also changed the function parameters so that default, scaled = True:

Upon looking at the screenshot of the docstring, I guess they wanted to scale the numbers by the standard deviation * sqrt(n_samples). Dunno why, but by doing that the numbers that diabetes.csv output is similar (but not quite the same) as the numbers that diabetes.csv had previously.
There was a problem hiding this comment.
got it - thanks for explaining! If we were using the scaled version before let's keep it that way. I would add a comment explaining the scaling we're doing and perhaps leaving a link to the load_diabetes doc to show the scaling.
chukarsten
left a comment
There was a problem hiding this comment.
I'll let the team handle overall approval of this!
docs/source/release_notes.rst
Outdated
| * Fixes | ||
| * Updated the Imputer and SimpleImputer to work with scikit-learn 1.1.1. :pr:`3525` | ||
| * Bumped the minimum versions of scikit-learn to 1.1.1 and imbalanced-learn to 0.9.1. :pr:`3525` | ||
| * Updated the `load_diabetes()` method to account for scikit-learn 1.1.1 changes to the dataset :pr:`3584` |
There was a problem hiding this comment.
@MichaelFu512 sometimes for changes like this, you can use the "edit file" button in the upper right of this window panel (the ... button) and add a change commit directly to the branch. Little life hack to make addressing comments like this easier.



Pull Request Description
Update diabetes to use the new diabetes.csv file that sklearn 1.1.1 uses. Also reverted line 90 in test_dataset.
I had to use the scale function because without it, a lot of tests didn't pass (seen in my commit history).
After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rstto include this pull request by adding :pr:123.