Skip to content

Sklearn diabetes dataset update + test update#3591

Merged
MichaelFu512 merged 23 commits intomainfrom
sklearn-dataset-test
Jul 5, 2022
Merged

Sklearn diabetes dataset update + test update#3591
MichaelFu512 merged 23 commits intomainfrom
sklearn-dataset-test

Conversation

@MichaelFu512
Copy link
Contributor

@MichaelFu512 MichaelFu512 commented Jun 28, 2022

Pull Request Description

Update diabetes to use the new diabetes.csv file that sklearn 1.1.1 uses. Also reverted line 90 in test_dataset.

I had to use the scale function because without it, a lot of tests didn't pass (seen in my commit history).


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@codecov
Copy link

codecov bot commented Jun 28, 2022

Codecov Report

Merging #3591 (489dbf9) into main (90bacb1) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3591     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        335     335             
  Lines      33375   33381      +6     
=======================================
+ Hits       33246   33252      +6     
  Misses       129     129             
Impacted Files Coverage Δ
evalml/demos/diabetes.py 100.0% <100.0%> (ø)
evalml/tests/demo_tests/test_datasets.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 90bacb1...489dbf9. Read the comment docs.

@MichaelFu512 MichaelFu512 marked this pull request as ready for review June 29, 2022 16:59
@MichaelFu512 MichaelFu512 changed the title DRAFT - Sklearn dataset test Sklearn diabetes dataset update + test update Jun 29, 2022
@MichaelFu512
Copy link
Contributor Author

Didn't mean to close this oops.

@MichaelFu512 MichaelFu512 reopened this Jun 29, 2022
@MichaelFu512 MichaelFu512 requested a review from chukarsten June 29, 2022 17:09
Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just left one code cleanliness comment!

Comment on lines +24 to +25
numpy_of_X = X.to_numpy()
numpy_of_X = scale(numpy_of_X, copy=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simplify these two lines to just X_np = scale(X)!

(I'd personally rename numpy_of_X to X_np for conciseness' sake)

Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some questions but overall looks good! Just blocking on the fix to release notes.

* Fixes
* Updated the Imputer and SimpleImputer to work with scikit-learn 1.1.1. :pr:`3525`
* Bumped the minimum versions of scikit-learn to 1.1.1 and imbalanced-learn to 0.9.1. :pr:`3525`
* Updated the `load_diabetes()` method to account for scikit-learn 1.1.1 changes to the dataset :pr:`3584`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey - why are there two mentions here? Let's fix this before merge and put it into Future Releases

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichaelFu512 sometimes for changes like this, you can use the "edit file" button in the upper right of this window panel (the ... button) and add a change commit directly to the branch. Little life hack to make addressing comments like this easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey thanks Karsten for the tip!

Also thanks Jeremy, honestly I didn't notice it mentioned twice oops.

X, y = load_data(filename, index=None, target="target")
y.name = None

X = X.astype(float)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can also skip these X astype calls and just use the dtype parameter in to_numpy
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html#pandas.DataFrame.to_numpy

y = y.astype(float)
numpy_of_X = X.to_numpy()
numpy_of_X = scale(numpy_of_X, copy=False)
numpy_of_X /= numpy_of_X.shape[0] ** 0.5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichaelFu512 - think I'm a little lost here, why do we need to divide by the square root of number of rows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I simply copied what sklearn does inside their own code and don't really know why the regularization does this.

While I don't understand the math or reasoning behind the division, from my understanding of the methods, load_diabetes from evalml before simply loaded the dataset in a similar manner to how sklearn did it. When scikit learn updated to 1.1.1, inside of sklearn's own code, they added this regularization to the diabetes dataset before they returned, rather than just simply returning the dataset like they had previously.

This happened because now the diabetes.csv file is no longer pre-regularized, and is actually the whole numbers (so the values are now numbers like 116 or 81 rather than stuff like 0.00116 or 0.00081). Therefore, to pass the test, I thought to just imitate what sklearn now does with its diabetes dataset.

Here's a screenshot of the load_diabetes() method from scikit learn which shows the difference between v1.0.2 and v1.1.1.

scikit v1.0.2:
Screen Shot 2022-07-05 at 9 00 49 AM

scikit v1.1.1
Screen Shot 2022-07-05 at 9 01 17 AM

They also added this note to the sklearn's load_diabetes() method docstring:

Screen Shot 2022-07-05 at 9 05 01 AM

And they also changed the function parameters so that default, scaled = True:
Screen Shot 2022-07-05 at 9 12 09 AM

Upon looking at the screenshot of the docstring, I guess they wanted to scale the numbers by the standard deviation * sqrt(n_samples). Dunno why, but by doing that the numbers that diabetes.csv output is similar (but not quite the same) as the numbers that diabetes.csv had previously.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it - thanks for explaining! If we were using the scaled version before let's keep it that way. I would add a comment explaining the scaling we're doing and perhaps leaving a link to the load_diabetes doc to show the scaling.

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll let the team handle overall approval of this!

* Fixes
* Updated the Imputer and SimpleImputer to work with scikit-learn 1.1.1. :pr:`3525`
* Bumped the minimum versions of scikit-learn to 1.1.1 and imbalanced-learn to 0.9.1. :pr:`3525`
* Updated the `load_diabetes()` method to account for scikit-learn 1.1.1 changes to the dataset :pr:`3584`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichaelFu512 sometimes for changes like this, you can use the "edit file" button in the upper right of this window panel (the ... button) and add a change commit directly to the branch. Little life hack to make addressing comments like this easier.

Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MichaelFu512 MichaelFu512 merged commit cea3aff into main Jul 5, 2022
@MichaelFu512 MichaelFu512 deleted the sklearn-dataset-test branch July 5, 2022 19:05
@chukarsten chukarsten mentioned this pull request Jul 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants