-
Notifications
You must be signed in to change notification settings - Fork 91
Remove all nan to double conversion in infer_feature_types #3196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3196 +/- ##
========================================
+ Coverage 67.9% 99.7% +31.9%
========================================
Files 326 326
Lines 31392 31131 -261
========================================
+ Hits 21305 31033 +9728
+ Misses 10087 98 -9989
Continue to review full report at Codecov.
|
151500d to
12185a5
Compare
d694d91 to
31b9458
Compare
bchen1116
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| { | ||
| "all_nan": [np.nan, np.nan, np.nan], | ||
| "some_nan": [np.nan, 1, 0], | ||
| "some_nan": [0.0, 1.0, 0.0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this need to change? Also, we should rename this column since there are no longer nans
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I thought I had answered this yesterday.
The only way to pass an all-NaN column to the imputer is if it's not typed as Unknown, which is what the default woodwork inference returns. So we need to init woodwork on the dataframe. However, because of #2055, the imputer will modify the input data.
So the some_nan column will have the NaNs imputed. We can't change the name of the column because the intention is to compare the input data before and after running through the imputer and the column name doesn't change after going through the imputer.
| { | ||
| "all_nan": [np.nan, np.nan, np.nan], | ||
| "some_nan": [np.nan, 1, 0], | ||
| "some_nan": [0.0, 1.0, 0.0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above ^
| "column_with_nan_not_included": [np.nan, 1, 0], | ||
| "column_with_nan_included": [0, 1, np.nan], | ||
| # Because of https://github.com/alteryx/evalml/issues/2055 | ||
| "column_with_nan_included": [0.0, 1.0, 0.0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename
31b9458 to
63e36bc
Compare
angela97lin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you @freddyaboulton!
| else [] | ||
| ) | ||
| drop_null = [DropNullColumns] if "all_null" in column_names else [] | ||
| drop_null = [DropColumns] if "all_null" in column_names else [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is because all_null will be considered an unknown column now, rather than double right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
|
@bchen1116 I think the performance on model understanding is faster on this feature branch! See datasets like nyc taxi and regress.csv. The particular dataset you've pointed out is slower but the difference is 0.1 seconds, which I don't think is meaningful. |
63e36bc to
c530d4a
Compare

Pull Request Description
Fixes #3194
Perf tests:
nan-to-double-reports.zip
New branch seems to be faster on large datasets for search and model understanding (nyc-taxi, regress, kddcup)
After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rstto include this pull request by adding :pr:123.