Skip to content

Conversation

@freddyaboulton
Copy link
Contributor

@freddyaboulton freddyaboulton commented Jan 7, 2022

Pull Request Description

Fixes #3194

Perf tests:
nan-to-double-reports.zip
New branch seems to be faster on large datasets for search and model understanding (nyc-taxi, regress, kddcup)


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@codecov
Copy link

codecov bot commented Jan 7, 2022

Codecov Report

Merging #3196 (d694d91) into main (aa0ec23) will increase coverage by 31.9%.
The diff coverage is 100.0%.

❗ Current head d694d91 differs from pull request most recent head c530d4a. Consider uploading reports for the commit c530d4a to get more accurate results
Impacted file tree graph

@@           Coverage Diff            @@
##            main   #3196      +/-   ##
========================================
+ Coverage   67.9%   99.7%   +31.9%     
========================================
  Files        326     326              
  Lines      31392   31131     -261     
========================================
+ Hits       21305   31033    +9728     
+ Misses     10087      98    -9989     
Impacted Files Coverage Δ
...onents/transformers/imputers/per_column_imputer.py 100.0% <100.0%> (ø)
evalml/pipelines/utils.py 99.4% <100.0%> (+0.1%) ⬆️
...l/tests/component_tests/test_per_column_imputer.py 100.0% <100.0%> (ø)
...valml/tests/component_tests/test_simple_imputer.py 100.0% <100.0%> (ø)
...del_understanding_tests/test_partial_dependence.py 100.0% <100.0%> (+100.0%) ⬆️
evalml/tests/pipeline_tests/test_pipeline_utils.py 99.7% <100.0%> (+1.3%) ⬆️
evalml/tests/utils_tests/test_woodwork_utils.py 100.0% <100.0%> (ø)
evalml/utils/woodwork_utils.py 100.0% <100.0%> (ø)
evalml/data_checks/outliers_data_check.py 98.1% <0.0%> (-1.9%) ⬇️
... and 76 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aa0ec23...c530d4a. Read the comment docs.

@freddyaboulton freddyaboulton force-pushed the remove-all-nan-to-double-conversion branch 4 times, most recently from 151500d to 12185a5 Compare January 10, 2022 22:01
@freddyaboulton freddyaboulton marked this pull request as ready for review January 10, 2022 22:48
@freddyaboulton freddyaboulton force-pushed the remove-all-nan-to-double-conversion branch from d694d91 to 31b9458 Compare January 11, 2022 15:15
Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! I left a few comments on updating column names.
As for the perf tests, do you have a hypothesis on why some model understanding tests increased a fair bit in time?
image

Not blocking, but curious as to these slight regressions

{
"all_nan": [np.nan, np.nan, np.nan],
"some_nan": [np.nan, 1, 0],
"some_nan": [0.0, 1.0, 0.0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this need to change? Also, we should rename this column since there are no longer nans

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I thought I had answered this yesterday.

The only way to pass an all-NaN column to the imputer is if it's not typed as Unknown, which is what the default woodwork inference returns. So we need to init woodwork on the dataframe. However, because of #2055, the imputer will modify the input data.

So the some_nan column will have the NaNs imputed. We can't change the name of the column because the intention is to compare the input data before and after running through the imputer and the column name doesn't change after going through the imputer.

{
"all_nan": [np.nan, np.nan, np.nan],
"some_nan": [np.nan, 1, 0],
"some_nan": [0.0, 1.0, 0.0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above ^

"column_with_nan_not_included": [np.nan, 1, 0],
"column_with_nan_included": [0, 1, np.nan],
# Because of https://github.com/alteryx/evalml/issues/2055
"column_with_nan_included": [0.0, 1.0, 0.0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename

@freddyaboulton freddyaboulton force-pushed the remove-all-nan-to-double-conversion branch from 31b9458 to 63e36bc Compare January 11, 2022 16:51
Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you @freddyaboulton!

else []
)
drop_null = [DropNullColumns] if "all_null" in column_names else []
drop_null = [DropColumns] if "all_null" in column_names else []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is because all_null will be considered an unknown column now, rather than double right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

@freddyaboulton
Copy link
Contributor Author

freddyaboulton commented Jan 12, 2022

@bchen1116 I think the performance on model understanding is faster on this feature branch! See datasets like nyc taxi and regress.csv. The particular dataset you've pointed out is slower but the difference is 0.1 seconds, which I don't think is meaningful.

@freddyaboulton freddyaboulton force-pushed the remove-all-nan-to-double-conversion branch from 63e36bc to c530d4a Compare January 12, 2022 17:16
@freddyaboulton freddyaboulton enabled auto-merge (squash) January 12, 2022 17:26
@freddyaboulton freddyaboulton merged commit 119c8c7 into main Jan 12, 2022
@freddyaboulton freddyaboulton deleted the remove-all-nan-to-double-conversion branch January 12, 2022 18:19
@chukarsten chukarsten mentioned this pull request Jan 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cant predict on single-row dataframes with Unknown features

4 participants