Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove all nan to double conversion in infer_feature_types #3196

Merged
merged 4 commits into from
Jan 12, 2022

Conversation

freddyaboulton
Copy link
Contributor

@freddyaboulton freddyaboulton commented Jan 7, 2022

Pull Request Description

Fixes #3194

Perf tests:
nan-to-double-reports.zip
New branch seems to be faster on large datasets for search and model understanding (nyc-taxi, regress, kddcup)


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@codecov
Copy link

codecov bot commented Jan 7, 2022

Codecov Report

Merging #3196 (d694d91) into main (aa0ec23) will increase coverage by 31.9%.
The diff coverage is 100.0%.

❗ Current head d694d91 differs from pull request most recent head c530d4a. Consider uploading reports for the commit c530d4a to get more accurate results
Impacted file tree graph

@@           Coverage Diff            @@
##            main   #3196      +/-   ##
========================================
+ Coverage   67.9%   99.7%   +31.9%     
========================================
  Files        326     326              
  Lines      31392   31131     -261     
========================================
+ Hits       21305   31033    +9728     
+ Misses     10087      98    -9989     
Impacted Files Coverage Δ
...onents/transformers/imputers/per_column_imputer.py 100.0% <100.0%> (ø)
evalml/pipelines/utils.py 99.4% <100.0%> (+0.1%) ⬆️
...l/tests/component_tests/test_per_column_imputer.py 100.0% <100.0%> (ø)
...valml/tests/component_tests/test_simple_imputer.py 100.0% <100.0%> (ø)
...del_understanding_tests/test_partial_dependence.py 100.0% <100.0%> (+100.0%) ⬆️
evalml/tests/pipeline_tests/test_pipeline_utils.py 99.7% <100.0%> (+1.3%) ⬆️
evalml/tests/utils_tests/test_woodwork_utils.py 100.0% <100.0%> (ø)
evalml/utils/woodwork_utils.py 100.0% <100.0%> (ø)
evalml/data_checks/outliers_data_check.py 98.1% <0.0%> (-1.9%) ⬇️
... and 76 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aa0ec23...c530d4a. Read the comment docs.

@freddyaboulton freddyaboulton force-pushed the remove-all-nan-to-double-conversion branch 4 times, most recently from 151500d to 12185a5 Compare January 10, 2022 22:01
@freddyaboulton freddyaboulton marked this pull request as ready for review January 10, 2022 22:48
@freddyaboulton freddyaboulton force-pushed the remove-all-nan-to-double-conversion branch from d694d91 to 31b9458 Compare January 11, 2022 15:15
Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! I left a few comments on updating column names.
As for the perf tests, do you have a hypothesis on why some model understanding tests increased a fair bit in time?
image

Not blocking, but curious as to these slight regressions

@@ -212,7 +213,7 @@ def test_fit_transform_drop_all_nan_columns():
pd.DataFrame(
{
"all_nan": [np.nan, np.nan, np.nan],
"some_nan": [np.nan, 1, 0],
"some_nan": [0.0, 1.0, 0.0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this need to change? Also, we should rename this column since there are no longer nans

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I thought I had answered this yesterday.

The only way to pass an all-NaN column to the imputer is if it's not typed as Unknown, which is what the default woodwork inference returns. So we need to init woodwork on the dataframe. However, because of #2055, the imputer will modify the input data.

So the some_nan column will have the NaNs imputed. We can't change the name of the column because the intention is to compare the input data before and after running through the imputer and the column name doesn't change after going through the imputer.

@@ -243,7 +245,7 @@ def test_transform_drop_all_nan_columns():
pd.DataFrame(
{
"all_nan": [np.nan, np.nan, np.nan],
"some_nan": [np.nan, 1, 0],
"some_nan": [0.0, 1.0, 0.0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above ^

@@ -344,7 +350,8 @@ def test_per_column_imputer_impute_all_is_false():
"all_nan_not_included": [np.nan, np.nan, np.nan],
"all_nan_included": [np.nan, np.nan, np.nan],
"column_with_nan_not_included": [np.nan, 1, 0],
"column_with_nan_included": [0, 1, np.nan],
# Because of https://github.com/alteryx/evalml/issues/2055
"column_with_nan_included": [0.0, 1.0, 0.0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename

@freddyaboulton freddyaboulton force-pushed the remove-all-nan-to-double-conversion branch from 31b9458 to 63e36bc Compare January 11, 2022 16:51
Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you @freddyaboulton!

@@ -131,7 +130,7 @@ def test_make_pipeline(
if estimator_class.model_family == ModelFamily.LINEAR_MODEL
else []
)
drop_null = [DropNullColumns] if "all_null" in column_names else []
drop_null = [DropColumns] if "all_null" in column_names else []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is because all_null will be considered an unknown column now, rather than double right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

@freddyaboulton
Copy link
Contributor Author

freddyaboulton commented Jan 12, 2022

@bchen1116 I think the performance on model understanding is faster on this feature branch! See datasets like nyc taxi and regress.csv. The particular dataset you've pointed out is slower but the difference is 0.1 seconds, which I don't think is meaningful.

@freddyaboulton freddyaboulton force-pushed the remove-all-nan-to-double-conversion branch from 63e36bc to c530d4a Compare January 12, 2022 17:16
@freddyaboulton freddyaboulton enabled auto-merge (squash) January 12, 2022 17:26
@freddyaboulton freddyaboulton merged commit 119c8c7 into main Jan 12, 2022
@freddyaboulton freddyaboulton deleted the remove-all-nan-to-double-conversion branch January 12, 2022 18:19
@chukarsten chukarsten mentioned this pull request Jan 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cant predict on single-row dataframes with Unknown features
3 participants