Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update cfm to use nullable types when needed #1810

Merged
merged 10 commits into from Jan 7, 2022
Merged

Conversation

thehomebrewnerd
Copy link
Contributor

Update cfm to use nullable types when needed

Closes #1692

Updates the CFM process to use Woodwork nullable logical types when needed, instead of the non nullable types. This is important for situations where the cutoff time settings cause null values to be introduced into the feature matrix for columns that previously did not have null values present.

@thehomebrewnerd thehomebrewnerd marked this pull request as draft December 16, 2021 22:06
@thehomebrewnerd thehomebrewnerd changed the title Update cfm to use nullable types when needed [WIP] Update cfm to use nullable types when needed Dec 16, 2021
@thehomebrewnerd thehomebrewnerd changed the title [WIP] Update cfm to use nullable types when needed Update cfm to use nullable types when needed Jan 5, 2022
@@ -55,7 +55,7 @@ def test_single_table_dask_entityset():

# Use the same columns and make sure both indexes are sorted the same
dask_computed_fm = dask_fm.compute().set_index('id').loc[fm.index][fm.columns]
pd.testing.assert_frame_equal(fm, dask_computed_fm)
pd.testing.assert_frame_equal(fm, dask_computed_fm, check_dtype=False)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated these test checks to ignore the dtypes, because this update means we can no longer expect the dask and koalas feature matrices to have the same dtypes for all the columns. With pandas, the logical types would only change when they need to, but with Dask and Koalas they would always end up using the nullable types in the feature matrix.

@thehomebrewnerd thehomebrewnerd marked this pull request as ready for review January 5, 2022 22:20
@codecov
Copy link

codecov bot commented Jan 5, 2022

Codecov Report

Merging #1810 (200a0bd) into main (8383c16) will increase coverage by 0.42%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1810      +/-   ##
==========================================
+ Coverage   98.36%   98.79%   +0.42%     
==========================================
  Files         142      142              
  Lines       15820    15850      +30     
==========================================
+ Hits        15562    15659      +97     
+ Misses        258      191      -67     
Impacted Files Coverage Δ
...computational_backends/calculate_feature_matrix.py 100.00% <100.00%> (+2.88%) ⬆️
...utational_backend/test_calculate_feature_matrix.py 100.00% <100.00%> (+3.86%) ⬆️
featuretools/tests/synthesis/test_dask_dfs.py 100.00% <100.00%> (ø)
featuretools/tests/synthesis/test_koalas_dfs.py 100.00% <100.00%> (ø)
featuretools/tests/synthesis/test_dfs_method.py 99.53% <0.00%> (+10.23%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8383c16...200a0bd. Read the comment docs.

@rwedge
Copy link
Contributor

rwedge commented Jan 5, 2022

Have we done a performance test run on the branch?

@thehomebrewnerd
Copy link
Contributor Author

@rwedge Just now launched a run to compare commit d01e7c7 against the latest release.

@thehomebrewnerd
Copy link
Contributor Author

@rwedge Just now launched a run to compare commit d01e7c7 against the latest release.

Pandas performance didn't change much with this update.
d01e7c7c3_v1.3.0__2022-01-06T14.16.37.733426.html.zip

tamargrey
tamargrey previously approved these changes Jan 7, 2022
Copy link
Contributor

@tamargrey tamargrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Also I confirmed these changes allow us to use a Lag primitive where nans are introduced on columns that have non nullable logical types!

@@ -1,7 +1,7 @@
name: Auto Approve Dependency PRs
on:
schedule:
- cron: '*/30 * * * *'
- cron: '*/5 * * * *'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked Nate for this change (rather than putting up a 1 line PR). I want this workflow to run more often.

@thehomebrewnerd
Copy link
Contributor Author

LGTM! Also I confirmed these changes allow us to use a Lag primitive where nans are introduced on columns that have non nullable logical types!

@tamargrey Nice - thanks for checking that out!

@thehomebrewnerd thehomebrewnerd merged commit 808783a into main Jan 7, 2022
@thehomebrewnerd thehomebrewnerd deleted the cfm-type-conv-error branch January 7, 2022 16:16
@tamargrey tamargrey mentioned this pull request Feb 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants