Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accept serialized features into DFSTransformer #3106

Merged
merged 30 commits into from
Dec 14, 2021

Conversation

jeremyliweishih
Copy link
Collaborator

@jeremyliweishih jeremyliweishih commented Nov 30, 2021

Fixes #2917. This PR allows DFSTransformer to accept features and skips computation when needed. This should work in tandem with #2919.

@codecov
Copy link

codecov bot commented Nov 30, 2021

Codecov Report

Merging #3106 (7c3d0ce) into main (d6d1562) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #3106     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        318     318             
  Lines      30692   30816    +124     
=======================================
+ Hits       30588   30712    +124     
  Misses       104     104             
Impacted Files Coverage Δ
...ponents/transformers/preprocessing/featuretools.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_featuretools.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d6d1562...7c3d0ce. Read the comment docs.

@@ -70,6 +82,9 @@ def transform(self, X, y=None):
Returns:
pd.DataFrame: Feature matrix
"""
if self._passed_in_features and self._should_skip_transform(X):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we detect that the input contains all features generated by the serialized features we skip transforming and return the input.



@patch("evalml.pipelines.components.transformers.preprocessing.featuretools.dfs")
def test_dfs_with_serialized_features(mock_dfs, X_y_binary):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test checks that DFS is not called when features are serialized and that the feature matrix is computed correctly using serialized features.

@patch(
"evalml.pipelines.components.transformers.preprocessing.featuretools.calculate_feature_matrix"
)
def test_dfs_skip_transform(mock_calculate_feature_matrix, mock_dfs, X_y_binary):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give serialized features and input that contains the transformed columns, this test checks that both DFS and calculate feature matrix is not called.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! I think it would be nice to add a test for features where the associated feature names aren't in the dataframe. I also left a question for my own understanding.

@jeremyliweishih
Copy link
Collaborator Author

@freddyaboulton @chukarsten @dsherry re our discussion:

Something I didn’t realize was that calculate_feature_matrix would error out if the original columns used by the features didn’t exist. In the case of this notebook, I cannot recalculate a feature matrix unless “Date” exists (which makes sense since its needs a “Date” column to calculate year etc.). In this case I can check two things:

  1. Check if the columns used by features exist
  2. If they don’t, dont use these features in DFSTransformer
  3. If they do, check if feature names exists and if it does exist, skip these features as well
  4. Lastly, if the feature names don’t exist (and original columns exist), use these features for transform

I’ll attach the notebook here as well.

skip_computation_experiments.ipynb.zip

Some other use cases that @thehomebrewnerd brought up to me:

  • Some primitives use multiple columns as inputs, so for features generated from those primitives you will need to confirm all the input columns are present.
  • Things will get tricky in a multi-table setup when aggregations are present because you would need to start checking for columns in multiple dataframes - but maybe that is out of scope here for now.

I will definitely address the first point in unit testing as well as file an issue tracking multi-table changes when we want to support multi-table as well.

Let me know what you guys think!

es = es.add_dataframe(
dataframe_name="X", dataframe=X_pd, index="index", make_index=True
)
feature_matrix, features = ft.dfs(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing the same behavior with the year primitive on the fraud dataset:

from evalml.demos import load_fraud
import featuretools as ft
from evalml.pipelines.components import DFSTransformer
from featuretools.feature_base import IdentityFeature
import pandas as pd
import pytest


X, y = load_fraud(1000)
del X.ww

X_fit = X.iloc[:X.shape[0]//2]
X_transform = X.iloc[X.shape[0]//2:]


es = ft.EntitySet()
es = es.add_dataframe(
    dataframe_name="X", dataframe=X_fit, index="index", make_index=True
)
feature_matrix, features = ft.dfs(
    entityset=es, target_dataframe_name="X", trans_primitives=["year"]
)
features = list(filter(lambda f: not isinstance(f, IdentityFeature), features))

dfs = DFSTransformer(features=features)
dfs.fit(X_fit)
X_t = dfs.transform(X_transform)

with pytest.raises(AssertionError):
    pd.testing.assert_frame_equal(X_t, feature_matrix)
    
print(X_t.columns)

I would expect the same behavior in this test for this repro: X_t has all the original features plus YEAR (and is therefore equal to feature_matrix).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my mistake heh. We need to pass the identity features in this case for X_t to match feature_matrix.


feature_matrix = calculate_feature_matrix(features=self.features, entityset=es)
features_to_use = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this, but would it make sense to do the filtering during fit instead of here, so that after calling fit, self.features would always contain the features that were actually used by the transformer? You could change self._passed_in_features to retain the original list of features that were provided if you need to hold on to those, and if the list is present or contains values you know the user passed in features, otherwise you know they were generated from DFS.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good points! Will edit 😄

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thehomebrewnerd thought about this a little more and decided to keep the filtering on the transform end. Main reasoning being that ultimately we need to look at the columns that are passed into transform to decide on which features to use. I could add logic verifying that the columns between the dataframe for fit and transform are the same but I'd rather keep it simple and keep fit a no-op and do the filtering in transform. Let me know if that makes sense or if you other suggestions!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I guess I wasn't thinking about the set of columns used for fit being different from the set of columns passed to transform. I don't have any strong feelings one way or the other here, so I'm fine if you keep the filtering in transform.

Copy link
Contributor

@thehomebrewnerd thehomebrewnerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good to me, but you might want to consider adding a test case for a multi-input transform primitive (such divide_numeric) - both to confirm that the feature is calculated when both inputs are present and to confirm that no errors occur when one of the two input features is missing.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me @jeremyliweishih ! Thank you for making the changes.

@jeremyliweishih jeremyliweishih merged commit 2ea5644 into main Dec 14, 2021
@angela97lin angela97lin mentioned this pull request Dec 22, 2021
@freddyaboulton freddyaboulton deleted the js_2917_accept_features branch May 13, 2022 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Accept features for DFSTransformer and add logic to skip computation if features are already present
5 participants