Accept serialized features into DFSTransformer#3106
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3106 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 318 318
Lines 30692 30816 +124
=======================================
+ Hits 30588 30712 +124
Misses 104 104
Continue to review full report at Codecov.
|
| Returns: | ||
| pd.DataFrame: Feature matrix | ||
| """ | ||
| if self._passed_in_features and self._should_skip_transform(X): |
There was a problem hiding this comment.
if we detect that the input contains all features generated by the serialized features we skip transforming and return the input.
|
|
||
|
|
||
| @patch("evalml.pipelines.components.transformers.preprocessing.featuretools.dfs") | ||
| def test_dfs_with_serialized_features(mock_dfs, X_y_binary): |
There was a problem hiding this comment.
this test checks that DFS is not called when features are serialized and that the feature matrix is computed correctly using serialized features.
| @patch( | ||
| "evalml.pipelines.components.transformers.preprocessing.featuretools.calculate_feature_matrix" | ||
| ) | ||
| def test_dfs_skip_transform(mock_calculate_feature_matrix, mock_dfs, X_y_binary): |
There was a problem hiding this comment.
Give serialized features and input that contains the transformed columns, this test checks that both DFS and calculate feature matrix is not called.
bchen1116
left a comment
There was a problem hiding this comment.
Looking good! I think it would be nice to add a test for features where the associated feature names aren't in the dataframe. I also left a question for my own understanding.
evalml/pipelines/components/transformers/preprocessing/featuretools.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/preprocessing/featuretools.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/preprocessing/featuretools.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/preprocessing/featuretools.py
Outdated
Show resolved
Hide resolved
|
@freddyaboulton @chukarsten @dsherry re our discussion: Something I didn’t realize was that
I’ll attach the notebook here as well. skip_computation_experiments.ipynb.zip Some other use cases that @thehomebrewnerd brought up to me:
I will definitely address the first point in unit testing as well as file an issue tracking multi-table changes when we want to support multi-table as well. Let me know what you guys think! |
…nto js_2917_accept_features
…tity and identity features exist in input
| es = es.add_dataframe( | ||
| dataframe_name="X", dataframe=X_pd, index="index", make_index=True | ||
| ) | ||
| feature_matrix, features = ft.dfs( |
There was a problem hiding this comment.
I'm not seeing the same behavior with the year primitive on the fraud dataset:
from evalml.demos import load_fraud
import featuretools as ft
from evalml.pipelines.components import DFSTransformer
from featuretools.feature_base import IdentityFeature
import pandas as pd
import pytest
X, y = load_fraud(1000)
del X.ww
X_fit = X.iloc[:X.shape[0]//2]
X_transform = X.iloc[X.shape[0]//2:]
es = ft.EntitySet()
es = es.add_dataframe(
dataframe_name="X", dataframe=X_fit, index="index", make_index=True
)
feature_matrix, features = ft.dfs(
entityset=es, target_dataframe_name="X", trans_primitives=["year"]
)
features = list(filter(lambda f: not isinstance(f, IdentityFeature), features))
dfs = DFSTransformer(features=features)
dfs.fit(X_fit)
X_t = dfs.transform(X_transform)
with pytest.raises(AssertionError):
pd.testing.assert_frame_equal(X_t, feature_matrix)
print(X_t.columns)I would expect the same behavior in this test for this repro: X_t has all the original features plus YEAR (and is therefore equal to feature_matrix).
There was a problem hiding this comment.
This was my mistake heh. We need to pass the identity features in this case for X_t to match feature_matrix.
evalml/pipelines/components/transformers/preprocessing/featuretools.py
Outdated
Show resolved
Hide resolved
| es = self._make_entity_set(X_ww) | ||
|
|
||
| feature_matrix = calculate_feature_matrix(features=self.features, entityset=es) | ||
| features_to_use = ( |
There was a problem hiding this comment.
Not sure about this, but would it make sense to do the filtering during fit instead of here, so that after calling fit, self.features would always contain the features that were actually used by the transformer? You could change self._passed_in_features to retain the original list of features that were provided if you need to hold on to those, and if the list is present or contains values you know the user passed in features, otherwise you know they were generated from DFS.
There was a problem hiding this comment.
good points! Will edit 😄
There was a problem hiding this comment.
@thehomebrewnerd thought about this a little more and decided to keep the filtering on the transform end. Main reasoning being that ultimately we need to look at the columns that are passed into transform to decide on which features to use. I could add logic verifying that the columns between the dataframe for fit and transform are the same but I'd rather keep it simple and keep fit a no-op and do the filtering in transform. Let me know if that makes sense or if you other suggestions!
There was a problem hiding this comment.
Ah, I guess I wasn't thinking about the set of columns used for fit being different from the set of columns passed to transform. I don't have any strong feelings one way or the other here, so I'm fine if you keep the filtering in transform.
thehomebrewnerd
left a comment
There was a problem hiding this comment.
This looks pretty good to me, but you might want to consider adding a test case for a multi-input transform primitive (such divide_numeric) - both to confirm that the feature is calculated when both inputs are present and to confirm that no errors occur when one of the two input features is missing.
freddyaboulton
left a comment
There was a problem hiding this comment.
Looks good to me @jeremyliweishih ! Thank you for making the changes.
Fixes #2917. This PR allows
DFSTransformerto accept features and skips computation when needed. This should work in tandem with #2919.