Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove transform stacking from DFS #1119

Merged
merged 16 commits into from Aug 28, 2020
Merged

Remove transform stacking from DFS #1119

merged 16 commits into from Aug 28, 2020

Conversation

tamargrey
Copy link
Contributor

@tamargrey tamargrey commented Aug 21, 2020

Remove transform stacking from DFS

Dfs has been returning a different set of features depending on the order of the trans_primitives input list. Our goal with this PR is to make it such that both the set of features we get from dfs and their order are the same independent of the order of the input list.

ToDo:

  • Change dfs so it doesn't create features that contain transform primitives stacked directly on one another
  • When building the DeepFeatureSynthesis object, sort all the primitive lists according to their names (this touches all types of primitives and not just transform primitives)

Testing:

  • Fix the tests that broke from these changes
    • test_groupby_multi_output_stacking
    • test_make_transform_multiple_output_features
    • test_stacks_multioutput_features
    • test_transform_consistency
    • test_deserialize_features_s3 - Needs to have s3 file that's referenced updated
  • Add tests to check the set of features and their orders
    • test_primitive_ordering
    • test_no_transform_stacking

@tamargrey tamargrey changed the title [WIP] Remove transform stacking from DFS Remove transform stacking from DFS Aug 24, 2020
@tamargrey tamargrey requested a review from rwedge August 24, 2020 19:56
@tamargrey tamargrey requested a review from rwedge August 25, 2020 16:29
**Breaking Changes**

* ``ft.dfs`` will no longer build features from transform primitives where one of the
inputs was also built with a transform primitve. This will make some
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would "where on of the inputs was a Transform feature, a GroupByTransform feature, or a Direct Feature of a Transform / GroupByTransform feature" be an accurate description?

Copy link
Contributor Author

@tamargrey tamargrey Aug 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that would be the most descriptive way to say this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ended up adding this in

docs/source/changelog.rst Outdated Show resolved Hide resolved
@rwedge
Copy link
Contributor

rwedge commented Aug 25, 2020

The code changes look in a good spot, let's get it up to date with the main branch and see what's left

@codecov
Copy link

codecov bot commented Aug 25, 2020

Codecov Report

Merging #1119 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1119   +/-   ##
=======================================
  Coverage   98.34%   98.35%           
=======================================
  Files         126      126           
  Lines       13268    13309   +41     
=======================================
+ Hits        13049    13090   +41     
  Misses        219      219           
Impacted Files Coverage Δ
featuretools/primitives/base/primitive_base.py 100.00% <100.00%> (ø)
featuretools/synthesis/deep_feature_synthesis.py 97.43% <100.00%> (+0.08%) ⬆️
...ests/primitive_tests/test_feature_serialization.py 99.34% <100.00%> (ø)
...imitive_tests/test_groupby_transform_primitives.py 100.00% <100.00%> (ø)
...s/tests/primitive_tests/test_transform_features.py 98.81% <100.00%> (ø)
...ols/tests/synthesis/test_deep_feature_synthesis.py 99.26% <100.00%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 38ac1f6...f74a562. Read the comment docs.

docs/source/changelog.rst Outdated Show resolved Hide resolved
@@ -242,7 +246,7 @@ def __init__(self,
self.ignore_entities,
self.ignore_variables,
self.es)
self.seed_features = seed_features or []
self.seed_features = sorted(seed_features or [], key=lambda f: f.primitive.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not sort by unique name, for features?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point--changed

featuretools/synthesis/deep_feature_synthesis.py Outdated Show resolved Hide resolved
featuretools/synthesis/deep_feature_synthesis.py Outdated Show resolved Hide resolved
Copy link
Contributor

@rwedge rwedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@amin-nejad
Copy link

Hi @tamargrey, is there any way to manually replicate the transform stacking behaviour that was removed in this PR? For instance if I just have one table and I want to have AddNumeric and DivideNumeric primitives with a depth of 2, what can be done now? Currently the resultant feature matrix will just come back with a depth of 1 (i.e. no stacking). If the the two primitives are applied sequentially with separate dfs calls, we instead get a maximum depth of 3 which is not ideal either. Any help would be appreciated - thanks

@tamargrey
Copy link
Contributor Author

tamargrey commented Feb 16, 2021

Hi @amin-nejad, your best bet is going to be to create those depth 2 features as seed features to dfs.

You could make the features directly and pass them into dfs if you wanted to be able to build off of them (with the example below, the Mean aggregate primitive will be applied to the seed feature):

import featuretools as ft
from featuretools.primitives import (
    AddNumeric,
    DivideNumeric,
)
from featuretools.variable_types import Numeric

es = ft.demo.load_retail()

# Create the features directly
depth_one_feature = ft.Feature([es['order_products']['quantity'], es['order_products']['unit_price']], 
                               primitive=AddNumeric)
depth_two_feature = ft.Feature([depth_one_feature, es['order_products']['total']], 
                               primitive=DivideNumeric)

feature_matrix, feature_defs = ft.dfs(entityset=es,
                                      target_entity="order_products",
                                      agg_primitives=['mean'],
                                      trans_primitives=[],
                                      seed_features=[depth_two_feature]) # Just this one feature has been added
feature_defs

Another option could be to take the results of dfs with just AddNumeric, for example, and then and get the depth-2 Features manually:

feature_matrix, feature_defs = ft.dfs(entityset=es,
                                      target_entity="order_products",
                                      agg_primitives=['mean'],
                                      trans_primitives=[AddNumeric]) # Just this one feature has been added

# Can also get stacking on the results of dfs by creating the Features directly
[ft.Feature([feat, es['order_products']['unit_price']], primitive=DivideNumeric) for feat in feature_defs if feat.variable_type == Numeric]

Let me know if that helps!

@amin-nejad
Copy link

Thanks very much @tamargrey and sorry for the late reply. I followed the latter approach which works well for me. MWE for anyone interested. I would suggest adding a small note to the docs about this as I found it a little confusing that transform primitives do not stack and are not subject to the max_depth argument of dfs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants