Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate transform features of direct features #623

Merged
merged 27 commits into from Jul 25, 2019

Conversation

@CJStadler
Copy link
Contributor

commented Jun 24, 2019

But we do not create transforms of single direct features, or when all inputs are direct features with the same relationship path. This would be redundant as WEEKDAY(customers.signup_date) is equivalent to customers.WEEKDAY(signup_date), which should already have been calculated.

This is based on #123

Resolves #119
Resolves #81

args_set = frozenset({arg1.unique_name(), arg2.unique_name()})
unordered_args.add(args_set)

assert len(add_feats) == len(unordered_args)

This comment has been minimized.

Copy link
@CJStadler

CJStadler Jun 24, 2019

Author Contributor

Checking the count doesn't communicate why it should have that value. I changed this to instead test that no features are created which have the same two inputs.

@codecov

This comment has been minimized.

Copy link

commented Jun 24, 2019

Codecov Report

Merging #623 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #623      +/-   ##
==========================================
+ Coverage   97.47%   97.49%   +0.02%     
==========================================
  Files         118      118              
  Lines        9671     9713      +42     
==========================================
+ Hits         9427     9470      +43     
+ Misses        244      243       -1
Impacted Files Coverage Δ
featuretools/synthesis/deep_feature_synthesis.py 96.25% <100%> (+0.18%) ⬆️
...ols/tests/synthesis/test_deep_feature_synthesis.py 100% <100%> (ø) ⬆️
featuretools/entityset/relationship.py 98.71% <100%> (+0.03%) ⬆️
...s/computational_backends/feature_set_calculator.py 98.42% <0%> (+0.31%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eabad43...7dba0a8. Read the comment docs.

CJStadler and others added 6 commits Jun 25, 2019
Implement "!=" for RelationshipPath
This is necessary for Python 2 because it does not default to the
negation of "==".
@kmax12
Copy link
Member

left a comment

Will do a bit more testing, but implementation looks good.

Left one question

self._build_transform_features(
all_features, entity, max_depth=max_depth)
# add seed features, if any, for dfs to build on top of
for f in self.seed_features:

This comment has been minimized.

Copy link
@kmax12

kmax12 Jul 8, 2019

Member

any reason this can't stay inside of _add_identity_features?

This comment has been minimized.

Copy link
@CJStadler

CJStadler Jul 8, 2019

Author Contributor

I think I extracted this because if seed features are added before building aggregations then this test fails:

def test_seed_features_added_with_identity_features(es):
count_sessions = ft.Feature(es['sessions']["id"], parent_entity=es['customers'], primitive=Count)
dfs_obj = DeepFeatureSynthesis(target_entity_id='customers',
entityset=es,
agg_primitives=[Last],
trans_primitives=[],
max_depth=2,
seed_features=[count_sessions])
features = dfs_obj.build_features()
# this feature is meaningless because customers.COUNT(sessions) is already defined on
# the customers entity
assert not feature_with_name(features, 'LAST(sessions.customers.COUNT(sessions))')

But I'm realizing now that you could write a similar test without seed features and it would fail with the current implementation because I moved _add_identity_features before building aggregations:

# We don't want this feature because it will always be equal to 'age'
assert not feature_with_name(features, 'LAST(sessions.customers.age)')

I need to think some more about how to solve this.

This comment has been minimized.

Copy link
@CJStadler

CJStadler Jul 9, 2019

Author Contributor

I resolved this by adding identity and seed features first, and then making sure to not build aggregations of direct features of the target. For example, if we are building features for customers we do not use direct features such as customers.age as the base for aggregations.

CJStadler added 7 commits Jul 9, 2019
Build transforms in two steps
First on identity and aggregations, then on direct features.

This allows features to be constructed which are direct features of
aggregations of transforms on the target.
E.g. "log: sessions.MEAN(log.ABSOLUTE(value))"

Previously this was not being built because the transform was not built
when the aggregation was being built.
@CJStadler

This comment has been minimized.

Copy link
Contributor Author

commented Jul 23, 2019

@kmax12 the entityset tests revealed that this was no longer building direct features of aggregations of transform. For example, log: sessions.MEAN(log.ABSOLUTE(value)) would not have been built because the ABSOLUTE transform feature was not built until after we recursed into sessions.

To solve this I made two calls to build_transform_features: one in the original location to build on identity and aggregation features, and one at the end to build on direct features. To avoid making duplicate features in the second call only sets of inputs which contain at least one direct feature will be used.

@kmax12
Copy link
Member

left a comment

looks good. just some comments on how to structure.

@CJStadler CJStadler requested a review from kmax12 Jul 24, 2019

CJStadler added 2 commits Jul 24, 2019
@kmax12
kmax12 approved these changes Jul 24, 2019
Copy link
Member

left a comment

LGTM

@CJStadler CJStadler merged commit c583ff1 into master Jul 25, 2019

4 checks passed

codecov/patch 100% of diff hit (target 97.47%)
Details
codecov/project 97.49% (+0.02%) compared to eabad43
Details
license/cla Contributor License Agreement is signed.
Details
test_all_python_versions Workflow: test_all_python_versions
Details

@CJStadler CJStadler deleted the dfs-trans-after-direct branch Jul 25, 2019

@rwedge rwedge referenced this pull request Aug 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.