Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not calculate dependencies of precalculated features #667

Merged
merged 11 commits into from Jul 18, 2019

Conversation

@CJStadler
Copy link
Contributor

commented Jul 12, 2019

Currently if there is a precalculated feature "A(B)" (and no other feature depends on B) we will still re-calculate B even though the resulting feature vector will never be used.

This commit solves this by making FeatureSet aware of which features will be precalculated, so that it can exclude them and their dependencies from the feature trie.

Script showing this:

import pandas as pd
import featuretools as ft

from datetime import datetime
from featuretools.tests.testing_utils import make_ecommerce_entityset

es = make_ecommerce_entityset()


class Hello(ft.primitives.AggregationPrimitive):
    name = "hello"
    input_types = [ft.variable_types.Numeric]
    return_type = ft.variable_types.Numeric

    def get_function(self):
        def hello(s):
            print('Hello')
            return 0
        return hello

class World(ft.primitives.AggregationPrimitive):
    name = "world"
    input_types = [ft.variable_types.Numeric]
    return_type = ft.variable_types.Numeric

    def get_function(self):
        def hello(s):
            print('World')
            return 0
        return hello

agg_feat = ft.Feature(es['log']['value'], parent_entity=es['sessions'], primitive=Hello)
agg_feat2 = ft.Feature(agg_feat, parent_entity=es['customers'], primitive=World)
dfeat = ft.DirectFeature(agg_feat2, es['sessions'])
cutoff_time = pd.DataFrame({
    'time': [datetime(2013, 4, 9, 10, 31, 19), datetime(2013, 4, 9, 11, 0, 0)],
    'instance_id': [0, 2]
})
feature_matrix = ft.calculate_feature_matrix([dfeat],
                                             es,
                                             cutoff_time_in_index=False,
                                             approximate=ft.Timedelta(12, 's'),
                                             cutoff_time=cutoff_time,
                                             chunk_size='cutoff time')

On master this outputs

Hello
Hello
Hello
World
Hello
Hello
Hello
Hello
Hello
Hello
World
Hello
Hello
Hello

The 3 Hellos after each World are generating feature vectors which are never used (this is only obvious for the last 3 though).

On this branch we instead get

Hello
Hello
Hello
World
Hello
Hello
Hello
World

@CJStadler CJStadler force-pushed the ignore-approx-dependencies branch from 2cc1943 to 030ac82 Jul 12, 2019

@codecov

This comment has been minimized.

Copy link

commented Jul 12, 2019

Codecov Report

Merging #667 into master will increase coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #667      +/-   ##
==========================================
+ Coverage   97.42%   97.44%   +0.01%     
==========================================
  Files         118      118              
  Lines        9567     9618      +51     
==========================================
+ Hits         9321     9372      +51     
  Misses        246      246
Impacted Files Coverage Δ
...s/computational_backends/feature_set_calculator.py 98.1% <ø> (-0.03%) ⬇️
featuretools/entityset/relationship.py 98.68% <ø> (ø) ⬆️
featuretools/computational_backends/utils.py 95.78% <100%> (-0.06%) ⬇️
featuretools/computational_backends/feature_set.py 100% <100%> (ø) ⬆️
...computational_backends/calculate_feature_matrix.py 98.31% <100%> (ø) ⬆️
...ls/tests/computational_backend/test_feature_set.py 100% <100%> (ø) ⬆️
...mputational_backend/test_feature_set_calculator.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e0e9802...4edf465. Read the comment docs.

@CJStadler
Copy link
Contributor Author

left a comment

This is missing tests for FeatureSet.

[resolved]

CJStadler added 4 commits Jul 12, 2019
Do not calculate dependencies of precalculated features
Currently if there is a precalculated feature "A(B)" (and no other
feature depends on B) we will still re-calculate B even though the
resulting feature vector will never be used.

This commit solves this by making FeatureSet aware of which features
will be precalculated, so that it can exclude them and their
dependencies from the feature trie.

@CJStadler CJStadler force-pushed the ignore-approx-dependencies branch from 030ac82 to 00b02fb Jul 17, 2019

CJStadler added 3 commits Jul 17, 2019
Test that error is raised if not ignored
It was implicit that if these features were calculated (not ignored)
then an error would be raised. Now this is explicitly tested.
@kmax12
Copy link
Member

left a comment

few comments

@@ -15,11 +15,16 @@


class FeatureSet(object):
def __init__(self, features):
def __init__(self, features, ignored_feature_trie=None):

This comment has been minimized.

Copy link
@kmax12

kmax12 Jul 18, 2019

Member

add doc about ignored_feature_trie

self._feature_trie = None

@property
def feature_trie(self):

This comment has been minimized.

Copy link
@kmax12

kmax12 Jul 18, 2019

Member

add note that this is optimization

if self._relationships_with_direction:
path = '%s.%s' % (next(self.entities()), self.name)
else:
path = 'empty'

This comment has been minimized.

Copy link
@kmax12

kmax12 Jul 18, 2019

Member

to discuss: is this the best string to use?

This comment has been minimized.

Copy link
@CJStadler

CJStadler Jul 18, 2019

Author Contributor

Options:

  1. <RelationshipPath empty> (current)
  2. <RelationshipPath>
  3. <RelationshipPath None>
  4. <RelationshipPath []>

I'm leaning towards number 4 because RelationshipPath generally acts like a list so [] makes it clear that this path is like an empty list. Since the normal format is <RelationshipPath entity1.entity2>, <RelationshipPath empty> could potentially look like empty is the name of an entity.

This comment has been minimized.

Copy link
@kmax12

kmax12 Jul 18, 2019

Member

i like [] too

feature_list = approximate_feature_trie.get_node(path).value
feature_list.append(base_feature)
node_feature_set = approximate_feature_trie.get_node(path).value
node_feature_set.add(base_feature.unique_name())
approximate_feature_set.add(base_feature.unique_name())

return approximate_feature_trie, approximate_feature_set

This comment has been minimized.

Copy link
@kmax12

kmax12 Jul 18, 2019

Member

refactor to not return approximate_feature_set

@@ -375,6 +379,7 @@ def calc_results(time_last, ids, precalculated_features=None, training_window=No


def approximate_features(feature_set, cutoff_time, window, entityset,
approximate_feature_trie,

This comment has been minimized.

Copy link
@kmax12

kmax12 Jul 18, 2019

Member

get this value from feature_set. likely renamed ignored_feature_trie in feature_set as well

@kmax12
kmax12 approved these changes Jul 18, 2019
Copy link
Member

left a comment

LGTM

@CJStadler CJStadler merged commit d191c64 into master Jul 18, 2019

4 checks passed

codecov/patch 100% of diff hit (target 97.42%)
Details
codecov/project 97.44% (+0.01%) compared to e0e9802
Details
license/cla Contributor License Agreement is signed.
Details
test_all_python_versions Workflow: test_all_python_versions
Details

@CJStadler CJStadler deleted the ignore-approx-dependencies branch Jul 18, 2019

@rwedge rwedge referenced this pull request Aug 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.