Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not calculate dependencies of precalculated features #667

Merged
merged 11 commits into from Jul 18, 2019

Conversation

CJStadler
Copy link
Contributor

@CJStadler CJStadler commented Jul 12, 2019

Currently if there is a precalculated feature "A(B)" (and no other feature depends on B) we will still re-calculate B even though the resulting feature vector will never be used.

This commit solves this by making FeatureSet aware of which features will be precalculated, so that it can exclude them and their dependencies from the feature trie.

Script showing this:

import pandas as pd
import featuretools as ft

from datetime import datetime
from featuretools.tests.testing_utils import make_ecommerce_entityset

es = make_ecommerce_entityset()


class Hello(ft.primitives.AggregationPrimitive):
    name = "hello"
    input_types = [ft.variable_types.Numeric]
    return_type = ft.variable_types.Numeric

    def get_function(self):
        def hello(s):
            print('Hello')
            return 0
        return hello

class World(ft.primitives.AggregationPrimitive):
    name = "world"
    input_types = [ft.variable_types.Numeric]
    return_type = ft.variable_types.Numeric

    def get_function(self):
        def hello(s):
            print('World')
            return 0
        return hello

agg_feat = ft.Feature(es['log']['value'], parent_entity=es['sessions'], primitive=Hello)
agg_feat2 = ft.Feature(agg_feat, parent_entity=es['customers'], primitive=World)
dfeat = ft.DirectFeature(agg_feat2, es['sessions'])
cutoff_time = pd.DataFrame({
    'time': [datetime(2013, 4, 9, 10, 31, 19), datetime(2013, 4, 9, 11, 0, 0)],
    'instance_id': [0, 2]
})
feature_matrix = ft.calculate_feature_matrix([dfeat],
                                             es,
                                             cutoff_time_in_index=False,
                                             approximate=ft.Timedelta(12, 's'),
                                             cutoff_time=cutoff_time,
                                             chunk_size='cutoff time')

On master this outputs

Hello
Hello
Hello
World
Hello
Hello
Hello
Hello
Hello
Hello
World
Hello
Hello
Hello

The 3 Hellos after each World are generating feature vectors which are never used (this is only obvious for the last 3 though).

On this branch we instead get

Hello
Hello
Hello
World
Hello
Hello
Hello
World

@codecov
Copy link

codecov bot commented Jul 12, 2019

Codecov Report

Merging #667 into master will increase coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #667      +/-   ##
==========================================
+ Coverage   97.42%   97.44%   +0.01%     
==========================================
  Files         118      118              
  Lines        9567     9618      +51     
==========================================
+ Hits         9321     9372      +51     
  Misses        246      246
Impacted Files Coverage Δ
...s/computational_backends/feature_set_calculator.py 98.1% <ø> (-0.03%) ⬇️
featuretools/entityset/relationship.py 98.68% <ø> (ø) ⬆️
featuretools/computational_backends/utils.py 95.78% <100%> (-0.06%) ⬇️
featuretools/computational_backends/feature_set.py 100% <100%> (ø) ⬆️
...computational_backends/calculate_feature_matrix.py 98.31% <100%> (ø) ⬆️
...ls/tests/computational_backend/test_feature_set.py 100% <100%> (ø) ⬆️
...mputational_backend/test_feature_set_calculator.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e0e9802...4edf465. Read the comment docs.

Copy link
Contributor Author

@CJStadler CJStadler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing tests for FeatureSet.

[resolved]

Currently if there is a precalculated feature "A(B)" (and no other
feature depends on B) we will still re-calculate B even though the
resulting feature vector will never be used.

This commit solves this by making FeatureSet aware of which features
will be precalculated, so that it can exclude them and their
dependencies from the feature trie.
It was implicit that if these features were calculated (not ignored)
then an error would be raised. Now this is explicitly tested.
Copy link
Contributor

@kmax12 kmax12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few comments

@@ -15,11 +15,16 @@


class FeatureSet(object):
def __init__(self, features):
def __init__(self, features, ignored_feature_trie=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add doc about ignored_feature_trie

self._feature_trie = None

@property
def feature_trie(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add note that this is optimization

if self._relationships_with_direction:
path = '%s.%s' % (next(self.entities()), self.name)
else:
path = 'empty'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to discuss: is this the best string to use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Options:

  1. <RelationshipPath empty> (current)
  2. <RelationshipPath>
  3. <RelationshipPath None>
  4. <RelationshipPath []>

I'm leaning towards number 4 because RelationshipPath generally acts like a list so [] makes it clear that this path is like an empty list. Since the normal format is <RelationshipPath entity1.entity2>, <RelationshipPath empty> could potentially look like empty is the name of an entity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like [] too

feature_list = approximate_feature_trie.get_node(path).value
feature_list.append(base_feature)
node_feature_set = approximate_feature_trie.get_node(path).value
node_feature_set.add(base_feature.unique_name())
approximate_feature_set.add(base_feature.unique_name())

return approximate_feature_trie, approximate_feature_set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactor to not return approximate_feature_set

@@ -375,6 +379,7 @@ def calc_results(time_last, ids, precalculated_features=None, training_window=No


def approximate_features(feature_set, cutoff_time, window, entityset,
approximate_feature_trie,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get this value from feature_set. likely renamed ignored_feature_trie in feature_set as well

Copy link
Contributor

@kmax12 kmax12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@CJStadler CJStadler merged commit d191c64 into master Jul 18, 2019
@CJStadler CJStadler deleted the ignore-approx-dependencies branch July 18, 2019 20:36
@rwedge rwedge mentioned this pull request Aug 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants