New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Transform features that need access to all values of entity #91

Merged
merged 22 commits into from Mar 6, 2018

Conversation

Projects
None yet
5 participants
@bschreck
Contributor

bschreck commented Feb 13, 2018

Adds logic to pass in all values of an entity before cutoff to features that need them.

The problem:
Features like Percentile and CumMean/CumMax/etc. need access to the all the values the entity before the cutoff time, rather than only specified instance ids. Normal Transform features are one-to-one mappings, where the output for specific instance only depends on a single input value for that instance. On the contrary, features like Percentile use all the available data to determine the rank of a particular value. Currently, they are only provided with a subset of the data- the subset that is linked to the set of specified instance ids on target entity.

This PR creates an additional entity_frames object inside of pandas_backend.py, which contains all the data before the cutoff time. It adds logic to determine whether to pass in this object of the normal entity_frames object to each feature calculation function. It also decides where to place the output of these functions.

Lastly, it makes sure to only extract the relevant columns from the entityset using EntitySet.get_pandas_data_slice. Previously, we simply used all the columns (which is a waste of memory).

@Seth-Rothschild

This comment has been minimized.

Contributor

Seth-Rothschild commented Feb 14, 2018

Trying to use this with custom primitives now: what's the expected workflow? I thought that you might be able to set use_previous by doing

def primitive1(value):
    numeric = pd.Series(value)
    return (numeric - numeric.shift(1))

Prim1 = make_trans_primitive(primitive1,
                            input_types=[vtypes.Numeric],
                            cls_attributes={'use_previous': True},
                            return_type=vtypes.Numeric)

but that doesn't seem to be the way to access calculating using additional data.

bschreck and others added some commits Feb 14, 2018

lint needs-all-values-transform (#92)
* lint

* Sort imports
@codecov-io

This comment has been minimized.

codecov-io commented Feb 14, 2018

Codecov Report

Merging #91 into master will increase coverage by 0.53%.
The diff coverage is 98.76%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #91      +/-   ##
==========================================
+ Coverage   87.65%   88.18%   +0.53%     
==========================================
  Files          73       73              
  Lines        7192     7392     +200     
==========================================
+ Hits         6304     6519     +215     
+ Misses        888      873      -15
Impacted Files Coverage Δ
...computational_backends/calculate_feature_matrix.py 96.87% <100%> (+0.02%) ⬆️
featuretools/primitives/transform_primitive.py 97.76% <100%> (ø) ⬆️
...eaturetools/computational_backends/feature_tree.py 99.28% <100%> (+6.42%) ⬆️
.../feature_function_tests/test_transform_features.py 87.64% <100%> (+1.51%) ⬆️
...aturetools/tests/entityset_tests/test_pandas_es.py 99.72% <100%> (ø) ⬆️
featuretools/entityset/entityset.py 89.8% <100%> (+0.04%) ⬆️
...utational_backend/test_calculate_feature_matrix.py 98.79% <100%> (+0.18%) ⬆️
featuretools/primitives/cum_transform_feature.py 97.63% <100%> (+0.01%) ⬆️
featuretools/primitives/primitive_base.py 77.92% <100%> (+0.74%) ⬆️
...turetools/computational_backends/pandas_backend.py 89.24% <92.85%> (+2.04%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3be6fb3...861f22b. Read the comment docs.

@bschreck

This comment has been minimized.

Contributor

bschreck commented Feb 16, 2018

@Seth-Rothschild the way to do it would be cls_attributes={'needs_all_values': True}. You just have the wrong variable name.

@@ -95,14 +96,28 @@ def calculate_all_features(self, instance_ids, time_last,
ordered_entities = FeatureTree(self.entityset, self.features, ignored=ignored).ordered_entities
else:
ordered_entities = self.feature_tree.ordered_entities
necessary_columns = self._find_necessary_columns(only_if_needs_all_values=False)

This comment has been minimized.

@kmax12

kmax12 Feb 20, 2018

Member

is there a way to refactor to not need this parameter? not clear to me what it does

@@ -210,6 +243,72 @@ def generate_default_df(self, instance_ids, extra_columns=None):
default_df[c] = [np.nan] * len(instance_ids)
return default_df
def _find_necessary_columns(self, only_if_needs_all_values=False):

This comment has been minimized.

@kmax12

kmax12 Feb 20, 2018

Member

any reason not to make this a method of featuretree?

handler = self._feature_type_handler(test_feature)
result_frame = handler(group, input_frames)
self._place_in_output_frames(result_frame,

This comment has been minimized.

@kmax12

kmax12 Feb 20, 2018

Member

look like this is a function that is only called once. any reason you chose to make it a function rather than include the code in line?

@@ -256,6 +272,7 @@ def query_entity_by_values(self, entity_id, instance_vals, variable_id=None,
variable_id=variable_id,
columns=columns,
time_last=time_last,
training_window=training_window,

This comment has been minimized.

@kmax12

kmax12 Feb 20, 2018

Member

can you explain this addition? how were things working previously without this?

This comment has been minimized.

@bschreck

bschreck Feb 20, 2018

Contributor

I think they weren't

This comment has been minimized.

@bschreck

bschreck Feb 20, 2018

Contributor

should write a test to confirm

This comment has been minimized.

@bschreck

bschreck Feb 21, 2018

Contributor

We actually never use this function except in tests... We always call the one on the Entity itself (Entity.query_by_values). I think we should just delete this function and change those tests to the one we actually use in EntitySet

This comment has been minimized.

@bschreck

bschreck Feb 21, 2018

Contributor

Changed in latest commit. Let me know what you think

This comment has been minimized.

@kmax12

kmax12 Feb 23, 2018

Member

yep, let's remove if it's unused

self.all_features = list(all_features.values())
self.feature_deps = feature_deps
self.feature_dependents = self._build_dependents()

This comment has been minimized.

@kmax12

kmax12 Feb 20, 2018

Member

i think this is the inverse of self.feature_deps, but the variable name doesn't make it very clear since it is the same

This comment has been minimized.

@bschreck

bschreck Feb 20, 2018

Contributor

i didn't want to change feature_deps when I was first doing it for fear it was used elsewhere, but now that everythign works it would be better to name it feature_dependencies

self.all_features = list(all_features.values())
self.feature_deps = feature_deps
self.feature_dependents = self._build_dependents()

This comment has been minimized.

@kmax12

kmax12 Feb 20, 2018

Member

why break this out into a different function? can it be combined into the logic above that calculates self.feature_deps?

This comment has been minimized.

@bschreck

bschreck Feb 20, 2018

Contributor

we would have to change the above logic to accommodate it, since we rely on get_deep_dependencies() but we don't have an equivalent get_deep_dependents() method. It would just have to be a bit more complicated as a queue/search/while loop instead of just the two for loops. But yeah might make it look more uniform

This comment has been minimized.

@bschreck

bschreck Feb 21, 2018

Contributor

Nevermind i think i can make it work

@@ -151,15 +169,30 @@ def calculate_all_features(self, instance_ids, time_last,
if not self.entityset.find_backward_path(start_entity_id=filter_eid,
goal_entity_id=eid):
entity_frames[eid] = eframes_by_filter[eid][eid]
if (large_eframes_by_filter is not None and
eid in large_eframes_by_filter and eid in large_eframes_by_filter[eid]):
large_entity_frames[eid] = large_eframes_by_filter[eid][eid]

This comment has been minimized.

@kmax12

kmax12 Feb 20, 2018

Member

what is this line about? seems a bit weird

test_feature = group[0]
input_frames, output_frames = \
self._determine_input_output_frames(

This comment has been minimized.

@kmax12

kmax12 Feb 20, 2018

Member

seems like this logic can be done in a simpler way to avoid implementing a function that only gets called once.

input_frames = large_entity_frames
if self.feature_tree.needs_all_values_differentiator(test_feature) == 'normal_no_dependent':
    input_frames = entity_frames

then you can put the output frame logic in existing _place_in_output_frames function you've defined.

@@ -210,6 +243,72 @@ def generate_default_df(self, instance_ids, extra_columns=None):
default_df[c] = [np.nan] * len(instance_ids)
return default_df
def _find_necessary_columns(self, only_if_needs_all_values=False):

This comment has been minimized.

@kmax12

kmax12 Feb 20, 2018

Member

is only_if_needs_all_values an optimization or is it required for this to work? a bit hard to follow its role. in general, I find it confusing when a function has a single boolean parameter that controls behavior like this unless it is very obvious what it does. as far as I can tell, it there is only one place in the code where we set it to true. if that is the case, I'd prefer if the optimization happens more explicitly outside the function. makes the function itself more readable and maintainable.

bschreck added some commits Feb 21, 2018

@rwedge

This comment has been minimized.

Contributor

rwedge commented Feb 23, 2018

Codecov has feature_tree.get_all_features not being hit by the test cases and it doesn't look like it's called by other parts of featuretools. Is this a method we want to keep?

@@ -19,6 +19,7 @@ class CumFeature(TransformPrimitive):
allow_where = True
agg_feature = None
rolling_function = True
needs_all_values = True

This comment has been minimized.

@kmax12

kmax12 Feb 23, 2018

Member

thinking through other names. do any of these stick out as better or trigger other ideas?

uses_all_values
uses_all_data
uses_full_data

This comment has been minimized.

@bschreck

bschreck Feb 23, 2018

Contributor

What about something like "cross_entity", or "spans_entity"

This comment has been minimized.

@bschreck

bschreck Feb 23, 2018

Contributor

"full_entity_transform"

This comment has been minimized.

@kmax12

kmax12 Feb 26, 2018

Member

historical_transform

This comment has been minimized.

@kmax12

kmax12 Feb 26, 2018

Member

I like the idea of full_entity_transform. any thoughts on other versions that don't repeat transform?

This comment has been minimized.

@bschreck

bschreck Feb 27, 2018

Contributor
applied_to_full_entity
applied_to_full_data
```?

This comment has been minimized.

@kmax12

kmax12 Feb 28, 2018

Member

uses_full_entity

# These functions are used for sorting and grouping features
def needs_all_values_differentiator(self, f):
if self._dependent_needs_all_values(f) and self._is_output_feature(f):

This comment has been minimized.

@kmax12

kmax12 Feb 23, 2018

Member

i think this is an example of where extra one time use functions don't necessarily help.

is_output = f.hash() in self.feature_hashes

then use the variable in your conditionals below

This comment has been minimized.

@bschreck

bschreck Feb 23, 2018

Contributor

yeah i agree

This comment has been minimized.

@bschreck

bschreck Feb 23, 2018

Contributor

i think they're more to help me think while I go about solving the problem, and makes sense to review them at final solution

def _needs_all_values(self, feature):
if feature.needs_all_values:
return True
for d in self.feature_dependents[feature.hash()]:

This comment has been minimized.

@kmax12

kmax12 Feb 23, 2018

Member

can this be replaced with a call to self._dependent_needs_all_values? at first glance, the code looks copy and pasted

# These functions are used for sorting and grouping features
def needs_all_values_differentiator(self, f):

This comment has been minimized.

@kmax12

kmax12 Feb 23, 2018

Member

can you briefly document what each of the 4 cases mean for someone looking at this code in the future?

large_eframes_by_filter = None
if any([f.needs_all_values for f in self.feature_tree.all_features]):
large_necessary_columns = self.feature_tree.necessary_columns_for_all_values_features
all_instances = self.entityset[self.target_eid].get_all_instances()

This comment has been minimized.

@kmax12

kmax12 Feb 23, 2018

Member

is it a bit weird to get all instances here when you'd likely have to filter some of them out by time later?

This comment has been minimized.

@bschreck

bschreck Feb 23, 2018

Contributor

workaround for issue #93. can merge with master and remove

result_frame = handler(group, input_frames)
output_frames = []
if needs_all_values_type.startswith('dependent'):

This comment has been minimized.

@kmax12

kmax12 Feb 23, 2018

Member

I'd make this more explicit like the other conditionals around it

if needs_all_values_type in ["dependent", "dependent_and_output"]:
    pass
# turn values which were hashes of features into the features themselves
# (note that they were hashes to prevent checking equality of feature objects,
# which is not currently an allowed operation)
self.feature_dependents = {fhash: [all_features[dhash] for dhash in feature_dependents[fhash]]

This comment has been minimized.

@kmax12

kmax12 Feb 23, 2018

Member

why do you iterate over the keys in all_features rather than feature_dependents?

This comment has been minimized.

@bschreck

bschreck Feb 23, 2018

Contributor

If a feature has no dependencies, it won't be put in feature_dependents by the above code. But, looking at it now, I can just explicitly add those in the for loop.

This comment has been minimized.

@bschreck

bschreck Feb 23, 2018

Contributor

hmm actually that doesn't make much sense because it's a defaultdict, so a feature with no deps will get treated as an empty value in the defaultdict anyway.

all_features[dep.hash()] = dep
feature_deps[dep.hash()] = dep.get_deep_dependencies(ignored=ignored)
subdeps = dep.get_deep_dependencies(

This comment has been minimized.

@kmax12

kmax12 Feb 26, 2018

Member

@bschreck i see we had this before, but can you explain why we need to do the sub dependency call here? like why only once rather than go to sub sub dependencies?

This comment has been minimized.

@bschreck

bschreck Feb 27, 2018

Contributor

get_deep_dependencies() gets all the dependencies recursively, so features + each one's deep dependencies defines every feature. To build the dependents dict, for each one of those, we need all of its dependents, not dependencies. deps is every dependency of the top level features, and subdeps is every dependency of all the other ones. So deps also contains all the dependents of all the subdeps, and top level features are all the other dependents for all the rest. kind of tricky to wrap your head around

@@ -153,9 +198,53 @@ def _get_feature_depths(self, entity_id):
return list(features.values()), out
def _build_dependents(self):

This comment has been minimized.

@kmax12

kmax12 Feb 26, 2018

Member

is this used anywhere?

# to calculate_feature_matrix), then we also need
# to subselect the output based on the desired instance ids
# and place in the return data frame.
if self._dependent_needs_all_values(f) and is_output:

This comment has been minimized.

@kmax12

kmax12 Feb 26, 2018

Member

i think you can refactor this to get rid of extra functions

dependent_needs_all_values = False
for d in self.feature_dependents[f.hash()]:
        if d.needs_all_values:
            dependent_needs_all_values = True
needs_all_values = f.needs_all_values or dependent_needs_all_values

then use variables below

@@ -10,9 +10,9 @@
import numpy as np
import pandas as pd
# progress bar

This comment has been minimized.

@kmax12

kmax12 Feb 26, 2018

Member

delete?

from featuretools.utils.gen_utils import make_tqdm_iterator
# featuretools

This comment has been minimized.

@kmax12

kmax12 Feb 26, 2018

Member

delete?

@@ -151,15 +170,48 @@ def calculate_all_features(self, instance_ids, time_last,
if not self.entityset.find_backward_path(start_entity_id=filter_eid,
goal_entity_id=eid):
entity_frames[eid] = eframes_by_filter[eid][eid]
# precalculated features will only be placed in entity_frames,

This comment has been minimized.

@kmax12

kmax12 Feb 26, 2018

Member

is it possible to have to have precalculated a needs_all_values feature? do we need to add code to handle that case?

This comment has been minimized.

@bschreck

bschreck Feb 27, 2018

Contributor

i thought that I considered all cases before but I did miss the case when you have a needs_all_values feature that depends on an approximated/precalculated feature. This will need to be reworked.

This comment has been minimized.

@kmax12

kmax12 Feb 28, 2018

Member

can we remove the TODO?

for frames in output_frames:
index = frames[entity_id].index
frames[entity_id] = result_frame.loc[index]

This comment has been minimized.

@kmax12

kmax12 Feb 26, 2018

Member

you might want to use result_frame.reindex() here. you want to if some of the index values are going to be in result_frame.

@@ -19,6 +19,7 @@ class CumFeature(TransformPrimitive):
allow_where = True
agg_feature = None
rolling_function = True
needs_all_values = True

This comment has been minimized.

@kmax12

kmax12 Feb 26, 2018

Member

historical_transform

@@ -19,6 +19,7 @@ class CumFeature(TransformPrimitive):
allow_where = True
agg_feature = None
rolling_function = True
needs_all_values = True

This comment has been minimized.

@kmax12

kmax12 Feb 26, 2018

Member

I like the idea of full_entity_transform. any thoughts on other versions that don't repeat transform?

@kmax12

kmax12 approved these changes Mar 6, 2018

@kmax12

This comment has been minimized.

Member

kmax12 commented Mar 6, 2018

Looks great. Thanks for the work on this!

@kmax12 kmax12 merged commit 4189b0b into master Mar 6, 2018

2 checks passed

ci/circleci Your tests passed on CircleCI!
Details
license/cla Contributor License Agreement is signed.
Details

@rwedge rwedge referenced this pull request Mar 21, 2018

Merged

Release v0.1.19 #114

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment