-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle Transform features that need access to all values of entity #91
Conversation
Trying to use this with custom primitives now: what's the expected workflow? I thought that you might be able to set
but that doesn't seem to be the way to access calculating using additional data. |
* lint * Sort imports
Codecov Report
@@ Coverage Diff @@
## master #91 +/- ##
==========================================
+ Coverage 87.65% 88.18% +0.53%
==========================================
Files 73 73
Lines 7192 7392 +200
==========================================
+ Hits 6304 6519 +215
+ Misses 888 873 -15
Continue to review full report at Codecov.
|
@Seth-Rothschild the way to do it would be |
@@ -95,14 +96,28 @@ def calculate_all_features(self, instance_ids, time_last, | |||
ordered_entities = FeatureTree(self.entityset, self.features, ignored=ignored).ordered_entities | |||
else: | |||
ordered_entities = self.feature_tree.ordered_entities | |||
|
|||
necessary_columns = self._find_necessary_columns(only_if_needs_all_values=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way to refactor to not need this parameter? not clear to me what it does
@@ -210,6 +243,72 @@ def generate_default_df(self, instance_ids, extra_columns=None): | |||
default_df[c] = [np.nan] * len(instance_ids) | |||
return default_df | |||
|
|||
def _find_necessary_columns(self, only_if_needs_all_values=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason not to make this a method of featuretree?
handler = self._feature_type_handler(test_feature) | ||
result_frame = handler(group, input_frames) | ||
|
||
self._place_in_output_frames(result_frame, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look like this is a function that is only called once. any reason you chose to make it a function rather than include the code in line?
featuretools/entityset/entityset.py
Outdated
@@ -256,6 +272,7 @@ def query_entity_by_values(self, entity_id, instance_vals, variable_id=None, | |||
variable_id=variable_id, | |||
columns=columns, | |||
time_last=time_last, | |||
training_window=training_window, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain this addition? how were things working previously without this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they weren't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should write a test to confirm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually never use this function except in tests... We always call the one on the Entity itself (Entity.query_by_values
). I think we should just delete this function and change those tests to the one we actually use in EntitySet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in latest commit. Let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, let's remove if it's unused
self.all_features = list(all_features.values()) | ||
self.feature_deps = feature_deps | ||
self.feature_dependents = self._build_dependents() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this is the inverse of self.feature_deps
, but the variable name doesn't make it very clear since it is the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i didn't want to change feature_deps when I was first doing it for fear it was used elsewhere, but now that everythign works it would be better to name it feature_dependencies
self.all_features = list(all_features.values()) | ||
self.feature_deps = feature_deps | ||
self.feature_dependents = self._build_dependents() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why break this out into a different function? can it be combined into the logic above that calculates self.feature_deps
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we would have to change the above logic to accommodate it, since we rely on get_deep_dependencies() but we don't have an equivalent get_deep_dependents() method. It would just have to be a bit more complicated as a queue/search/while loop instead of just the two for loops. But yeah might make it look more uniform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind i think i can make it work
@@ -151,15 +169,30 @@ def calculate_all_features(self, instance_ids, time_last, | |||
if not self.entityset.find_backward_path(start_entity_id=filter_eid, | |||
goal_entity_id=eid): | |||
entity_frames[eid] = eframes_by_filter[eid][eid] | |||
if (large_eframes_by_filter is not None and | |||
eid in large_eframes_by_filter and eid in large_eframes_by_filter[eid]): | |||
large_entity_frames[eid] = large_eframes_by_filter[eid][eid] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this line about? seems a bit weird
test_feature = group[0] | ||
|
||
input_frames, output_frames = \ | ||
self._determine_input_output_frames( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like this logic can be done in a simpler way to avoid implementing a function that only gets called once.
input_frames = large_entity_frames
if self.feature_tree.needs_all_values_differentiator(test_feature) == 'normal_no_dependent':
input_frames = entity_frames
then you can put the output frame logic in existing _place_in_output_frames
function you've defined.
@@ -210,6 +243,72 @@ def generate_default_df(self, instance_ids, extra_columns=None): | |||
default_df[c] = [np.nan] * len(instance_ids) | |||
return default_df | |||
|
|||
def _find_necessary_columns(self, only_if_needs_all_values=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is only_if_needs_all_values
an optimization or is it required for this to work? a bit hard to follow its role. in general, I find it confusing when a function has a single boolean parameter that controls behavior like this unless it is very obvious what it does. as far as I can tell, it there is only one place in the code where we set it to true. if that is the case, I'd prefer if the optimization happens more explicitly outside the function. makes the function itself more readable and maintainable.
Codecov has |
@@ -19,6 +19,7 @@ class CumFeature(TransformPrimitive): | |||
allow_where = True | |||
agg_feature = None | |||
rolling_function = True | |||
needs_all_values = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking through other names. do any of these stick out as better or trigger other ideas?
uses_all_values
uses_all_data
uses_full_data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about something like "cross_entity", or "spans_entity"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"full_entity_transform"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
historical_transform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of full_entity_transform
. any thoughts on other versions that don't repeat transform?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
applied_to_full_entity
applied_to_full_data
```?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uses_full_entity
|
||
# These functions are used for sorting and grouping features | ||
|
||
def needs_all_values_differentiator(self, f): | ||
if self._dependent_needs_all_values(f) and self._is_output_feature(f): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this is an example of where extra one time use functions don't necessarily help.
is_output = f.hash() in self.feature_hashes
then use the variable in your conditionals below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i agree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think they're more to help me think while I go about solving the problem, and makes sense to review them at final solution
def _needs_all_values(self, feature): | ||
if feature.needs_all_values: | ||
return True | ||
for d in self.feature_dependents[feature.hash()]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be replaced with a call to self._dependent_needs_all_values
? at first glance, the code looks copy and pasted
|
||
# These functions are used for sorting and grouping features | ||
|
||
def needs_all_values_differentiator(self, f): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you briefly document what each of the 4 cases mean for someone looking at this code in the future?
large_eframes_by_filter = None | ||
if any([f.needs_all_values for f in self.feature_tree.all_features]): | ||
large_necessary_columns = self.feature_tree.necessary_columns_for_all_values_features | ||
all_instances = self.entityset[self.target_eid].get_all_instances() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it a bit weird to get all instances here when you'd likely have to filter some of them out by time later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
workaround for issue #93. can merge with master and remove
result_frame = handler(group, input_frames) | ||
|
||
output_frames = [] | ||
if needs_all_values_type.startswith('dependent'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd make this more explicit like the other conditionals around it
if needs_all_values_type in ["dependent", "dependent_and_output"]:
pass
# turn values which were hashes of features into the features themselves | ||
# (note that they were hashes to prevent checking equality of feature objects, | ||
# which is not currently an allowed operation) | ||
self.feature_dependents = {fhash: [all_features[dhash] for dhash in feature_dependents[fhash]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you iterate over the keys in all_features
rather than feature_dependents
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a feature has no dependencies, it won't be put in feature_dependents by the above code. But, looking at it now, I can just explicitly add those in the for loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm actually that doesn't make much sense because it's a defaultdict, so a feature with no deps will get treated as an empty value in the defaultdict anyway.
all_features[dep.hash()] = dep | ||
feature_deps[dep.hash()] = dep.get_deep_dependencies(ignored=ignored) | ||
subdeps = dep.get_deep_dependencies( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bschreck i see we had this before, but can you explain why we need to do the sub dependency call here? like why only once rather than go to sub sub dependencies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_deep_dependencies() gets all the dependencies recursively, so features + each one's deep dependencies defines every feature. To build the dependents dict, for each one of those, we need all of its dependents, not dependencies. deps is every dependency of the top level features, and subdeps is every dependency of all the other ones. So deps also contains all the dependents of all the subdeps, and top level features are all the other dependents for all the rest. kind of tricky to wrap your head around
@@ -153,9 +198,53 @@ def _get_feature_depths(self, entity_id): | |||
|
|||
return list(features.values()), out | |||
|
|||
def _build_dependents(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this used anywhere?
# to calculate_feature_matrix), then we also need | ||
# to subselect the output based on the desired instance ids | ||
# and place in the return data frame. | ||
if self._dependent_needs_all_values(f) and is_output: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you can refactor this to get rid of extra functions
dependent_needs_all_values = False
for d in self.feature_dependents[f.hash()]:
if d.needs_all_values:
dependent_needs_all_values = True
needs_all_values = f.needs_all_values or dependent_needs_all_values
then use variables below
@@ -10,9 +10,9 @@ | |||
|
|||
import numpy as np | |||
import pandas as pd | |||
# progress bar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete?
from featuretools.utils.gen_utils import make_tqdm_iterator | ||
|
||
# featuretools |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete?
@@ -151,15 +170,48 @@ def calculate_all_features(self, instance_ids, time_last, | |||
if not self.entityset.find_backward_path(start_entity_id=filter_eid, | |||
goal_entity_id=eid): | |||
entity_frames[eid] = eframes_by_filter[eid][eid] | |||
# precalculated features will only be placed in entity_frames, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to have to have precalculated a needs_all_values
feature? do we need to add code to handle that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought that I considered all cases before but I did miss the case when you have a needs_all_values
feature that depends on an approximated/precalculated feature. This will need to be reworked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove the TODO?
|
||
for frames in output_frames: | ||
index = frames[entity_id].index | ||
frames[entity_id] = result_frame.loc[index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might want to use result_frame.reindex()
here. you want to if some of the index values are going to be in result_frame.
@@ -19,6 +19,7 @@ class CumFeature(TransformPrimitive): | |||
allow_where = True | |||
agg_feature = None | |||
rolling_function = True | |||
needs_all_values = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
historical_transform
@@ -19,6 +19,7 @@ class CumFeature(TransformPrimitive): | |||
allow_where = True | |||
agg_feature = None | |||
rolling_function = True | |||
needs_all_values = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of full_entity_transform
. any thoughts on other versions that don't repeat transform?
Looks great. Thanks for the work on this! |
Adds logic to pass in all values of an entity before cutoff to features that need them.
The problem:
Features like Percentile and CumMean/CumMax/etc. need access to the all the values the entity before the cutoff time, rather than only specified instance ids. Normal Transform features are one-to-one mappings, where the output for specific instance only depends on a single input value for that instance. On the contrary, features like Percentile use all the available data to determine the rank of a particular value. Currently, they are only provided with a subset of the data- the subset that is linked to the set of specified instance ids on target entity.
This PR creates an additional
entity_frames
object inside ofpandas_backend.py
, which contains all the data before the cutoff time. It adds logic to determine whether to pass in this object of the normalentity_frames
object to each feature calculation function. It also decides where to place the output of these functions.Lastly, it makes sure to only extract the relevant columns from the entityset using
EntitySet.get_pandas_data_slice
. Previously, we simply used all the columns (which is a waste of memory).