-
Notifications
You must be signed in to change notification settings - Fork 894
Update calculating features to handle multiple paths #572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is just a refactor, it does not change existing functionality.
calculate_feature_matrix can now calculate features on entitysets where there are multiple paths between two entities ("diamond graphs"). The basic algorithm: 1. Construct a trie where the edges are relationships and each node contains a set of features. Also construct a trie with the same structure for storing dataframes. 2. Traverse the trie using depth first search. For each node: 1. Get a dataframe for the entity (of instances related to the parent). 2. Add variables linking the entity to its ancestors. 3. Recurse on children. After this is done all feature dependencies will have been calculated and stored in the dataframe trie. 4. Group the features for this node in the trie. 5. Calculate the features of each group. - Add Trie class. - Make Relationship hashable. - Rename FeatureTree to FeatureSet - Use feature.unique_name instead of hash in dicts. Features should not be used in dictionaries and sets because they do not support equality. The equality operator is overloaded to produce a new feature instead of comparing features. We have instead been using feature hash values, but this could potentially lead to bugs due to hash collisions. This commit instead uses the feature's unique_name where the hash had been used. The unique name contains the same information that is used to generate the hash, so it should not lead to any collisions.
Instead of a path, since we currently only support paths of length 1. Updates features JSON SCHEMA_VERSION.
This currently fails because we do not detect that paths with cycles are not unique. This means that two features with different length paths through the same cycle will be given the same name, and will be treated as the same feature in sets and dictionaries.
featuretools/tests/computational_backend/test_pandas_backend.py
Outdated
Show resolved
Hide resolved
Instead of accessing it through the entity of the relationship, as this may have been deserialized and so may only have the entityset "metadata" and no actual data.
Instead of storing them ahead of time in a trie.
We are not going to support this (for now).
So that time_last, training_window, etc. can be shared across recursive calls without passing them.
I'm not sure how this was created (in addition to the normal .DS_Store)
This comment has been minimized.
This comment has been minimized.
This is no longer necessary since we use full frames for an entity if any features require it.
This way it is clear where the trie gets updated.
@kmax12 this is ready to be re-reviewed. There are three things I know of left to do, but I wanted to check with you whether it makes sense to do them here or in separate PRs:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove profile. Since it is technically public API.
Let's remove it. It's only for developers, so no need to worry that it is technically public
Remove PandasBackend. Since we might want to discuss this more.
I still think we should do this, but we can discuss before doing
Refactor approximate_features to not use get_pandas_data_frame.
i'd say let's do it here or in a PR that branches off of this branch.
Instead of entity id. Changes precalculated_features to be a Trie. Updates Trie iterator to yield RelationshipPaths instead of lists.
Remove an extra column, removing the need for suffixes in the merge.
I've updated the approximate logic to handle entitysets with multiple paths. I have also started removing |
One test needed to be updated which attempted to make a direct feature of a grandparent.
If there is an aggregation of a uses_full_entity feature the aggregation should not be calculated on the full entity. Previously because we only had one set of dfs this was not possible. Now, we always store the "filtered" df in df_trie, and if necessary also store a full df in large_df_trie. The values of feature_trie are now a 3-tuple of (needs full entity, needs full entity features, rest of features). The main issue this solves is when there is a node without any features we might still need the full entity. For example: if the only target feature is "PERCENTILE(MEAN(customers.transactions))" there are no features on "customers", but we need the full entity so that the every row in "transactions" can get its relationship variable to the target entity. If any features at a given node require a full df we first calculate them. Then we extract the filtered df from the result and use this to calculate any remaining features.
Instead, calculate_feature_matrix constructs a FeatureSet when first calls, and then uses FeaturesCalculator for sets of instance_ids and cutoffs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few minor comments. Otherwise, LGTM
other sequences. | ||
|
||
Examples: | ||
.. code-block:: python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe this doc string needs to be updated using .value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to have this block executed when the docs are built?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, if you want it to be a test case, look at the syntax we use for any of the primitives' example sections. In that case, building the docs will run the code and error if the outputs don't match.
if you want to just run it, look at how we do encode_features
featuretools/utils/trie.py
Outdated
self._path_constructor = path_constructor | ||
|
||
@property | ||
def value(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any reason to have these defined using @property
? I think we can just have self.value
defined
assert i in result['log']['id'].values | ||
|
||
|
||
def test_get_pandas_slice_secondary_index(es): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this might be the only place in the tests we explicitly tested that secondary time index correctly filtered data.
a quick fix would be to add a test call to Entity.query_by_values
that replicates this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
calculate_feature_matrix
can now calculate features on entitysets wherethere are multiple paths between two entities ("diamond graphs").
The basic algorithm:
contains a set of features. Also construct a trie with the same
structure for storing dataframes.
parent).
will have been calculated and stored in the dataframe trie.
be used in dictionaries and sets because they do not support equality.
The equality operator is overloaded to produce a new feature instead
of comparing features. We have instead been using feature hash values,
but this could potentially lead to bugs due to hash collisions. This
commit instead uses the feature's unique_name where the hash had been
used. The unique name contains the same information that is used to
generate the hash, so it should not lead to any collisions.
See #543.