Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update calculating features to handle multiple paths #572

Merged
merged 52 commits into from Jun 17, 2019

Conversation

Projects
None yet
2 participants
@CJStadler
Copy link
Contributor

commented May 31, 2019

calculate_feature_matrix can now calculate features on entitysets where
there are multiple paths between two entities ("diamond graphs").

The basic algorithm:

  1. Construct a trie where the edges are relationships and each node
    contains a set of features. Also construct a trie with the same
    structure for storing dataframes.
  2. Traverse the trie using depth first search. For each node:
    1. Get a dataframe for the entity (of instances related to the
      parent).
    2. Add variables linking the entity to its ancestors.
    3. Recurse on children. After this is done all feature dependencies
      will have been calculated and stored in the dataframe trie.
    4. Group the features for this node in the trie.
    5. Calculate the features of each group.
  • Add Trie class.
  • Make Relationship hashable.
  • Rename FeatureTree to FeatureSet
  • Use feature.unique_name instead of hash in dicts. Features should not
    be used in dictionaries and sets because they do not support equality.
    The equality operator is overloaded to produce a new feature instead
    of comparing features. We have instead been using feature hash values,
    but this could potentially lead to bugs due to hash collisions. This
    commit instead uses the feature's unique_name where the hash had been
    used. The unique name contains the same information that is used to
    generate the hash, so it should not lead to any collisions.

See #543.

CJStadler added some commits May 21, 2019

Refactor PandasBackend and FeatureTree to use paths
This is just a refactor, it does not change existing functionality.
Update calculating features to handle multiple paths
calculate_feature_matrix can now calculate features on entitysets where
there are multiple paths between two entities ("diamond graphs").

The basic algorithm:

1. Construct a trie where the edges are relationships and each node
  contains a set of features. Also construct a trie with the same
  structure for storing dataframes.
2. Traverse the trie using depth first search. For each node:
  1. Get a dataframe for the entity (of instances related to the
    parent).
  2. Add variables linking the entity to its ancestors.
  3. Recurse on children. After this is done all feature dependencies
    will have been calculated and stored in the dataframe trie.
  4. Group the features for this node in the trie.
  5. Calculate the features of each group.

- Add Trie class.
- Make Relationship hashable.
- Rename FeatureTree to FeatureSet
- Use feature.unique_name instead of hash in dicts. Features should not
  be used in dictionaries and sets because they do not support equality.
  The equality operator is overloaded to produce a new feature instead
  of comparing features. We have instead been using feature hash values,
  but this could potentially lead to bugs due to hash collisions. This
  commit instead uses the feature's unique_name where the hash had been
  used. The unique name contains the same information that is used to
  generate the hash, so it should not lead to any collisions.
Change DirectFeature to take a single relationship
Instead of a path, since we currently only support paths of length 1.

Updates features JSON SCHEMA_VERSION.
Test calculate features on es with cycle
This currently fails because we do not detect that paths with cycles are
not unique. This means that two features with different length paths
through the same cycle will be given the same name, and will be treated
as the same feature in sets and dictionaries.

@CJStadler CJStadler requested a review from kmax12 May 31, 2019

CJStadler added some commits Jun 3, 2019

Show resolved Hide resolved featuretools/feature_base/feature_base.py Outdated
Show resolved Hide resolved featuretools/primitives/base/primitive_base.py Outdated
Show resolved Hide resolved featuretools/utils/trie.py
Show resolved Hide resolved featuretools/utils/trie.py Outdated
Show resolved Hide resolved featuretools/utils/trie.py Outdated
Show resolved Hide resolved featuretools/computational_backends/pandas_backend.py Outdated
Show resolved Hide resolved featuretools/computational_backends/pandas_backend.py Outdated
Show resolved Hide resolved featuretools/computational_backends/pandas_backend.py Outdated
Show resolved Hide resolved featuretools/computational_backends/pandas_backend.py Outdated
Show resolved Hide resolved featuretools/computational_backends/pandas_backend.py Outdated

CJStadler added some commits Jun 4, 2019

Use dataframe from given entityset in PandasBackend
Instead of accessing it through the entity of the relationship, as this
may have been deserialized and so may only have the entityset "metadata"
and no actual data.
Calculate necessary columns for entity on demand
Instead of storing them ahead of time in a trie.
Remove test with loop
We are not going to support this (for now).
Add _FeaturesCalculator class
So that time_last, training_window, etc. can be shared across recursive
calls without passing them.
Remove DS_Store file
I'm not sure how this was created (in addition to the normal .DS_Store)
@codecov

This comment was marked as outdated.

Copy link

commented Jun 5, 2019

Codecov Report

Merging #572 into master will decrease coverage by <.01%.
The diff coverage is 99.76%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #572      +/-   ##
==========================================
- Coverage   96.43%   96.43%   -0.01%     
==========================================
  Files         117      120       +3     
  Lines        9518     9590      +72     
==========================================
+ Hits         9179     9248      +69     
- Misses        339      342       +3
Impacted Files Coverage Δ
...etools/primitives/base/transform_primitive_base.py 100% <ø> (ø) ⬆️
featuretools/utils/gen_utils.py 83.87% <ø> (-4.77%) ⬇️
featuretools/computational_backends/utils.py 95.18% <100%> (ø) ⬆️
featuretools/entityset/entityset.py 94.56% <100%> (-0.42%) ⬇️
.../tests/primitive_tests/test_features_serializer.py 100% <100%> (ø) ⬆️
featuretools/feature_base/feature_base.py 97.32% <100%> (+0.02%) ⬆️
featuretools/primitives/base/primitive_base.py 100% <100%> (ø) ⬆️
featuretools/entityset/relationship.py 98.07% <100%> (+0.4%) ⬆️
featuretools/tests/conftest.py 100% <100%> (ø) ⬆️
...ools/tests/primitive_tests/test_direct_features.py 100% <100%> (ø) ⬆️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8f75909...e28a3cb. Read the comment docs.

CJStadler added some commits Jun 5, 2019

Remove feature grouping by output frame type
This is no longer necessary since we use full frames for an entity if
any features require it.
Only store df in trie after calculation
This way it is clear where the trie gets updated.
@kmax12
Copy link
Member

left a comment

Remove profile. Since it is technically public API.

Let's remove it. It's only for developers, so no need to worry that it is technically public

Remove PandasBackend. Since we might want to discuss this more.

I still think we should do this, but we can discuss before doing

Refactor approximate_features to not use get_pandas_data_frame.

i'd say let's do it here or in a PR that branches off of this branch.

@CJStadler CJStadler force-pushed the feature-trie branch from d8851e3 to 7f33e19 Jun 11, 2019

CJStadler added some commits Jun 11, 2019

Track approximated features by path
Instead of entity id. Changes precalculated_features to be a Trie.

Updates Trie iterator to yield RelationshipPaths instead of lists.
Simplify some of approx
Remove an extra column, removing the need for suffixes in the merge.
@CJStadler

This comment has been minimized.

Copy link
Contributor Author

commented Jun 11, 2019

I've updated the approximate logic to handle entitysets with multiple paths. I have also started removing PandasBackend, although we might still want to discuss that.

CJStadler added some commits Jun 12, 2019

CJStadler added some commits Jun 14, 2019

Merge branch 'master' into feature-trie
One test needed to be updated which attempted to make a direct feature
of a grandparent.
Store full entity dfs in separate trie (#598)
If there is an aggregation of a uses_full_entity feature the aggregation
should not be calculated on the full entity. Previously because we only
had one set of dfs this was not possible. Now, we always store the 
"filtered" df in df_trie, and if necessary also store a full df in large_df_trie.

The values of feature_trie are now a 3-tuple of
(needs full entity, needs full entity features, rest of features).

The main issue this solves is when there is a node without any features
we might still need the full entity. For example: if the only target
feature is "PERCENTILE(MEAN(customers.transactions))" there are no
features on "customers", but we need the full entity so that the every
row in "transactions" can get its relationship variable to the target
entity.

If any features at a given node require a full df we first calculate
them. Then we extract the filtered df from the result and use this to
calculate any remaining features.
Remove PandasBackend (#594)
Instead, calculate_feature_matrix constructs a FeatureSet when first
calls, and then uses FeaturesCalculator for sets of instance_ids and
cutoffs.
@kmax12
Copy link
Member

left a comment

Left a few minor comments. Otherwise, LGTM

other sequences.
Examples:
.. code-block:: python

This comment has been minimized.

Copy link
@kmax12

kmax12 Jun 17, 2019

Member

i believe this doc string needs to be updated using .value

This comment has been minimized.

Copy link
@CJStadler

CJStadler Jun 17, 2019

Author Contributor

Is there a way to have this block executed when the docs are built?

This comment has been minimized.

Copy link
@kmax12

kmax12 Jun 17, 2019

Member

yep, if you want it to be a test case, look at the syntax we use for any of the primitives' example sections. In that case, building the docs will run the code and error if the outputs don't match.

if you want to just run it, look at how we do encode_features

self._path_constructor = path_constructor

@property
def value(self):

This comment has been minimized.

Copy link
@kmax12

kmax12 Jun 17, 2019

Member

is there any reason to have these defined using @property? I think we can just have self.value defined

assert i in result['log']['id'].values


def test_get_pandas_slice_secondary_index(es):

This comment has been minimized.

Copy link
@kmax12

kmax12 Jun 17, 2019

Member

i think this might be the only place in the tests we explicitly tested that secondary time index correctly filtered data.

a quick fix would be to add a test call to Entity.query_by_values that replicates this test.

@kmax12

kmax12 approved these changes Jun 17, 2019

Copy link
Member

left a comment

LGTM

@CJStadler CJStadler merged commit 3b84b66 into master Jun 17, 2019

5 checks passed

changelog_updated Workflow: changelog_updated
Details
codecov/patch 99.38% of diff hit (target 96.48%)
Details
codecov/project Absolute coverage decreased by -0.08% but relative coverage increased by +2.9% compared to 8f07396
Details
license/cla Contributor License Agreement is signed.
Details
test_all_python_versions Workflow: test_all_python_versions
Details

@rwedge rwedge referenced this pull request Jun 19, 2019

Merged

v0.9.0 #610

@CJStadler CJStadler deleted the feature-trie branch Jun 21, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.