Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update calculating features to handle multiple paths #572

Merged
merged 52 commits into from Jun 17, 2019
Merged

Commits on May 31, 2019

  1. Refactor PandasBackend and FeatureTree to use paths

    This is just a refactor, it does not change existing functionality.
    CJStadler committed May 31, 2019
    Copy the full SHA
    d59ab4a View commit details
    Browse the repository at this point in the history
  2. Update calculating features to handle multiple paths

    calculate_feature_matrix can now calculate features on entitysets where
    there are multiple paths between two entities ("diamond graphs").
    
    The basic algorithm:
    
    1. Construct a trie where the edges are relationships and each node
      contains a set of features. Also construct a trie with the same
      structure for storing dataframes.
    2. Traverse the trie using depth first search. For each node:
      1. Get a dataframe for the entity (of instances related to the
        parent).
      2. Add variables linking the entity to its ancestors.
      3. Recurse on children. After this is done all feature dependencies
        will have been calculated and stored in the dataframe trie.
      4. Group the features for this node in the trie.
      5. Calculate the features of each group.
    
    - Add Trie class.
    - Make Relationship hashable.
    - Rename FeatureTree to FeatureSet
    - Use feature.unique_name instead of hash in dicts. Features should not
      be used in dictionaries and sets because they do not support equality.
      The equality operator is overloaded to produce a new feature instead
      of comparing features. We have instead been using feature hash values,
      but this could potentially lead to bugs due to hash collisions. This
      commit instead uses the feature's unique_name where the hash had been
      used. The unique name contains the same information that is used to
      generate the hash, so it should not lead to any collisions.
    CJStadler committed May 31, 2019
    Copy the full SHA
    ba94b98 View commit details
    Browse the repository at this point in the history
  3. Change DirectFeature to take a single relationship

    Instead of a path, since we currently only support paths of length 1.
    
    Updates features JSON SCHEMA_VERSION.
    CJStadler committed May 31, 2019
    Copy the full SHA
    c558c4a View commit details
    Browse the repository at this point in the history
  4. Test calculate features on es with cycle

    This currently fails because we do not detect that paths with cycles are
    not unique. This means that two features with different length paths
    through the same cycle will be given the same name, and will be treated
    as the same feature in sets and dictionaries.
    CJStadler committed May 31, 2019
    Copy the full SHA
    68f2132 View commit details
    Browse the repository at this point in the history
  5. Copy the full SHA
    88b1822 View commit details
    Browse the repository at this point in the history

Commits on Jun 3, 2019

  1. Copy the full SHA
    7cb6145 View commit details
    Browse the repository at this point in the history
  2. Copy the full SHA
    6a061e4 View commit details
    Browse the repository at this point in the history

Commits on Jun 4, 2019

  1. Add examples to Trie docs.

    CJStadler committed Jun 4, 2019
    Copy the full SHA
    53debce View commit details
    Browse the repository at this point in the history
  2. Use dataframe from given entityset in PandasBackend

    Instead of accessing it through the entity of the relationship, as this
    may have been deserialized and so may only have the entityset "metadata"
    and no actual data.
    CJStadler committed Jun 4, 2019
    Copy the full SHA
    1a67817 View commit details
    Browse the repository at this point in the history
  3. comments and naming

    CJStadler committed Jun 4, 2019
    Copy the full SHA
    eb7a8dd View commit details
    Browse the repository at this point in the history
  4. minor refactor

    CJStadler committed Jun 4, 2019
    Copy the full SHA
    71005e9 View commit details
    Browse the repository at this point in the history
  5. Calculate necessary columns for entity on demand

    Instead of storing them ahead of time in a trie.
    CJStadler committed Jun 4, 2019
    Copy the full SHA
    c937fc3 View commit details
    Browse the repository at this point in the history
  6. Copy the full SHA
    dc4adb1 View commit details
    Browse the repository at this point in the history
  7. Remove test with loop

    We are not going to support this (for now).
    CJStadler committed Jun 4, 2019
    Copy the full SHA
    749dc0e View commit details
    Browse the repository at this point in the history

Commits on Jun 5, 2019

  1. Comments

    CJStadler committed Jun 5, 2019
    Copy the full SHA
    3130fe1 View commit details
    Browse the repository at this point in the history
  2. Copy the full SHA
    d3bfe8a View commit details
    Browse the repository at this point in the history
  3. fix imports

    CJStadler committed Jun 5, 2019
    Copy the full SHA
    a200234 View commit details
    Browse the repository at this point in the history
  4. Add _FeaturesCalculator class

    So that time_last, training_window, etc. can be shared across recursive
    calls without passing them.
    CJStadler committed Jun 5, 2019
    Copy the full SHA
    7084d37 View commit details
    Browse the repository at this point in the history
  5. Remove DS_Store file

    I'm not sure how this was created (in addition to the normal .DS_Store)
    CJStadler committed Jun 5, 2019
    Copy the full SHA
    7972130 View commit details
    Browse the repository at this point in the history
  6. Remove feature grouping by output frame type

    This is no longer necessary since we use full frames for an entity if
    any features require it.
    CJStadler committed Jun 5, 2019
    Copy the full SHA
    54173ca View commit details
    Browse the repository at this point in the history
  7. Copy the full SHA
    13b6db1 View commit details
    Browse the repository at this point in the history
  8. Remove unused functions

    CJStadler committed Jun 5, 2019
    Copy the full SHA
    cd15bb3 View commit details
    Browse the repository at this point in the history
  9. Copy the full SHA
    0caadc2 View commit details
    Browse the repository at this point in the history
  10. Only store df in trie after calculation

    This way it is clear where the trie gets updated.
    CJStadler committed Jun 5, 2019
    Copy the full SHA
    258f59c View commit details
    Browse the repository at this point in the history
  11. Copy the full SHA
    2c9ff0a View commit details
    Browse the repository at this point in the history
  12. Remove unused method

    CJStadler committed Jun 5, 2019
    Copy the full SHA
    e28a3cb View commit details
    Browse the repository at this point in the history
  13. Copy the full SHA
    b7a8d72 View commit details
    Browse the repository at this point in the history
  14. Copy the full SHA
    bab1c6d View commit details
    Browse the repository at this point in the history
  15. Copy the full SHA
    2562d0a View commit details
    Browse the repository at this point in the history
  16. Update changelog

    CJStadler committed Jun 5, 2019
    Copy the full SHA
    f10f8f1 View commit details
    Browse the repository at this point in the history

Commits on Jun 6, 2019

  1. Get rid of FeatureTrie[] methods

    Instead use .get_node and .value
    
    Hopefully this is more readable than trie[[]]
    CJStadler committed Jun 6, 2019
    Copy the full SHA
    1447953 View commit details
    Browse the repository at this point in the history
  2. Copy the full SHA
    2c1e805 View commit details
    Browse the repository at this point in the history
  3. Replace 3 params with parent_data tuple

    Since these will all either be present or not.
    CJStadler committed Jun 6, 2019
    Copy the full SHA
    7600541 View commit details
    Browse the repository at this point in the history
  4. Copy the full SHA
    f1544be View commit details
    Browse the repository at this point in the history
  5. Don't call get_node with empty path

    Because it will just return the object it was called on.
    CJStadler committed Jun 6, 2019
    Copy the full SHA
    6c91beb View commit details
    Browse the repository at this point in the history
  6. Add RelationshipPath

    Represents a series of relationships, and the directions in which they
    are traversed.
    CJStadler committed Jun 6, 2019
    Copy the full SHA
    3207889 View commit details
    Browse the repository at this point in the history
  7. Copy the full SHA
    9f8e68e View commit details
    Browse the repository at this point in the history
  8. Remove unused test helper

    CJStadler committed Jun 6, 2019
    Copy the full SHA
    e005509 View commit details
    Browse the repository at this point in the history

Commits on Jun 11, 2019

  1. Copy the full SHA
    7f33e19 View commit details
    Browse the repository at this point in the history
  2. Copy the full SHA
    076db4e View commit details
    Browse the repository at this point in the history
  3. Track approximated features by path

    Instead of entity id. Changes precalculated_features to be a Trie.
    
    Updates Trie iterator to yield RelationshipPaths instead of lists.
    CJStadler committed Jun 11, 2019
    Copy the full SHA
    576a24a View commit details
    Browse the repository at this point in the history
  4. Copy the full SHA
    3e6a4c5 View commit details
    Browse the repository at this point in the history
  5. Copy the full SHA
    b22e5ee View commit details
    Browse the repository at this point in the history
  6. Simplify some of approx

    Remove an extra column, removing the need for suffixes in the merge.
    CJStadler committed Jun 11, 2019
    Copy the full SHA
    4d1d82c View commit details
    Browse the repository at this point in the history

Commits on Jun 12, 2019

  1. Copy the full SHA
    e1b48a1 View commit details
    Browse the repository at this point in the history
  2. fix imports

    CJStadler committed Jun 12, 2019
    Copy the full SHA
    e9cae56 View commit details
    Browse the repository at this point in the history
  3. Copy the full SHA
    c646c9a View commit details
    Browse the repository at this point in the history

Commits on Jun 14, 2019

  1. Merge branch 'master' into feature-trie

    One test needed to be updated which attempted to make a direct feature
    of a grandparent.
    CJStadler committed Jun 14, 2019
    Copy the full SHA
    4e15b5e View commit details
    Browse the repository at this point in the history
  2. Store full entity dfs in separate trie (#598)

    If there is an aggregation of a uses_full_entity feature the aggregation
    should not be calculated on the full entity. Previously because we only
    had one set of dfs this was not possible. Now, we always store the 
    "filtered" df in df_trie, and if necessary also store a full df in large_df_trie.
    
    The values of feature_trie are now a 3-tuple of
    (needs full entity, needs full entity features, rest of features).
    
    The main issue this solves is when there is a node without any features
    we might still need the full entity. For example: if the only target
    feature is "PERCENTILE(MEAN(customers.transactions))" there are no
    features on "customers", but we need the full entity so that the every
    row in "transactions" can get its relationship variable to the target
    entity.
    
    If any features at a given node require a full df we first calculate
    them. Then we extract the filtered df from the result and use this to
    calculate any remaining features.
    CJStadler committed Jun 14, 2019
    Copy the full SHA
    4ad6203 View commit details
    Browse the repository at this point in the history
  3. Remove PandasBackend (#594)

    Instead, calculate_feature_matrix constructs a FeatureSet when first
    calls, and then uses FeaturesCalculator for sets of instance_ids and
    cutoffs.
    CJStadler committed Jun 14, 2019
    Copy the full SHA
    a2b335a View commit details
    Browse the repository at this point in the history

Commits on Jun 17, 2019

  1. Update Trie docs

    CJStadler committed Jun 17, 2019
    Copy the full SHA
    a018611 View commit details
    Browse the repository at this point in the history
  2. Copy the full SHA
    fb44f9f View commit details
    Browse the repository at this point in the history