Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Primitive refactor #364

Merged
merged 36 commits into from
Jan 18, 2019
Merged

Primitive refactor #364

merged 36 commits into from
Jan 18, 2019

Conversation

kmax12
Copy link
Contributor

@kmax12 kmax12 commented Jan 7, 2019

This PR is separates the concept of a Primitive from a Feature. The internals of Featuretools change, but it is has minimal impact on the external API.

Compared to before, a primitive is now only aware of the data it take in. A feature is then defined by input variables and/or features, as well as the primitive that will be applied. Put another way, a feature takes the specific entities and variables of an entityset and the primitive to be applied as its input so the primitive doesn't have to work about it.

This has several advantages

  1. It is easier to unit test primitives. There is no need to have an entity set to test a primitive
  2. Primitive definitions become more reusable since they are not tied to the concept of entity set.
  3. Conceptually the user defining a primitive has to only think about input and output data which is just numpy arrays.

To give an example, here is how a feature is currently defined using the count primitive

from featuretools.primitive import Count
f = Count(es["logs"]["value"], parent_entity=es["customers"])

Now, you define the inputs to the feature and provide the primitive as an input.

import featuretools as ft
from featuretools.primitive import Count
f = ft.Feature(es["logs"]["value"], parent_entity=es["customers"], primitive=Count)

if a primitive has parameters it can be used like this

f = ft.Feature(es["logs"]["comment"], parent_entity=es["customers"], primitive=CountString(string="coffee"))

the API for calling DFS doesn't change with the exception of being able to provide primitive with arguments

ft.dfs(target_entity="customers",
       entityset=es,
       agg_primtives=["count", Sum],
       trans_primtiives=CountString(string="coffee"))

@codecov
Copy link

codecov bot commented Jan 7, 2019

Codecov Report

Merging #364 into master will increase coverage by 0.2%.
The diff coverage is 97.48%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #364     +/-   ##
=========================================
+ Coverage   95.33%   95.53%   +0.2%     
=========================================
  Files          86       89      +3     
  Lines        8032     7555    -477     
=========================================
- Hits         7657     7218    -439     
+ Misses        375      337     -38
Impacted Files Coverage Δ
featuretools/primitives/api.py 100% <ø> (ø) ⬆️
featuretools/wrappers/sklearn.py 95.65% <ø> (ø) ⬆️
featuretools/synthesis/dfs.py 100% <ø> (ø) ⬆️
featuretools/utils/pickle_utils.py 100% <ø> (ø) ⬆️
featuretools/selection/variance_selection.py 0% <ø> (ø) ⬆️
featuretools/synthesis/encode_features.py 98.03% <ø> (ø) ⬆️
featuretools/selection/selection.py 100% <ø> (ø) ⬆️
featuretools/synthesis/deep_feature_synthesis.py 93.52% <100%> (+0.06%) ⬆️
featuretools/feature_base/api.py 100% <100%> (ø)
...aturetools/tests/entityset_tests/test_timedelta.py 100% <100%> (ø) ⬆️
... and 32 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ffb8081...40946ba. Read the comment docs.

kmax12 and others added 2 commits January 7, 2019 18:53
* remove incorrect commutative attributes

* handle rsub override and test reverse overrides

* rename weekend primitive is_weekend

* updated weekend to is_weekend in docs

* test values for scalar_subtract_numeric

* rename subtract_numeric and scalar_subtract_numeric to subtract_numeric_feature and scalar_subtract_numeric_feature

* revert subtract_numeric_feature to subtract_numeric
@kmax12 kmax12 changed the title [WIP] Primitive refactor Primitive refactor Jan 16, 2019
@kmax12 kmax12 requested a review from rwedge January 16, 2019 21:56
Args:
entity (Entity): entity this feature is being calculated for
base_featres (list[FeatureBase]): list of base features for primitive
primitive (): primitive to calculate. if not initilized when passed, gets initialized with no arguments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing the type information for primitive

# if self.use_previous and self.use_previous.is_absolute():
# entity = self.entity
# time_var = IdentityFeature(entity[entity.time_index])
# deps += [time_var]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been commented out for a while, I think we should either remove it or make an issue about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the time index automatically gets added, so we don't need to put the time index feature as a dependent. removed


base_entity = set([f.entity for f in base_features])
assert len(base_entity) == 1, \
"More than one entity for base features"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's potentially two checks in this init for whether the base features share the same entity.

Copy link
Contributor Author

@kmax12 kmax12 Jan 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. fixed

featuretools/feature_base/feature_base.py Outdated Show resolved Hide resolved
featuretools/feature_base/feature_base.py Outdated Show resolved Hide resolved
featuretools/feature_base/feature_base.py Outdated Show resolved Hide resolved
featuretools/feature_base/feature_base.py Outdated Show resolved Hide resolved
seed_feature_sessions = Count(es['log']["id"], es['sessions']) > 2
seed_feature_log = Hour(es['log']['datetime'])
session_agg = Last(seed_feature_log, es['sessions'])
seed_feature_sessions = ft.Feature(es['log']["id"], parent_entity=es['sessions'], primitive=Count)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it changes the test but seed_feature_sessions is missing the > 2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, i noticed that. i dont think adding the >2 improved the test so i simplified it

@rwedge rwedge merged commit 36ce3c3 into master Jan 18, 2019
@rwedge rwedge mentioned this pull request Jan 30, 2019
@rwedge rwedge deleted the primitive-refactor branch February 19, 2021 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants