New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple output column features with primitive refactor #376

Merged
merged 61 commits into from Jan 19, 2019

Conversation

Projects
None yet
2 participants
@rwedge
Copy link
Contributor

rwedge commented Jan 16, 2019

This pull request updates the API for handling features that output multiple columns of values. In the past these were referred to in the code as "expanding" features. A brief rundown of the changes:

  • Multiple columns in the feature matrix can now correspond to the same feature. Where before a feature with multiple outputs would have a list of values per instance, these features now get one column per value.

  • Column names do not always match a feature name. With features having multiple columns it's necessary to modify the column names to distinguish between the various output columns of a feature.

  • feature.expanding (bool) replaced with feature.primitive.number_output_features (int). Instead of a feature either being expanding or not, it has an integer corresponding to the number of output features (columns) it has.

  • feature.get_feature_names() is the correct way to get the names of columns in a feature matrix

  • feature.default_value is a singular value that becomes the default for all output features. Instead of a list of default values, one for each output feature, each output features shares the same default value.

  • new keyword argument for make_primitive functions

    • number_output_features (int): sets the number of output features for a custom primitive
  • features with multiple outputs can now be used as direct features

kmax12 and others added some commits Dec 31, 2018

Primitive refactor updates (#365)
* remove incorrect commutative attributes

* handle rsub override and test reverse overrides

* rename weekend primitive is_weekend

* updated weekend to is_weekend in docs

* test values for scalar_subtract_numeric

* rename subtract_numeric and scalar_subtract_numeric to subtract_numeric_feature and scalar_subtract_numeric_feature

* revert subtract_numeric_feature to subtract_numeric
@codecov

This comment has been minimized.

Copy link

codecov bot commented Jan 16, 2019

Codecov Report

Merging #376 into master will increase coverage by 0.21%.
The diff coverage is 98.8%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #376      +/-   ##
==========================================
+ Coverage   95.56%   95.78%   +0.21%     
==========================================
  Files          89       89              
  Lines        7555     7730     +175     
==========================================
+ Hits         7220     7404     +184     
+ Misses        335      326       -9
Impacted Files Coverage Δ
featuretools/synthesis/deep_feature_synthesis.py 94.49% <100%> (+0.96%) ⬆️
...aturetools/tests/primitive_tests/test_agg_feats.py 99.11% <100%> (+0.07%) ⬆️
...etools/primitives/base/transform_primitive_base.py 100% <100%> (ø) ⬆️
featuretools/feature_base/feature_base.py 95.81% <100%> (+0.77%) ⬆️
...utational_backend/test_calculate_feature_matrix.py 99.3% <100%> (ø) ⬆️
featuretools/primitives/base/primitive_base.py 100% <100%> (ø) ⬆️
...ols/tests/dfs_tests/test_deep_feature_synthesis.py 98.32% <100%> (+0.05%) ⬆️
...ools/primitives/base/aggregation_primitive_base.py 100% <100%> (ø) ⬆️
...ests/computational_backend/test_encode_features.py 100% <100%> (ø) ⬆️
...ools/primitives/standard/aggregation_primitives.py 95.45% <100%> (+0.34%) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 36ce3c3...e5d253c. Read the comment docs.

self.base_feature = base_feature
self.primitive = base_feature.primitive

This comment has been minimized.

@kmax12

kmax12 Jan 16, 2019

Member

hmm, not sure about this. i would this our code shouldn't ever try access the primitive property of a direct feature

return np.nan

def get_function(self):
def pd_topn(x, n=self.number_output_features):

This comment has been minimized.

@kmax12

kmax12 Jan 16, 2019

Member

i think we can rename pd_topn to just n_most_common

self.number_output_features = n

@property
def default_value(self):

This comment has been minimized.

@kmax12

kmax12 Jan 16, 2019

Member

do we need this? i think primitive base already has default value as np.nan

name = "n_most_common"
input_types = [Discrete]
return_type = Discrete
stack_on = []

This comment has been minimized.

@kmax12

kmax12 Jan 16, 2019

Member

i think we want to let NMostCommon stack

def pd_topn(x, n=self.number_output_features):
array = np.array(x.value_counts()[:n].index)
if len(array) < n:
filler = np.full(n - len(array), self.default_value)

This comment has been minimized.

@kmax12

kmax12 Jan 16, 2019

Member

let's not use self.default_value here

if f.primitive.number_output_features > 1:
names = f.get_feature_names()
assert len(names) == len(values)
d.update(dict(zip(names, values)))

This comment has been minimized.

@kmax12

kmax12 Jan 16, 2019

Member

any reason use use DataFrame.update here instead of the loop approach from _calculate_transform_features?

for name, value in zip(names, values):
    frame[name] = strip_values_if_series(value)
Show resolved Hide resolved featuretools/synthesis/deep_feature_synthesis.py
@@ -499,3 +524,8 @@ def test_empty_child_dataframe():
fm2 = ft.calculate_feature_matrix(entityset=es, features=[count_where, trend_where], cutoff_time=pd.Timestamp("1/4/2018"))
names = [count_where.get_name(), trend_where.get_name()]
assert_array_equal(fm2[names], [[0, np.nan]])


def test_computes_direct_features_of_multi_ouput_features():

This comment has been minimized.

@kmax12

kmax12 Jan 16, 2019

Member

let's create an issue for this

This comment has been minimized.

@rwedge

rwedge Jan 17, 2019

Author Contributor

this got implemented in test_direct_features.py so I'll remove it here

@@ -44,3 +58,66 @@ def test_direct_rename(es):
assert feat.get_name() != copy_feat.get_name()
assert feat.base_features[0].generate_name() == copy_feat.base_features[0].generate_name()
assert feat.entity == copy_feat.entity


def test_direct_of_multi_output_transform_feat(es):

This comment has been minimized.

@kmax12

kmax12 Jan 16, 2019

Member

maybe add that this is just testing naming

fm, fl = dfs(
entityset=es,
target_entity="sessions",
trans_primitives=[TestTime, Year, Month, Day, Hour, Minute, Second])

This comment has been minimized.

@kmax12

kmax12 Jan 16, 2019

Member

i think you can do features_only=True here. same for test case below

@rwedge rwedge changed the base branch from primitive-refactor to master Jan 18, 2019

@rwedge rwedge merged commit f80b5ae into master Jan 19, 2019

3 checks passed

codecov/patch 98.8% of diff hit (target 95.56%)
Details
codecov/project 95.78% (+0.21%) compared to 36ce3c3
Details
license/cla Contributor License Agreement is signed.
Details

@rwedge rwedge referenced this pull request Jan 30, 2019

Merged

v0.6.0 #387

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment