Skip to content

Multiple output column features with primitive refactor #376

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 61 commits into from
Jan 19, 2019

Conversation

rwedge
Copy link
Contributor

@rwedge rwedge commented Jan 16, 2019

This pull request updates the API for handling features that output multiple columns of values. In the past these were referred to in the code as "expanding" features. A brief rundown of the changes:

  • Multiple columns in the feature matrix can now correspond to the same feature. Where before a feature with multiple outputs would have a list of values per instance, these features now get one column per value.

  • Column names do not always match a feature name. With features having multiple columns it's necessary to modify the column names to distinguish between the various output columns of a feature.

  • feature.expanding (bool) replaced with feature.primitive.number_output_features (int). Instead of a feature either being expanding or not, it has an integer corresponding to the number of output features (columns) it has.

  • feature.get_feature_names() is the correct way to get the names of columns in a feature matrix

  • feature.default_value is a singular value that becomes the default for all output features. Instead of a list of default values, one for each output feature, each output features shares the same default value.

  • new keyword argument for make_primitive functions

    • number_output_features (int): sets the number of output features for a custom primitive
  • features with multiple outputs can now be used as direct features

@codecov
Copy link

codecov bot commented Jan 16, 2019

Codecov Report

Merging #376 into master will increase coverage by 0.21%.
The diff coverage is 98.8%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #376      +/-   ##
==========================================
+ Coverage   95.56%   95.78%   +0.21%     
==========================================
  Files          89       89              
  Lines        7555     7730     +175     
==========================================
+ Hits         7220     7404     +184     
+ Misses        335      326       -9
Impacted Files Coverage Δ
featuretools/synthesis/deep_feature_synthesis.py 94.49% <100%> (+0.96%) ⬆️
...aturetools/tests/primitive_tests/test_agg_feats.py 99.11% <100%> (+0.07%) ⬆️
...etools/primitives/base/transform_primitive_base.py 100% <100%> (ø) ⬆️
featuretools/feature_base/feature_base.py 95.81% <100%> (+0.77%) ⬆️
...utational_backend/test_calculate_feature_matrix.py 99.3% <100%> (ø) ⬆️
featuretools/primitives/base/primitive_base.py 100% <100%> (ø) ⬆️
...ols/tests/dfs_tests/test_deep_feature_synthesis.py 98.32% <100%> (+0.05%) ⬆️
...ools/primitives/base/aggregation_primitive_base.py 100% <100%> (ø) ⬆️
...ests/computational_backend/test_encode_features.py 100% <100%> (ø) ⬆️
...ools/primitives/standard/aggregation_primitives.py 95.45% <100%> (+0.34%) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 36ce3c3...e5d253c. Read the comment docs.

self.base_feature = base_feature
self.primitive = base_feature.primitive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, not sure about this. i would this our code shouldn't ever try access the primitive property of a direct feature

return np.nan

def get_function(self):
def pd_topn(x, n=self.number_output_features):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can rename pd_topn to just n_most_common

self.number_output_features = n

@property
def default_value(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this? i think primitive base already has default value as np.nan

name = "n_most_common"
input_types = [Discrete]
return_type = Discrete
stack_on = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we want to let NMostCommon stack

def pd_topn(x, n=self.number_output_features):
array = np.array(x.value_counts()[:n].index)
if len(array) < n:
filler = np.full(n - len(array), self.default_value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not use self.default_value here

if f.primitive.number_output_features > 1:
names = f.get_feature_names()
assert len(names) == len(values)
d.update(dict(zip(names, values)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason use use DataFrame.update here instead of the loop approach from _calculate_transform_features?

for name, value in zip(names, values):
    frame[name] = strip_values_if_series(value)

@@ -499,3 +524,8 @@ def test_empty_child_dataframe():
fm2 = ft.calculate_feature_matrix(entityset=es, features=[count_where, trend_where], cutoff_time=pd.Timestamp("1/4/2018"))
names = [count_where.get_name(), trend_where.get_name()]
assert_array_equal(fm2[names], [[0, np.nan]])


def test_computes_direct_features_of_multi_ouput_features():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's create an issue for this

Copy link
Contributor Author

@rwedge rwedge Jan 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this got implemented in test_direct_features.py so I'll remove it here

@@ -44,3 +58,66 @@ def test_direct_rename(es):
assert feat.get_name() != copy_feat.get_name()
assert feat.base_features[0].generate_name() == copy_feat.base_features[0].generate_name()
assert feat.entity == copy_feat.entity


def test_direct_of_multi_output_transform_feat(es):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add that this is just testing naming

fm, fl = dfs(
entityset=es,
target_entity="sessions",
trans_primitives=[TestTime, Year, Month, Day, Hour, Minute, Second])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you can do features_only=True here. same for test case below

@rwedge rwedge changed the base branch from primitive-refactor to master January 18, 2019 23:39
@rwedge rwedge merged commit f80b5ae into master Jan 19, 2019
@rwedge rwedge mentioned this pull request Jan 30, 2019
@rwedge rwedge deleted the multi-output-primitive-refactor branch February 19, 2021 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants