-
Notifications
You must be signed in to change notification settings - Fork 896
Multiple output column features with primitive refactor #376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* remove incorrect commutative attributes * handle rsub override and test reverse overrides * rename weekend primitive is_weekend * updated weekend to is_weekend in docs * test values for scalar_subtract_numeric * rename subtract_numeric and scalar_subtract_numeric to subtract_numeric_feature and scalar_subtract_numeric_feature * revert subtract_numeric_feature to subtract_numeric
Codecov Report
@@ Coverage Diff @@
## master #376 +/- ##
==========================================
+ Coverage 95.56% 95.78% +0.21%
==========================================
Files 89 89
Lines 7555 7730 +175
==========================================
+ Hits 7220 7404 +184
+ Misses 335 326 -9
Continue to review full report at Codecov.
|
self.base_feature = base_feature | ||
self.primitive = base_feature.primitive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, not sure about this. i would this our code shouldn't ever try access the primitive property of a direct feature
return np.nan | ||
|
||
def get_function(self): | ||
def pd_topn(x, n=self.number_output_features): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we can rename pd_topn
to just n_most_common
self.number_output_features = n | ||
|
||
@property | ||
def default_value(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this? i think primitive base already has default value as np.nan
name = "n_most_common" | ||
input_types = [Discrete] | ||
return_type = Discrete | ||
stack_on = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we want to let NMostCommon stack
def pd_topn(x, n=self.number_output_features): | ||
array = np.array(x.value_counts()[:n].index) | ||
if len(array) < n: | ||
filler = np.full(n - len(array), self.default_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's not use self.default_value here
if f.primitive.number_output_features > 1: | ||
names = f.get_feature_names() | ||
assert len(names) == len(values) | ||
d.update(dict(zip(names, values))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason use use DataFrame.update here instead of the loop approach from _calculate_transform_features
?
for name, value in zip(names, values):
frame[name] = strip_values_if_series(value)
@@ -499,3 +524,8 @@ def test_empty_child_dataframe(): | |||
fm2 = ft.calculate_feature_matrix(entityset=es, features=[count_where, trend_where], cutoff_time=pd.Timestamp("1/4/2018")) | |||
names = [count_where.get_name(), trend_where.get_name()] | |||
assert_array_equal(fm2[names], [[0, np.nan]]) | |||
|
|||
|
|||
def test_computes_direct_features_of_multi_ouput_features(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's create an issue for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this got implemented in test_direct_features.py so I'll remove it here
@@ -44,3 +58,66 @@ def test_direct_rename(es): | |||
assert feat.get_name() != copy_feat.get_name() | |||
assert feat.base_features[0].generate_name() == copy_feat.base_features[0].generate_name() | |||
assert feat.entity == copy_feat.entity | |||
|
|||
|
|||
def test_direct_of_multi_output_transform_feat(es): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add that this is just testing naming
fm, fl = dfs( | ||
entityset=es, | ||
target_entity="sessions", | ||
trans_primitives=[TestTime, Year, Month, Day, Hour, Minute, Second]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you can do features_only=True
here. same for test case below
This pull request updates the API for handling features that output multiple columns of values. In the past these were referred to in the code as "expanding" features. A brief rundown of the changes:
Multiple columns in the feature matrix can now correspond to the same feature. Where before a feature with multiple outputs would have a list of values per instance, these features now get one column per value.
Column names do not always match a feature name. With features having multiple columns it's necessary to modify the column names to distinguish between the various output columns of a feature.
feature.expanding (bool) replaced with feature.primitive.number_output_features (int). Instead of a feature either being expanding or not, it has an integer corresponding to the number of output features (columns) it has.
feature.get_feature_names() is the correct way to get the names of columns in a feature matrix
feature.default_value is a singular value that becomes the default for all output features. Instead of a list of default values, one for each output feature, each output features shares the same default value.
new keyword argument for make_primitive functions
features with multiple outputs can now be used as direct features