-
Notifications
You must be signed in to change notification settings - Fork 883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grouby transform feature #455
Conversation
Codecov Report
@@ Coverage Diff @@
## master #455 +/- ##
==========================================
+ Coverage 96.24% 96.29% +0.05%
==========================================
Files 102 103 +1
Lines 8724 8845 +121
==========================================
+ Hits 8396 8517 +121
Misses 328 328
Continue to review full report at Codecov.
|
@@ -0,0 +1,292 @@ | |||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's name this file test_groupby_transform_primitives.py
set_default_column(frame, f) | ||
continue | ||
|
||
groupby = f.groupby.get_name() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we make the change to feature tree, we can ensure that all features have the same groupby once they get to this step. this means we can do the iterations of for index, group in frame_data.groupby(groupby)
only once per group and calculate all the features at once
@@ -277,7 +280,9 @@ def generate_default_df(self, instance_ids, extra_columns=None): | |||
return default_df | |||
|
|||
def _feature_type_handler(self, f): | |||
if isinstance(f, TransformFeature): | |||
if isinstance(f, GroupByTransformFeature): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be worth leaving a comment saying the GroupByTransformFeature
has to come first since it is a subclass of TransformFeature, unless we change it as suggested above
self.groupby = groupby | ||
super(GroupByTransformFeature, self).__init__(base_features, | ||
primitive=primitive) | ||
self.base_features.append(groupby) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not add this to base_features
before calling the super method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, not sure it is worth subclassing TransformFeature
. Doesn't seem like they actually share much.
@@ -31,67 +29,51 @@ class CumCount(TransformPrimitive): | |||
|
|||
def get_function(self): | |||
def cum_count(values): | |||
return values.groupby(values).cumcount() + 1 | |||
return pd.Series(range(1, len(values) + 1), index=values.index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another option that is easier to read would be. not sure which is more performant though
values.iloc[:] = range(1, len(values) + 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in case anyone's curious, I ran some benchmarks
In [1]: import pandas as pd
...: values = pd.Series(range(100000))
In [2]: %timeit pd.Series(range(1, len(values) + 1), index=values.index)
118 µs ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [3]: %timeit values.iloc[:] = range(1, len(values) + 1)
54 ms ± 916 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: %timeit np.arange(1, len(values))
50.1 µs ± 555 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [5]: %timeit pd.Series(np.arange(1, len(values) + 1), index=values.index)
107 µs ± 835 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
i think we should just do the 3rd option since pandas backend can handle the conversion to a series for us
@@ -31,67 +29,51 @@ class CumCount(TransformPrimitive): | |||
|
|||
def get_function(self): | |||
def cum_count(values): | |||
return values.groupby(values).cumcount() + 1 | |||
return pd.Series(range(1, len(values) + 1), index=values.index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in case anyone's curious, I ran some benchmarks
In [1]: import pandas as pd
...: values = pd.Series(range(100000))
In [2]: %timeit pd.Series(range(1, len(values) + 1), index=values.index)
118 µs ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [3]: %timeit values.iloc[:] = range(1, len(values) + 1)
54 ms ± 916 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: %timeit np.arange(1, len(values))
50.1 µs ± 555 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [5]: %timeit pd.Series(np.arange(1, len(values) + 1), index=values.index)
107 µs ± 835 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
i think we should just do the 3rd option since pandas backend can handle the conversion to a series for us
|
||
if not isinstance(values, pd.Series): | ||
values = pd.Series(values, index=variable_data[0].index) | ||
group_values[f.hash()].append(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's try doing just
frame[f.get_name()].update(values)
then we can remove null handling below
else: | ||
values = feature_func(*variable_data) | ||
|
||
if not isinstance(values, pd.Series): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# make sure index is aligned for update call below
if isinstance(values, pd.Series):
values.index = variable_data[0].index
else:
values = pd.Series(values, index=variable_data[0].index)
group_values[f.hash()].append(values)
primitive=primitive) | ||
|
||
def copy(self): | ||
return GroupByTransformFeature(self.base_features[:-1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add comment about excluding groupby
self.groupby) | ||
|
||
def generate_name(self): | ||
base_names = [bf.get_name() for bf in self.base_features[:-1]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add comment about excluding groupby
…ng all base features
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
This PR creates a new feature class:
GroupByTransformFeature
The idea behind this new class of features is to simplify writing primitives like
CumulativeSum
by handling the grouping necessary for the calculation elsewhere. In order to calculate the cumulative sum of a customer's purchases, you would first need to group the purchases by customer. With this newGroupByTransformFeature
class, the primitive function no longer needs to account for this. It can assume the data passed to it has already been grouped.