Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grouby transform feature #455

Merged
merged 34 commits into from
Mar 25, 2019
Merged

Grouby transform feature #455

merged 34 commits into from
Mar 25, 2019

Conversation

rwedge
Copy link
Contributor

@rwedge rwedge commented Mar 6, 2019

This PR creates a new feature class: GroupByTransformFeature

The idea behind this new class of features is to simplify writing primitives like CumulativeSum by handling the grouping necessary for the calculation elsewhere. In order to calculate the cumulative sum of a customer's purchases, you would first need to group the purchases by customer. With this new GroupByTransformFeature class, the primitive function no longer needs to account for this. It can assume the data passed to it has already been grouped.

@codecov
Copy link

codecov bot commented Mar 6, 2019

Codecov Report

Merging #455 into master will increase coverage by 0.05%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #455      +/-   ##
==========================================
+ Coverage   96.24%   96.29%   +0.05%     
==========================================
  Files         102      103       +1     
  Lines        8724     8845     +121     
==========================================
+ Hits         8396     8517     +121     
  Misses        328      328
Impacted Files Coverage Δ
featuretools/feature_base/api.py 100% <ø> (ø) ⬆️
...s/tests/primitive_tests/test_transform_features.py 98.08% <ø> (-0.45%) ⬇️
...turetools/computational_backends/pandas_backend.py 98.36% <100%> (+0.16%) ⬆️
...imitive_tests/test_groupby_transform_primitives.py 100% <100%> (ø)
...tools/primitives/standard/cum_transform_feature.py 100% <100%> (ø) ⬆️
featuretools/feature_base/feature_base.py 96.46% <100%> (+0.24%) ⬆️
...eaturetools/computational_backends/feature_tree.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0e2e5de...49aa411. Read the comment docs.

@rwedge rwedge requested a review from kmax12 March 7, 2019 18:49
@@ -0,0 +1,292 @@
import numpy as np
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's name this file test_groupby_transform_primitives.py

set_default_column(frame, f)
continue

groupby = f.groupby.get_name()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we make the change to feature tree, we can ensure that all features have the same groupby once they get to this step. this means we can do the iterations of for index, group in frame_data.groupby(groupby) only once per group and calculate all the features at once

@@ -277,7 +280,9 @@ def generate_default_df(self, instance_ids, extra_columns=None):
return default_df

def _feature_type_handler(self, f):
if isinstance(f, TransformFeature):
if isinstance(f, GroupByTransformFeature):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth leaving a comment saying the GroupByTransformFeature has to come first since it is a subclass of TransformFeature, unless we change it as suggested above

self.groupby = groupby
super(GroupByTransformFeature, self).__init__(base_features,
primitive=primitive)
self.base_features.append(groupby)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not add this to base_features before calling the super method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, not sure it is worth subclassing TransformFeature. Doesn't seem like they actually share much.

@@ -31,67 +29,51 @@ class CumCount(TransformPrimitive):

def get_function(self):
def cum_count(values):
return values.groupby(values).cumcount() + 1
return pd.Series(range(1, len(values) + 1), index=values.index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another option that is easier to read would be. not sure which is more performant though

values.iloc[:] = range(1, len(values) + 1)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case anyone's curious, I ran some benchmarks

In [1]: import pandas as pd 
   ...: values = pd.Series(range(100000))                                                                                                                                                              

In [2]: %timeit pd.Series(range(1, len(values) + 1), index=values.index)                                                                                                                               
118 µs ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [3]: %timeit values.iloc[:] = range(1, len(values) + 1)                                                                                                                                             
54 ms ± 916 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit np.arange(1, len(values))                                                                                                                                                                                                  
50.1 µs ± 555 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [5]: %timeit pd.Series(np.arange(1, len(values) + 1), index=values.index)                                                                                                                                                           
107 µs ± 835 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

i think we should just do the 3rd option since pandas backend can handle the conversion to a series for us

@@ -31,67 +29,51 @@ class CumCount(TransformPrimitive):

def get_function(self):
def cum_count(values):
return values.groupby(values).cumcount() + 1
return pd.Series(range(1, len(values) + 1), index=values.index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case anyone's curious, I ran some benchmarks

In [1]: import pandas as pd 
   ...: values = pd.Series(range(100000))                                                                                                                                                              

In [2]: %timeit pd.Series(range(1, len(values) + 1), index=values.index)                                                                                                                               
118 µs ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [3]: %timeit values.iloc[:] = range(1, len(values) + 1)                                                                                                                                             
54 ms ± 916 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit np.arange(1, len(values))                                                                                                                                                                                                  
50.1 µs ± 555 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [5]: %timeit pd.Series(np.arange(1, len(values) + 1), index=values.index)                                                                                                                                                           
107 µs ± 835 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

i think we should just do the 3rd option since pandas backend can handle the conversion to a series for us


if not isinstance(values, pd.Series):
values = pd.Series(values, index=variable_data[0].index)
group_values[f.hash()].append(values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's try doing just

frame[f.get_name()].update(values)

then we can remove null handling below

else:
values = feature_func(*variable_data)

if not isinstance(values, pd.Series):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# make sure index is aligned for update call below
if isinstance(values, pd.Series):
    values.index = variable_data[0].index
else:
    values = pd.Series(values, index=variable_data[0].index)
    group_values[f.hash()].append(values)

primitive=primitive)

def copy(self):
return GroupByTransformFeature(self.base_features[:-1],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment about excluding groupby

self.groupby)

def generate_name(self):
base_names = [bf.get_name() for bf in self.base_features[:-1]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment about excluding groupby

@rwedge rwedge mentioned this pull request Mar 22, 2019
Copy link
Contributor

@kmax12 kmax12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@rwedge rwedge merged commit a45d285 into master Mar 25, 2019
@rwedge rwedge mentioned this pull request Mar 29, 2019
@rwedge rwedge deleted the grouby-transform-feature branch February 19, 2021 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants