Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grouby transform feature #455

Merged
merged 34 commits into from Mar 25, 2019

Conversation

Projects
None yet
2 participants
@rwedge
Copy link
Contributor

commented Mar 6, 2019

This PR creates a new feature class: GroupByTransformFeature

The idea behind this new class of features is to simplify writing primitives like CumulativeSum by handling the grouping necessary for the calculation elsewhere. In order to calculate the cumulative sum of a customer's purchases, you would first need to group the purchases by customer. With this new GroupByTransformFeature class, the primitive function no longer needs to account for this. It can assume the data passed to it has already been grouped.

@codecov

This comment has been minimized.

Copy link

commented Mar 6, 2019

Codecov Report

Merging #455 into master will increase coverage by 0.05%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #455      +/-   ##
==========================================
+ Coverage   96.24%   96.29%   +0.05%     
==========================================
  Files         102      103       +1     
  Lines        8724     8845     +121     
==========================================
+ Hits         8396     8517     +121     
  Misses        328      328
Impacted Files Coverage Δ
featuretools/feature_base/api.py 100% <ø> (ø) ⬆️
...s/tests/primitive_tests/test_transform_features.py 98.08% <ø> (-0.45%) ⬇️
...turetools/computational_backends/pandas_backend.py 98.36% <100%> (+0.16%) ⬆️
...imitive_tests/test_groupby_transform_primitives.py 100% <100%> (ø)
...tools/primitives/standard/cum_transform_feature.py 100% <100%> (ø) ⬆️
featuretools/feature_base/feature_base.py 96.46% <100%> (+0.24%) ⬆️
...eaturetools/computational_backends/feature_tree.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0e2e5de...49aa411. Read the comment docs.

@rwedge rwedge requested a review from kmax12 Mar 7, 2019

@@ -0,0 +1,292 @@
import numpy as np

This comment has been minimized.

Copy link
@kmax12

kmax12 Mar 14, 2019

Member

let's name this file test_groupby_transform_primitives.py

Show resolved Hide resolved featuretools/computational_backends/feature_tree.py
set_default_column(frame, f)
continue

groupby = f.groupby.get_name()

This comment has been minimized.

Copy link
@kmax12

kmax12 Mar 14, 2019

Member

if we make the change to feature tree, we can ensure that all features have the same groupby once they get to this step. this means we can do the iterations of for index, group in frame_data.groupby(groupby) only once per group and calculate all the features at once

@@ -277,7 +280,9 @@ def generate_default_df(self, instance_ids, extra_columns=None):
return default_df

def _feature_type_handler(self, f):
if isinstance(f, TransformFeature):
if isinstance(f, GroupByTransformFeature):

This comment has been minimized.

Copy link
@kmax12

kmax12 Mar 14, 2019

Member

might be worth leaving a comment saying the GroupByTransformFeature has to come first since it is a subclass of TransformFeature, unless we change it as suggested above

self.groupby = groupby
super(GroupByTransformFeature, self).__init__(base_features,
primitive=primitive)
self.base_features.append(groupby)

This comment has been minimized.

Copy link
@kmax12

kmax12 Mar 14, 2019

Member

why not add this to base_features before calling the super method?

This comment has been minimized.

Copy link
@kmax12

kmax12 Mar 14, 2019

Member

also, not sure it is worth subclassing TransformFeature. Doesn't seem like they actually share much.

@@ -31,67 +29,51 @@ class CumCount(TransformPrimitive):

def get_function(self):
def cum_count(values):
return values.groupby(values).cumcount() + 1
return pd.Series(range(1, len(values) + 1), index=values.index)

This comment has been minimized.

Copy link
@kmax12

kmax12 Mar 14, 2019

Member

another option that is easier to read would be. not sure which is more performant though

values.iloc[:] = range(1, len(values) + 1)

This comment has been minimized.

Copy link
@kmax12

kmax12 Mar 22, 2019

Member

in case anyone's curious, I ran some benchmarks

In [1]: import pandas as pd 
   ...: values = pd.Series(range(100000))                                                                                                                                                              

In [2]: %timeit pd.Series(range(1, len(values) + 1), index=values.index)                                                                                                                               
118 µs ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [3]: %timeit values.iloc[:] = range(1, len(values) + 1)                                                                                                                                             
54 ms ± 916 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit np.arange(1, len(values))                                                                                                                                                                                                  
50.1 µs ± 555 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [5]: %timeit pd.Series(np.arange(1, len(values) + 1), index=values.index)                                                                                                                                                           
107 µs ± 835 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

i think we should just do the 3rd option since pandas backend can handle the conversion to a series for us

@@ -31,67 +29,51 @@ class CumCount(TransformPrimitive):

def get_function(self):
def cum_count(values):
return values.groupby(values).cumcount() + 1
return pd.Series(range(1, len(values) + 1), index=values.index)

This comment has been minimized.

Copy link
@kmax12

kmax12 Mar 22, 2019

Member

in case anyone's curious, I ran some benchmarks

In [1]: import pandas as pd 
   ...: values = pd.Series(range(100000))                                                                                                                                                              

In [2]: %timeit pd.Series(range(1, len(values) + 1), index=values.index)                                                                                                                               
118 µs ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [3]: %timeit values.iloc[:] = range(1, len(values) + 1)                                                                                                                                             
54 ms ± 916 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit np.arange(1, len(values))                                                                                                                                                                                                  
50.1 µs ± 555 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [5]: %timeit pd.Series(np.arange(1, len(values) + 1), index=values.index)                                                                                                                                                           
107 µs ± 835 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

i think we should just do the 3rd option since pandas backend can handle the conversion to a series for us

Show resolved Hide resolved featuretools/computational_backends/pandas_backend.py

if not isinstance(values, pd.Series):
values = pd.Series(values, index=variable_data[0].index)
group_values[f.hash()].append(values)

This comment has been minimized.

Copy link
@kmax12

kmax12 Mar 22, 2019

Member

let's try doing just

frame[f.get_name()].update(values)

then we can remove null handling below

else:
values = feature_func(*variable_data)

if not isinstance(values, pd.Series):

This comment has been minimized.

Copy link
@kmax12

kmax12 Mar 22, 2019

Member
# make sure index is aligned for update call below
if isinstance(values, pd.Series):
    values.index = variable_data[0].index
else:
    values = pd.Series(values, index=variable_data[0].index)
    group_values[f.hash()].append(values)
primitive=primitive)

def copy(self):
return GroupByTransformFeature(self.base_features[:-1],

This comment has been minimized.

Copy link
@kmax12

kmax12 Mar 22, 2019

Member

add comment about excluding groupby

self.groupby)

def generate_name(self):
base_names = [bf.get_name() for bf in self.base_features[:-1]]

This comment has been minimized.

Copy link
@kmax12

kmax12 Mar 22, 2019

Member

add comment about excluding groupby

@rwedge rwedge referenced this pull request Mar 22, 2019

Merged

Groupby dfs #472

kmax12 added some commits Mar 25, 2019

@kmax12

kmax12 approved these changes Mar 25, 2019

Copy link
Member

left a comment

Looks good to me

@rwedge rwedge merged commit a45d285 into master Mar 25, 2019

3 checks passed

codecov/patch 100% of diff hit (target 96.24%)
Details
codecov/project 96.29% (+0.05%) compared to 0e2e5de
Details
license/cla Contributor License Agreement is signed.
Details

@rwedge rwedge referenced this pull request Mar 29, 2019

Merged

v0.7.0 #477

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.