# Groupby
* often we want to split up and work with data based on groups
* pandas allows us to iterate through rows and columns in a dataframe, but this is sort of slow
* pandas also supports `groupby()` through a split-apply-combine pattern

## Splitting
* Let's get motivated first

In [None]:
import pandas as pd
import numpy as np
df = pd.read_csv('datasets/census.csv')
df = df[df['SUMLEV']==50]
df.head()

In [None]:
%%timeit -n 3
for state in df['STNAME'].unique():
    avg = np.average(df.where(df['STNAME']==state).dropna()['CENSUS2010POP'])
    print('Counties in state ' + state + ' have an average population of ' + str(avg))

In [None]:
%%timeit -n 3
for group, frame in df.groupby('STNAME'):
    avg = np.average(frame['CENSUS2010POP'])
    print('Counties in state ' + group + ' have an average population of ' + str(avg))

* ok, so groupby rocks
* usually you'll group by data in a column, but you can also provide a function to groupby and use that to segment your data.

In [None]:
df = df.set_index('STNAME')

def set_batch_number(item):
    if item[0]<'M':
        return 0
    if item[0]<'Q':
        return 1
    return 2

for group, frame in df.groupby(set_batch_number):
    print('There are ' + str(len(frame)) + ' records in group ' + str(group) + ' for processing.')

* we can also group by multiple columns

In [None]:
#airbnb data
df=pd.read_csv("datasets/listings.csv")
df.head()

In [None]:
df=df.set_index(["cancellation_policy","review_scores_value"])

# When we have a multiindex we need to pass in the levels we are interested in grouping by
for group, frame in df.groupby(level=(0,1)):
    print(group)

In [None]:
# We can also do this with a function, which is passed a tuple of the index
def grouping_fun(item):
    if item[1] == 10.0:
        return (item[0],"10.0")
    else:
        return (item[0],"not 10.0")

for group, frame in df.groupby(by=grouping_fun):
    print(group)

## Applying
* so far we have just looked at splitting up data
* we have three broad kinds of applying for data: aggregation, transformation, and filtering.

### Aggregation

In [None]:
# We should just be able to aggregate by calling .agg
df=df.reset_index()
df.groupby("cancellation_policy").agg({"review_scores_value":np.average})

In [None]:
# That didn't seem to work at all, NaN!
# The issue is actually in the function that we sent to aggregate. np.average does not ignore nans! 
df.groupby("cancellation_policy").agg({"review_scores_value":np.nanmean})

In [None]:
# We can just extend this dictionary to aggregate by multiple functions or multiple columns.
df.groupby("cancellation_policy").agg({"review_scores_value":(np.nanmean,np.nanstd),
                                      "reviews_per_month":np.nanmean})

### Transformation
* Transformation broadcasts the function you supply over the grouped dataframe, returning a new dataframe.

In [None]:
cols=['cancellation_policy','review_scores_value']
transform_df=df[cols].groupby('cancellation_policy').transform(np.nanmean)
transform_df.head()

In [None]:
# lets clean this up
transform_df.rename({'review_scores_value':'mean_review_scores'}, axis='columns', inplace=True)
# and merge back to our original dataframe
df=df.merge(transform_df, left_index=True, right_index=True)
df.head()

In [None]:
# Now we can create the difference between a given row and it's group (the cancellation policy) means.
df['mean_diff']=np.absolute(df['review_scores_value']-df['mean_review_scores'])
df['mean_diff'].head()

### Filtering
* You can also use `filter()` to drop data, sort of like `where()`

In [None]:
df.groupby('cancellation_policy').filter(lambda x: np.nanmean(x['review_scores_value'])>9.2)

### Applying
* this is 95% of what I actually do with groups

In [None]:
df=pd.read_csv("datasets/listings.csv")
df=df[['cancellation_policy','review_scores_value']]
df.head()

In [None]:
def calc_mean_review_scores(group):
    # we can treat this as the complete dataframe
    avg=np.nanmean(group["review_scores_value"])
    # now broadcast our formula and create a new column
    group["review_scores_mean"]=np.abs(avg-group["review_scores_value"])
    return group

# Now just apply this to the groups
df.groupby('cancellation_policy').apply(calc_mean_review_scores).head()