<a href="https://colab.research.google.com/github/brunofbpaula/DataScience-UM-Coursera/blob/main/Pandas/DataFrame/GroupyBy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Group By

Pandas allows us to iterate over every row in a DataFrame, but it's generally vey slow though. Fortunately, Pandas has a group-by function to speed up this task.

The ideia behind the groupby() function is that it takes some DataFrame, split it into chunks based on some key values, applies computation on those chunks, then combines the results back together into another DataFrame. In pandas this is refered to as the split-apply-combine pattern.

In [2]:
import pandas as pd
import numpy as np

In [46]:
census = pd.read_csv('census.csv')
# Excluding state level summarizations
census = census[census['SUMLEV'] == 50]

## Spliting

In [47]:
census.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
5,50,3,6,1,9,Alabama,Blount County,57322,57322,57373,...,1.807375,-1.177622,-1.748766,-2.062535,-1.36997,1.859511,-0.84858,-1.402476,-1.577232,-0.884411


In [48]:
# First, we get a list of the unique states
# Then, iterate over all the states
# and for each one, we reduce the DataFrame
# and finally calculate the average

In [49]:
%%timeit -n 3


for state in census['STNAME'].unique():
  avg = np.average(census.where(census['STNAME']==state).dropna()['CENSUS2010POP'])
  print(f'Counties in the state {state} have an average population of {avg}')

Counties in the state Alabama have an average population of 71339.34328358209
Counties in the state Alaska have an average population of 24490.724137931036
Counties in the state Arizona have an average population of 426134.4666666667
Counties in the state Arkansas have an average population of 38878.90666666667
Counties in the state California have an average population of 642309.5862068966
Counties in the state Colorado have an average population of 78581.1875
Counties in the state Connecticut have an average population of 446762.125
Counties in the state Delaware have an average population of 299311.3333333333
Counties in the state District of Columbia have an average population of 601723.0
Counties in the state Florida have an average population of 280616.5671641791
Counties in the state Georgia have an average population of 60928.63522012578
Counties in the state Hawaii have an average population of 272060.2
Counties in the state Idaho have an average population of 35626.8636363636

Now let's find out how long it shall take using the groupby function.

In [50]:
%%timeit -n 3

for group, frame in census.groupby('STNAME'):
  avg = np.average(frame['CENSUS2010POP'])
  print(f'Counties in the state {group} have an average population of {avg}')

Counties in the state Alabama have an average population of 71339.34328358209
Counties in the state Alaska have an average population of 24490.724137931036
Counties in the state Arizona have an average population of 426134.4666666667
Counties in the state Arkansas have an average population of 38878.90666666667
Counties in the state California have an average population of 642309.5862068966
Counties in the state Colorado have an average population of 78581.1875
Counties in the state Connecticut have an average population of 446762.125
Counties in the state Delaware have an average population of 299311.3333333333
Counties in the state District of Columbia have an average population of 601723.0
Counties in the state Florida have an average population of 280616.5671641791
Counties in the state Georgia have an average population of 60928.63522012578
Counties in the state Hawaii have an average population of 272060.2
Counties in the state Idaho have an average population of 35626.8636363636

A huge difference in speed.

This a bit of a fabricated example, but let's assume we need to work on only a third or so of the states at given time. We can create some function to return a number between zero and two based on the first character of the state name. Then we can tell groupby to use this function to split up the DataFrame.

In [51]:
census = census.set_index('STNAME')

def set_batch_number(item):
  if item[0] < 'M':
    return 0
  if item[0] < 'Q':
    return 1
  return 2

# This time, there was no need to pass a parameter to the function
# inside groupby, because if not given, the function will automatically
# pass the index of the data frame into it.
for group, frame in census.groupby(set_batch_number):
  print(f'There are {len(frame)} records in group {group+1} for processing')

There are 1177 records in group 1 for processing
There are 1134 records in group 2 for processing
There are 831 records in group 3 for processing


## Listing

A dataset of housing from airbnb.

In [89]:
data = pd.read_csv('listings.csv')
data.head(3)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",...,,f,,,f,moderate,f,f,1,
1,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,none,"The room is in Roslindale, a diverse and prima...",...,9.0,f,,,t,moderate,f,f,1,1.3
2,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",none,The LOCATION: Roslindale is a safe and diverse...,...,10.0,f,,,f,moderate,t,f,1,0.47


In [90]:
# Grouping by columns of interest
data = data.set_index(['cancellation_policy', 'review_scores_value'])

# When we have an multi-index we need to pass in the levels we are interested in grouping by
for group, frame in data.groupby(level=(0,1)):
  print(group)

('flexible', 2.0)
('flexible', 4.0)
('flexible', 5.0)
('flexible', 6.0)
('flexible', 7.0)
('flexible', 8.0)
('flexible', 9.0)
('flexible', 10.0)
('moderate', 2.0)
('moderate', 4.0)
('moderate', 6.0)
('moderate', 7.0)
('moderate', 8.0)
('moderate', 9.0)
('moderate', 10.0)
('strict', 2.0)
('strict', 3.0)
('strict', 4.0)
('strict', 5.0)
('strict', 6.0)
('strict', 7.0)
('strict', 8.0)
('strict', 9.0)
('strict', 10.0)
('super_strict_30', 6.0)
('super_strict_30', 7.0)
('super_strict_30', 8.0)
('super_strict_30', 9.0)
('super_strict_30', 10.0)


In [91]:
# Separating out all the 10's
def grouping_fun(item):
  if item[1] == 10.0:
    return (item[0], '10.0')
  else:
    return (item[0], 'not 10.0')

for group, frame in data.groupby(by=grouping_fun):
  print(group)

('flexible', '10.0')
('flexible', 'not 10.0')
('moderate', '10.0')
('moderate', 'not 10.0')
('strict', '10.0')
('strict', 'not 10.0')
('super_strict_30', '10.0')
('super_strict_30', 'not 10.0')


Pandas developers have three broad categories of data processing to happen during the apply step: Aggregating of group data, Transformation of group data and Filtration of group data.

## Aggregation

The most straightforward apply step is the aggregation of data. This uses a method called agg on the groupby object. Thus far, we've only iterated through the groupby object, unpacking it into a label, the group name, and a DataFrame. But with agg, we can pass in a dictionary of the columns we are interested in aggregating along with the function that we're looking to apply.

In [92]:
# Reseting index
data = data.reset_index()

# Now let's group by the cancellation_policy and find the average review_scores_value by group
data.groupby('cancellation_policy').agg({'review_scores_value': np.average})

Unnamed: 0_level_0,review_scores_value
cancellation_policy,Unnamed: 1_level_1
flexible,
moderate,
strict,
super_strict_30,


It didn't work well. The reason is that numpy's average function does not ignore NaNs.

In [93]:
# Then, we can use instead
data.groupby('cancellation_policy').agg({'review_scores_value': np.nanmean})

Unnamed: 0_level_0,review_scores_value
cancellation_policy,Unnamed: 1_level_1
flexible,9.237421
moderate,9.307398
strict,9.081441
super_strict_30,8.537313


In [94]:
# And we can just extend this dictionary to aggregate by multiple functions or multiple columns
data.groupby('cancellation_policy').agg({'review_scores_value': (np.nanmean, np.nanstd),
                                         'reviews_per_month': np.nanmean})

Unnamed: 0_level_0,review_scores_value,review_scores_value,reviews_per_month
Unnamed: 0_level_1,nanmean,nanstd,nanmean
cancellation_policy,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
flexible,9.237421,1.096271,1.82921
moderate,9.307398,0.859859,2.391922
strict,9.081441,1.040531,1.873467
super_strict_30,8.537313,0.840785,0.340143


First, we're doing a group by on the DataFrame object by the column "cancellation_policy", and it creates a new GroupBy object. Then we invoke the agg function on that object, which applies one or more specified functions to the group DataFrames and return a single row per DataFrame/group.

In [95]:
print('Aggregation')

Aggregation


## Transformation

 Transformation is different from aggregations. Where agg returns a single value per column, so one row per group, transform returns an object that is the same size as the group. Essentially, it broadcasts the function you supply over the group dataframe, returning a new dataframe. This makes combining data later quite easy.

In [96]:
# For instance, let's say we are eager to include the average
# rating values in a given group by cancellation policy, but
# preserve the DataFrame shape so that we can generate a
# difference between an individual observation and the sum

# Defining some subset of columns of interest
cols = ['cancellation_policy', 'review_scores_value']

# Transforming it and storing in its own DataFrame
transform_data = data[cols].groupby('cancellation_policy').transform(np.nanmean)
transform_data.head()

Unnamed: 0,review_scores_value
0,9.307398
1,9.307398
2,9.307398
3,9.307398
4,9.237421


In [98]:
# The index is the same as the original's. We can join it in.

transform_data.rename({'review_scores_value': 'mean_review_scores'}, axis='columns', inplace=True)
data = data.merge(transform_data['mean_review_scores'], left_index=True, right_index=True)
data.head()

Unnamed: 0,cancellation_policy,review_scores_value,id,listing_url,scrape_id,last_scraped,name,summary,space,description,...,review_scores_location,requires_license,license,jurisdiction_names,instant_bookable,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month,mean_review_scores
0,moderate,,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",...,,f,,,f,f,f,1,,9.307398
1,moderate,9.0,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,...,9.0,f,,,t,f,f,1,1.3,9.307398
2,moderate,10.0,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",...,9.0,f,,,f,t,f,1,0.47,9.307398
3,moderate,10.0,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,...,10.0,f,,,f,f,f,1,1.0,9.307398
4,flexible,10.0,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",...,9.0,f,,,f,f,f,1,2.25,9.237421


In [99]:
# Difference between a row and its group
data['mean_diff'] = np.absolute(data['review_scores_value'] - data['mean_review_scores'])
data['mean_diff'].head()

0         NaN
1    0.307398
2    0.692602
3    0.692602
4    0.762579
Name: mean_diff, dtype: float64

## Filtering

The GroupBy object has built in support for filtering groups as well. The filter function takes in a function which it applies to each group DataFrame and return either True or False.

In [101]:
# For example, if we only want those groups which have a mean rating above 9 included in our results
data.groupby('cancellation_policy').filter(lambda x: np.nanmean(x['review_scores_value']) > 9.0)

Unnamed: 0,cancellation_policy,review_scores_value,id,listing_url,scrape_id,last_scraped,name,summary,space,description,...,requires_license,license,jurisdiction_names,instant_bookable,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month,mean_review_scores,mean_diff
0,moderate,,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",...,f,,,f,f,f,1,,9.307398,
1,moderate,9.0,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,...,f,,,t,f,f,1,1.30,9.307398,0.307398
2,moderate,10.0,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",...,f,,,f,t,f,1,0.47,9.307398,0.692602
3,moderate,10.0,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,...,f,,,f,f,f,1,1.00,9.307398,0.692602
4,flexible,10.0,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",...,f,,,f,f,f,1,2.25,9.237421,0.762579
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3580,strict,9.0,8373729,https://www.airbnb.com/rooms/8373729,20160906204935,2016-09-07,Big cozy room near T,5 min walking to Orange Line subway with 2 sto...,,5 min walking to Orange Line subway with 2 sto...,...,f,,,t,f,f,8,0.34,9.081441,0.081441
3581,strict,,14844274,https://www.airbnb.com/rooms/14844274,20160906204935,2016-09-07,BU Apartment DexterPark Bright room,"Most popular apartment in BU, best located in ...",Best location in BU,"Most popular apartment in BU, best located in ...",...,f,,,f,f,f,2,,9.081441,
3582,flexible,,14585486,https://www.airbnb.com/rooms/14585486,20160906204935,2016-09-07,Gorgeous funky apartment,Funky little apartment close to public transpo...,Modern and relaxed space with many facilities ...,Funky little apartment close to public transpo...,...,f,,,f,f,f,1,,9.237421,
3583,strict,7.0,14603878,https://www.airbnb.com/rooms/14603878,20160906204935,2016-09-07,Great Location; Train and Restaurants,"My place is close to Taco Loco Mexican Grill, ...",,"My place is close to Taco Loco Mexican Grill, ...",...,f,,,f,f,f,1,2.00,9.081441,2.081441


The result is still indexed. Any of the results which were in a group with a mean review score less than or equal to 9.0 were not copied over.

## Applying

By far the most commom operation invoked on group by objects. This allows us to apply an arbitrary function to each group, and stitch the results back for each apply() into a single DataFrame where the index is preserved. It may be slower than using some specialized functions.

In [3]:
# Clean copy of DataFrame
data = pd.read_csv('listings.csv')

# Including only the columns of interest
data = data[['cancellation_policy', 'review_scores_value']]
data.head()

Unnamed: 0,cancellation_policy,review_scores_value
0,moderate,
1,moderate,9.0
2,moderate,10.0
3,moderate,10.0
4,flexible,10.0


In [5]:
# In previous work, we wanted to find the average score of a listing and its deviation
# from the group mean. This was a two-step process, first we used transform() on the GroupBy
# object and then we had to broadcast to create a new column. With apply we can wrap this logic in one place
def calc_mean_review_scores(group):
  # Group is a DataFrame just of whataver we have grouped by.
  avg = np.nanmean(group['review_scores_value'])
  # Now broadcast our formula and create a new column
  group['review_scores_mean'] = np.abs(avg-group['review_scores_value'])
  return group

# Applying it to all groups
data.groupby('cancellation_policy', group_keys=True).apply(calc_mean_review_scores).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,cancellation_policy,review_scores_value,review_scores_mean
cancellation_policy,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
flexible,4,flexible,10.0,0.762579
flexible,5,flexible,10.0,0.762579
flexible,10,flexible,10.0,0.762579
flexible,11,flexible,9.0,0.237421
flexible,12,flexible,10.0,0.762579
