Sometimes we want to select data based on groups and understand aggregated data at the group level. We've seen that even though Pandas allows us to iterate every row in a data frame, it's generally very slow to do this. Fortunately, Pandas has a groupby function to speed up such tasks.

The idea behind groupby is that it takes some data frame, splits it into chunks based on some key values, and then applies computation on those chunks, and then combines the result back together into another data frame. In Pandas, this is referred to as the split-apply-combine pattern.

In [9]:
import pandas as pd
import numpy as np

df = pd.read_csv('resources/census.csv')
df = df[df['SUMLEV']==50]
df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
5,50,3,6,1,9,Alabama,Blount County,57322,57322,57373,...,1.807375,-1.177622,-1.748766,-2.062535,-1.36997,1.859511,-0.84858,-1.402476,-1.577232,-0.884411


Let's get a list of all the unique states then we can iterate over all of those states. For each state, we can reduce the dataframe and calculate the average. Let's run such a task for three times and time it. For this, we're going use this cell magic function,

In [8]:
%%timeit -n 3

for state in df['STNAME'].unique():
    avg = np.average(df.where(df['STNAME']==state).dropna()['CENSUS2010POP'])
    print('Countries in state  ' + state + 'have an average population of' + str(avg))

Countries in state  Alabamahave an average population of71339.34328358209
Countries in state  Alaskahave an average population of24490.724137931036
Countries in state  Arizonahave an average population of426134.4666666667
Countries in state  Arkansashave an average population of38878.90666666667
Countries in state  Californiahave an average population of642309.5862068966
Countries in state  Coloradohave an average population of78581.1875
Countries in state  Connecticuthave an average population of446762.125
Countries in state  Delawarehave an average population of299311.3333333333
Countries in state  District of Columbiahave an average population of601723.0
Countries in state  Floridahave an average population of280616.5671641791
Countries in state  Georgiahave an average population of60928.63522012578
Countries in state  Hawaiihave an average population of272060.2
Countries in state  Idahohave an average population of35626.86363636364
Countries in state  Illinoishave an average popula

Countries in state  Illinoishave an average population of125790.50980392157
Countries in state  Indianahave an average population of70476.10869565218
Countries in state  Iowahave an average population of30771.262626262625
Countries in state  Kansashave an average population of27172.55238095238
Countries in state  Kentuckyhave an average population of36161.39166666667
Countries in state  Louisianahave an average population of70833.9375
Countries in state  Mainehave an average population of83022.5625
Countries in state  Marylandhave an average population of240564.66666666666
Countries in state  Massachusettshave an average population of467687.78571428574
Countries in state  Michiganhave an average population of119080.0
Countries in state  Minnesotahave an average population of60964.65517241379
Countries in state  Mississippihave an average population of36186.54878048781
Countries in state  Missourihave an average population of52077.62608695652
Countries in state  Montanahave an average p

Countries in state  Missourihave an average population of52077.62608695652
Countries in state  Montanahave an average population of17668.125
Countries in state  Nebraskahave an average population of19638.075268817203
Countries in state  Nevadahave an average population of158855.9411764706
Countries in state  New Hampshirehave an average population of131647.0
Countries in state  New Jerseyhave an average population of418661.61904761905
Countries in state  New Mexicohave an average population of62399.36363636364
Countries in state  New Yorkhave an average population of312550.03225806454
Countries in state  North Carolinahave an average population of95354.83
Countries in state  North Dakotahave an average population of12690.396226415094
Countries in state  Ohiohave an average population of131096.63636363635
Countries in state  Oklahomahave an average population of48718.844155844155
Countries in state  Oregonhave an average population of106418.72222222222
Countries in state  Pennsylvaniaha

Countries in state  Oregonhave an average population of106418.72222222222
Countries in state  Pennsylvaniahave an average population of189587.74626865672
Countries in state  Rhode Islandhave an average population of210513.4
Countries in state  South Carolinahave an average population of100551.39130434782
Countries in state  South Dakotahave an average population of12336.060606060606
Countries in state  Tennesseehave an average population of66801.1052631579
Countries in state  Texashave an average population of98998.27165354331
Countries in state  Utahhave an average population of95306.37931034483
Countries in state  Vermonthave an average population of44695.78571428572
Countries in state  Virginiahave an average population of60111.29323308271
Countries in state  Washingtonhave an average population of172424.10256410256
Countries in state  West Virginiahave an average population of33690.8
Countries in state  Wisconsinhave an average population of78985.91666666667
Countries in state  Wyo

KeyboardInterrupt: 

In [None]:
%%timeit -n 3

for group, frame in df.groupby('STNAME'):
    
    avg = np.average(frame['CENSUS2010POP'])
    
    print('Countries in state  ' + group + 'have an average population of' + str(avg))

That's a huge difference in speed and improvement of roughly by two factors. Now, 99 percent of the time you'll use groupby on one or more columns. But you can also provide a function to groupby and use that to segment your data. This is a bit of a fabricated example, but let's say that you have a big batch job with lots of processing and you want to work only on a third or so of the states at a given time. We could create some function which returns a number between zero and two based on the first character of the state name. Then we can tell groupby to use this function to split up our data frame. It's important to note that in order to do this, you need to set the index of the data frame to be the column that you want to group by first. 

In [10]:
df = df.set_index('STNAME')

def set_batch(item):
    if item[0]<'M':
        return 0
    if item[0]<'Q':
        return 1
    return 2

for group, frame in df.groupby(set_batch):
    print('There are '+str(len(frame))+' records in group '+str(group)+' for processing.')

There are 1177 records in group 0 for processing.
There are 1134 records in group 1 for processing.
There are 831 records in group 2 for processing.


Notice that this time, I didn't pass in a column name to groupby, instead, I set the index of the DataFrame to be STNAME. If no column identifier is passed, groupby will automatically use that index. 

 Let's take one more look at an example of how we might group data. In this example, I want to use a data set of housing from Airbnb. In this dataset, there's two columns of interest. One is the cancellation policy, and the other is the review of scores value. We'll bring this in. 

In [11]:
df = pd.read_csv('resources/listings.csv')
df.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",...,,f,,,f,moderate,f,f,1,
1,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,none,"The room is in Roslindale, a diverse and prima...",...,9.0,f,,,t,moderate,f,f,1,1.3
2,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",none,The LOCATION: Roslindale is a safe and diverse...,...,10.0,f,,,f,moderate,t,f,1,0.47
3,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,none,Roslindale is a lovely little neighborhood loc...,...,10.0,f,,,f,moderate,f,f,1,1.0
4,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",none,"I love the proximity to downtown, the neighbor...",...,10.0,f,,,f,flexible,f,f,1,2.25


How would I group by both of these columns? A first approach might be to promote them to a multi-index, and then just call it groupby. Here, I'll just say df equals df.set_index

In [12]:
df = df.set_index(['cancellation_policy','review_scores_value'])
#When we have a multi-index, we need to pass in the levels that we're interested in groupingby. By default, 
#groupby does not know and does not assume that you want to group by all levels
for group,frame in df.groupby(level=(0,1)):
     print(group)

('flexible', 2.0)
('flexible', 4.0)
('flexible', 5.0)
('flexible', 6.0)
('flexible', 7.0)
('flexible', 8.0)
('flexible', 9.0)
('flexible', 10.0)
('moderate', 2.0)
('moderate', 4.0)
('moderate', 6.0)
('moderate', 7.0)
('moderate', 8.0)
('moderate', 9.0)
('moderate', 10.0)
('strict', 2.0)
('strict', 3.0)
('strict', 4.0)
('strict', 5.0)
('strict', 6.0)
('strict', 7.0)
('strict', 8.0)
('strict', 9.0)
('strict', 10.0)
('super_strict_30', 6.0)
('super_strict_30', 7.0)
('super_strict_30', 8.0)
('super_strict_30', 9.0)
('super_strict_30', 10.0)


This seems to work out okay. But what if we wanted to group by the cancellation policy and review scores, but separate all of the 10s from those under 10. In this case, we could use a function to manage the groupings.

In [13]:
def grouping_fun(item):
    if item[1]==10:
        return (item[0],'10.0')
    else:
        return (item[0],'not 10.0')
for group, frame in df.groupby(grouping_fun):
    print(group)

('flexible', '10.0')
('flexible', 'not 10.0')
('moderate', '10.0')
('moderate', 'not 10.0')
('strict', '10.0')
('strict', 'not 10.0')
('super_strict_30', '10.0')
('super_strict_30', 'not 10.0')


In [14]:
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_communication,review_scores_location,requires_license,license,jurisdiction_names,instant_bookable,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
cancellation_policy,review_scores_value,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
moderate,,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",...,,,f,,,f,f,f,1,
moderate,9.0,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,none,"The room is in Roslindale, a diverse and prima...",...,10.0,9.0,f,,,t,f,f,1,1.3
moderate,10.0,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",none,The LOCATION: Roslindale is a safe and diverse...,...,10.0,9.0,f,,,f,t,f,1,0.47
moderate,10.0,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,none,Roslindale is a lovely little neighborhood loc...,...,10.0,10.0,f,,,f,f,f,1,1.0
flexible,10.0,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",none,"I love the proximity to downtown, the neighbor...",...,10.0,9.0,f,,,f,f,f,1,2.25


To this point, we've applied very simple processing to our data after splitting, really just outputting some print statements to demonstrate how the splitting works. The panda's developers have three broad categories of data processing to happen during the apply step. Aggregations of group data, transformation of group data, and filtration of group data.

The most straightforward apply step is the aggregation of data. This uses a method called agg on the groupby object. Thus far, we've only iterated through the groupby object, unpacking it into a label, the group name, and a DataFrame. But with agg, we can pass in a dictionary of the columns we are interested in aggregating along with the function that we're looking to apply.

# Aggregation

In [15]:
#Now let's group by the cancellation policy and find the average review scores value by group.
df=df.reset_index()

df.groupby('cancellation_policy').agg({'review_scores_value':np.average})

Unnamed: 0_level_0,review_scores_value
cancellation_policy,Unnamed: 1_level_1
flexible,
moderate,
strict,
super_strict_30,


In [16]:
#So that didn't seem to work at all. Just a bunch of not a numbers. 
#Np.average does not ignore not a numbers. However, there is a function that we can use for this. Actually,

df.groupby('cancellation_policy').agg({'review_scores_value':np.nanmean})

Unnamed: 0_level_0,review_scores_value
cancellation_policy,Unnamed: 1_level_1
flexible,9.237421
moderate,9.307398
strict,9.081441
super_strict_30,8.537313


In [17]:
#We can just extend this dictionary to aggregate by multiple functions if we want to or multiple columns
df.groupby('cancellation_policy').agg({'review_scores_value':(np.nanmean,np.nanstd),
                                      'reviews_per_month':np.nanmean}) # std standard debiation


Unnamed: 0_level_0,review_scores_value,review_scores_value,reviews_per_month
Unnamed: 0_level_1,nanmean,nanstd,nanmean
cancellation_policy,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
flexible,9.237421,1.096271,1.82921
moderate,9.307398,0.859859,2.391922
strict,9.081441,1.040531,1.873467
super_strict_30,8.537313,0.840785,0.340143


Take a moment to make sure you understand the previous cell, since it's somewhat complex. First, we're doing a groupby on the dataframe object by the column cancellation policy. This creates a new groupby object. Then we're invoking the agg function on that object. The agg function is going to apply one or more functions that we specify to the group dataframes and return a single row per dataframe/group. When we call this function, we sent it two dictionary entries, each with the key indicating which column we wanted functions applied to. For the first column, we actually supplied a tuple of two functions. Note that these are not function invocations, like np.nanmean with parentheses after it, or function names like "nanmean" and a string. They are actually references to functions which will return single values. The groupby object will recognize the tuple and call each function in order on the same column. The results will then be in a hierarchical index, but since they are columns they don't show up as an index per se, then we indicated that another column and a single function we wanted to be run should be run. This is really important that you understand what's happened here and how that statement was created. 

# Transformation

Transformation is different from aggregations. Where agg returns a single value per column, so one row per group, transform returns an object that is the same size as the group. Essentially, it broadcasts the function you supply over the group dataframe, returning a new dataframe. This makes combining data later quite easy.

In [18]:
#suppose we wanted to include the average rating values in a given group by cancellation policy, but preserve the dataframe 
#shapes so that we could generate a difference between an individual observation and the sum
cols=['cancellation_policy','review_scores_value']
transform_df = df[cols].groupby('cancellation_policy').transform(np.nanmean)
transform_df.head()

Unnamed: 0,review_scores_value
0,9.307398
1,9.307398
2,9.307398
3,9.307398
4,9.237421


In [19]:
#We can see that the index here is actually the same as the original dataframe. Let's just join this in. 
transform_df.rename({'review_scores_value':'mean_review_scores'}, axis='columns',inplace=True)
df=df.merge(transform_df, left_index=True,right_index=True)
df.head()

Unnamed: 0,cancellation_policy,review_scores_value,id,listing_url,scrape_id,last_scraped,name,summary,space,description,...,review_scores_location,requires_license,license,jurisdiction_names,instant_bookable,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month,mean_review_scores
0,moderate,,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",...,,f,,,f,f,f,1,,9.307398
1,moderate,9.0,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,...,9.0,f,,,t,f,f,1,1.3,9.307398
2,moderate,10.0,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",...,9.0,f,,,f,t,f,1,0.47,9.307398
3,moderate,10.0,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,...,10.0,f,,,f,f,f,1,1.0,9.307398
4,flexible,10.0,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",...,9.0,f,,,f,f,f,1,2.25,9.237421


In [20]:
#We consider our new column is in place the Mean review scores. So now we could create, for instance, 
#the difference between a given row and it's group the cancellation policy means.
df['mean_diff']=np.absolute(df['review_scores_value']-df['mean_review_scores'])
df['mean_diff'].head()

0         NaN
1    0.307398
2    0.692602
3    0.692602
4    0.762579
Name: mean_diff, dtype: float64

# Filtering

So the Group-by object is built-in support for filtering groups as well. It's often that you'll want to group by some features then make some transformations to the groups, then drop certain groups as part of your cleaning routine. The Filter Function takes in a function which it applies to each group data frame and returns either a true or false, depending on whether that group should be included in the results.

In [21]:
df.groupby('cancellation_policy').filter(lambda x: np.nanmean(x['review_scores_value'])>9.2)

Unnamed: 0,cancellation_policy,review_scores_value,id,listing_url,scrape_id,last_scraped,name,summary,space,description,...,requires_license,license,jurisdiction_names,instant_bookable,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month,mean_review_scores,mean_diff
0,moderate,,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",...,f,,,f,f,f,1,,9.307398,
1,moderate,9.0,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,...,f,,,t,f,f,1,1.30,9.307398,0.307398
2,moderate,10.0,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",...,f,,,f,t,f,1,0.47,9.307398,0.692602
3,moderate,10.0,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,...,f,,,f,f,f,1,1.00,9.307398,0.692602
4,flexible,10.0,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",...,f,,,f,f,f,1,2.25,9.237421,0.762579
5,flexible,10.0,12386020,https://www.airbnb.com/rooms/12386020,20160906204935,2016-09-07,Private Bedroom + Great Coffee,Super comfy bedroom plus your own bathroom in ...,Our sunny condo is located on the second and t...,Super comfy bedroom plus your own bathroom in ...,...,f,,,f,f,f,1,1.70,9.237421,0.762579
7,moderate,10.0,2843445,https://www.airbnb.com/rooms/2843445,20160906204935,2016-09-07,"""Tranquility"" on ""Top of the Hill""","We can accommodate guests who are gluten-free,...",We provide a bedroom and full shared bath. Ra...,"We can accommodate guests who are gluten-free,...",...,f,,,f,t,t,2,2.38,9.307398,0.692602
8,moderate,10.0,753446,https://www.airbnb.com/rooms/753446,20160906204935,2016-09-07,6 miles away from downtown Boston!,Nice and cozy apartment about 6 miles away to ...,Nice and cozy apartment about 6 miles away to ...,Nice and cozy apartment about 6 miles away to ...,...,f,,,f,f,f,1,5.36,9.307398,0.692602
10,flexible,10.0,12023024,https://www.airbnb.com/rooms/12023024,20160906204935,2016-09-07,Cozy room in a well located house,The room is in a single family house located i...,,The room is in a single family house located i...,...,f,,,f,f,f,1,0.36,9.237421,0.762579
11,flexible,9.0,1668313,https://www.airbnb.com/rooms/1668313,20160906204935,2016-09-07,Room in Rozzie-Twin Bed-Full Bath,Quiet second floor bedroom sleeps one in comfo...,,Quiet second floor bedroom sleeps one in comfo...,...,f,,,f,f,f,2,0.48,9.237421,0.237421


# Apply


By far, the most common operation I invoke on Group-by objects is the Apply function. This allows you to apply an arbitrary function to each group and stitched the results back together for each apply into a single data frame where the index is preserved. 

 So let's look at an example using our Airbnb data. I'm gonna get a clean copy of that data frame. So we'll just load that from the CSV listings dot CSV and let's just include some of the columns that we were interested in previously.

In [26]:
df = pd.read_csv('resources/listings.csv')
df = df[['cancellation_policy','review_scores_value']]
df.head()

Unnamed: 0,cancellation_policy,review_scores_value
0,moderate,
1,moderate,9.0
2,moderate,10.0
3,moderate,10.0
4,flexible,10.0


 In previous work, we wanted to find the average review score of a listing and its deviation from the group mean. This was a two-step process. First we use Transform on the Group-by object, and then we had to broadcast to create a new column. With Apply, we could wrap this logic in one place. 

In [28]:
def calc_mean(group):
    avg=np.nanmean(group['review_scores_value'])
    group['review_scores_mean']=np.abs(avg-group['review_scores_value'])
    
    return group
df.groupby('cancellation_policy').apply(calc_mean).head(50)

Unnamed: 0,cancellation_policy,review_scores_value,review_scores_mean
0,moderate,,
1,moderate,9.0,0.307398
2,moderate,10.0,0.692602
3,moderate,10.0,0.692602
4,flexible,10.0,0.762579
5,flexible,10.0,0.762579
6,strict,9.0,0.081441
7,moderate,10.0,0.692602
8,moderate,10.0,0.692602
9,strict,9.0,0.081441
