Sometimes we want to select data based on groups and understand aggregated data on a group level. We have seen that even though Pandas allows us to iterate over every row in a dataframe, it is geneally very slow to do so. Fortunately Pandas has a groupby() function **to speed up such task**. The idea behind the **groupby()** function is that **it takes some dataframe, splits it into DataFrame chunks based on some key values, applies computation on those chunks, then combines the results back together into another dataframe**. In pandas this is refered to as the **split-apply-combine pattern**.


In [1]:
import pandas as pd 
import numpy as np

In [2]:
df = pd.read_csv("datasets/census.csv")
df

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.002295,-0.193196,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.592270,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.832960,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.500690,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3188,50,4,8,56,37,Wyoming,Sweetwater County,43806,43806,43593,...,1.072643,16.243199,-5.339774,-14.252889,-14.248864,1.255221,16.243199,-5.295460,-14.075283,-14.070195
3189,50,4,8,56,39,Wyoming,Teton County,21294,21294,21297,...,-1.589565,0.972695,19.525929,14.143021,-0.564849,0.654527,2.408578,21.160658,16.308671,1.520747
3190,50,4,8,56,41,Wyoming,Uinta County,21118,21118,21102,...,-17.755986,-4.916350,-6.902954,-14.215862,-12.127022,-18.136812,-5.536861,-7.521840,-14.740608,-12.606351
3191,50,4,8,56,43,Wyoming,Washakie County,8533,8533,8545,...,-11.637475,-0.827815,-2.013502,-17.781491,1.682288,-11.990126,-1.182592,-2.250385,-18.020168,1.441961


In [3]:
df = df[df["SUMLEV"] == 50]
df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
5,50,3,6,1,9,Alabama,Blount County,57322,57322,57373,...,1.807375,-1.177622,-1.748766,-2.062535,-1.36997,1.859511,-0.84858,-1.402476,-1.577232,-0.884411


In [4]:
df["STNAME"].unique()

array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
       'Colorado', 'Connecticut', 'Delaware', 'District of Columbia',
       'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana',
       'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland',
       'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi',
       'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire',
       'New Jersey', 'New Mexico', 'New York', 'North Carolina',
       'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania',
       'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee',
       'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West Virginia', 'Wisconsin', 'Wyoming'], dtype=object)

In [5]:
len(df["STNAME"].unique())

51

In [6]:
for state in df["STNAME"].unique():
    avg = df[df["STNAME"] == state]["CENSUS2010POP"].mean()
    print("counties in state " + state + 
          " have an average population of " + str(avg))


counties in state Alabama have an average population of 71339.34328358209
counties in state Alaska have an average population of 24490.724137931036
counties in state Arizona have an average population of 426134.4666666667
counties in state Arkansas have an average population of 38878.90666666667
counties in state California have an average population of 642309.5862068966
counties in state Colorado have an average population of 78581.1875
counties in state Connecticut have an average population of 446762.125
counties in state Delaware have an average population of 299311.3333333333
counties in state District of Columbia have an average population of 601723.0
counties in state Florida have an average population of 280616.5671641791
counties in state Georgia have an average population of 60928.63522012578
counties in state Hawaii have an average population of 272060.2
counties in state Idaho have an average population of 35626.86363636364
counties in state Illinois have an average populat

In [7]:
%%timeit -n 3

for state in df["STNAME"].unique():
    avg = np.average(df[df["STNAME"] == state]["CENSUS2010POP"])
    print("counties in state " + state + 
          " have an average population of " + str(avg))


counties in state Alabama have an average population of 71339.34328358209
counties in state Alaska have an average population of 24490.724137931036
counties in state Arizona have an average population of 426134.4666666667
counties in state Arkansas have an average population of 38878.90666666667
counties in state California have an average population of 642309.5862068966
counties in state Colorado have an average population of 78581.1875
counties in state Connecticut have an average population of 446762.125
counties in state Delaware have an average population of 299311.3333333333
counties in state District of Columbia have an average population of 601723.0
counties in state Florida have an average population of 280616.5671641791
counties in state Georgia have an average population of 60928.63522012578
counties in state Hawaii have an average population of 272060.2
counties in state Idaho have an average population of 35626.86363636364
counties in state Illinois have an average populat

counties in state Michigan have an average population of 119080.0
counties in state Minnesota have an average population of 60964.65517241379
counties in state Mississippi have an average population of 36186.54878048781
counties in state Missouri have an average population of 52077.62608695652
counties in state Montana have an average population of 17668.125
counties in state Nebraska have an average population of 19638.075268817203
counties in state Nevada have an average population of 158855.9411764706
counties in state New Hampshire have an average population of 131647.0
counties in state New Jersey have an average population of 418661.61904761905
counties in state New Mexico have an average population of 62399.36363636364
counties in state New York have an average population of 312550.03225806454
counties in state North Carolina have an average population of 95354.83
counties in state North Dakota have an average population of 12690.396226415094
counties in state Ohio have an avera

counties in state Pennsylvania have an average population of 189587.74626865672
counties in state Rhode Island have an average population of 210513.4
counties in state South Carolina have an average population of 100551.39130434782
counties in state South Dakota have an average population of 12336.060606060606
counties in state Tennessee have an average population of 66801.1052631579
counties in state Texas have an average population of 98998.27165354331
counties in state Utah have an average population of 95306.37931034483
counties in state Vermont have an average population of 44695.78571428572
counties in state Virginia have an average population of 60111.29323308271
counties in state Washington have an average population of 172424.10256410256
counties in state West Virginia have an average population of 33690.8
counties in state Wisconsin have an average population of 78985.91666666667
counties in state Wyoming have an average population of 24505.478260869564
counties in state Alab

counties in state South Dakota have an average population of 12336.060606060606
counties in state Tennessee have an average population of 66801.1052631579
counties in state Texas have an average population of 98998.27165354331
counties in state Utah have an average population of 95306.37931034483
counties in state Vermont have an average population of 44695.78571428572
counties in state Virginia have an average population of 60111.29323308271
counties in state Washington have an average population of 172424.10256410256
counties in state West Virginia have an average population of 33690.8
counties in state Wisconsin have an average population of 78985.91666666667
counties in state Wyoming have an average population of 24505.478260869564
counties in state Alabama have an average population of 71339.34328358209
counties in state Alaska have an average population of 24490.724137931036
counties in state Arizona have an average population of 426134.4666666667
counties in state Arkansas have 

counties in state Maryland have an average population of 240564.66666666666
counties in state Massachusetts have an average population of 467687.78571428574
counties in state Michigan have an average population of 119080.0
counties in state Minnesota have an average population of 60964.65517241379
counties in state Mississippi have an average population of 36186.54878048781
counties in state Missouri have an average population of 52077.62608695652
counties in state Montana have an average population of 17668.125
counties in state Nebraska have an average population of 19638.075268817203
counties in state Nevada have an average population of 158855.9411764706
counties in state New Hampshire have an average population of 131647.0
counties in state New Jersey have an average population of 418661.61904761905
counties in state New Mexico have an average population of 62399.36363636364
counties in state New York have an average population of 312550.03225806454
counties in state North Carolin

counties in state Utah have an average population of 95306.37931034483
counties in state Vermont have an average population of 44695.78571428572
counties in state Virginia have an average population of 60111.29323308271
counties in state Washington have an average population of 172424.10256410256
counties in state West Virginia have an average population of 33690.8
counties in state Wisconsin have an average population of 78985.91666666667
counties in state Wyoming have an average population of 24505.478260869564
counties in state Alabama have an average population of 71339.34328358209
counties in state Alaska have an average population of 24490.724137931036
counties in state Arizona have an average population of 426134.4666666667
counties in state Arkansas have an average population of 38878.90666666667
counties in state California have an average population of 642309.5862068966
counties in state Colorado have an average population of 78581.1875
counties in state Connecticut have an a

counties in state Hawaii have an average population of 272060.2
counties in state Idaho have an average population of 35626.86363636364
counties in state Illinois have an average population of 125790.50980392157
counties in state Indiana have an average population of 70476.10869565218
counties in state Iowa have an average population of 30771.262626262625
counties in state Kansas have an average population of 27172.55238095238
counties in state Kentucky have an average population of 36161.39166666667
counties in state Louisiana have an average population of 70833.9375
counties in state Maine have an average population of 83022.5625
counties in state Maryland have an average population of 240564.66666666666
counties in state Massachusetts have an average population of 467687.78571428574
counties in state Michigan have an average population of 119080.0
counties in state Minnesota have an average population of 60964.65517241379
counties in state Mississippi have an average population of 3

counties in state Nebraska have an average population of 19638.075268817203
counties in state Nevada have an average population of 158855.9411764706
counties in state New Hampshire have an average population of 131647.0
counties in state New Jersey have an average population of 418661.61904761905
counties in state New Mexico have an average population of 62399.36363636364
counties in state New York have an average population of 312550.03225806454
counties in state North Carolina have an average population of 95354.83
counties in state North Dakota have an average population of 12690.396226415094
counties in state Ohio have an average population of 131096.63636363635
counties in state Oklahoma have an average population of 48718.844155844155
counties in state Oregon have an average population of 106418.72222222222
counties in state Pennsylvania have an average population of 189587.74626865672
counties in state Rhode Island have an average population of 210513.4
counties in state South C

there are two values we set here. **groupby() returns a tuple**, where **the first value is the value of the key we were trying to groupby**, in this case a specific state name, and **the second one is projected dataframe that was found for that group**.

In [8]:
%%timeit -n 3

for group, frame in df.groupby("STNAME"):
    
    # the step of applying computation on DataFrame chuncks.
    avg = np.average(frame["CENSUS2010POP"])
    print("counties in state " + state +
         " have an average population " + str(avg))
    
# we don't have to worry about the combine step in this case, because all of our data transformation is
# actually printing out results.

counties in state Wyoming have an average population 71339.34328358209
counties in state Wyoming have an average population 24490.724137931036
counties in state Wyoming have an average population 426134.4666666667
counties in state Wyoming have an average population 38878.90666666667
counties in state Wyoming have an average population 642309.5862068966
counties in state Wyoming have an average population 78581.1875
counties in state Wyoming have an average population 446762.125
counties in state Wyoming have an average population 299311.3333333333
counties in state Wyoming have an average population 601723.0
counties in state Wyoming have an average population 280616.5671641791
counties in state Wyoming have an average population 60928.63522012578
counties in state Wyoming have an average population 272060.2
counties in state Wyoming have an average population 35626.86363636364
counties in state Wyoming have an average population 125790.50980392157
counties in state Wyoming have an av

counties in state Wyoming have an average population 98998.27165354331
counties in state Wyoming have an average population 95306.37931034483
counties in state Wyoming have an average population 44695.78571428572
counties in state Wyoming have an average population 60111.29323308271
counties in state Wyoming have an average population 172424.10256410256
counties in state Wyoming have an average population 33690.8
counties in state Wyoming have an average population 78985.91666666667
counties in state Wyoming have an average population 24505.478260869564
counties in state Wyoming have an average population 71339.34328358209
counties in state Wyoming have an average population 24490.724137931036
counties in state Wyoming have an average population 426134.4666666667
counties in state Wyoming have an average population 38878.90666666667
counties in state Wyoming have an average population 642309.5862068966
counties in state Wyoming have an average population 78581.1875
counties in state Wy

counties in state Wyoming have an average population 210513.4
counties in state Wyoming have an average population 100551.39130434782
counties in state Wyoming have an average population 12336.060606060606
counties in state Wyoming have an average population 66801.1052631579
counties in state Wyoming have an average population 98998.27165354331
counties in state Wyoming have an average population 95306.37931034483
counties in state Wyoming have an average population 44695.78571428572
counties in state Wyoming have an average population 60111.29323308271
counties in state Wyoming have an average population 172424.10256410256
counties in state Wyoming have an average population 33690.8
counties in state Wyoming have an average population 78985.91666666667
counties in state Wyoming have an average population 24505.478260869564
counties in state Wyoming have an average population 71339.34328358209
counties in state Wyoming have an average population 24490.724137931036
counties in state Wyo

counties in state Wyoming have an average population 131096.63636363635
counties in state Wyoming have an average population 48718.844155844155
counties in state Wyoming have an average population 106418.72222222222
counties in state Wyoming have an average population 189587.74626865672
counties in state Wyoming have an average population 210513.4
counties in state Wyoming have an average population 100551.39130434782
counties in state Wyoming have an average population 12336.060606060606
counties in state Wyoming have an average population 66801.1052631579
counties in state Wyoming have an average population 98998.27165354331
counties in state Wyoming have an average population 95306.37931034483
counties in state Wyoming have an average population 44695.78571428572
counties in state Wyoming have an average population 60111.29323308271
counties in state Wyoming have an average population 172424.10256410256
counties in state Wyoming have an average population 33690.8
counties in state W

Wow, what a huge difference in speed.

Now, 99% of the time, we'll use **groupby on one or more columns**. But we can also **provide a function to groupby** 
and use that **to segment your data**.

if we have a big batch job with lots of processing and we want to work on only a third or so of the states at a given time, We could create some function which returns a number between zero and two based on the first character of the state name, Then **we can tell groupby to use this function to split up our data frame**.

It's important to note that **in order to do this we need to  set the index of the data frame to be the column that we want to groupby first**.

if **no column is passed groupby()**, it **will automatically use the index**.

In [9]:
df = df.set_index("STNAME")

def set_batch_number(item):
    if item[0] < "M":
        return 0
    if item[0] < "Q":
        return 1
    return 2

# there are 3 groups (group 0, group 1, and group 2), and 3 DataFrames for these groups 
for group, frame in df.groupby(set_batch_number):
    print("there are " + str(len(frame)) +
          " records in group " + str(group) + " for processing")
    

there are 1177 records in group 0 for processing
there are 1134 records in group 1 for processing
there are 831 records in group 2 for processing


In this example, we want to use a **dataset of housing from airbnb**. In this dataset there are two columns of interest, one is the cancellation_policy and the other is the review_scores_value.

In [10]:
df = pd.read_csv("datasets/listings.csv")
df

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",...,,f,,,f,moderate,f,f,1,
1,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,none,"The room is in Roslindale, a diverse and prima...",...,9.0,f,,,t,moderate,f,f,1,1.30
2,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",none,The LOCATION: Roslindale is a safe and diverse...,...,10.0,f,,,f,moderate,t,f,1,0.47
3,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,none,Roslindale is a lovely little neighborhood loc...,...,10.0,f,,,f,moderate,f,f,1,1.00
4,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",none,"I love the proximity to downtown, the neighbor...",...,10.0,f,,,f,flexible,f,f,1,2.25
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3580,8373729,https://www.airbnb.com/rooms/8373729,20160906204935,2016-09-07,Big cozy room near T,5 min walking to Orange Line subway with 2 sto...,,5 min walking to Orange Line subway with 2 sto...,none,,...,9.0,f,,,t,strict,f,f,8,0.34
3581,14844274,https://www.airbnb.com/rooms/14844274,20160906204935,2016-09-07,BU Apartment DexterPark Bright room,"Most popular apartment in BU, best located in ...",Best location in BU,"Most popular apartment in BU, best located in ...",none,,...,,f,,,f,strict,f,f,2,
3582,14585486,https://www.airbnb.com/rooms/14585486,20160906204935,2016-09-07,Gorgeous funky apartment,Funky little apartment close to public transpo...,Modern and relaxed space with many facilities ...,Funky little apartment close to public transpo...,none,"Cambridge is a short walk into Boston, and set...",...,,f,,,f,flexible,f,f,1,
3583,14603878,https://www.airbnb.com/rooms/14603878,20160906204935,2016-09-07,Great Location; Train and Restaurants,"My place is close to Taco Loco Mexican Grill, ...",,"My place is close to Taco Loco Mexican Grill, ...",none,,...,7.0,f,,,f,strict,f,f,1,2.00


In [11]:
df = df.set_index(["cancellation_policy", "review_scores_value"])
li = []
for group, frame in df.groupby(level=(0,1)):
    li.append(len(frame))
    print("there is/are " + str(len(frame)) + " records " + "in group " + str(group))
    

there is/are 1 records in group ('flexible', 2.0)
there is/are 5 records in group ('flexible', 4.0)
there is/are 1 records in group ('flexible', 5.0)
there is/are 18 records in group ('flexible', 6.0)
there is/are 12 records in group ('flexible', 7.0)
there is/are 67 records in group ('flexible', 8.0)
there is/are 200 records in group ('flexible', 9.0)
there is/are 332 records in group ('flexible', 10.0)
there is/are 1 records in group ('moderate', 2.0)
there is/are 1 records in group ('moderate', 4.0)
there is/are 10 records in group ('moderate', 6.0)
there is/are 7 records in group ('moderate', 7.0)
there is/are 82 records in group ('moderate', 8.0)
there is/are 304 records in group ('moderate', 9.0)
there is/are 379 records in group ('moderate', 10.0)
there is/are 5 records in group ('strict', 2.0)
there is/are 2 records in group ('strict', 3.0)
there is/are 6 records in group ('strict', 4.0)
there is/are 1 records in group ('strict', 5.0)
there is/are 19 records in group ('strict',

total original DataFrame records are:

In [12]:
len(df)

3585

total considered records for groups are :

In [13]:
np.sum(li)

2764

total NaN groups in original DataFrame :

In [14]:
review_scores_li = [item[1] for item in df.index if np.isnan(item[1]) == True]
review_scores_series = pd.Series(review_scores_li)
review_scores_series

0     NaN
1     NaN
2     NaN
3     NaN
4     NaN
       ..
816   NaN
817   NaN
818   NaN
819   NaN
820   NaN
Length: 821, dtype: float64

In [15]:
3585 - 2764

821

note : **by default, groupby() ignores the groups that have NaN values, but we can consider them by grouping**. so 821 groups are ignored by groupby function.

what if we wanted to group by the cancelation policy and review scores, but separate out all the 10's from those under ten? In this case, we could use a function to manage the groupings

In [16]:
def grouping(item):
    if item[1] == 10:
        return (item[0], "10.0")
    return (item[0], "not 10.0")
li = []
for group, frame in df.groupby(grouping):
    li.append(len(frame))
    print(group)


('flexible', '10.0')
('flexible', 'not 10.0')
('moderate', '10.0')
('moderate', 'not 10.0')
('strict', '10.0')
('strict', 'not 10.0')
('super_strict_30', '10.0')
('super_strict_30', 'not 10.0')


the DataFrame of group ('super_strict_30', 'not 10.0') is :

In [17]:
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_communication,review_scores_location,requires_license,license,jurisdiction_names,instant_bookable,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
cancellation_policy,review_scores_value,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
super_strict_30,,903598,https://www.airbnb.com/rooms/903598,20160906204935,2016-09-07,[1684-NE]2BR At The Longwood,,Inside the spectacular apartments at The Longw...,Inside the spectacular apartments at The Longw...,none,"Living at The Longwood Apartments, residents w...",...,,,f,,,f,f,t,79,
super_strict_30,,9903,https://www.airbnb.com/rooms/9903,20160906204935,2016-09-07,[1480-1] 1BR-City View at Longwood,,CityView at Longwood apartments are located on...,CityView at Longwood apartments are located on...,none,,...,,,f,,,f,f,t,79,
super_strict_30,8.0,195515,https://www.airbnb.com/rooms/195515,20160906204935,2016-09-07,[1684-ST] Lux 1BR-Longwood Med Area,Very nicely appointed apartment in an economic...,This Longwood apartments will allow residents ...,Very nicely appointed apartment in an economic...,none,,...,10.0,10.0,f,,,f,f,t,79,0.02
super_strict_30,8.0,130552,https://www.airbnb.com/rooms/130552,20160906204935,2016-09-07,[1684-2]2BRs-Near Longwood Med Area,Very nicely appointed apartment in an economic...,Inside the spectacular apartments at The Longw...,Very nicely appointed apartment in an economic...,none,,...,10.0,9.0,f,,,f,f,t,79,0.05
super_strict_30,,543188,https://www.airbnb.com/rooms/543188,20160906204935,2016-09-07,[1480-2] 2 BR-City view at Longwood,,City View at Longwood apartments are located o...,City View at Longwood apartments are located o...,none,,...,,,f,,,f,f,t,79,
super_strict_30,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
super_strict_30,7.0,180914,https://www.airbnb.com/rooms/180914,20160906204935,2016-09-07,[1125-1]Furnished 1BR - Landmark Square,,"Landmark Square, an elegant apartment communit...","Landmark Square, an elegant apartment communit...",none,,...,8.0,9.0,f,,,f,f,t,79,0.04
super_strict_30,8.0,951480,https://www.airbnb.com/rooms/951480,20160906204935,2016-09-07,[1246-2C]Elegant 2BRs - Fenway Area,,This high-rise building is modern city living ...,This high-rise building is modern city living ...,none,Central Location - near Harvard Medical School...,...,10.0,10.0,f,,,f,f,t,79,0.03
super_strict_30,8.0,951476,https://www.airbnb.com/rooms/951476,20160906204935,2016-09-07,[1246-1NE]Luxury 1BR - Fenway Area,,We know that there is more to City Living than...,We know that there is more to City Living than...,none,,...,7.0,9.0,f,,,f,f,t,79,0.08
super_strict_30,,951473,https://www.airbnb.com/rooms/951473,20160906204935,2016-09-07,[1246-1C] Lux 1BR in The Fenway,,"The Trilogy offers you stylish design, natural...","The Trilogy offers you stylish design, natural...",none,Central Location - near Harvard Medical School...,...,,,f,,,f,f,t,79,


in this case, we cosider NaN groups. so total considered records for groups are :

In [18]:
np.sum(li)

3585

# in the apply step :

we have applied very simple processing to our data after splitting in the apply step, really just outputting some print statements to demonstrate how the splitting works.

The pandas developers have **three broad categories of data processing** to happen during **the apply step**, **Aggregation of group data, Transformation of group data, Filteration of group data, and Apply of group data**.

# 1. Aggregation of group data :

In [19]:
df = df.reset_index()

df.groupby("cancellation_policy").agg({"review_scores_value" : np.average})

Unnamed: 0_level_0,review_scores_value
cancellation_policy,Unnamed: 1_level_1
flexible,
moderate,
strict,
super_strict_30,


That didn't seem to work at all because **np.average does not ignore nans**! However, we can use **np.nanmean** for this.

**The most straight forward apply step** is the **aggregation of data**, and uses the method **.agg() on the groupby object**. Thus far we have only iterated through the groupby object, unpacking it into a label (the group name) and a dataframe. But **with .agg() we can pass in a dictionary of the columns** we are interested in aggregating **along with the function** we are looking to apply to aggregate.

**.agg() returns a new dataframe that is not the same size as the original dataframe object**.

In [20]:
df.groupby("cancellation_policy").agg({"review_scores_value" : np.nanmean})

Unnamed: 0_level_0,review_scores_value
cancellation_policy,Unnamed: 1_level_1
flexible,9.237421
moderate,9.307398
strict,9.081441
super_strict_30,8.537313


In [21]:
df.groupby("cancellation_policy").agg({"review_scores_value" : (np.nanmean, np.nanstd),
                                       "reviews_per_month" : np.nanmean})

Unnamed: 0_level_0,review_scores_value,review_scores_value,reviews_per_month
Unnamed: 0_level_1,nanmean,nanstd,nanmean
cancellation_policy,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
flexible,9.237421,1.096271,1.82921
moderate,9.307398,0.859859,2.391922
strict,9.081441,1.040531,1.873467
super_strict_30,8.537313,0.840785,0.340143


First we're doing a **.groupby() on the dataframe object** by the column "cancellation_policy". This **creates a new GroupBy object**. Then we are invoking the **.agg() function on that object**. **The agg function is going to apply one or more functions we specify to the group dataframes and return results as a single row per group dataframes**. **When we called this function we sent it two dictionary entries, each with the key indicating to a column of a group dataframe we wanted functions applied to**. 

For the first key we actually supplied a tuple of two functions**. Note that **these are not function invocations, like np.nanmean(), or function names, like "nanmean" **they are references to functions which will return single values**. **The results will be in a heirarchical index**. Then we indicated another column and a single function we wanted to run.

# 2. Transformation of group data :

In [22]:
cols = ["cancellation_policy", "review_scores_value"]

df[cols].groupby("cancellation_policy")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002893E2CF3D0>

In [23]:
type(df[cols].groupby("cancellation_policy").transform(np.nanmean))

pandas.core.frame.DataFrame

Transformation is different from aggregation. Where agg() returns a single row per group, **tranform() returns a new dataframe object that is the same size as the original dataframe object**. Essentially, **it broadcasts the function we supply over the grouped dataframe**. This makes **combining data  easy** later.

In [24]:
transformed_df = df[cols].groupby("cancellation_policy").transform(np.nanmean)
transformed_df

Unnamed: 0,review_scores_value
0,9.307398
1,9.307398
2,9.307398
3,9.307398
4,9.237421
...,...
3580,9.081441
3581,9.081441
3582,9.237421
3583,9.081441


In [25]:
transformed_df.rename({"review_scores_value" : "mean_scores_value"}, inplace= True, axis= 1)
df = df.merge(transformed_df, left_index= True, right_index= True)
df.head(10)

Unnamed: 0,cancellation_policy,review_scores_value,id,listing_url,scrape_id,last_scraped,name,summary,space,description,...,review_scores_location,requires_license,license,jurisdiction_names,instant_bookable,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month,mean_scores_value
0,moderate,,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",...,,f,,,f,f,f,1,,9.307398
1,moderate,9.0,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,...,9.0,f,,,t,f,f,1,1.3,9.307398
2,moderate,10.0,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",...,9.0,f,,,f,t,f,1,0.47,9.307398
3,moderate,10.0,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,...,10.0,f,,,f,f,f,1,1.0,9.307398
4,flexible,10.0,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",...,9.0,f,,,f,f,f,1,2.25,9.237421
5,flexible,10.0,12386020,https://www.airbnb.com/rooms/12386020,20160906204935,2016-09-07,Private Bedroom + Great Coffee,Super comfy bedroom plus your own bathroom in ...,Our sunny condo is located on the second and t...,Super comfy bedroom plus your own bathroom in ...,...,9.0,f,,,f,f,f,1,1.7,9.237421
6,strict,9.0,5706985,https://www.airbnb.com/rooms/5706985,20160906204935,2016-09-07,New Lrg Studio apt 15 min to Boston,It's a 5 minute walk to Rosi Square to catch t...,The whole house was recently redone and it 's ...,It's a 5 minute walk to Rosi Square to catch t...,...,9.0,f,,,f,f,f,3,4.0,9.081441
7,moderate,10.0,2843445,https://www.airbnb.com/rooms/2843445,20160906204935,2016-09-07,"""Tranquility"" on ""Top of the Hill""","We can accommodate guests who are gluten-free,...",We provide a bedroom and full shared bath. Ra...,"We can accommodate guests who are gluten-free,...",...,10.0,f,,,f,t,t,2,2.38,9.307398
8,moderate,10.0,753446,https://www.airbnb.com/rooms/753446,20160906204935,2016-09-07,6 miles away from downtown Boston!,Nice and cozy apartment about 6 miles away to ...,Nice and cozy apartment about 6 miles away to ...,Nice and cozy apartment about 6 miles away to ...,...,9.0,f,,,f,f,f,1,5.36,9.307398
9,strict,9.0,849408,https://www.airbnb.com/rooms/849408,20160906204935,2016-09-07,Perfect & Practical Boston Rental,This is a cozy and spacious two bedroom unit w...,Perfect apartment rental for those in town vis...,This is a cozy and spacious two bedroom unit w...,...,9.0,f,,,f,f,f,2,1.01,9.081441


the difference between mean_review_value and review_scores_value :

In [26]:
# broadcasting porcess
df["mean_diff"] = np.absolute(df["review_scores_value"] - df["mean_scores_value"])
df["mean_diff"].head()

0         NaN
1    0.307398
2    0.692602
3    0.692602
4    0.762579
Name: mean_diff, dtype: float64

# 3. Filteration of group data :

The **.filter()** method **takes in a function** which it **applies to each group dataframe** and **returns either a True or a False, depending upon whether that dataframe records should be included in the resulting dataframe**. 

**.filter() reduces the records of the original dataframe and return it**.

so **the resulting dataframe is not the same size as the original dataframe object, but is shrinked of the one**.

In [27]:
filtered_df = df.groupby("cancellation_policy").filter(lambda x: np.nanmean(x["review_scores_value"]) > 9.2)
filtered_df.head()

Unnamed: 0,cancellation_policy,review_scores_value,id,listing_url,scrape_id,last_scraped,name,summary,space,description,...,requires_license,license,jurisdiction_names,instant_bookable,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month,mean_scores_value,mean_diff
0,moderate,,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",...,f,,,f,f,f,1,,9.307398,
1,moderate,9.0,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,...,f,,,t,f,f,1,1.3,9.307398,0.307398
2,moderate,10.0,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",...,f,,,f,t,f,1,0.47,9.307398,0.692602
3,moderate,10.0,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,...,f,,,f,f,f,1,1.0,9.307398,0.692602
4,flexible,10.0,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",...,f,,,f,f,f,1,2.25,9.237421,0.762579


In [28]:
filtered_df["cancellation_policy"].unique()

array(['moderate', 'flexible'], dtype=object)

moderate and flexible groups have a mean bigger than 9.2 for  review_scores_value.

# 4. Apply of group data :

**.apply() invoked on groupby object** allows us to **apply an arbitrary function to each group dataframe**.

it **returns a dataframe that is a little different with original dataframe** but **has the same size as original dataframe**.

In previous work we wanted to find the average review score of a listing and its deviation from the group mean. This was a two step process, first we used transform() on the groupby object and then we had to broadcast to create a new column. With apply() we could wrap this logic in one place.

In [29]:
df = pd.read_csv("datasets/listings.csv")

df = df[["cancellation_policy", "review_scores_value"]]

def calc_mean_scores_value(group_df):
    # broadcasting process
    group_df["mean_scores_value"] = np.nanmean(group_df["review_scores_value"])
    return group_df

df = df.groupby("cancellation_policy").apply(calc_mean_scores_value)

#the diffrentiation of mean_scores_value and review_scores_value
def calc_diff_scores_value(group_df):
    # broadcasting process
    group_df["diff_mean_review"] = np.abs(group_df["review_scores_value"] - group_df["mean_scores_value"])
    return group_df
    
df = df.groupby("cancellation_policy").apply(calc_diff_scores_value)
df

Unnamed: 0,cancellation_policy,review_scores_value,mean_scores_value,diff_mean_review
0,moderate,,9.307398,
1,moderate,9.0,9.307398,0.307398
2,moderate,10.0,9.307398,0.692602
3,moderate,10.0,9.307398,0.692602
4,flexible,10.0,9.237421,0.762579
...,...,...,...,...
3580,strict,9.0,9.081441,0.081441
3581,strict,,9.081441,
3582,flexible,,9.237421,
3583,strict,7.0,9.081441,2.081441


Using **.apply() can be slower than** using **some of the specialized functions**, especially **agg()**. But, if your dataframes are not huge, it's a solid general purpose approach.

**.Groupby()** is a powerful tool and **commonly used for data cleaning and data analysis**. 

**Once we have grouped the data by some category** we have a dataframe of just those values, **we can conduct aggregated analsyis on the segments that we are interested**.

The groupby() function follows a split-apply-combine approach:

* first the data is **split into subgroups**, 
* then we can __apply some transformation__(or operation), aggregation, transformation, filteration, or aplly. 
* then **the results are combined automatically by pandas** for us.