# Groupby
* Often we want to split up and work with data based on groups
* Pandas allows us to iterate through rows and columns in a dataframe, but this is sort of slow
* Pandas also supports `groupby()` through a split-apply-combine pattern

## Splitting
* Let's get motivated first

In [20]:
import pandas as pd
import numpy as np
df = pd.read_csv('datasets/census.csv')
df = df[df['SUMLEV']==50]
df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
5,50,3,6,1,9,Alabama,Blount County,57322,57322,57373,...,1.807375,-1.177622,-1.748766,-2.062535,-1.36997,1.859511,-0.84858,-1.402476,-1.577232,-0.884411


In [21]:
#compute average population of particular states
for state in df['STNAME'].unique(): #gets every single unique state
    avg = np.average(df.where(df['STNAME'] == state).dropna()['CENSUS2010POP']) #where stateneame is equal to state drop na, then only get the data from 'CENSUS2010POP'. Afterward get np.average for the whole thing
    print('Counties in state ' + state + ' have an average population of ' + str(avg)) #print string with state and average as string

Counties in state Alabama have an average population of 71339.34328358209
Counties in state Alaska have an average population of 24490.724137931036
Counties in state Arizona have an average population of 426134.4666666667
Counties in state Arkansas have an average population of 38878.90666666667
Counties in state California have an average population of 642309.5862068966
Counties in state Colorado have an average population of 78581.1875
Counties in state Connecticut have an average population of 446762.125
Counties in state Delaware have an average population of 299311.3333333333
Counties in state District of Columbia have an average population of 601723.0
Counties in state Florida have an average population of 280616.5671641791
Counties in state Georgia have an average population of 60928.63522012578
Counties in state Hawaii have an average population of 272060.2
Counties in state Idaho have an average population of 35626.86363636364
Counties in state Illinois have an average populat

In [31]:
%%timeit -n 3 #time everything in the cell 3 loops
for state in df['STNAME'].unique():
    avg = np.average(df.where(df['STNAME'] == state).dropna()['CENSUS2010POP'])


KeyError: 'STNAME'

In [None]:
for group, frame in df.groupby('STNAME'): #groupby method. group is all the different group names, in this case states. Frame is each of the dataframes organized by state
    avg = np.average(frame['CENSUS2010POP']) #gets average of frame

In [12]:
%%timeit -n 3 #timing with 3 loops     dramatic improvement in speed

for group, frame in df.groupby('STNAME'):
    avg = np.average(frame['CENSUS2010POP'])

14.3 ms ± 160 µs per loop (mean ± std. dev. of 7 runs, 3 loops each)


* Ok, so `groupby` is great
* Usually you'll group by data in a column, but you can also provide a function to groupby and use that to segment your data.

In [None]:
df = df.set_index('STNAME')
df.head()


In [None]:
def set_batch_number(item):          #defining three groups
    if item[0] < 'M': #if item 0 which is first letter in each state is less than m than return 'States up to M'
        return 'States up to M'
    if item[0] < 'Q': #if item 0 is less than Q return 1 'States up to Q'
        return 'States up to Q' 
    return 'States up to Z' #all others return 2 'States up to Z'

for group, frame in df.groupby(set_batch_number): #passing in function
    print(group) #prints the three different groups

* We can also group by multiple columns

In [23]:
#Airbnb data
df=pd.read_csv("datasets/listings.csv")
df.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",...,,f,,,f,moderate,f,f,1,
1,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,none,"The room is in Roslindale, a diverse and prima...",...,9.0,f,,,t,moderate,f,f,1,1.3
2,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",none,The LOCATION: Roslindale is a safe and diverse...,...,10.0,f,,,f,moderate,t,f,1,0.47
3,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,none,Roslindale is a lovely little neighborhood loc...,...,10.0,f,,,f,moderate,f,f,1,1.0
4,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",none,"I love the proximity to downtown, the neighbor...",...,10.0,f,,,f,flexible,f,f,1,2.25


In [24]:
#It works pretty much as you would expect
for group, frame in df.groupby(['cancellation_policy', 'review_scores_value']): #group and frame for cancellation_policy and review_scores_value
    print('in group {} there were this many records: {}'.format(group, len(frame))) #return string with group and length of frame

in group ('flexible', 2.0) there were this many records: 1
in group ('flexible', 4.0) there were this many records: 5
in group ('flexible', 5.0) there were this many records: 1
in group ('flexible', 6.0) there were this many records: 18
in group ('flexible', 7.0) there were this many records: 12
in group ('flexible', 8.0) there were this many records: 67
in group ('flexible', 9.0) there were this many records: 200
in group ('flexible', 10.0) there were this many records: 332
in group ('moderate', 2.0) there were this many records: 1
in group ('moderate', 4.0) there were this many records: 1
in group ('moderate', 6.0) there were this many records: 10
in group ('moderate', 7.0) there were this many records: 7
in group ('moderate', 8.0) there were this many records: 82
in group ('moderate', 9.0) there were this many records: 304
in group ('moderate', 10.0) there were this many records: 379
in group ('strict', 2.0) there were this many records: 5
in group ('strict', 3.0) there were this ma

## Applying
* So far we have just looked at splitting up data
* We have three broad kinds of applying for data: aggregation, transformation, and filtering.

### Aggregation

In [25]:
# We should just be able to aggregate by calling .agg
#aggregate using one or more operations over the specified axis
df.groupby('cancellation_policy').agg({'review_scores_value': np.average}) #groupby cancellation_policy then aggregates review_scores_value that calculates an average
#np.average does not ignore NaN


Unnamed: 0_level_0,review_scores_value
cancellation_policy,Unnamed: 1_level_1
flexible,
moderate,
strict,
super_strict_30,


In [26]:
# That didn't seem to work at all, NaN!
#same code as above but replace np.average with np.nanmean which returns the average without the NaNs
df.groupby('cancellation_policy').agg({'review_scores_value': np.nanmean})

Unnamed: 0_level_0,review_scores_value
cancellation_policy,Unnamed: 1_level_1
flexible,9.237421
moderate,9.307398
strict,9.081441
super_strict_30,8.537313


In [27]:
# We can just extend this dictionary to aggregate by multiple functions or multiple columns.
x = df.groupby('cancellation_policy').agg({'review_scores_value': (np.nanmean, np.nanstd),
                                      'reviews_per_month': np.nanmean})

In [28]:
x.columns

MultiIndex([('review_scores_value', 'nanmean'),
            ('review_scores_value',  'nanstd'),
            (  'reviews_per_month', 'nanmean')],
           )

In [29]:
type(x.loc['flexible']) #series

pandas.core.series.Series

In [30]:
x.loc['flexible']['review_scores_value']['nanmean'] #multi index outer is flexible then review_scores_value then nanmean

9.2374213836478

### Transformation
* Transformation broadcasts the function you supply over the grouped `DataFrame`, returning a new `DataFrame`.
* This is an important subtlety. `agg()` takes a grouped `DataFrame` and returns a scalar for that group. But `transform()` returns a `DataFrame` for that group.
* Whereas `agg()` will return a `DataFrame` the size of the number of groups (one entry per group), `transform()` will return a `DataFrame` the size of your original `DataFrame`

In [26]:
# Lets just look at a couple of columns from our DataFrame
ndf = df[['cancellation_policy', 'review_scores_value']] #grabs columns for cancellation_policy and review_scores_value
ndf.head()

Unnamed: 0,cancellation_policy,review_scores_value
0,moderate,
1,moderate,9.0
2,moderate,10.0
3,moderate,10.0
4,flexible,10.0


In [27]:
# Notice that we are indexed by some review number. If we want to find the average for each group, we can do
ndf.groupby('cancellation_policy').agg(np.nanmean) #groups by the cancellation_policys and gets average without NaN 

Unnamed: 0_level_0,review_scores_value
cancellation_policy,Unnamed: 1_level_1
flexible,9.237421
moderate,9.307398
strict,9.081441
super_strict_30,8.537313


In [None]:
# But how do we put this average, say as a column called "related_averages", 
# back to our original dataframe
# Transform lets us do this in one step
ndf.groupby('cancellation_policy').transform(np.nanmean).head() #takes whats in the orignal dataframe and puts in the average of that group in each cell of the group


In [None]:
# Since the return is indexed just like the original dataframe, we can just assign it to a column
ndf['related_averages'] = ndf.groupby('cancellation_policy').transform(np.nanmean) #creating a column for whats in the orignal dataframe and puts in the average of that group in each cell of the group 

### Filtering
* You can also use `filter()` to remove rows from groups, sort of like `where()`

In [30]:
def scores(df):
    return np.nanmean(df['review_scores_value']) > 9.2 #takes the averages without the nans of review_scores_value and returns those that are greater than 9.2

y = ndf.groupby('cancellation_policy').filter(scores) #gets everything thats not strict because strict is less than 9.2
y['cancellation_policy'].unique()

array(['moderate', 'flexible'], dtype=object)

### Applying
* This is 95% of what I actually do with groups

In [31]:
df=pd.read_csv("datasets/listings.csv")
df=df[['cancellation_policy','review_scores_value']]
df.head()

Unnamed: 0,cancellation_policy,review_scores_value
0,moderate,
1,moderate,9.0
2,moderate,10.0
3,moderate,10.0
4,flexible,10.0


In [38]:
def calc_mean_review_scores(group):
    # we can treat this as the complete dataframe
    avg = np.nanmean(group['review_scores_value']) #averages for everything in the different groups in review_scores_value
    # now broadcast our formula and create a new column
    group['diff_from_group_mean'] = np.abs(avg - group['review_scores_value']) #create column  different from avg variable and the original number from the group
    return group

# Now just apply this to the groups
df.groupby('cancellation_policy').apply(calc_mean_review_scores).head() 

Unnamed: 0,cancellation_policy,review_scores_value,diff_from_group_mean
0,moderate,,
1,moderate,9.0,0.307398
2,moderate,10.0,0.692602
3,moderate,10.0,0.692602
4,flexible,10.0,0.762579
