# Chapter 10 - Data Aggregation and Group Operations

## Groupby Mechanics

In [1]:
import pandas as pd

In [2]:
# Read from CSV file
df = pd.read_csv('dataset-A3-loans.csv')
display(df.head(8))

Unnamed: 0,id,funded_amount,term,interest_rate,grade,employee_length,home_ownership,annual_income,purpose,title
0,721751,7000.0,36 months,14.91,D,2 years,RENT,46000.0,debt_consolidation,Debt Removal
1,40277218,16800.0,60 months,16.49,D,4 years,RENT,45500.0,home_improvement,Home improvement
2,68416017,1500.0,36 months,9.17,B,10+ years,MORTGAGE,83000.0,major_purchase,Major purchase
3,59481461,8000.0,36 months,12.29,C,2 years,RENT,74000.0,debt_consolidation,Debt consolidation
4,73003,3200.0,36 months,9.96,B,< 1 year,MORTGAGE,150000.0,other,New Bathroom
5,55917749,5000.0,36 months,12.29,C,10+ years,MORTGAGE,55000.0,home_improvement,Home improvement
6,1149328,11500.0,36 months,16.29,C,2 years,MORTGAGE,68000.0,credit_card,Credit Card
7,1614457,6000.0,36 months,15.8,C,5 years,MORTGAGE,36000.0,debt_consolidation,Debt consolidation


Consider putting all the loans in the same grade together and finding some descriptive statistics of a column. To do so, use the `Series` that you want to calculate the descriptive stats on and group using the `Series` where the labels are contained.

In [3]:
df_grouped = df['funded_amount'].groupby(df['grade'])
print(df_grouped)

<pandas.core.groupby.SeriesGroupBy object at 0x11750cac8>


After performing a `groupby()`, its result is a `GroupBy` object. It has not computed anything, and is an abstract object. The function after this will be human-readable and usable.

Calling `.mean()` will give the mean of each `grade` of loans. It has been aggregated using the grouping key, producing a new `Series` that is now indexed by the unique values in the `grade` column.

In [4]:
df_grouped.mean()

grade
A    12083.333333
B    12517.857143
C    10670.000000
D    12568.750000
E    18683.333333
F    25000.000000
Name: funded_amount, dtype: float64

Grouping can also be done on multiple keys. 

In [5]:
# Pay attention that in groupby function, the parameter is a list datatype
ir_means = df['interest_rate'].groupby([df['grade'], df['term']]).mean()
display(ir_means)

grade  term      
A       36 months     7.086667
B       36 months    10.752500
        60 months    10.122000
C       36 months    13.820909
        60 months    13.670000
D       36 months    16.385000
        60 months    17.030000
E       36 months    18.545000
        60 months    20.500000
F       60 months    23.760000
Name: interest_rate, dtype: float64

Since this is a `Series`, it can be `unstacked()` and each unique `term` will have its own column. Recall that the columns take a hierarchical index form, having the same number of columns as the number of unique values for `term`.

Unstacking will let the outer index (left-most index) become rows while converting the inner index to columns.

In [6]:
ir_by_term_df = ir_means.unstack()

# Also note some values are missing as they do not exist in the Series before unstacking
display(ir_by_term_df) 
print(ir_by_term_df.index)
print(ir_by_term_df.columns)

term,36 months,60 months
grade,Unnamed: 1_level_1,Unnamed: 2_level_1
A,7.086667,
B,10.7525,10.122
C,13.820909,13.67
D,16.385,17.03
E,18.545,20.5
F,,23.76


Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object', name='grade')
Index([' 36 months', ' 60 months'], dtype='object', name='term')


In the more common expression, use `df.groupby(key)` to group data.

In [7]:
# Notice how functions are being chained here. groupby() is followed by mean().

# This will calculate mean for all columns. Some columns have no meaning, like id
all_agg = df.groupby('employee_length').mean()
display(all_agg)

# This will only calculate the mean for one column
funded_amt_agg = df.groupby('employee_length')['funded_amount'].mean()
display(funded_amt_agg)

Unnamed: 0_level_0,id,funded_amount,interest_rate,annual_income
employee_length,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1 year,29269970.0,11750.0,10.4325,76585.75
10+ years,36866060.0,9321.428571,11.587857,65766.857143
2 years,31576480.0,11791.666667,14.635,62512.5
3 years,60833300.0,22300.0,17.57,88250.0
4 years,57826580.0,14391.666667,13.38,76166.666667
5 years,5444514.0,16000.0,15.45,75500.0
6 years,23323860.0,8395.0,11.314,51808.0
7 years,30321120.0,10900.0,10.1125,71500.0
8 years,27822460.0,16450.0,12.245,74450.5
9 years,40224300.0,21237.5,14.175,70875.0


employee_length
1 year       11750.000000
10+ years     9321.428571
2 years      11791.666667
3 years      22300.000000
4 years      14391.666667
5 years      16000.000000
6 years       8395.000000
7 years      10900.000000
8 years      16450.000000
9 years      21237.500000
< 1 year     14900.000000
Name: funded_amount, dtype: float64

A general `groupby()` function that is useful is `.size()`, which counts the no. of elements for each unique `term`.

In [8]:
df.groupby('term').size()

term
 36 months    37
 60 months    13
dtype: int64

<hr>
To look in detail what is in the `groupby` object, iterate through it. Iterating it generates a sequence of tuples containing the group name and the subset of data.

In [9]:
for n, x in df.groupby('employee_length'):
    print(n)
    display(x)

1 year


Unnamed: 0,id,funded_amount,term,interest_rate,grade,employee_length,home_ownership,annual_income,purpose,title
9,290605,15000.0,36 months,9.45,B,1 year,RENT,150000.0,house,Chicago Home Purchase
19,56994235,6000.0,36 months,10.99,B,1 year,RENT,40000.0,debt_consolidation,Debt consolidation
29,58350400,20000.0,36 months,9.17,B,1 year,RENT,76343.0,credit_card,Credit card refinancing
33,1444634,6000.0,36 months,12.12,B,1 year,RENT,40000.0,credit_card,Lending Club


10+ years


Unnamed: 0,id,funded_amount,term,interest_rate,grade,employee_length,home_ownership,annual_income,purpose,title
2,68416017,1500.0,36 months,9.17,B,10+ years,MORTGAGE,83000.0,major_purchase,Major purchase
5,55917749,5000.0,36 months,12.29,C,10+ years,MORTGAGE,55000.0,home_improvement,Home improvement
8,31999103,11000.0,36 months,14.49,C,10+ years,RENT,72000.0,debt_consolidation,Debt consolidation
12,62166736,4500.0,36 months,5.32,A,10+ years,MORTGAGE,76000.0,home_improvement,Home improvement
13,62082707,3225.0,36 months,13.33,C,10+ years,OWN,60000.0,credit_card,Credit card refinancing
14,1248751,21000.0,60 months,10.74,B,10+ years,RENT,77983.0,debt_consolidation,Debt Consolidation
34,66583250,4175.0,36 months,17.86,D,10+ years,RENT,25000.0,debt_consolidation,Debt consolidation
35,43165466,1000.0,36 months,12.69,C,10+ years,MORTGAGE,55000.0,vacation,Vacation
37,29575450,14400.0,36 months,8.39,A,10+ years,MORTGAGE,67000.0,debt_consolidation,Debt consolidation
38,10119917,10000.0,36 months,11.99,B,10+ years,RENT,150000.0,credit_card,CC Payoff


2 years


Unnamed: 0,id,funded_amount,term,interest_rate,grade,employee_length,home_ownership,annual_income,purpose,title
0,721751,7000.0,36 months,14.91,D,2 years,RENT,46000.0,debt_consolidation,Debt Removal
3,59481461,8000.0,36 months,12.29,C,2 years,RENT,74000.0,debt_consolidation,Debt consolidation
6,1149328,11500.0,36 months,16.29,C,2 years,MORTGAGE,68000.0,credit_card,Credit Card
16,62042554,18000.0,60 months,12.29,C,2 years,OWN,48000.0,debt_consolidation,Debt consolidation
24,10095878,19750.0,60 months,20.5,E,2 years,MORTGAGE,99075.0,debt_consolidation,Debt Consolidation
49,55967936,6500.0,36 months,11.53,B,2 years,RENT,40000.0,credit_card,Credit card refinancing


3 years


Unnamed: 0,id,funded_amount,term,interest_rate,grade,employee_length,home_ownership,annual_income,purpose,title
23,60833297,22300.0,60 months,17.57,D,3 years,RENT,88250.0,credit_card,Credit card refinancing


4 years


Unnamed: 0,id,funded_amount,term,interest_rate,grade,employee_length,home_ownership,annual_income,purpose,title
1,40277218,16800.0,60 months,16.49,D,4 years,RENT,45500.0,home_improvement,Home improvement
30,65382755,20375.0,60 months,9.17,B,4 years,MORTGAGE,98000.0,debt_consolidation,Debt consolidation
39,67819776,6000.0,36 months,14.48,C,4 years,RENT,85000.0,debt_consolidation,Debt consolidation


5 years


Unnamed: 0,id,funded_amount,term,interest_rate,grade,employee_length,home_ownership,annual_income,purpose,title
7,1614457,6000.0,36 months,15.8,C,5 years,MORTGAGE,36000.0,debt_consolidation,Debt consolidation
31,9274571,26000.0,60 months,15.1,C,5 years,MORTGAGE,115000.0,debt_consolidation,Consolidation


6 years


Unnamed: 0,id,funded_amount,term,interest_rate,grade,employee_length,home_ownership,annual_income,purpose,title
11,42221389,2875.0,36 months,12.69,C,6 years,RENT,49040.0,debt_consolidation,Debt consolidation
15,39629278,11000.0,36 months,8.67,B,6 years,RENT,45000.0,credit_card,Credit card refinancing
25,6836002,11075.0,36 months,13.05,B,6 years,RENT,55000.0,debt_consolidation,Lower rate
41,18965501,10800.0,36 months,12.49,B,6 years,RENT,65000.0,debt_consolidation,Debt consolidation
47,8967105,6225.0,36 months,9.67,B,6 years,RENT,45000.0,debt_consolidation,Consolidate LC


7 years


Unnamed: 0,id,funded_amount,term,interest_rate,grade,employee_length,home_ownership,annual_income,purpose,title
10,49864582,10000.0,60 months,11.53,B,7 years,OWN,78000.0,credit_card,Credit card refinancing
18,38539098,6000.0,36 months,6.03,A,7 years,RENT,73000.0,credit_card,Credit card refinancing
36,8945719,9600.0,36 months,7.9,A,7 years,MORTGAGE,90000.0,credit_card,pay off credit
46,23935069,18000.0,36 months,14.99,C,7 years,RENT,45000.0,debt_consolidation,Debt consolidation


8 years


Unnamed: 0,id,funded_amount,term,interest_rate,grade,employee_length,home_ownership,annual_income,purpose,title
17,4798106,22000.0,36 months,12.12,B,8 years,RENT,78000.0,credit_card,Credit Card Payoff
22,41339796,4800.0,36 months,18.25,E,8 years,MORTGAGE,46000.0,vacation,Vacation
32,2865306,4000.0,36 months,7.62,A,8 years,RENT,71000.0,debt_consolidation,Debt Consolidation
44,62286613,35000.0,60 months,10.99,B,8 years,MORTGAGE,102802.0,credit_card,Credit card refinancing


9 years


Unnamed: 0,id,funded_amount,term,interest_rate,grade,employee_length,home_ownership,annual_income,purpose,title
21,38514650,22950.0,60 months,12.99,C,9 years,RENT,50000.0,debt_consolidation,Debt consolidation
26,3702908,25000.0,60 months,23.76,F,9 years,MORTGAGE,68000.0,debt_consolidation,Debt consolidation
27,62399461,3000.0,36 months,12.69,C,9 years,RENT,55500.0,car,Car financing
42,56280184,34000.0,36 months,7.26,A,9 years,MORTGAGE,110000.0,debt_consolidation,Debt consolidation


< 1 year


Unnamed: 0,id,funded_amount,term,interest_rate,grade,employee_length,home_ownership,annual_income,purpose,title
4,73003,3200.0,36 months,9.96,B,< 1 year,MORTGAGE,150000.0,other,New Bathroom
20,58723821,10000.0,36 months,8.18,B,< 1 year,MORTGAGE,43000.0,debt_consolidation,Debt consolidation
28,43155702,31500.0,36 months,18.84,E,< 1 year,RENT,130000.0,major_purchase,Major purchase


In the case of multiple keys, the first element contains the tuple containing column names.

In [10]:
for (tm, gr), x in df.groupby(['term', 'grade']):
    print("< %s|%s >" % (tm.strip(), gr.strip()))
    display(x)

< 36 months|A >


Unnamed: 0,id,funded_amount,term,interest_rate,grade,employee_length,home_ownership,annual_income,purpose,title
12,62166736,4500.0,36 months,5.32,A,10+ years,MORTGAGE,76000.0,home_improvement,Home improvement
18,38539098,6000.0,36 months,6.03,A,7 years,RENT,73000.0,credit_card,Credit card refinancing
32,2865306,4000.0,36 months,7.62,A,8 years,RENT,71000.0,debt_consolidation,Debt Consolidation
36,8945719,9600.0,36 months,7.9,A,7 years,MORTGAGE,90000.0,credit_card,pay off credit
37,29575450,14400.0,36 months,8.39,A,10+ years,MORTGAGE,67000.0,debt_consolidation,Debt consolidation
42,56280184,34000.0,36 months,7.26,A,9 years,MORTGAGE,110000.0,debt_consolidation,Debt consolidation


< 36 months|B >


Unnamed: 0,id,funded_amount,term,interest_rate,grade,employee_length,home_ownership,annual_income,purpose,title
2,68416017,1500.0,36 months,9.17,B,10+ years,MORTGAGE,83000.0,major_purchase,Major purchase
4,73003,3200.0,36 months,9.96,B,< 1 year,MORTGAGE,150000.0,other,New Bathroom
9,290605,15000.0,36 months,9.45,B,1 year,RENT,150000.0,house,Chicago Home Purchase
15,39629278,11000.0,36 months,8.67,B,6 years,RENT,45000.0,credit_card,Credit card refinancing
17,4798106,22000.0,36 months,12.12,B,8 years,RENT,78000.0,credit_card,Credit Card Payoff
19,56994235,6000.0,36 months,10.99,B,1 year,RENT,40000.0,debt_consolidation,Debt consolidation
20,58723821,10000.0,36 months,8.18,B,< 1 year,MORTGAGE,43000.0,debt_consolidation,Debt consolidation
25,6836002,11075.0,36 months,13.05,B,6 years,RENT,55000.0,debt_consolidation,Lower rate
29,58350400,20000.0,36 months,9.17,B,1 year,RENT,76343.0,credit_card,Credit card refinancing
33,1444634,6000.0,36 months,12.12,B,1 year,RENT,40000.0,credit_card,Lending Club


< 36 months|C >


Unnamed: 0,id,funded_amount,term,interest_rate,grade,employee_length,home_ownership,annual_income,purpose,title
3,59481461,8000.0,36 months,12.29,C,2 years,RENT,74000.0,debt_consolidation,Debt consolidation
5,55917749,5000.0,36 months,12.29,C,10+ years,MORTGAGE,55000.0,home_improvement,Home improvement
6,1149328,11500.0,36 months,16.29,C,2 years,MORTGAGE,68000.0,credit_card,Credit Card
7,1614457,6000.0,36 months,15.8,C,5 years,MORTGAGE,36000.0,debt_consolidation,Debt consolidation
8,31999103,11000.0,36 months,14.49,C,10+ years,RENT,72000.0,debt_consolidation,Debt consolidation
11,42221389,2875.0,36 months,12.69,C,6 years,RENT,49040.0,debt_consolidation,Debt consolidation
13,62082707,3225.0,36 months,13.33,C,10+ years,OWN,60000.0,credit_card,Credit card refinancing
27,62399461,3000.0,36 months,12.69,C,9 years,RENT,55500.0,car,Car financing
35,43165466,1000.0,36 months,12.69,C,10+ years,MORTGAGE,55000.0,vacation,Vacation
39,67819776,6000.0,36 months,14.48,C,4 years,RENT,85000.0,debt_consolidation,Debt consolidation


< 36 months|D >


Unnamed: 0,id,funded_amount,term,interest_rate,grade,employee_length,home_ownership,annual_income,purpose,title
0,721751,7000.0,36 months,14.91,D,2 years,RENT,46000.0,debt_consolidation,Debt Removal
34,66583250,4175.0,36 months,17.86,D,10+ years,RENT,25000.0,debt_consolidation,Debt consolidation


< 36 months|E >


Unnamed: 0,id,funded_amount,term,interest_rate,grade,employee_length,home_ownership,annual_income,purpose,title
22,41339796,4800.0,36 months,18.25,E,8 years,MORTGAGE,46000.0,vacation,Vacation
28,43155702,31500.0,36 months,18.84,E,< 1 year,RENT,130000.0,major_purchase,Major purchase


< 60 months|B >


Unnamed: 0,id,funded_amount,term,interest_rate,grade,employee_length,home_ownership,annual_income,purpose,title
10,49864582,10000.0,60 months,11.53,B,7 years,OWN,78000.0,credit_card,Credit card refinancing
14,1248751,21000.0,60 months,10.74,B,10+ years,RENT,77983.0,debt_consolidation,Debt Consolidation
30,65382755,20375.0,60 months,9.17,B,4 years,MORTGAGE,98000.0,debt_consolidation,Debt consolidation
44,62286613,35000.0,60 months,10.99,B,8 years,MORTGAGE,102802.0,credit_card,Credit card refinancing
48,57236645,17200.0,60 months,8.18,B,10+ years,RENT,49753.0,debt_consolidation,Debt consolidation


< 60 months|C >


Unnamed: 0,id,funded_amount,term,interest_rate,grade,employee_length,home_ownership,annual_income,purpose,title
16,62042554,18000.0,60 months,12.29,C,2 years,OWN,48000.0,debt_consolidation,Debt consolidation
21,38514650,22950.0,60 months,12.99,C,9 years,RENT,50000.0,debt_consolidation,Debt consolidation
31,9274571,26000.0,60 months,15.1,C,5 years,MORTGAGE,115000.0,debt_consolidation,Consolidation
43,7389445,17500.0,60 months,14.3,C,10+ years,MORTGAGE,40000.0,credit_card,Credit card refinancing


< 60 months|D >


Unnamed: 0,id,funded_amount,term,interest_rate,grade,employee_length,home_ownership,annual_income,purpose,title
1,40277218,16800.0,60 months,16.49,D,4 years,RENT,45500.0,home_improvement,Home improvement
23,60833297,22300.0,60 months,17.57,D,3 years,RENT,88250.0,credit_card,Credit card refinancing


< 60 months|E >


Unnamed: 0,id,funded_amount,term,interest_rate,grade,employee_length,home_ownership,annual_income,purpose,title
24,10095878,19750.0,60 months,20.5,E,2 years,MORTGAGE,99075.0,debt_consolidation,Debt Consolidation


< 60 months|F >


Unnamed: 0,id,funded_amount,term,interest_rate,grade,employee_length,home_ownership,annual_income,purpose,title
26,3702908,25000.0,60 months,23.76,F,9 years,MORTGAGE,68000.0,debt_consolidation,Debt consolidation


A useful trick is to convert this technique into a dictionary, using the unique values as the key.

In [11]:
grade_dict = dict(list(df.groupby('grade')))
display(grade_dict['D'])

Unnamed: 0,id,funded_amount,term,interest_rate,grade,employee_length,home_ownership,annual_income,purpose,title
0,721751,7000.0,36 months,14.91,D,2 years,RENT,46000.0,debt_consolidation,Debt Removal
1,40277218,16800.0,60 months,16.49,D,4 years,RENT,45500.0,home_improvement,Home improvement
23,60833297,22300.0,60 months,17.57,D,3 years,RENT,88250.0,credit_card,Credit card refinancing
34,66583250,4175.0,36 months,17.86,D,10+ years,RENT,25000.0,debt_consolidation,Debt consolidation


<hr>
For using `groupby()` objects, it is usual to only look at a subset of columns. It can be done by adding the list after the `groupby()` function.

In [12]:
funded_groupby = df.groupby('grade')['funded_amount']
print(funded_groupby)

<pandas.core.groupby.SeriesGroupBy object at 0x1175455c0>


In [13]:
display(funded_groupby.size())
display(funded_groupby.sum())
display(funded_groupby.mean())
display(funded_groupby.std())

grade
A     6
B    21
C    15
D     4
E     3
F     1
Name: funded_amount, dtype: int64

grade
A     72500.0
B    262875.0
C    160050.0
D     50275.0
E     56050.0
F     25000.0
Name: funded_amount, dtype: float64

grade
A    12083.333333
B    12517.857143
C    10670.000000
D    12568.750000
E    18683.333333
F    25000.000000
Name: funded_amount, dtype: float64

grade
A    11416.902674
B     7915.602247
C     7982.728007
D     8447.222793
E    13381.921885
F             NaN
Name: funded_amount, dtype: float64

In [14]:
funded_groupby2 = df.groupby(['grade', 'term'])['funded_amount']
print(funded_groupby2)
print(funded_groupby2.mean())

<pandas.core.groupby.SeriesGroupBy object at 0x117562a58>
grade  term      
A       36 months    12083.333333
B       36 months     9956.250000
        60 months    20715.000000
C       36 months     6872.727273
        60 months    21112.500000
D       36 months     5587.500000
        60 months    19550.000000
E       36 months    18150.000000
        60 months    19750.000000
F       60 months    25000.000000
Name: funded_amount, dtype: float64


**References:**

Python for Data Analysis, 2nd Edition, McKinney (2017)