## Aggregating and Grouping Data

In [1]:
import pandas as pd
import numpy as np

dataset = pd.read_csv('datasets/Credit_Data.cvs', sep = ';')
dataset.head()

Unnamed: 0.1,Unnamed: 0,serious_dlqin2yrs,revolving_utilization_of_unsecured_lines,age,number_of_time30-59_days_past_due_not_worse,debt_ratio,monthly_income,number_of_open_credit_lines_and_loans,number_of_times90_days_late,number_real_estate_loans_or_lines,number_of_time60-89_days_past_due_not_worse,number_of_dependents
0,0,1,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0
1,1,0,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0
2,2,0,0.65818,38,1,0.085113,3042.0,2,1,0,0,0.0
3,3,0,0.23381,30,0,0.03605,3300.0,5,0,0,0,0.0
4,4,0,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0


### Apply

In [2]:
"""
"Applies" or operates on a column in your data frame with a given function. 
This is analagous to an Excel formula.
"""
dataset.monthly_income.apply(np.log).head()

0     9.118225
1     7.863267
2     8.020270
3     8.101678
4    11.060180
Name: monthly_income, dtype: float64

### Apply with lambda functions

In [3]:
"""
A lambda function, or anonymous function, is a shorthand way to define a quick function that you need once.
"""
add_10 = lambda x : x + 10
plus = lambda x, y : x + y

print(add_10(9))
print(plus(10, 20))

19
30


In [4]:
dataset.monthly_income.apply(lambda x : np.log(x + 1)).head()

0     9.118335
1     7.863651
2     8.020599
3     8.101981
4    11.060196
Name: monthly_income, dtype: float64

### Using custom functions

In [5]:
"""
If you can't do it in a one-liner lambda function don't worry. pandas also let's apply your own custom functions. You can use custom functions when applying on Series and also when operating on chunks of data frames in groupbys.
"""
def inverse(x):
    return 1 / (x + 1)

dataset.monthly_income.apply(inverse).head()

0    0.000110
1    0.000384
2    0.000329
3    0.000303
4    0.000016
Name: monthly_income, dtype: float64

#### Exercise

In [6]:
"""
Write a custom function called cap_value(x, cap) that will set x to the cap if x > cap. Then apply it to 
debt_ratio with a cap of 5.
"""

def cap_value(x, cap):
    if x > cap:
        return cap
    return x

print(cap_value(1000, 10) == 10)
print(cap_value(10, 100) == 10)
print(dataset.debt_ratio.apply(lambda x : cap_value(x, 5.0)).mean())

True
True
1.2869633745276574


## Splot -> Apply -> Combine

In [7]:
"""
Split, Apply, Combine is a data munging methodology similar in spirit to SQL's GROUP BY. The idea being you 
split your data into chunks, operate on those chunks, and then combine the results together into a single table. 
groupby in pandas works exactly the same way. But since we're using Python and not SQL, we have a lot more 
flexibility in terms of the types of operations we can perform in the apply step.

From the pandas documentation:
    - Splitting the data into groups based on some criteria
    - Applying a function to each group independently
    - Combining the results into a data structure
"""
print('')




### groupby (Split)

In [8]:
"""
groupby splits a DataFrame into chunks. This is is analogous to, you guessed it, GROUP BY in SQL or Rows in an 
Excel pivot table.

In the example below, we'll split the dataset into chunks based on the serious_dlqin2yrs field.
"""
subnet = dataset[['serious_dlqin2yrs', 'age', 'monthly_income']]
subnet.groupby('serious_dlqin2yrs')

<pandas.core.groupby.DataFrameGroupBy object at 0x7f00495a1518>

### Apply / Combine

In [9]:
"""
Once you've grouped your dataset, you can apply different functions on each group to make calculations, 
generate summary statistics, etc. This is done the same way as a "regular" apply as seen above. The reuslts 
are automatically aggregated and combined back into a DataFrame.
"""
# split the dataset into groups based on the serious_dlqin2yrs field, then calculate the mean for 
# each of the remaining columns
subnet.groupby('serious_dlqin2yrs').mean()

Unnamed: 0_level_0,age,monthly_income
serious_dlqin2yrs,Unnamed: 1_level_1,Unnamed: 2_level_1
0,52.751375,5473.758555
1,45.926591,4746.613006


### Under the hood

In [10]:
"""
What's really going on here? You can see below that when you groupby a certain variable(s), you're 
literally splitting the data into chunks based on each possible value of that variable.
"""
for name, group in subnet.groupby('serious_dlqin2yrs'):
    print('Splitting by ', name)
    print(group.mean())
    print('*' * 80)

Splitting by  0
serious_dlqin2yrs       0.000000
age                    52.751375
monthly_income       5473.758555
dtype: float64
********************************************************************************
Splitting by  1
serious_dlqin2yrs       1.000000
age                    45.926591
monthly_income       4746.613006
dtype: float64
********************************************************************************


### agg - Complex Aggregations

In [11]:
"""
You can aggregate by multiple functions using the agg method. Simple pass in a list of the functions you'd 
like to apply to your dataset.
"""
functions = [np.min, np.mean, np.median, np.max]
subnet.groupby(subnet.serious_dlqin2yrs).agg(functions)

Unnamed: 0_level_0,age,age,age,age,monthly_income,monthly_income,monthly_income,monthly_income
Unnamed: 0_level_1,amin,mean,median,amax,amin,mean,median,amax
serious_dlqin2yrs,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
0,0,52.751375,52,109,0.0,5473.758555,4500.0,3008750.0
1,21,45.926591,45,101,0.0,4746.613006,3905.5,250000.0


### grouping with custom apply functions

In [12]:
"""
Just as you can apply custom functions to a column in your data frame, you can do the same with groups.
"""
def age_x_income(frame):
    x = (frame.age * frame.monthly_income)
    return np.mean(x)

subnet.groupby('serious_dlqin2yrs').apply(age_x_income)

serious_dlqin2yrs
0    289683.488005
1    225787.627568
dtype: float64

### Merging and Joining

In [13]:
"""
As in SQL, sometimes you need to join (or merge) certains datasets together. A good example might be some 
census demographic data that would be a helpful addition to our original dataset.
"""
pop = pd.read_csv('datasets/uspop.csv')
pop.head()

Unnamed: 0,age,est_pop
0,10,20055.346939
1,11,20073.020408
2,12,20090.693878
3,13,20108.367347
4,14,20139.081633


### Duplicates


In [14]:
"""
Be careful of joining datasets that contain duplicate values. It can lead to compounding error! pandas has a
handy duplicated() method that you can use to ensure uniqueness among your data.
"""
pop.age.value_counts().head()

29    2
59    2
36    1
30    1
31    1
Name: age, dtype: int64

In [15]:
pop = pop[pop.age.duplicated() == False]
pop.age.value_counts().head()

85    1
37    1
30    1
31    1
32    1
Name: age, dtype: int64

### Merge

In [17]:
"""
To combine datasets, you can use the merge function. You'll need to specify the following:
    - The data frames you'd like to merge (first 2 arguments)
    - A strategy for dealing with missing values and/or multiple matches ("how")
    - Criteria for which rows should be combined ("on")

In the case below, we're going to merge our original dataset with the census data. We're using a "left join" 
strategy, which means our resulting dataset will contain all records from the original df variable, even if 
they didn't match up with anything in pop. We're going to join on age, meaning we will match up rows across 
datasets where the age is the same.
"""
cols = ['age', 'monthly_income', 'serious_dlqin2yrs']
results = pd.merge(dataset[cols], pop, how = 'left', on = 'age')
results.head()

Unnamed: 0,age,monthly_income,serious_dlqin2yrs,est_pop
0,45,9120.0,1,21988.020408
1,40,2600.0,0,20644.22449
2,38,3042.0,0,19390.918367
3,30,3300.0,0,20163.346939
4,49,63588.0,0,21936.857143


In [18]:
len(results) > len(dataset) # This should be true since we did a "left join"

False