# Practice Optimizing Dataframes and Processing in Chunks

In this project, we'll practice working with chunked dataframes and optimizing a dataframe's memory usage. We'll be working with financial lending data from [Lending Club](https://www.lendingclub.com/), a marketplace for personal loans that matches borrowers with investors.

The dataset we will be working with is of loans approved from 2007-2011. This is available [here](https://www.lendingclub.com/info/download-data.action).

In [68]:
import pandas as pd
import numpy as np

pd.options.display.max_columns = 99

five_rows = pd.read_csv('loans_2007.csv', nrows=5)
five_rows.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,,10+ years,RENT,24000.0,Verified,Dec-2011,Fully Paid,n,credit_card,Computer,860xx,AZ,27.65,0.0,Jan-1985,1.0,3.0,0.0,13648.0,83.7%,9.0,f,0.0,0.0,5863.155187,5833.84,5000.0,863.16,0.0,0.0,0.0,Jan-2015,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,Ryder,< 1 year,RENT,30000.0,Source Verified,Dec-2011,Charged Off,n,car,bike,309xx,GA,1.0,0.0,Apr-1999,5.0,3.0,0.0,1687.0,9.4%,4.0,f,0.0,0.0,1008.71,1008.71,456.46,435.17,0.0,117.08,1.11,Apr-2013,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,,10+ years,RENT,12252.0,Not Verified,Dec-2011,Fully Paid,n,small_business,real estate business,606xx,IL,8.72,0.0,Nov-2001,2.0,2.0,0.0,2956.0,98.5%,10.0,f,0.0,0.0,3005.666844,3005.67,2400.0,605.67,0.0,0.0,0.0,Jun-2014,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,AIR RESOURCES BOARD,10+ years,RENT,49200.0,Source Verified,Dec-2011,Fully Paid,n,other,personel,917xx,CA,20.0,0.0,Feb-1996,1.0,10.0,0.0,5598.0,21%,37.0,f,0.0,0.0,12231.89,12231.89,10000.0,2214.92,16.97,0.0,0.0,Jan-2015,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,University Medical Group,1 year,RENT,80000.0,Source Verified,Dec-2011,Current,n,other,Personal,972xx,OR,17.94,0.0,Jan-1996,0.0,15.0,0.0,27783.0,53.9%,38.0,f,461.73,461.73,3581.12,3581.12,2538.27,1042.85,0.0,0.0,0.0,Jun-2016,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


In [69]:
thousand_rows = pd.read_csv('loans_2007.csv', nrows=1000)
thousand_rows.memory_usage(deep=True).sum()/(1024*1024)

1.5502548217773438

A thousand rows uses 1.55 MB, thus 1 row uses .00155 MB. If we only have 10 MB of memory available in the project, and the dataset is 67 MB, we must calculate how many rows to use in a chunk. We will take half of the 10 MB of memory available and divide by the row/MB ratio. This results in 3226 rows which we will simplify to 3000 rows.

In [70]:
chunk = pd.read_csv('loans_2007.csv', nrows=3000)
chunk.memory_usage(deep=True).sum()/(1024*1024)

4.649059295654297

Let's check that every chunk will be under the 5 MB of memory.

In [71]:
chunks = pd.read_csv('loans_2007.csv', chunksize=3000)
for chunk in chunks:
    print(chunk.memory_usage(deep=True).sum()/(1024*1024))

4.649059295654297
4.644805908203125
4.646563529968262
4.647915840148926
4.644108772277832
4.645991325378418
4.644582748413086
4.646951675415039
4.645077705383301
4.64512825012207
4.657840728759766
4.656707763671875
4.663515090942383
4.896956443786621
0.880854606628418


## Exploring the Data in Chunks

First we will look at numeric and string types:

In [72]:
chunks = pd.read_csv('loans_2007.csv', chunksize=3000)
numeric = []
string = []

for chunk in chunks:
    numeric.append(chunk.select_dtypes(include=[np.number]).shape[1])
    string.append(chunk.select_dtypes(include=['object']).shape[1])
    
print(numeric)
print(string)

[31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 30, 30]
[21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22]


From the above, we can see that in the last 2 chunks there are different values. Let's see what this means.

In [73]:
import itertools as it

chunks = pd.read_csv('loans_2007.csv', chunksize=3000)

diff = it.islice(chunks, 12, 14)

In [74]:
for dif in diff:
    print(dif.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 36000 to 38999
Data columns (total 52 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   id                          3000 non-null   int64  
 1   member_id                   3000 non-null   float64
 2   loan_amnt                   3000 non-null   float64
 3   funded_amnt                 3000 non-null   float64
 4   funded_amnt_inv             3000 non-null   float64
 5   term                        3000 non-null   object 
 6   int_rate                    3000 non-null   object 
 7   installment                 3000 non-null   float64
 8   grade                       3000 non-null   object 
 9   sub_grade                   3000 non-null   object 
 10  emp_title                   2891 non-null   object 
 11  emp_length                  3000 non-null   object 
 12  home_ownership              3000 non-null   object 
 13  annual_inc                  

From the above, we can see that the difference is that the id is int64 in most chunks and object in the last 2 chunks.

Now we will look at unique values in each string column. We may be able to enumarate string columns that contain values less than 50% unique.

In [75]:
chunks = pd.read_csv('loans_2007.csv', chunksize=3000)
uniques = {} # For each column

for chunk in chunks:
    strings = chunk.select_dtypes(include=['object']) # Get strings cols of chunk
    cols = strings.columns
    
    # For each column, append the value_counts
    for c in cols:
        val_counts = strings[c].value_counts()
        
        if c in uniques:
            uniques[c].append(val_counts)
        else:
            uniques[c] = [val_counts]
            
combined = {} # For putting the columns together

for col in uniques:
    c_concat = pd.concat(uniques[col]) # Combine lists
    c_group = c_concat.groupby(c_concat.index).sum() # Group by column
    combined[col] = c_group # Put into the new dictionary

In [76]:
for col in combined:
    print(col + ': ' + str(len(combined[col])))

term: 2
int_rate: 394
grade: 7
sub_grade: 35
emp_title: 30658
emp_length: 11
home_ownership: 5
verification_status: 3
issue_d: 55
loan_status: 9
pymnt_plan: 2
purpose: 14
title: 21264
zip_code: 837
addr_state: 50
earliest_cr_line: 530
revol_util: 1119
initial_list_status: 1
last_pymnt_d: 103
last_credit_pull_d: 108
application_type: 1
id: 3538


From the above string columns, now we will filter which columns have values that are less than 50% unique.

In [77]:
for col in combined:
    if len(combined[col]) > 3000*15*0.5: # chunksize * number of chunks * .5
        print(col + ': ' + str(len(combined[col])))

emp_title: 30658


We will now look at which float columns have no missing values and could be converted to integer.

In [78]:
chunks = pd.read_csv('loans_2007.csv', chunksize=3000)
missing = []

for chunk in chunks:
    floats = chunk.select_dtypes(include=['float'])
    missing.append(floats.apply(pd.isnull).sum())
    
combined_missing = pd.concat(missing)
grouped = combined_missing.groupby(combined_missing.index).sum().sort_values()

In [79]:
print(grouped)
print(len(grouped))

member_id                        3
total_rec_int                    3
total_pymnt_inv                  3
total_pymnt                      3
revol_bal                        3
recoveries                       3
policy_code                      3
out_prncp_inv                    3
out_prncp                        3
total_rec_late_fee               3
loan_amnt                        3
last_pymnt_amnt                  3
total_rec_prncp                  3
funded_amnt_inv                  3
funded_amnt                      3
dti                              3
collection_recovery_fee          3
installment                      3
annual_inc                       7
inq_last_6mths                  32
total_acc                       32
delinq_2yrs                     32
pub_rec                         32
delinq_amnt                     32
open_acc                        32
acc_now_delinq                  32
tax_liens                      108
collections_12_mths_ex_med     148
chargeoff_within_12_

We can see from above that all float columns having missing values and may not be good candidates for conversion to the integer type.

In [80]:
chunks = pd.read_csv('loans_2007.csv', chunksize=3000)

memory = []

for chunk in chunks:
    memory.append(chunk.memory_usage(deep=True).sum() / (1024 * 1024))

sum(memory)

66.21605968475342

The total memory use is 66 MB which is slightly below the amount that was noted.

## Optimizing String and Numeric Columns

We can achieve the greatest memory improvements by converting the string columns to a numeric or category type. 

First, those columns that have less than 50% unique should become a category type. We will do that below:

In [81]:
# Object Columns for Strings that should be converted
category_columns = ['term', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'loan_status', 'pymnt_plan', 'purpose', 'addr_state']

For reference from above, these are the unique value counts: <br>
term: 2 <br>
int_rate: 394 <br>
grade: 7 <br>
sub_grade: 35 <br>
emp_title: 30658 <br>
emp_length: 11 <br>
home_ownership: 5 <br>
verification_status: 3 <br>
issue_d: 55 <br>
loan_status: 9 <br>
pymnt_plan: 2 <br>
purpose: 14 <br>
title: 21264 <br>
zip_code: 837 <br>
addr_state: 50 <br>
earliest_cr_line: 530 <br>
revol_util: 1119 <br>
initial_list_status: 1 <br>
last_pymnt_d: 103 <br>
last_credit_pull_d: 108 <br>
application_type: 1 <br>
id: 3538

In [82]:
for col in category_columns:
    print(combined[col])

 36 months    31534
 60 months    11001
Name: term, dtype: int64
A1    1142
A2    1520
A3    1823
A4    2905
A5    2793
B1    1882
B2    2113
B3    2997
B4    2590
B5    2807
C1    2264
C2    2157
C3    1658
C4    1370
C5    1291
D1    1053
D2    1485
D3    1322
D4    1140
D5    1016
E1     884
E2     791
E3     668
E4     552
E5     499
F1     392
F2     308
F3     236
F4     211
F5     154
G1     141
G2     107
G3      79
G4      99
G5      86
Name: sub_grade, dtype: int64
1 year       3595
10+ years    9369
2 years      4743
3 years      4364
4 years      3649
5 years      3458
6 years      2375
7 years      1875
8 years      1592
9 years      1341
< 1 year     5062
Name: emp_length, dtype: int64
MORTGAGE    18959
NONE            8
OTHER         136
OWN          3251
RENT        20181
Name: home_ownership, dtype: int64
Not Verified       18758
Source Verified    10306
Verified           13471
Name: verification_status, dtype: int64
Charged Off                                        

In [83]:
convert_to_category = {i: 'category' for i in category_columns}
print(convert_to_category)

{'term': 'category', 'sub_grade': 'category', 'emp_length': 'category', 'home_ownership': 'category', 'verification_status': 'category', 'loan_status': 'category', 'pymnt_plan': 'category', 'purpose': 'category', 'addr_state': 'category'}


Now we will convert those, convert the dates as well, and convert the integers.

In [86]:
chunks = pd.read_csv('loans_2007.csv', chunksize=3000, dtype=convert_to_category, parse_dates=["issue_d", "earliest_cr_line", "last_pymnt_d", "last_credit_pull_d"])

memory = []

for chunk in chunks:
    chunk['int_rate'] = pd.to_numeric(chunk['int_rate'].str.rstrip('%'))
    chunk['revol_util'] = pd.to_numeric(chunk['revol_util'].str.rstrip('%'))
    memory.append(chunk.memory_usage(deep=True).sum() / (1024 * 1024))

In [88]:
sum(memory)

29.269675254821777

We can see that converting the dtypes for the columns ends up reducing the memory usage by at least a half. We can further optimize the dataframe as further next steps.