# Optimizing Dataframes

In this project, we'll optimize a dataframe's memory usage by working with financial lending data in the form of chunked dataframes from [Lending Club](https://www.lendingclub.com), a marketplace for personal loans that matches borrowers with investors. The dataset we'll use contains loans approved from 2007-2011, and it can be downloaded from Lending Club's website [here](https://www.lendingclub.com/info/download-data.action).

If we were to read in the entire dataset, it would consume around 67MB of memory. For this project, we'll imagine that we only have 10MB of memory available for us to work with.

Let's take a first look at the dataset.

In [1]:
import pandas as pd
import numpy as np
pd.options.display.max_columns = 99

loans_2007_head = pd.read_csv('loans_2007.csv', nrows=5)
loans_2007_head

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,,10+ years,RENT,24000.0,Verified,Dec-2011,Fully Paid,n,credit_card,Computer,860xx,AZ,27.65,0.0,Jan-1985,1.0,3.0,0.0,13648.0,83.7%,9.0,f,0.0,0.0,5863.155187,5833.84,5000.0,863.16,0.0,0.0,0.0,Jan-2015,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,Ryder,< 1 year,RENT,30000.0,Source Verified,Dec-2011,Charged Off,n,car,bike,309xx,GA,1.0,0.0,Apr-1999,5.0,3.0,0.0,1687.0,9.4%,4.0,f,0.0,0.0,1008.71,1008.71,456.46,435.17,0.0,117.08,1.11,Apr-2013,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,,10+ years,RENT,12252.0,Not Verified,Dec-2011,Fully Paid,n,small_business,real estate business,606xx,IL,8.72,0.0,Nov-2001,2.0,2.0,0.0,2956.0,98.5%,10.0,f,0.0,0.0,3005.666844,3005.67,2400.0,605.67,0.0,0.0,0.0,Jun-2014,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,AIR RESOURCES BOARD,10+ years,RENT,49200.0,Source Verified,Dec-2011,Fully Paid,n,other,personel,917xx,CA,20.0,0.0,Feb-1996,1.0,10.0,0.0,5598.0,21%,37.0,f,0.0,0.0,12231.89,12231.89,10000.0,2214.92,16.97,0.0,0.0,Jan-2015,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,University Medical Group,1 year,RENT,80000.0,Source Verified,Dec-2011,Current,n,other,Personal,972xx,OR,17.94,0.0,Jan-1996,0.0,15.0,0.0,27783.0,53.9%,38.0,f,461.73,461.73,3581.12,3581.12,2538.27,1042.85,0.0,0.0,0.0,Jun-2016,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


We'll read in the first 1000 rows and calculate their total memory usage so we can get a basic idea for how much memory the rows consume. This will help us find an optimal for each chunk in order to keep our memory usage under 5MB, which is around 50% of the memory we have to work with.

In [2]:
thousand_chunk = pd.read_csv('loans_2007.csv', nrows=1000)
thousand_chunk.memory_usage(deep=True).sum()/(1024*1024)

1.5273666381835938

Since 1000 rows only takes up around 1.5MB, let's see if chunks of 3000 rows are all around 5MB.

In [3]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
for chunk in chunk_iter:
    print(chunk.memory_usage(deep=True).sum()/(1024*1024))

4.580394744873047
4.576141357421875
4.577898979187012
4.579251289367676
4.575444221496582
4.577326774597168
4.575918197631836
4.578287124633789
4.576413154602051
4.57646369934082
4.589176177978516
4.588043212890625
4.594850540161133
4.828314781188965
0.868586540222168


Now let's find the total rows in the dataset.

In [4]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
total_rows = 0
for chunk in chunk_iter:
    total_rows += len(chunk)
print('Total Rows: ', total_rows)

Total Rows:  42538


## Exploring the Data in Chunks

Next, we'll explore some of the columns to see where we can optimize our data usage. For each chunk, we'll look at: how many columns have numeric or string type, how many unique values there are, which columns have no missing values, and we'll calculate total memory usage across all chunks.

First, we'll find out how many columns have numeric and object type.

In [5]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)

numeric = []
objects = []

for chunk in chunk_iter:
    num = chunk.select_dtypes(include=[np.number]).shape[1]
    numeric.append(num)
    obj = chunk.select_dtypes(include=['object']).shape[1]
    objects.append(obj)
    
print('Numeric: ', numeric)
print('Objects: ', objects)

Numeric:  [31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 30, 30]
Objects:  [21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22]


Let's look at the object columns to see if they are consistent across the chunks.

In [6]:
obj_cols = []
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)

for chunk in chunk_iter:
    chunk_obj_cols = chunk.select_dtypes(include=['object']).columns.tolist()
    if len(obj_cols) > 0:
        is_equal = obj_cols == chunk_obj_cols
        if not is_equal:
            print('Object Columns: ', obj_cols, '\n')
            print('Chunk Object Columns: ', chunk_obj_cols, '\n')
    else:
        obj_cols = chunk_obj_cols

Object Columns:  ['term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type'] 

Chunk Object Columns:  ['id', 'term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type'] 

Object Columns:  ['term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type

It looks like out of 31 numeric columns and 21 string columns, the `id` column is an `int64` data type in the last 2 chunks only. Since this column won't be useful for predictive modeling anyway, we can just ignore it.

Next, we'll take a look at how many unique values are in each of the string columns. We'll want to find which columns have less than 50% unique values that we can later convert to numeric data type to be more efficient.

In [7]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)

unique = {}
for chunk in chunk_iter:
    strings = chunk.select_dtypes(include=['object'])
    cols = strings.columns
    for c in cols:
        val_counts = strings[c].value_counts()
        if c in unique:
            unique[c].append(val_counts)
        else:
            unique[c] = [val_counts]
            
unique_all = {}
unique_stats = {
    'column_name': [],
    'total_values': [],
    'unique_values': []
}

for col in unique:
    unique_concat = pd.concat(unique[col])
    unique_group = unique_concat.groupby(unique_concat.index).sum()
    unique_all[col] = unique_group
    if unique_group.shape[0] < 50:
        print(col, unique_group.shape[0])

term 2
grade 7
sub_grade 35
emp_length 11
home_ownership 5
verification_status 3
loan_status 9
pymnt_plan 2
purpose 14
initial_list_status 1
application_type 1


Now let's look at the float columns with no missing values that can be converted to the `integer` type to save space.

In [11]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)

missing = []
for chunk in chunk_iter:
    floats = chunk.select_dtypes(include=['float'])
    missing.append(floats.apply(pd.isnull).sum())
    
missing_all = pd.concat(missing)
missing_all.groupby(missing_all.index).sum().sort_values(ascending=False)

pub_rec_bankruptcies          1368
chargeoff_within_12_mths       148
collections_12_mths_ex_med     148
tax_liens                      108
acc_now_delinq                  32
open_acc                        32
delinq_amnt                     32
pub_rec                         32
delinq_2yrs                     32
total_acc                       32
inq_last_6mths                  32
annual_inc                       7
installment                      3
collection_recovery_fee          3
dti                              3
funded_amnt                      3
funded_amnt_inv                  3
total_rec_prncp                  3
last_pymnt_amnt                  3
loan_amnt                        3
total_rec_late_fee               3
out_prncp                        3
out_prncp_inv                    3
policy_code                      3
recoveries                       3
revol_bal                        3
total_pymnt                      3
total_pymnt_inv                  3
total_rec_int       

Now let's calculate the total memory usage across all chunks.

In [14]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)

mem_usage = []

for chunk in chunk_iter:
    mem_usage.append(chunk.memory_usage(deep=True).sum() / (1024 * 1024))
    
print(mem_usage)
sum(mem_usage)

[4.580394744873047, 4.576141357421875, 4.577898979187012, 4.579251289367676, 4.575444221496582, 4.577326774597168, 4.575918197631836, 4.578287124633789, 4.576413154602051, 4.57646369934082, 4.589176177978516, 4.588043212890625, 4.594850540161133, 4.828314781188965, 0.868586540222168]


65.24251079559326

## Optimizing String Columns

We can be more efficient with our memory usage by converting string columns to a numeric type. We'll want to convert the columns with less that 50% unique values to numeric type, and then the columns that contain numeric values to the float type.

Let's start by determining which string columns we can convert to numeric after cleaning them.



Determine which string columns you can convert to a numeric type if you clean them. Let's focus on columns that would actually be useful for analysis and modelling.Â¶


DQ:

As we learned in the first mission of this course, we can achieve the greatest memory improvements by converting the string columns to a numeric type. Let's convert all of the columns where the values are less than 50% unique to the category type, and the columns that contain numeric values to the float type.
Instructions

While working with dataframe chunks:
Determine which string columns you can convert to a numeric type if you clean them. For example, the int_rate column is only a string because of the % sign at the end.
Determine which columns have a few unique values and convert them to the category type. For example, you may want to convert the grade and sub_grade columns.
Based on your conclusions, perform the necessary type changes across all chunks. Calculate the total memory footprint, and compare it with the previous one.

## Optimizing Numeric Columns

## Conclusion & Next Steps

In this project we used Pandas to read in a large dataset in chunks. Some next steps to continue this project could be to create a function that automates as much of the work as possible so that we could use it on other similar datasets.

The idea for this project comes from the [DATAQUEST](https://app.dataquest.io/) **Processing Large Datasets in Pandas** course.