# Guided Project
### Practice Optimizing Dataframes and Processing in Chunks

## Introduction

In this guided project, we'll practice working with chunked dataframes and optimizing a dataframe's memory usage. We'll be working with financial lending data from [Lending Club](https://www.lendingclub.com/), a marketplace for personal loans that matches borrowers with investors. You can read more about the marketplace [on its website](https://www.lendingclub.com/public/how-peer-lending-works.action).<br>

The Lending Club's website lists approved loans. Qualified investors can view the borrower's credit score, the purpose of the loan, and other details in the loan applications. Once a lender is ready to back a loan, it selects the amount of money it wants to fund. When the loan amount the borrower requested is fully funded, the borrower receives the money, minus the [origination fee](https://help.lendingclub.com/hc/en-us/articles/214501207-What-is-the-origination-fee-) that Lending Club charges.<br>

We'll be working with a dataset of loans approved from `2007-2011`, which you can download from [Lending Club's website](https://www.lendingclub.com/info/download-data.action). We've already removed the `desc` column for you to make our system run more quickly.<br>

If we read in the entire data set, it will consume about 67 megabytes of memory. Let's imagine that we only have 10 megabytes of memory available throughout this project, so you can practice the concepts you learned in the last two missions. You can find the solutions notebook for this guided project [in our GitHub repo](https://github.com/dataquestio/solutions/blob/master/Mission165Solutions.ipynb).

* Read in the first five lines from `loans_2007.csv` and look for any data quality issues.
* Read in the first 1000 rows from the data set, and calculate the total memory usage for these rows. Increase or decrease the number of rows to converge on a memory usage under five megabytes (to stay on the conservative side).

In [1]:
import pandas as pd
pd.options.display.max_columns = 99

In [2]:
loans = pd.read_csv('loans_2007.csv')
loans.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,,10+ years,RENT,24000.0,Verified,Dec-2011,Fully Paid,n,credit_card,Computer,860xx,AZ,27.65,0.0,Jan-1985,1.0,3.0,0.0,13648.0,83.7%,9.0,f,0.0,0.0,5863.155187,5833.84,5000.0,863.16,0.0,0.0,0.0,Jan-2015,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,Ryder,< 1 year,RENT,30000.0,Source Verified,Dec-2011,Charged Off,n,car,bike,309xx,GA,1.0,0.0,Apr-1999,5.0,3.0,0.0,1687.0,9.4%,4.0,f,0.0,0.0,1008.71,1008.71,456.46,435.17,0.0,117.08,1.11,Apr-2013,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,,10+ years,RENT,12252.0,Not Verified,Dec-2011,Fully Paid,n,small_business,real estate business,606xx,IL,8.72,0.0,Nov-2001,2.0,2.0,0.0,2956.0,98.5%,10.0,f,0.0,0.0,3005.666844,3005.67,2400.0,605.67,0.0,0.0,0.0,Jun-2014,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,AIR RESOURCES BOARD,10+ years,RENT,49200.0,Source Verified,Dec-2011,Fully Paid,n,other,personel,917xx,CA,20.0,0.0,Feb-1996,1.0,10.0,0.0,5598.0,21%,37.0,f,0.0,0.0,12231.89,12231.89,10000.0,2214.92,16.97,0.0,0.0,Jan-2015,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,University Medical Group,1 year,RENT,80000.0,Source Verified,Dec-2011,Current,n,other,Personal,972xx,OR,17.94,0.0,Jan-1996,0.0,15.0,0.0,27783.0,53.9%,38.0,f,461.73,461.73,3581.12,3581.12,2538.27,1042.85,0.0,0.0,0.0,Jun-2016,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


### Expected issues on data quality
* `id` column has mixed data types
* Some numerical columns (`int_rate`, `term`, etc) has some unit characters (`momnths`, `%`) and need to be processed for calculation.
* Columns representing datetime need to be transformed into `datetime` dtype to be calculated.
* We can optimize the total memory usage for this dataframe by converting some columns with number of unique values less than half of the total column length into categorical type.

In [3]:
loans.iloc[:1000].info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 52 columns):
id                            1000 non-null object
member_id                     1000 non-null float64
loan_amnt                     1000 non-null float64
funded_amnt                   1000 non-null float64
funded_amnt_inv               1000 non-null float64
term                          1000 non-null object
int_rate                      1000 non-null object
installment                   1000 non-null float64
grade                         1000 non-null object
sub_grade                     1000 non-null object
emp_title                     949 non-null object
emp_length                    983 non-null object
home_ownership                1000 non-null object
annual_inc                    1000 non-null float64
verification_status           1000 non-null object
issue_d                       1000 non-null object
loan_status                   1000 non-null object
pymnt_plan             

In [4]:
loans.iloc[:3000].info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 52 columns):
id                            3000 non-null object
member_id                     3000 non-null float64
loan_amnt                     3000 non-null float64
funded_amnt                   3000 non-null float64
funded_amnt_inv               3000 non-null float64
term                          3000 non-null object
int_rate                      3000 non-null object
installment                   3000 non-null float64
grade                         3000 non-null object
sub_grade                     3000 non-null object
emp_title                     2829 non-null object
emp_length                    2917 non-null object
home_ownership                3000 non-null object
annual_inc                    3000 non-null float64
verification_status           3000 non-null object
issue_d                       3000 non-null object
loan_status                   3000 non-null object
pymnt_plan          

## Exploring the Data in Chunks

Let's familiarize ourselves with the columns to see which ones we can optimize. In the first mission, we explored column types by reading in the full dataframe. In this guided project, let's try to understand the column types better while using dataframe chunks.

For each chunk:
* How many columns have a numeric type? How many have a string type?
* How many unique values are there in each string column? How many of the string columns contain values that are less than 50% unique?
* Which float columns have no missing values and could be candidates for conversion to the integer type?

Calculate the total memory usage across all of the chunks.

In [5]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)

chunk_num = 1
chunk_memory_usage = 0
for chunk in chunk_iter:
    
    print('chunk #'+str(chunk_num))
    chunk_num += 1
    
    chunk_numcols = chunk.select_dtypes(include=['float'])
    chunk_strcols = chunk.select_dtypes(include=['object'])
    
    print('num of numeric columns: {}\nnum of string columns: {}'\
                  .format(len(chunk_numcols.columns), 
                          len(chunk_strcols.columns)))
    
    less_than_50p_unique = []
    for col in chunk_strcols.columns:
        tot_leng = len(chunk[col])
        unq_leng = len(chunk[col].unique())
        #print('unique# in', col, ':', unq_leng)
        
        if tot_leng*.5 > unq_leng:
            less_than_50p_unique.append(col)
    
    print('\ncontain values that are less than 50% unique:')
    print(less_than_50p_unique)
    
    chunk_nullcounts = chunk_numcols.isnull().sum()
    chunk_numcols_notnull = list(chunk_nullcounts[chunk_nullcounts == 0].index)
    print('\nnumeric columns with no null')
    print(chunk_numcols_notnull)
    
    chunk_memory_usage += chunk.memory_usage(deep=True).sum()
    
    print('#'*30)
    
    
print('Total memory usage across all the chunks (MB) :',
      chunk_memory_usage/2**20)

chunk #1
num of numeric columns: 30
num of string columns: 21

contain values that are less than 50% unique:
['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']

numeric columns with no null
['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']
##############################
chunk #2
num of numeric col

chunk #11
num of numeric columns: 30
num of string columns: 21

contain values that are less than 50% unique:
['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']

numeric columns with no null
['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']
##############################
chunk #12
num of numeric columns: 3

### values that are less than 50% unique ---> to `category` type

['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']

### Candidates for conversion to integer with no missing values
### ---> to `integer` type
* There seems no candidates for this case
  * All the float type columns have at least one null value.
  * We will see which columns have few nulls and which columns not afterwards.

## Optimizing String Columns

As we learned in the first mission of this course, 
### we can achieve the greatest memory improvements by converting the string columns to a numeric type. 
Let's convert all of the columns where 
* the values are less than 50% unique to the category type
* the columns that contain numeric values to the float type

While working with dataframe chunks:
* Determine which string columns you can convert to a numeric type if you clean them. For example, the `int_rate` column is only a string because of the `%` sign at the end.
* Determine which columns have a few unique values and convert them to the category type. For example, you may want to convert the `grade` and `sub_grade` columns.
* Based on your conclusions, perform the necessary type changes across all chunks. Calculate the total memory footprint, and compare it with the previous one.

In [6]:
loans.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,,10+ years,RENT,24000.0,Verified,Dec-2011,Fully Paid,n,credit_card,Computer,860xx,AZ,27.65,0.0,Jan-1985,1.0,3.0,0.0,13648.0,83.7%,9.0,f,0.0,0.0,5863.155187,5833.84,5000.0,863.16,0.0,0.0,0.0,Jan-2015,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,Ryder,< 1 year,RENT,30000.0,Source Verified,Dec-2011,Charged Off,n,car,bike,309xx,GA,1.0,0.0,Apr-1999,5.0,3.0,0.0,1687.0,9.4%,4.0,f,0.0,0.0,1008.71,1008.71,456.46,435.17,0.0,117.08,1.11,Apr-2013,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,,10+ years,RENT,12252.0,Not Verified,Dec-2011,Fully Paid,n,small_business,real estate business,606xx,IL,8.72,0.0,Nov-2001,2.0,2.0,0.0,2956.0,98.5%,10.0,f,0.0,0.0,3005.666844,3005.67,2400.0,605.67,0.0,0.0,0.0,Jun-2014,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,AIR RESOURCES BOARD,10+ years,RENT,49200.0,Source Verified,Dec-2011,Fully Paid,n,other,personel,917xx,CA,20.0,0.0,Feb-1996,1.0,10.0,0.0,5598.0,21%,37.0,f,0.0,0.0,12231.89,12231.89,10000.0,2214.92,16.97,0.0,0.0,Jan-2015,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,University Medical Group,1 year,RENT,80000.0,Source Verified,Dec-2011,Current,n,other,Personal,972xx,OR,17.94,0.0,Jan-1996,0.0,15.0,0.0,27783.0,53.9%,38.0,f,461.73,461.73,3581.12,3581.12,2538.27,1042.85,0.0,0.0,0.0,Jun-2016,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


In [7]:
# available to convert to a numeric type if cleaning
# term : {int} months ---> {int}
# int_rate : {float}% ---> {float} * 0.01
# emp_length : {int} year, {int} years, {int}+ years, < {int} year ---> {int}
# revol_util : {int}% ---> {int}

convert_to_numeric = ['term', 'int_rate', 'emp_length', 'revol_util']


In [None]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000, usecols=convert_to_numeric)
chunks_to_numeric = []

for chunk in chunk_iter:
    
    term_converted = chunk['term'].apply(lambda term: float(term[1:3] 
                                                            if type(term)==str else term))
    int_rate_converted = chunk['int_rate'].apply(lambda int_rate: float(int_rate[:-1])*.01
                                                if type(int_rate) == str else int_rate)
    
    emp_length_mapper = {
        '< 1 year' : 0,
        '1 year' : 1,
        '2 years' : 2,
        '3 years' : 3,
        '4 years' : 4,
        '5 years' : 5,
        '6 years' : 6,
        '7 years' : 7,
        '8 years' : 8,
        '9 years' : 9,
        '10+ years' : 10
    }
    emp_length_converted = chunk['emp_length'].map(emp_length_mapper)
    
    revol_util_converted = chunk['revol_util'].apply(lambda revol_util: float(revol_util[:-1])*.01
                                                    if type(revol_util)==str else revol_util)
    
    chunk_converted = pd.concat([term_converted, int_rate_converted, 
                                 emp_length_converted, revol_util_converted],
                               axis=1)
    
    chunks_to_numeric.append(chunk_converted)
    
loans[convert_to_numeric] = pd.concat(chunks_to_numeric)

In [61]:
loans[convert_to_numeric].head()

Unnamed: 0,term,int_rate,emp_length,revol_util
0,36.0,0.1065,10.0,0.837
1,60.0,0.1527,0.0,0.094
2,36.0,0.1596,10.0,0.985
3,36.0,0.1349,10.0,0.21
4,60.0,0.1269,1.0,0.539


In [169]:
loans[convert_to_numeric].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42535 entries, 0 to 42535
Data columns (total 4 columns):
term          42535 non-null float64
int_rate      42535 non-null float64
emp_length    41423 non-null float64
revol_util    42445 non-null float64
dtypes: float64(4)
memory usage: 1.6 MB


In [70]:
# with need to convert to a datetime type
import re

convert_to_datetime = ['issue_d', 'earliest_cr_line', 'last_pymnt_d', 'last_credit_pull_d']
month_names = ['Jan', 'Feb', 'Mar', 'Apr','May', 'Jun', 
               'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

chunk_iter = pd.read_csv('loans_2007.csv', 
                         chunksize=3000, 
                         usecols=convert_to_datetime)
chunks_to_datetime = []

for chunk in chunk_iter:
    
    for i, mnth in enumerate(month_names):
        
        reg_exp = mnth+'-'
        
        issue_d_converted_bymonth = pd.to_datetime(chunk['issue_d'].apply(lambda issue_d: re.sub(reg_exp, str(i+1)+'/1/', issue_d)
                                                                         if type(issue_d)==str else issue_d))
        ecl_converted_bymonth = pd.to_datetime(chunk['earliest_cr_line'].apply(lambda ecl: re.sub(reg_exp, str(i+1)+'/1/', ecl)
                                                                              if type(ecl)==str else ecl))
        pymnt_converted_bymonth = pd.to_datetime(chunk['last_pymnt_d'].apply(lambda pymnt: re.sub(reg_exp, str(i+1)+'/1/', pymnt)
                                                                            if type(pymnt)==str else pymnt))
        
        credit_pull_converted_bymonth = pd.to_datetime(chunk['last_credit_pull_d'].apply(lambda credit: re.sub(reg_exp, str(i+1)+'/1/', credit)
                                                                            if type(credit)==str else credit))
        
        # update 3 chunks
        chunk['issue_d'] = issue_d_converted_bymonth
        chunk['earliest_cr_line'] = ecl_converted_bymonth
        chunk['last_pymnt_d'] = pymnt_converted_bymonth
        chunk['last_credit_pull_d'] = credit_pull_converted_bymonth
        
    chunks_to_datetime.append(chunk)
    
loans[convert_to_datetime] = pd.concat(chunks_to_datetime)

In [71]:
loans[convert_to_datetime].head()

Unnamed: 0,issue_d,earliest_cr_line,last_pymnt_d,last_credit_pull_d
0,2011-12-01,1985-01-01,2015-01-01,2016-06-01
1,2011-12-01,1999-04-01,2013-04-01,2013-09-01
2,2011-12-01,2001-11-01,2014-06-01,2016-06-01
3,2011-12-01,1996-02-01,2015-01-01,2016-04-01
4,2011-12-01,1996-01-01,2016-06-01,2016-06-01


In [170]:
loans[convert_to_datetime].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42535 entries, 0 to 42535
Data columns (total 4 columns):
issue_d               42535 non-null datetime64[ns]
earliest_cr_line      42506 non-null datetime64[ns]
last_pymnt_d          42452 non-null datetime64[ns]
last_credit_pull_d    42531 non-null datetime64[ns]
dtypes: datetime64[ns](4)
memory usage: 1.6 MB


In [72]:
#which columns have a few unique values 
# and convert them to the category type

# only include categorical columns.
# - exclude numeric type (further calculation)
# - exclude datetime convertible type (4 columns processed right above)

convert_to_category = ['grade', 'sub_grade', 'home_ownership', 
                       'verification_status', 'loan_status', 'pymnt_plan', 
                       'purpose', 'zip_code', 'addr_state', 
                       'initial_list_status', 'application_type']

loans[convert_to_category].head()


Unnamed: 0,grade,sub_grade,home_ownership,verification_status,loan_status,pymnt_plan,purpose,zip_code,addr_state,initial_list_status,application_type
0,B,B2,RENT,Verified,Fully Paid,n,credit_card,860xx,AZ,f,INDIVIDUAL
1,C,C4,RENT,Source Verified,Charged Off,n,car,309xx,GA,f,INDIVIDUAL
2,C,C5,RENT,Not Verified,Fully Paid,n,small_business,606xx,IL,f,INDIVIDUAL
3,C,C1,RENT,Source Verified,Fully Paid,n,other,917xx,CA,f,INDIVIDUAL
4,B,B5,RENT,Source Verified,Current,n,other,972xx,OR,f,INDIVIDUAL


In [124]:
loans['sub_grade'].unique()

array(['B2', 'C4', 'C5', 'C1', 'B5', 'A4', 'E1', 'F2', 'C3', 'B1', 'D1',
       'A1', 'B3', 'B4', 'C2', 'D2', 'A3', 'A5', 'D5', 'A2', 'E4', 'D3',
       'D4', 'F3', 'E3', 'F4', 'F1', 'E5', 'G4', 'E2', 'G3', 'G2', 'G1',
       'F5', 'G5', nan], dtype=object)

In [130]:
chunk_iter = pd.read_csv('loans_2007.csv', 
                         chunksize=3000, 
                         usecols=convert_to_category)
chunks_to_category = []

for chunk in chunk_iter:
    
    for col in chunk.columns:
        
        # since each chunk does not contain all the categorical values for each column,
        # we need to reference the unique values except for nan to convert column type.
        
        categories = [unq for unq in loans[col].unique() if type(unq)==str]
        t = pd.api.types.CategoricalDtype(categories=categories, ordered=False)
        chunk[col] = chunk[col].astype(t)
    
    chunks_to_category.append(chunk)
    
loans[convert_to_category] = pd.concat(chunks_to_category)

In [171]:
loans[convert_to_category].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42535 entries, 0 to 42535
Data columns (total 11 columns):
grade                  42535 non-null category
sub_grade              42535 non-null category
home_ownership         42535 non-null category
verification_status    42535 non-null category
loan_status            42535 non-null category
pymnt_plan             42535 non-null category
purpose                42535 non-null category
zip_code               42535 non-null category
addr_state             42535 non-null category
initial_list_status    42535 non-null category
application_type       42535 non-null category
dtypes: category(11)
memory usage: 883.8 KB


### Memory Usage Check
* Previous data set (`66.2MB`)
* Processed data set (**`22.01MB`**)

In [134]:
chunk_iter = pd.read_csv('loans_2007.csv', 
                         chunksize=3000)
total_memory_usage = 0

for chunk in chunk_iter:
    
    total_memory_usage += chunk.memory_usage(deep=True).sum()/2**20
    
total_memory_usage

66.215373039245605

In [142]:
total_memory_usage = 0

for chunk_bound in range(0, 42538, 3000):
    
    try:
        chunk = loans.iloc[chunk_bound:chunk_bound+3000]
    except:
        chunk = loans.iloc[chunk_bound:len(loans)]
    
    total_memory_usage += chunk.memory_usage(deep=True).sum()/2**20
    
total_memory_usage

22.017316818237305

## Optimizing Numeric Columns

It looks like we were able to realize some powerful memory savings by converting to the category type and converting string columns to numeric ones.<br>

Now let's optimize the numeric columns using the `pandas.to_numeric()` function.<br>

While working with dataframe chunks:
* Identify float columns that contain missing values, and that we can convert to a more space efficient subtype.
* Identify float columns that don't contain any missing values, and that we can convert to the integer type because they represent whole numbers.
* Based on your conclusions, perform the necessary type changes across all chunks. Calculate the total memory footprint and compare it with the previous one.

In [145]:
loans.select_dtypes(include=['float']).head()

Unnamed: 0,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,emp_length,annual_inc,dti,delinq_2yrs,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_amnt,collections_12_mths_ex_med,policy_code,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1296599.0,5000.0,5000.0,4975.0,36.0,0.1065,162.87,10.0,24000.0,27.65,0.0,1.0,3.0,0.0,13648.0,0.837,9.0,0.0,0.0,5863.155187,5833.84,5000.0,863.16,0.0,0.0,0.0,171.62,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,1314167.0,2500.0,2500.0,2500.0,60.0,0.1527,59.83,0.0,30000.0,1.0,0.0,5.0,3.0,0.0,1687.0,0.094,4.0,0.0,0.0,1008.71,1008.71,456.46,435.17,0.0,117.08,1.11,119.66,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,1313524.0,2400.0,2400.0,2400.0,36.0,0.1596,84.33,10.0,12252.0,8.72,0.0,2.0,2.0,0.0,2956.0,0.985,10.0,0.0,0.0,3005.666844,3005.67,2400.0,605.67,0.0,0.0,0.0,649.91,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,1277178.0,10000.0,10000.0,10000.0,36.0,0.1349,339.31,10.0,49200.0,20.0,0.0,1.0,10.0,0.0,5598.0,0.21,37.0,0.0,0.0,12231.89,12231.89,10000.0,2214.92,16.97,0.0,0.0,357.48,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,1311748.0,3000.0,3000.0,3000.0,60.0,0.1269,67.79,1.0,80000.0,17.94,0.0,0.0,15.0,0.0,27783.0,0.539,38.0,461.73,461.73,3581.12,3581.12,2538.27,1042.85,0.0,0.0,0.0,67.79,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [146]:
loans.select_dtypes(include=['float']).tail()

Unnamed: 0,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,emp_length,annual_inc,dti,delinq_2yrs,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_amnt,collections_12_mths_ex_med,policy_code,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
42533,70868.0,2525.0,2525.0,225.0,36.0,0.0933,80.69,0.0,110000.0,10.0,,,,,0.0,,,0.0,0.0,2904.498829,258.82,2525.0,379.5,0.0,0.0,0.0,82.03,,1.0,,,,,
42534,70735.0,6500.0,6500.0,0.0,36.0,0.0838,204.84,0.0,,4.0,,,,,0.0,,,0.0,0.0,7373.904962,0.0,6500.0,873.9,0.0,0.0,0.0,205.32,,1.0,,,,,
42535,70681.0,5000.0,5000.0,0.0,36.0,0.0775,156.11,10.0,70000.0,8.81,,,,,0.0,,,0.0,0.0,5619.76209,0.0,5000.0,619.76,0.0,0.0,0.0,156.39,,1.0,,,,,
42536,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
42537,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [144]:
loans.select_dtypes(include=['float']).isnull().sum()

member_id                        3
loan_amnt                        3
funded_amnt                      3
funded_amnt_inv                  3
term                             3
int_rate                         3
installment                      3
emp_length                    1115
annual_inc                       7
dti                              3
delinq_2yrs                     32
inq_last_6mths                  32
open_acc                        32
pub_rec                         32
revol_bal                        3
revol_util                      93
total_acc                       32
out_prncp                        3
out_prncp_inv                    3
total_pymnt                      3
total_pymnt_inv                  3
total_rec_prncp                  3
total_rec_int                    3
total_rec_late_fee               3
recoveries                       3
collection_recovery_fee          3
last_pymnt_amnt                  3
collections_12_mths_ex_med     148
policy_code         

In [151]:
for col in loans.select_dtypes(include=['float']).columns:
    if loans[col].isnull().sum() == 3:
        print(col)
        print('null value index:\n', loans[col][loans[col].isnull()==True].index)

member_id
null value index:
 Int64Index([39786, 42536, 42537], dtype='int64')
loan_amnt
null value index:
 Int64Index([39786, 42536, 42537], dtype='int64')
funded_amnt
null value index:
 Int64Index([39786, 42536, 42537], dtype='int64')
funded_amnt_inv
null value index:
 Int64Index([39786, 42536, 42537], dtype='int64')
term
null value index:
 Int64Index([39786, 42536, 42537], dtype='int64')
int_rate
null value index:
 Int64Index([39786, 42536, 42537], dtype='int64')
installment
null value index:
 Int64Index([39786, 42536, 42537], dtype='int64')
dti
null value index:
 Int64Index([39786, 42536, 42537], dtype='int64')
revol_bal
null value index:
 Int64Index([39786, 42536, 42537], dtype='int64')
out_prncp
null value index:
 Int64Index([39786, 42536, 42537], dtype='int64')
out_prncp_inv
null value index:
 Int64Index([39786, 42536, 42537], dtype='int64')
total_pymnt
null value index:
 Int64Index([39786, 42536, 42537], dtype='int64')
total_pymnt_inv
null value index:
 Int64Index([39786, 42536,

In [153]:
loans = loans.drop(index=[39786, 42536, 42537])
loans.select_dtypes(include=['float']).isnull().sum()

member_id                        0
loan_amnt                        0
funded_amnt                      0
funded_amnt_inv                  0
term                             0
int_rate                         0
installment                      0
emp_length                    1112
annual_inc                       4
dti                              0
delinq_2yrs                     29
inq_last_6mths                  29
open_acc                        29
pub_rec                         29
revol_bal                        0
revol_util                      90
total_acc                       29
out_prncp                        0
out_prncp_inv                    0
total_pymnt                      0
total_pymnt_inv                  0
total_rec_prncp                  0
total_rec_int                    0
total_rec_late_fee               0
recoveries                       0
collection_recovery_fee          0
last_pymnt_amnt                  0
collections_12_mths_ex_med     145
policy_code         

In [156]:
float_isnull_summed = loans.select_dtypes(include=['float']).isnull().sum()
float_cols_nonull = list(float_isnull_summed[float_isnull_summed == 0].index)
float_cols_nonull

['member_id',
 'loan_amnt',
 'funded_amnt',
 'funded_amnt_inv',
 'term',
 'int_rate',
 'installment',
 'dti',
 'revol_bal',
 'out_prncp',
 'out_prncp_inv',
 'total_pymnt',
 'total_pymnt_inv',
 'total_rec_prncp',
 'total_rec_int',
 'total_rec_late_fee',
 'recoveries',
 'collection_recovery_fee',
 'last_pymnt_amnt',
 'policy_code']

In [157]:
loans[float_cols_nonull].head()

Unnamed: 0,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,dti,revol_bal,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_amnt,policy_code
0,1296599.0,5000.0,5000.0,4975.0,36.0,0.1065,162.87,27.65,13648.0,0.0,0.0,5863.155187,5833.84,5000.0,863.16,0.0,0.0,0.0,171.62,1.0
1,1314167.0,2500.0,2500.0,2500.0,60.0,0.1527,59.83,1.0,1687.0,0.0,0.0,1008.71,1008.71,456.46,435.17,0.0,117.08,1.11,119.66,1.0
2,1313524.0,2400.0,2400.0,2400.0,36.0,0.1596,84.33,8.72,2956.0,0.0,0.0,3005.666844,3005.67,2400.0,605.67,0.0,0.0,0.0,649.91,1.0
3,1277178.0,10000.0,10000.0,10000.0,36.0,0.1349,339.31,20.0,5598.0,0.0,0.0,12231.89,12231.89,10000.0,2214.92,16.97,0.0,0.0,357.48,1.0
4,1311748.0,3000.0,3000.0,3000.0,60.0,0.1269,67.79,17.94,27783.0,461.73,461.73,3581.12,3581.12,2538.27,1042.85,0.0,0.0,0.0,67.79,1.0


In [225]:
def check_float_cols_having_no_decimal(col):
    
    uniques = loans[col].unique()
    
    for unq in uniques:
        
        if str(unq*10)[-1] != '0':
            return False
        
    return True
    
    
import numpy as np

def optimize_numeric_dtype_size(df, col, col_min, col_max):
    
    #col_dtype = re.findall('[a-z]+', str(df[col].dtype))[0]
    dtypes = ['int8', 'int16', 'int32', 'int64']
    
    iinfo_dt_minmax = [(np.iinfo(dt).min, np.iinfo(dt).max)
                      for dt in dtypes]
    
    for i, minmax in enumerate(iinfo_dt_minmax):
        
        if col_min >= minmax[0] and col_max <= minmax[1]:
            df[col] = df[col].astype(dtypes[i])
    
            #print(col, 'optimized to:', dtypes[i])
            return 'Success'
        
    return 'Fail'

In [228]:
float_cols_convertible_int = [col for col in float_cols_nonull
                             if check_float_cols_having_no_decimal(col)]

In [188]:
float_cols_convertible_int

['member_id', 'loan_amnt', 'funded_amnt', 'term', 'revol_bal', 'policy_code']

In [227]:
chunks_to_integer = []

# get the min/max values for each columns from the whole dataset.
# If the whole dataset is too large to get the min/max values directly,
# we can get the values using chunking.
minmax_vals = [(loans[col_int].min(), loans[col_int].max()) 
               for col_int in float_cols_convertible_int]

for chunk_bound in range(0, 42538, 3000):
    
    try:
        chunk = loans.iloc[chunk_bound:chunk_bound+3000][float_cols_convertible_int]
    except:
        chunk = loans.iloc[chunk_bound:len(loans)][float_cols_convertible_int]
    
    for i, col in enumerate(float_cols_convertible_int):
        
        col_min = minmax_vals[i][0]
        col_max = minmax_vals[i][1]
        
        optimize_numeric_dtype_size(chunk, col, col_min, col_max)
    
    chunks_to_integer.append(chunk)
    
loans[float_cols_convertible_int] = pd.concat(chunks_to_integer)
loans[float_cols_convertible_int].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42535 entries, 0 to 42535
Data columns (total 6 columns):
member_id      42535 non-null int32
loan_amnt      42535 non-null int32
funded_amnt    42535 non-null int32
term           42535 non-null int8
revol_bal      42535 non-null int32
policy_code    42535 non-null int8
dtypes: int32(4), int8(2)
memory usage: 1.1 MB


### Memory Usage Final Check
* Previous data set (`66.2MB`)
* First Processed data set (`22.01MB`)
* **Final Processed data set (`21.12MB`**)

In [231]:
tot_mem_usage = 0

for chunk_bound in range(0, 42538, 3000):
    
    try:
        chunk = loans.iloc[chunk_bound:chunk_bound+3000]
    except:
        chunk = loans.iloc[chunk_bound:len(loans)]
    
    tot_mem_usage += chunk.memory_usage(deep=True).sum()/2**20
    
    
tot_mem_usage

21.122328758239746

In [232]:
# first 5 lines of precessed dataset
loans.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599,5000,5000,4975.0,36,0.1065,162.87,B,B2,,10.0,RENT,24000.0,Verified,2011-12-01,Fully Paid,n,credit_card,Computer,860xx,AZ,27.65,0.0,1985-01-01,1.0,3.0,0.0,13648,0.837,9.0,f,0.0,0.0,5863.155187,5833.84,5000.0,863.16,0.0,0.0,0.0,2015-01-01,171.62,2016-06-01,0.0,1,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167,2500,2500,2500.0,60,0.1527,59.83,C,C4,Ryder,0.0,RENT,30000.0,Source Verified,2011-12-01,Charged Off,n,car,bike,309xx,GA,1.0,0.0,1999-04-01,5.0,3.0,0.0,1687,0.094,4.0,f,0.0,0.0,1008.71,1008.71,456.46,435.17,0.0,117.08,1.11,2013-04-01,119.66,2013-09-01,0.0,1,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524,2400,2400,2400.0,36,0.1596,84.33,C,C5,,10.0,RENT,12252.0,Not Verified,2011-12-01,Fully Paid,n,small_business,real estate business,606xx,IL,8.72,0.0,2001-11-01,2.0,2.0,0.0,2956,0.985,10.0,f,0.0,0.0,3005.666844,3005.67,2400.0,605.67,0.0,0.0,0.0,2014-06-01,649.91,2016-06-01,0.0,1,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178,10000,10000,10000.0,36,0.1349,339.31,C,C1,AIR RESOURCES BOARD,10.0,RENT,49200.0,Source Verified,2011-12-01,Fully Paid,n,other,personel,917xx,CA,20.0,0.0,1996-02-01,1.0,10.0,0.0,5598,0.21,37.0,f,0.0,0.0,12231.89,12231.89,10000.0,2214.92,16.97,0.0,0.0,2015-01-01,357.48,2016-04-01,0.0,1,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748,3000,3000,3000.0,60,0.1269,67.79,B,B5,University Medical Group,1.0,RENT,80000.0,Source Verified,2011-12-01,Current,n,other,Personal,972xx,OR,17.94,0.0,1996-01-01,0.0,15.0,0.0,27783,0.539,38.0,f,461.73,461.73,3581.12,3581.12,2538.27,1042.85,0.0,0.0,0.0,2016-06-01,67.79,2016-06-01,0.0,1,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


## Next Steps

Create a function that automates as much of the work you just did as possible, so that you could use it on other Lending Club data sets. This function should:
* Determine the optimal chunk size based on the memory constraints you provide
* Determine which string columns can be converted to numeric ones by removing the % character
* Determine which numeric columns can be converted to more space efficient representations

In [235]:
# Determine the optimal chunk size based on the memory constraints you provide
def determine_optimal_chunk_size(df, mem_const_min, mem_const_max, stepsize):
    
    chunk_size = 1000
    chunk_mem_usage = 0
    
    while (chunk_mem_usage < mem_const_min) or (chunk_mem_usage > mem_const_max):
        
        chunk_mem_usage = df.iloc[:chunk_size].memory_usage(deep=True).sum()/2**20
        
        if chunk_mem_usage > mem_const_max:
            chunk_size -= stepsize
        elif chunk_mem_usage < mem_const_min:
            chunk_size += stepsize
        
    return chunk_size

In [238]:
loans_default = pd.read_csv('loans_2007.csv')

# determine to get the chunk size with 10-15MB deep memory usage for each chunk.
determine_optimal_chunk_size(loans_default, 10, 15, 1000)

  interactivity=interactivity, compiler=compiler, result=result)


7000

In [None]:
#Determine which string columns can be converted to numeric ones by removing the % character


In [None]:
# Determine which numeric columns can be converted to more space efficient representations