# Guided Project
### Practice Optimizing Dataframes and Processing in Chunks

## Introduction

In this guided project, we'll practice working with chunked dataframes and optimizing a dataframe's memory usage. We'll be working with financial lending data from [Lending Club](https://www.lendingclub.com/), a marketplace for personal loans that matches borrowers with investors. You can read more about the marketplace [on its website](https://www.lendingclub.com/public/how-peer-lending-works.action).<br>

The Lending Club's website lists approved loans. Qualified investors can view the borrower's credit score, the purpose of the loan, and other details in the loan applications. Once a lender is ready to back a loan, it selects the amount of money it wants to fund. When the loan amount the borrower requested is fully funded, the borrower receives the money, minus the [origination fee](https://help.lendingclub.com/hc/en-us/articles/214501207-What-is-the-origination-fee-) that Lending Club charges.<br>

We'll be working with a dataset of loans approved from `2007-2011`, which you can download from [Lending Club's website](https://www.lendingclub.com/info/download-data.action). We've already removed the `desc` column for you to make our system run more quickly.<br>

If we read in the entire data set, it will consume about 67 megabytes of memory. Let's imagine that we only have 10 megabytes of memory available throughout this project, so you can practice the concepts you learned in the last two missions. You can find the solutions notebook for this guided project [in our GitHub repo](https://github.com/dataquestio/solutions/blob/master/Mission165Solutions.ipynb).

* Read in the first five lines from `loans_2007.csv` and look for any data quality issues.
* Read in the first 1000 rows from the data set, and calculate the total memory usage for these rows. Increase or decrease the number of rows to converge on a memory usage under five megabytes (to stay on the conservative side).

In [1]:
import pandas as pd
pd.options.display.max_columns = 99

In [2]:
loans = pd.read_csv('loans_2007.csv')
loans.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,,10+ years,RENT,24000.0,Verified,Dec-2011,Fully Paid,n,credit_card,Computer,860xx,AZ,27.65,0.0,Jan-1985,1.0,3.0,0.0,13648.0,83.7%,9.0,f,0.0,0.0,5863.155187,5833.84,5000.0,863.16,0.0,0.0,0.0,Jan-2015,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,Ryder,< 1 year,RENT,30000.0,Source Verified,Dec-2011,Charged Off,n,car,bike,309xx,GA,1.0,0.0,Apr-1999,5.0,3.0,0.0,1687.0,9.4%,4.0,f,0.0,0.0,1008.71,1008.71,456.46,435.17,0.0,117.08,1.11,Apr-2013,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,,10+ years,RENT,12252.0,Not Verified,Dec-2011,Fully Paid,n,small_business,real estate business,606xx,IL,8.72,0.0,Nov-2001,2.0,2.0,0.0,2956.0,98.5%,10.0,f,0.0,0.0,3005.666844,3005.67,2400.0,605.67,0.0,0.0,0.0,Jun-2014,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,AIR RESOURCES BOARD,10+ years,RENT,49200.0,Source Verified,Dec-2011,Fully Paid,n,other,personel,917xx,CA,20.0,0.0,Feb-1996,1.0,10.0,0.0,5598.0,21%,37.0,f,0.0,0.0,12231.89,12231.89,10000.0,2214.92,16.97,0.0,0.0,Jan-2015,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,University Medical Group,1 year,RENT,80000.0,Source Verified,Dec-2011,Current,n,other,Personal,972xx,OR,17.94,0.0,Jan-1996,0.0,15.0,0.0,27783.0,53.9%,38.0,f,461.73,461.73,3581.12,3581.12,2538.27,1042.85,0.0,0.0,0.0,Jun-2016,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


### Expected issues on data quality
* `id` column has mixed data types
* Some numerical columns (`int_rate`, `term`, etc) has some unit characters (`momnths`, `%`) and need to be processed for calculation.
* Columns representing datetime need to be transformed into `datetime` dtype to be calculated.
* We can optimize the total memory usage for this dataframe by converting some columns with number of unique values less than half of the total column length into categorical type.

In [3]:
loans.iloc[:1000].info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 52 columns):
id                            1000 non-null object
member_id                     1000 non-null float64
loan_amnt                     1000 non-null float64
funded_amnt                   1000 non-null float64
funded_amnt_inv               1000 non-null float64
term                          1000 non-null object
int_rate                      1000 non-null object
installment                   1000 non-null float64
grade                         1000 non-null object
sub_grade                     1000 non-null object
emp_title                     949 non-null object
emp_length                    1000 non-null object
home_ownership                1000 non-null object
annual_inc                    1000 non-null float64
verification_status           1000 non-null object
issue_d                       1000 non-null object
loan_status                   1000 non-null object
pymnt_plan            

In [4]:
loans.iloc[:3000].info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 52 columns):
id                            3000 non-null object
member_id                     3000 non-null float64
loan_amnt                     3000 non-null float64
funded_amnt                   3000 non-null float64
funded_amnt_inv               3000 non-null float64
term                          3000 non-null object
int_rate                      3000 non-null object
installment                   3000 non-null float64
grade                         3000 non-null object
sub_grade                     3000 non-null object
emp_title                     2829 non-null object
emp_length                    3000 non-null object
home_ownership                3000 non-null object
annual_inc                    3000 non-null float64
verification_status           3000 non-null object
issue_d                       3000 non-null object
loan_status                   3000 non-null object
pymnt_plan          

## Exploring the Data in Chunks

Let's familiarize ourselves with the columns to see which ones we can optimize. In the first mission, we explored column types by reading in the full dataframe. In this guided project, let's try to understand the column types better while using dataframe chunks.

For each chunk:
* How many columns have a numeric type? How many have a string type?
* How many unique values are there in each string column? How many of the string columns contain values that are less than 50% unique?
* Which float columns have no missing values and could be candidates for conversion to the integer type?

Calculate the total memory usage across all of the chunks.

In [33]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)

chunk_num = 1
chunk_memory_usage = 0
for chunk in chunk_iter:
    
    print('chunk #'+str(chunk_num))
    chunk_num += 1
    
    chunk_numcols = chunk.select_dtypes(include=['float'])
    chunk_strcols = chunk.select_dtypes(include=['object'])
    
    print('num of numeric columns: {}\nnum of string columns: {}'\
                  .format(len(chunk_numcols.columns), 
                          len(chunk_strcols.columns)))
    
    less_than_50p_unique = []
    for col in chunk_strcols.columns:
        tot_leng = len(chunk[col])
        unq_leng = len(chunk[col].unique())
        #print('unique# in', col, ':', unq_leng)
        
        if tot_leng*.5 > unq_leng:
            less_than_50p_unique.append(col)
    
    print('\ncontain values that are less than 50% unique:')
    print(less_than_50p_unique)
    
    chunk_nullcounts = chunk_numcols.isnull().sum()
    chunk_numcols_notnull = list(chunk_nullcounts[chunk_nullcounts == 0].index)
    print('\nnumeric columns with no null')
    print(chunk_numcols_notnull)
    
    chunk_memory_usage += chunk.memory_usage(deep=True).sum()
    
    print('#'*30)
    
    
print('Total memory usage across all the chunks (MB) :',
      chunk_memory_usage/2**20)

chunk #1
num of numeric columns: 30
num of string columns: 21

contain values that are less than 50% unique:
['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']

numeric columns with no null
['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']
##############################
chunk #2
num of numeric col

### values that are less than 50% unique ---> to `category` type

['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']

### Candidates for conversion to integer with no missing values
### ---> to `integer` type
* Except for the last 2 chunk (`#1` ~ `#13`)

['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'tax_liens']

## Optimizing String Columns

As we learned in the first mission of this course, 
### we can achieve the greatest memory improvements by converting the string columns to a numeric type. 
Let's convert all of the columns where 
* the values are less than 50% unique to the category type
* the columns that contain numeric values to the float type

While working with dataframe chunks:
* Determine which string columns you can convert to a numeric type if you clean them. For example, the `int_rate` column is only a string because of the `%` sign at the end.
* Determine which columns have a few unique values and convert them to the category type. For example, you may want to convert the `grade` and `sub_grade` columns.
* Based on your conclusions, perform the necessary type changes across all chunks. Calculate the total memory footprint, and compare it with the previous one.