# Stretch Goals

You can do more with the LendingClub or Instacart datasets.

LendingClub options:
- There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.
- Modify the `emp_title` column to replace titles with 'Other' if the title is not in the top 20. 
- Take initiatve and work on your own ideas!

Instacart options:
- Read [Instacart Market Basket Analysis, Winner's Interview: 2nd place, Kazuki Onodera](http://blog.kaggle.com/2017/09/21/instacart-market-basket-analysis-winners-interview-2nd-place-kazuki-onodera/), especially the **Feature Engineering** section. (Can you choose one feature from his bulleted lists, and try to engineer it with pandas code?)
- Read and replicate parts of [Simple Exploration Notebook - Instacart](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart). (It's the Python Notebook with the most upvotes for this Kaggle competition.)
- Take initiative and work on your own ideas!

In [0]:
import numpy as np
import pandas as pd

In [2]:
# Load the dataset zip file using bash command

!wget https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip

--2020-04-09 22:04:44--  https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip
Resolving resources.lendingclub.com (resources.lendingclub.com)... 54.212.148.51, 35.162.98.127, 52.41.172.114
Connecting to resources.lendingclub.com (resources.lendingclub.com)|54.212.148.51|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘LoanStats_2018Q4.csv.zip’

LoanStats_2018Q4.cs     [           <=>      ]  21.77M  1.95MB/s    in 12s     

2020-04-09 22:04:57 (1.87 MB/s) - ‘LoanStats_2018Q4.csv.zip’ saved [22826918]



In [3]:
# Use the !unzip command to ectract the csv from the zipped folder

!unzip /content/LoanStats_2018Q4.csv.zip

Archive:  /content/LoanStats_2018Q4.csv.zip
  inflating: LoanStats_2018Q4.csv    


In [4]:
# Read in CSV and fix header and footer problems using 
# 'skip rows' and 'skip footer' parameters

df = pd.read_csv('/content/LoanStats_2018Q4.csv', skiprows=1, skipfooter=2,
                 engine="python")
print(df.shape)
df.head()

(128412, 144)


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,...,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,,,35000,35000,35000.0,36 months,14.47%,1204.23,C,C2,Staff Physician,8 years,MORTGAGE,360000.0,Verified,Dec-2018,Fully Paid,n,,,credit_card,Credit card refinancing,336xx,FL,19.9,0,Apr-1995,1,,,24,0,57259,43.2%,51,w,0.0,0.0,38187.046837,38187.05,...,30.8,0,0,1222051,169286,124600,258401,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
1,,,5000,5000,5000.0,36 months,22.35%,191.86,D,D5,Director of Sales,10+ years,MORTGAGE,72000.0,Source Verified,Dec-2018,Fully Paid,n,,,debt_consolidation,Debt consolidation,333xx,FL,20.12,0,Mar-2010,0,,,13,0,11720,47.1%,26,f,0.0,0.0,5615.977674,5615.98,...,50.0,0,0,218686,34418,18200,37786,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
2,,,10000,10000,10000.0,60 months,23.40%,284.21,E,E1,,< 1 year,RENT,55000.0,Source Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,902xx,CA,13.51,0,Apr-2007,0,44.0,88.0,9,1,11859,53.9%,11,w,8462.43,8462.43,4243.65,4243.65,...,100.0,1,0,34386,21235,10500,12386,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
3,,,17100,17100,17100.0,36 months,18.94%,626.3,D,D2,Receptionist,10+ years,RENT,38000.0,Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,150xx,PA,38.09,0,Mar-1998,1,47.0,,14,0,15323,53%,21,w,11120.22,11120.22,9367.51,9367.51,...,75.0,0,0,70954,43351,16600,41784,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
4,,,4000,4000,4000.0,36 months,10.72%,130.43,B,B2,Extrusion assistant,10+ years,MORTGAGE,56000.0,Verified,Dec-2018,Current,n,,,credit_card,Credit card refinancing,301xx,GA,31.03,0,Sep-2006,0,,,7,0,4518,28.6%,11,w,2487.19,2487.19,1943.36,1943.36,...,0.0,0,0,221310,71375,12300,77865,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,


In [0]:
# Display max rows to get better view of null values

pd.set_option('display.max_rows', 150)

In [6]:
# Drop columns that are made up of all NaN Values

df = df.drop(['url', 'member_id', 'desc', 'id'], axis=1)

# Check table to make sure colums were dropped

df.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,...,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,35000,35000,35000.0,36 months,14.47%,1204.23,C,C2,Staff Physician,8 years,MORTGAGE,360000.0,Verified,Dec-2018,Fully Paid,n,credit_card,Credit card refinancing,336xx,FL,19.9,0,Apr-1995,1,,,24,0,57259,43.2%,51,w,0.0,0.0,38187.046837,38187.05,35000.0,3187.05,0.0,0.0,...,30.8,0,0,1222051,169286,124600,258401,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
1,5000,5000,5000.0,36 months,22.35%,191.86,D,D5,Director of Sales,10+ years,MORTGAGE,72000.0,Source Verified,Dec-2018,Fully Paid,n,debt_consolidation,Debt consolidation,333xx,FL,20.12,0,Mar-2010,0,,,13,0,11720,47.1%,26,f,0.0,0.0,5615.977674,5615.98,5000.0,615.98,0.0,0.0,...,50.0,0,0,218686,34418,18200,37786,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
2,10000,10000,10000.0,60 months,23.40%,284.21,E,E1,,< 1 year,RENT,55000.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,902xx,CA,13.51,0,Apr-2007,0,44.0,88.0,9,1,11859,53.9%,11,w,8462.43,8462.43,4243.65,4243.65,1537.57,2706.08,0.0,0.0,...,100.0,1,0,34386,21235,10500,12386,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
3,17100,17100,17100.0,36 months,18.94%,626.3,D,D2,Receptionist,10+ years,RENT,38000.0,Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,150xx,PA,38.09,0,Mar-1998,1,47.0,,14,0,15323,53%,21,w,11120.22,11120.22,9367.51,9367.51,5979.78,3387.73,0.0,0.0,...,75.0,0,0,70954,43351,16600,41784,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
4,4000,4000,4000.0,36 months,10.72%,130.43,B,B2,Extrusion assistant,10+ years,MORTGAGE,56000.0,Verified,Dec-2018,Current,n,credit_card,Credit card refinancing,301xx,GA,31.03,0,Sep-2006,0,,,7,0,4518,28.6%,11,w,2487.19,2487.19,1943.36,1943.36,1512.81,430.55,0.0,0.0,...,0.0,0,0,221310,71375,12300,77865,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,


In [7]:
# Our original fix from lecture
def int_rate_to_float(value):
  return float(value.strip().strip('%'))

df['int_rate'] = df['int_rate'].apply(int_rate_to_float)

# Check to make sure the changes were made to the original column

df.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,...,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,35000,35000,35000.0,36 months,14.47,1204.23,C,C2,Staff Physician,8 years,MORTGAGE,360000.0,Verified,Dec-2018,Fully Paid,n,credit_card,Credit card refinancing,336xx,FL,19.9,0,Apr-1995,1,,,24,0,57259,43.2%,51,w,0.0,0.0,38187.046837,38187.05,35000.0,3187.05,0.0,0.0,...,30.8,0,0,1222051,169286,124600,258401,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
1,5000,5000,5000.0,36 months,22.35,191.86,D,D5,Director of Sales,10+ years,MORTGAGE,72000.0,Source Verified,Dec-2018,Fully Paid,n,debt_consolidation,Debt consolidation,333xx,FL,20.12,0,Mar-2010,0,,,13,0,11720,47.1%,26,f,0.0,0.0,5615.977674,5615.98,5000.0,615.98,0.0,0.0,...,50.0,0,0,218686,34418,18200,37786,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
2,10000,10000,10000.0,60 months,23.4,284.21,E,E1,,< 1 year,RENT,55000.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,902xx,CA,13.51,0,Apr-2007,0,44.0,88.0,9,1,11859,53.9%,11,w,8462.43,8462.43,4243.65,4243.65,1537.57,2706.08,0.0,0.0,...,100.0,1,0,34386,21235,10500,12386,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
3,17100,17100,17100.0,36 months,18.94,626.3,D,D2,Receptionist,10+ years,RENT,38000.0,Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,150xx,PA,38.09,0,Mar-1998,1,47.0,,14,0,15323,53%,21,w,11120.22,11120.22,9367.51,9367.51,5979.78,3387.73,0.0,0.0,...,75.0,0,0,70954,43351,16600,41784,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
4,4000,4000,4000.0,36 months,10.72,130.43,B,B2,Extrusion assistant,10+ years,MORTGAGE,56000.0,Verified,Dec-2018,Current,n,credit_card,Credit card refinancing,301xx,GA,31.03,0,Sep-2006,0,,,7,0,4518,28.6%,11,w,2487.19,2487.19,1943.36,1943.36,1512.81,430.55,0.0,0.0,...,0.0,0,0,221310,71375,12300,77865,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,


In [8]:
# The other column in the dataframe with percent signs is the revol_util column. 
# I will remove them and convert to floats and handle missing values.

df['revol_util'].head()

0    43.2%
1    47.1%
2    53.9%
3      53%
4    28.6%
Name: revol_util, dtype: object

In [0]:
# Exploring Missing Values
df['revol_util'].isnull().sum()

In [0]:
# Removing Missing Values
df['revol_util'] = df['revol_util'].fillna('Unavailable')

In [0]:
# Checking to see if Missing Values were handled.
df['revol_util'].isnull().sum()

In [0]:
# Changed Column from Object to Float
df['revol_util'] = (df['revol_util'].str[:-1].astype(float))

In [0]:
df["revol_util"] = pd.to_numeric(df["revol_util"], downcast="float")

You can uncomment and run the cells below to re-download and extract the Instacart data

In [0]:
# !wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# !tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# %cd instacart_2017_05_01