# Data cleaning

## Read data
1. loans_2007.csv is a filtered data set from the raw data set 'LoanStats3a.csv'. Because the original data is too big to read, some information is cut down on purpose, such as the 'desc' and 'url' column.

In [10]:
import pandas as pd
#Read data
loans_2007 = pd.read_csv('loans_2007.csv', low_memory=False)
loans_2007.drop_duplicates()
#Display the first ow of 'loans_2007' and the number of columns in 'loans_2007'
print(loans_2007.iloc[0])
print(loans_2007.shape[1])

id                                1077501
member_id                      1.2966e+06
loan_amnt                            5000
funded_amnt                          5000
funded_amnt_inv                      4975
term                            36 months
int_rate                           10.65%
installment                        162.87
grade                                   B
sub_grade                              B2
emp_title                             NaN
emp_length                      10+ years
home_ownership                       RENT
annual_inc                          24000
verification_status              Verified
issue_d                          Dec-2011
loan_status                    Fully Paid
pymnt_plan                              n
purpose                       credit_card
title                            Computer
zip_code                            860xx
addr_state                             AZ
dti                                 27.65
delinq_2yrs                       

## Delete extra
1. Let's delete more columns since too many columns can oerfit the model.  To do this, go through the dictionary to find useful columns. 

In [11]:
#Working on deleteing columns
loans_2007 = loans_2007.drop(["id", "member_id", "funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d"], axis=1)

In [12]:
#Working on deleteing columns
loans_2007 = loans_2007.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1)

In [13]:
#Working on deleteing columns
loans_2007 = loans_2007.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1)
#Display the first row and the number of columns
print(loans_2007.iloc[0])
print(loans_2007.shape[1])

loan_amnt                            5000
term                            36 months
int_rate                           10.65%
installment                        162.87
emp_length                      10+ years
home_ownership                       RENT
annual_inc                          24000
verification_status              Verified
loan_status                    Fully Paid
pymnt_plan                              n
purpose                       credit_card
title                            Computer
addr_state                             AZ
dti                                 27.65
delinq_2yrs                             0
earliest_cr_line                 Jan-1985
inq_last_6mths                          1
open_acc                                3
pub_rec                                 0
revol_bal                           13648
revol_util                          83.7%
total_acc                               9
initial_list_status                     f
last_credit_pull_d               J

Finding: Reduced column numbers from 52 to 34.

## Choose a target column
1. Let's use 'loan_status' column as it tells whether the loan is paid off on time. 
2. Convert columns that include texts to numbers so that I can train a model. 

In [14]:
#Convert texts to numbers
print(loans_2007['loan_status'].value_counts())

Fully Paid                                             33136
Charged Off                                             5634
Does not meet the credit policy. Status:Fully Paid      1988
Current                                                  961
Does not meet the credit policy. Status:Charged Off      761
Late (31-120 days)                                        24
In Grace Period                                           20
Late (16-30 days)                                          8
Default                                                    3
Name: loan_status, dtype: int64


## Binary classification
1. According to the Lending club website, there are 8 different meanings attached to 'loan_status'. 
2. Among many columns, 'Fully Paid' and 'Charged off' columns will show the results of loans.
3. Use binary classification to test where a loan falls out of two values. 

In [15]:
#Remove all rows other than two columns
loans_2007 = loans_2007[(loans_2007['loan_status'] == "Fully Paid") | (loans_2007['loan_status'] == "Charged Off")]
#Use the DataFrame methoc replace to mark 'Fully Paid' as 1 and 'Charged Off' as 0. 
status_replace = {
    "loan_status" : {
        "Fully Paid": 1,
        "Charged Off": 0,
    }
}
loans_2007 = loans_2007.replace(status_replace)

Finding: Since there are way more loans that are fully paid off than the loans that are charged off, the class imbalance might happen. I need to keep in mind that this training might bring the model to have a strong bias towards a training set with more observations.

## Drop values
1. Delete the vaues that contain only one unique value so that there will be reduced number of columns that the model will go through (increase the productivity). 

In [16]:
#Create an empty list to keep track of columns I want to drop
orig_columns = loans_2007.columns
drop_columns = []
#Use the dropna method to remove null values
#Use the unique method to compute the number of unique values
for col in orig_columns:
    col_series = loans_2007[col].dropna().unique()
    #Return the number of values
    if len(col_series) == 1:
        #Append values
        drop_columns.append(col)
#Use the drop method to get rid of 'drop_columns'
loans_2007 = loans_2007.drop(drop_columns, axis=1)
#Display the result
print(drop_columns)

['pymnt_plan', 'initial_list_status', 'collections_12_mths_ex_med', 'policy_code', 'application_type', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'tax_liens']


# Setting up the features
1. Since scikit-learn returns an error if the model contains missing values, let's compute the number of missing values.

## Return the missing values
1. To return the missing values, let's use isnull to return a DataFrame containing Boolean values.
2. Use the method sum to calculate the number of num values.

In [19]:
loans = pd.read_csv("loans_2007.csv", low_memory=False)
#Return the number of missing values using isnull and sum
null_counts = loans.isnull().sum()
print(null_counts)

id                               0
member_id                        3
loan_amnt                        3
funded_amnt                      3
funded_amnt_inv                  3
term                             3
int_rate                         3
installment                      3
grade                            3
sub_grade                        3
emp_title                     2627
emp_length                       3
home_ownership                   3
annual_inc                       7
verification_status              3
issue_d                          3
loan_status                      3
pymnt_plan                       3
purpose                          3
title                           15
zip_code                         3
addr_state                       3
dti                              3
delinq_2yrs                     32
earliest_cr_line                32
inq_last_6mths                  32
open_acc                        32
pub_rec                         32
revol_bal           

## Remove null values
1. Remove the column that contains too many null values. In this case, I am going to delete 'pub_rec_bankruptcies'.

In [20]:
#Remove the 'pub_rec_bankruptcies' column and remove rows from loans containing missing values within 'title,' 'revol_util,' and 'last_credit_pull_d'
loans = loans.drop("pub_rec_bankruptcies", axis=1)
#Drop rows with missing values all across
loans = loans.dropna(axis=0)
#Return the counts for each column
print(loans.dtypes.value_counts())

float64    29
object     22
dtype: int64


## Convert values
1. Convert object columns from texts to numbers.

In [21]:
#Convert texts in the object columns to numbers using select_dtypes
object_columns_df = loans.select_dtypes(include=["object"])
#Assign the results to 'object_columns_df'
print(object_columns_df.iloc[0])

id                             1077430
term                         60 months
int_rate                        15.27%
grade                                C
sub_grade                           C4
emp_title                        Ryder
emp_length                    < 1 year
home_ownership                    RENT
verification_status    Source Verified
issue_d                       Dec-2011
loan_status                Charged Off
pymnt_plan                           n
purpose                            car
title                             bike
zip_code                         309xx
addr_state                          GA
earliest_cr_line              Apr-1999
revol_util                        9.4%
initial_list_status                  f
last_pymnt_d                  Apr-2013
last_credit_pull_d            Sep-2013
application_type            INDIVIDUAL
Name: 1, dtype: object


## Categorical values
1. Check categorical values to confirm the number of unique values.
2. Explore unique values in 'purpose' and 'title'. 

In [22]:
#Take a look at these columns that seem to contain categorical values
cols = ['home_ownership', 'verification_status', 'emp_length', 'term', 'addr_state']
for c in cols:
    print(loans[c].value_counts())

RENT        18990
MORTGAGE    17725
OWN          2809
OTHER         130
NONE            2
Name: home_ownership, dtype: int64
Not Verified       17317
Verified           12568
Source Verified     9771
Name: verification_status, dtype: int64
10+ years    8987
2 years      4589
< 1 year     4548
3 years      4221
4 years      3545
1 year       3439
5 years      3330
6 years      2289
7 years      1816
8 years      1536
9 years      1298
n/a            58
Name: emp_length, dtype: int64
 36 months    29230
 60 months    10426
Name: term, dtype: int64
CA    6942
NY    3797
FL    2864
TX    2738
NJ    1879
IL    1576
PA    1550
GA    1409
VA    1397
MA    1329
OH    1259
MD    1067
AZ     844
WA     811
CO     788
NC     765
CT     742
MI     739
MO     706
MN     604
NV     491
WI     466
SC     458
AL     453
OR     434
LA     430
KY     337
OK     301
KS     276
UT     259
AR     243
DC     218
RI     201
NM     187
WV     171
NH     170
HI     167
DE     130
MT      83
AK      82
WY      

Finding: Clean the 'emp_length' column to treat it as numerical

In [23]:
#Print unique values of 'purpose' and 'title' columns 
print(loans["purpose"].value_counts())
print(loans["title"].value_counts())

debt_consolidation    18818
credit_card            5119
other                  4052
home_improvement       2948
major_purchase         2150
small_business         1604
car                    1520
wedding                 961
medical                 701
moving                  576
house                   388
educational             374
vacation                354
renewable_energy         91
Name: purpose, dtype: int64
Debt Consolidation                                 2158
Debt Consolidation Loan                            1678
Personal Loan                                       678
Consolidation                                       528
debt consolidation                                  498
Credit Card Consolidation                           355
Home Improvement                                    353
Debt consolidation                                  330
Personal                                            311
Credit Card Loan                                    303
Small Business Loan 

Finding: Since 'purpose' and 'title' columns have overlapping information, let's keep the 'purpose' column, which has some unique values. 'title' column has many repetitive information to other columns. 

In [24]:
#Keep 'home_ownership', 'verification_status', 'emp_length', 'term', and 'purpose'
#Clean 'emp_length column' 
#Assume 10+ years have worked only 10 years, people who worked less than a year or n/a as 0
#Remove 'addr_state,' 'last_credit_pull_d,' 'title,' and 'earliest_cr_line' because they would make DataFrame larger and slow down
mapping_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    }
}
#Strip the percentage sign 
loans = loans.drop(["last_credit_pull_d", "earliest_cr_line", "addr_state", "title"], axis=1)
#Use 'astype' method to convert to the float type
loans["int_rate"] = loans["int_rate"].str.rstrip("%").astype("float")
loans["revol_util"] = loans["revol_util"].str.rstrip("%").astype("float")
#Clean 'emp_length' column
loans = loans.replace(mapping_dict)

## Dummy columns
1. Encode some columns as dummy variables to use them in the model. 
2. Apply 'get_dummies' method and return a new DataFrame. 
3. Use the concat method to add dummy columns back to the origial DataFrame.
4. Use the drop method to drop the original DataFrame columns.

In [25]:
#concat 'home_ownership', 'verification_status', 'title', and 'term' columns as dummy and put them back to loans_2007
cat_columns = ["home_ownership", "verification_status", "emp_length", "purpose", "term"]
dummy_df = pd.get_dummies(loans[cat_columns])
loans = pd.concat([loans, dummy_df], axis=1)
loans = loans.drop(cat_columns, axis=1)

Finding: Converted all of the columns to numerical values since I am using scikit-learn code. After this, I fixed columns that had data leakage issues, contained redundant information, required additional processing to turn them into useful features.

# Making predictions
1. Train the model and calculate cross validation.

## Read data
1. Let's take a look at the clean version.

In [26]:
loans = pd.read_csv("loans_2007.csv", low_memory=False)
print(loans.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42538 entries, 0 to 42537
Data columns (total 52 columns):
id                            42538 non-null object
member_id                     42535 non-null float64
loan_amnt                     42535 non-null float64
funded_amnt                   42535 non-null float64
funded_amnt_inv               42535 non-null float64
term                          42535 non-null object
int_rate                      42535 non-null object
installment                   42535 non-null float64
grade                         42535 non-null object
sub_grade                     42535 non-null object
emp_title                     39911 non-null object
emp_length                    42535 non-null object
home_ownership                42535 non-null object
annual_inc                    42531 non-null float64
verification_status           42535 non-null object
issue_d                       42535 non-null object
loan_status                   42535 non-null object
p