# Machine Learning Project
### Predicting what loans will be defaulted

On the first part, we removed all of the columns that contained redundant information, weren't useful for modeling, required too much processing to make useful, or leaked information from the future. We've exported the Dataframe from the end of the last part to a CSV file named cleaned_loans.csv to differentiate the file with the loans_2007.csv we used in the data cleaning process. 

On this part, we'll prepare the data for machine learning by focusing on handling missing values, converting categorical columns to numeric columns, and removing any other extraneous columns we encounter throughout this process.

In [1]:
import pandas as pd
df = pd.read_csv('cleaned_loans.csv')

In [2]:
null_counts = df.isnull().sum()
null_counts

Unnamed: 0                 0
loan_amnt                  0
term                       0
int_rate                   0
installment                0
emp_length              1036
home_ownership             0
annual_inc                 0
verification_status        0
loan_status                0
purpose                    0
title                     11
addr_state                 0
dti                        0
delinq_2yrs                0
earliest_cr_line           0
inq_last_6mths             0
open_acc                   0
pub_rec                    0
revol_bal                  0
revol_util                50
total_acc                  0
last_credit_pull_d         2
pub_rec_bankruptcies     697
dtype: int64

In [3]:
df = df.drop("pub_rec_bankruptcies", axis=1)
df = df.dropna(axis=0)
print(df.dtypes.value_counts())

object     11
float64    10
int64       2
dtype: int64


While the numerical columns can be used natively with scikit-learn, the object columns that contain text need to be converted to numerical data types. Let's return a new Dataframe containing just the object columns so we can explore them in more depth.

In [5]:
object_columns_df = df.select_dtypes(include=["object"])
print(object_columns_df.iloc[0])

term                         60 months
int_rate                        15.27%
emp_length                    < 1 year
home_ownership                    RENT
verification_status    Source Verified
purpose                            car
title                             bike
addr_state                          GA
earliest_cr_line              Apr-1999
revol_util                        9.4%
last_credit_pull_d            Sep-2013
Name: 0, dtype: object


Some of the columns seem like they represent categorical values, but we should confirm by checking the number of unique values in those columns:

- home_ownership: home ownership status, can only be 1 of 4 categorical values according to the data dictionary,
- verification_status: indicates if income was verified by Lending Club,
- emp_length: number of years the borrower was employed upon time of application,
- term: number of payments on the loan, either 36 or 60,
- addr_state: borrower's state of residence,
- purpose: a category provided by the borrower for the loan request,
- title: loan title provided the borrower,

There are also some columns that represent numeric values, that need to be converted:

- int_rate: interest rate of the loan in %,
- revol_util: revolving line utilization rate or the amount of credit the borrower is using relative to all available credit, read more here.

Based on the first row's values for purpose and title, it seems like these columns could reflect the same information. Let's explore the unique value counts separately to confirm if this is true.

Lastly, some of the columns contain date values that would require a good amount of feature engineering for them to be potentially useful:

- earliest_cr_line: The month the borrower's earliest reported credit line was opened,
- last_credit_pull_d: The most recent month Lending Club pulled credit for this loan.

Since these date features require some feature engineering for modeling purposes, let's remove these date columns from the Dataframe

In [6]:
cols = ['home_ownership', 'verification_status', 'emp_length', 'term', 'addr_state']
for c in cols:
    print(df[c].value_counts())

RENT        18111
MORTGAGE    16686
OWN          2778
OTHER          96
NONE            3
Name: home_ownership, dtype: int64
Not Verified       16281
Verified           11855
Source Verified     9538
Name: verification_status, dtype: int64
10+ years    8544
< 1 year     4513
2 years      4303
3 years      4022
4 years      3353
5 years      3202
1 year       3176
6 years      2177
7 years      1714
8 years      1442
9 years      1228
Name: emp_length, dtype: int64
 36 months    28233
 60 months     9441
Name: term, dtype: int64
CA    6776
NY    3614
FL    2704
TX    2613
NJ    1776
IL    1447
PA    1442
VA    1347
GA    1323
MA    1272
OH    1149
MD    1008
AZ     806
WA     788
CO     748
NC     729
CT     711
MI     678
MO     648
MN     581
NV     466
SC     454
WI     427
OR     422
AL     420
LA     420
KY     311
OK     285
KS     249
UT     249
AR     229
DC     209
RI     194
NM     180
WV     164
HI     162
NH     157
DE     110
MT      77
WY      76
AK      76
SD      60
VT  

In [8]:
print(df["purpose"].value_counts())
print(df["title"].value_counts())

debt_consolidation    17751
credit_card            4910
other                  3711
home_improvement       2808
major_purchase         2083
small_business         1719
car                    1459
wedding                 916
medical                 655
moving                  552
house                   356
vacation                348
educational             312
renewable_energy         94
Name: purpose, dtype: int64
Debt Consolidation                                                      2068
Debt Consolidation Loan                                                 1599
Personal Loan                                                            624
Consolidation                                                            488
debt consolidation                                                       466
Credit Card Consolidation                                                345
Home Improvement                                                         336
Debt consolidation                       

The home_ownership, verification_status, emp_length, and term columns each contain a few discrete categorical values. We should encode these columns as dummy variables and keep them.

It seems like the purpose and title columns do contain overlapping information but we'll keep the purpose column since it contains a few discrete values. In addition, the title column has data quality issues since many of the values are repeated with slight modifications (e.g. Debt Consolidation and Debt Consolidation Loan and debt consolidation).

Lastly, the addr_state column contains many discrete values and we'd need to add 49 dummy variable columns to use it for classification. This would make our Dataframe much larger and could slow down how quickly the code runs. Let's remove this column from consideration.

In [10]:
mapping_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    }
}
df = df.drop(["last_credit_pull_d", "earliest_cr_line", "addr_state", "title"], axis=1)
df["int_rate"] = df["int_rate"].str.rstrip("%").astype("float")
df["revol_util"] = df["revol_util"].str.rstrip("%").astype("float")
df = df.replace(mapping_dict)

Let's now encode the home_ownership, verification_status, purpose, and term columns as dummy variables so we can use them in our model

In [11]:
cat_columns = ["home_ownership", "verification_status", "purpose", "term"]
dummy_df = pd.get_dummies(df[cat_columns])
df = pd.concat([df, dummy_df], axis=1)
df = df.drop(cat_columns, axis=1)

On this part, we performed the last amount of data preparation necessary to start training machine learning models. We converted all of the columns to numerical values because those are the only type of value scikit-learn can work with. On the next part, we'll experiment with training models and evaluating accuracy using cross-validation.