# student loan

## background
The company [LendingClub](https://www.lendingclub.com/info/statistics.action) has [data](https://www.dropbox.com/s/0so14yudedjmm5m/LoanStats3d.csv?dl=1) on student loans and factors that might help predict loan status (e.g., current, fully paid, late, in grace period, etc.). Here I will use Random Forest to build predictive models based on given features. 

## load data

In [None]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import ensemble
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier

import warnings

warnings.filterwarnings("ignore")

%matplotlib inline

In [None]:
# read data from file
raw = pd.read_csv(
    "https://www.dropbox.com/s/0so14yudedjmm5m/LoanStats3d.csv?dl=1",
    skipinitialspace=True,
    header=1,
)


# show first 5 rows
raw.head()

## model preparation

### data cleaning

#### data types
Let's make sure each column has the correct data type. A discrete column with too many categories may be continuous (e.g., id) or unnecessary (e.g., zip code, url). 

In [None]:
# number of unique values in each column
categorical = raw.select_dtypes(include=["object"])
for col in categorical.columns:
    print("{}: {}".format(col, categorical[col].nunique()))

In [None]:
# create a copy of the original data
loans = raw.copy()

# convert ID and interest rate to numeric.
loans["id"] = pd.to_numeric(loans["id"], errors="coerce")
loans["int_rate"] = pd.to_numeric(loans["int_rate"].str.strip("%"), errors="coerce")

# drop columns with too many unique values
loans.drop(
    [
        "url",
        "emp_title",
        "zip_code",
        "earliest_cr_line",
        "revol_util",
        "sub_grade",
        "addr_state",
        "desc",
    ],
    1,
    inplace=True,
)

#### missing data
Let's drop rows without any data.

In [None]:
loans.dropna(how="all", inplace=True)

### one hot encoding
Since most models (including Random Forest) only accept numeric values as input, we can use [one hot encoding](https://www.geeksforgeeks.org/ml-one-hot-encoding-of-datasets-in-python/) to convert categorical data ($N$ levels) into numerical data ($N$ columns) before training models.

In [None]:
# target variable
y = loans["loan_status"]

# feature vector
X = loans.drop("loan_status", 1)

# one hot encoding
X = pd.get_dummies(X)

# drop rows with missing data
X = X.dropna(axis=1)

Since one hot encoding often creates many more columns than before, we often need to follow it with principle component analysis (PCA) or other dimensionality reduction techniques to select features.

### train/test split
Before performing PCA, let's save up 20% of the data for testing. We want to split first because PCA relies the specific data we observed; we shouldn't let testing data influence the decision in this step.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

### standardization
PCA works best with normalized features, so let's do so below.

In [None]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

## predictive modeling

### PCA
To get a rough estimate of how many components are needed to describe the data, we can plot the *cumulative explained variance* as a function the number of components selected.

In [None]:
pca = PCA().fit(X_train)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel("number of components")
plt.ylabel("cumulative explained variance");

For the student loan dataset, it seems we need as many as 150 components to describe the data!

In [None]:
n_components = 150

pca = PCA(n_components=n_components)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

### training and predicting

In [None]:
classifier = RandomForestClassifier(max_depth=2, random_state=0)
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

### evaluate performance

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

cm = confusion_matrix(y_test, y_pred)
print("Confusion matrix: \n{}\n".format(cm))
print(
    "Accuracy is {} with {} components.".format(
        accuracy_score(y_test, y_pred), n_components
    )
)

## doing without payment/outstanding amounts
Predicting loan status kind of feels like cheating when payment or outstanding amounts are known. Let's see if we can make good predictions without variables related to these. To do so, we can drop columns that contain substrings "pymnt" or "out".

### drop columns

In [None]:
# select columns to drop
to_drop = [col for col in raw.columns if "pymnt" in col or "out" in col]

# drop from a copy of the raw data
loans2 = raw.copy()
loans2 = raw.drop(to_drop, axis=1)

### data cleaning

However we cleaned the previous dataset, we need to do the same to the new dataset.

In [None]:
# convert ID and interest rate to numeric.
loans2["id"] = pd.to_numeric(loans2["id"], errors="coerce")
loans2["int_rate"] = pd.to_numeric(loans2["int_rate"].str.strip("%"), errors="coerce")

# drop columns with too many unique values
loans2.drop(
    [
        "url",
        "emp_title",
        "zip_code",
        "earliest_cr_line",
        "revol_util",
        "sub_grade",
        "addr_state",
        "desc",
    ],
    1,
    inplace=True,
)

# drop rows without any data
loans2.dropna(how="all", inplace=True)

# new feature vector
X_new = loans2.drop("loan_status", 1)

# one hot encoding
X_new = pd.get_dummies(X)

# drop rows with missing data
X_new = X_new.dropna(axis=1)

### train/test split

In [None]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(
    X_new, y, test_size=0.2, random_state=0
)

### PCA
Similarly, let's estimate how many components are needed.

In [None]:
pca = PCA().fit(X_train2)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel("number of components")
plt.ylabel("cumulative explained variance");

This time, we only seem to need about 5 principle components.

### training and predicting

In [None]:
classifier = RandomForestClassifier(max_depth=2, random_state=0)
classifier.fit(X_train2, y_train2)

y_pred = classifier.predict(X_test2)

### evaluate performance

In [None]:
cm = confusion_matrix(y_test2, y_pred)
print("Confusion matrix: \n{}\n".format(cm))
print(
    "Accuracy is {} with {} components.".format(
        accuracy_score(y_test2, y_pred), n_components
    )
)