# Machine Learning Engineer Nanodegree

## Capstone Project in Finance

### Predicting Credit Risk

#### Introduction

Many investment banks trade loan backed securities. If a loan defaults, it leads to devaluation of the securitized product. That is why banks use risk models to identify loans at risk and predict loans that might default in near future. Risk models are also used to decide on approving a loan request by a borrower.

#### The Data

I will be using data from Lenging Club (https://www.lendingclub.com/info/download-data.action). Lending Club is the world’s largest online marketplace connecting borrowers and investors. 

The file LoansImputed.csv contains complete loan data for all loans issued through the time period stated. 

In [1]:
#Import Libraries
import pandas as pd

#Read data file
df = pd.read_csv('LoansImputed.csv')
#Print top 5 rows
print df.head()

   credit.policy             purpose  int.rate  installment  log.annual.inc  \
0              1  debt_consolidation    0.1496       194.02       10.714418   
1              1           all_other    0.1114       131.22       11.002100   
2              1         credit_card    0.1343       678.08       11.884489   
3              1           all_other    0.1059        32.55       10.433822   
4              1      small_business    0.1501       225.37       12.269047   

     dti  fico  days.with.cr.line  revol.bal  revol.util  inq.last.6mths  \
0   4.00   667        3180.041667       3839        76.8               0   
1  11.08   722        5116.000000      24220        68.6               0   
2  10.15   682        4209.958333      41674        74.1               0   
3  14.47   687        1110.000000       4485        36.9               1   
4   6.45   677        6240.000000      56411        75.3               0   

   delinq.2yrs  pub.rec  not.fully.paid  annualincome  
0           

#### Variables in the dataset

##### Dependent Variable

* not.fully.paid: A binary variable. 1 means borrower defaulted and 0 means monthly payments are made on time

##### Independent Variables

* credit.policy: 1 if borrower meets credit underwriting criteria and 0 otherwise
* purpose: The reason for the loan
* int.rate: Interest rate for the loan (14% is stored as 0.14)
* installment: Monthly payment to be made for the loan
* log.annual.inc: Natural log of self reported annual income of the borrower
* dti: Debt to Income ratio of the borrower
* fico: FICO credit score of the borrower
* days.with.cr.line: Number of days borrower has had credit line
* revol.bal: The borrower's rovolving balance (Principal loan amount still remaining)
* revol.util: Amount of credit line utilized by borrower as percentage of total available credit
* inq.last.6mths: Borrowers credit inquiry in last 6 months
* delinq.2yrs: Number of times borrower was deliquent in last 2 years
* pub.rec: Number of derogatory pulic record borrower has (Bankruptcy, tax liens and judgements etc.)
* annualincome: Annual income of borrpwer

In [2]:
#Print statistics of variables
print df.describe()
print df[df['not.fully.paid'] == 1].describe()
print df[df['not.fully.paid'] == 0].describe()
print "Number of loans that have 70% credit utilization and defaulted: "
print len(df[(df['revol.util'] > 70.00) & (df['not.fully.paid'] == 0)])

       credit.policy     int.rate  installment  log.annual.inc          dti  \
count    5000.000000  5000.000000  5000.000000     5000.000000  5000.000000   
mean        0.896200     0.120816   308.325968       10.911819    12.308698   
std         0.305031     0.025336   197.307080        0.598897     6.754521   
min         0.000000     0.060000    15.690000        7.600902     0.000000   
25%         1.000000     0.100800   163.550000       10.545341     7.067500   
50%         1.000000     0.121800   260.640000       10.915088    12.300000   
75%         1.000000     0.137900   407.510000       11.277203    17.652500   
max         1.000000     0.216400   926.830000       14.528354    29.960000   

              fico  days.with.cr.line       revol.bal   revol.util  \
count  5000.000000        5000.000000     5000.000000  5000.000000   
mean    710.926000        4510.713433    15872.533200    46.395622   
std      37.026757        2418.553606    31116.319033    29.138604   
min     

In [3]:
#plot correlation between each feature
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
from pandas.tools.plotting import scatter_matrix
scatter_matrix(df, alpha=0.2, figsize=(6, 6), diagonal='kde')
plt.show()

In [4]:
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder

#Print unique values of purpose
print pd.Series.unique(df['purpose'])
# Convert purpose to category
df['purpose'] = pd.Categorical.from_array(df['purpose']).codes
print "purpose converted to factors"
print df.head()

#extract dependent variable as label
Y = df['not.fully.paid']
#Drop dependent variable and categorical variable
X = df.drop('not.fully.paid', 1)

print "not.fully.paid as label"
print Y.head()
print "not.fully.paid removed from features"
print X.head()

#One Hot Encode purpose
enc = OneHotEncoder()
df['purpose'] = enc.fit(df['purpose'])

#scale dependent variable
X = preprocessing.scale(X)

#print first row of X
print X[0]

['debt_consolidation' 'all_other' 'credit_card' 'small_business'
 'home_improvement' 'educational' 'major_purchase']
purpose converted to factors
   credit.policy  purpose  int.rate  installment  log.annual.inc    dti  fico  \
0              1        2    0.1496       194.02       10.714418   4.00   667   
1              1        0    0.1114       131.22       11.002100  11.08   722   
2              1        1    0.1343       678.08       11.884489  10.15   682   
3              1        0    0.1059        32.55       10.433822  14.47   687   
4              1        6    0.1501       225.37       12.269047   6.45   677   

   days.with.cr.line  revol.bal  revol.util  inq.last.6mths  delinq.2yrs  \
0        3180.041667       3839        76.8               0            0   
1        5116.000000      24220        68.6               0            0   
2        4209.958333      41674        74.1               0            0   
3        1110.000000       4485        36.9               1    



In [5]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import ExtraTreesClassifier

# Build a forest and compute the feature importances
forest = ExtraTreesClassifier(n_estimators=250, random_state=0)

forest.fit(X, Y)
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_], axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices], color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.show()

Feature ranking:
1. feature 0 (0.215506)
2. feature 2 (0.096978)
3. feature 3 (0.077505)
4. feature 6 (0.071775)
5. feature 9 (0.070654)
6. feature 8 (0.068685)
7. feature 5 (0.067838)
8. feature 7 (0.067469)
9. feature 4 (0.066262)
10. feature 13 (0.063575)
11. feature 10 (0.061159)
12. feature 1 (0.043079)
13. feature 11 (0.019153)
14. feature 12 (0.010361)


In [10]:
#Shuffle and split data into 70% in training and 30% in testing
from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

In [11]:
#Using Logistic Regression
from sklearn import metrics
from sklearn import linear_model

# fit an Extra Trees model to the data
logreg = linear_model.LogisticRegression()
logreg.fit(X_train, Y_train)

print "LogisticRegression accuracy in training set: {0}".format(logreg.score(X_train, Y_train))

#Predict using the logistic regression model
Y_pred = logreg.predict(X_test)
print "LogisticRegression accuracy in testing set: {0}".format(metrics.accuracy_score(Y_test, Y_pred))

print "Logistic Regression F1 Score: {0}".format(metrics.f1_score(Y_test, Y_pred))

LogisticRegression accuracy in training set: 0.806857142857
LogisticRegression accuracy in testing set: 0.787333333333
Logistic Regression F1 Score: 0.530191458027


In [12]:
#Using Support Vector Machine
from sklearn import metrics
from sklearn import svm

# fit an Extra Trees model to the data
clm = svm.SVC(probability=True)
clm.fit(X_train, Y_train)

print "SVM accuracy in training set: {0}".format(clm.score(X_train, Y_train))

#Predict using the logistic regression model
Y_pred = clm.predict(X_test)

#accuracy score
print "SVM accuracy in testing set: {0}".format(metrics.accuracy_score(Y_test, Y_pred))
#auc score
print "SVM F1 score in testing set: {0}".format(metrics.f1_score(Y_test, Y_pred))

SVM accuracy in training set: 0.810571428571
SVM accuracy in testing set: 0.790666666667
SVM F1 score in testing set: 0.516923076923


In [13]:
from sklearn import metrics
from sklearn.ensemble import ExtraTreesClassifier

# fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(X_train, Y_train)

print "ExtraTreeClassifier accuracy in training set: {0}".format(model.score(X_train, Y_train))

#Predict using the logistic regression model
Y_pred = model.predict(X_test)

print "ExtraTreeClassifier accuracy in testing set: {0}".format(metrics.accuracy_score(Y_test, Y_pred))
print "ExtraTreeClassifier F1 score in testing set: {0}".format(metrics.roc_auc_score(Y_test, Y_pred))

ExtraTreeClassifier accuracy in training set: 1.0
ExtraTreeClassifier accuracy in testing set: 0.781333333333
ExtraTreeClassifier F1 score in testing set: 0.67818627451


In [11]:
from sklearn.metrics import roc_curve, auc

false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_test, Y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)

plt.title('ROC Curve')
plt.plot(false_positive_rate, true_positive_rate, 'b',
label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

#### Conclusion
Our model has an accuracy of 70%. This model can be used to predict loans that will be at risk. This can be used to decide whether to approve a loan or not and proactive action can be taken on existing loans that are likely to default.