# Machine Learning Engineer Nanodegree

## Capstone Project in Finance

### Predicting Credit Risk

#### Introduction

Many investment banks trade loan backed securities. If a loan defaults, it leads to devaluation of the securitized product. That is why banks use risk models to identify loans at risk and predict loans that might default in near future. Risk models are also used to decide on approving a loan request by a borrower.

#### The Data

I will be using data from Lenging Club (https://www.lendingclub.com/info/download-data.action). Lending Club is the world’s largest online marketplace connecting borrowers and investors. 

The file LoansImputed.csv contains complete loan data for all loans issued through the time period stated. 

In [1]:
#Import Libraries
import pandas as pd

#Read data file
df = pd.read_csv('LoansImputed.csv')
#Print top 5 rows
print df.head()

   credit.policy             purpose  int.rate  installment  log.annual.inc  \
0              1  debt_consolidation    0.1496       194.02       10.714418   
1              1           all_other    0.1114       131.22       11.002100   
2              1         credit_card    0.1343       678.08       11.884489   
3              1           all_other    0.1059        32.55       10.433822   
4              1      small_business    0.1501       225.37       12.269047   

     dti  fico  days.with.cr.line  revol.bal  revol.util  inq.last.6mths  \
0   4.00   667        3180.041667       3839        76.8               0   
1  11.08   722        5116.000000      24220        68.6               0   
2  10.15   682        4209.958333      41674        74.1               0   
3  14.47   687        1110.000000       4485        36.9               1   
4   6.45   677        6240.000000      56411        75.3               0   

   delinq.2yrs  pub.rec  not.fully.paid  annualincome  
0           

#### Variables in the dataset

##### Dependent Variable

* not.fully.paid: A binary variable. 1 means borrower defaulted and 0 means monthly payments are made on time

##### Independent Variables

* credit.policy: 1 if borrower meets credit underwriting criteria and 0 otherwise
* purpose: The reason for the loan
* int.rate: Interest rate for the loan (14% is stored as 0.14)
* installment: Monthly payment to be made for the loan
* log.annual.inc: Natural log of self reported annual income of the borrower
* dti: Debt to Income ratio of the borrower
* fico: FICO credit score of the borrower
* days.with.cr.line: Number of days borrower has had credit line
* revol.bal: The borrower's rovolving balance (Principal loan amount still remaining)
* revol.util: Amount of credit line utilized by borrower as percentage of total available credit
* inq.last.6mths: Borrowers credit inquiry in last 6 months
* delinq.2yrs: Number of times borrower was deliquent in last 2 years
* pub.rec: Number of derogatory pulic record borrower has (Bankruptcy, tax liens and judgements etc.)

In [2]:
# Convert purpose to category
df['purpose'] = pd.Categorical.from_array(df['purpose']).codes
print "purpose converted to factors"
print df.head()

#extract dependent variable as label
Y = df['not.fully.paid']
X = df.drop('not.fully.paid', 1)
print "not.fully.paid as label"
print Y.head()
print "not.fully.paid removed from features"
print X.head()

purpose converted to factors
   credit.policy  purpose  int.rate  installment  log.annual.inc    dti  fico  \
0              1        2    0.1496       194.02       10.714418   4.00   667   
1              1        0    0.1114       131.22       11.002100  11.08   722   
2              1        1    0.1343       678.08       11.884489  10.15   682   
3              1        0    0.1059        32.55       10.433822  14.47   687   
4              1        6    0.1501       225.37       12.269047   6.45   677   

   days.with.cr.line  revol.bal  revol.util  inq.last.6mths  delinq.2yrs  \
0        3180.041667       3839        76.8               0            0   
1        5116.000000      24220        68.6               0            0   
2        4209.958333      41674        74.1               0            0   
3        1110.000000       4485        36.9               1            0   
4        6240.000000      56411        75.3               0            0   

   pub.rec  not.fully.paid 

In [6]:
#Select only important features
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
print "Shape of X: "
print X.shape
X_imp = SelectKBest(chi2, k=4).fit_transform(X, Y)
print "New shape of X"
print X_imp.shape

Shape of X: 
(5000, 14)
New shape of X
(5000L, 4L)


In [7]:
#Shuffle and split data into 70% in training and 30% in testing
from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X_imp, Y, test_size=0.3, random_state=42)

In [8]:
#Mean Squared Error to be used to measure model performance
from sklearn.metrics import mean_squared_error
def performance_metric(y_true, y_predict):
    error = mean_squared_error(y_true, y_predict)
    return error

In [10]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn import metrics

#Model using Logistic Regression
logreg = linear_model.LogisticRegression()
logreg.fit(X_train, Y_train)
print "Model Accuracy:"
print logreg.score(X_train, Y_train)

#Predict using the logistic regression model
Y_pred = logreg.predict(X_test)
#probability of the predicted labels
Y_prob = logreg.predict_proba(X_test)

#accuracy score
print metrics.accuracy_score(Y_test, Y_pred)
#auc score
print metrics.roc_auc_score(Y_test, Y_prob[:,1])

Model Accuracy:
0.704285714286
0.69
0.572142565359


#### Conclusion
Our model has an accuracy of 70%. This model can be used to predict loans that will be at risk. This can be used to decide whether to approve a loan or not and proactive action can be taken on existing loans that are likely to default.