# Credit Risk Modeling

### What is credit risk?

A credit risk is the **risk of default** on a debt that may arise from a borrower failing to make required payments. In the first resort, the risk is that of the lender and includes lost principal and interest, disruption to cash flows, and increased collection costs. The loss may be complete or partial. In an efficient market, higher levels of credit risk will be associated with higher borrowing costs.

Losses can arise in a number of circumstances. For example:

- A consumer may fail to make a payment due on a mortgage loan, credit card, line of credit, or other loan.
- A company is unable to repay asset-secured fixed or floating charge debt.
- A business or consumer does not pay a trade invoice when due.
- A business does not pay an employee's earned wages when due.
- A business or government bond issuer does not make a payment on a coupon or principal payment when due.
- An insolvent insurance company does not pay a policy obligation.
- An insolvent bank won't return funds to a depositor.
- A government grants bankruptcy protection to an insolvent consumer or business.

To reduce the lender's credit risk:

- the lender may perform a credit check on the prospective borrower
- may require the borrower to take out appropriate insurance, such as mortgage insurance, or 
- seek security over some assets of the borrower or a guarantee from a third party. 

The lender can also take out insurance against the risk or on-sell the debt to another company. In general, the higher the risk, the higher will be the interest rate that the debtor will be asked to pay on the debt. Credit risk mainly arises when borrowers are unable to pay due willingly or unwillingly.

Re-iterating credit risk:

- Credit risk refers to the risk that a borrower may not repay a loan and that the lender may lose the principal of the loan or the interest associated with it. (Eg: Failure to pay back mortgages, credit cards, personal loans, etc)

- Higher credit risk equates to higher borrowing costs.

For a lender(eg: bank), it is of paramount importance to assess credit risk of the borrower. 


source: [wikipedia](https://en.wikipedia.org/wiki/Credit_risk), [investopedia](http://www.investopedia.com/terms/c/creditrisk.asp), [Datacamp Course using R](https://www.datacamp.com/courses/introduction-to-credit-risk-modeling-in-r)

# Problem Statement

XYZ Bank wants to build a credit risk model for its personal loan division. XYZ has given personal loans before and has historical data about the customers and knows whether the loan defaulted or not. If the loan had defaulted, the amount defaulted was recorded.

It wants you to build a credit risk model for its bankers.

It will use it to determine whether to provide loan to applicants or not.

# Approach

A very typical approach is to estimate the following two metrics for every applicant:

`Probability of Default`: What is the probability that the applicant will default, if given the loan

`Loss Given Default`: If the applicant defaults on the loan, what will be loss incurred by the bank?

For the purpose of this model, Bank XYZ has asked you to ignore cash flows from interest and other costs incurred by the bank for processing and maintaining the loan and use only the data provided in the data. 

# Data

You are provided with the following data:

**train_creditRisk.csv**  
This is the historical data that the bank has provided. It has the following columns

`loan_amnt` : Amount of Loan provided to the applicant  
`int_rate`: Interest rate charged for the applicant  
`grade`:  Employment Grade of the applicant  
`emp_length`: Number of years the applicant has been employed  
`home_ownership`: Whether the applicant owns a house or not  
`annual_inc`:  Annual income of the applicant  
`age`: Age of the applicant  
`default` : Whether the applicant has defaulted or not (target variable)  
`default_amount`:  What was the loan amount that was defaulted (target variable)

**testFeatures_creditRisk.csv**  
This dataset has the features of the applicants that you need to score and provide *probability of default* and *loss given default* metrics. It has the same set of columns as the train dataset, except the `default` and `default_amount` columns.


**testLabels_creditRisk.csv**  
This has the target columns of the test dataset. Once the model has been finalized, the accuracy of the model is found out by comparing the predictions against the actual. 

# Loading the data

In [1]:
import pandas as pd

In [2]:
#Load the training dataset
train = pd.read_csv("data/train_creditRisk.csv")

In [3]:
#View the first few rows of train
train.head()

Unnamed: 0,loan_amnt,int_rate,grade,emp_length,home_ownership,annual_inc,age,default,default_amount
0,5750,,D,1.0,OTHER,16000.0,27,0,0
1,7500,12.22,C,8.0,MORTGAGE,55000.0,27,1,4560
2,19000,11.99,B,4.0,RENT,47000.0,25,0,0
3,24000,9.91,B,3.0,MORTGAGE,50400.0,26,0,0
4,25000,10.62,B,8.0,MORTGAGE,156600.0,32,0,0


In [4]:
#View the columns of the train dataset
train.columns

Index(['loan_amnt', 'int_rate', 'grade', 'emp_length', 'home_ownership',
       'annual_inc', 'age', 'default', 'default_amount'],
      dtype='object')

In [5]:
#View the data types of the train dataset
train.dtypes

loan_amnt           int64
int_rate          float64
grade              object
emp_length        float64
home_ownership     object
annual_inc        float64
age                 int64
default             int64
default_amount      int64
dtype: object

In [6]:
#View summary of train 
train.describe()

Unnamed: 0,loan_amnt,int_rate,emp_length,annual_inc,age,default,default_amount
count,21819.0,19711.0,21212.0,21819.0,21819.0,21819.0,21819.0
mean,9633.113342,10.984877,6.176221,67501.12,27.704753,0.408589,1983.88125
std,6324.932999,3.235739,6.680542,67576.26,6.284862,0.491584,3217.155453
min,500.0,5.42,0.0,4000.0,20.0,0.0,0.0
25%,5000.0,7.9,2.0,40000.0,23.0,0.0,0.0
50%,8000.0,10.99,4.0,57000.0,26.0,0.0,0.0
75%,12500.0,13.47,8.0,80000.0,30.0,1.0,3240.0
max,35000.0,23.22,62.0,6000000.0,144.0,1.0,25146.0


# Data Pre-processing

We will do the following two data pre-processing before building our first set of models.

1. handle missing values  
2. handle categorical variables

#### Missing values

In [8]:
#Find if train has missing values.
# There is a isnull() function
train.isnull().head()

Unnamed: 0,loan_amnt,int_rate,grade,emp_length,home_ownership,annual_inc,age,default,default_amount
0,False,True,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False


In [9]:
#The above is kinda hard to comprehend. 
#It tells if every single element is null or not
#What we want first is to find if a column has missing values or not
#let's find that first
train.isnull().any()

loan_amnt         False
int_rate           True
grade             False
emp_length         True
home_ownership    False
annual_inc        False
age               False
default           False
default_amount    False
dtype: bool

In [19]:
#One consideration we check here is the number of observations with missing 
#values for those columns that have missing values
#If a column has too many missing values, it might make sense 
#to drop the column.
#let's see how many missing values are present
train.isnull().sum()

loan_amnt         0
int_rate          0
grade             0
emp_length        0
home_ownership    0
annual_inc        0
age               0
default           0
default_amount    0
dtype: int64

In [10]:
#So, we see that two columns have missing values. : int_rate and emp_length
#Both the columns are numeric. 
#Let's replace missing values with the mean of the column

In [12]:
train.mean()

loan_amnt          9633.113342
int_rate             10.984877
emp_length            6.176221
annual_inc        67501.115503
age                  27.704753
default               0.408589
default_amount     1983.881250
dtype: float64

In [17]:
#there's a fillna function
train = train.fillna(train.mean())

In [18]:
#Now, let's check if train has missing values or not
train.isnull().any()

loan_amnt         False
int_rate          False
grade             False
emp_length        False
home_ownership    False
annual_inc        False
age               False
default           False
default_amount    False
dtype: bool

#### Categorical variables

In [20]:
#We will use sklearn's LabelEncoder function

from sklearn.preprocessing import LabelEncoder

In [None]:
#Replace grade and home_ownership with the labelEncoder

In [25]:
train[["grade", "home_ownership"]] =  train[["grade", "home_ownership"]].apply(LabelEncoder().fit_transform)

In [26]:
#Look at train to check if it worked fine
train.head()

Unnamed: 0,loan_amnt,int_rate,grade,emp_length,home_ownership,annual_inc,age,default,default_amount
0,5750,10.984877,3,1.0,1,16000.0,27,0,0
1,7500,12.22,2,8.0,0,55000.0,27,1,4560
2,19000,11.99,1,4.0,3,47000.0,25,0,0
3,24000,9.91,1,3.0,0,50400.0,26,0,0
4,25000,10.62,1,8.0,0,156600.0,32,0,0


In [27]:
#check at the types
train.dtypes

loan_amnt           int64
int_rate          float64
grade               int64
emp_length        float64
home_ownership      int64
annual_inc        float64
age                 int64
default             int64
default_amount      int64
dtype: object

In [None]:
#We are good to build our first model