# Predicting Loan Repayment

In [1]:
# import library.
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from sklearn import metrics

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## Problem 1.1 - Preparing the Dataset

Load the dataset loans.csv into a data frame called loans.
What proportion of the loans in the dataset were not paid in full? Please input a number between 0 and 1.

In [2]:
loans = pd.read_csv('../data/loans.csv')
loans_imputed = pd.read_csv('../data/loans_imputed.csv')
loans['not.fully.paid'].mean()

0.16005429108373356

## Problem 1.2 - Preparing the Dataset
Which of the following variables has at least one missing observation? *Select all that apply*.
- log.annual.inc
- days.with.cr.line
- revol.util
- inq.last.6mths
- delinq.2yrs
- pub.rec

In [3]:
loans.isnull().sum()

credit.policy         0
purpose               0
int.rate              0
installment           0
log.annual.inc        4
dti                   0
fico                  0
days.with.cr.line    29
revol.bal             0
revol.util           62
inq.last.6mths       29
delinq.2yrs          29
pub.rec              29
not.fully.paid        0
dtype: int64

## Problem 1.3 - Preparing the Dataset
Which of the following is the best reason to fill in the missing values for these variables instead of removing observations with missing data?
- We want to be able to predict risk for all borrowers, instead of just the ones with all data reported.

In [4]:
loans[loans.isnull().any(axis=1)]['not.fully.paid'].mean()

0.1935483870967742

## Problem 1.4 - Preparing the Dataset
For the rest of this problem, we'll be using a revised version of the dataset that has the missing values filled in with multiple imputation (which was discussed in the Recitation of this Unit). To ensure everybody has the same data frame going forward, you can download and load into Python the dataset we created after running the imputation: loans_imputed.csv.

What best describes the process we just used to handle missing values?
- We predicted missing variable values using the available independent variables for each observation.

## Problem 2.1 - Prediction Models
Now that we have prepared the dataset, we need to split it into a training and testing set. To ensure everybody obtains the same split, set the random seed to 144 (even though you already did so earlier in the problem) and use the sample.split function to select the 70% of observations for the training set. Name the data frames train and test.

Now, use logistic regression trained on the training set to predict the dependent variable not.fully.paid using all the independent variables.

Which independent variables are significant in our model? *Select all that apply*.
- credit.policy
- purpose2 (credit card)
- purpose3 (debt consolidation)
- purpose6 (major purchase)
- purpose7 (small business)
- installment
- log.annual.inc
- fico
- revol.bal
- inq.last.6mths
- pub.rec

In [5]:
purpose_class = [
    'credit_card', 'debt_consolidation', 'educational',
    'home_improvement', 'major_purchase', 'small_business'
]

for purpose_ in purpose_class:
    loans_imputed[purpose_] = (loans_imputed['purpose']==purpose_).astype(int)
loans_imputed.drop('purpose', axis=1, inplace=True)


X = loans_imputed.drop('not.fully.paid', axis=1).copy()
y = loans_imputed['not.fully.paid'].copy()

X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.7, random_state=144
)

model1 = sm.Logit(y_train, sm.add_constant(X_train)).fit()
print(model1.summary())

Optimization terminated successfully.
         Current function value: 0.414813
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:         not.fully.paid   No. Observations:                 6704
Model:                          Logit   Df Residuals:                     6685
Method:                           MLE   Df Model:                           18
Date:                Fri, 20 Aug 2021   Pseudo R-squ.:                 0.06868
Time:                        17:41:54   Log-Likelihood:                -2780.9
converged:                       True   LL-Null:                       -2986.0
Covariance Type:            nonrobust   LLR p-value:                 6.964e-76
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                  8.6816      1.536      5.653      0.000       5.672      11.692
credi

## Problem 2.2 - Prediction Models
Consider two loan applications, which are identical other than the fact that the borrower in Application A has FICO credit score 700 while the borrower in Application B has FICO credit score 710.

Let Logit(A) be the log odds of loan A not being paid back in full, according to our logistic regression model, and define Logit(B) similarly for loan B. What is the value of Logit(A) - Logit(B)?
- 0.0093

Now, let O(A) be the odds of loan A not being paid back in full, according to our logistic regression model, and define O(B) similarly for loan B. What is the value of O(A)/O(B)? *(HINT: Use the mathematical rule that exp(A + B + C) = exp(A)*exp(B)*exp(C)*.
- 1.0975

In [6]:
np.exp(0.093)

1.097461735268082

## Problem 2.3 - Prediction Models
Predict the probability of the test set loans not being paid back in full. Store these predicted probabilities in a variable named predicted.risk and add it to your test set (we will use this variable in later parts of the problem). Compute the confusion matrix using a threshold of 0.5.

What is the accuracy of the logistic regression model? Input the accuracy as a number between 0 and 1.
- 0.8482

What is the accuracy of the baseline model? Input the accuracy as a number between 0 and 1.
- 0.8479

In [7]:
pred_1 = model1.predict(sm.add_constant(X_test))
th = 0.5

test_df = pd.DataFrame()
test_df['actual.risk'], test_df['predicted.risk'] = y_test, pred_1
test_df['predicted.risk'] = (test_df['predicted.risk']>=th).astype(int)

print(test_df.value_counts())

print("Accuracy:", (2424 + 14) / test_df.shape[0])
print("Baseline", 1- y_test.mean())

actual.risk  predicted.risk
0            0                 2424
1            0                  423
             1                   14
0            1                   13
dtype: int64
Accuracy: 0.848295059151009
Baseline 0.8479471120389701


## Problem 2.4 - Prediction Models
Compute the test set AUC.

In [8]:
fpr, tpr, ths = metrics.roc_curve(y_test, pred_1)
print("AUC", metrics.auc(fpr, tpr))

AUC 0.6732383759527273


## Problem 3.1 - A "Smart Baseline"
In the previous problem, we built a logistic regression model that has an AUC significantly higher than the AUC of 0.5 that would be obtained by randomly ordering observations.

However, LendingClub.com assigns the interest rate to a loan based on their estimate of that loan's risk. This variable, int.rate, is an independent variable in our dataset. In this part, we will investigate using the loan's interest rate as a "smart baseline" to order the loans according to risk.

Using the training set, build a bivariate logistic regression model (aka a logistic regression model with a single independent variable) that predicts the dependent variable not.fully.paid using only the variable int.rate.

The variable int.rate is highly significant in the bivariate model, but it is not significant at the 0.05 level in the model trained with all the independent variables. What is the most likely explanation for this difference?
- int.rate is correlated with other risk-related variables, and therefore does not incrementally improve the model when those other variables are included. 

In [9]:
X2 = loans_imputed[['int.rate']].copy()
y2 = loans_imputed['not.fully.paid'].copy()

X_train2, X_test2, y_train2, y_test2 = train_test_split(
    X2, y2, train_size=0.7, random_state=144
)

model2 = sm.Logit(y_train2, sm.add_constant(X_train2)).fit()
print("Int rate coef:", model2.params.values[1])

Optimization terminated successfully.
         Current function value: 0.432369
         Iterations 6
Int rate coef: 16.298697030748826


## Problem 3.2 - A "Smart Baseline"
Make test set predictions for the bivariate model. What is the highest predicted probability of a loan not being paid in full on the testing set?
- 0.4251

With a logistic regression cutoff of 0.5, how many loans would be predicted as not being paid in full on the testing set?
- 0

In [10]:
pred_2 = model2.predict(sm.add_constant(X_test2))
print("Highest prob:", max(pred_2))
print("No pay:", sum(pred_2 > 0.5))

Highest prob: 0.42510485714939467
No pay: 0


## Problem 3.3 - A "Smart Baseline"
What is the test set AUC of the bivariate model?

In [11]:
fpr2, tpr2, ths2 = metrics.roc_curve(y_test2, pred_2)
print(metrics.auc(fpr2, tpr2))

0.617354120166878


## Problem 4.1 - Computing the Profitability of an Investment
While thus far we have predicted if a loan will be paid back or not, an investor needs to identify loans that are expected to be profitable. If the loan is paid back in full, then the investor makes interest on the loan. However, if the loan is not paid back, the investor loses the money invested. Therefore, the investor should seek loans that best balance this risk and reward.

To compute interest revenue, consider a $ $c $ investment in a loan that has an annual interest rate $r$ over a period of $t$ years. Using continuous compounding of interest, this investment pays back $c * exp(rt)$ dollars by the end of the $t$ years, where $exp(rt)$ is $e$ raised to the $r*t$ power.

How much does a $\$10$ investment with an annual interest rate of $6\%$ pay back after $3$ years, using continuous compounding of interest? *Hint: remember to convert the percentage to a proportion before doing the math.*

In [12]:
10 * np.exp(6/100*3)

11.972173631218102

## Problem 4.2 - Computing the Profitability of an Investment
While the investment has value $c * exp(rt)$ dollars after collecting interest, the investor had to pay $\$c$ for the investment. What is the profit to the investor if the investment is paid back in full?
- $c * exp(rt) - c$

## Problem 4.3 - Computing the Profitability of an Investment
Now, consider the case where the investor made a $\$c$ investment, but it was not paid back in full. Assume, conservatively, that no money was received from the borrower (often a lender will receive some but not all of the value of the loan, making this a pessimistic assumption of how much is received). What is the profit to the investor in this scenario?
- $c$

## Problem 5.1 - A Simple Investment Strategy
In the previous subproblem, we concluded that an investor who invested $c$ dollars in a loan with interest rate $r$ for $t$ years makes $c * (exp(rt) - 1)$ dollars of profit if the loan is paid back in full and $-c$ dollars of profit if the loan is not paid back in full (pessimistically).

In order to evaluate the quality of an investment strategy, we need to compute this profit for each loan in the test set. For this variable, we will assume a $\$1$ investment (aka $c=1$). To create the variable, we first assign to the profit for a fully paid loan, $exp(rt)-1$, to every observation, and we then replace this value with $-1$ in the cases where the loan was not paid in full. All the loans in our dataset are $3$-year loans, meaning $t=3$ in our calculations. 

What is the maximum profit of a $\$10$ investment in any loan in the testing set?

In [13]:
test = X_test.copy()
test['not.fully.paid'] = y_test
test['profit'] = (np.exp(test['int.rate']*3)-1)
test['profit'] = np.where(
    test['not.fully.paid']==1,
    -1,
    test['profit']
)
print("Max Profit", 10 * np.max(test['profit']))

Max Profit 8.6974115219633


## Problem 6.1 - An Investment Strategy Based on Risk
A simple investment strategy of equally investing in all the loans would yield profit $20.94 for a $100 investment. But this simple investment strategy does not leverage the prediction model we built earlier in this problem. As stated earlier, investors seek loans that balance reward with risk, in that they simultaneously have high interest rates and a low risk of not being paid back.

To meet this objective, we will analyze an investment strategy in which the investor only purchases loans with a high interest rate (a rate of at least 15%), but amongst these loans selects the ones with the lowest predicted risk of not being paid back in full. We will model an investor who invests $1 in each of the most promising 100 loans.

First, build a data frame called highInterest consisting of the test set loans with an interest rate of at least 15%.

What is the average profit of a $\$1$ investment in one of these high-interest loans?
- 0.2799257623939107

What proportion of the high-interest loans were not paid back in full?
- 0.21608040201005024

In [14]:
highInterest = test[test['int.rate']>=0.15].copy()
print("Avg profit:", np.mean(highInterest['profit']))
print("Portion of loans unpaid:", np.mean(highInterest['not.fully.paid']))

Avg profit: 0.2799257623939107
Portion of loans unpaid: 0.21608040201005024


## Problem 6.2 - An Investment Strategy Based on Risk
Next, we will determine the 100th smallest predicted probability of not paying in full by sorting the predicted risks in increasing order and selecting the 100th element of this sorted list. 

Build a data frame called selectedLoans consisting of the high-interest loans with predicted risk not exceeding the cutoff we just computed. Check to make sure you have selected 100 loans for investment.

What is the profit of the investor, who invested $\$1$ in each of these 100 loans?
- Nan

How many of 100 selected loans were not paid back in full?
- Nan

In [15]:
cut_off = sorted(pred_2)[100]

highInterest['pred_y'] = pred_2
selectedLoans = highInterest[highInterest['pred_y']<=cut_off].copy()
print("Avg profit:", np.mean(selectedLoans['profit']))
print("Portion of loans unpaid:", np.mean(selectedLoans['not.fully.paid']))

Avg profit: nan
Portion of loans unpaid: nan
