# Modeling Personal Loan Delinquency with LendingClub Data

Our project focuses on the domain of peer-to-peer consumer lending. The task is to model credit risk on personal loans, more precisely to evaluate the probability of default or delinquency of borrowers during their loan period.

We explored different ways to model the loans, all based on the idea of modeling the loans as Markov Processes, with discrete states that we present in the next section, and discrete timestamps (data is collected monthly).

We discovered that modelling sequential models is not an easy task given the tools of Probabilistic Programming, therefore we tried various way to build models and perform inference, and compared each of them.

## Data

In [1]:
from utils.utils import load_dataframe, preprocess, split_data
from utils.models import build_mle_matrix, build_mc_no_priors, build_mc_with_priors
from utils.inference import compute_mle, infer_mc_no_priors, infer_mc_with_priors

Instructions for updating:
Use the retry module or similar alternatives.


In [2]:
df = load_dataframe()

Loading raw data from cache...
Retrieved 40,263,987 rows, 4 columns in 2.77 seconds


Our variable of interest is called `loan_status` which has eight possible states. These are the Loan Status Descriptions based on the LendingClub's [website](https://help.lendingclub.com/hc/en-us/articles/215488038-What-do-the-different-Note-statuses-mean-):

- **Issued**: New loan that has passed all LendingClub reviews, received full funding, and has been issued.

- **Current**: Loan is up to date on all outstanding payments. 

- **In Grace Period**: Loan payment is late, but within the 15-day grace period.
 
- **Late (16-30)**: Loan is late, past the grace period, hasn't been current for 16 to 30 days.
 
- **Late (31-120)**: Loan has not been current for 31 to 120 days.
 
- **Charged Off**: Loan for which there is no longer a reasonable expectation of further payments. Charge Off typically occurs when a loan is 120 days or more past due and there is no reasonable expectation of sufficient payment to prevent the charge off. Loans for which borrowers have filed for bankruptcy may be charged off earlier.

- **Default**: Loan has not been current for an extended period of time. Charged off and default states are similar, yet different. [TODO explain]

- **Fully paid**: Loan has been fully repaid, either at the expiration of the 3- or 5-year year term or as a result of a prepayment.

In [3]:
df = preprocess(df)

Mapping column names...
Loading preprocessed data from cache...
Retrieved 27,636,875 rows, 4 columns in 2.25 seconds


In [4]:
x_train, x_test = split_data(df)

Loading split data from cache...
Retrieved 1,486,122 rows, 36 columns in 0.44 seconds
Train: (1337582, 36) | Test: (148540, 36)


## Experiment 1: Markov Model with Maximum Likelihood Estimates

We start with a simple model which is a Markov Model, which is stationary and homogeneous: the transitions between states are the same for all loans, and don't vary with time. We will keep this model for several experiments, as inference can be quite challenging even on this simpel model.

Even though we want to solve the problem from a Bayesian perspective, our first idea is to estimate the model with MLE estimates, which are easy to obtain and give us a robust baseline.

### 1.1 Model

We can easily derive that the MLE estimates for such a model are simply the empirical frequencies of each transition, therefore all we need to do is to build a count matrix where we count each type of transition i => j between two states i,j.

In [5]:
realized_transitions = build_mle_matrix(df)

Loading transitions data from cache...
Retrieved 8 rows, 8 columns in 0.11 seconds


In [6]:
realized_transitions

Unnamed: 0,Charged Off,Current,Default,Fully Paid,In Grace Period,Issued,Late (16-30 days),Late (31-120 days)
Charged Off,0,0,0,0,0,0,0,0
Current,774,24450170,3,707161,5831,0,160357,62006
Default,28843,147,2206,71,0,0,4,506
Fully Paid,0,0,0,8063,12,0,101,72
In Grace Period,0,276,0,11,22,0,59,41
Issued,0,17206,0,670,1,0,38,1
Late (16-30 days),4548,32374,0,2066,257,0,13413,119613
Late (31-120 days),105932,25398,29748,2138,56,0,3292,332463


### 1.2 Inference

Now that we have built a count matrix, we just take empirical frequencies to get our estimates:

In [7]:
mle = compute_mle(realized_transitions)

In [8]:
mle

Unnamed: 0,Charged Off,Current,Default,Fully Paid,In Grace Period,Issued,Late (16-30 days),Late (31-120 days)
Charged Off,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Current,0.0,0.96,0.0,0.03,0.0,0.0,0.01,0.0
Default,0.91,0.0,0.07,0.0,0.0,0.0,0.0,0.02
Fully Paid,0.0,0.0,0.0,0.98,0.0,0.0,0.01,0.01
In Grace Period,0.0,0.67,0.0,0.03,0.05,0.0,0.14,0.1
Issued,0.0,0.96,0.0,0.04,0.0,0.0,0.0,0.0
Late (16-30 days),0.03,0.19,0.0,0.01,0.0,0.0,0.08,0.69
Late (31-120 days),0.21,0.05,0.06,0.0,0.0,0.0,0.01,0.67


Note that these probabilities are true probabilities, except for:
- Charged Off: this is a "sink state", therefore there is no possible next state, hence the row of 0.
- Because of round approximations, probabilities for "In Grace Period" sum up to 0.99.
**TODO** fix that in the MLE function directly

In [9]:
mle.sum(axis=1)

Charged Off           0.00
Current               1.00
Default               1.00
Fully Paid            1.00
In Grace Period       0.99
Issued                1.00
Late (16-30 days)     1.00
Late (31-120 days)    1.00
dtype: float64

In [10]:
import numpy as np

In [13]:
np.softmax(mle.loc['In Grace Period',:])

AttributeError: module 'numpy' has no attribute 'softmax'

### 1.3 Criticism

**Values of the estimates:**

We can see that a lot of the MLE estimates reflect our expectations, for example we can see that:
- A loan which is 'Current' is much more likely to stay 'Current' the next month than any other state transition.
- Some states are not reachable from certain states, for example once Charged Off, a loan cannot go to any other state, because LC shuts down the loan completely. Similarly, a loan cannot transition from 'Current' to 'Late (31-120)', because first it has to go to either Grace Period or 'Late (16-30)', since we are measuring data each month.

We can also see the limits of our model and of these estimates:
- TODO

**Sampling:**

We can now use these estimates to generate some trajectories. We start as 'issued' and keep generating transitions for a period of 36 months, except if we reach 'Charged Off' which is a "sink state".

In [28]:
def sample_mle(mle_table, length=36):
    """
    Given a table of MLE estimates for the Markov Chain,
    samples states until length is reached, starting from 'Issued'.
    """
    chain = list()
    initial_state = current_state = 'Issued'
    chain.append(initial_state)
    for i in range(36):
        if current_state != 'Charged Off':
        current_state = np.random.choice(list(mle_table.index), p=mle_table.loc[current_state,:].values)
        chain.append(current_state)

In [30]:
print(sample_mle(mle))

ValueError: probabilities do not sum to 1

**Looking at statistics based on generated samples:**

In [16]:
# TODO length of sampled chains for example.

## Experiment 2: Stationary Markov Chain without Priors

In [8]:
chain_len = max(df.age_of_loan)
n_states = df.loan_status.unique().shape[0]

### 2.1 Model

In [9]:
x, T = build_mc_no_priors(n_states, chain_len)

### 2.2 Inference

In [10]:
infer_mc_no_priors(x_train, x, T, n_states, chain_len)

Loading experiment2 data from cache...
Retrieved 8 rows, 8 columns in 0.01 seconds


Unnamed: 0,Charged Off,Current,Default,Fully Paid,In Grace Period,Issued,Late (16-30 days),Late (31-120 days)
Charged Off,0.11,0.12,0.12,0.12,0.12,0.11,0.12,0.12
Current,0.13,0.13,0.12,0.13,0.13,0.13,0.13,0.13
Default,0.1,0.1,0.1,0.1,0.11,0.1,0.1,0.1
Fully Paid,0.13,0.12,0.13,0.12,0.13,0.13,0.13,0.13
In Grace Period,0.13,0.12,0.13,0.13,0.12,0.12,0.13,0.13
Issued,0.14,0.15,0.15,0.15,0.15,0.15,0.15,0.15
Late (16-30 days),0.16,0.16,0.16,0.16,0.16,0.16,0.16,0.16
Late (31-120 days),0.09,0.09,0.09,0.09,0.09,0.09,0.09,0.09


### 2.3 Criticism

## Experiment 3: Stationary Markov Chain with Priors

### 3.1 Model

In [11]:
batch_size = 1000

In [12]:
x, pi_0, pi_T = build_mc_with_priors(n_states, chain_len, batch_size)

### 3.2 Inference (Batch)

In [15]:
infer_mc_with_priors(x_train, x, pi_0, pi_T, n_states, chain_len, batch_size)

Loading experiment3 data from cache...
Retrieved 8 rows, 8 columns in 0.01 seconds


Unnamed: 0,Charged Off,Current,Default,Fully Paid,In Grace Period,Issued,Late (16-30 days),Late (31-120 days)
Charged Off,0.26,0.04,0.13,0.15,0.05,0.03,0.32,0.02
Current,0.1,0.01,0.02,0.12,0.01,0.41,0.33,0.01
Default,0.14,0.12,0.01,0.03,0.0,0.04,0.05,0.61
Fully Paid,0.22,0.55,0.0,0.12,0.04,0.03,0.01,0.01
In Grace Period,0.07,0.21,0.07,0.31,0.2,0.06,0.08,0.01
Issued,0.2,0.04,0.1,0.04,0.34,0.01,0.19,0.08
Late (16-30 days),0.49,0.04,0.01,0.05,0.06,0.03,0.3,0.03
Late (31-120 days),0.1,0.14,0.05,0.11,0.04,0.24,0.23,0.08


### 3.3 Criticism