<a href="https://colab.research.google.com/github/ancastillar/Study-on-the-probability-of-payment-of-a-credit-requested-by-a-customer/blob/main/Model_LOAN_NOLINEAR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [10]:
import pandas as pd
import pydotplus
import numpy  as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

from io import StringIO
from IPython.display import Image, SVG
from graphviz import Source
from matplotlib import cm
from matplotlib.colors import ListedColormap, LinearSegmentedColormap
from matplotlib.patches import Patch
from scipy.stats import chi2_contingency, norm
from sklearn.tree import export_graphviz
from sklearn.metrics import roc_curve, auc, accuracy_score
from sklearn.model_selection import StratifiedKFold, train_test_split
from statsmodels.formula.api import ols
from sympy import var, plot_implicit, Eq

# ignore log(0) and divide by 0 warning
np.seterr(divide='ignore');

In [2]:
df = pd.read_csv('/content/drive/MyDrive/data_davivienda/Lending_club.csv',
                 dtype = {'loan_status':'category', 'annual_inc':'float',
                          'verification_status':'category', 'emp_length':'category',
                          'home_ownership':'category', 'int_rate':'object',
                          'loan_amnt':'float', 'purpose':'category',
                          'term':'category', 'grade':'category'})
df['int_rate'] = df['int_rate'].str.rstrip('%').astype('float')

In [3]:
df.head()

Unnamed: 0,loan_status,annual_inc,verification_status,emp_length,home_ownership,int_rate,loan_amnt,purpose,term,grade
0,Fully Paid,24000.0,Verified,10+ years,RENT,10.65,5000.0,credit_card,36 months,B
1,Charged Off,30000.0,Source Verified,< 1 year,RENT,15.27,2500.0,car,60 months,C
2,Fully Paid,12252.0,Not Verified,10+ years,RENT,15.96,2400.0,small_business,36 months,C
3,Fully Paid,49200.0,Source Verified,10+ years,RENT,13.49,10000.0,other,36 months,C
4,Fully Paid,80000.0,Source Verified,1 year,RENT,12.69,3000.0,other,60 months,B


##Disadvantages of logistic regression##
Although logistic regression is one of the most commonly used classification algorithms, it is not the only one. If the underlying relationship between the logit of odds and the covariates is not linear, we need to consider higher order terms of the covariates to make the logistic model valid. Doing so requires a lot of tuning and experimenting. In such cases, it is often better to look at a classification algorithm that does not rely on a specific assumption about the relationship between the probability of the outcome and the covariates. Therefore we are going to introduce some of the most popular models

**Bayes Theorem**


To solve a problem using Bayesian methods, we have to specify two functions: the likelihood function p(X|θ),which describes the probability of observing a dataset X for a given value of the unknown parameters θ, and the prior distribution p(θ), which describes any knowledge we have about the parameters before we collect the data. Note that the likelihood should be considered as a function of the parameters θ with the data X held fixed. The prior distribution and the likelihood functionare used to compute the posterior distribution p(θ|X) via Bayes’ rule:
>  p($\theta$|X)=$\frac{p(X|\theta) \ p(\theta)}{\int d\theta'\ p(X|\theta')\ p(\theta')}$

In many cases, it will not be possible to analytically compute the normalizing constant in the denominator ofthe posterior distribution, p(X) =∫dθp(X|θ)p(θ),and Markov Chain Monte Carlo (MCMC) methods are needed to draw random samples from p(θ|X).[1]

**Deduction of Bayes Theorem**

Let B be an event whose chance of occurrence must be measured under the assumption that an event A has been observed. If the experiment is repeated, under the same conditions, n times then the relative frequency of B under condition A is defined as follows: 

> $fr(B|A)=\frac{n(A\cap B))}{n(A)}; if\quad n(A)>0$

where n(A∩B) is the number of success cases to A∩B. When the experiment is performed a large enough number of times, we have:

> $fr(B|A)=\frac{\frac{n(A\cap B)}{n}}{\frac{n(A)}{n}}; if\quad n(A)>0$

If n$\rightarrow \ \infty$, the last equation becomes:

>  $P(B|A)= \frac{P(A\cap B)}{P(A)}$

From the last equation we can define the independency (other important concept)of two events as follows:

> $P(A \cap B)= P(A)P(B)$


Recalling the following definitions we can get two important results:

> **sample space**: $\Omega=\{ \omega_n \}\quad n \in (0, \infty)$ $\quad$where $\omega_i\cap\omega_j=\phi$: 


**Total probability rule**

$P(B)= P(B\cap \Omega)$ = $P(\cup_n(B \cap \omega_n )$ = $\sum_n P(B\cap \omega_n)$ = $\sum_n P(B|\omega_n)P(\omega_n)$ 

from the last rule we get the next corollary:

Corollary:  Be it $\omega_1, \omega_2...$ a finite sample of $\Omega$ with P(A_i)> 0 for all $i$ we have:

> $P(\omega_i| B)=\frac{P(\omega_i) P(B|\omega_i)}{\sum_j P(B|\omega_j) P(\omega_j)} $


> Proof:

> $P(\omega_i|B)=\frac{P(\omega_i \cap B)}{P(B)}= \frac{P(\omega_i)P(B|\omega_i)}{P(B)}= \frac{P(\omega_i)P(B|\omega_i)}{\sum_j P(B|\omega_j)P(\omega_j)} $  


if $j\ \rightarrow \ \infty$ then:

> p($\omega$|X)=$\frac{p(X|\omega) \ p(\omega)}{\int d\omega'\ p(X|\omega')\ p(\omega')}$ 

The last equation is the continuous version of Bayes rule[2]

##Naive Bayes Clasiffier##

This algorithm is based on Bayes' theorem, which is a conditional probability theorem (see previous section). The principal idea for this project is:

> Given a client with certain characteristics, we will use Bayes' theorem and the observed data to estimate the probability that defaulters and non-defaulters have the same characteristics as the client. The larger of these two probabilities would then determine how we classify the client.

Let's go to determine whether $\text{P(defaulter|income verified)}$ is larger than $\text{P(non-defaulter| income verified)}$ using Bayes' theorem:


>  $\text{P(defaulter|income verified)}=\frac{\text{P(income verified|defaulter)}\ \text{P(defaulter)}}{\text{P(income verified)}}$ ; $\text{P(non-defaulter|income verified)}=\frac{\text{P(income verified|non-defaulter)} \ \text{P(non-defaulter)}}{\text{P(income verified)}}$


In order to get the answer we need to do an assumption and verify if this is true or not:


>  $\text{P(defaulter|income verified)} \ \text{P(defaulter)}$< $\text{P(non defaulter|income verified)}\ \text{P(non-defaulter)}\quad (1)$




First we are going to calculate the probability $\text{P(defaulter)}$ and $\text{P(non-defaulter)}$:

In [4]:
p_def=df[df['loan_status']=='Charged Off']['loan_status'].count()/df['loan_status'].count()
print('P(defaulter)={:.3f} \nP(non-defaulter)={:.3f}'.format(p_def, 1-p_def))

P(defaulter)=0.141 
P(non-defaulter)=0.859


Now we are goingt to calculate the probability $\text{P(income verified| defaulter)}$ and $\text{P(income verified| non-defaulter)}$:

In [5]:
p_id=df[(df['loan_status']=='Charged Off') & (df['verification_status']=='Verified')]['loan_status'].count()/df[df['loan_status']=='Charged Off']['loan_status'].count()
p_ind=df[(df['loan_status']=='Fully Paid') & (df['verification_status']=='Verified')]['loan_status'].count()/df[df['loan_status']=='Fully Paid']['loan_status'].count()
print('P(income verified given defaulter)={:.3f} \nP(income verified given non-defaulter)={:.3f}'.format(p_id, p_ind))

P(income verified given defaulter)=0.363 
P(income verified given non-defaulter)=0.313


We can evaluate the equation (1)

In [7]:
p_def_iv=p_id*p_def
p_ndef_iv=p_ind*(1-p_def)
print('P(defaulter|income verified)  is proportional to {:.3f}  \
\nP(non-defaulter|income verified) is proportional to {:.3f}'.format(p_def_iv, p_ndef_iv))

P(defaulter|income verified)  is proportional to 0.051  
P(non-defaulter|income verified) is proportional to 0.269


 If we only have information about the income verification status of our clients, we would probably give loans to those clients whose income has been verified. What happens if the client is either `Source Verified` or `Not Verified`? We can repeat the same procedure:

In [8]:
p_id2=df[(df['loan_status']=='Charged Off') & (df['verification_status']=='Source Verified')]['loan_status'].count()/df[df['loan_status']=='Charged Off']['loan_status'].count()
p_ind2=df[(df['loan_status']=='Fully Paid') & (df['verification_status']=='Source Verified')]['loan_status'].count()/df[df['loan_status']=='Fully Paid']['loan_status'].count()

p_def_iv2=p_id2*p_def
p_ndef_iv2=p_ind2*(1-p_def)
print('P(defaulter|income source verified)  is proportional to {:.3f}  \
\nP(non-defaulter|income source verified) is proportional to {:.3f}'.format(p_def_iv2, p_ndef_iv2))

P(defaulter|income source verified)  is proportional to 0.037  
P(non-defaulter|income source verified) is proportional to 0.217


In [9]:
p_id3=df[(df['loan_status']=='Charged Off') & (df['verification_status']=='Not Verified')]['loan_status'].count()/df[df['loan_status']=='Charged Off']['loan_status'].count()
p_ind3=df[(df['loan_status']=='Fully Paid') & (df['verification_status']=='Not Verified')]['loan_status'].count()/df[df['loan_status']=='Fully Paid']['loan_status'].count()

p_def_iv3=p_id3*p_def
p_ndef_iv3=p_ind3*(1-p_def)
print('P(defaulter|income not verified)  is proportional to {:.3f}  \
\nP(non-defaulter| income not verified) is proportional to {:.3f}'.format(p_def_iv3, p_ndef_iv3))

P(defaulter|income not verified)  is proportional to 0.053  
P(non-defaulter| income not verified) is proportional to 0.373


The conditional probabilities of being a non-defaulter are always bigger than those for being a defaulter, remenber that in the notebook of logistic regression we find a Simsonp paradox for the `verified_status` and `loan_status`

##The algorithm and assumptions##

There are two possible routes that we can follow to create a classification algorithm from Bayes' theorem.

#### Probabilistic Model

Recall that in logistic regression, the main quantity of interest is the probability of no default (remember that our encoding was 1 for non-defaulters and 0 for defaulters) given specific values for the covariates. We explicitly modeled this probability as a linear function of covariates after transformation by the logit function. 

The Naive Bayes classifier aims to estimate this probability from a different perspective. From Bayes' theorem, we can express this conditional probability in the following way:

$$
\begin{equation}\tag{*}
\text{P(non-defaulter|covariates=values)} = \frac{\text{P(covariates=values| non-defaulter)P(defaulter)}}{\text{P(covariates=values)}}.
\end{equation}
$$

**Then, we create our classifier as in the logistic model by selecting a threshold $t$ and classify a client with given covariates values as a defaulter if:**

$$
\text{P(non-defaulter|covariates=values)}>t.
$$

#### Maximum A Posteriori Model (MAP)

For this classifier, a client with given covariates values is classified as a non-defaulter if:

$$
\text{P(non-defaulter|covariates=values)}>\text{P(defaulter|covariates=values)}
$$

which is equivalent to:

$$
\begin{equation}\tag{**}
\text{P(covariates=values|non-defaulter)P(non-defaulter)}>\text{P(covariates=values|defaulter)P(defaulter)}
\end{equation}
$$
**Note: This assumption is the same of the equation (1)**

Let's take a look at each of these quantities:

1. $\text{P(defaulter)}$ and $\text{P(non-defaulter)}$. These are the probabilities of having a defaulter and a non-defaulter. These can be easily estimated using the proportion of defaulters in the dataset.
2. $\text{P(covariates=values)}$. This is the probability of having a customer whose covariates are equal to $\text{values}$. This can also be estimated by proportions, but we will actually find out that it is not necessary to estimate this quantity at all (more on this later).
3. $\text{P(covariates=values| non-defaulter)}$ and $\text{P(covariates=values| defaulter)}$. We can estimate these by looking at the proportion of defaulters and non-defaulters with covariates equal to $\text{values}$. But if we have a lot of covariates, the number of such defaulters might be extremely small or even equal to zero. The main challenge in building a Naive Bayes classifier is estimating these particular probabilities.

Indeed, when the number of covariates is large, it is hard to estimate $\text{P(covariates=values given non-defaulter)}$ by proportions directly. For instance, suppose that we have 10 covariates and that all of them are binary variables. Then, the total number of possible values these covariates can take is equal to $2^{10}=1024$. Thus, in order to estimate $\text{P(covariates=values given non-defaulter)}$ we would need at least $1024$ samples! If we want to go even further, suppose we want to classify texts having most 82 words, where each word in the text has 10 possible alternatives. The number of samples we would need will be larger than the number of atoms in the universe!

##Assumption of Naive##

All al