# Example-Dependent Clasification for Credit Scoring

  In order to mitigate the impact of credit risk and make more objective and accurate decisions, 
  financial institutions use credit scores to predict and control their losses.
  The objective in credit scoring is to classify which potential customers are likely to default a 
  contracted financial obligation based on the customer's past financial experience, and with that 
  information decide whether to approve or decline a loan [1]. This tool has 
  become a standard practice among financial institutions around the world in order to predict 
  and control their loans portfolios. When constructing credit scores, it is a common practice to 
  use standard cost-insensitive binary classification algorithms such as logistic regression, 
  neural networks, discriminant analysis, genetic programing, decision trees, among 
  others [2,3]. 
  
  Formally, a credit score is a statistical model that allows the estimation of the probability 
  $\hat p_i=P(y_i=1|\mathbf{x}_i)$ of a customer $i$ defaulting a contracted debt. Additionally, 
since the 
  objective of credit scoring is to estimate a classifier $c_i$ to decide whether or not to grant a 
  loan to a customer $i$, a threshold $t$ is defined such that if $\hat p_i <t$, then the loan is 
  granted, i.e., $c_i(t)=0$, and denied otherwise, i.e., $c_i(t)=1$.

## Example: Pacific-Asia Knowledge Discovery and Data Mining conference (PAKDD) competition 2009

Credit Risk Assessment on a Private Label Credit Card Application

### Load dataset and show basic statistics

In [1]:
import pandas as pd
import numpy as np
from costcla.datasets import load_creditscoring2
data = load_creditscoring2()

# Elements of the data file
print data.keys()

['target_names', 'cost_mat', 'name', 'DESCR', 'feature_names', 'data', 'target']


In [2]:
# Full description of the dataset
# print data.DESCR

In [3]:
# Number of features
print data.feature_names
print data.data.shape

['ID_SHOP' 'AGE' 'AREA_CODE_RESIDENCIAL_PHONE' 'PAYMENT_DAY' 'SHOP_RANK'
 'MONTHS_IN_RESIDENCE' 'MONTHS_IN_THE_JOB' 'PROFESSION_CODE' 'MATE_INCOME'
 'QUANT_ADDITIONAL_CARDS_IN_THE_APPLICATION' 'PERSONAL_NET_INCOME' 'SEX_F'
 'SEX_M' 'MARITAL_STATUS_C' 'MARITAL_STATUS_D' 'MARITAL_STATUS_O'
 'MARITAL_STATUS_S' 'MARITAL_STATUS_V' 'FLAG_RESIDENCIAL_PHONE_N'
 'FLAG_RESIDENCIAL_PHONE_Y' 'RESIDENCE_TYPE_A' 'RESIDENCE_TYPE_C'
 'RESIDENCE_TYPE_O' 'RESIDENCE_TYPE_P' 'FLAG_MOTHERS_NAME_N'
 'FLAG_MOTHERS_NAME_Y' 'FLAG_FATHERS_NAME_N' 'FLAG_FATHERS_NAME_Y'
 'FLAG_RESIDENCE_TOWN_eq_WORKING_TOWN_N'
 'FLAG_RESIDENCE_TOWN_eq_WORKING_TOWN_Y'
 'FLAG_RESIDENCE_STATE_eq_WORKING_STATE_N'
 'FLAG_RESIDENCE_STATE_eq_WORKING_STATE_Y'
 'FLAG_RESIDENCIAL_ADDRESS_eq_POSTAL_ADDRESS_N'
 'FLAG_RESIDENCIAL_ADDRESS_eq_POSTAL_ADDRESS_Y']
(38938, 34)


In [4]:
# Percentage of bad (positive) clients
print data.target.mean()*100

19.8854589347


### Credit scoring as a standard classification problem

Using a random forest, a model is learned to classify customers in good and bad

In [5]:
# Load classifier and split dataset in training and testing
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test, cost_mat_train, cost_mat_test = \
train_test_split(data.data, data.target, data.cost_mat)

# Fit the classifier
f_RF = RandomForestClassifier()
f_RF.fit(X_train, y_train)
y_pred = f_RF.predict(X_test)

  After the classifier $c_i$ is estimated, there is a need to evaluate its performance. In 
  practice, many statistical evaluation measures are used to assess the performance of a credit 
  scoring model. Measures such as the area under the  receiver operating characteristic curve (AUC),
  Brier score, Kolmogorov-Smirnoff (K-S) statistic,  $F_1$-Score, and misclassification are among 
  the most common [4]. 

In [6]:
# Evaluate the performance
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score
measures = {"f1": f1_score, "precision": precision_score, 
            "recall": recall_score, "accuracy": accuracy_score}
results = pd.DataFrame(columns=measures.keys())
results.loc["RandomForest"] = [measures[measure](y_test, y_pred) for measure in measures.keys()]

print results

                    f1  accuracy  precision    recall
RandomForest  0.131691  0.790036   0.358796  0.080645


  Nevertheless, none of these measures takes into account the 
  business and economical realities that take place in credit scoring. Costs that the financial 
  institution had incurred to acquire customers, or the expected profit due to a particular client, 
  are not considered in the evaluation of the different models. 

### Financial Evaluation of a Credit Scorecard

  Typically, a credit risk model is evaluated using standard cost-insensitive measures.
  However, in practice, the cost associated with approving 
  what is known as a bad customer, i.e., a customer who default his credit loan, is quite 
  different from the cost associated with declining a good customer,  i.e., a customer who 
  successfully repay his credit loan. Furthermore, the costs are not constant among customers. 
  This is because loans have different credit line amounts, terms, and even interest rates. Some 
  authors have proposed methods that include the misclassification costs in the credit scoring 
  context [4,5,6,7].
  
  In order to take into account the varying costs that each example carries, we proposed in 
  [8], a cost matrix with example-dependent misclassification costs as 
  given in the following table.
  
  
|  	| Actual Positive ($y_i=1$)  	|  Actual Negative 	($y_i=0$)|
|---	|:-:	|:-:	|
|   Predicted Positive ($c_i=1$)	|   $C_{TP_i}=0$	|  $C_{FP_i}=r_i+C^a_{FP}$ 	|
|  Predicted Negative  ($c_i=0$) 	|   $C_{FN_i}=Cl_i \cdot L_{gd}$	| $C_{TN_i}=0$	|
  
  First, we assume that the costs of a correct 
  classification, $C_{TP_i}$ and $C_{TN_i}$, are zero for every customer $i$. We define $C_{FN_i}$ 
  to be the losses if the customer $i$ defaults to be proportional to his credit line $Cl_i$. We 
  define the cost of a false positive per customer $C_{FP_i}$ as the sum of two real financial 
  costs $r_i$ and $C^a_{FP}$, where $r_i$ is the loss in profit by rejecting what would have been a 
  good customer. 
  
  The profit per customer $r_i$ is calculated as the present value of the difference between the 
  financial institution gains and expenses, given the credit line $Cl_i$, the term $l_i$ and the 
  financial institution lending rate $int_{r_i}$ for customer $i$, and the financial institution 
  of cost funds $int_{cf}$.

  $$  r_i= PV(A(Cl_i,int_{r_i},l_i),int_{cf},l_i)-Cl_i,$$
  
  with $A$ being the customer monthly payment and $PV$ the present value of the monthly payments,
  which are calculated using the time value of money equations [9],
 
 $$ A(Cl_i,int_{r_i},l_i) =  Cl_i \frac{int_{r_i}(1+int_{r_i})^{l_i}}{(1+int_{r_i})^{l_i}-1},$$
 
 $$ PV(A,int_{cf},l_i) = \frac{A}{int_{cf}} \left(1-\frac{1}{(1+int_{cf})^{l_i}} \right).$$
      
  The second term $C^a_{FP}$, is related to the assumption that the financial institution will not 
  keep the money of the declined customer idle. It will instead give a loan to an alternative 
  customer [10]. Since no further information is known about the alternative customer, 
  it is assumed to have an average credit line $\overline{Cl}$ and an average profit $\overline{r}$.
  Given that, 
  
  $$  C^a_{FP}=- \overline{r} \cdot \pi_0+\overline{Cl}\cdot L_{gd} \cdot \pi_1,$$

  in other words minus the profit of an average alternative customer plus the expected loss, 
  taking into account that the alternative customer will pay his debt with a probability equal to 
  the prior negative rate, and similarly will default with probability equal to the prior positive 
  rate.
  
  One key parameter of our model is the credit limit. There exists several strategies to calculate 
  the $Cl_i$ depending on the type of loans, the state of the economy, the current portfolio, 
  among others [1,9]. Nevertheless, given the lack of information 
  regarding the specific business environments of the considered datasets, we simply define 
  $Cl_i$ as

$$      Cl_i = \min \bigg\{ q \cdot Inc_i, Cl_{max}, Cl_{max}(debt_i) \bigg\},$$
  
  where $Inc_i$ and $debt_i$ are the monthly income and debt ratio of the customer $i$, 
  respectively, $q$ is a parameter that defines the maximum $Cl_i$ in times $Inc_i$, and 
  $Cl_{max}$ the maximum overall credit line. Lastly, the maximum credit line given the current 
  debt is calculated as the maximum credit limit such that the current debt ratio plus the new 
  monthly payment does not surpass the customer monthly income. It is calculated as
 
 $$  Cl_{max}(debt_i)=PV\left(Inc_i \cdot P_{m}(debt_i),int_{r_i},l_i\right),$$
  and
  $$ P_{m}(debt_i)=\min \left\{ \frac{A(q \cdot Inc_i,int_{r_i},l_i)}{Inc_i},\left(1-debt_i \right) \right\}.$$
  
  
### Financial savings

  Let $\mathcal{S}$ be a set of $N$ examples $i$, $N=\vert S \vert$, where each example is 
  represented by  the augmented feature vector $\mathbf{x}_i^*=[\mathbf{x}_i, 
  C_{TP_i},C_{FP_i},C_{FN_i},C_{TN_i}]$  and labeled using the class   label $y_i   \in \{0,1\}$. 
  A classifier $f$ which generates the   predicted label $c_i$ for each   element $i$ is trained  
  using the set $\mathcal{S}$. Then the cost of   using $f$ on $\mathcal{S}$ is calculated by
  
  $$   Cost(f(\mathcal{S})) = \sum_{i=1}^N Cost(f(\mathbf{x}_i^*)),$$
  
  where
  
 $$   Cost(f(\mathbf{x}_i^*)) = y_i(c_i C_{TP_i} + (1-c_i)C_{FN_i}) + (1-y_i)(c_i C_{FP_i} + (1-c_i)C_{TN_i}).$$
  

  However, the total cost may not be easy to interpret. We proposed an approach in [8], where the savings of using an algorithm  are defined as the cost of the algorithm versus the cost of using no algorithm at all.  To do that, the cost of the costless class is defined as 
  
  $$  Cost_l(\mathcal{S}) = \min \{Cost(f_0(\mathcal{S})), Cost(f_1(\mathcal{S}))\},$$
  
  where 
  
  $$  f_a(\mathcal{S}) = \mathbf{a}, \text{ with } a\in \{0,1\}.$$
  

  The cost improvement can be expressed as the cost savings as compared with $Cost_l(\mathcal{S})$. 
  
  $$    Savings(f(\mathcal{S})) = \frac{ Cost_l(\mathcal{S}) - Cost(f(\mathcal{S}))}   {Cost_l(\mathcal{S})}.$$
  


  ### Parameters for the PAKDD Credit Database

 As this database contain information regarding the features, and more importantly about the income of each example, from which an estimated credit limit $Cl_i$ can be calculated.
Since no specific information regarding the datasets is provided, we assume that they belong to 
average Brazilian financial institution. This enabled us to find the different 
parameters needed to calculate the cost measure. 

| Parameter 	| Value |
|---	|:-:	|
|Interest rate ($int_r$) | 63.0% |
|  Cost of funds ($int_{cf}$) | 16.5% |
|  Term ($l$) in months | 24 |
|  Loss given default ($L_{gd}$) | 75% |
|  Times income ($q$) | 3 |
|  Maximum credit line ($Cl_{max}$) | 25,000|

In particular, we obtain the average interest rates in Brazil during 2004 from Trading Economics [11]. Moreover, we convert all monetary values to Euros. Additionally, we use a fixed loan term $l$, 
because the PAKDD Credit dataset is related to credit cards the term is fix to two years [9].
Moreover, we set the loss given default $L_{gd}$ using information from 
the Basel II standard, $q$ to 3 since it is the average personal loan requests related to monthly income, and the maximum credit limit $Cl_{max}$ to 25,000 Euros.

### Calculation of the savings of the random forest

In [7]:
# The cost matrix is already calculated for the dataset
# cost_mat[C_FP,C_FN,C_TP,C_TN]
print data.cost_mat[[10, 17, 50]]

[[ 209.     547.965    0.       0.   ]
 [  24.     274.725    0.       0.   ]
 [  89.     371.25     0.       0.   ]]


In [8]:
# Calculation of the cost and savings
from costcla.metrics import savings_score, cost_loss 

results["savings_score"] = savings_score(y_test, y_pred, cost_mat_test)

print results

                    f1  accuracy  precision    recall  savings_score
RandomForest  0.131691  0.790036   0.358796  0.080645         0.0392


It is quite interesting how the model is not making significant savings, as the cost of using this model is almost equal to the cost of predicting all the examples as negatives

In [9]:
print "No model ", cost_loss(y_test, np.zeros(y_test.shape), cost_mat_test)
print "RF model ", cost_loss(y_test, y_pred, cost_mat_test)


No model  784510.65
RF model  753757.7875


## References
1. ~\citep{Anderson2007}
2. ~\citep{Hand1997}
3. ~\citep{Bahnsen2011}
4. \citep{Beling2005}
5. \citep{Verbraken2014
6.  Alejo2013
7.  Oliver2009}.
8.  \citep{CorreaBahnsen2014b}
9.  \citep{Lawrence2012}
10. \citep{Nayak1997}
11. \citep{Economics2014}