<h1><p style='text-align: center'; > IN-STK5000 Credit project </p></h1>
<h4><p style='text-align: center'; > 16-10-2019 <br/> <br/> Bjørn Ivar Teigen, Mathieu Diaz, Jolynde Vis </p></h4>

------

## 1. Introduction

This report presents a machine learning algorithm for a decision rule, that is able to evaluate new applicants for mortgage credit loans and to assess whether they are a good or bad credit risk. Such a decision rule is of main interest to financial institutions, where a machine learning model can handle ten of thousands of loan applicants at once, while an employee of the bank can only do so much in a day.

While machine learning models are much faster than the human employees at a FI, there are some concerns that have to be taken into account. First, the quality of the data that is used to train the model should be assessed, and the model should be adapted to this. There are several ways to take into account the uncertainty that comes from limited or biased data, and to assess model risk. In this report we will explain how to do this, for both the data collection as well as in the model. Second, the privacy of the individuals in the database, as well as that of new applicants, has to be protected in some way. Since machine learning algorithms work best with precise data about individuals (i.e. not categorical), there is a tradeoff between the amount of privacy offered in the database and the accuracy of the decision rule. In this report we show this tradeoff and present a way to balance this, depending on the requirements for privacy and accuracy. The third concern is the fairness of the model. While humans have some sort of ‘awareness’ about what is fair and what is not, a basic machine learning model doesn’t know anything about what is racist, sexist or otherwise unfair. The model will assess applicants based on all the data it is given, and sometimes might end up discriminating based on, for example, gender or age. In this report we show how to prevent this from happening, and we show that there is also a tradeoff between fairness and accuracy. 



This report presents a case study on German credit data from UCI [^1] Machine Learning Repository. Three different machine learning models are assessed on their predictive accuracy for this dataset; a k-Nearest Neighbour model, a Neural Network model and a Random Forest model. We show how to select a model given a certain dataset, in this case the German credit data. Then we show how to take into account privacy and fairness. 
The structure of the report is as follows: the next section describes the method used to build the model, then we explain the results of the various steps taken to build the model and the model itself, and we end the report with a conclusion and a critical discussion of the limitations.

[^1]: <https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)>

--------

## 2. Method

For this report the dataset ‘German Credit Data’ from UCI is used to train and test the model. This dataset classifies people as good or bad credit risks and each individual is described by twenty attributes. 

### 2.1. Data exploration
We start by exploring the data by use of histograms and countplots of the variables. This gives an idea of how the dataset looks and as a sanity-check on the data gathering process. Visualising the data makes it easy to spot weird outliers and missing or corrupted data.

### 2.2. Classification models

Then three different machine-learning models for classification are built. Literature shows that among the best classification models for credit loan classifications are neural networks, k-nearest neighbour and random forests (Yeh and Lien, 2009; Leo et al., 2019). We build these three models and compare them. The implementation details for the models are described below. 

#### 2.2.1 Implementation details
This section details how each classifier is implemented. All models follow the same high-level workflow; all training and data-preprocessing is done in the fit() function. We do not perform any optimisatisation steps outside of the fit() function. We can therefore be confident that the model is optimized for different training set.

##### K-Nearest Neighbor
We first normalize each column of the dataset, where the mean zero and unit variance, using the StandardScaler implementation from scikit learn. 
The most important hyperparameter in the KNN algorithm is k. We have varied this parameter  between 1 and 200 and Weights (uniform and distance). For this we use GridSearchCV which uses cross-validation. It will separate the dataset into 5 parts, then it will try every possible case to find the best hyperparameters. The inputs to the hyperparameter search are listed below:

| Variable | Description | Values |
|:------|:------|:------|
| n_neighbors | A list of integers which represent the <br/> number of neighbors to use by default <br/> for kneighbors queries. | [1..200] |
| weights  | Weight function used in prediction.  | 'uniform', 'distance'  |

<body>
<i><p style='text-align: right'; > Table 1: Inputs hyperparameters search kNN </p></i>
</body>



##### Neural Network 
The neural network consists of a batch normalization layer and a fully connected artificial neural network with elu activation functions in all layers except the final one. The last layer is a single output node with a sigmoid activation function. The network is trained with L2 regularization. The batch normalization is run with the standard parameters found in Keras version 2.2.5. 

A hyperparameter grid search is ran with a five-fold cross-validation to find the best hyperparameters for the neural network. The scoring function for the hyperparameter search is the mean utility following the formula for E(U) above. The inputs to the hyperparameter search are listed below:



| Variable | Description | Values |
|:------|:------|:------|
| layer_sizes | A list of integers which defines the <br/> depth of the network and the size of <br/> each layer | [32, 16], <br/> [64, 16], <br/> [64, 32, 16, 8] |
| alpha  | The coefficient used for regularization | 1, <br/> 0.1, <br/> 0.01, <br/> 0.001, <br/> 0.0001 |

<body>
<i><p style='text-align: right'; > Table 2: Inputs hyperparameters search Neural Network </p></i>
</body>



##### Random Forest

The random forest implementation uses the standard implementation from scikit learn.
A hyperparameter grid search is ran with a five-fold cross-validation to find the best hyperparameters for the neural network. The scoring function for the hyperparameter search is the mean utility following the formula for E(U) above. The inputs to the hyperparameter search are listed below:

| Variable | Description | Values |
|:------|:------|:------|
| 'n_estimators' | n_estimators represents the number of trees in the forest. | np.linspace(10, 200).astype(int)|
| 'max_depth' | max_depth represents the depth of each tree in the forest. | [None] + list(np.linspace(3, 20).astype(int)) |
| 'max_features' | max_features represents the number of features to consider when looking for the best split. | ['auto', 'sqrt', None] + list(np.arange(0.5, 1, 0.1)) |
| 'max_leaf_nodes'  | max_leaf_nodes is the total number of terminal nodes (leaves) in a tree. | [None] + list(np.linspace(10, 50, 500).astype(int))   |
| 'min_samples_split' | min_samples_split represents the minimum number of samples required to split an internal node. | 2, 5, 10 |
| 'bootstrap' | bootstrap is a method for sampling data points | True, False |

<body>
<i><p style='text-align: right'; > Table 3: Inputs hyperparameters search Random Forest </p></i>
</body>




However, due to some problems in the code and with Python, we couldn’t find the best hyperparameters for random forest. We used the Random Forest with n_estimators = 130 and rest the default values that scikit learn uses. 

#### 2.2.2 Cross-validation

Cross-validation is used to estimate the predictive performance of the model on unseen data. Cross-validation repeats the training of the model multiple times, using different parts of the training set as training and validation sets. This gives a more accurate indication of how well the model generalizes to unseen data. It is important that no optimisation is done outside of the cross-validation process, otherwise the results from the cross-validation will not be representative for expected performance on new data.

### 2.3. Model evaluation

The three different models are evaluated based on four scores. These four scores are; 1) the accuracy score, 2) the number of false positives, 3) the AUC-score, and 4) the average utility. All scores are evaluated using a 10-fold cross validation.

#### 2.3.1 Accuracy score 
The accuracy score is the number of correct predictions, divided by the number of total predictions. It indicates how well the model predicts. The accuracy score has some limitations however, for example in the case when the majority class makes up over 90% of the data. The total accuracy score could be 90%, while it is possible that it is distributed with 91% accuracy for class A and 50% accuracy for class B. The accuracy score gives a good general overview, but we use more measures to assess the different models.

#### 2.3.2 Number of false positives

While the accuracy score assesses the model on how much right or wrong predictions they have, the models are also assessed on where their wrong predictions are located, seeing that some types of errors are more costly. In the case of credit loans, the false positives are a problem for the bank. False positives are the errors of the model where it falsely predicts a positive outcome, in our case that the loan will be repaid. This error is the most important to minimize, since this error costs the bank money. If the model falsely predicts a negative outcome, i.e. it predicts that the loan will not be repaid and therefore the credit loan will not be provided, it doesn’t cost the bank money. A confusion matrix gives the distribution of the predictions. The confusion matrix is defined as follows:

<img src="Images/confusion_matrix.png" alt="Confustion matrix" style="width: 300px;"/>
<i><p style='text-align: right'; > Figure 1: Confusion matrix </p></i>

#### 2.3.3 ROC/AUC-score

The ROC (Receiver Operating Characteristics) curve of the model shows the true positive rate (x-axis) and the false positive rate (y-axis). A curve to the top and left is a better model, and the AUC-score gives the size of the area underneath the curve. The higher the AUC-score, the better the model is at prediction actual 1s as 1s and 0s as 0s. 
An AUC-score of 1 would mean a perfect model, and an AUC-score of 0.5 would mean that the model is not able to distinguish between positive and negative class. The true positive rate (TPR, or recall) and false positive rate (FPR) are defined as follows: 

$$
TPR = \frac {TP} {TP + FN}
$$

$$
FPR = \frac {FP} {FP + TN}
$$ 

#### 2.3.4 Utility

Because the goal of a FI (Financial Institution) is to make a profit, individuals are evaluated based on the expected utility (in Deutsche Mark) of giving the loan. If the expected utility is under zero, the loan should be denied, and vice versa, when the expected utility is higher than zero, the loan should be granted. The expected utility is defined as follows: 


$$
E(U) = ((m(1-r)^n - 1)*p)-m(1-p)
$$

Here *p* is the probability of being a credit risk, which is given by the model. *r* is the monthly interest rate, and *m* is the amount of the loan. *n* is the lending period in months. 


### 3.4. Model optimization

#### 3.4.1 Uncertainty

The uncertainty of the model can be expressed by using bagging, i.e. bootstrap aggregating. Using bootstrapping, where a different subset of the dataset is used each time, we train a new model on each subset of the data. The aggregated models average over the versions and the predicted class will be the class that has the highest average probability over all the aggregated models. This bagging may result in a more stable predictor (Breiman, L., 1996), but it also enables us to print the class probabilities for the model, which allows us to assess the models on their uncertainty in predicting a class. 


### 3.5. Privacy
With the amount of information that we have in the German credit dataset, we can predict if an applicant will default on his loan or not. To give us a prediction, the models are trained on the German dataset. However, in the dataset is a lot of user’s information and some of them are private. As we have personal information, since 2018 and the General Data Protection Regulation, we can’t do what we want with it.

For example, we have contact information, such as age and phone, but also personal status (married, divorced, single, etc.), financial capacity (existence of bank account), and more. Some of the data are qualitative and some are numerical. Qualitative information is usually less sensitive for privacy than numerical data, e.g. if the phone is numerical (+33 6 51...) that gives us much more information than when is qualitative (yes or no).

In the case where we want to share our dataset with the public, we need to anonymize the data. A famous example occurred in 1997, when a student at MIT found the medical records of Massachusetts Governor William Weld, who had collapsed during a public ceremony. He used different dataset to create links between them. After this operation, he has a dataset with personal information.
In our case, this is exactly what we want to avoid with our dataset. That is a privacy attack. The purpose of a privacy attack is to recreate a dataset with personal information or to find information about a specific person in an anonymized dataset. 
In order to do so we will explore three different privacy mechanisms: k-anonymity, the LaPlace mechanism and randomised responses.

### 3.6. Fairness

We require that the model does not discriminate based on demographic factors such as race, age and gender. This section describes the methods used to measure the fairness of our classifier.

#### 3.6.1 Analytical model of fairness
Calibration and balance are two common metrics for fairness.

Calibration measures how much the actual outcome depends on the sensitive variable given the action of the classifier. A perfectly fair classifier will satisfy the property

$$ 
P(y | a, z) = P(y|a)
$$

Deviation from calibration can be defined like this:

$$
\begin{equation}
\mathrm{F}_{calibration} (\theta, \pi) = \sum_{y,a,z}[ \mathrm{P}_\theta^\pi (y|a,z) - \mathrm{P}_\theta^\pi (y|a) ]^2
\end{equation}
$$

Balance measures whether the actions of our classifier is independent of the sensitive variable, given the actual outcome y.


$$
\begin{equation}
\mathrm{F}_{balance}(\theta,\pi) = \sum_{y,z,a}[ \mathrm{P}_\theta^\pi (a|y,z) - \mathrm{P}_\theta^\pi (a|y) ]^2
\end{equation}
$$

For both equations y is the actual outcomes. a is the action of our classifier and z is the sensitive variable.

----------


## 4. Results

As explained before, we use the German Credit Data from UCI to train and test the models. 

### 4.1. Exploring and preprocessing the data
The dataset classifies applicants as ‘good or bad credit risk’ (i.e. the outcome variable). Thus the problem is to classify each pattern of variables as either good or bad. Figure X shows that there are 700 cases of ‘good’ applicants (value = 1) and 300 cases of ‘bad’ applicants (value = 0). We assume that the data points are independently sampled.


<img src="Images/countplot1.png" alt="Countplot" style="width: 350px;"/>
<i><p style='text-align: right'; > Figure 2: countplot of the outcome variable ‘credit risk’, <br/> where 1 is a good credit risk and 2 a bad credit risk
 </p></i>

The variables in the dataset are as follows: 

| | Attribute | Value |
|:------|:------|:------|
| 1 | Status of existing checking account | Categorical, qualitative |
| 2 | Duration in month | Numerical |
| 3 | Credit history | Categorical, qualitative |
| 4 | Purpose | Categorical, qualitative |
| 5 | Credit amount | Numerical |
| 6 | Savings account/bonds | Categorical, qualitative |
| 7 | Present employment since | Categorical, qualitative |
| 8 | Installment rate (in % of disposable income) | Numerical |
| 9 | Personal status and sex | Categorical, qualitative |
| 10 | Other debtors / guarantors | Categorical, qualitative |
| 11 | Present residence since | Numerical |
| 12 | Property | Categorical, qualitative |
| 13 | Age in years | Numerical |
| 14 | Other installment plans | Categorical, qualitative |
| 15 | Housing | Categorical, qualitative |
| 16 | Number of existing credits at this bank | Numerical |
| 17 | Job | Categorical, qualitative |
| 18 | Number of people being liable to provide <br/> maintenance for | Numerical |
| 19 | Telephone | Categorical, qualitative |
| 20 | Foreign worker | Categorical, qualitative |

<body>
<i><p style='text-align: right'; > Table 4: Variables in the German credit data set </p></i>
</body>




There are 7 numerical variables and 13 categorical, qualitative variables. Seeing that these latter variables are qualitative, they are observed subjectively, and labeled into a category. To use the categorical values in the machine learning algorithm, they are one-hot encoded. 

We have a look at the feature importance of the data. This graph is made with the feature importance functionality build into the Random Forest model in scikit learn. Because random forests uses decision trees, where every tree builds different subsets (of the data) until it understands and represents the relationship of the variables with the target variable, this model has a feature importance attribute in it. 


<img src="Images/feature_importances.png" alt="Feature_importances" style="width: 800px;"/>

<body>
<i> <p style='text-align: right'; > Figure 3: The importance of the features in the German credit data set </p></i>
</body>

The graph shows that duration, amount and age are the three most important features in the dataset that influence the outcome. 


### 4.2. Designing a policy for classifying new applicants as good or bad credit risk 

The choice for giving or denying credit to individuals is based on their probability for being credit-worthy. This probability is given by the machine learning model, and using this probability  and taking into account the length of the loan, we can calculate the expected utility of giving a loan:


$$
E(U) = gain * p-amount*(1-p)
$$

Where amount is the loaned amount, and gain is the total amount of interest on the loan. *p* is the predicted probability of the loan being paid back. The interest is calculated using the following formula:

$$
amount*((1+interestrate)^{duration}/-1)
$$

where duration is loan duration in months, and interest_rate is return per month in %/100.

Using the function for expected utility, we insert the variables and use the probability that follows from the model, we get an expected utility for each application. If the result is greater than 0, that is to say if we can make money with this loan, the action will take the value 1: grant the loan. If the expected utility value is 0 or negative, the loan must not be granted, and the action is to *not* grant the loan. 

The models are built accordingly to the description in the method-implementation section. 


#### 4.2.1. Model evaluation

To evaluate the three different models we look at how the three models perform on the dataset. While *maximum* revenue can never be ensured (there will always be an error rate in every machine learning model, and therefore there will always be some loss), by choosing a model with a low error on the given data and by minimizing this error, the revenue can be increased. The error can be minimized by optimizing the parameters of the model. 

In order to choose the right model given the dataset, we built the models on 80% of the data and test it on the unseen 20% of the data, and we analyse the different scores of the models. 

| | Accuracy score | Confusion matrix | ROC/AUC-score | Utility <br/> (average over 10 <br/> runs, with standard deviation) |
|:------|:------:|:------:|:------:|:-------:|
| kNN Banker | 0.635 | [29  **46**] <br/> [6 127] | 0.750 | 496 (+/- 11823)
| Random Forest Banker | 0.665 | [55 **12**] <br/> [0 133] | 0.765 | 3559 (+/- 3504)
| Neural Network Banker | 0.625 | [16  **51**] <br/> [8 125] | 0.756 | 2383 (+/- 8463)

<body>
<i><p style='text-align: right'; > Table 5: Evaluation scores for the three classification models </p></i>
</body>

The right-upper corner of the confusion matrix (in bold) shows the false positives of the model. We see that the Random Forest model has the highest accuracy score, the least false positives, the highest AUC score and the highest average utility with the lowest standard deviation (meaning that this model has the most consistent utility score over 10 runs). For the German credit data the Random Forest seems to fit the data the best, and in this case is therefore the best model. 

#### 4.2.2. Model risk
The risk of the model being wrong can be lowered by critically assessing the model’s performance. Beforehand, performance requirements for the model should be set in terms of accuracy score and acceptable error. These requirements are dependent on the data and the business model of the FI. The utility function can be replaced with a monotonic function to achieve risk-sensitive behaviour. Using a convex or concave function for the utility, we can either take more risk or decrease the risk, respectively. This depends on the aims of the FI. 

After setting these initial requirements, they should be tested. A common measure is to calculate the accuracy score and the confusion matrix of the model, which are discussed before in model evaluation. By optimizing the parameters of the model these scores can be improved. The best hyperparameters will vary between datasets, so optimization should be performed separately on each dataset. Because in our case it is included in the fit, it ensures that every model uses the best parameters on the dataset that is given to the model. The confusion matrix can be improved by adjusting the threshold of the probability. The accuracy score can (in this case) be improved with feature selection and tuning the parameters of the model. By using the best subset of attributes that explain the relationship of the independent variables and the outcome variable, there is less noise from independent variables that do not explain the outcome variable that well. Feature selection is done by first critically assessing the database, and secondly using the feature importance score as shown in the previous section (figure X). 

By using cross validation, where the data is divided in k parts and each part is used as training data and once as test data for the model, more generalized relationships between the input and outcome data are achieved. This will ensure that the model performs better on unseen data as well. 

Lastly, it is important to keep monitoring the model to make sure it’s working well, and regularly retrain the model when new data comes in, so that the model keeps up to date. We will come back to this later. 


The results of the bagging for the different models, as explained in the method, are shown in figure 4. 

<img src="Images/bagging_prob_hist_kNN.png" alt="Bagging_kNN" style="width: 300px;"/>

</body>
<i> <p style='text-align: right'; > Figure 4a: Histogram of probabilities for kNN </p></i>
</body>

<img src="Images/bagging_prob_hist_RF.png" alt="Bagging_kNN" style="width: 300px;"/>
    
</body>
<i> <p style='text-align: right'; > Figure 4b: Histogram of probabilities for Random Forest </p></i>
</body>



Figure 4a shows the histogram of the kNN-model. We see that the majority of the probabilities are located around 0.5 to 0.7, meaning that the model is not very certain about the predicted class. Figure 4b shows the histogram of the probabilities for the Random Forest model, where we see that the majority of the probabilities is around 0.8 to 0.9, indicating that this model is more certain about the predicted class. For Neural Networks that are built with Keras (such as the one used here), scikit learn’s BaggingClassifier is not available. 

#### 4.2.3. Reliability

The machine learning models are built with the assumption that the rows of the dataset are i.i.d (independent and identically distributed) variables. When the applicant’s defaults would be correlated, for example during an economic downturn, the probability distributions change. In this case, when one applicants defaults, the probability of another applicant defaulting on its loan increases, i.e. they are dependent. This dependency can be modeled by using a copula, which couples the marginal distribution of the events, that follows from the joint distribution of the events. 

As shown by Sklar’s Theorem (1959): for any joint distribution over $X1,...,XN$, there exists a copula function C such


$$
H (x,y) = C (F_x, G_y)
$$

where H is a two-dimensional distribution function with marginal distribution functions F and G.

Using a copula enables the machine learning model to take into account the correlation between the applicants. When the economy is getting worse and the correlation between applicants increases, the model will be declined to assign the class ‘good credit risk’ to applicants who are correlated with applicants that are assigned ‘bad credit risk’. 

Continuing on this is that it is important that the machine learning model allows online learning. Online machine learning means that new data arrives sequential, and that the model is updated at each step in order to predict future data the best. This way the model stays up to date to current events, it can recognize correlation between applicants sooner, and adapt to new patterns in the data. 

#### 4.2.4. Limited and biased data

Cross-validation, as explained before, is also a good way to deal with limited data. Rather than splitting the data into a training and a test part, where we would end up with a small test subset to test the model on, the whole dataset is used to both train and test over different runs, therefore making optimal use of the whole dataset. Moreover, using bagging, we create multiple models on subsets of the dataset and average them, thereby using the dataset more intensively. 

Uncertainty arising from biased data is hard to take into account. By critically assessing the origin of the data and how the data is collected biases in the dataset can be estimated. If these biases are clear up front, the design of the model can be adapted so that it responds minimal to these biases. Moreover, it is important that for different FIs the model is trained on own data of the FI. By using data of their own clientele the model makes predictions about their clientele, seeing that for different FIs the clientele could be very different. Also, by training the model on own data, the sampling bias of the data is consistent, because the data is consistently gathered in the same way. 

After making the model it should be checked whether it is biased against certain societal biases. Marr (Forbes, 2019) lists steps that could be taken to minimize the risk of preserving societal biases in AI. Among others, the article focuses on ensuring that the algorithm is coded so that it doesn’t respond to societal biases. More specifically, this means that when designing a machine learning algorithm, it is important, first, to choose your subset carefully. In other words, make sure the subset is representative for the population you are predicting something for. Since we don’t collect our own data for this report this point is a bit difficult, but when using feature selection it is important to consider the population we are predicting. Second, in feature selection, we have to make sure we only exclude features that don’t influence the outcome. 

Another way to reduce a bias in the model is to monitor the performance of the model, thereby preventing that the model responds to societal biases. Also, the sampling process introduces bias which can not be corrected by using a testset. To correct for changes in the sampling process and/or the population of customers, economic trends, etc., it is important to monitor the model in production. The reliability results from testing on the test-set should be treated as a best-case scenario - it represents how well the model performs given that the sampling process and underlying distribution remains constant.

### 4.3. Privacy

The best way to avoid privacy attack, is to use an API to share our data. The API is like a blackbox for the user because we won’t allow a user to access directly to the data. First, we need to apply some anonymization algorithm on our data, and then we can send the information to the user. With our API we can have some questions. Should we give a complete row or column of information ? Only an average, minimum or maximum on the requested data? In fact, the less information we give, the better your privacy is.

The 3 points that we choose to protect the privacy of the users in the database:
* To protect the data we can for example create slices for all the precise numerical values. This concept is called k-anonymity. In other words “A database provides k-anonymity if for every person in the database is indistinguishable from k - 1 persons with respect to quasi-identifiers”. For example, with the correct age, we can create age groups (15-19,20-24,25-29 ...). Repeating this process on all digital data makes it more difficult for us to identify our customers. Also, the way we are going to group the data will be very important. For example, with age, we will take the following classes: 18-24, 25-34, 35-49, 50-64, 65 +. 
    
    The function that we implement is:  privacy_step(X_one_column). It takes as parameters an array X_one_column with the corresponding column that we want to anonymize. For instance X['age'] will return the new array with interval of value and not numerical value.


* Another way to anonymize the data is to add some noise inside. If we want a lot of privacy, we can add a lot of noise. However we will lose accuracy and utility. In our case, we use the Laplace mechanism.
    
    The function privacy_epsilon(X_one_column,epsilon)  take as parameters an array X_one_column with the corresponding column that we want to anonymize and the epsilon. For instance X['age'] will return the new array with data and noise for each value.


* The last point that we implement is an algorithm for randomising responses.
    The principe is to flip a coin and if it comes heads, we respond truthfully. Otherwise, change the data randomly.
    The function is  privacy_step_coin(X_one_column,p) and it returns an array with anonymized data.

    For example, we can use this function if we ask our user if they think that they can refund the loan. So they can respond without any trouble and that can give us additional information. But the problem is that half of the data is wrong.


The anonymization algorithms are very useful to protect the privacy of the user. But they add uncertainty inside the data. So the accuracy and the utility of our model isn’t as good as the original. So we need to find the best precision scale between the accuracy and the privacy of the model.


### 4.4. Fairness

We will look at the calibration of decisions with respect to age. Looking through the documentation of the data we have found that the following variables are sensitive:

* Attribute 13: (numerical) Age in years

Other variables, such as "Present residence since" and "Present employment since" are likely to be correlated to age. There is a trend towards lower age in the original features. The figure below shows how the ages are distributed.



<img src="Images/distribution_of_age.png" alt="dist_age" style="width: 350px;"/>

<body>
<i> <p style='text-align: right'; > Figure 5: Distribution of age variable </p></i>
</body>

It is unwieldy to have balance between too many groups, so we sort the dataset according to the age column, and split it into three equal size groups. The resulting age brackets are [19,28), [28, 38) and [39,75), with 333-334 people in each group.

We now look at how some of the variables are distributed in the different age segments. First the employment variable:

    * Attribute 7:  (qualitative)
         Present employment since
          A71 : unemployed
          A72 :       ... < 1 year
          A73 : 1  <= ... < 4 years  
          A74 : 4  <= ... < 7 years
          A75 :       .. >= 7 years

<img src="Images/age_employment.png" alt="age_employment" style="width: 600px;"/>

<body>
<i> <p style='text-align: right'; > Figure 6: Distribution of age in the employment variable </p></i>
</body>

We see that more old people are unemployed or have been in a job for a long time, whereas more young people recently started in their current job.

We now look at housing:
    * Attribute 15: (qualitative)
          Housing
          A151 : rent
          A152 : own
          A153 : for free


<img src="Images/age_housing.png" alt="age_housing" style="width: 400px;"/>

<body>
<i> <p style='text-align: right'; > Figure 7: Distribution of age in the housing variable </p></i>
</body>

There are clear differences between the groups on housing. Very few young people have free housing, and relatively few old and middle age people rent. This means we have a dependency between the age variable and the housing variable.

Finally, we look at whether there are differences between the groups in how often loans are paid back.
    * 1 : Loan was repaid
      2 : Loan was defaulted


<img src="Images/age_default.png" alt="age_default" style="width: 250px;"/>

<body>
<i> <p style='text-align: right'; > Figure 8: Distribution of age in the outcome variable </p></i>
</body>

As can be seen, a higher percentage of young people default on their loans

In conclusion; We expect there to be differences in the decision of whether to get a loan between the groups, and we need to estimate how much of that difference is due to age and how much is due to other factors.

#### 4.4.1 Measuring fairness
We have chosen to measure fairness in terms of both calibration and balance with respect to age. 

We calculate the deviation from fairness using a frequentist approach. For a given set of $y$, $a$ and $z$ we calculate $P(y | a,z)$ by selecting the samples where our classifier decides $a$ and the age is in group $z$. We then calculate what fraction of this selection has “repaid” = $y$. The process is similar for the other probabilities. We then insert the calculated probabilities in the equations for calibration and balance defined in section 3.6. This analysis yields the following results:

| | Calibration | Balance |
|:------|:------:|:------:|
| Random Banker | 0.0017 | 0.0 |
| kNN Banker  | 0.0166 | 0.0310 |
| Random Forest Banker| 0.0097 | 0.0186 |
| Neural Network Banker | 0.0041 | 0.0146 |

<body>
<i><p style='text-align: right'; > Table 6: Calibration and balance scores for age variable </p></i>
</body>

As expected, the RandomBanker has the best fairness. The kNNbanker is less fair than RandomForest and NeuralBanker. Of our three algorithms, NeuralBankerGridSearch is the most fair with respect to age.

We can also take amount into account. We do this in a similar way, by dividing the dataset into 3 equal size groups based on amount. The dataset is thus in total sliced into 9 equal size parts containing all combinations of the three groups of age and amount respectively.
If we also take amount into account, the fairness results are:

| | Calibration | Balance |
|:------|:------:|:------:|
| Random Banker | 0.1065 | 0.0 |
| kNN Banker  | 0.4249 | 0.2871 |
| Random Forest Banker| 0.1479 | 0.1656 |
| Neural Network Banker | 0.1576 | 0.1930 |

<body>
<i><p style='text-align: right'; > Table 7: Calibration and balance scores for age and amount variables </p></i>
</body>

Including amount as a sensitive variable for fairness increases the deviation from fairness for all classifiers (except of course the random classifier). We believe it is unwise to include amount as a fairness variable because bigger loans entail more risk to the bank and it is therefore justified to discriminate based on the loan amount.
If more fairness is needed it is possible to optimise the classifier for more fairness, at the cost of some accuracy. Details of this process are outlined below.

##### Finding a good balance between fairness and utility

We now want to find a policy π which is optimised for both fairness and utility. The utility fairness trade-off can be modelled like this:


$$
\begin{equation}
\mathrm{V} (\lambda, \theta, \pi) = (1 - \lambda) * U(\theta, \pi) - \lambda F (\theta, \pi)
\end{equation}
$$

Where $\lambda$ governs how important fairness is. $\theta$ is the parameters of the classifier and $\pi$ is the classifier. $U(⋅)$ is a function measuring the utility of the decision function and $F(⋅)$ is a function measuring the fairness of the decision function.

At least for some of our classifiers, for instance the neural network, we already have the gradient of $ U(\theta, \pi)$ with respect to $\pi$.
We therefore focus on the gradient $ \nabla_\pi F (\theta, \pi)$. ∇πF(θ,π). $\pi$ is our rule for making decisions $(a)$ based on observations $(x)$. That gives us the following:

$$
\begin{equation}
\mathrm{\nabla}_\pi F(\theta, \pi) = \nabla_\pi \sum_{y,a,z} [\mathrm{P}_\theta^\pi (y|a,z) - \mathrm{P}_\theta^\pi(y|a)]^2
\end{equation}
$$

$$
\begin{equation}
 = 2 * \nabla_\pi \sum_{y,a,z} [\mathrm{P}_\theta (y|\pi(x),z) - \mathrm{P}_\theta(y|\pi(x))]
\end{equation}
$$

$$
\begin{equation}
= 2 * \sum_{y,a,z} [\nabla_\pi \mathrm{P}_\theta (y|\pi(x),z) - \mathrm{P}_\theta(y|\pi(x))]
\end{equation}
$$

Our decision rule $\pi$ is *"Give the loan if the expected return is more than 0, and reject it otherwise"*. This means that it is basically a step-function which is either 1 or 0 for all $x$. Consequently the derivative of $V(x)$ is zero everywhere. It is still possible to optimise our $ V(\theta, \pi)$, but we’ll need to use another method. Genetic algorithms [Chapter 12, Marsland, 2012] are well suited to this because they only require a scoring function. Due to time-constraints we have not implemented this optimisation.

---------

## 5. Conclusion

In this report we presented three different machine learning models that are able to evaluate new applicants for mortgage credit loans and to assess whether they are a good or bad credit risk. To do so we used the German Credit Dataset from UCI. 

We started with analysing the dataset and the importance of the different features. Then we build the three models, a k-Nearest Neighbor model, a Neural Network and a Random Forest. The models are optimized within their fit-function, meaning they will optimize for every dataset it is trained on. We evaluated the three models on four different measures; the accuracy score, the confusion matrix and the model’s number of false positives, the ROC/AUC-score and the average utility and standard deviation. All metrics are evaluated using 10-way cross validation. We showed that for the German credit data the Random Forest was the best fit, but for different dataset this might be a different model. Different aspects of model risk, uncertainty and reliability are discussed. 

For privacy, we’ve shown that there are three options that can be taken to improve the privacy of both existing and new applicants in the dataset. We show that either k-anonymity, the Laplace mechanism or randomised responses can be used to protect privacy. We argue that the anonymization algorithms add noise (uncertainty) into the data, and that there is a tradeoff between accuracy and privacy in the model. Using an 

API when sharing the data protects privacy. 
For fairness we choose the sensitive variable age and we measured the calibration and balance  of the classifiers with respect to this variable. We divided the dataset in three equal-sized groups and looked at how these groups were represented in other variables (e.g. employment, housing and the outcome variable). We show how the different models score on fairness and saw that for the German credit data the Neural Network scores the best. We have outlined an algorithm to find a balance between fairness and utility. 

In this report we have presented a method to design a decision rule for giving loans to individuals. By using the German credit data as example we show how this method can be implemented, and how to choose and optimize a model to maximise expected utility, thereby taking into account risk, uncertainty, privacy and fairness. FIs can implement this model, using data from their own customers following the method described here. Once a model is deployed, it needs to be monitored to make sure that the predictions stay accurate as new data comes in. 


----------

## 6. Discussion

Using a machine learning model to predict whether new applicants will default on their loan, and this report in particular, have a few limitations. 

First of all it is important that any machine learning model is as good as the data it is trained on. That means, if the data sample is not representative of the population, the model won’t make useful predictions of the population. If there are sampling biases in the data, they will be in the model. Therefore, it is important that the data collection process is well understood. For this report, we use the German credit data. In this data there are some assumptions of which we have no knowledge, e.g. the outcome is labeled as ‘good or bad credit risk’. We don’t know on which criteria these labels are based on, and we have to assume that the labeling process is accurate. Also, the dataset contains multiple categorical, quantitative variables where applicants are put into some category. This means that the dataset is generalized and possibly biased. When using an in-house dataset of applicants, these labels are known and created, but it is important to critically think about how everything is labeled and categorized. This also means that it is important, when implementing the model, to train and test it on own data. Characteristics differ greatly among continents, countries, regions and even among different banks. Therefore, implementing a  model that has been trained on German credit data, in for example the US or Japan, would be risky. Lastly, the German credit dataset has been created in 1994, meaning it is not very up to date anymore. Nowadays, some of independent variables influencing whether an applicant is a good or bad credit risk would be different. By using an elaborate dataset with possible features in the first place, and then applying feature selection, current independent variables can be identified.

Second, it is important that our models output probabilities of an applicant belonging to a class. These probabilities represent the uncertainty of the model, and normally the model would assign the applicant to the class with the highest probability. However, in this case, the model chooses its action based on whether the utility is positive. Therefore the model is not created to classify applicants in either one of the classes, but the model is optimized to make a profit (a highest possible utility score), based on the probability of an applicant being a good credit risk. 

Lastly, an important limitation is that macroeconomic aspects, such as inflation, are ignored in this analysis, resulting in an optimistic model. 

----------

## 7. References

Breiman, L. (1996). Bagging predictors. *Machine learning, 24*, 123-140. 

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Leo, M., Sharma, S., Maddulety, K. (2019). Machine learning in banking risk management: a literature review. *Risks, 7*(29), 2-22.

Marr, B. (2019). Artificial intelligence has a problem with bias, here's how to tackle it. *Forbes.* Retrieved on 25-11-2019 from https://www.forbes.com/sites/bernardmarr/2019/01/29/3-steps-to-tackle-the-problem-of-bias-in-artificial-intelligence/#6cff50b47a12 

Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. *Expert Systems with Applications, 36*(2), 2473-2480.

-------