# Loan repay prediction

Data location: https://drive.switch.ch/index.php/s/pCy5ctcFRsM2RdZ

Consider file `HomeCredit_train.csv`, which contains anonymized data shared from a financial institution (https://www.kaggle.com/c/home-credit-default-risk/).

Each row is a loan.  See `HomeCredit_description.csv` for a description of the columns (note that we only have a subset of the columns).  The target variable (column `TARGET`) contains 1 if the client had issues repaying the loan, 0 otherwise.

If you need it for faster experiments, the file  `HomeCredit_train_small.csv` contains only a small part of the training data.

## Exercise 1: exploratory analysis
Open the dataset using Pandas.

### 1.1
Which fraction of the loans are not repayed? 

### 1.2
Choose 3 variables you like and whose meaning you understand. Make one or two plots for each to describe its distribution (univariate analysis), and to check whether there is an obvious relation to the target variable (bivariate analysis).

In [None]:
# 1.1 
import pandas as pd

#df = pd.read_csv("HomeCredit_train_small.csv")
df = pd.read_csv("HomeCredit_train.csv")
df_blind = pd.read_csv("HomeCredit_test_blind.csv")
m = df["TARGET"].mean()
print(f"Fraction of not repayed loans = {m:.1%}")

In [None]:
# 1.2
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots

df_unpaid = df[df['TARGET']==1]
df_paid = df[df['TARGET']==0]

def unpaid_comparison(df, df_unpaid, col, title):
    ## Counts the occurrence of unqiue elements and stores in a series type
    data = df[col].value_counts()
    f1 = go.Bar(x=data.values/np.sum(data)*100, y=data.index, orientation='h',
                marker_color='rgba(0, 0, 255, 0.7)')

    data_unpaid = df_unpaid[col].value_counts() / data * 100.0
    data_unpaid.sort_values(ascending=False, inplace=True)
    f2 = go.Bar(x=data_unpaid.values, y=data_unpaid.index, orientation='h', 
                marker_color='rgba(255, 0, 0, 0.7)')
    
    fig = make_subplots(shared_yaxes=False, rows=1, cols=2, horizontal_spacing=0.3)
    fig.add_trace(f1, row=1, col=1)
    fig.add_trace(f2, row=1, col=2)
    fig.update_yaxes(row=1, col=1, autorange="reversed")
    fig.update_xaxes(title_text="% of people obtained a credit", row=1, col=1)
    fig.update_yaxes(row=1, col=2, autorange="reversed")
    fig.update_xaxes(title_text="% of not paid loan", row=1, col=2)
    fig.update_layout(title_text=title)
    fig.update_traces(showlegend=False)
    fig.show()

unpaid_comparison(df, df_insolvment, 'CODE_GENDER', 'Gender')
unpaid_comparison(df, df_insolvment, 'OCCUPATION_TYPE', 'Occupation Type')
unpaid_comparison(df, df_insolvment, 'NAME_HOUSING_TYPE', 'Housing Type')
unpaid_comparison(df, df_insolvment, 'NAME_EDUCATION_TYPE', 'Education')

#F    9247/131546
#M    6962/68452
#XNA  0/2
#dfp = df_unpaid['CODE_GENDER'].value_counts() / df['CODE_GENDER'].value_counts() 


## Exercise 2: preparing a training and a validation dataset

### 2.1
Randomly split the 200k rows of your dataset in two groups; keep 150k rows for training and use 50k for validating your models.

### 2.2
Choose three variables that are already numeric.  Build the following numpy arrays:
- `X_tr` (2 dimensions: 150k rows, 3 columns)
- `y_tr` (1 dimension: 150k elements)
- `X_val` (2 dimensions: 50k rows, 3 columns)
- `y_val` (1 dimension: 150k elements)

## Exercise 3: training and scoring simple models

### 3.1
Train a K-Nearest-Neighbors classifier (use the sklearn function) and compute its accuracy on the validation set.  Compare it with the accuracy of a classifier that always returns 0.  Comment.

### 3.2
Train the classifier with K=20, and use the `predict_proba(...)` function on the trained classifier to obtain a *score* for each instance in the validation set.  Consider the distribution of the scores returned for the instances of the validation set.

Describe what is in this context the concept of TP, TN, FP, FN, TPR, FPR.

Write a function that given a threshold, computes the number of TP, TN, FP, FN. 

How would you describe what the TPR and FPR are in this context?


### 3.3
Using your function defined above, compute the TPR and FPR for a large number of different thresholds.  Plot the ROC curve.

Using the [appropriate sklearn function](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html), compute the AUC of your classifier.

### 3.4
Draw the ROC curve and compute the AUC value for two "dummy" classifiers:
- one that always returns a score of 0 for each sample
- one that returns a random score for each sample

## Exercise 4: training better models (optional)

### 4.1
Normalize the data and repeat the analysis. Is the accuracy better?

### 4.2
Try using other classifiers.  A good option is the [random forest classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

### 4.3
Try using more numerical features.

### 4.4
There is a lot of information also in the categorical features. Find a way to use them in a classifier.  For example, you can use One Hot Encoding, implementing it manually or by using the [appropriate sklearn function](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).

### 4.5
Given a classifier, study how the AUC on the validation data decreases if you use only part of the training data.  Make a plot with the AUC on the y and the fraction of training data on the x. Compare for example 0.1%, 1%, 10%, 100%.

## Exercise 5
Download the testing dataset `HomeCredit_test_blind.csv`. It does not contain the target variable. Predict it with your best classifier, and submit the results as a CSV file.