# Loan repay prediction

Data location: https://drive.switch.ch/index.php/s/pCy5ctcFRsM2RdZ

Consider file `HomeCredit_train.csv`, which contains anonymized data shared from a financial institution (https://www.kaggle.com/c/home-credit-default-risk/).

Each row is a loan.  See `HomeCredit_description.csv` for a description of the columns (note that we only have a subset of the columns).  The target variable (column `TARGET`) contains 1 if the client had issues repaying the loan, 0 otherwise.

If you need it for faster experiments, the file  `HomeCredit_train_small.csv` contains only a small part of the training data.

## Exercise 1: exploratory analysis
Open the dataset using Pandas.

### 1.1
Which fraction of the loans are not repayed? 

### 1.2
Choose 3 variables you like and whose meaning you understand. Make one or two plots for each to describe its distribution (univariate analysis), and to check whether there is an obvious relation to the target variable (bivariate analysis).

In [None]:
# 1.1 
import pandas as pd

#df = pd.read_csv("HomeCredit_train_small.csv")
df = pd.read_csv("HomeCredit_train.csv")
df_blind = pd.read_csv("HomeCredit_test_blind.csv")
m = df["TARGET"].mean()
print(f"Fraction of not repayed loans = {m:.1%}")

In [None]:
# 1.2
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots

df_unpaid = df[df['TARGET']==1]
df_paid = df[df['TARGET']==0]

def unpaid_comparison(df, df_unpaid, col, title):
    ## Counts the occurrence of unqiue elements and stores in a series type
    data = df[col].value_counts()
    data.sort_values(ascending=False, inplace=True)
    #top 10
    data = data.head(10)
    
    f2 = go.Bar(x=data.values/np.sum(data)*100, y=data.index, orientation='h',
                marker_color='rgba(0, 0, 255, 0.7)')

    data_unpaid = df_unpaid[col].value_counts() / data * 100.0
    data_unpaid.sort_values(ascending=False, inplace=True)
    #top 10
    data_unpaid = data_unpaid.head(10)
    
    f1 = go.Bar(x=data_unpaid.values, y=data_unpaid.index, orientation='h', 
                marker_color='rgba(255, 0, 0, 0.7)')
    
    fig = make_subplots(shared_xaxes=False, rows=2, cols=1) #, vertical_spacing=0.25)
    fig.add_trace(f2, row=2, col=1)
    fig.add_trace(f1, row=1, col=1)
    fig.update_yaxes(row=2, col=1, autorange="reversed")
    fig.update_xaxes(title_text="% of people obtained a credit", row=2, col=1)
    fig.update_yaxes(row=1, col=1, autorange="reversed")
    fig.update_xaxes(title_text="% of not paid loan", row=1, col=1)
    fig.update_layout(title_text=title)
    fig.update_traces(showlegend=False)
    fig.show()
    
#F    9247/131546
#M    6962/68452
#XNA  0/2
#dfp = df_unpaid['CODE_GENDER'].value_counts() / df['CODE_GENDER'].value_counts() 

In [None]:
u = unpaid_comparison(df, df_unpaid, 'CODE_GENDER', 'Unpaid Loan By Gender')

In [None]:
unpaid_comparison(df, df_unpaid, 'OCCUPATION_TYPE', 'Top 10 Unpaid Loan By Occupation Type')

In [None]:
unpaid_comparison(df, df_unpaid, 'NAME_HOUSING_TYPE', 'Unpaid Loan By Housing Situation')

In [None]:
unpaid_comparison(df, df_unpaid, 'NAME_EDUCATION_TYPE', 'Unpaid Loan By Education')

In [None]:
unpaid_comparison(df, df_unpaid, 'ORGANIZATION_TYPE', 'Top 10 Unpaid Loan By Type of organization where client works')

## Exercise 2: preparing a training and a validation dataset

### 2.1
Randomly split the 200k rows of your dataset in two groups; keep 150k rows for training and use 50k for validating your models.

### 2.2
Choose three variables that are already numeric.  Build the following numpy arrays:
- `X_tr` (2 dimensions: 150k rows, 3 columns)
- `y_tr` (1 dimension: 150k elements)
- `X_val` (2 dimensions: 50k rows, 3 columns)
- `y_val` (1 dimension: 50k elements)

In [None]:
#2.1
msk = np.random.rand(len(df)) < 3/4
train = df[msk] # 150'000 rows
test = df[~msk] # 50'000 rows

In [None]:
#2.2 - numeric variables
cols = ['AMT_INCOME_TOTAL','AMT_CREDIT', 'REGION_POPULATION_RELATIVE']

X_tr = train[cols]
y_tr = train['TARGET']

X_val = test[cols]
y_val = test['TARGET']

## Exercise 3: training and scoring simple models

### 3.1
Train a K-Nearest-Neighbors classifier (use the sklearn function) and compute its accuracy on the validation set.  Compare it with the accuracy of a classifier that always returns 0.  Comment.

### 3.2
Train the classifier with K=20, and use the `predict_proba(...)` function on the trained classifier to obtain a *score* for each instance in the validation set.  Consider the distribution of the scores returned for the instances of the validation set.

Describe what is in this context the concept of TP, TN, FP, FN, TPR, FPR.

Write a function that given a threshold, computes the number of TP, TN, FP, FN. 

How would you describe what the TPR and FPR are in this context?


### 3.3
Using your function defined above, compute the TPR and FPR for a large number of different thresholds.  Plot the ROC curve.

Using the [appropriate sklearn function](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html), compute the AUC of your classifier.

### 3.4
Draw the ROC curve and compute the AUC value for two "dummy" classifiers:
- one that always returns a score of 0 for each sample
- one that returns a random score for each sample

In [None]:
#3.1

import sklearn.datasets
from sklearn.neighbors import NearestNeighbors

clf_kn2 = sklearn.neighbors.KNeighborsClassifier(n_neighbors=2)
clf_kn2.fit(X_tr, y_tr)

y_hat = clf_kn2.predict(X_val)
accurancy_kn = np.mean(y_hat == y_val)

y_hat_dummy = [0] * len(y_val)
accurancy_dummy = np.mean(y_hat_dummy == y_val)
print(f'Accurancy KN(2)\t {accurancy_kn:.4f}')
print(f'Accurancy Dummy\t {accurancy_dummy:.4f}')
print(f'Difference\t{accurancy_kn-accurancy_dummy:.4f}')
print(f'With this configuration the Dummy Classifier works better than a KNeighbour(2).')
print(f'It can be explained by the fact that the % of positive cases are very low in the context of this problem.')


In [None]:
#3.2 - Train the classifier with KNeighborsClassifier using n_neighbors=20

clf_kn20 = sklearn.neighbors.KNeighborsClassifier(n_neighbors=20)
clf_kn20.fit(X_tr, y_tr)
y_score_kn20 = clf_kn20.predict_proba(X_val)[:,1]
y_hat = y_score_kn20 > 0.5
accurancy_kn_20 = np.mean(y_hat == y_val)

def roc_calculator(t, y_val, y_sc):
    matrix = sklearn.metrics.confusion_matrix(y_val, y_sc > t)
    TN=matrix[0][0]
    FP=matrix[0][1]
    FN=matrix[1][0]
    TP=matrix[1][1]
    # FNR = False Negative Rate 
    FNR=FN/(FN+TP)    
    # FPR = False Positive Rate 
    FPR=FP/(TN+FP)
    # TPR = True Positive Rate
    TPR=1-FNR 
    return TPR, FPR

# Description in the context of the problem
# [TN, FP] -> TN+FP = Totale Paganti
# [FN, TP] -> FN+TP = Totale Insolventi
# TP = Correctly prediceted as: non-paganti.
# TN = Correctly prediceted as: paganti.
# FP = Wrongly prediceted as: non-paganti.
# FN = Wrongly prediceted as: paganti.
# TPR = % of correctly predicted as: non paganti.
# FPR = % of wrongly predicted as: non paganti.

# distribution of y scores
fig = go.Figure(go.Histogram(x=y_score_kn20))
fig.update_layout(
    title='Distribution of scores for a KNeighbour(20) Classifier',
    xaxis=dict(title='y scores'),
    yaxis=dict(title='Count'),  
)
fig.show()

In [None]:
#3.3
def print_roc_curve(y_val, y_sc, name):
    x_values = []
    y_values = []
    # Calculate the AUC (Area Under Curve)
    AUC = sklearn.metrics.roc_auc_score(y_val, y_sc)
    
    for i in  np.arange(0, 1, 0.02):
        TPR,FPR = roc_calculator(i, y_val, y_sc)
        x_values.append(FPR)
        y_values.append(TPR)
        
    hat = y_sc > 0.5
    acc = np.mean(hat == y_val)
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=x_values, y=y_values, mode='lines+markers', name=name))
    fig.add_trace(go.Scatter(x=[0,1], y=[0,1], mode='lines', name='baseline', line=dict(width=2, dash='dash')))
    fig.update_layout(title=f'ROC curve for {name}<br>AUC={AUC:.3f} Accurancy={acc:.3f}', xaxis_title='%FP', yaxis_title='%TP', legend=dict(x=-.04, y=-.22))
    fig.show()
    return AUC

In [None]:
kn20_auc = print_roc_curve(y_val, y_score_kn20, 'KNeighbors(20) Classifier')

In [None]:
#3.4
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier

random_classifier = DummyClassifier(strategy="uniform")
random_classifier.fit(X_tr, y_tr)
y_score_random = random_classifier.predict_proba(X_val)[:,1]
print_roc_curve(y_val, y_score_random, "Random Classifier")

zero_classifier = DummyClassifier(strategy="constant", constant=0)
zero_classifier.fit(X_tr, y_tr)
y_score_zero = zero_classifier.predict_proba(X_val)[:,1]
print_roc_curve(y_val, y_score_zero, "AlwaysZero Classifier")

## Exercise 4: training better models (optional)

### 4.1
Normalize the data and repeat the analysis. Is the accuracy better?

### 4.2
Try using other classifiers.  A good option is the [random forest classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

### 4.3
Try using more numerical features.

### 4.4
There is a lot of information also in the categorical features. Find a way to use them in a classifier.  For example, you can use One Hot Encoding, implementing it manually or by using the [appropriate sklearn function](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).

### 4.5
Given a classifier, study how the AUC on the validation data decreases if you use only part of the training data.  Make a plot with the AUC on the y and the fraction of training data on the x. Compare for example 0.1%, 1%, 10%, 100%.

In [None]:
#4.1
cols = ['AMT_INCOME_TOTAL','AMT_CREDIT', 'REGION_POPULATION_RELATIVE']

def normalize(df, cols):
    arr = np.array(df[cols])
    X_tr_means = np.mean(arr, axis=0, keepdims=True)
    X_tr_stds = np.std(arr, axis=0, keepdims=True)
    X_tr_norm = (arr - X_tr_means) / X_tr_stds
    return X_tr_norm

X_tr_norm = normalize(X_tr, cols)
X_te_norm = normalize(X_val, cols)

clf_kn20_norm = sklearn.neighbors.KNeighborsClassifier(n_neighbors=20)
clf_kn20_norm.fit(X_tr_norm, y_tr)
y_score_norm = clf_kn20_norm.predict_proba(X_te_norm)[:,1]

kn20_norm_auc = print_roc_curve(y_val, y_score_norm, 'KNeighbors(20) Classifier with Normalized values')

accurancy_kn_20_norm = np.mean(clf_kn20_norm.predict(X_te_norm) == y_val)

print(f'Not Normalized AUC={kn20_auc:.4f} Accurancy={accurancy_kn_20:.4f}')
print(f'Normalized AUC={kn20_norm_auc:.4f} Accurancy={accurancy_kn_20_norm:.4f}')
print('In this case, the AUC decreases while the Accurancy increases')

In [None]:
#4.2
rf_classifier = RandomForestClassifier(max_depth=6, random_state=0)
rf_classifier.fit(X_tr_norm, y_tr)
y_score_randomforest = rf_classifier.predict_proba(X_te_norm)[:,1]
random_forest_auc = print_roc_curve(y_val, y_score_randomforest, "random forest classifier with normalized data")
accurancy_rf_norm = np.mean(rf_classifier.predict(X_te_norm) == y_val)
print(f'Normalized AUC={random_forest_auc:.4f} Accurancy={accurancy_rf_norm:.4f}')

In [None]:
def auc_score(clf, x_tr, x_te):
    clf.fit(x_tr, y_tr)
    score = clf.predict_proba(x_te)[:,1]
    auc = sklearn.metrics.roc_auc_score(y_val, score)
    return auc,score
def rf_auc(depth, x_tr, x_te):
    clf = RandomForestClassifier(max_depth=depth, random_state=0)
    return auc_score(clf, x_tr, x_te)
def kn_auc(k, x_tr, x_te):
    clf = sklearn.neighbors.KNeighborsClassifier(n_neighbors=k)
    return auc_score(clf, x_tr, x_te)
    
def calc_max_auc(f_auc, start, end, x_tr, x_te):
    aucs=[]
    accs=[]
    for d in range(start, end):
        auc, sc = f_auc(d, x_tr, x_te)
        acc = np.mean((sc > 0.5) == y_val)
        aucs.append(auc)
        accs.append(acc)
        print(f'{d} -> AUC:{auc:.4f} ACC:{acc:.4f}')
    max_auc = max(aucs)
    idx = aucs.index(max_auc)
    print(f'Max AUC:{max_auc:.4f} param: {idx+start} -> ACC: {accs[idx]:.4f}')
    max_accurr = max(accs)
    idx = accs.index(max_accurr)
    print(f'Max ACC:{max_accurr:.4f} param: {idx+start} -> AUC: {aucs[idx]:.4f}')

In [None]:

# uncomment line below to test some parameters. It may take some time to run!!!
print('Max AUC and Accurancy for KNeighbour changing the n_neighbors param')
#calc_max_auc(kn_auc, 40, 50, X_tr_norm, X_te_norm)
#47 -> AUC:0.5479 ACC:0.9194
        
print('Max AUC and Accurancy for Random Forest changing the max_depth param')
#calc_max_auc(rf_auc, 1, 10, X_tr_norm, X_te_norm)
#6 -> AUC:0.5596 ACC:0.9182


In [None]:
#4.3 - Try using more numerical features.

num_cols = cols + ['AMT_ANNUITY','CNT_CHILDREN','AMT_GOODS_PRICE', 'DAYS_EMPLOYED']
for c in num_cols:
    train[c].fillna(train[c].mean(), inplace=True)
    test[c].fillna(test[c].mean(), inplace=True)
    
X_tr_num = normalize(train, num_cols)
X_te_num = normalize(test, num_cols)

In [None]:
print('Max AUC and Accurancy for KNeighbour changing the n_neighbors param')
#calc_max_auc(kn_auc, 1, 50, X_tr_num, X_te_num)
# k:48 -> AUC:0.5881 ACC:0.9194

print('Max AUC and Accurancy for Random Forest changing the max_depth param')
#calc_max_auc(rf_auc, 1, 50, X_tr_num, X_te_num)
# 8 -> AUC:0.6199 ACC:0.9194

In [None]:
rf_classifier = RandomForestClassifier(max_depth=8, random_state=0)
rf_classifier.fit(X_tr_num, y_tr)
rf_yscore = rf_classifier.predict_proba(X_te_num)[:,1]
print_roc_curve(y_val, rf_yscore, "Random Forest Depth 8")

In [None]:
#4.4 - categorical features
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore')

cat_cols = ['OCCUPATION_TYPE']
X_tr_cat = ohe.fit_transform(train[cat_cols]).toarray()
X_tr_all = np.concatenate((X_tr_num, X_tr_cat), axis=1)

X_te_cat = ohe.fit_transform(test[cat_cols]).toarray()
X_te_all = np.concatenate((X_te_num, X_te_cat), axis=1)

In [None]:
#calc_max_auc(rf_auc, 5, 10, X_tr_all, X_te_all)
# k:8 -> AUC:0.6231 ACC:0.9176


In [None]:
rf_classifier = RandomForestClassifier(max_depth=8, random_state=0)
rf_classifier.fit(X_tr_all, y_tr)
rf_yscore = rf_classifier.predict_proba(X_te_all)[:,1]
print_roc_curve(y_val, rf_yscore, "Random Forest Depth 8 + categorical")

In [None]:
#4.5 - Compare for example 0.1%, 1%, 10%, 100%.
#      Make a plot with the AUC on the y and the fraction of training data on the x.

def calc_auc_by_fraction(fraction, train, test, cols):
    limit = int(len(train)*fraction)
    print(fraction,limit)
    tr_fraction = train[:limit]
    Xt = tr_fraction[cols]
    yt = tr_fraction['TARGET']
    Xv = test[cols]
    yv = test['TARGET']
    clf = RandomForestClassifier(max_depth=2, random_state=0)
    clf.fit(Xt, yt)
    ysc = clf.predict_proba(Xv)[:,1]
    return sklearn.metrics.roc_auc_score(yv, ysc)
    
#    0.001 - 150 rows
aucs = []
fractions = np.geomspace(1, 1000, num=4)
for f in fractions:
    aucs.append(calc_auc_by_fraction(1/f, train, test, cols))

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=fractions, y=aucs, mode='lines+markers'))
fig.update_layout(title=f'AUC decreases using a fraction of training data<br>Training data rows:{len(train)}', xaxis_type='log', xaxis_title='fractions', yaxis_title='AUC', legend=dict(x=-.04, y=-.22))
fig.show()


## Exercise 5
Download the testing dataset `HomeCredit_test_blind.csv`. It does not contain the target variable. Predict it with your best classifier, and submit the results as a CSV file.

In [None]:
#5 HomeCredit_test_blind.csv

X_te_num = normalize(df_blind, num_cols)
X_te_cat = ohe.fit_transform(df_blind[cat_cols]).toarray()  
X_te_all = np.concatenate((X_te_num, X_te_cat), axis=1)

rf_classifier = RandomForestClassifier(max_depth=8, random_state=0)
rf_classifier.fit(X_tr_all, y_tr)
rf_yscore = rf_classifier.predict_proba(X_te_all)[:,1]

df_te_predictions = pd.DataFrame()
df_te_predictions["SK_ID_CURR"] = df_blind["SK_ID_CURR"]
df_te_predictions["score"] = rf_yscore

your_surname = "DeSanti"
df_te_predictions.to_csv(f"HomeCredit_test_scores_{your_surname}.csv")