## Error Analysis of Final Selected Models 

### Table of Contents <a name="toc"></a>
- [Import packages](#1)
- [Large loans model candidates](#2)
- [Small loans model candidates](#3)
- [Large loans threshold selection](#4)
- [Small loans threshold selection](#5)
- [Final model selection](#6)
- [Error analysis](#7)


### Import necessary packages <a name="1"></a>

* Includes models, config, and helpers
* Packages for visualisation and plotting

[back to top](#toc)

In [None]:
import numpy as np
import pandas as pd
from src import config
from src.evaluation import evaluate_report
from src.model_dispatcher import (large_models, 
                                  small_models, 
                                  large_models_threshold_sel, 
                                  small_models_threshold_sel)
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt 
from sklearn.metrics import roc_curve, roc_auc_score


%matplotlib inline

sns.set_style("whitegrid")
sns.set_palette("deep")
mpl.rcParams['figure.figsize'] = config.DEFAULT_FIGSIZE
mpl.rcParams['lines.linewidth'] = config.DEFAULT_PLOT_LINEWIDTH
mpl.rcParams['lines.linestyle'] = config.DEFAULT_PLOT_LINESTYLE
mpl.rcParams['font.size'] = config.DEFAULT_AXIS_FONT_SIZE

pal = sns.color_palette("deep")
pal_hex = pal.as_hex()

### Model selection 

#### Large loans model candidates <a name="2"></a>

* Process to select the optimal model specification includes:
* Reading in the test data and model candidates
* Generate ROC curves and classification reports for each large loans candidate
* The final selection is GBM for large loans


[back to top](#toc)

In [None]:
ll_test = pd.read_parquet(config.FIN_FILE_PATH / "test_df_large_loans_300000.parquet")

X_large_test = ll_test.drop(columns=config.TARGET)
y_large_test = ll_test[config.TARGET]

#### Generate the ROC curves for candidates

In [None]:
_, ax = plt.subplots()
plt.tight_layout()
plt.plot([0, 1], [0, 1], ls="--", color="black")
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title(f"Large loans - ROC Curves")
run = 0

for name, model in large_models.items():
    y_pred_prob = model.predict_proba(X_large_test)[:,1]
    y_pred = model.predict(X_large_test)
    fpr, tpr, _ = roc_curve(y_large_test, y_pred_prob)
    auc = roc_auc_score(y_large_test, y_pred_prob)

    ax.plot(
        fpr, tpr, 
        label=f"{name}: AUC {auc:.2%}", 
        linestyle="solid", 
        linewidth=2,
        color=pal_hex[run]
    )
    run += 1

ax.legend(loc="lower right")
plt.savefig(config.REPORTS_PATH / "roc/all_large_models_300000.jpeg", bbox_inches="tight")
plt.show()

#### Generate the classification report for each candidate

In [None]:
ll_results = []

for name, model in large_models.items():
    y_pred_prob = model.predict_proba(X_large_test)[:,1]
    y_pred = model.predict(X_large_test)
    report = evaluate_report(y_test=y_large_test, y_pred=y_pred, y_pred_prob=y_pred_prob)
    report["model"] = name
    ll_results.append(report)

#### Convert the results to a DataFrame and export if necessary

In [None]:
# results
large_model_metrics = pd.DataFrame(ll_results).set_index("model")
large_model_metrics

# export results to csv if needed
# large_model_metrics.to_csv("large_model_metrics.csv")

### Small loans model candidates <a name="3"></a>

* Process to select the optimal model specification includes:
* Reading in the test data and model candidates
* Generate ROC curves and classification reports for each small loans candidate
* The final selection is RF for small loans

[back to top](#toc)

In [None]:
sl_test = pd.read_parquet(config.FIN_FILE_PATH / "test_df_small_loans_300000.parquet")

X_small_test = sl_test.drop(columns=config.TARGET)
y_small_test = sl_test[config.TARGET]

#### Generate the ROC curves for candidates

In [None]:
_, ax = plt.subplots()
plt.tight_layout()
plt.plot([0, 1], [0, 1], ls="--", color="black")
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title(f"Small loans - ROC Curves")
run = 0

for name, model in small_models.items():
    y_pred_prob = model.predict_proba(X_small_test)[:,1]
    y_pred = model.predict(X_small_test)
    fpr, tpr, _ = roc_curve(y_small_test, y_pred_prob)
    auc = roc_auc_score(y_small_test, y_pred_prob)

    ax.plot(
        fpr, tpr, 
        label=f"{name}: AUC {auc:.2%}", 
        linestyle="solid", 
        linewidth=2,
        color=pal_hex[run]
    )
    run += 1

ax.legend(loc="lower right")
plt.savefig(config.REPORTS_PATH / "roc/all_small_models_300000.jpeg", bbox_inches="tight")
plt.show()

#### Generate the classification report for each candidate

In [None]:
sl_results = []

for name, model in small_models.items():
    y_pred_prob = model.predict_proba(X_small_test)[:,1]
    y_pred = model.predict(X_small_test)
    report = evaluate_report(y_test=y_small_test, y_pred=y_pred, y_pred_prob=y_pred_prob)
    report["model"] = name
    sl_results.append(report)

#### Convert the results to a DataFrame and export if necessary

In [None]:
# results
small_model_metrics = pd.DataFrame(sl_results).set_index("model")
small_model_metrics

# export results to csv if needed
# small_model_metrics.to_csv("small_model_metrics.csv")

### Threshold selection

#### Large loans threshold selection <a name="4"></a>

* Selection process: Reading in the necessary data files and models 
* ROC curves and classification reports are then generated for each large loan threshold candidate
* We select the best performing large loans model with consideration to the small loan's performance
* The final threshold selected is 400k 

[back to top](#toc)

In [None]:
ll_200000_test = pd.read_parquet(config.FIN_FILE_PATH / "test_df_large_loans_200000.parquet")
ll_300000_test = pd.read_parquet(config.FIN_FILE_PATH / "test_df_large_loans_300000.parquet")
ll_400000_test = pd.read_parquet(config.FIN_FILE_PATH / "test_df_large_loans_400000.parquet")
ll_500000_test = pd.read_parquet(config.FIN_FILE_PATH / "test_df_large_loans_500000.parquet")

X_large_200000_test = ll_200000_test.drop(columns=config.TARGET)
y_large_200000_test = ll_200000_test[config.TARGET]
X_large_300000_test = ll_300000_test.drop(columns=config.TARGET)
y_large_300000_test = ll_300000_test[config.TARGET]
X_large_400000_test = ll_400000_test.drop(columns=config.TARGET)
y_large_400000_test = ll_400000_test[config.TARGET]
X_large_500000_test = ll_500000_test.drop(columns=config.TARGET)
y_large_500000_test = ll_500000_test[config.TARGET]

ll_evaluation = {
    "gbm_200000": [X_large_200000_test, y_large_200000_test],
    "gbm_300000": [X_large_300000_test, y_large_300000_test],
    "gbm_400000": [X_large_400000_test, y_large_400000_test],
    "gbm_500000": [X_large_500000_test, y_large_500000_test]
}

#### Generate the ROC curves for candidates

In [None]:
_, ax = plt.subplots()
plt.tight_layout()
plt.plot([0, 1], [0, 1], ls="--", color="black")
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title(f"Large loans - GBM threshold selection - ROC Curves")
run = 0

for name, data in ll_evaluation.items():
    model = large_models_threshold_sel.get(name)
    y_pred_prob = model.predict_proba(data[0])[:,1]
    y_pred = model.predict(data[0])
    fpr, tpr, _ = roc_curve(data[1], y_pred_prob)
    auc = roc_auc_score(data[1], y_pred_prob)

    ax.plot(
        fpr, tpr, 
        label=f"{name}: AUC {auc:.2%}", 
        linestyle="solid", 
        linewidth=2,
        color=pal_hex[run]
    )
    run += 1

ax.legend(loc="lower right")
plt.savefig(config.REPORTS_PATH / "roc/large_models_threshold_sel.jpeg", bbox_inches="tight")
plt.show()

#### Generate the classification report for each candidate

In [None]:
### generate report

ll_threshold_candidates_results = []

for name, data in ll_evaluation.items():
    model = large_models_threshold_sel.get(name)
    y_pred_prob = model.predict_proba(data[0])[:,1]
    y_pred = model.predict(data[0])
    report = evaluate_report(y_test=data[1], y_pred=y_pred, y_pred_prob=y_pred_prob)
    report["model"] = name
    ll_threshold_candidates_results.append(report)

ll_threshold_candidates_results

#### Convert the results to a DataFrame and export if necessary

In [None]:
# results
large_model_thresholds = pd.DataFrame(ll_threshold_candidates_results).set_index("model")
large_model_thresholds

# export results to csv if needed
# large_model_thresholds.to_csv("large_model_threshold_metrics.csv")

#### Small loans threshold selection <a name="5"></a>

* Selection process: Reading in the necessary data files and models 
* ROC curves and classification reports are then generated for each small loan threshold candidate
* We select the best performing small loans model with consideration to the large loan's performance
* The final threshold selected is 400k

[back to top](#toc)

In [None]:
sl_200000_test = pd.read_parquet(config.FIN_FILE_PATH / "test_df_small_loans_200000.parquet")
sl_300000_test = pd.read_parquet(config.FIN_FILE_PATH / "test_df_small_loans_300000.parquet")
sl_400000_test = pd.read_parquet(config.FIN_FILE_PATH / "test_df_small_loans_400000.parquet")
sl_500000_test = pd.read_parquet(config.FIN_FILE_PATH / "test_df_small_loans_500000.parquet")

X_small_200000_test = sl_200000_test.drop(columns=config.TARGET)
y_small_200000_test = sl_200000_test[config.TARGET]
X_small_300000_test = sl_300000_test.drop(columns=config.TARGET)
y_small_300000_test = sl_300000_test[config.TARGET]
X_small_400000_test = sl_400000_test.drop(columns=config.TARGET)
y_small_400000_test = sl_400000_test[config.TARGET]
X_small_500000_test = sl_500000_test.drop(columns=config.TARGET)
y_small_500000_test = sl_500000_test[config.TARGET]

sl_evaluation = {
    "rf_200000": [X_small_200000_test, y_small_200000_test],
    "rf_300000": [X_small_300000_test, y_small_300000_test],
    "rf_400000": [X_small_400000_test, y_small_400000_test],
    "rf_500000": [X_small_500000_test, y_small_500000_test]
}

#### Generate the ROC curves for candidates

In [None]:
_, ax = plt.subplots()
plt.tight_layout()
plt.plot([0, 1], [0, 1], ls="--", color="black")
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title(f"Small loans - RF threshold selection - ROC Curves")
run = 0

for name, data in sl_evaluation.items():
    model = small_models_threshold_sel.get(name)
    y_pred_prob = model.predict_proba(data[0])[:,1]
    y_pred = model.predict(data[0])
    fpr, tpr, _ = roc_curve(data[1], y_pred_prob)
    auc = roc_auc_score(data[1], y_pred_prob)

    ax.plot(
        fpr, tpr, 
        label=f"{name}: AUC {auc:.2%}", 
        linestyle="solid", 
        linewidth=2,
        color=pal_hex[run]
    )
    run += 1

ax.legend(loc="lower right")
plt.savefig(config.REPORTS_PATH / "roc/small_models_threshold_sel.jpeg", bbox_inches="tight")
plt.show()

#### Generate the classification report for each candidate

In [None]:
### generate report

sl_threshold_candidates_results = []

for name, data in sl_evaluation.items():
    model = small_models_threshold_sel.get(name)
    y_pred_prob = model.predict_proba(data[0])[:,1]
    y_pred = model.predict(data[0])
    report = evaluate_report(y_test=data[1], y_pred=y_pred, y_pred_prob=y_pred_prob)
    report["model"] = name
    sl_threshold_candidates_results.append(report)

#### Convert the results to a DataFrame and export if necessary


In [None]:
# results
small_model_thresholds = pd.DataFrame(sl_threshold_candidates_results).set_index("model")
small_model_thresholds

# export results to csv if needed
# small_model_thresholds.to_csv("small_model_threshold_metrics.csv")

#### Final model selection <a name="6"></a>

* In order to select our final models, we will assume that our data is representative and calculate the expected model's discriminativeness 
* This is done by taking the weighted average of the AUC 
* The best combination results at the 300K threshold 

[back to top](#toc)

In [None]:
ll_200000_test = pd.read_parquet(config.FIN_FILE_PATH / "test_df_large_loans_200000.parquet")
ll_300000_test = pd.read_parquet(config.FIN_FILE_PATH / "test_df_large_loans_300000.parquet")
ll_400000_test = pd.read_parquet(config.FIN_FILE_PATH / "test_df_large_loans_400000.parquet")
ll_500000_test = pd.read_parquet(config.FIN_FILE_PATH / "test_df_large_loans_500000.parquet")
sl_200000_test = pd.read_parquet(config.FIN_FILE_PATH / "test_df_small_loans_200000.parquet")
sl_300000_test = pd.read_parquet(config.FIN_FILE_PATH / "test_df_small_loans_300000.parquet")
sl_400000_test = pd.read_parquet(config.FIN_FILE_PATH / "test_df_small_loans_400000.parquet")
sl_500000_test = pd.read_parquet(config.FIN_FILE_PATH / "test_df_small_loans_500000.parquet")

In [None]:
# calculation 
threshold_200000 = (84.10 * sl_200000_test.shape[0] / (sl_200000_test.shape[0] + ll_200000_test.shape[0])) + (88.50 * ll_200000_test.shape[0] / (sl_200000_test.shape[0] + ll_200000_test.shape[0]))
threshold_300000 = (84.57 * sl_300000_test.shape[0] / (sl_300000_test.shape[0] + ll_300000_test.shape[0])) + (89.28 * ll_300000_test.shape[0] / (sl_300000_test.shape[0] + ll_300000_test.shape[0]))
threshold_400000 = (85.19 * sl_400000_test.shape[0] / (sl_400000_test.shape[0] + ll_400000_test.shape[0])) + (89.83 * ll_400000_test.shape[0] / (sl_400000_test.shape[0] + ll_400000_test.shape[0]))
threshold_500000 = (86.20 * sl_500000_test.shape[0] / (sl_500000_test.shape[0] + ll_500000_test.shape[0])) + (89.00 * ll_500000_test.shape[0] / (sl_500000_test.shape[0] + ll_500000_test.shape[0]))

print(f"""
200000 Threshold: {threshold_500000:.2f}%
300000 Threshold: {threshold_300000:.2f}%
400000 Threshold: {threshold_400000:.2f}%
500000 Threshold: {threshold_500000:.2f}%
""")

### Error Analysis <a name="7"></a>

* Given our final models - what kinds of observations are our models misclassifying?
* What are some potential reasons for the misclassification? 
* Approach to diagnose is to look at the FPs

[back to top](#toc)

#### Large loans 

Steps
* Load in the GBM model with threshold 300,000
* Identify the misclassified observations
* Generate proportions for misclassified observations 

In [None]:
ll_model = large_models_threshold_sel.get("gbm_300000")

ll_test_300000 = pd.read_parquet(config.FIN_FILE_PATH / "test_df_large_loans_300000.parquet").reset_index(drop=True)
X_ll_test_300000 = ll_test_300000.drop(columns=config.TARGET)
y_ll_test_300000 = ll_test_300000[config.TARGET]

In [None]:
y_pred = pd.Series(ll_model.predict(X_ll_test_300000))
ll_test_300000["prediction"] = y_pred

# tag misclassified observations
ll_test_300000["false_positives"] = np.where((ll_test_300000["status"] == 0) & (ll_test_300000["prediction"] == 1), 1, 0)
ll_test_300000["false_negatives"] = np.where((ll_test_300000["status"] == 1) & (ll_test_300000["prediction"] == 0), 1, 0)

In [None]:
# any particular demographics? 
demographics_subset = ["gender_j", "gender_m", "gender_na", 
                       "age_25-34", "age_35-44", "age_45-54",
                       "age_55-64", "age_65-74", "age_>74", "region_north",
                       "region_north-east", "region_south"]
misclassified_dem = ll_test_300000.loc[ll_test_300000["false_positives"], demographics_subset]

mean_misclassified_dem = misclassified_dem.describe().to_numpy()[1]
mean_actuals_dem = ll_test_300000[demographics_subset].describe().to_numpy()[1]

fig, ax = plt.subplots()
plt.tight_layout()
sns.barplot(x=mean_misclassified_dem, y=demographics_subset, color=pal_hex[0])


ax.set_xlim(0, 1)
plt.title("Misclassified results originate from specific genders, ages and regions")
plt.xlabel("Proportion of misclassified results")

plt.savefig("misclassified_dem")
plt.show()


Diving deeper into the characteristics of the model, we notice that the misclassifications for 
false positives tend to: 
* Have a higher loan amount 
* Have lower loan limit and credit worthiness 

This implies that our model might be classifying these observations as potential defaulters based on a representation of risk - ie. individuals who borrow higher but might not have stellar credit worthiness

In [None]:
gender_na_groupby = ll_test_300000.groupby("gender_na")[["loan_limit", "credit_worthiness", "loan_amount"]].mean()
region_south_groupby = ll_test_300000.groupby("region_south")[["loan_limit", "credit_worthiness", "loan_amount"]].mean()
age_45_54_groupby = ll_test_300000.groupby("age_45-54")[["loan_limit", "credit_worthiness", "loan_amount"]].mean()

display(region_south_groupby)
display(gender_na_groupby)
display(age_45_54_groupby)

#### Small loans 

We carry out a similar approach to the large loans 
* Load in the RF model with threshold 300,000
* Identify the misclassified observations
* Generate proportions for misclassified observations 

In [None]:
sl_model = small_models_threshold_sel.get("rf_300000")

sl_test_300000 = pd.read_parquet(config.FIN_FILE_PATH / "test_df_small_loans_300000.parquet").reset_index(drop=True)
X_sl_test_300000 = sl_test_300000.drop(columns=config.TARGET)
y_sl_test_300000 = sl_test_300000[config.TARGET]

In [None]:
y_pred = pd.Series(ll_model.predict(X_sl_test_300000))
sl_test_300000["prediction"] = y_pred

# tag misclassified observations
sl_test_300000["false_positives"] = np.where((sl_test_300000["status"] == 0) & (sl_test_300000["prediction"] == 1), 1, 0)
sl_test_300000["false_negatives"] = np.where((sl_test_300000["status"] == 1) & (sl_test_300000["prediction"] == 0), 1, 0)

In [None]:
# any particular demographics? 
demographics_subset = ["gender_j", "gender_m", "gender_na", 
                       "age_25-34", "age_35-44", "age_45-54",
                       "age_55-64", "age_65-74", "age_>74", "region_north",
                       "region_north-east", "region_south"]
misclassified_dem = sl_test_300000.loc[sl_test_300000["false_positives"], demographics_subset]

mean_misclassified_dem = misclassified_dem.describe().to_numpy()[1]
mean_actuals_dem = sl_test_300000[demographics_subset].describe().to_numpy()[1]

fig, ax = plt.subplots()
plt.tight_layout()
sns.barplot(x=mean_misclassified_dem, y=demographics_subset, color=pal_hex[0])


ax.set_xlim(0, 1)
plt.title("The case is similar for small loans, but with a more pronounced focus on 55-64")
plt.xlabel("Proportion of misclassified results")

plt.savefig("misclassified_dem")
plt.show()

We find a similar pattern in the small loans dataset - gender NA and region south tend to be misclassified - an appropriate follow up step would be to try and gather more data from these subsets so that the model has more examples to learn from 

In [None]:
age_55_64_groupby = sl_test_300000.groupby("age_55-64")[["loan_limit", "credit_worthiness", "loan_amount", "lump_sum_payment"]].mean()
display(age_55_64_groupby)