# 3. Validation

In this worksheet we validate the models that we trained and choose the best one.

In [45]:
import pandas as pd
pd.options.mode.chained_assignment = None
import numpy as np
import os
from joblib import load

In [46]:
validation_df = pd.read_csv("data/validation_data.csv")

# Parse columns into date values.
date_columns = ['LoanDate', 'LastPaymentOn']
validation_df[date_columns] = validation_df[date_columns].apply(pd.to_datetime, format='%Y-%m-%d', errors='coerce')

## Calculating the outcomes for the loans

We being with the function *calculate_loan_outcomes()*, where the outcome for each loan is calculated. These columns will help us later with evaluating different classification thresholds.

The investment_amount variable is not needed, but I decided to add it because it can be useful for creating more advanced strategies. For example, more into loans with higher probability thresholds and less into loans with lower thresholds.

In [47]:
def calculate_loan_outcomes(df, investment_amount = 10):

    # The duration between start of the loan and the last payment. If LastPaymentOn is null then duration is 0.
    df['LoanLength'] = np.where(df.LastPaymentOn.notnull(), (df.LastPaymentOn.sub(df.LoanDate).dt.days.div(365.25)), 0)

    # The total amount repaid by the lender.
    df['TotalRepayments'] = df.PrincipalPaymentsMade + df.InterestAndPenaltyPaymentsMade

    # Calculate the portion size of the investment based on the amount invested.
    df['InvestmentPortionSize'] = investment_amount / df.Amount

    # Calculate returns.
    df['Return'] = df.InvestmentPortionSize * df.TotalRepayments

    # Calculate the profit.
    df['Profit'] = df.Return - investment_amount

    # Calculate the return on investment.
    df['ROI'] = (df.Return - investment_amount) / investment_amount * 100

    # Calculate the annual return on investment.
    df['ROI_Annual'] = ((1 + df.ROI / 100) ** (1/df.LoanLength) - 1) * 100

    return df

# Call the function.
validation_df = calculate_loan_outcomes(validation_df)

## Validating the models

We again set the input columns that we used in the modeling step. Also define a function that transforms the validation_df into an input that the models can use.

In [48]:
# Columns that are used as inputs in the models.
input_cols = ['NewCreditCustomer', 'VerificationType', 'LanguageCode', 'Age', 'Gender',
              'Amount', 'Interest', 'LoanDuration', 'MonthlyPayment',
              'Education', 'EmploymentDurationCurrentEmployer', 'HomeOwnershipType', 'IncomeTotal',
              'ExistingLiabilities', 'LiabilitiesTotal', 'Rating',
              'CreditScoreEeMini', 'NoOfPreviousLoansBeforeLoan', 'AmountOfPreviousLoansBeforeLoan',
              'PreviousRepaymentsBeforeLoan', 'PreviousEarlyRepaymentsBefoleLoan', 'PreferLoan']

def transform_into_input(df, input_cols):
    df = df[input_cols]
    df = pd.get_dummies(df)
    X = df.drop('PreferLoan', axis=1)
    return X

The *calculate_threshold_stats()* function calculates the statistics for a given threshold. These stats will be used for choosing the best model.

In [49]:
def calculate_threshold_stats(df, total_loans, threshold, filename):
    result = {
        'Model': filename.removesuffix('.joblib'),
        'Threshold': threshold,
        'Total_loans' : total_loans,
        'Investments_made' : df.shape[0],
        'Investments_made_percentage' : df.shape[0] / total_loans * 100,
        'No_of_preferred_loans' : (df.PreferLoan.values == 1).sum(),
        'Precision' : ((df.PreferLoan.values == 1).sum()) / df.shape[0],
        'ROI_annual_mean': df.ROI_Annual.mean()
    }
    return result

The *validate_models()* function is the main function that starts the validation process. It iterates through all of the models and gets the predicted probabilities for each model. For each model the probabilities are added or overwritten as a column to the initial validation_df. After that different threshold levels are iterated and the df is filtered with the respective threshold level, so only the loans where the predicted probability is higher or equal to the threshold are included. The stats are calculated for each threshold level and appended into a list that we use for choosing the most suitable model.


In [50]:
def validate_models(val_df, input_cols):
    X = transform_into_input(val_df, input_cols)
    stats = []
    total_loans = val_df.shape[0]

    for filename in os.listdir('./models'):
        df = val_df
        clf = load(f'./models/{filename}')
        predictions = clf.predict_proba(X)
        df["Prediction"] = predictions[:, 1]

        for threshold in np.arange(0.5, 1.0, 0.025):
            df_t = df.loc[(df.Prediction >= threshold)]
            t_stats = calculate_threshold_stats(df_t, total_loans, threshold, filename)
            stats.append(t_stats)

    return stats

We can now run the validation function and inspect the results.

In [None]:
# Start the validation and get stats.
stats = validate_models(validation_df, input_cols)

## Interpreting the results
We can see that the higher the xgboost and random forest models perform the best and logistic regression performs poorly. We also see that the highest mean annual ROI is achieved with higher thresholds, which have a low number of investments made.

In [52]:
results = pd.DataFrame(stats).set_index("Model").round(3).sort_values(by='ROI_annual_mean', ascending=False)
results

Unnamed: 0_level_0,Threshold,Total_loans,Investments_made,Investments_made_percentage,No_of_preferred_loans,Precision,ROI_annual_mean
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
xgboost_roc_auc,0.950,42338,3,0.007,3,1.000,19.000
xgboost_roc_auc,0.925,42338,8,0.019,6,0.750,15.350
xgboost_f1,0.950,42338,5,0.012,5,1.000,13.895
xgboost_roc_auc,0.900,42338,30,0.071,26,0.867,13.525
random_forest_precision,0.600,42338,72,0.170,61,0.847,12.820
...,...,...,...,...,...,...,...
xgboost_precision,0.900,42338,0,0.000,0,,
xgboost_precision,0.925,42338,0,0.000,0,,
xgboost_precision,0.950,42338,0,0.000,0,,
xgboost_precision,0.975,42338,0,0.000,0,,


To get a more realistic mean annual ROI we should consider models that have a higher percentage of investments made. If set the percentage as 0.5% then we expect to invest in every 200th loan. The higher the percentage the more loans we invest in.

In [53]:
results.loc[(results.Investments_made_percentage >= 0.5)][:10]

Unnamed: 0_level_0,Threshold,Total_loans,Investments_made,Investments_made_percentage,No_of_preferred_loans,Precision,ROI_annual_mean
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
xgboost_roc_auc,0.8,42338,299,0.706,261,0.873,10.652
xgboost_roc_auc,0.775,42338,416,0.983,353,0.849,10.512
xgboost_roc_auc,0.75,42338,570,1.346,468,0.821,10.38
xgboost_precision,0.725,42338,330,0.779,272,0.824,10.126
random_forest_roc_auc,0.675,42338,228,0.539,193,0.846,9.741
xgboost_roc_auc,0.725,42338,750,1.771,602,0.803,9.666
xgboost_f1,0.85,42338,251,0.593,208,0.829,9.398
xgboost_roc_auc,0.7,42338,969,2.289,765,0.789,9.294
xgboost_f1,0.825,42338,400,0.945,324,0.81,9.274
random_forest_precision,0.525,42338,498,1.176,401,0.805,9.117


We classified all the loans that have had the problems as "not preferable". But there is always a possibility that a loan recovers and payments resume. This means that these loans can still be active and we have no way of knowing the final outcome of said loans. The loans which have recovered and are are still active also have an effect on the annual mean ROI. If we eliminate the loans that still have the status "Current", we should get more accurate results.

In [None]:
validation_df_no_current = validation_df.loc[(validation_df.Status != 'Current')]
stats_no_current = validate_models(validation_df_no_current, input_cols)

With the exclusion of currently active loans the annual mean ROI also increases by a percent or two. It should also be noted that we did not exclude loans with the status "Late", which might also have a chance to recover and therefore further increase the annual ROI.

In [55]:
results_no_current = pd.DataFrame(stats_no_current).set_index("Model").round(3).sort_values(by='ROI_annual_mean', ascending=False)
results_no_current.loc[(results_no_current.Investments_made_percentage >= 0.5)][:10]

Unnamed: 0_level_0,Threshold,Total_loans,Investments_made,Investments_made_percentage,No_of_preferred_loans,Precision,ROI_annual_mean
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
xgboost_roc_auc,0.8,31206,296,0.949,261,0.882,11.699
xgboost_roc_auc,0.75,31206,555,1.779,468,0.843,11.65
xgboost_roc_auc,0.775,31206,409,1.311,353,0.863,11.498
xgboost_roc_auc,0.725,31206,723,2.317,602,0.833,11.452
xgboost_precision,0.75,31206,195,0.625,173,0.887,11.368
xgboost_roc_auc,0.7,31206,924,2.961,765,0.828,11.262
random_forest_precision,0.525,31206,478,1.532,401,0.839,11.236
xgboost_precision,0.725,31206,322,1.032,272,0.845,11.16
random_forest_roc_auc,0.675,31206,222,0.711,193,0.869,11.046
xgboost_roc_auc,0.825,31206,187,0.599,167,0.893,11.014


It is good to see that we can constantly reach at least 11% mean annual ROI. When choosing which model to use there are multiple things to consider. There should be a good balance between precision and the number of investments made. If the precision is really high, but the number of investments is really low, then in practice the model might not be viable, as the preferred loans are very rare.

From this list I would consider xgboost_roc_auc with the following thresholds: 0.750, 0.775, 0.725 and 0.700. The reason being that the precision and annual ROI are quite high and the number of investments made seems to be also suitable and lot too low.