### **4. Modeling**
---

We will perform:
1. Forward Selection Procedure
2. Best Model Adjustment

#### **Perform Forward Selection Procedure**
---
Start with a null model (no predictors), then add each predictor one at a time until the model is as improved as possible.

In [46]:
# Import library
import pandas as pd
import numpy as np

# Import library for modeling
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression

# Load configuration
import src.utils as utils

In [47]:
CONFIG_DATA = utils.load_config()
CONFIG_DATA

{'raw_data_path': 'data/raw/credit_dataset.csv',
 'data_path': 'data/output/data.pkl',
 'predictors_set_path': 'data/output/predictors.pkl',
 'target_set_path': 'data/output/target.pkl',
 'train_path': ['data/output/X_train.pkl', 'data/output/y_train.pkl'],
 'test_path': ['data/output/X_test.pkl', 'data/output/y_test.pkl'],
 'data_train_path': 'data/output/data_train.pkl',
 'data_train_binned_path': 'data/output/data_train_binned.pkl',
 'crosstab_list_path': 'data/output/crosstab_list.pkl',
 'WOE_table_path': 'data/output/WOE_table.pkl',
 'IV_table_path': 'data/output/IV_table.pkl',
 'WOE_map_dict_path': 'data/output/WOE_map_dict.pkl',
 'X_train_woe_path': 'data/output/X_train_woe.pkl',
 'target_variable': 'Credit_Score',
 'test_size': 0.3,
 'num_columns': ['Age',
  'Annual_Income',
  'Num_of_Loan',
  'Num_of_Delayed_Payment',
  'Outstanding_Debt',
  'Monthly_Inhand_Salary',
  'Num_Credit_Inquiries',
  'Credit_Utilization_Ratio',
  'Total_EMI_per_month',
  'Num_Bank_Accounts',
  'Num_C

To fit a model on the train set and determine its CV score from the validation set, define the function `forward()`.

In [48]:
# The forward selection procedure's function
def forward(X, y, predictors, scoring='roc_auc', cv=5):
    """Function for carrying out the forward selection process"""

    # Specify the number of all predictors and the sample size.
    n_samples, n_predictors = X.shape

    # Describe the entire predictor list.
    col_list = np.arange(n_predictors)

    # For every k, define the remaining predictors.
    remaining_predictors = [p for p in col_list if p not in predictors]

    # Set the CV Score and predictors' initial values.
    pred_list = []
    score_list = []

    # Every possible pairing of the remaining predictors should be cross-validated.
    for p in remaining_predictors:
        combi = predictors + [p]

        # Combine extract predictors
        X_ = X[:, combi]
        y_ = y

        # Define the estimator
        model = LogisticRegression(penalty = 'l2',
                                   class_weight = 'balanced')

        # Cross-validate the model's recall scores
        cv_results = cross_validate(estimator = model,
                                    X = X_,
                                    y = y_,
                                    scoring = scoring,
                                    cv = cv)

        # Determine the typical CV/recall score.
        score_ = np.mean(cv_results['test_score'])

        # Add the combination of predictors and their CV score to the list.
        pred_list.append(list(combi))
        score_list.append(score_)

    # Total the outcomes.
    models = pd.DataFrame({"Predictors": pred_list,
                           "CV Score": score_list})

    # Select the best model.
    best_model = models.loc[models['CV Score'].argmax()]

    return models, best_model

In [49]:
# The ability to carry out forward selection across all attributes
def run_forward():
    """Function to carry out forward selection based on every attribute"""

    cv = CONFIG_DATA['num_of_cv']
    scoring = CONFIG_DATA['scoring']

    X_train_woe_path = CONFIG_DATA['X_train_woe_path']
    X_train_woe = utils.load_pickle(X_train_woe_path)
    X_train = X_train_woe.to_numpy()

    y_train_path = CONFIG_DATA['train_path'][1]
    y_train = utils.load_pickle(y_train_path)
    y_train = y_train.to_numpy()

    # First, fit the null model
    # Define the null model's predictor.
    predictor = []

    # In the null model, every predictor has a value of zero.
    X_null = np.zeros((X_train.shape[0], 1))

    # Define the estimator
    model = LogisticRegression(penalty = 'l2',
                               class_weight = 'balanced')

    # Cross validate
    cv_results = cross_validate(estimator = model,
                                X = X_null,
                                y = y_train,
                                cv = cv,
                                scoring = scoring)

    # Determine the typical CV score.
    score_ = np.mean(cv_results['test_score'])

    # Make a table with each k predictor's best model.
    # Add the null model results.
    forward_models = pd.DataFrame({"Predictors": [predictor],
                                   "CV Score": [score_]})

    # Proceed with forward selection for each and every predictor.
    # Define the predictor list.
    predictors = []
    n_predictors = X_train.shape[1]

    # Apply the forward selection method to the predictors k=1,...,n_predictors.
    for k in range(n_predictors):
        _, best_model = forward(X = X_train,
                                y = y_train,
                                predictors = predictors,
                                scoring = scoring,
                                cv = cv)

        # List the optimal model for each of the k predictors.
        forward_models.loc[k+1] = best_model
        predictors = best_model['Predictors']

    # Find the best CV score
    best_idx = forward_models['CV Score'].argmax()
    best_cv_score = forward_models['CV Score'].loc[best_idx]
    best_predictors = forward_models['Predictors'].loc[best_idx]

    # Print the summary
    print('===================================================')
    print('Best index            :', best_idx)
    print('Best CV Score         :', best_cv_score)
    print('Best predictors (idx) :', best_predictors)
    print('Best predictors       :')
    print(X_train_woe.columns[best_predictors].tolist())
    print('===================================================')

    print(forward_models)
    print('===================================================')
    
    forward_models_path = CONFIG_DATA['forward_models_path']
    utils.dump_pickle(forward_models, forward_models_path)

    best_predictors_path = CONFIG_DATA['best_predictors_path']
    utils.dump_pickle(best_predictors, best_predictors_path)

    return forward_models, best_predictors

In [50]:
run_forward()

Best index            : 2
Best CV Score         : 0.9114868456019133
Best predictors (idx) : [11, 5]
Best predictors       :
['Changed_Credit_Limit', 'Num_Credit_Card']
                                           Predictors  CV Score
0                                                  []  0.000000
1                                                [11]  0.874289
2                                             [11, 5]  0.911487
3                                         [11, 5, 15]  0.902549
4                                     [11, 5, 15, 18]  0.882310
5                                  [11, 5, 15, 18, 6]  0.856865
6                              [11, 5, 15, 18, 6, 14]  0.895473
7                           [11, 5, 15, 18, 6, 14, 7]  0.895943
8                       [11, 5, 15, 18, 6, 14, 7, 21]  0.895945
9                    [11, 5, 15, 18, 6, 14, 7, 21, 8]  0.897358
10               [11, 5, 15, 18, 6, 14, 7, 21, 8, 20]  0.895945
11           [11, 5, 15, 18, 6, 14, 7, 21, 8, 20, 19]  0.895004

(                                           Predictors  CV Score
 0                                                  []  0.000000
 1                                                [11]  0.874289
 2                                             [11, 5]  0.911487
 3                                         [11, 5, 15]  0.902549
 4                                     [11, 5, 15, 18]  0.882310
 5                                  [11, 5, 15, 18, 6]  0.856865
 6                              [11, 5, 15, 18, 6, 14]  0.895473
 7                           [11, 5, 15, 18, 6, 14, 7]  0.895943
 8                       [11, 5, 15, 18, 6, 14, 7, 21]  0.895945
 9                    [11, 5, 15, 18, 6, 14, 7, 21, 8]  0.897358
 10               [11, 5, 15, 18, 6, 14, 7, 21, 8, 20]  0.895945
 11           [11, 5, 15, 18, 6, 14, 7, 21, 8, 20, 19]  0.895004
 12        [11, 5, 15, 18, 6, 14, 7, 21, 8, 20, 19, 3]  0.894528
 13     [11, 5, 15, 18, 6, 14, 7, 21, 8, 20, 19, 3, 2]  0.892648
 14  [11, 5, 15, 18, 6, 1

In [53]:
# Function to fit optimal model across the entire X_train
def best_model_fitting(best_predictors):
    """Function to fit optimal model across the entire X_train"""

    X_train_path = CONFIG_DATA['X_train_woe_path']
    X_train_woe = utils.load_pickle(X_train_path)
    X_train = X_train_woe.to_numpy()

    y_train_path = CONFIG_DATA['train_path'][1]
    y_train = utils.load_pickle(y_train_path)
    y_train = y_train.to_numpy()

    if best_predictors is None:
        best_predictors_path = CONFIG_DATA['best_predictors_path']
        best_predictors = utils.load_pickle(best_predictors_path)
        print(f"Best predictors index   :", best_predictors)
    else:
        print(f"[Adjusted] best predictors index   :", best_predictors)

    # Use the best predictors to define X.
    X_train_best = X_train[:, best_predictors]

    # Fit best model
    best_model = LogisticRegression(penalty = 'l2',
                                    class_weight = 'balanced')
    best_model.fit(X_train_best, y_train)

    print(best_model)

    # Extract parameter estimates from the optimal model.
    best_model_intercept = pd.DataFrame({'Characteristic': 'Intercept',
                                         'Estimate': best_model.intercept_})
    
    best_model_params = X_train_woe.columns[best_predictors].tolist()

    best_model_coefs = pd.DataFrame({'Characteristic': best_model_params,
                                     'Estimate': np.reshape(best_model.coef_, 
                                                            len(best_predictors))})

    best_model_summary = pd.concat((best_model_intercept, best_model_coefs),
                                   axis = 0,
                                   ignore_index = True)
    
    print('===================================================')
    print(best_model_summary)
    
    best_model_path = CONFIG_DATA['best_model_path']
    utils.dump_pickle(best_model, best_model_path)

    best_model_summary_path = CONFIG_DATA['best_model_summary_path']
    utils.dump_pickle(best_model_summary, best_model_summary_path)

    return best_model, best_model_summary

In [54]:
best_model_fitting(best_predictors = None)

Best predictors index   : [11, 5]
LogisticRegression(class_weight='balanced')
         Characteristic  Estimate
0             Intercept -0.004195
1  Changed_Credit_Limit  0.777330
2       Num_Credit_Card  0.940723


(LogisticRegression(class_weight='balanced'),
          Characteristic  Estimate
 0             Intercept -0.004195
 1  Changed_Credit_Limit  0.777330
 2       Num_Credit_Card  0.940723)

#### **Best Model Adjustment**
---

Too-simple scorecards typically cannot stand the test of time because: 
- They are easily affected by slight alterations in the applicant profile.
- An adjudicator of quality would never base their decision on merely two features from an application form.

Every feature will be present in the finished model.
- The independence test indicates that no attribute is independent of the response variable (default probability).
- Typically, a final scorecard has eight to fifteen characteristics.

In [55]:
best_model_fitting(best_predictors = np.arange(11).tolist())

[Adjusted] best predictors index   : [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
LogisticRegression(class_weight='balanced')
            Characteristic  Estimate
0                Intercept -0.048574
1                      Age  0.266541
2               Occupation  0.944933
3            Annual_Income  0.110433
4    Monthly_Inhand_Salary  0.061898
5        Num_Bank_Accounts  0.184315
6          Num_Credit_Card  0.342532
7            Interest_Rate  0.514895
8              Num_of_Loan  0.285725
9             Type_of_Loan -0.066199
10     Delay_from_due_date  0.370126
11  Num_of_Delayed_Payment  0.226606


(LogisticRegression(class_weight='balanced'),
             Characteristic  Estimate
 0                Intercept -0.048574
 1                      Age  0.266541
 2               Occupation  0.944933
 3            Annual_Income  0.110433
 4    Monthly_Inhand_Salary  0.061898
 5        Num_Bank_Accounts  0.184315
 6          Num_Credit_Card  0.342532
 7            Interest_Rate  0.514895
 8              Num_of_Loan  0.285725
 9             Type_of_Loan -0.066199
 10     Delay_from_due_date  0.370126
 11  Num_of_Delayed_Payment  0.226606)