# **Modelling and Evaluation - Predict Default (Classification)**

## Objectives

*   Fit and evaluate a classification model to predict if a debt applicant will default or not.
*   Answer **business requirement 2**: 
    * The client is interested in creating a classification model able to predict loan applicant default event with high confidence with high precision of at least 85%. 

## Inputs

* outputs/datasets/collection/row/LoanDefaultDataset.csv
* Instructions on which variables to use for data cleaning and feature engineering. Those instructions are found in FeatureEngineering Notebook.

## Outputs

* The following is a list of files to be saved in the output folder:

  - Train Set
  - Test Set
  - Modeling pipeline
  - label map
  - feature importance plot

---

## **SetUp**

### Imports

In [None]:
import os
import warnings
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
import seaborn as sns
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from feature_engine.selection import SmartCorrelatedSelection
from feature_engine.encoding import OrdinalEncoder
from feature_engine import transformation as vt
from sklearn.metrics import make_scorer, recall_score, classification_report, confusion_matrix
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier, AdaBoostClassifier
from xgboost import XGBClassifier



### Change working directory

* Change the working directory from its current folder to its parent folder.

In [None]:
current_dir = os.getcwd()
current_dir

* Make the parent of the current directory the new current directory.

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

* Confirm the new current directory.

In [None]:
current_dir = os.getcwd()
current_dir

## **Dataset Loading**

- Load the row dataset.

In [None]:
df = (pd.read_csv("outputs/datasets/collection/row/LoanDefaultDataset.csv"))
df.head(3)

## **ML Pipeline with All Features**

### **Split Train and Test Set**

- Split the row dataset into Train and Test.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(['loan_status'], axis=1),
    df['loan_status'],
    test_size=0.2,
    random_state=0,
)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

### **Data Cleaning and Feature Engineering Pipeline**

#### Data Cleaning

- Since the data is already clean, this step is skipped.

#### Feature Engineering Pipeline

- The feature engineering pipeline is extracted from **Notebook 03**.

In [None]:
def PipelineFeatureEngineering():
    pipeline_base = Pipeline([
        ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary',
                                                     variables=[
                                                        'person_gender',
                                                        'person_education',
                                                        'person_home_ownership',
                                                        'loan_intent',
                                                        'previous_loan_defaults_on_file',
                                                        ])),
        ("YeoJohnsonTransformer", vt.YeoJohnsonTransformer(
            variables = [
                'person_income',
                'loan_amnt',
                'loan_percent_income',
                'credit_score',
                ])),

        ("SmartCorrelatedSelection", SmartCorrelatedSelection(variables=None,
         method="spearman", threshold=0.7, selection_method="variance")), # to be dropped = ['person_age', 'cb_person_cred_hist_length'].
    ])

    return pipeline_base


PipelineFeatureEngineering()

- Apply Feature Engineering Pipeline.

In [None]:
warnings.filterwarnings('ignore')

pipeline_feat_eng = PipelineFeatureEngineering()
X_train = pipeline_feat_eng.fit_transform(X_train)
X_test = pipeline_feat_eng.transform(X_test)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

### **Target Balance Analysis**

- Display Train Set Target (loan_status) distribution.
- Evaluate if the two loan_status classes (Default = 0, No Default = 1) are balanced.

In [None]:
sns.set_style("whitegrid")
y_train.value_counts().plot(kind='bar', title='Train Set Target Distribution')
plt.show()

> Result:

- The default class (0) has more occurrences, hence the train set ought to be balanced.
- In order to balance both classes in the Train Set, SMOTE (Synthetic Minority Oversampling TEchnique) is used. 
- This is accomplished by oversampling the minority class (No Default = 1).

- Apply **SMOTE (Synthetic Minority Oversampling TEchnique)**.

In [None]:
oversample = SMOTE(sampling_strategy='minority', random_state=0)
X_train, y_train = oversample.fit_resample(X_train, y_train)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

- Evaluate the Train Set Target distribution after oversampling.

In [None]:
y_train.value_counts().plot(kind='bar', title='Train Set Target Distribution')
plt.show()

### **ML Pipeline**

#### Define the main functions and Dictionaries for creating the pipeline.

- Classification Pipeline.

In [None]:
def PipelineClf(model):
    pipeline_base = Pipeline([
        ("scaler", StandardScaler()),
        ("feat_selection", SelectFromModel(model)),
        ("model", model),
    ])

    return pipeline_base

-  Hyperparameter Optimization Class.

In [None]:
class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")

            model = PipelineClf(self.models[key])
            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs,
                              verbose=verbose, scoring=scoring, )
            gs.fit(X, y)
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]
                scores.append(r.reshape(len(params), 1))

            all_scores = np.hstack(scores)
            for p, s in zip(params, all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)
        columns = ['estimator', 'min_score',
                   'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]
        return df[columns], self.grid_searches

* Define a dictionary for estimators with standard hyperparameters.

In [None]:
models_quick_search = {
    "LogisticRegression": LogisticRegression(random_state=0),
    "DecisionTreeClassifier": DecisionTreeClassifier(random_state=0),
    "RandomForestClassifier": RandomForestClassifier(random_state=0),
    "GradientBoostingClassifier": GradientBoostingClassifier(random_state=0),
    "ExtraTreesClassifier": ExtraTreesClassifier(random_state=0),
    "AdaBoostClassifier": AdaBoostClassifier(random_state=0),
}

params_quick_search = {
    "LogisticRegression": {},
    "DecisionTreeClassifier": {},
    "RandomForestClassifier": {},
    "GradientBoostingClassifier": {},
    "ExtraTreesClassifier": {},
    "AdaBoostClassifier": {},
}

#### Grid Search CV

##### Estimator Search

- Evaluate estimators performance scores.

In [None]:
warnings.filterwarnings('ignore')

search = HyperparameterOptimizationSearch(models=models_quick_search, params=params_quick_search)
search.fit(X_train, y_train,
           scoring =  make_scorer(recall_score, pos_label=0),
           n_jobs=-1, cv=5)

* Summarize estimators performance scores result.
* Display the scores' results in a table.

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary 

> Result:

* The best performing estimators are list below by order:
  * **ExtraTreesClassifier** with mean score of **0.922684**
  * **DecisionTreeClassifier** with mean score of **0.913894**
  * **RandomForestClassifier** with mean score of **0.911572**
* **Important Note**:
  * when implementing **ExtraTreesClassifier** pipeline_clf pickle file becomes extremely large (> 100 MB, which is not compatible with GitHub max storage). Therefore, the **RandomForestClassifier** is used as an Alternative. 
* The next step aims to search for the hyperparameter configurations seeking better estimator scores.

* Define the best Estimator.

In [None]:
best_model = grid_search_summary.iloc[2,0] # Select the third best model
best_model

##### Search Hyperparameter Configurations

* Define a dictionary for the best estimator: **RandomForestClassifier**.
* Define a dictionary for candidate hyperparameter configurations of each nominated estimator.

In [None]:
models_search = {
    "RandomForestClassifier": RandomForestClassifier(random_state=0),
}
params_search = {
    "RandomForestClassifier":{'model__n_estimators': [100,50,140],
                             'model__max_depth': [None,4, 15],
                             'model__min_samples_split': [2,50],
                             'model__min_samples_leaf': [1,50],
                             'model__max_leaf_nodes': [None,50],
                            },
  }

* Apply HyperparameterOptimizationSearch on candidate estimators with the candidate parameter configurations.

In [None]:
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train,
           scoring =  make_scorer(recall_score, pos_label=0),
           n_jobs=-1, cv=5)

* Summarize the configurations score.

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

* Display the best hyperparameters configuration for **ExtraTreesClassifier**.

In [None]:
best_parameters = grid_search_pipelines[best_model].best_params_
best_parameters

> Result:
- The estimator performance improves by optimizing the hyperparameter.
- The mean score with optimized hyperparameter is 0.911822. No significant improvement over the default configuration.
- best_parameters = `{'model__max_depth': None,
 'model__max_leaf_nodes': None,
 'model__min_samples_leaf': 1,
 'model__min_samples_split': 2,
 'model__n_estimators': 50}`.

* Define the best clf pipeline.

In [None]:
pipeline_clf = grid_search_pipelines[best_model].best_estimator_
pipeline_clf

### Pipeline Evaluation

In [None]:
def confusion_matrix_and_report(X, y, pipeline, label_map):

    prediction = pipeline.predict(X)

    print('---  Confusion Matrix  ---')
    print(pd.DataFrame(confusion_matrix(y_true=prediction, y_pred=y),
          columns=[["Actual " + sub for sub in label_map]],
          index=[["Prediction " + sub for sub in label_map]]
          ))
    print("\n")

    print('---  Classification Report  ---')
    print(classification_report(y, prediction, target_names=label_map), "\n")


def clf_performance(X_train, y_train, X_test, y_test, pipeline, label_map):
    print("#### Train Set #### \n")
    confusion_matrix_and_report(X_train, y_train, pipeline, label_map)

    print("#### Test Set ####\n")
    confusion_matrix_and_report(X_test, y_test, pipeline, label_map)

> **Business Requirement 2**:
  - Default event should be predicted with high confidence, specifically, at least 85% precision.

- Display Confusion Matrix.

In [None]:
clf_performance(X_train=X_train, y_train=y_train,
                X_test=X_test, y_test=y_test,
                pipeline=pipeline_clf,
                label_map= ['Default','No Default']
                )

> Result:

* for test set:

  - Default Precision: 0.95
  - Default Recall: 0.91
  - No Default Precision: 0.74
  - No Default Recall: 0.85

### **Assess feature importance**

- Extract the important features.

In [None]:
# create DataFrame to display feature importance
df_feature_importance = (pd.DataFrame(data={
    'Feature': X_train.columns[pipeline_clf['feat_selection'].get_support()],
    'Importance': pipeline_clf['model'].feature_importances_})
    .sort_values(by='Importance', ascending=False)
)

# re-assign best_features order
best_features = df_feature_importance['Feature'].to_list()

# Most important features statement and plot
print(f"* These are the {len(best_features)} most important features in descending order. "
      f"The model was trained on them: \n{df_feature_importance['Feature'].to_list()}")

df_feature_importance.plot(kind='bar', x='Feature', y='Importance')
plt.show()

> Result:
- best_features = `['previous_loan_defaults_on_file', 'loan_percent_income', 'loan_int_rate', 'person_income']`.

> Summary of **ML Pipeline with All Features**:

- * **RandomForestClassifier** proved to be the most effective estimator which also produces smaller clf_pipeline pickle file in comparsion with other two candidates. The estimator is defined with a configuration: `{'model__max_depth': None,
 'model__max_leaf_nodes': [None],
 'model__min_samples_leaf': [1],
 'model__min_samples_split': [2],
 'model__n_estimators': [50]
}`.
* The model performs extremely well on the full features pipeline both on the train and test dataset.
* The model suffices the business requirement 2 with higher precision than what business stipulates.
* An assessment is conducted to identify the most important features.
* Three features are assessed to be important. These are: `['previous_loan_defaults_on_file', 'loan_percent_income', 'loan_int_rate', 'person_income']`.
* The next step is to evaluate the ML pipeline on the most important features only.

## **ML Pipeline with Important Features**

- Display Best Features.

In [None]:
best_features

> Result:
- best_features = `['previous_loan_defaults_on_file',
 'loan_percent_income',
 'loan_int_rate',
 'person_income']`.

### **Split Train and Test Set**

- Split the row dataset into Train and Test.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(['loan_status'], axis=1),
    df['loan_status'],
    test_size=0.2,
    random_state=0,
)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

- Consider only the best features.

In [None]:
X_train = X_train.filter(best_features)
X_test = X_test.filter(best_features)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
X_train.head(3)

### **Data Cleaning and Feature Engineering Pipeline**

#### Data Cleaning

- Since the data is already clean, this step is skipped.

#### Feature Engineering Pipeline

- The feature engineering pipeline is extracted from **Notebook 03** with employing the best features instead.
- Since the dataset now does not contain the to-be-dropped features ['person_age', 'cb_person_cred_hist_length'], SmartCorrelatedSelection is dropped from the pipeline.

In [None]:
def PipelineFeatureEngineering():
    pipeline_base = Pipeline([
        ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary',
                                                     variables=[
                                                        'previous_loan_defaults_on_file',
                                                        ])),
        ("YeoJohnsonTransformer", vt.YeoJohnsonTransformer(
            variables = ['loan_percent_income','person_income',]
            )),
        # SmartCorrelatedSelection is removed since we filter out all the unneeded features
        # by only selecting the important features.
    ])

    return pipeline_base

PipelineFeatureEngineering()

In [None]:
X_train.head(3)

- Apply Data Feature Engineering Pipeline.

In [None]:
pipeline_feat_eng = PipelineFeatureEngineering()
X_train = pipeline_feat_eng.fit_transform(X_train)
X_test =pipeline_feat_eng.transform(X_test)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

### **Target Balance Analysis**

- Display Train Set Target (loan_status) distribution.
- Evaluate if the two loan_status classes (Default = 0, No Default = 1) are balanced.

In [None]:
sns.set_style("whitegrid")
y_train.value_counts().plot(kind='bar', title='Train Set Target Distribution')
plt.show()

> Result:

- The default class (0) has more occurrences, hence the train set ought to be balanced.
- In order to balance both classes in the Train Set, SMOTE (Synthetic Minority Oversampling TEchnique) is used. 
- This is accomplished by oversampling the minority class (No Default = 1).

- Apply **SMOTE (Synthetic Minority Oversampling TEchnique)**.

In [None]:
from imblearn.over_sampling import SMOTE
oversample = SMOTE(sampling_strategy='minority', random_state=0)
X_train, y_train = oversample.fit_resample(X_train, y_train)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

- Evaluate the Train Set Target distribution after oversampling.

In [None]:
import matplotlib.pyplot as plt
y_train.value_counts().plot(kind='bar', title='Train Set Target Distribution')
plt.show()

### **Model Pipeline**

- Classification Pipeline.

In [None]:
def PipelineClf(model):
    pipeline_base = Pipeline([
        ("scaler", StandardScaler()),
        ("model", model),
    ])

    return pipeline_base

#### Grid Search CV

- The best model and its hyperparameter configuration extracted from the earlier task are:
  - Estimator : **RandomForestClassifier**.
  - Hyperparameter Configuration: `{'model__max_depth': None,
 'model__max_leaf_nodes': None,
 'model__min_samples_leaf': 1,
 'model__min_samples_split': 2,
 'model__n_estimators': 140}`.

In [None]:
models_search # RandomForestClassifier

In [None]:
best_model = {'RandomForestClassifier': RandomForestClassifier(random_state=0)}

- Display the best hyperparameter configurations.

In [None]:
best_parameters

- Define a new search parameters based on the best configurations identified in the earlier step.

In [None]:
params_search = {'RandomForestClassifier': {
    'model__max_depth': [None],
    'model__max_leaf_nodes': [None],
    'model__min_samples_leaf': [1],
    'model__min_samples_split': [2],
    'model__n_estimators': [50]
},
}
params_search

- Apply the the Search.

In [None]:
quick_search = HyperparameterOptimizationSearch(
    models=best_model, params=params_search)
quick_search.fit(X_train, y_train,
                 scoring=make_scorer(recall_score, pos_label=0),
                 n_jobs=-1, cv=5)

- Display the algorithm score result

In [None]:
grid_search_summary, grid_search_pipelines = quick_search.score_summary(sort_by='mean_score')
grid_search_summary 

> Note:
- The estimator performance is slightly better after refitting the model on the significant features. 
- The mean score with all the features is 0.911.
- The mean score with the significant features is 0.915967.

- Display the best model.

In [None]:
best_model = grid_search_summary.iloc[0, 0]
best_model

- Display the best parameter.

In [None]:
best_parameters = {'model__max_depth': [None],
 'model__max_leaf_nodes': [None],
 'model__min_samples_leaf': [1],
 'model__min_samples_split': [2],
 'model__n_estimators': [50]}
best_parameters

- Define and then display the best clf pipline.

In [None]:
pipeline_clf = grid_search_pipelines[best_model].best_estimator_
pipeline_clf

#### Evaluate Pipeline on Train and Test Sets

In [None]:
clf_performance(X_train=X_train, y_train=y_train,
                X_test=X_test, y_test=y_test,
                pipeline=pipeline_clf,
                label_map= ['Default', 'No Default'] 
                )

> Result:

* for test set:

  - Default Precision: 0.95
  - Default Recall: 0.92
  - No Default Precision: 0.75
  - No Default Recall: 0.83

> Summary of **ML Pipeline with Best Features**:

* **RandomForestClassifier** proved to be the most effective estimator with a configuration: `{'model__max_depth': [None],
 'model__max_leaf_nodes': [None],
 'model__min_samples_leaf': [1],
 'model__min_samples_split': [2],
 'model__n_estimators': [50]}`.
* The model performance sustain it performance on the important feature pipeline with even a slight improvement.
* The model suffices the business requirement 2 with higher precision than what business stipulates.
* Four features are assessed to be important. These are: best_features = `['previous_loan_defaults_on_file', 'loan_percent_income', 'loan_int_rate', 'person_income']`.

### **Assess feature importance**

- Extract the important features.

In [None]:
best_features = X_train.columns

# create DataFrame to display feature importance
df_feature_importance = (pd.DataFrame(data={
    'Feature': best_features,
    'Importance': pipeline_clf['model'].feature_importances_})
    .sort_values(by='Importance', ascending=False)
)


# Most important features statement and plot
print(f"* These are the {len(best_features)} most important features in descending order. "
      f"The model was trained on them: \n{df_feature_importance['Feature'].to_list()}")

df_feature_importance.plot(kind='bar', x='Feature', y='Importance')
plt.show()

> Result:
- The result affirms that are no additional filtration has occurred to the input features of the pipeline.
- best_features = `['previous_loan_defaults_on_file', 'loan_percent_income', 'loan_int_rate', 'person_income']`.

## **Push files to Repo**

- The following list summarize the Notebook's generated outputs:
  - Train Set
  - Test Set
  - Modeling pipeline
  - label map
  - feature importance plot

### Output folder

In [None]:
version = 'v1'
file_path = f'outputs/ml_pipeline/predict_default/{version}'

try:
  os.makedirs(name=file_path)
except Exception as e:
  print(e)

### Data sets

#### Train Set

- Display the train set (features).

In [None]:
X_train.head(3)

- Save the train set (features).

In [None]:
X_train.to_csv(f"{file_path}/X_train.csv", index=False)

- Display the train set (target='loan_status').

In [None]:
y_train.head(3)

- Save the train set (target='loan_status').

In [None]:
y_train.to_csv(f"{file_path}/y_train.csv", index=False)

### Test Set

- Display the Test set (features).

In [None]:
X_test.head(3)

- Save the Test set (features).

In [None]:
X_test.to_csv(f"{file_path}/X_test.csv", index=False)

- Display the Test set (target='loan_status').

In [None]:
y_test.head(3)

- Save the Test set (target='loan_status').

In [None]:
y_test.to_csv(f"{file_path}/y_test.csv", index=False)

### Data Cleaning and Feature Engineering

> Note:

- Since the row data is already cleaned there is no data clean pipeline to be saved or incorporated into the feature engineering pipeline.

- Display the Feature Engineering Pipeline.

In [None]:
pipeline_feat_eng

- Save the Feature Engineering Pipeline.

In [None]:
joblib.dump(value=pipeline_feat_eng ,
            filename=f"{file_path}/clf_pipeline_feat_eng.pkl")

### Model Pipeline

- Display Feature Scaling and Model pipeline.

In [None]:
pipeline_clf

- Save the Model Pipeline as a pickle file.

In [None]:
joblib.dump(value=pipeline_clf, filename=f"{file_path}/clf_pipeline_model.pkl")

- Calculate clf_pipeline_model.pkl file size before git commit and push.
- If the file size > 100Mb repeat the search to find the most efficient estimator both in terms of score and model size. 

In [None]:
clf_pipeline_model_file_path = "outputs/ml_pipeline/predict_default/v1/clf_pipeline_model.pkl"
file_size_bytes = os.path.getsize(clf_pipeline_model_file_path)
print("File size in bytes:", file_size_bytes)

file_size_mb = file_size_bytes/(1024 * 1024)
print("File size in MB:", round(file_size_mb, 2))

> Result:

File size in MB: 27.4 when using RandomForestClassifier.

### Feature importance plot

- Display feature importance plot.

In [None]:
df_feature_importance.plot(kind='bar', x='Feature', y='Importance')
plt.show()

- Save feature importance plot.

In [None]:
df_feature_importance.plot(kind='bar',x='Feature',y='Importance')
plt.savefig(f'{file_path}/features_importance.png', bbox_inches='tight')

## **Conclusion**

> **ML Pipeline with Best Features**:

* ML Pipeline with best features performed quite well in comparison with all features.
* **RandomForestClassifier** proved to be the most effective estimator with a configuration: `{'model__max_depth': [None],
 'model__max_leaf_nodes': [None],
 'model__min_samples_leaf': [1],
 'model__min_samples_split': [2],
 'model__n_estimators': [50]}`.
* The model performance sustain it performance on the important feature pipeline with even a slight improvement.
* The model suffices the business requirement 2 with higher precision than what business stipulates.
* Four features are assessed to be important. These are: best_features = `['previous_loan_defaults_on_file', 'loan_percent_income', 'loan_int_rate', 'person_income']`.