In [1]:
import pandas as pd
import numpy as np

### Methodology

Using the following dataset, a **classification** project was identified to be the most suitable approach. Before testing and evaluating models, data pre-processing and cleaning must first be done.

#### Preprocessing

To be able to properly fit the data within the ML model without any issues on the final predictions, pre-processing must be done:
- Null Value and Duplicates Handling
    - As stated in the kaggle notebook, there are no null instances and duplicated instances in the dataset itself
- Outlier Handling
    - Number of entries BEFORE outlier handling:  918
    - Number of entries AFTER outlier handling:  702

- String Indexer for Categorical Values

- Vector Assembly

- Normalization

#### ML Model Selection

With various classification models availabe to use with PySpark's Machine Learning Library (MLlib), these models have been used for testing suitable predictors:

- Logistic Regression
- Random Forest Classifier
- Gradient-boosted Tress Classifier (GBT)
- Linear Support Vector Machine (SVM)
- Decision Tree Classifier

Following a 7:3 train, test split on the data. The following values were recovered on model evaluation.

In [2]:
# Our custom styling method - this is fired once for each column (Series) of the DataFrame
def highlight_cell(col, col_label, row_label):
   # check if col is a column we want to highlight
    if col.name == col_label:
        # a boolean mask where True represents a row we want to highlight
        mask = (col.index == row_label)
        # return an array of string styles (e.g. ["", "background-color: yellow"])
        return ["background-color: yellow" if val_bool else "" for val_bool in mask]
    else:
        # return an array of empty strings that has the same size as col (e.g. ["",""])
        return np.full_like(col, "", dtype="str")

##### Models (W/O hyperparameter tuning)

In [3]:
eval_cols = ["Model", "Accuracy", "Precision", "Recall", "F1", "AUC-ROC"] 

eval_base_eval = [["LogisticRegression", 84.14, 82.26, 89.67, 85.81, 91.91],
                  ["RandomForest",85.03,85.29,87.0,86.14,91.79],
                  ["GradientBoost",78.79,77.18, 85.67, 81.2,86.59],
                  ["SVC",85.38,85.39, 87.67, 86.51, 91.69],
                  ["DecisionTree",81.46,82.03, 83.67,82.84,84.09]]

base_pd = pd.DataFrame(eval_base_eval, columns=eval_cols)

blankIndex=[''] * len(base_pd)
base_pd.index=blankIndex
base_pd

Unnamed: 0,Model,Accuracy,Precision,Recall,F1,AUC-ROC
,LogisticRegression,84.14,82.26,89.67,85.81,91.91
,RandomForest,85.03,85.29,87.0,86.14,91.79
,GradientBoost,78.79,77.18,85.67,81.2,86.59
,SVC,85.38,85.39,87.67,86.51,91.69
,DecisionTree,81.46,82.03,83.67,82.84,84.09


##### Models (W/ hyperparameter tuning)

In [4]:
eval_tuned_eval = [["LogisticRegression",   83.42,    81.08,  90.0,85.31,91.87],
                   ["RandomForest",   84.67,    84.74,  87.0,85.86,92.14],
                   ["GradientBoost",   85.74,    85.26, 88.67,86.93,91.65],
                   ["SVC",   84.49,    83.18,  89.0,85.99,91.77],
                   ["DecisionTree",   83.07,    81.35, 88.67,84.85,79.27]]

tune_pd = pd.DataFrame(eval_tuned_eval, columns=eval_cols)
tune_pd.index = blankIndex
tune_pd

Unnamed: 0,Model,Accuracy,Precision,Recall,F1,AUC-ROC
,LogisticRegression,83.42,81.08,90.0,85.31,91.87
,RandomForest,84.67,84.74,87.0,85.86,92.14
,GradientBoost,85.74,85.26,88.67,86.93,91.65
,SVC,84.49,83.18,89.0,85.99,91.77
,DecisionTree,83.07,81.35,88.67,84.85,79.27


Seeing as the two highest ROC scores from both types of model testing are the Logistic Regression and the RandomForest, these two models became the focus of further testing.

##### Stacked Ensemble Model

To make sure predictions are maximized, a stacked ensemble model method was used. Upon further testing, the most suitable combination of the two models are :
- Random Forest Classification (Base Model w/o hyperparameter tuning)
- Logistic Regression Model (Meta-model w/ hyperparameter tuning)

In this case, the results from the random forest classifier will be used as feature inputs into the logistic regression model. **This includes the initial prediction (pred_rf), and the probability (prob_rf) predicting positive or negative class (>0.5 it's class 1 and if <0.5 it's labeled class 0).**

In [5]:
metrics_cols = ['Metric', 'Score']

metrics_vals = [['Accuracy','89.14%'],['Precision','86.61%'],['Recall','94.01%'],['F-1','90.16%'],['AUC-ROC','94.74%']]

metrics_pd = pd.DataFrame(metrics_vals, columns=metrics_cols)
metrics_pd.index = blankIndex
metrics_pd

Unnamed: 0,Metric,Score
,Accuracy,89.14%
,Precision,86.61%
,Recall,94.01%
,F-1,90.16%
,AUC-ROC,94.74%


*Note that the following metrics were provided with a 70:30 train-test split (fitting and transforming the train set on the RandomForest base model). Further tests with different ratio and splits is highly recommended if replicated.*

##### Stages of the Final Predictor

Since there is a distinction of feature inputs within the two models, pre-processing must also be done separately to ensure that only suitable inputs are passed.

1. Pre-processing
    - StringIndexer for Categorical columns (appending _index to the originall column names).
    - VectorAssembler for assembling a vector for all features.
    - StandardScaler for normalzing the vector.
2. Random Forest Classifier
    - Transform the dataset.
    - Extract the prediction and probability columns from the results.
3. Meta pre-processing
    - VectorAssembler for both prediction and probability columns from previous prediction.
4. Logistic Regression
    - Transform the dataset.

All that's left to do is to implement the following predictor into a Streamlit Interface.