## Classification - Part 3.1: Imbalanced Data (Logistic Regression)
- Objective : 
    1. Classify yes class  along with their probability.

## 1.0 Constrains

### 1.1 Modeling Constraints : 
- Modeling Type : Classification    
- Model Explainaibility : 
    - 1. One Logistic Regression Model : Explainable Model
    - 2. Another Black Box Model : 
        
    
### 1.2 Data Constraint :
- Primary Data : 
    - Dependent / Outcome Variable :  y (0/1)
    - Independent / Predictor Variables :  100 Feature Variables (Categorical, numerical)
- Imbalanced Data :  


### 1.3 Evaluation Metrics Constraint:
- Imbalanced Data specific Evaluation Metrics
- General constraints i.e accuracy leads to improper evaluation / modeling & final choice.

### 1.4 Framework Constraint : 
- This constraint is self-selected based on ease of use and data size.
- Scikit-learn. 
    - Scikit-Learn : Given that the Data-size is comparatively small, scikit-learn  is selected as the framework, given its comparative ease  or use and faster Iteration possiblity.
- TensorFlow/ Keras : Complex Deep Neural Network  modeling and GPU support is provided by TensorFlow / Keras, and hence is used for more complex NN modeling.
- StatsModel :
    - Scikit-learn provides more varied models(Forest, Boosting, Bagging, SVM, NNs), powerful parameter customisation and control. However the LR model generated by them have few model and model parameter statistics compared to StatsModel (i.e p-value, 95% CI, marginal_effects ). Hence we will use StatsModel's LogisticRegression in the second part.



## 1. Data Load: Modeling Ready Data
- Data has been preprocessed with the earlier notebook.
- We are loading the preprocessed data in this step.

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

test_df = pd.read_csv("Data/ModelReadyData/test_df.csv")
train_df = pd.read_csv("Data/ModelReadyData/train_df.csv")

X_train = pd.read_csv("Data/ModelReadyData/X_train.csv")
y_train = pd.read_csv("Data/ModelReadyData/y_train.csv", index_col=False)

X_valid = pd.read_csv("Data/ModelReadyData/X_valid.csv")
y_valid = pd.read_csv("Data/ModelReadyData/y_valid.csv")

X_train_smote = pd.read_csv("Data/ModelReadyData/X_train_smote.csv")
y_train_smote = pd.read_csv("Data/ModelReadyData/y_train_smote.csv")

# ckpt_train_df_pre_scaling = pd.read_csv("Data/ModelReadyData/ckpt_train_df_pre_scaling.csv")

In [36]:
def write_yes_probability_to_a_file(predict_probability, file_name):
    class_1_probability_df = pd.DataFrame(columns=["class_1_probability"])

    for i, class_proba in enumerate(predict_probability):
        _, one_proba = class_proba[0], class_proba[1]
        class_1_probability_df.loc[i] = round(one_proba, 4)

    class_1_probability_df.to_csv(file_name, header=None, index=None)
    return class_1_probability_df

### 4.2 Evaluation Metric Selection:

- Accuracy is a bad Evaluation metric given  highly imbalanced nature of Data.
- We will choose AUC, recall, F1-score and additionally use confusion matrix. 
- Recall : Ability of model to find all positive classes
- Precision : Ability of model to not label as positive a sample that is negative.
- Balanced accuracy : avg of recall across all classes.


In [2]:
from sklearn.model_selection import cross_validate

scoring = ["roc_auc", "recall", "precision", "accuracy", "balanced_accuracy", "f1"]

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from typing import List
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from matplotlib import pyplot
from sklearn import tree
from sklearn.model_selection import cross_validate


# Defining the confusion matrix function
def plot_confusion_matrix(cm, class_labels, title="Confusion matrix", cmap=plt.cm.Blues):
    import itertools

    plt.figure(figsize=(3, 3))
    plt.imshow(cm, interpolation="nearest", cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(class_labels))
    plt.xticks(tick_marks, class_labels, rotation=45)
    plt.yticks(tick_marks, class_labels)

    thresh = cm.max() / 2.0
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
    plt.show()


def area_under_roc(y, pred):
    from sklearn import metrics

    # fpr,tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2 )
    auc = metrics.roc_auc_score(y, pred)
    # print('fpr,tpr, AUC (higher is better)', fpr, tpr, auc)
    print("AUC (higher is better)", auc)
    return auc


def eval_classification(
    y_test,
    ypred_test,
    class_labels=["Retained", "Churn - Lost"],
    title="Confusion Matrix",
    metrics=["confusion_matrix"],
):
    returned_metrics = []

    if "auc" in metrics:
        auc = area_under_roc(y_test, ypred_test)
        returned_metrics.append(auc)
    if "confusion_matrix" in metrics:
        print("Confusion report is")
        print(classification_report(y_test, ypred_test))
        conf_matrix = confusion_matrix(y_test, ypred_test)
        plot_confusion_matrix(conf_matrix, class_labels=class_labels, title=title)
        returned_metrics.append(conf_matrix)

    return returned_metrics[0] if len(returned_metrics) == 1 else returned_metrics


def plot_roc(y_test: List[int], y_test_proba: List[float]):
    auc = roc_auc_score(y_test, y_test_proba)
    print("Logistic: ROC AUC=%.3f" % (auc))

    # # keep probabilities for the positive outcome only
    # probability_positive_class = y_test_proba[:, 1]
    probability_positive_class = y_test_proba

    fpr, tpr, _ = roc_curve(y_test, probability_positive_class)

    pyplot.plot(fpr, tpr, marker=".", label="")
    pyplot.title("ROC Curve")
    pyplot.xlabel("False Positive Rate (False Alarm Rate)")
    pyplot.ylabel("True Positive Rate ( Sensitivity, Hit Rate)")
    pyplot.legend()
    pyplot.show()


def plot_roc_for_binary_prediction_label(y_test, y_test_prediction):
    from plot_metric.functions import BinaryClassification

    bc = BinaryClassification(y_test, y_test_prediction, labels=["Class 1", "Class 2"])
    # Figures
    plt.figure(figsize=(5, 5))
    bc.plot_roc_curve()
    plt.show()


def create_new_score_tracker_df(scoring):
    score_tracker_dict = {"model_name": []}
    scores = {}
    for cur_scoring in scoring:
        score_tracker_dict.update({cur_scoring: []})

    score_tracker_df = pd.DataFrame(score_tracker_dict)
    return score_tracker_df


def add_cv_score_to_df(df, model_descriptor_name, scoring, cur_clf_cv_result):
    cur_scores = {"model_name": model_descriptor_name}
    for cur_scoring in scoring:
        cur_scores.update({cur_scoring: round(cur_clf_cv_result["test_" + cur_scoring].mean(), 2)})

    df = df.append(cur_scores, ignore_index=True)
    df = df.drop_duplicates(keep="last")
    return df


def get_cv_scores(clf, model_descriptor_name, X, y, scoring, score_tracker_df=None, cv_fold=5):
    # Create Score tracker df if first run
    if score_tracker_df is None:
        print("Created New Score Tracker")
        score_tracker_df = create_new_score_tracker_df(scoring)

    cur_clf_cv_result = cross_validate(clf, X, y, cv=cv_fold, scoring=scoring)
    score_tracker_df = add_cv_score_to_df(score_tracker_df, model_descriptor_name, scoring, cur_clf_cv_result)

    return score_tracker_df


def get_benchmark_cv_scores(clf, x_columns, df, model_name, score_tracker_df=None):
    if x_columns is None:
        x_columns = list(df.columns)
        x_columns.remove("y")
    print(f" Columns : {len(x_columns)}")
    cur_X_train, cur_y_train, _, _ = get_stratified_data(df[x_columns], df["y"], test_size=0.01, seed=4)
    score_tracker_df = get_cv_scores(
        clf=clf,
        X=cur_X_train,
        y=cur_y_train[0],
        cv_fold=5,
        scoring=scoring,
        model_descriptor_name=model_name,
        score_tracker_df=score_tracker_df,
    )
    return score_tracker_df


def get_clasifier_evaluation(clf, X_train=None, y_train=None, X_valid=None, y_valid=None):
    clf.fit(X_train, y_train)
    y_pred_valid = clf.predict(X_valid)
    eval_classification(y_valid, y_pred_valid, class_labels=["no", "yes"], metrics=["auc", "confusion_matrix"])

## DataSet Creation
def get_stratified_data(X_df, y_series, test_size=0.2, verbose=False, seed=4):
    from sklearn.model_selection import StratifiedShuffleSplit

    feature_cols = list(X_df.columns)
    # target_col = y_df.columns

    X = np.array(X_df)
    y = np.array(y_series)

    sss = StratifiedShuffleSplit(n_splits=2, test_size=test_size, random_state=seed)
    sss.get_n_splits(X, y)

    for train_index, test_index in sss.split(X, y):
        if verbose:
            print("TRAIN:", train_index, "TEST:", test_index)
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

    X_train = pd.DataFrame(X_train, columns=feature_cols)
    X_test = pd.DataFrame(X_test, columns=feature_cols)

    y_train = pd.DataFrame(y_train)
    y_test = pd.DataFrame(y_test)

    if verbose:
        print(f"\n ...No. of Training Data : {len(X_train)}")
        print(f" ...No. of Test Data     : {len(X_test)}")
        print(f" ...No. of Features : {len(X_train.columns)}")
        print(f" ...Train Bincount : {np.bincount(y_train[0])}")
        print(f" ...Test Bincount  : {np.bincount(y_test[0])}")

    return X_train, y_train, X_test, y_test

def simple_tree(
    X_trainR,
    y_trainR,
    X_testR,
    y_testR,
    max_depth=None,
    class_weight=None,
    class_labels=["yes", "no"],
    metrics=["confusion_matrix"],
):
    from sklearn import tree

    clf = tree.DecisionTreeClassifier(max_depth=max_depth, class_weight=class_weight)
    clf.fit(X_trainR, y_trainR)
    ypred_testR = clf.predict(X_testR)

    eval_classification(y_testR, ypred_testR, class_labels=class_labels, metrics=metrics)
    return clf


##  6. Modeling with StatsModel : Logistic Regression Explainable Models
- Logistic Regression from sklearn  is good for prediction  however it is barebone. It provides very few statistical information compared to statsmodel i.e it does not provied p-values for coeficients to tell us coeficient significance and also provided musch richer model statistical information. Hence we will  build a Logistic Regression model with statsmodel library and then use it for further explanation. It provides information i.e p-values for the coeficient, pseudo-R2 values.
- Hence for explainable model purpose, we will modeling with statsmodel with more descriptive statistical properties.

Ref: https://www.statsmodels.org/stable/generated/statsmodels.discrete.discrete_model.Logit.fit.html




###  6.1 Further Pre-Processing : Multi-colinearity (df-1)
- Transform dummy variable into n-1 variables, because nth variable can be perfectly explained by 1,2,... n-1 variables 
- We will remove the nth variable and during interpretation we will use it as a reference variable to make the explanations.


In [3]:
import statsmodels.api as sm
from scipy import stats
from statsmodels.formula.api import logit
import cvxopt


def normalise_column_names(df):
    for col in list(df.columns):
        if col == "y":
            continue

        if "." in col or "-" in col or " " in col:
            # print('Replacing col ', col, col.replace(".","_").replace("-","_"))
            df.rename(columns={col: col.replace(".", "_").replace("-", "_").replace(" ", "_")}, inplace=True)
    return df


def get_df_n_formulae(df, columns, class_label="y"):
    df = df[columns]
    df = normalise_column_names(df)
    normalised_column_names = sorted(list(df.columns))

    if class_label in normalised_column_names:
        normalised_column_names.remove(class_label)

    return df, class_label + " ~ " + " + ".join(normalised_column_names)

# reference_columns_to_remove = ['education_illiterate', 'job_unemployed','month_dec', 'marital_single', 'day_of_week_fri', 'default_miss' ]
reference_columns_to_remove = [
        "x77_toyota",
        "x33_California",
        "x3_Mon",
        "x60_August",
        "x65_allstate"
]
base_explainable_cols = list(train_df.columns)
explainable_cols = list(set(base_explainable_cols) - set(reference_columns_to_remove))

explainable_data_df, explainable_formulae = get_df_n_formulae(train_df, explainable_cols, class_label="y")


logit_model_explainable_variables = logit(explainable_formulae, explainable_data_df).fit_regularized(
    maxiter=100, method="l1", trim_mode="size", size_trim_tol="auto", auto_trim_tol="auto"
)


lr = logit_model_explainable_variables
# marginal_effect = logit_model_explainable_variables.get_margeff()
# marginal_effect.summary()
logit_model_explainable_variables.summary2()




A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.rename(columns={col: col.replace(".", "_").replace("-", "_").replace(" ", "_")}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.rename(columns={col: col.replace(".", "_").replace("-", "_").replace(" ", "_")}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.rename(columns={col: col.replace(".", "_").replace("-", "_").replace(" ", "_")}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats

Iteration limit reached    (Exit mode 9)
            Current function value: 0.35216009824265926
            Iterations: 100
            Function evaluations: 104
            Gradient evaluations: 100




0,1,2,3
Model:,Logit,Pseudo R-squared:,0.15
Dependent Variable:,y,AIC:,28506.8079
Date:,2023-03-11 22:43,BIC:,29942.4459
No. Observations:,40000,Log-Likelihood:,-14086.0
Df Model:,166,LL-Null:,-16563.0
Df Residuals:,39833,LLR p-value:,0.0
Converged:,0.0000,Scale:,1.0
No. Iterations:,100.0000,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
Intercept,-1.9579,0.1381,-14.1794,0.0000,-2.2285,-1.6872
x1,0.0297,0.0223,1.3348,0.1819,-0.0139,0.0733
x10,-0.0054,0.0194,-0.2811,0.7787,-0.0434,0.0325
x100,0.0384,0.0511,0.7520,0.4520,-0.0617,0.1386
x11,0.1700,0.0403,4.2143,0.0000,0.0909,0.2490
x12,-0.0257,0.0765,-0.3360,0.7368,-0.1757,0.1243
x13,0.0591,0.0371,1.5919,0.1114,-0.0137,0.1318
x14,0.0432,0.0240,1.8024,0.0715,-0.0038,0.0902
x15,-0.0345,0.0155,-2.2212,0.0263,-0.0649,-0.0041


### 6.2 LR Model Confusion Matrix : Diff Threshold

In [4]:
best_lr = logit_model_explainable_variables
print("Confusion Matrix across different threshold")
print(f"\nThreshold=0.5   \n{best_lr.pred_table(0.5)}")
print(f"\nThreshold=0.4   \n{best_lr.pred_table(0.4)}")
print(f"\nThreshold=0.3   \n{best_lr.pred_table(0.3)}")
print(f"\nThreshold=0.2   \n{best_lr.pred_table(0.2)}")
print(f"\nThreshold=0.1   \n{best_lr.pred_table(0.1)}")



Confusion Matrix across different threshold

Threshold=0.5   
[[33769.   428.]
 [ 5178.   625.]]

Threshold=0.4   
[[33113.  1084.]
 [ 4607.  1196.]]

Threshold=0.3   
[[31439.  2758.]
 [ 3722.  2081.]]

Threshold=0.2   
[[27561.  6636.]
 [ 2463.  3340.]]

Threshold=0.1   
[[18287. 15910.]
 [  986.  4817.]]


### 6.3 Future To Dos: Logistic Regression
- Variable slicing for coeficients with larger p-value
- Add in interaction terms