## Advanced ML Kaggle Using sklearn
Sicheng Zhou, Flora Chen, University of San Francisco

In [1]:
import numpy as np
import pandas as pd

In [2]:
import imblearn
from imblearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline

from sklearn.model_selection import train_test_split

In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, ExtraTreesClassifier

In [4]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import make_scorer

In [5]:
from sklearn.metrics import balanced_accuracy_score 
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from sklearn.metrics import plot_roc_curve
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

## Research Question

Could we build a model to predict the probability that a credit card customer is going to churn?

## Dataset Description

Here is the [dataset](https://www.kaggle.com/sakshigoyal7/credit-card-customers). This dataset contains 10,127 customers' information including age, salary, etc. There are 1627 Customers who have churned. Other 8500 customers are not churned. So this is a very unbalanced dataset. If we set a baseline model predicting every customer as not churned, there is 83.9% to be right. As as result, our model must beat that baseline.

Fortunately, the data has no missing values. There are 19 feature columns and 1 target column. Among the features columns, there are 5 categorical columns: Gender, Educational_Level, Martial_Status, Income_Category, and Card_Category.

Load Data
-----

In [6]:
df = pd.read_csv('train_ml2_2021.csv')
X = df.drop(columns=['target', 'problem_id'])
y = df.target
df_test = pd.read_csv('test0.csv', index_col='obs_id').drop(columns=['problem_id'])

In [7]:
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.8, stratify=y, random_state=1)
X_test, y_test = df_test.drop(columns='target'), df_test.target

## Fit a base model

We already set a baseline model predicting every customer as not churned, there is 83.9% to be right. However, this dataset is hight unbalanced. We want to fit another baseline model without balancing the data.

In [8]:
base_pipe = make_pipeline(
    LogisticRegression(
        solver='liblinear',
        class_weight=None
    )
)

In [9]:
base_pipe.fit(X_train.values, y_train)

Pipeline(steps=[('logisticregression', LogisticRegression(solver='liblinear'))])

In [10]:
base_pred = base_pipe.predict(X_val)

In [11]:
base_accuracy = balanced_accuracy_score(y_val, base_pred)
base_accuracy

0.33029955483826184

In [12]:
accuracy_score(y_val, base_pred)

0.588199879590608

Since we used balanced_accuracy_score for the classifier, we must also calculate a balanced accuracy score for the first baseline model predicting everything as "not churned", that is, 0. According to sklearn official documentation, 
>The balanced accuracy in binary and multiclass classification problems to deal with imbalanced datasets. It is defined as the average of recall obtained on each class.
>

So our baseline Logistic Classifier performs better than simply predicting everything to be 0. Next, we will deal with the imbalancing dataset to see if we could get better model

## Methods to deal with imbalanced data

- Sometimes we could simply ignore class imbalances because most real-world data is imbalanced. Small differences at larget scale might not effect business outcomes. If the imbalance is not serious, we could ignore it. However, in this project, the imbalance of data could not be ignored.
- We could get more data for the minority group. 
- The most practical way in this project is to resample the data. For example, over-sample minority group, under-sample majority group, representative sampling of both groups and synthetically generate samples from minority class(SMOTE). In this project, we use SMOTE. SMOTE synthesises new minority instances between existing minority instances. In the intuitive picture below, SMOTE synthetic minority instances somewhere on these lines.
- We should pick an appropriate evaluation metrics, especially avoiding accuracy. We could apply balanced_accuracy_score in this project.
- Use robust algorithms. For example, Support Vector Machine, which finds a hyperplan that maximizes the margin. We only needs very few "support vectors" thus minimizing the impact of imbalanced data.

![](https://raw.githubusercontent.com/rikunert/SMOTE_visualisation/master/SMOTE_R_visualisation_2.png)

image from: https://rikunert.com/SMOTE_explained

## Models

In [37]:
models = [
#     LogisticRegression(),
#     RidgeClassifier(),
#     SGDClassifier(),
#     SVC(),
    RandomForestClassifier(),
    AdaBoostClassifier(),
    ExtraTreesClassifier()
]

In [38]:
def pipe_4_model(model):
    pipe_4_model = make_pipeline(   
        
        imblearn.over_sampling.SMOTE(k_neighbors=10),
        model
    )
    return pipe_4_model

In [39]:
pipes = [pipe_4_model(model) for model in models]

- We could use pipes[index].get_params().keys() to get model names

## Model Hyper Parameters

In [40]:
# for LogisticRegression model, we set l1_ratio to find a mid-point
# between l1 and l2 regularization
lr_params = dict(
    logisticregression__penalty=['elasticnet'],
    logisticregression__solver=['saga'],
    logisticregression__l1_ratio=[0, 0,1, 0.3, 0.5, 0.7, 0.9, 1]
)

In [41]:
# for RidgeClassifier, alpha is important for regularization
rc_params = dict(
    ridgeclassifier__alpha=[0.1, 1, 10, 100, 1000]
)

In [42]:
# for SGDClassifier, l1_ratio still controls the balance
# between l1 and l2 regularization
# We should set early stoppint as True to prevent overfitting
sgd_params = dict(
    
    sgdclassifier__l1_ratio=[0, 0,1, 0.3, 0.5, 0.7, 0.9, 1],
    sgdclassifier__early_stopping=[True],
)

In [43]:
# for Support Vector Machine, the kernel is important
# different kernel defines different method to transform data
svc_params = dict(
    svc__C=[0.1, 0.3, 0.5, 1, 10, 100],
    svc__kernel=['linear', 'poly', 'rbf', 'sigmoid'],
    svc__gamma=[0.1, 1, 10, 100, 1000]
)

In [49]:
# for RandomForestClassifier, n_estimators is the number of decision trees
# if n_estimator is high, it tends to overfitting
# max_depth is the max depth of each tree, if high, it tends to overfitting
# max_feature is the max features each tree use. We do not use all the features 
# to prevent overfitting
rfc_params = dict(
    randomforestclassifier__n_estimators=[70, 80, 90, 100, 110, 120, 130, 140],
    randomforestclassifier__max_depth=[8, 20, 22, 30, 50],
    randomforestclassifier__max_features=[500, 600, 700]
)

In [50]:
# for AdaBoostClassifier, n_estimators is still the number of estimators
abc_params = dict(
    adaboostclassifier__n_estimators=[10, 20, 30, 40, 50, 70, 100, 500, 1000],
)

In [51]:
# for ExtraTreesClassifier, n_estimators is still the number of estimators
# n_depth is the maximum depth of the tree
et_params = dict(
    extratreesclassifier__n_estimators = [5, 10, 50, 100, 200, 300, 400, 500],
    extratreesclassifier__max_depth = [range(2,30), None],
    extratreesclassifier__min_samples_split = range(1,10),
    extratreesclassifier__min_samples_leaf = range(1,10),
    extratreesclassifier__max_features = ['auto', 'sqrt', 'log2'],
    extratreesclassifier__warm_start = [True, False],   
)

In [52]:
params = [
#     lr_params,
#     rc_params,
#     sgd_params,
#     svc_params,
    rfc_params,
    abc_params,
    et_params
]

## Search Parameter Space

- We use Randomized Search strategy to select the best model. Usually this strategy is faster than Grid Search. 
- We use cross validation with 3 folds. 
- We use different metrics: balanced_accuracy_score and f1 score. Both of them are suitable in a unbalanced dataset.
- balanced_accuracy_score gives the accuracy.
- f1 score gives the harmonic average of precision and recall. 

In [53]:
balanced_scorer = make_scorer(balanced_accuracy_score)
accuracy_scorer = make_scorer(accuracy_score)

In [54]:
best_models = []
best_params = []
best_scores = []

In [55]:
for index in range(len(models)):
    for score in [balanced_scorer]:
        model_family = models[index].__class__.__name__
        pipe = pipes[index]
        search_space = params[index]

        cross_valid = RandomizedSearchCV(
            estimator = pipe,
            param_distributions = search_space,
            n_iter = 5,
            cv = 3,
            scoring = score,
            n_jobs = -1,
            verbose = -1
        )
        
        

        best_model = cross_valid.fit(X.values, y.values)
        best_param = cross_valid.best_params_
        best_score = cross_valid.best_score_

        best_models.append(best_model)
        best_params.append(best_param)
        best_scores.append(best_score)

        print(f"index={index}, {model_family}: {best_score}, metrics={score._score_func.__name__}")
        


index=0, RandomForestClassifier: 0.3426374678944292, metrics=balanced_accuracy_score
index=1, AdaBoostClassifier: 0.31884723713513535, metrics=balanced_accuracy_score




index=2, ExtraTreesClassifier: 0.3549412102432132, metrics=balanced_accuracy_score


## Fit the best model

- From the training process above, the best model should be RandomForestClassifier, for both balanced_accuracy_score and f1_score.
- Here are the parameters of random forest classifier:

In [85]:
print(best_params[2])

{'extratreesclassifier__warm_start': False, 'extratreesclassifier__n_estimators': 300, 'extratreesclassifier__min_samples_split': 8, 'extratreesclassifier__min_samples_leaf': 9, 'extratreesclassifier__max_features': 'sqrt', 'extratreesclassifier__max_depth': None}


In [110]:
best_pipe = make_pipeline(   
    imblearn.over_sampling.SMOTE(k_neighbors=13),
    ExtraTreesClassifier(
        n_estimators = 250,
        min_samples_split = 8,
        min_samples_leaf = 9,
        max_features = 'sqrt',
        max_depth = None
    )
)

In [111]:
best_pipe.fit(X_train.values, y_train.values)

Pipeline(steps=[('smote', SMOTE(k_neighbors=13)),
                ('extratreesclassifier',
                 ExtraTreesClassifier(max_features='sqrt', min_samples_leaf=9,
                                      min_samples_split=8, n_estimators=250))])

In [112]:
best_pred = best_pipe.predict(X_val.values)

In [113]:
accuracy_score(y_val.values, best_pred)

0.6827212522576761

In [90]:
y_pred_test = best_pipe.predict(X_test.values)

In [91]:
submission = pd.read_csv("sample_submission.csv")

submission["target"] = y_pred_test

submission.to_csv("submission.csv", index=False)