<a href="https://colab.research.google.com/github/adhang/data-science-digital-skola/blob/update/99.%20Final%20Project/Telco%20Customer%20Churn%20Prediction%20(Modeling).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Telco Customer Churn Prediction
Author: Adhang Muntaha Muhammad

[![LinkedIn](https://img.shields.io/badge/linkedin-0077B5?style=for-the-badge&logo=linkedin&logoColor=white&link=https://www.linkedin.com/in/adhangmuntaha/)](https://www.linkedin.com/in/adhangmuntaha/)
[![GitHub](https://img.shields.io/badge/github-121011?style=for-the-badge&logo=github&logoColor=white&link=https://github.com/adhang)](https://github.com/adhang)
[![Kaggle](https://img.shields.io/badge/kaggle-20BEFF?style=for-the-badge&logo=kaggle&logoColor=white&link=https://www.kaggle.com/adhang)](https://www.kaggle.com/adhang)
[![Tableau](https://img.shields.io/badge/tableau-E97627?style=for-the-badge&logo=tableau&logoColor=white&link=https://public.tableau.com/app/profile/adhang)](https://public.tableau.com/app/profile/adhang)
___

**Context**
- The telco customer churn data contains customer information from a fictional telco company
- This company provides various services such as streaming, phone, and internet services
<br><br>

**Problem Background**
- Customer churn is one of the biggest problems in the telecommunications industry
- By definition, customer churn is when customers stop interacting (subscribe) with the company
- Companies need to invest (expense costs) to get new customers
- When a customer leaves the service (churns), it indicates a loss of investment
- Cost, time, and effort need to be channelled to replace customers who have left the service
- Acquiring new customers is often more difficult and more expensive than retaining existing customers
- On Hardvard Business Review [page](https://hbr.org/2014/10/the-value-of-keeping-the-right-customers), they said: acquiring a new customer is anywhere from five to 25 times more expensive than retaining an existing one
<br><br>

**Objectives**
- Predict whether customers will continue to use the service or will leave the service
- Understand the customer behaviors: what keeps customers using the service and what makes them leave the service
<br><br>

**Contents**
1. Dataset Information
2. Importing Libraries
3. Dataset Overview
4. Dataset Overview - Function
5. Exploratory Data Analysis
6. Data Preprocessing

# 1. Dataset Information
This dataset comes from Kaggle, you can find it here: [Telco Customer Churn](https://www.kaggle.com/blastchar/telco-customer-churn).
<br><br>
This dataset is used to predict behavior to retain customers. Each row represents a customer, and each column contains customer's attribute.
<br><br>
**Attribute Information**
- Identifier
  - `customerID` - ID number of the customer

- Target Variable
  - `Churn` - Churn status, whether the customer churned or not

- Demographic information
  - `gender` - Whether the customer is a male or a female
  - `SeniorCitizen` - Whether the customer is a senior citizen or not
  - `Partner` - Whether the customer has a partner or not
  - `Dependents` - Whether the customer has dependents or not

- Customer account information
  - `tenure` - Number of months the customer has used the service
  - `Contract` - The contract term of the customer
  - `PaperlessBilling` - Whether the customer has paperless billing or not
  - `PaymentMethod` - The customer’s payment method
  - `MonthlyCharges` - The amount charged to the customer monthly
  - `TotalCharges` - The total amount charged to the customer
  
- Services that each customer has signed up for
  - `PhoneService` - Whether the customer has a phone service or not
  - `MultipleLines` - Whether the customer has multiple lines or not
  - `InternetService` - Customer’s internet service provider
  - `OnlineSecurity` - Whether the customer has online security or not
  - `OnlineBackup` - Whether the customer has online backup or not
  - `DeviceProtection` - Whether the customer has device protection or not
  - `TechSupport` - Whether the customer has tech support or not
  - `StreamingTV` - Whether the customer has streaming TV or not
  - `StreamingMovies` - Whether the customer has streaming movies or not
<br><br>

**Note:** Since this dataset is using `CamelCase` format for the column names, for this project, I will convert it to `snake_case` format.

# 2. Importing Libraries

In [None]:
# !pip install --upgrade matplotlib
# !pip install --upgrade seaborn
!pip install inflection
!pip install dython
# !pip install xgboost

In [61]:
# association between attributes
from dython.nominal import associations

# basic
import pandas as pd
import numpy as np

# viz
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.colors import LinearSegmentedColormap
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns

plt.style.use('fivethirtyeight')
sns.set_style('white')
sns.set_context('notebook', font_scale=1.5, rc={'lines.linewidth':1.5})
# I change the maximum width in characters of a column (default: 50)
pd.set_option('display.max_colwidth', None)

# CamelCase to snake_case format
import inflection

# encoding
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder

# scaling
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

# oversampling
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import SMOTENC
from imblearn.over_sampling import ADASYN

# train test split
from sklearn.model_selection import train_test_split

# model
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import RidgeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier

# metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# model evaluation & tuning hyperparameter
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold

# 4. Dataset Overview - Function

In [2]:
#@title Just Run This Function

def get_camel_case(data):
  # renaming column
  column_list = list(data.columns)

  for i, col in enumerate(column_list):
    column_list[i] = inflection.underscore(column_list[i]).replace(' ', '_')

  return column_list

def get_data_type(data, column_list_numerical):
  # general data type
  data_type_general = dict()

  for col in data.columns:
    if col in column_list_numerical:
      data_type_general[col] = 'numerical'
    else:
      data_type_general[col] = 'categorical'

  tmp = pd.Series(data_type_general)
  data_type_general = pd.DataFrame(tmp).T.rename({0:'general data types'})
  
  # pythonic data type
  data['total_charges'] = pd.to_numeric(data['total_charges'], errors='coerce')
  data['total_charges'].fillna(0, inplace=True)
  data['total_charges'] = data['total_charges'].astype(float)

  data_type_python = pd.DataFrame(data.dtypes).T.rename({0:'python data types'})

  return data_type_general, data_type_python

def get_data_variation(data, column_list_numerical, column_list_categorical):
  # numerical data variation
  variation_numerical = dict()

  for col in column_list_numerical:
    tmp = f'{data[col].min()} - {data[col].max()}'
    variation_numerical[col] = tmp

  tmp = pd.Series(variation_numerical)
  data_variation_numerical = pd.DataFrame(tmp).T.rename({0:'data variation'})

  # categorical data variation
  variation_categorical = dict()

  for col in column_list_categorical:
    tmp = data[col].unique().tolist()
    tmp.sort()
    variation_categorical[col] = ', '.join(str(item) for item in tmp)

  tmp = pd.Series(variation_categorical)
  data_variation_categorical = pd.DataFrame(tmp).T.rename({0:'data variation'})

  # overall data variation
  data_variation = pd.concat([data_variation_numerical, data_variation_categorical], axis=1)

  return data_variation

def get_dataset_overview(data):
  # renaming column
  column_list = get_camel_case(data)
  data.columns = column_list

  # total duplicated values
  # print('Total duplicated values:', data.duplicated().sum())

  # dropping column 
  data.drop('customer_id', axis=1, inplace=True)

  # column list
  column_list_numerical = ['tenure', 'monthly_charges', 'total_charges']
  column_list_categorical = list(data.columns)
  column_list_categorical.remove('tenure')
  column_list_categorical.remove('monthly_charges')
  column_list_categorical.remove('total_charges')

  # data type
  data_type_general, data_type_python = get_data_type(data, column_list_numerical)

  # total data
  data_count = pd.DataFrame(data.count()).T.rename({0:'total data'})

  # total null values
  data_null_total = pd.DataFrame(data.isna().sum()).T.rename({0:'total null'})

  # percentage of null values
  data_null_percentage = pd.DataFrame(100*data.isna().sum()/data.shape[0]).T.rename({0:'percentage null'})

  # data variation
  data_variation = get_data_variation(data, column_list_numerical, column_list_categorical)

  data_info = pd.concat([data_type_general, data_type_python,
                       data_count, data_null_total,
                       data_null_percentage.round(2), data_variation],
                      axis=0)

  data_info = data_info.reindex(data.columns, axis=1)

  return data, data_info

In [3]:
#@title And Then Run This
path = 'https://raw.githubusercontent.com/adhang/data-science-digital-skola/main/99.%20Final%20Project/dataset/Telco-Customer-Churn.csv'

data = pd.read_csv(path)
data, data_info = get_dataset_overview(data)

# numerical
column_numerical = ['tenure', 'monthly_charges', 'total_charges']

# categorical
column_categorical = list(data.columns)
column_categorical.remove('tenure')
column_categorical.remove('monthly_charges')
column_categorical.remove('total_charges')

# only contains input features
column_categorical.remove('churn')

In [None]:
data.head()

Unnamed: 0,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies,contract,paperless_billing,payment_method,monthly_charges,total_charges,churn
0,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [None]:
data_info

Unnamed: 0,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies,contract,paperless_billing,payment_method,monthly_charges,total_charges,churn
general data types,categorical,categorical,categorical,categorical,numerical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,numerical,numerical,categorical
python data types,object,int64,object,object,int64,object,object,object,object,object,object,object,object,object,object,object,object,float64,float64,object
total data,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043
total null,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
percentage null,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
data variation,"Female, Male","0, 1","No, Yes","No, Yes",0 - 72,"No, Yes","No, No phone service, Yes","DSL, Fiber optic, No","No, No internet service, Yes","No, No internet service, Yes","No, No internet service, Yes","No, No internet service, Yes","No, No internet service, Yes","No, No internet service, Yes","Month-to-month, One year, Two year","No, Yes","Bank transfer (automatic), Credit card (automatic), Electronic check, Mailed check",18.25 - 118.75,0.0 - 8684.8,"No, Yes"


# 6. Data Preprocessing

## Preprocessing

In [24]:
# numerical
column_numerical = ['tenure', 'monthly_charges', 'total_charges']

# categorical
column_categorical = list(data.columns)
column_categorical.remove('tenure')
column_categorical.remove('monthly_charges')
column_categorical.remove('total_charges')

# only contains input features
column_categorical.remove('churn')

# =========================================
# TRAIN - TEST SPLIT
# =========================================
data_X = data.drop('churn', axis=1)
data_y = data['churn']

X_train, X_test, y_train, y_test = train_test_split(data_X, data_y, test_size=0.3, random_state=1, stratify=data_y)

# =========================================
# LABEL ENCODING
# =========================================
le = LabelEncoder()

le.fit(y_train)

y_train_encode = le.transform(y_train)
y_test_encode = le.transform(y_test)

# =========================================
# ONE HOT ENCODING
# =========================================
# ohe = OneHotEncoder(sparse=False, drop='if_binary')
ohe = OneHotEncoder(sparse=False)

ohe.fit(X_train[column_categorical])

# for col in column_categorical:
X_train_ohe = ohe.transform(X_train[column_categorical])
X_test_ohe = ohe.transform(X_test[column_categorical])

# =========================================
# OHE COLUMN TO SNAKE CASAE
# =========================================
# rename ohe column to snake_case
column_ohe = ohe.get_feature_names_out()

for i, col in enumerate(column_ohe):
  column_ohe[i] = inflection.underscore(column_ohe[i]).replace(' ', '_').replace('(automatic)','')

# =========================================
# COMBINE NUMERICAL COLUMN & ENCODED CATEGORICAL
# =========================================
# create dataframe from one-hot encoded features
X_train_ohe_df = pd.DataFrame(X_train_ohe, columns=column_ohe, index=X_train.index)

# combine the numerical and encoded features
X_train_encode = pd.concat([X_train.drop(columns=column_categorical), X_train_ohe_df], axis=1)

# create dataframe from one-hot encoded features
X_test_ohe_df = pd.DataFrame(X_test_ohe, columns=column_ohe, index=X_test.index)

# combine the numerical and encoded features
X_test_encode = pd.concat([X_test.drop(columns=column_categorical), X_test_ohe_df], axis=1)

# =========================================
# FEATURE SCALING
# =========================================
X_train_scale = X_train_encode.copy()
X_test_scale = X_test_encode.copy()

for i in column_numerical:
  scaler = MinMaxScaler()
  scaler.fit(X_train_scale[[i]])

  X_train_scale[[i]] = scaler.transform(X_train_scale[[i]])
  X_test_scale[[i]] = scaler.transform(X_test_scale[[i]])

## Encoded Dataframe

In [25]:
# combine the X-train and X-test
data_encode = pd.concat([X_train_encode, X_test_encode], axis=0)

# combine with the y-train
data_encode = data_encode.join(pd.Series(y_train_encode, name='churn', index=X_train_encode.index), lsuffix='_1', rsuffix='_2')

# combine with the y-test
data_encode = data_encode.join(pd.Series(y_test_encode, name='churn', index=X_test_encode.index), lsuffix='_1', rsuffix='_2')

# merging the y-train and y-test column
data_encode['churn_1'].fillna(data_encode['churn_2'], inplace=True)
data_encode.drop(columns='churn_2', inplace=True)
data_encode.rename(columns={'churn_1':'churn'}, inplace=True)

data_encode.head()

Unnamed: 0,tenure,monthly_charges,total_charges,gender_female,gender_male,senior_citizen_0,senior_citizen_1,partner_no,partner_yes,dependents_no,...,contract_month_to_month,contract_one_year,contract_two_year,paperless_billing_no,paperless_billing_yes,payment_method_bank_transfer_,payment_method_credit_card_,payment_method_electronic_check,payment_method_mailed_check,churn
6427,41,20.15,802.35,0.0,1.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
6971,18,99.75,1836.25,1.0,0.0,0.0,1.0,0.0,1.0,1.0,...,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
96,71,66.85,4748.7,0.0,1.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
5640,1,79.6,79.6,0.0,1.0,1.0,0.0,1.0,0.0,1.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
414,48,70.65,3545.05,1.0,0.0,0.0,1.0,0.0,1.0,1.0,...,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0


## Scaled Dataframe
This dataframe has been encoded and scaled

In [26]:
# combine the X-train and X-test
data_scale = pd.concat([X_train_scale, X_test_scale], axis=0)

# combine with the y-train
data_scale = data_scale.join(pd.Series(y_train_encode, name='churn', index=X_train_scale.index), lsuffix='_1', rsuffix='_2')

# combine with the y-test
data_scale = data_scale.join(pd.Series(y_test_encode, name='churn', index=X_test_scale.index), lsuffix='_1', rsuffix='_2')

# merging the y-train and y-test column
data_scale['churn_1'].fillna(data_scale['churn_2'], inplace=True)
data_scale.drop(columns='churn_2', inplace=True)
data_scale.rename(columns={'churn_1':'churn'}, inplace=True)

data_scale.head()

Unnamed: 0,tenure,monthly_charges,total_charges,gender_female,gender_male,senior_citizen_0,senior_citizen_1,partner_no,partner_yes,dependents_no,...,contract_month_to_month,contract_one_year,contract_two_year,paperless_billing_no,paperless_billing_yes,payment_method_bank_transfer_,payment_method_credit_card_,payment_method_electronic_check,payment_method_mailed_check,churn
6427,0.569444,0.017439,0.092386,0.0,1.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
6971,0.25,0.810663,0.211433,1.0,0.0,0.0,1.0,0.0,1.0,1.0,...,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
96,0.986111,0.48281,0.546783,0.0,1.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
5640,0.013889,0.609865,0.009165,0.0,1.0,1.0,0.0,1.0,0.0,1.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
414,0.666667,0.520678,0.40819,1.0,0.0,0.0,1.0,0.0,1.0,1.0,...,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0


## Resampling

In [27]:
# numerical
column_numerical = ['tenure', 'monthly_charges', 'total_charges']

# categorical
column_categorical = list(data_scale.columns)
column_categorical.remove('tenure')
column_categorical.remove('monthly_charges')
column_categorical.remove('total_charges')

# only contains input features
# column_categorical.remove('churn')

### SMOTE

In [28]:
smote = SMOTE(random_state=1)

X_train_smote, y_train_smote = smote.fit_resample(X_train_scale, y_train_encode)

X_train_smote_df = pd.DataFrame(X_train_smote, columns=X_train_smote.columns)
y_train_smote_df = pd.DataFrame(y_train_smote, columns=['churn'])

data_smote = pd.concat([X_train_smote_df, y_train_smote_df], axis=1)

### SMOTENC

In [30]:
# column 3-45 is categorical (exclude the target)
smotenc = SMOTENC(random_state=1, categorical_features=np.arange(3,46)) # 46 because exclusive

X_train_smotenc, y_train_smotenc = smotenc.fit_resample(X_train_scale, y_train_encode)

X_train_smotenc_df = pd.DataFrame(X_train_smotenc, columns=X_train_smotenc.columns)
y_train_smotenc_df = pd.DataFrame(y_train_smotenc, columns=['churn'])

data_smotenc = pd.concat([X_train_smotenc_df, y_train_smotenc_df], axis=1)

## ADASYN

In [31]:
adasyn = ADASYN(random_state=1)

X_train_adasyn, y_train_adasyn = adasyn.fit_resample(X_train_scale, y_train_encode)

X_train_adasyn_df = pd.DataFrame(X_train_adasyn, columns=X_train_adasyn.columns)
y_train_adasyn_df = pd.DataFrame(y_train_adasyn, columns=['churn'])

data_adasyn = pd.concat([X_train_adasyn_df, y_train_adasyn_df], axis=1)

# Modeling

## Function

### Print Single Report

In [32]:
def print_report(y_test, y_pred):
  print(classification_report(y_test, y_pred, digits=3))

  print('========================================')
  print('========================================')

  print('Accuracy\t: ', round(accuracy_score(y_test, y_pred),3))
  print('Precision\t: ', round(precision_score(y_test, y_pred, average='macro'),3)) 
  print('Recall\t\t: ', round(recall_score(y_test, y_pred, average='macro'),3))

### Print Score

In [97]:
def print_score(y_pred_list, scoring='accuracy'):
  model_name = []
  accuracy = []
  precision = []
  recall = []
  f1 = []
  roc_auc = []

  for name, y_pred in y_pred_list.items():
    model_name.append(name)
    accuracy.append(accuracy_score(y_test_encode, y_pred))
    precision.append(precision_score(y_test_encode, y_pred, average='macro'))
    recall.append(recall_score(y_test_encode, y_pred, average='macro'))
    f1.append(f1_score(y_test_encode, y_pred, average='macro'))
    roc_auc.append(roc_auc_score(y_test_encode, y_pred, average='macro'))

  score_list = {
      'model':model_name,
      'accuracy':accuracy,
      'precision':precision,
      'recall':recall,
      'f1_score':f1,
      'roc_auc':roc_auc
  }

  score_df = pd.DataFrame(score_list).set_index('model').sort_values(scoring, ascending=False).round(3)
  display(score_df.style.highlight_max(props='color:white; background-color:#008FD5').highlight_min(props='color:white; background-color:#FC4F30'))

### Grid Search

In [33]:
def grid_search(model, grid, X, y, scoring='accuracy'):
  cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
  
  grid_search = GridSearchCV(estimator=model, param_grid=grid,
                             n_jobs=-1, cv=cv, scoring=scoring, error_score=0)
  
  grid_result = grid_search.fit(X, y)
  # summarize results
  print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
  means = grid_result.cv_results_['mean_test_score']
  stds = grid_result.cv_results_['std_test_score']
  params = grid_result.cv_results_['params']
  for mean, stdev, param in zip(means, stds, params):
      print("%f (%f) with: %r" % (mean, stdev, param))

## Default Parameter

In [104]:
model_list = {
    'Logistic Regression':LogisticRegression(max_iter=500),
    'Ridge Classifier':RidgeClassifier(),
    'KNN':KNeighborsClassifier(),
    'SVC':SVC(),
    'Decision Tree':DecisionTreeClassifier(random_state=1),
    'Random Forest':RandomForestClassifier(random_state=1),
    'AdaBoost':AdaBoostClassifier(random_state=1),
    'Gradient Boosting':GradientBoostingClassifier(random_state=1),
    'Hist Gradient Boosting':HistGradientBoostingClassifier(random_state=1),
    'XGBoost':XGBClassifier(random_state=1),
    'Neural Network':MLPClassifier(max_iter=500, random_state=1)
}

### SMOTE

In [105]:
y_pred_list = dict()

for name, model in model_list.items():
  model.fit(X_train_smote, y_train_smote)
  y_pred_list[name] = model.predict(X_test_scale)

print_score(y_pred_list, 'accuracy')
# print_score(y_pred_list, 'f1_score')

Unnamed: 0_level_0,accuracy,precision,recall,f1_score,roc_auc
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Hist Gradient Boosting,0.787,0.728,0.737,0.732,0.737
XGBoost,0.786,0.732,0.762,0.743,0.762
Gradient Boosting,0.784,0.729,0.754,0.738,0.754
Random Forest,0.78,0.718,0.715,0.717,0.715
SVC,0.764,0.713,0.749,0.723,0.749
AdaBoost,0.76,0.713,0.754,0.723,0.754
Logistic Regression,0.746,0.708,0.756,0.714,0.756
Ridge Classifier,0.744,0.707,0.755,0.713,0.755
Neural Network,0.732,0.668,0.684,0.674,0.684
Decision Tree,0.717,0.648,0.661,0.653,0.661


### SMOTENC

In [106]:
y_pred_list = dict()

for name, model in model_list.items():
  model.fit(X_train_smotenc, y_train_smotenc)
  y_pred_list[name] = model.predict(X_test_scale)

print_score(y_pred_list, 'accuracy')
# print_score(y_pred_list, 'f1_score')

Unnamed: 0_level_0,accuracy,precision,recall,f1_score,roc_auc
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Random Forest,0.768,0.705,0.715,0.71,0.715
Neural Network,0.764,0.7,0.708,0.703,0.708
Hist Gradient Boosting,0.764,0.707,0.733,0.716,0.733
Gradient Boosting,0.763,0.713,0.752,0.724,0.752
XGBoost,0.762,0.715,0.756,0.725,0.756
Logistic Regression,0.761,0.715,0.758,0.725,0.758
SVC,0.759,0.707,0.742,0.717,0.742
Ridge Classifier,0.755,0.71,0.752,0.719,0.752
AdaBoost,0.752,0.71,0.757,0.718,0.757
Decision Tree,0.703,0.639,0.657,0.645,0.657


### ADASYN

In [107]:
y_pred_list = dict()

for name, model in model_list.items():
  model.fit(X_train_adasyn, y_train_adasyn)
  y_pred_list[name] = model.predict(X_test_scale)

print_score(y_pred_list, 'accuracy')
# print_score(y_pred_list, 'f1_score')

Unnamed: 0_level_0,accuracy,precision,recall,f1_score,roc_auc
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Gradient Boosting,0.787,0.733,0.761,0.743,0.761
XGBoost,0.786,0.732,0.761,0.743,0.761
Hist Gradient Boosting,0.785,0.726,0.734,0.73,0.734
Random Forest,0.77,0.707,0.71,0.708,0.71
AdaBoost,0.753,0.708,0.75,0.717,0.75
SVC,0.732,0.694,0.739,0.699,0.739
Logistic Regression,0.727,0.703,0.757,0.702,0.757
Neural Network,0.727,0.659,0.672,0.664,0.672
Ridge Classifier,0.725,0.702,0.756,0.7,0.756
Decision Tree,0.72,0.652,0.666,0.657,0.666


# Model Evaluation

## Ref 1
**Precision**<br>
Intuitively speaking, if we have a 100% precise model, that means it could catch all True positive but there were NO False Positive.<br>
![Precision](https://i.stack.imgur.com/bSmbY.png)
<br><br>

**Recall**<br>
Intuitively speaking, if we have a 100% recall model, that means it did NOT miss any True Positive, in other words, there were NO False Negatives.<br>
![Recall](https://i.stack.imgur.com/J6EUS.png)
<br><br>

**Specificity (1-recall)**<br>
Intuitively speaking, if we have 100% specific model, that means it did NOT miss any True Negative, in other words, there were NO False Positives.<br>
![Specificity](https://i.stack.imgur.com/TE01E.png)
<br><br>

**Rule of Thumb**<br>
As a rule of thumb, if the cost of having False negative is high, we want to increase the model recall (sensitivity)

**Example**<br>
For instance, in fraud detection or sick patient detection, we don't want to label/predict a fraudulent transaction (True Positive) as non-fraudulent (False Negative). Also, we don't want to label/predict a contagious sick patient (True Positive) as not sick (False Negative).
<br><br>
This is because the consequences will be worse than a False Positive (incorrectly labeling a a harmless transaction as fraudulent or a non-contagious patient as contagious).
<br><br>
On the other hand, if the cost of having False Positive is high, then we want to increase the model specificity and precision!.
<br><br>
For instance, in email spam detection, we don't want to label/predict a non-spam email (True Negative) as spam (False Positive). On the other hand, failing to label a spam email as spam (False Negative) is less costly.
<br><br>
ref: [stackoverflow](https://stackoverflow.com/questions/44172162/f1-score-vs-roc-auc)

## Ref 2
https://neptune.ai/blog/evaluation-metrics-binary-classification

https://neptune.ai/blog/f1-score-accuracy-roc-auc-pr-auc