# Supervised Learning Final Project
## This project will be analyzing customer behavior at Beta Bank to determine if it can be predicted if a customer will end its membership at the bank. Classification models such as Logistic Regressions and Decision Tree Classifiers will be used. The critieria for a successful model is a model that computes a F1 Score of 0.59 or higher.

## The steps in this project will be as follows: 
## 1. Loading the data, all necessary packages, and prepare the data for analysis
## 2. Check the balance of classes and train different models without factoring class imbalance
## 3. Improve the quality of the best model, use class balancing strategies
## 4. Final test and analyze the best model

In [1]:
# Import necessary packages

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.utils import shuffle
from sklearn.metrics import classification_report

In [2]:
# Load data and start data preparation

data = pd.read_csv('/datasets/Churn.csv')
data.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [3]:
# Detailed look at the data

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


In [4]:
# Looking at all rows with missing data to identify any trends

data[data.isnull().any(axis=1)]

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
30,31,15589475,Azikiwe,591,Spain,Female,39,,0.00,3,1,0,140469.38,1
48,49,15766205,Yin,550,Germany,Male,38,,103391.38,1,0,1,90878.13,0
51,52,15768193,Trevisani,585,Germany,Male,36,,146050.97,2,0,0,86424.57,0
53,54,15702298,Parkhill,655,Germany,Male,41,,125561.97,1,0,0,164040.94,1
60,61,15651280,Hunter,742,Germany,Male,35,,136857.00,1,0,0,84509.57,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9944,9945,15703923,Cameron,744,Germany,Male,41,,190409.34,2,1,1,138361.48,0
9956,9957,15707861,Nucci,520,France,Female,46,,85216.61,1,1,0,117369.52,1
9964,9965,15642785,Douglas,479,France,Male,34,,117593.48,2,0,0,113308.29,0
9985,9986,15586914,Nepean,659,France,Male,36,,123841.49,2,1,0,96833.00,0


In [5]:
# Filling missing values with median of that column since it is only a small percent of the data
# The median of the tenure column is 5.0 and the mean is around 4.99 so they are similar
# Median was chose to avoid outliers greatly affecting the data

data = data.fillna(data.median())
#print(data['Tenure'].mean())
data = data.drop(columns=['RowNumber', 'CustomerId', 'Surname'])
data.info()
data.tail()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CreditScore      10000 non-null  int64  
 1   Geography        10000 non-null  object 
 2   Gender           10000 non-null  object 
 3   Age              10000 non-null  int64  
 4   Tenure           10000 non-null  float64
 5   Balance          10000 non-null  float64
 6   NumOfProducts    10000 non-null  int64  
 7   HasCrCard        10000 non-null  int64  
 8   IsActiveMember   10000 non-null  int64  
 9   EstimatedSalary  10000 non-null  float64
 10  Exited           10000 non-null  int64  
dtypes: float64(3), int64(6), object(2)
memory usage: 859.5+ KB


Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
9995,771,France,Male,39,5.0,0.0,2,1,0,96270.64,0
9996,516,France,Male,35,10.0,57369.61,1,1,1,101699.77,0
9997,709,France,Female,36,7.0,0.0,1,0,1,42085.58,1
9998,772,Germany,Male,42,3.0,75075.31,2,1,0,92888.52,1
9999,792,France,Female,28,5.0,130142.79,1,1,0,38190.78,0


In [6]:
# Using One Hot Encode to prepare the data for a Logistic Regression. Also normalizing the numeric data using standard scaler

data_ohe = pd.get_dummies(data, columns=['Geography', 'Gender'], drop_first=True)

features_to_scale = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']

# Initialize the scaler
scaler = StandardScaler()

# Scale the numerical features
data_ohe[features_to_scale] = scaler.fit_transform(data_ohe[features_to_scale])

# Display the first few rows of the preprocessed dataset
print(data_ohe.info())
data_ohe.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CreditScore        10000 non-null  float64
 1   Age                10000 non-null  float64
 2   Tenure             10000 non-null  float64
 3   Balance            10000 non-null  float64
 4   NumOfProducts      10000 non-null  float64
 5   HasCrCard          10000 non-null  int64  
 6   IsActiveMember     10000 non-null  int64  
 7   EstimatedSalary    10000 non-null  float64
 8   Exited             10000 non-null  int64  
 9   Geography_Germany  10000 non-null  uint8  
 10  Geography_Spain    10000 non-null  uint8  
 11  Gender_Male        10000 non-null  uint8  
dtypes: float64(6), int64(3), uint8(3)
memory usage: 732.5 KB
None


Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain,Gender_Male
0,-0.326221,0.293517,-1.086246,-1.225848,-0.911583,1,1,0.021886,1,0,0,0
1,-0.440036,0.198164,-1.448581,0.11735,-0.911583,0,1,0.216534,0,0,1,0
2,-1.536794,0.293517,1.087768,1.333053,2.527057,1,0,0.240687,1,0,0,0
3,0.501521,0.007457,-1.448581,-1.225848,0.807737,0,0,-0.108918,0,0,0,0
4,2.063884,0.388871,-1.086246,0.785728,-0.911583,1,1,-0.365276,0,0,1,0


In [7]:
# Check the balance of the target variable

class_balance = data_ohe['Exited'].value_counts(normalize=True)
print(class_balance)

0    0.7963
1    0.2037
Name: Exited, dtype: float64


The target column 'Exited' is imbalanced, with a majority of the class of '0' and a minority of '1'. This can affect the models because of the bias towards the majority class. Training the models without taking into account this class imbalance will result in models with skewed metrics.

### The next section will be adjusting and tuning a Logistic Regression to get the best f1 score.

In [8]:
# Splitting the data into training and validation sets for Logistic Regression

#target = data_ohe['Exited']
#features = data_ohe.drop(columns=['Exited']) 
#features_train, features_valid, target_train, target_valid = train_test_split(
#    features, target, test_size=0.25, random_state=12345)




# Splitting data into training, validation, and testing sets

# Creating features list and target list
target = data_ohe['Exited']
features = data_ohe.drop(columns=['Exited'])

# Splitting into training (60%) and temporary (40%). Used random_state=12345 as shown in the sprint.
features_train, features_temp, target_train, target_temp = train_test_split(features, target, test_size=0.4, random_state=12345)

# Splitting the temporary set into validation (20%) and test (20%)
features_valid, features_test, target_valid, target_test = train_test_split(features_temp, target_temp, test_size=0.5, random_state=12345)

# Verify the sizes of the splits following the traditional 3:1:1 split.
print(f'Training set: {features_train.shape}, {target_train.shape}')
print(f'Validation set: {features_valid.shape}, {target_valid.shape}')
print(f'Test set: {features_test.shape}, {target_test.shape}')


Training set: (6000, 11), (6000,)
Validation set: (2000, 11), (2000,)
Test set: (2000, 11), (2000,)


In [9]:
# Running a Logisitic Regression on data with no hyperparameters or class balancing to get a base f1 score

log_model = LogisticRegression(random_state=12345)
log_model.fit(features_train, target_train)
predicted_log_valid = log_model.predict(features_valid)

f1_log_valid = f1_score(target_valid, predicted_log_valid)
print('F1 Score:', f1_log_valid)
#print('Accuracy Score:', accuracy_score(target_valid, predicted_log_valid))
print('Confusion Matrix:', confusion_matrix(target_valid, predicted_log_valid))
print(classification_report(target_valid, predicted_log_valid))

F1 Score: 0.33108108108108103
Confusion Matrix: [[1506   76]
 [ 320   98]]
              precision    recall  f1-score   support

           0       0.82      0.95      0.88      1582
           1       0.56      0.23      0.33       418

    accuracy                           0.80      2000
   macro avg       0.69      0.59      0.61      2000
weighted avg       0.77      0.80      0.77      2000



The results of a Logistic Regression with little data preparation, no hyperparameters besides random_state, results in an f1_score of 0.33 which means the model performs poorly. The ratio between precision and recall is around 0.33 most likely means the recall is low and the model cannot balance the two metrics. This is probably due to the fact that the data itself is imbalanced towards the column 'Exited' which is biased towards equaling 0.

In [10]:
# Calculate roc_auc_score

probabilities_valid = log_model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]

auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
print(auc_roc)

0.7588057029137607


A roc_auc_score of 0.7588 means that the model is fairly good at distinguishing between the negative and positive classes across all possible classification thresholds and is significantly better than random guessing.

In [11]:
# Adding hyperparameter class_weight to balance the classes

log_model = LogisticRegression(class_weight = 'balanced',random_state=12345)
log_model.fit(features_train, target_train)
predicted_log_valid = log_model.predict(features_valid)

f1_log_valid = f1_score(target_valid, predicted_log_valid)
print('F1 Score:', f1_log_valid)

#print('Accuracy Score:', accuracy_score(target_valid, predicted_log_valid))

print('Confusion Matrix:', confusion_matrix(target_valid, predicted_log_valid))
print(classification_report(target_valid, predicted_log_valid))

F1 Score: 0.4888507718696398
Confusion Matrix: [[1119  463]
 [ 133  285]]
              precision    recall  f1-score   support

           0       0.89      0.71      0.79      1582
           1       0.38      0.68      0.49       418

    accuracy                           0.70      2000
   macro avg       0.64      0.69      0.64      2000
weighted avg       0.79      0.70      0.73      2000



Adding a class_weight= 'balanced' parameter increased the f1_score from 0.33 to 0.48885 which is better but not yet at the 0.59 threshold.

In [12]:
# Calculate roc_auc_score

probabilities_valid = log_model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]

roc_auc = roc_auc_score(target_valid, probabilities_one_valid)
print(roc_auc)

0.7636781011257023


The roc_auc_score remained basically the same, suggesting the model still can distinguish positive and negative weight fairly accurately.

Next part will be upsampling and downsampling the data to see if the balance of classes can be improved to also improve the f1_score.

In [13]:
# Upsample function to increase positive cases

def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)

    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345
    )

    return features_upsampled, target_upsampled


In [14]:
# Running Logistic Regression using upsampling. Showing f1_score, classification report, and roc_auc_score.

best_factor = 0
best_f1 = 0
for i in range(1,6): # For loop to find the best integer for repeat parameter
    features_upsampled, target_upsampled = upsample(features_train, target_train, i)
    model_upsample = LogisticRegression( random_state=12345)
    model_upsample.fit(features_upsampled, target_upsampled)
    predicted_valid_upsample = model_upsample.predict(features_valid)
    
    f1 = f1_score(target_valid, predicted_valid_upsample)
    if(f1 > best_f1): # if statement to save the best f1 score that will be calculated
        best_f1 = f1
        best_factor = i
        
    
    print('F1 unbalanced model:', f1_score(target_valid, predicted_valid_upsample))
    print('Factor:', i)
    print(classification_report(target_valid, predicted_valid_upsample))

    probabilities_valid = model_upsample.predict_proba(features_valid)
    probabilities_one_valid = probabilities_valid[:, 1]

    roc_auc = roc_auc_score(target_valid, probabilities_one_valid)
    print('roc_auc_score unbalanced model:', roc_auc)
    
    print()

    model_upsample = LogisticRegression(class_weight= 'balanced', random_state=12345)
    model_upsample.fit(features_upsampled, target_upsampled)
    predicted_valid_upsample = model_upsample.predict(features_valid)
    
    f1 = f1_score(target_valid, predicted_valid_upsample)
    if(f1 > best_f1):
        best_f1 = f1
        best_factor = i
    
    print('F1 balanced model:', f1_score(target_valid, predicted_valid_upsample))
    print('Factor:', i )
    print(classification_report(target_valid, predicted_valid_upsample))

    probabilities_valid = model_upsample.predict_proba(features_valid)
    probabilities_one_valid = probabilities_valid[:, 1]

    roc_auc = roc_auc_score(target_valid, probabilities_one_valid)
    print('roc_auc_score balanced model:', roc_auc)
print('Best f1_score:', best_f1)
print('Best factor:', best_factor)

F1 unbalanced model: 0.33108108108108103
Factor: 1
              precision    recall  f1-score   support

           0       0.82      0.95      0.88      1582
           1       0.56      0.23      0.33       418

    accuracy                           0.80      2000
   macro avg       0.69      0.59      0.61      2000
weighted avg       0.77      0.80      0.77      2000

roc_auc_score unbalanced model: 0.7588057029137607

F1 balanced model: 0.4888507718696398
Factor: 1
              precision    recall  f1-score   support

           0       0.89      0.71      0.79      1582
           1       0.38      0.68      0.49       418

    accuracy                           0.70      2000
   macro avg       0.64      0.69      0.64      2000
weighted avg       0.79      0.70      0.73      2000

roc_auc_score balanced model: 0.7636781011257023
F1 unbalanced model: 0.46437346437346433
Factor: 2
              precision    recall  f1-score   support

           0       0.86      0.87      0

From the previous cell it is determined that a Logistic Regression with no class_weight parameter and a repeat value of 3 for the upsampling function gives the best f1_score of 0.5

In [15]:
# Logistic Regression with best parameters for upsampling

features_upsampled, target_upsampled = upsample(features_train, target_train, 3)
model_upsample = LogisticRegression( random_state=12345)
model_upsample.fit(features_upsampled, target_upsampled)

predicted_valid_upsample = model_upsample.predict(features_valid)
print('F1:', f1_score(target_valid, predicted_valid_upsample))
print(classification_report(target_valid, predicted_valid_upsample))

probabilities_valid = model_upsample.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]

roc_auc = roc_auc_score(target_valid, probabilities_one_valid)
print('roc_auc_score:', roc_auc)

F1: 0.5
              precision    recall  f1-score   support

           0       0.88      0.79      0.83      1582
           1       0.43      0.60      0.50       418

    accuracy                           0.75      2000
   macro avg       0.66      0.69      0.67      2000
weighted avg       0.79      0.75      0.76      2000

roc_auc_score: 0.7626709573612228


Upsampling the data had some affect on the f1_score (0.48 to 0.5) and roc_auc_score and some affect on precision and recall, but the ratios are still basically the same as the previous model.

In [16]:
# Downsample function to decrease negative cases

def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=12345)]
        + [features_ones]
    )
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=12345)]
        + [target_ones]
    )

    features_downsampled, target_downsampled = shuffle(
        features_downsampled, target_downsampled, random_state=12345
    )

    return features_downsampled, target_downsampled

In [17]:
# For loop to find the optimal fraction parameter for downsampling

best_i = 0
best_f1 = 0
for i in np.arange(0.1, 1, 0.1):
    features_downsampled, target_downsampled = downsample(features_train, target_train, i)

    model_downsample = LogisticRegression(random_state=12345)
    model_downsample.fit(features_downsampled, target_downsampled)
    predicted_valid_downsample = model_downsample.predict(features_valid)
    
    print('Fraction factor:', i)
    print('F1 unbalanced model:', f1_score(target_valid, predicted_valid_downsample))
    if(f1_score(target_valid, predicted_valid_downsample) > best_f1):
        best_f1 = f1_score(target_valid, predicted_valid_downsample)
        best_i = i
    
    features_downsampled, target_downsampled = downsample(features_train, target_train, i)

    model_downsample = LogisticRegression(class_weight= 'balanced', random_state=12345)
    model_downsample.fit(features_downsampled, target_downsampled)
    predicted_valid_downsample = model_downsample.predict(features_valid)

    print('F1 balanced model:', f1_score(target_valid, predicted_valid_downsample))
    if(f1_score(target_valid, predicted_valid_downsample) > best_f1):
        best_f1 = f1_score(target_valid, predicted_valid_downsample)
        best_i = i
    print()
print('Best f1:', best_f1, 'is at fraction =', best_i)

Fraction factor: 0.1
F1 unbalanced model: 0.42986425339366513
F1 balanced model: 0.48047538200339557

Fraction factor: 0.2
F1 unbalanced model: 0.4791344667697063
F1 balanced model: 0.48704663212435234

Fraction factor: 0.30000000000000004
F1 unbalanced model: 0.497164461247637
F1 balanced model: 0.4939759036144578

Fraction factor: 0.4
F1 unbalanced model: 0.504875406283857
F1 balanced model: 0.4918314703353396

Fraction factor: 0.5
F1 unbalanced model: 0.46886446886446886
F1 balanced model: 0.4923076923076924

Fraction factor: 0.6
F1 unbalanced model: 0.4306864064602961
F1 balanced model: 0.4888888888888888

Fraction factor: 0.7000000000000001
F1 unbalanced model: 0.3883211678832117
F1 balanced model: 0.4901793339026473

Fraction factor: 0.8
F1 unbalanced model: 0.36785162287480677
F1 balanced model: 0.48717948717948717

Fraction factor: 0.9
F1 unbalanced model: 0.36245954692556637
F1 balanced model: 0.4905982905982907

Best f1: 0.504875406283857 is at fraction = 0.4


From the cell above, fraction = 0.4 and no class_weight parameter gives the best f1_score of 0.5 which is basically the same as the results from upsampling.

In [18]:
# Logistic Regression with best parameters for downsampling

features_downsampled, target_downsampled = downsample(features_train, target_train, 0.4)
model_downsample = LogisticRegression(random_state=12345)
model_downsample.fit(features_downsampled, target_downsampled)
predicted_valid_downsample = model_downsample.predict(features_valid)

print('F1:', f1_score(target_valid, predicted_valid_downsample))
print(classification_report(target_valid, predicted_valid_downsample))

probabilities_valid = model_downsample.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]

roc_auc = roc_auc_score(target_valid, probabilities_one_valid)
print('roc_auc_score:', roc_auc)

F1: 0.504875406283857
              precision    recall  f1-score   support

           0       0.88      0.83      0.85      1582
           1       0.46      0.56      0.50       418

    accuracy                           0.77      2000
   macro avg       0.67      0.69      0.68      2000
weighted avg       0.79      0.77      0.78      2000

roc_auc_score: 0.7625877848281201


Again, the f1_score and roc_auc_score were affected very little by downsampling.
Next, the threshold for the barrier between negative classes and positive classes will be adjusted.

In [19]:
# Threshold adjust for upsampling case

best_thresh = 0
best_f1 = 0
for i in np.arange(0.01, 1, 0.01):
    pred_valid_upsample_new_threshold = (model_upsample.predict_proba(features_valid)[:, 1] >= i).astype(int) 
    print(f1_score(target_valid, pred_valid_upsample_new_threshold), '|', i)
    if(f1_score(target_valid, pred_valid_upsample_new_threshold) > best_f1):
        best_f1 = f1_score(target_valid, pred_valid_upsample_new_threshold)
        best_thresh = i
print('Best f1_score', best_f1, 'is at threshold', best_thresh)

0.3457402812241522 | 0.01
0.3457402812241522 | 0.02
0.3458833264377327 | 0.03
0.346744089589382 | 0.04
0.34746467165419787 | 0.05
0.3493522774759716 | 0.060000000000000005
0.3517038283550694 | 0.06999999999999999
0.35604770017035775 | 0.08
0.35815755488592343 | 0.09
0.3633187772925764 | 0.09999999999999999
0.3682342502218279 | 0.11
0.37330928764652843 | 0.12
0.3758573388203018 | 0.13
0.3794063079777365 | 0.14
0.38068448195030474 | 0.15000000000000002
0.3888622179548728 | 0.16
0.3944909001475652 | 0.17
0.3975963945918878 | 0.18000000000000002
0.4008117706747844 | 0.19
0.40329218106995884 | 0.2
0.40906694781233527 | 0.21000000000000002
0.4122383252818036 | 0.22
0.4161660294920808 | 0.23
0.4202334630350194 | 0.24000000000000002
0.4271954674220963 | 0.25
0.4303065355696934 | 0.26
0.43431952662721895 | 0.27
0.4412296564195299 | 0.28
0.44731977818853974 | 0.29000000000000004
0.4524714828897338 | 0.3
0.4596354166666667 | 0.31
0.46728971962616817 | 0.32
0.4726775956284153 | 0.33
0.473867595818

After threshold adjustment, the best f1_score was 0.502 at fraction = 0.47, which is close to the default threshold. The results of this test is very similar to the results from previous models.

In [20]:
# Threshold adjust for downsampling case

best_thresh = 0
best_f1 = 0
for i in np.arange(0.01, 1, 0.01):
    pred_valid_downsample_new_threshold = (model_downsample.predict_proba(features_valid)[:, 1] >= i).astype(int) 
    print(f1_score(target_valid, pred_valid_downsample_new_threshold))
    if(f1_score(target_valid, pred_valid_downsample_new_threshold) > best_f1):
        best_f1 = f1_score(target_valid, pred_valid_downsample_new_threshold)
        best_thresh = i
print('Best f1_score', best_f1, 'is at threshold', best_thresh)

0.3457402812241522
0.3457402812241522
0.34616977225672874
0.3473203157457416
0.3490605427974948
0.3527426160337553
0.3568075117370892
0.3611111111111111
0.3657243816254417
0.3723021582733813
0.37477148080438755
0.3790060380863911
0.3838862559241707
0.39088263821532493
0.39641434262948205
0.397364419665484
0.4045525090532851
0.4092582851130984
0.4152770306616461
0.42041712403951703
0.4244120940649496
0.4278320874065555
0.43309859154929575
0.4370149253731343
0.4441702652683529
0.45022194039315155
0.4601425793907971
0.4687083888149134
0.47282608695652173
0.47671994440583737
0.47285714285714286
0.4815905743740795
0.48599545798637384
0.4797507788161993
0.4796812749003984
0.4816326530612245
0.4887780548628429
0.4868532654792196
0.4891209747606614
0.4920071047957371
0.4872727272727272
0.4902143522833177
0.49382716049382713
0.49466537342386024
0.49950445986124875
0.5010060362173039
0.4938271604938271
0.4952780692549842
0.4989384288747345
0.504875406283857
0.5027932960893856
0.49371428571428577

After threshold adjustment, the best f1_score was 0.505 at fraction = 0.5, which is the default threshold. The results of this test is very similar to the results from previous models.

In [21]:
# Optimal Logistic Regression model found will be used on test set to calculate f1_score

features_downsampled, target_downsampled = downsample(features_train, target_train, 0.4)
model_downsample = LogisticRegression(random_state=12345)
model_downsample.fit(features_downsampled, target_downsampled)
predicted_test_downsample = model_downsample.predict(features_test)

print('F1:', f1_score(target_test, predicted_test_downsample))
print(classification_report(target_test, predicted_test_downsample))

probabilities_test = model_downsample.predict_proba(features_test)
probabilities_one_test = probabilities_test[:, 1]

roc_auc = roc_auc_score(target_test, probabilities_one_test)
print('roc_auc_score:', roc_auc)

F1: 0.4473975636766334
              precision    recall  f1-score   support

           0       0.85      0.82      0.84      1577
           1       0.42      0.48      0.45       423

    accuracy                           0.75      2000
   macro avg       0.64      0.65      0.64      2000
weighted avg       0.76      0.75      0.76      2000

roc_auc_score: 0.7422448285115077


After testing and tuning with a Logistic Regression, the model can not produce a f1_score higher than 0.505 and this was achieved with downsampling with a fraction parameter = 0.4 and no class_weight parameter. Its corresponding roc_auc_score is 0.76, which is very similar to the previous models. Using the test set, the model it fairly worse with a f1_score of 0.45 and a roc_auc_score of 0.74. I conclude that a different model needs to be analyzed since a Logistic Regression is not adequate to meet the f1_score threshold of 0.59 or higher. To achieve a higher f1_score, a Decision Tree Classifier will be used next.

### The next section will be adjusting and tuning a Decision Tree Classifier to get the best f1 score

In [22]:
# Preparing data for a Decision Tree Classifier

encoder = OrdinalEncoder()
data_ordinal = pd.DataFrame(encoder.fit_transform(data), columns=data.columns)

features_to_scale = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']

# Initialize the scaler
scaler = StandardScaler()

# Scale the numerical features
data_ordinal[features_to_scale] = scaler.fit_transform(data_ordinal[features_to_scale])

# Display the first few rows of the preprocessed dataset
print(data_ordinal.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CreditScore      10000 non-null  float64
 1   Geography        10000 non-null  float64
 2   Gender           10000 non-null  float64
 3   Age              10000 non-null  float64
 4   Tenure           10000 non-null  float64
 5   Balance          10000 non-null  float64
 6   NumOfProducts    10000 non-null  float64
 7   HasCrCard        10000 non-null  float64
 8   IsActiveMember   10000 non-null  float64
 9   EstimatedSalary  10000 non-null  float64
 10  Exited           10000 non-null  float64
dtypes: float64(11)
memory usage: 859.5 KB
None


In [23]:
# Splitting data into training, validation, and testing sets

# Creating features list and target list
target_x = data_ordinal['Exited']
features_x = data_ordinal.drop(columns=['Exited'])

# Splitting into training (60%) and temporary (40%). Used random_state=12345 as shown in the sprint.
features_train_x, features_temp_x, target_train_x, target_temp_x = train_test_split(features_x, target_x, test_size=0.4, random_state=12345)

# Splitting the temporary set into validation (20%) and test (20%)
features_valid_x, features_test_x, target_valid_x, target_test_x = train_test_split(features_temp_x, target_temp_x, test_size=0.5, random_state=12345)

# Verify the sizes of the splits following the traditional 3:1:1 split.
print(f'Training set: {features_train.shape}, {target_train.shape}')
print(f'Validation set: {features_valid.shape}, {target_valid.shape}')
print(f'Test set: {features_test.shape}, {target_test.shape}')
print()




tree_model = DecisionTreeClassifier(random_state=12345)
tree_model.fit(features_train_x, target_train_x)
predicted_tree_valid = tree_model.predict(features_valid_x)

f1_tree_valid = f1_score(target_valid_x,predicted_tree_valid)
print('F1:', f1_tree_valid)

probabilities_valid_x = tree_model.predict_proba(features_valid_x)
probabilities_one_valid_x = probabilities_valid_x[:, 1]

roc_auc_x = roc_auc_score(target_valid_x, probabilities_one_valid_x)
print('roc_auc_score:', roc_auc_x)
print(classification_report(target_valid_x, probabilities_one_valid_x))

Training set: (6000, 11), (6000,)
Validation set: (2000, 11), (2000,)
Test set: (2000, 11), (2000,)

F1: 0.45985401459854014
roc_auc_score: 0.6581245954790436
              precision    recall  f1-score   support

         0.0       0.86      0.86      0.86      1582
         1.0       0.47      0.45      0.46       418

    accuracy                           0.78      2000
   macro avg       0.66      0.66      0.66      2000
weighted avg       0.78      0.78      0.78      2000



After running a Decision Tree Classifier, the base f1_score is about 0.46 and a roc_auc_score of about 0.66. The f1 score is better than the base line for a Logistic Regression but has a worse roc_auc_score compared to the Logistic Regression.

In [24]:
# Running a Decision Tree Classifier with different max_depth integers and class_weight = 'balanced'

# Initializing the best_f1 and best_depth values to store the best f1 score and the corresponding max depth
best_f1 = 0
best_depth = 0
# Loop to change the max_depth parameter to find the optimal max_depth for highest f1 score
for depth in range(1,30):
    
    tree_model = DecisionTreeClassifier(class_weight = "balanced", random_state=12345, max_depth=depth)
    tree_model.fit(features_train_x, target_train_x) # train model on training set
    tree_pred = tree_model.predict(features_valid_x) # find the predictions using validation set
    tree_f1 = f1_score(target_valid_x, tree_pred) # get accuracy of model on validation set
    
    # Store the best accuracy and corresponding max_depth
    if(tree_f1 > best_f1):
        best_f1 = tree_f1
        best_depth = depth
    
    print("depth:", depth, end=' : ')
    print(tree_f1)
print("Best f1 is at max depth", best_depth, "with an f1_score of", best_f1)

depth: 1 : 0.4994903160040775
depth: 2 : 0.541015625
depth: 3 : 0.541015625
depth: 4 : 0.5277777777777778
depth: 5 : 0.5894962486602359
depth: 6 : 0.5497287522603979
depth: 7 : 0.5396536007292617
depth: 8 : 0.5357142857142858
depth: 9 : 0.5290068829891839
depth: 10 : 0.5054509415262636
depth: 11 : 0.49845520082389294
depth: 12 : 0.4849115504682622
depth: 13 : 0.49073064340239914
depth: 14 : 0.4955555555555556
depth: 15 : 0.48735632183908045
depth: 16 : 0.4820143884892086
depth: 17 : 0.47086801426872776
depth: 18 : 0.48375451263537905
depth: 19 : 0.4789410348977136
depth: 20 : 0.47746650426309384
depth: 21 : 0.4833538840937115
depth: 22 : 0.47940074906367036
depth: 23 : 0.47940074906367036
depth: 24 : 0.47940074906367036
depth: 25 : 0.47940074906367036
depth: 26 : 0.47940074906367036
depth: 27 : 0.47940074906367036
depth: 28 : 0.47940074906367036
depth: 29 : 0.47940074906367036
Best f1 is at max depth 5 with an f1_score of 0.5894962486602359


In [25]:
# Running a Decision Tree Classifier with different max_depth integers and no class_weight parameter

# Initializing the best_f1 and best_depth values to store the best f1 score and the corresponding max depth
best_f1 = 0
best_depth = 0
# Loop to change the max_depth parameter to find the optimal max_depth for highest f1 score
for depth in range(1,30):
    
    tree_model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    tree_model.fit(features_train_x, target_train_x) # train model on training set
    tree_pred = tree_model.predict(features_valid_x) # find the predictions using validation set
    tree_f1 = f1_score(target_valid_x, tree_pred) # get accuracy of model on validation set
    
    # Store the best accuracy and corresponding max_depth
    if(tree_f1 > best_f1):
        best_f1 = tree_f1
        best_depth = depth
    
    print("depth:", depth, end=' : ')
    print(tree_f1)
print("Best f1 is at max depth", best_depth, "with an f1_score of", best_f1)

depth: 1 : 0.0
depth: 2 : 0.5217391304347825
depth: 3 : 0.4234875444839857
depth: 4 : 0.5528700906344411
depth: 5 : 0.5169628432956381
depth: 6 : 0.5360501567398119
depth: 7 : 0.5106382978723405
depth: 8 : 0.5379939209726444
depth: 9 : 0.5325443786982248
depth: 10 : 0.5007194244604317
depth: 11 : 0.5210312075983717
depth: 12 : 0.5169712793733682
depth: 13 : 0.5052910052910053
depth: 14 : 0.48578811369509045
depth: 15 : 0.465
depth: 16 : 0.4825
depth: 17 : 0.48507462686567165
depth: 18 : 0.4668304668304668
depth: 19 : 0.46398046398046394
depth: 20 : 0.4560099132589839
depth: 21 : 0.4678787878787879
depth: 22 : 0.4678787878787879
depth: 23 : 0.4768856447688565
depth: 24 : 0.45985401459854014
depth: 25 : 0.45985401459854014
depth: 26 : 0.45985401459854014
depth: 27 : 0.45985401459854014
depth: 28 : 0.45985401459854014
depth: 29 : 0.45985401459854014
Best f1 is at max depth 4 with an f1_score of 0.5528700906344411


In [26]:
# Decision Tree Classifier with optimal max_depth and optimal class_weight parameter

tree_model = DecisionTreeClassifier(class_weight = 'balanced', random_state=12345, max_depth=5)  # create a model, specify n_estimators=est
tree_model.fit(features_train_x, target_train_x) # train model on training set
tree_pred = tree_model.predict(features_valid_x) # find the predictions using validation set
tree_f1 = f1_score(target_valid_x, tree_pred) # get accuracy of model on validation set
print('F1:', tree_f1)

probabilities_valid_x = tree_model.predict_proba(features_valid_x)
probabilities_one_valid_x = probabilities_valid_x[:, 1]

roc_auc_x = roc_auc_score(target_valid_x, probabilities_one_valid_x)
print('roc_auc_score:', roc_auc_x)

F1: 0.5894962486602359
roc_auc_score: 0.8197447056902112


The Decision Tree Classifier with class_weight = 'balanced' and max_depth = 5 gives a f1_score of 0.589, which is the best f1_score achieved so far. The roc_auc_score is 0.82 which is also the best roc_auc_score so far indicating that the model is better than a Logistic Regression.

Next upsampling, downsampling and threshold adjusetments will be used to achieve a higer f1_score.

In [27]:
# Decision Tree Classifier with upsampling and class_weight = 'balanced'

# Finding optimal repeat factor
best_i = 0
best_f1 = 0
for i in range(1,10):
    features_upsampled_x, target_upsampled_x = upsample(features_train_x, target_train_x, i)
    
    model_upsample_x = DecisionTreeClassifier(class_weight = 'balanced', random_state=12345, max_depth = 5)
    model_upsample_x.fit(features_upsampled_x, target_upsampled_x)
    predicted_valid_upsample_x = model_upsample_x.predict(features_valid_x)

    print('F1:', f1_score(target_valid_x, predicted_valid_upsample_x))
    if(f1_score(target_valid_x, predicted_valid_upsample_x) > best_f1):
        best_f1 = f1_score(target_valid_x, predicted_valid_upsample_x)
        best_i = i
print('Best f1:', best_f1, '| best repeat factor:', best_i)

F1: 0.5894962486602359
F1: 0.5894962486602359
F1: 0.5894962486602359
F1: 0.5894962486602359
F1: 0.5894962486602359
F1: 0.5894962486602359
F1: 0.5894962486602359
F1: 0.5894962486602359
F1: 0.5894962486602359
Best f1: 0.5894962486602359 | best repeat factor: 1


With upsampling, the f1_score remained the same with every repeat factor. Next will be finding the optimal repeat factor without the class_weight parameter.

In [28]:
# Decision Tree Classifier with upsampling and no class_weight parameter

# Finding optimal repeat factor
best_i = 0
best_f1 = 0
for i in range(1,10):
    features_upsampled_x, target_upsampled_x = upsample(features_train_x, target_train_x, i)
    
    model_upsample_x = DecisionTreeClassifier(random_state=12345, max_depth = 5)
    model_upsample_x.fit(features_upsampled_x, target_upsampled_x)
    predicted_valid_upsample_x = model_upsample_x.predict(features_valid_x)

    print('F1:', f1_score(target_valid_x, predicted_valid_upsample_x))
    if(f1_score(target_valid_x, predicted_valid_upsample_x) > best_f1):
        best_f1 = f1_score(target_valid_x, predicted_valid_upsample_x)
        best_i = i
print('Best f1:', best_f1, '| best repeat factor:', best_i)

F1: 0.5169628432956381
F1: 0.6048284625158831
F1: 0.6082004555808656
F1: 0.5894962486602359
F1: 0.5315555555555556
F1: 0.5482493595217763
F1: 0.5118050266565118
F1: 0.4743758212877793
F1: 0.4588688946015424
Best f1: 0.6082004555808656 | best repeat factor: 3


In [29]:
# Optimal Decision Tree Classifier with upsampling

features_upsampled_x, target_upsampled_x = upsample(features_train_x, target_train_x, 3)
    
model_upsample_x = DecisionTreeClassifier(random_state=12345, max_depth = 5)
model_upsample_x.fit(features_upsampled_x, target_upsampled_x)
predicted_valid_upsample_x = model_upsample_x.predict(features_valid_x)

print('F1:', f1_score(target_valid_x, predicted_valid_upsample_x))

probabilities_valid_x = model_upsample_x.predict_proba(features_valid_x)
probabilities_one_valid_x = probabilities_valid_x[:, 1]

roc_auc_x = roc_auc_score(target_valid_x, probabilities_one_valid_x)
print('roc_auc_score:', roc_auc_x)

F1: 0.6082004555808656
roc_auc_score: 0.8302123470381504


After upsampling, the model performs the best without a class_weight parameter and a max_depth = 5 and an upsampling repeat factor of 3 which computes a  f1_score of 0.6 and a corresponding roc_auc_score of 0.83, both of which are the best so far.

In [30]:
# Threshold adjustment on upsampling

best_f1 = 0
best_thresh = 0
for i in np.arange(0.01, 1, 0.01):
    pred_valid_upsample_new_threshold_x = (model_upsample_x.predict_proba(features_valid_x)[:, 1] >= i).astype(int) 
    print(f1_score(target_valid_x, pred_valid_upsample_new_threshold_x))
    
    if(f1_score(target_valid_x, pred_valid_upsample_new_threshold_x) > best_f1):
        best_f1 = f1_score(target_valid_x, pred_valid_upsample_new_threshold_x)
        best_thresh = i
print('Best f1_score is', best_f1, 'at threshold', best_thresh)

0.34451345755693585
0.34451345755693585
0.38295880149812733
0.38295880149812733
0.38295880149812733
0.38295880149812733
0.38295880149812733
0.38295880149812733
0.38295880149812733
0.4005979073243648
0.4005979073243648
0.4005979073243648
0.4133611691022964
0.4133611691022964
0.4133611691022964
0.4477784189267167
0.4477784189267167
0.4477784189267167
0.4477784189267167
0.4477784189267167
0.4477784189267167
0.4891304347826087
0.4891304347826087
0.4891304347826087
0.4891304347826087
0.4891304347826087
0.4891304347826087
0.4891304347826087
0.4891304347826087
0.5125925925925926
0.5125925925925926
0.5125925925925926
0.5125925925925926
0.5125925925925926
0.5125925925925926
0.5205267234701781
0.5205267234701781
0.5179968701095461
0.572541382667965
0.572541382667965
0.5963791267305644
0.5963791267305644
0.5963791267305644
0.5963791267305644
0.5963791267305644
0.5963791267305644
0.5963791267305644
0.5963791267305644
0.5963791267305644
0.6082004555808656
0.6082004555808656
0.6082004555808656
0.608

With threshold adjustments, the model got the best f1_score of 0.6 at threshold = 0.5, which is the default threshold. So, threshold adjustment did not improve the model.

In [31]:
# Decision Tree Classifier with downsampling

# Finding optimal fraction factor with no class_weight parameter
best_i = 0
best_f1 = 0
for i in np.arange(0.01,1,0.01):
    features_downsampled_x, target_downsampled_x = downsample(features_train_x, target_train_x, i) 
    model_downsample_x = DecisionTreeClassifier(random_state=12345, max_depth = 5)
    model_downsample_x.fit(features_downsampled_x, target_downsampled_x)
    predicted_valid_downsample_x = model_downsample_x.predict(features_valid_x)
    print('F1:', f1_score(target_valid_x, predicted_valid_downsample_x))
    if(f1_score(target_valid_x, predicted_valid_downsample_x) > best_f1):
        best_f1 = f1_score(target_valid_x, predicted_valid_downsample_x)
        best_i = i
print('Best f1:', best_f1, '| frac =', best_i)


F1: 0.35137457044673537
F1: 0.36028751123090746
F1: 0.36061269146608316
F1: 0.4053497942386831
F1: 0.40906694781233527
F1: 0.42286348501664817
F1: 0.4097258147956544
F1: 0.4117021276595745
F1: 0.43956043956043955
F1: 0.4374649467190129
F1: 0.48951994590939824
F1: 0.48275862068965525
F1: 0.5172413793103449
F1: 0.5141158989598811
F1: 0.526148969889065
F1: 0.5255704169944925
F1: 0.5263987391646966
F1: 0.510894064613073
F1: 0.5289912629070691
F1: 0.5283911671924291
F1: 0.5598621877691645
F1: 0.5598621877691645
F1: 0.5494313210848645
F1: 0.5441819772528435
F1: 0.5435540069686412
F1: 0.5432314410480349
F1: 0.562091503267974
F1: 0.5629077353215284
F1: 0.5568281938325991
F1: 0.5633802816901409
F1: 0.5633802816901409
F1: 0.5662778366914104
F1: 0.5642633228840125
F1: 0.5655172413793103
F1: 0.5800214822771214
F1: 0.5655471289274107
F1: 0.5911214953271028
F1: 0.5746864310148233
F1: 0.5760171306209849
F1: 0.5693581780538303
F1: 0.5858369098712447
F1: 0.5858369098712447
F1: 0.5849462365591399
F1: 0.

In [32]:
# Decision Tree Classifier with downsampling

# Finding optimal fraction factor with class_weight = 'balanced'
best_i = 0
best_f1 = 0
for i in np.arange(0.01,1,0.01):
    features_downsampled_x, target_downsampled_x = downsample(features_train_x, target_train_x, i) 
    model_downsample_x = DecisionTreeClassifier(class_weight = 'balanced', random_state=12345, max_depth = 5)
    model_downsample_x.fit(features_downsampled_x, target_downsampled_x)
    predicted_valid_downsample_x = model_downsample_x.predict(features_valid_x)
    print('F1:', f1_score(target_valid_x, predicted_valid_downsample_x))
    if(f1_score(target_valid_x, predicted_valid_downsample_x) > best_f1):
        best_f1 = f1_score(target_valid_x, predicted_valid_downsample_x)
        best_i = i
print('Best f1:', best_f1, '| frac =', best_i)


F1: 0.4667802385008518
F1: 0.48730158730158724
F1: 0.5012658227848101
F1: 0.5361930294906166
F1: 0.526
F1: 0.541958041958042
F1: 0.578383641674781
F1: 0.5615763546798029
F1: 0.5390835579514824
F1: 0.5334448160535118
F1: 0.5445544554455446
F1: 0.539159109645507
F1: 0.5190380761523047
F1: 0.5552511415525114
F1: 0.5546372819100092
F1: 0.5589600742804086
F1: 0.5297670405522001
F1: 0.5612343297974928
F1: 0.5635148042024832
F1: 0.5449688334817454
F1: 0.5329861111111112
F1: 0.5598621877691645
F1: 0.546562228024369
F1: 0.546562228024369
F1: 0.5445026178010471
F1: 0.5445026178010471
F1: 0.5296108291032149
F1: 0.5445026178010471
F1: 0.5445026178010471
F1: 0.5435540069686412
F1: 0.5445026178010471
F1: 0.5445026178010471
F1: 0.5482416591523896
F1: 0.5471014492753624
F1: 0.5579150579150579
F1: 0.5589941972920697
F1: 0.5631067961165048
F1: 0.5642023346303501
F1: 0.5653021442495126
F1: 0.56640625
F1: 0.5876068376068377
F1: 0.5884861407249468
F1: 0.5851063829787234
F1: 0.5653021442495126
F1: 0.5653021

In [33]:
# Decision Tree Classifier with optimal parameters 

features_downsampled_x, target_downsampled_x = downsample(features_train_x, target_train_x, 0.59) 
model_downsample_x = DecisionTreeClassifier(class_weight = 'balanced',random_state=12345, max_depth = 5)
model_downsample_x.fit(features_downsampled_x, target_downsampled_x)
predicted_valid_downsample_x = model_downsample_x.predict(features_valid_x)
print('F1:', f1_score(target_valid_x, predicted_valid_downsample_x))
print('roc_auc_score:', roc_auc_score(target_valid_x, predicted_valid_downsample_x))

F1: 0.6075949367088608
roc_auc_score: 0.7566870716614544


With downsampling, the highest f1_score achieved is 0.6 with a corresponding roc_auc_score of 0.76, both are not as good as with upsampling. Next, threshold adjustment with downsampling will be used to see if the f1_score can be improved.

In [34]:
# Downsampling with threshold adjustment

best_f1 = 0
best_thresh = 0
for i in np.arange(0.01,1,0.01):
    pred_valid_downsample_new_threshold_x = (model_downsample_x.predict_proba(features_valid_x)[:, 1] >= i).astype(int) 
    print(f1_score(target_valid_x, pred_valid_downsample_new_threshold_x))
    
    if(f1_score(target_valid_x, pred_valid_downsample_new_threshold_x) > best_f1):
        best_f1 = f1_score(target_valid_x, pred_valid_downsample_new_threshold_x)
        best_thresh = i
print('Best f1_score is', best_f1, 'at threshold', best_thresh)

0.3443708609271523
0.3443708609271523
0.3827795975666823
0.3827795975666823
0.3870967741935484
0.3870967741935484
0.3870967741935484
0.3870967741935484
0.3870967741935484
0.3870967741935484
0.3870967741935484
0.3870967741935484
0.405255179383527
0.405255179383527
0.405255179383527
0.405255179383527
0.405255179383527
0.405255179383527
0.432819383259912
0.432819383259912
0.432819383259912
0.432819383259912
0.432819383259912
0.432819383259912
0.43309272626318707
0.43309272626318707
0.43309272626318707
0.4710526315789474
0.4710526315789474
0.4710526315789474
0.4710526315789474
0.4710526315789474
0.4710526315789474
0.4710526315789474
0.49927431059506533
0.49927431059506533
0.49927431059506533
0.49927431059506533
0.49927431059506533
0.49927431059506533
0.49927431059506533
0.49670329670329677
0.5139220365950675
0.5139220365950675
0.5139220365950675
0.5139220365950675
0.5848452508004269
0.5848452508004269
0.5848452508004269
0.6075949367088608
0.6075949367088608
0.6075949367088608
0.60759493670

Threshold adjustment with down sampling did not affect the model.

In [35]:
# Optimal Decision Tree Classifer model found will be on test set to calculate f1_score

features_downsampled_x, target_downsampled_x = downsample(features_train_x, target_train_x, 0.59) 
model_downsample_x = DecisionTreeClassifier(class_weight = 'balanced',random_state=12345, max_depth = 5)
model_downsample_x.fit(features_downsampled_x, target_downsampled_x)
predicted_test_downsample_x = model_downsample_x.predict(features_test_x)
print('F1:', f1_score(target_test_x, predicted_test_downsample_x))
print('roc_auc_score:', roc_auc_score(target_test_x, predicted_test_downsample_x))

F1: 0.5963718820861679
roc_auc_score: 0.7487313944092907


Ultimately, a Decision Tree Classifier is an adequate model for predicting if a customer will end their membership with the bank or not. I chose to focus on Logistic Regressions and Decision Tree Classifiers due to their simplicity and ability to work with its parameters. For both models, I tuned the hyper parameters, applied upsampling and downsampling to the features and targets sets, and adjusted the threshold for classification on positive and negative classes. After working with a Logistic Regression, I came to the conclusion that is not robust enough to achieve a higher f1 score than around 0.5 on the validation set. When tested with the test set, it produced a f1_score = 0.44 meaning it is a poor model overall. With a Decision Tree Classifier, a f1 score of 0.6 was achieved on the validation set by tuning its max_depth parameter, class_weight parameter and adjusting the repeat parameter in an upsampling method. The f1 score of 0.6 shows that the model is moderate and reasonably good at identifying positive and negative classes. The corresponding roc_auc_score of 0.75 shows that the model has good discriminatory power. When the most optimal model was tested on the test set, it computed a f1_score of 0.596 and a roc_auc_score of around 0.75. The model is fairly effective at distinguishing between the positive and negative classes. Overall, the model can adequately distinguish between a customer exiting or staying but it does not balance the precision and recall well and this is most likely due to the imbalance of classes in the raw data set. 