# Beta Bank Customer Model

## Introduction

### The Problem

Beta Bank customers are leaving little by little each month. The bank determined it's cheaper to save existing customers than to attract new ones. Knowing which customers will leave soon would be a valuable tool to help Beta Bank target the right customers before they decide to leave.

### Prediction Model

Beta Bank hopes a sufficent prediction model can be made. It's sufficiency depends if a an F1 score of at least 0.59 is attained. The AUR-ROC metric will also be compared. Client behaviour including whether they have terminated their contract or not has been provided. A description of data follows.

### Data Description

#### Features

`RowNumber` — data string index

`CustomerId` — unique customer identifier

`Surname` — surname

`CreditScore` — credit score

`Geography` — country of residence

`Gender` — gender

`Age` — age

`Tenure` — period of maturation for a customer’s fixed deposit (years)

`Balance` — account balance

`NumOfProducts` — number of banking products used by the customer

`HasCrCard` — customer has a credit card

`IsActiveMember` — customer’s activeness

`EstimatedSalary` — estimated salary


#### Target
`Exited` — сustomer has left (if 1)

### The Process

The process will follow these four steps:

1. Prepare the data
2. Examine the balance of classes
3. Improve model quality
4. Perform final testing

## Read, Prepare and Pre-Process Data

### To read:

- Import packages 
- Save the dataframe

### To prepare:
- Check column names
- Check data types 
- Identify missing values 
- Identify duplicates
- Check binomial columns

### To pre-process
- Encode data based on learning algorithym types

### Read Data

In [1]:
# Import appropriate packages

import pandas as pd # to save dataframe
import numpy as np # to perform numerical operations
import re # to use regular expressions
from sklearn.preprocessing import LabelEncoder # for label encoding
from sklearn.model_selection import train_test_split # to split data into training and testing sets
from sklearn.ensemble import RandomForestClassifier # to train model with random forest
from sklearn.linear_model import LogisticRegression # to train model with logistic regression
from sklearn.metrics import accuracy_score # to calculate accuracy score
import warnings # to ignore warnings related to logistical regression
from sklearn.utils import shuffle # to shuffle data
from sklearn.metrics import precision_score, recall_score, f1_score # to calculate precision, recall and f1 scores
from sklearn.metrics import roc_curve # to plot ROC curve
import plotly.graph_objects as go # to plot ROC curve
from sklearn.metrics import roc_auc_score # To calculate AUC score


In [2]:
# Save dataframe to csv file
df = pd.read_csv('./datasets/Churn.csv')

### Prepare Data

#### Column Names

In [3]:
# Examine the first few rows of the dataframe and identify column names
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


The data column names are in CamelCase. Change this to snake_case as it is easier to work with.

In [4]:
# Convert column names to snake case using regular expressions
df.columns = [re.sub('([a-z0-9])([A-Z])', r'\1_\2', col).lower() for col in df.columns]

#### Data Types

In [5]:
# Examine data types with .info()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   row_number        10000 non-null  int64  
 1   customer_id       10000 non-null  int64  
 2   surname           10000 non-null  object 
 3   credit_score      10000 non-null  int64  
 4   geography         10000 non-null  object 
 5   gender            10000 non-null  object 
 6   age               10000 non-null  int64  
 7   tenure            9091 non-null   float64
 8   balance           10000 non-null  float64
 9   num_of_products   10000 non-null  int64  
 10  has_cr_card       10000 non-null  int64  
 11  is_active_member  10000 non-null  int64  
 12  estimated_salary  10000 non-null  float64
 13  exited            10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


It seems that all data types are correct, except the 'tenure' column. Check the number of values.

In [6]:
# Check unique values of 'tenure' column including NaN
df['tenure'].value_counts(dropna=False)

tenure
1.0     952
2.0     950
8.0     933
3.0     928
5.0     927
7.0     925
NaN     909
4.0     885
9.0     882
6.0     881
10.0    446
0.0     382
Name: count, dtype: int64

#### Missing Values

Notice that this column also contains missing values. It can safely be assumed that if no tenure exists, its value is zero.

In [7]:
# Change missing values in 'tenure' column to 0
df['tenure'] = df['tenure'].fillna(0)

Now that the column contains no missing values, we can convert it to an integer data type.

In [8]:
# Convert 'tenure' column to integer type
df['tenure'] = df['tenure'].astype('int64')

#### Duplicate Values

Check duplicate values generally and across columns that should contain unique items. These include row_number and customer_id.

In [9]:
# Check for duplicate rows generally
print('General duplicates:',df.duplicated().sum())

# Check for duplicates in row_number column
print('Row duplicates:',df.duplicated(subset='row_number').sum())

# Check for duplicate rows in customer_id column
print('Customer duplicates:',df.duplicated(subset='customer_id').sum())

General duplicates: 0
Row duplicates: 0
Customer duplicates: 0


As no duplicate rows are found, binomial columns can be checked.

#### Binomial columns

From observation, it seems that 'has_cr_card', 'is_active_member', and 'exited' columns are binomial categories. Ensure that this is the case by checking the number of unique values of these columns

In [10]:
# print number of unique values of df columns
print(df[['has_cr_card','is_active_member', 'exited']].nunique())

has_cr_card         2
is_active_member    2
exited              2
dtype: int64


The suspected binomial columns correctly contain two unique values. 

### Pre-Process Data

The gender and geography columns contain categories as a string.

In [11]:
# Print unique values for gender and geography columns
print('Unique gender values:', list(df['gender'].unique()))
print('Unique gender values:', list(df['geography'].unique()))


Unique gender values: ['Female', 'Male']
Unique gender values: ['France', 'Spain', 'Germany']


Unfortunately, sklearn learning algorithyms modules cannot comprehend string values. These values will need to be encoded so that the model may work with the data. The way they will be encoded will depend on the selected learning modules. These will be random forests and logistic regression.

#### Gender

As only two categories, 'Female' and 'Male', are given, either label encoding or one-hot encoding (OHE) may be used. This is because they will infer the same results which is ideal for both random trees and logistic regression algorithyms. 

For simplicity, label encoding will be used. Label encoding assigns values based on alphabetical order. Thus, the gender_label column will refer to:

`female` as 0

`male` as 1

In [12]:
# Save LabelEncoder as encoder
encoder = LabelEncoder()

# Encode 'gender' column
df['gender_label'] = encoder.fit_transform(df['gender'])

#### Geography

Unlike gender, the geography column has three values. This means encoding will differ for the two models. 

##### Random Forest: Label Encoder

For models that use trees such as random forest, the label encoder will be used as it was above. The geography_label column will refer to:

`France` as 0
`Germany` as 1
`Spain` as 2

In [13]:
# Label encode geography for random forest models
df['geography_label'] = encoder.fit_transform(df['geography'])

##### Logistic Regression: One-Hot Encoding

One-hot encoding (OHE) is a better use for the logistic regression model.

In [14]:
# Make geography column lowercase to keep column names lowercase
df['geography'] = df['geography'].str.lower()

# One-hot encode geography for logistic regression models
df = pd.get_dummies(df, columns=['geography'], drop_first=True)

## Balance of Classes
Check the balance of classes. Train the model without improvements to this balance. Note the findings.

### Check Balance

Check balance by examining the value counts of the target. Create a model that constantly predicts the more popular choice. Compare the two values.

In [15]:
# Find value counts of the 'exited' column as a percentage
print(df['exited'].value_counts(normalize=True)*100)

exited
0    79.63
1    20.37
Name: proportion, dtype: float64


In [16]:
# Create model that predicts all customers will not churn
target_pred_constant = pd.Series(0, index=df.index)

# Print accuracy score for this model
print(accuracy_score(df['exited'], target_pred_constant) *100)

79.63


#### Analysis

A model that predicts that every customer won't churn will be accurate 79.63% of the time. A model must be created that does better than this.

### Model without Improvements

To train the model, the data must first be split. Three sets for the training, validation and test sets. Another split will seperate the features and target according to the requirements of the two learning algorithyms. The first will be a random forest, the second using logistic regression. 


### Split Data

As we do not have a test set, the data will be split into 3 parts. These are for the training, validation and test sets. This split will occur in a 3:1:1 ratio as it keeps the validation and the test sets the same size.

We can do this by first splitting the data in a 3:2 ratio to give two datasets. The first will serve as the test set. The second will be split again in 1:1 ratio to give the validation and test sets.

Once this done, the datasets will be split into features and target.

#### Training, Validation and Test Splits 

In [17]:
# Split data into train and validation with test 
df_train, df_valid_test = train_test_split(df, test_size=0.4, random_state=12345)

# Split df_valid_test into validation and test
df_valid, df_test = train_test_split(df_valid_test, test_size=0.5, random_state=12345)

In [18]:
# Confirm the size of splitted data sets
print('df_train size:', df_train.shape)
print('df_test size:', df_test.shape)
print('df_val size:', df_test.shape)

df_train size: (6000, 17)
df_test size: (2000, 17)
df_val size: (2000, 17)


### Random Forests

#### Description
A random forest is a learning algorithym that trains groups of independent trees and makes decisions based on majority class prediction.

#### Hyperparameters

`max_depth`: Depth will be iterated for values 1 to 10.

`n_estimators`: The number of trees that are built before voting takes place. This will be iterated from 10 to 100. 

`random_state`: '12345' will be used to produce consistent results.

#### Process

Features and the target will be determined using the appropriate preprocessing data. Accuracy scores with their associated parameters will be determined

##### Features and Targets

Select features by dropping those that arent relevant. These include 'row_number', 'customer_id', 'surname',  'geography_germany, 'geography_spain' and 'exited'.

In [19]:
# Create a list of columns to drop for feature selection for random forests
rf_drop_columns = ['row_number', 'customer_id', 'surname','exited','gender','geography_germany','geography_spain']

# Split the training data into features and target
rf_features_train = df_train.drop(rf_drop_columns, axis=1)
rf_target_train = df_train['exited']

# Split the validation data into features and target
rf_features_valid = df_valid.drop(rf_drop_columns, axis=1)
rf_target_valid = df_valid['exited']

# Split the test data into features and target
rf_features_test = df_test.drop(rf_drop_columns, axis=1)
rf_target_test = df_test['exited']


##### Model Training

In [20]:
# Find the best accuracy score for the model with accomodating max_depth and n_estimators
best_score = 0
best_est = 0
best_depth = 0
f1 = 0

# Create loop to find the best accuracy score with different max_depths and n_estimators
for est in range(10, 51): # choose hyperparameter range
    for depth in range(1, 11):
        model = RandomForestClassifier(max_depth=depth, random_state=12345, n_estimators=est) # set number of trees
        model.fit(rf_features_train, rf_target_train) # train model on training set
        score = model.score(rf_features_valid, rf_target_valid) # calculate accuracy score on validation set
        if score > best_score: # if accuracy score is better than previous best, save model and its accuracy score
            best_score = score
            best_est = est
            best_depth = depth
            f1 = f1_score(rf_target_valid, model.predict(rf_features_valid))

print("Accuracy of the best model:", round(best_score*100,2), "%")
print("Found with",best_est, "estimators and a depth of", best_depth)
print("F1 score of the best model:", round(f1*100,2), "%")

Accuracy of the best model: 86.45 %
Found with 49 estimators and a depth of 8
F1 score of the best model: 57.32 %


### Logistic Regression

#### Description

Logistic regression may be used by assigning observations of features (calls, messages, usage, etc) as either positive or negative (Ultra or Smart plans) based on probability.

##### Parameters

`solver`: Sets the method for fitting the data. Each will be used and compared.

`random_state`: '12345' will be used.

#### Process

Features and the target will be determined using the appropriate preprocessing data. Accuracy scores with their associated parameters will be determined

##### Features and Targets

Select features by dropping those that arent relevant. These include 'row_number', 'customer_id', 'surname',  'geography_label', and 'exited'.

In [21]:
# Create a list of columns to drop for feature selection for random forests
lr_drop_columns = ['row_number', 'customer_id', 'gender', 'surname','exited','geography_germany','geography_spain']

# Split the training data into features and target
lr_features_train = df_train.drop(lr_drop_columns, axis=1)
lr_target_train = df_train['exited']

# Split the validation data into features and target
lr_features_valid = df_valid.drop(lr_drop_columns, axis=1)
lr_target_valid = df_valid['exited']

# Split the test data into features and target
lr_features_test = df_test.drop(lr_drop_columns, axis=1)
lr_target_test = df_test['exited']

##### Model

In [22]:
# Save solvers to a list
solvers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']

# Create a loop to find the best accuracy score with different solvers
for solver in solvers:
    # Ignore warnings
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        
        # Train model
        model = LogisticRegression(random_state=12345, solver=solver)
        model.fit(lr_features_train, lr_target_train)
        score_valid = round(model.score(lr_features_valid, lr_target_valid), 3) * 100
        print('Solver:', solver)
        print('Accuracy on the validation set:', score_valid, '%')
        print('F1 score on the validation set:', round(f1_score(lr_target_valid, model.predict(lr_features_valid)), 3) * 100, '%')
        print()

Solver: newton-cg
Accuracy on the validation set: 80.2 %
F1 score on the validation set: 25.900000000000002 %

Solver: lbfgs
Accuracy on the validation set: 78.10000000000001 %
F1 score on the validation set: 8.799999999999999 %

Solver: liblinear
Accuracy on the validation set: 78.2 %
F1 score on the validation set: 8.4 %

Solver: sag
Accuracy on the validation set: 79.10000000000001 %
F1 score on the validation set: 0.0 %

Solver: saga
Accuracy on the validation set: 79.10000000000001 %
F1 score on the validation set: 0.0 %



#### Analysis

80% accuracy may look good, but is a bad metric to determine how good the model in comparison to the constant model. Only one Logistic regression model did better than the constant model on accuracy. Whilst the random forest did well with an accuracy of 86%, the constant model was only 6% behind. The high imbalance may have negatively affected the training process.

## Improve Model Quality

To improve model quality, adjustments to class weights and thresholds will be made. Metrics aside from accuracy will be used improve model quality. These include precision, recall and F1 scores. 

Precision measures how many negative responses were found by the model when searching for positive responses. Recall shows how many positive answers were found by the model in relation to all positive responses. As these are both important, the F1 value combines the two and will be used as our main metric. The goal is to find one that exceeds 59%.

Another good indicator of the effectiveness of our models is the area under curve and the receiver operating characteristic. These will be explained.

### Class Weight Adjustment

Machine learning algorithyms keep observations balanced by default. However, adjustments can be made through upscaling and downscaling.

#### Upsampling

Upsampling gives more weight to more important values by replicating observations. This requires 4 steps:

1. Split data by target
2. Duplicate appropriate observations
3. Shuffle data
4. Train models

3.Improve the quality of the model. Make sure you use at least two approaches to fixing class imbalance. Use the training set to pick the best parameters. Train different models on training and validation sets. Find the best one. Briefly describe your findings.


In [23]:
# 1. Split data by target

# Split target_train data
target_zeros = rf_target_train[rf_target_train == 0]
target_ones = rf_target_train[rf_target_train == 1]

# Split features data based on target data for both random forest and logistic regression
rf_features_zeros = rf_features_train[rf_target_train == 0]
rf_features_ones = rf_features_train[rf_target_train == 1]

lr_features_zeros = lr_features_train[rf_target_train == 0]
lr_features_ones = lr_features_train[rf_target_train == 1]

# Print lengths for target zeros and target ones
print('target_zeros length:', len(target_zeros))
print('target_ones length:', len(target_ones))
print('Ratio:', round(len(target_zeros)/len(target_ones), 2))
print('Ratio Imbalanced')


target_zeros length: 4804
target_ones length: 1196
Ratio: 4.02
Ratio Imbalanced


In [24]:
# 2. Duplicate appropriate observations

# Find ratio of target zeros to target ones
ratio= int(len(target_zeros) / len(target_ones))

# Duplicate target ones by ratio and concatenate for target for random forests
rf_target_upsampled = pd.concat([target_zeros] + [target_ones] * ratio)

# Duplicate target ones by ratio and concatenate for target for logistic regression
lr_target_upsampled = pd.concat([target_zeros] + [target_ones] * ratio)

# Repeat for features for random forests and logistic regression
rf_features_upsampled = pd.concat([rf_features_zeros] + [rf_features_ones] * ratio)
lr_features_upsampled = pd.concat([lr_features_zeros] + [lr_features_ones] * ratio)

# Print lengths for target zeros and target ones
print('Number of observations where the customer stayed:', len(target_zeros))
print('Number of observations where the customer left:', len(target_ones) * ratio)
print('Ratio:', round(len(target_zeros)/(len(target_ones) * ratio), 2))
print('Ratio Balanced')

Number of observations where the customer stayed: 4804
Number of observations where the customer left: 4784
Ratio: 1.0
Ratio Balanced


In [38]:
# 3. Shuffle Data
# Shuffle data for random forests
rf_features_upsampled, rf_target_upsampled = shuffle(rf_features_upsampled, rf_target_upsampled, random_state=12345)

# Repeat for logistic regression
lr_features_upsampled, lr_target_upsampled = shuffle(lr_features_upsampled, lr_target_upsampled, random_state=12345)

In [39]:
# 4. Train Models

# Train random forest model
# Find the best F1 score for the model with accomodating max_depth and n_estimators

best_score = 0
best_est = 0
best_depth = 0

# Create loop to find the best accuracy score with different max_depths and n_estimators
for est in range(10, 101, 10): # choose hyperparameter range
    for depth in range(1, 11):
        model = RandomForestClassifier(max_depth=depth, random_state=12345, n_estimators=est) # set number of trees
        model.fit(rf_features_upsampled, rf_target_upsampled) # train model on training set
        score = f1_score(rf_target_valid, model.predict(rf_features_valid)) # calculate f1 score on validation set
        if score > best_score: # if f1 score is better than previous best, save model and its score
            best_score = score
            best_est = est
            best_depth = depth

print("F1 score of the best model:", round(best_score*100,2), "%")
print("Found with", best_est, "estimators and a depth of", best_depth)

F1 score of the best model: 61.7 %
Found with 30 estimators and a depth of 9


In [40]:
# Train logistic regression model
# Save solvers to a list
solvers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']

# Create a loop to find the best accuracy score with different solvers
for solver in solvers:
    # Ignore warnings
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        
        # Train model
        model = LogisticRegression(random_state=12345, solver=solver)
        model.fit(lr_features_upsampled, lr_target_upsampled)
        score_valid = round(f1_score(lr_target_test, model.predict(lr_features_valid)), 3) * 100
        print('Solver:', solver)
        print('Accuracy on the validation set:', round(accuracy_score(lr_target_valid, model.predict(lr_features_valid)), 3) * 100, '%')
        print('F1 score on the validation set:', score_valid, '%')


Solver: newton-cg
Accuracy on the validation set: 69.0 %
F1 score on the validation set: 27.900000000000002 %
Solver: lbfgs
Accuracy on the validation set: 65.4 %
F1 score on the validation set: 28.299999999999997 %
Solver: liblinear
Accuracy on the validation set: 65.9 %
F1 score on the validation set: 28.000000000000004 %
Solver: sag
Accuracy on the validation set: 49.6 %
F1 score on the validation set: 29.7 %
Solver: saga
Accuracy on the validation set: 48.8 %
F1 score on the validation set: 29.4 %


#### Analysis

Upsampling with the random trees model has produced a F1 of 62.18%. This is an improvement from 57.32% before balancing. Logistic regression has also seen increases in F1 scores to almost 30%.

#### Downsampling

Instead of upsampling the number of observations where customers exited, downsampling the number of customers that stayed could also work. This works similarly to upsampling and follows these steps:

1. Split data by target
2. Randomly drop negative observations
3. Shuffle data
4. Train models

As we already did the first step in upsampling, move onto the second.

In [44]:
# 2. Randomly drop negative observations

# Define fraction to drop
fraction = len(target_ones)/len(target_zeros)


# Drop negative observations for random forests
rf_features_downsampled = pd.concat(
    [rf_features_zeros.sample(frac=fraction, random_state=12345)]
    + [rf_features_ones]) 

rf_target_downsampled = pd.concat(
    [target_zeros.sample(frac=fraction, random_state=12345)]
    + [target_ones])

# Drop negative observations for logistic regression
lr_features_downsampled = pd.concat(
    [lr_features_zeros.sample(frac=fraction, random_state=12345)]
    + [lr_features_ones])

lr_target_downsampled = pd.concat(
    [target_zeros.sample(frac=fraction, random_state=12345)]
    + [target_ones])

# Print lengths for target zeros and target ones
print('Number of observations where the customer stayed:', len(target_zeros) * fraction)
print('Number of observations where the customer left:', len(target_ones))
print('Ratio:', round(len(target_zeros) * fraction/len(target_ones), 2))

Number of observations where the customer stayed: 1614.0
Number of observations where the customer left: 1614
Ratio: 1.0


In [45]:
# 3. Shuffle data
# Shuffle data for random forests
rf_features_downsampled, rf_target_downsampled = shuffle(rf_features_downsampled, rf_target_downsampled, random_state=12345)

# Repeat for logistic regression
lr_features_downsampled, lr_target_downsampled = shuffle(lr_features_downsampled, lr_target_downsampled, random_state=12345)

ValueError: Found input variables with inconsistent numbers of samples: [2410, 3228]

In [None]:
# 4. Train Models

# Train random forest model
# Find the best F1 score for the model with accomodating max_depth and n_estimators

best_score = 0
best_est = 0
best_depth = 0

# Create loop to find the best accuracy score with different max_depths and n_estimators
for est in range(10, 101, 10): # choose hyperparameter range
    for depth in range(1, 11):
        model = RandomForestClassifier(max_depth=depth, random_state=12345, n_estimators=est) # set number of trees
        model.fit(rf_features_downsampled, rf_target_downsampled) # train model on training set
        score = f1_score(rf_target_valid, model.predict(rf_features_valid)) # calculate f1 score on validation set
        if score > best_score: # if f1 score is better than previous best, save model and its score
            best_score = score
            best_est = est
            best_depth = depth

print("F1 score of the best model:", round(best_score*100,2), "%")
print("Found with", best_est, "estimators and a depth of", best_depth)

F1 score of the best model: 60.38 %
Found with 90 estimators and a depth of 6


In [None]:
# Train logistic regression model

# Create a loop to find the best F1 score with different solvers
for solver in solvers:
    # Ignore warnings
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        
        # Train model
        model = LogisticRegression(random_state=12345, solver=solver)
        model.fit(lr_features_downsampled, lr_target_downsampled)
        score_valid = round(f1_score(lr_target_test, model.predict(lr_features_valid)), 3) * 100
        print('Solver:', solver)
        print('Accuracy on the validation set:', round(accuracy_score(lr_target_valid, model.predict(lr_features_valid)), 3) * 100, '%')
        print('F1 score on the validation set:', score_valid, '%')


Solver: newton-cg
Accuracy on the validation set: 69.0 %
F1 score on the validation set: 27.800000000000004 %
Solver: lbfgs
Accuracy on the validation set: 65.0 %
F1 score on the validation set: 28.1 %
Solver: liblinear
Accuracy on the validation set: 65.10000000000001 %
F1 score on the validation set: 28.499999999999996 %
Solver: sag
Accuracy on the validation set: 47.699999999999996 %
F1 score on the validation set: 29.799999999999997 %
Solver: saga
Accuracy on the validation set: 47.699999999999996 %
F1 score on the validation set: 29.599999999999998 %


#### Analysis

Downsampling produced similar, but slightly less ideal results. 

### Threshold Adjustments

Threshold adjustments may help improve the F1 score of the logistic regression models. By default this level is set to 0.5 as it asigns a probability of 0.5 to each target observation outcome. However, this can be changed iterating through threshold values.

In [47]:
# Create a loop to find the best f1 score with different solvers
for solver in solvers:
    # Ignore warnings
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        model = LogisticRegression(random_state=12345, solver='liblinear')
        model.fit(lr_features_train, lr_target_train)
        probabilities_valid = model.predict_proba(lr_features_valid)
        probabilities_one_valid = probabilities_valid[:, 1]

        best_f1 = 0
        best_threshold = 0
        best_precision = 0
        best_recall = 0
        best_solver = ''

        for threshold in np.arange(0, 0.5, 0.02):
            predicted_valid = probabilities_one_valid > threshold
            precision = precision_score(lr_target_valid, predicted_valid)
            recall = recall_score(lr_target_valid, predicted_valid)
            f1 = precision * recall * 2 / (precision + recall)
            if f1 > best_f1:
                best_f1 = f1
                best_threshold = threshold
                best_precision = precision
                best_recall = recall
                best_solver = solver
                
print('Best solver:', best_solver)

print(
    'Threshold = {:.2f} | Precision = {:.3f}, Recall = {:.3f}, F1 = {:.2f} %'.format(
        best_threshold, best_precision, best_recall, best_f1 * 100))

Best solver: saga
Threshold = 0.22 | Precision = 0.322, Recall = 0.577, F1 = 41.30 %


#### Analysis

By changing threshold levels, the model was optimized for precision and recall. The best solver, 'saga', produced an F1 score of 41.3%. However, this is still lower than the random forests model.

### AUC-ROC

By iterating through the threshold, a curve that shows the relationship between the true positive rate (TPR) and false positive rate (FPR) can be made. A random model should show a consistent rise as each increases. However, a model that is better will show a curve that rises above. Once this is calculated, the area under the curve can be used to show how effective it may be. An area equal to one shows that the model is perfect, whereas a model that shows lower than 0.5 is worse than a random model.

In [None]:
# find fpr, tpr, thresholds with sklearn roc_curve function
fpr, tpr, thresholds = roc_curve(lr_target_valid, probabilities_one_valid)

# Create figure
fig = go.Figure()

# Plot ROC curve
fig.add_trace(go.Scatter(x=fpr, y=tpr,
                    mode='lines',
                    name='High quality model',
                    line=dict(color='deeppink')))  # Set ROC curve color

# Plot line of random model
fig.add_trace(go.Scatter(x=[0, 1], y=[0, 1],
                    mode='lines',
                    name='Random model',
                    line=dict(color='aqua')))  # Set Random model color

# Customize layout
fig.update_layout(
    title='ROC curve',
    xaxis=dict(title='False Positive Rate'),
    yaxis=dict(title='True Positive Rate'),
    width=700,
    height=500,
    template='plotly_dark',  # Change the template to 'plotly_dark' for a dark background 
    title_font_color='white'  # Change title font color
)

# Show plot
fig.show()

# Calculate AUC score
auc_roc = roc_auc_score(lr_target_valid, probabilities_one_valid)

# Print AUC Score
print('Area Under Curve =', auc_roc)

Area Under Curve = 0.672703984418004


# Analysis of the ROC curve

The model is far from perfect, but it's better than the random model. The area under the curve is 0.67, which is better than the 0.5 of a random model.

## Final Testing

After making adjustments to both the class weights and thresholds, the random forest model is the most effective. This model was able to reach an F1 level of 62.18% with the validation set using upscaling methods. The validaiton set will now be incorporated into the training set. Once this has been accomplished, the same methods used to upsample will be used. These results will be used to test  the test set.

### Incorporate Validation Set

In [None]:
# Combine random forest validation and training sets for features and target
rf_features_valid_train = pd.concat([rf_features_valid, rf_features_train])
rf_target_valid_train = pd.concat([rf_target_valid, rf_target_train])

# 1. Split data by target

# Split target data
target_zeros = rf_target_valid_train[rf_target_valid_train == 0]
target_ones = rf_target_valid_train[rf_target_valid_train == 1]

# Split features data based on target data for both random forest and logistic regression
features_zeros = rf_features_valid_train[rf_target_valid_train == 0]
features_ones = rf_features_valid_train[rf_target_valid_train == 1]

# Print lengths for target zeros and target ones
print('target_zeros length:', len(target_zeros))
print('target_ones length:', len(target_ones))
print('Ratio:', round(len(target_zeros)/len(target_ones), 2))
print('Ratio Imbalanced')

target_zeros length: 6386
target_ones length: 1614
Ratio: 3.96
Ratio Imbalanced


In [None]:
# 2. Duplicate appropriate observations

# Find ratio of target zeros to target ones
ratio= int(len(target_zeros) / len(target_ones))

# Duplicate target ones by ratio and concatenate for target for random forests
target_upsampled = pd.concat([target_zeros] + [target_ones] * ratio)

# Repeat for features
features_upsampled = pd.concat([features_zeros] + [features_ones] * ratio)

# Print lengths for target zeros and target ones
print('Number of observations where the customer stayed:', len(target_zeros))
print('Number of observations where the customer left:', len(target_ones) * ratio)
print('Ratio:', round(len(target_zeros)/(len(target_ones) * ratio), 2))
print('Ratio Balanced')


Number of observations where the customer stayed: 6386
Number of observations where the customer left: 4842
Ratio: 1.32
Ratio Balanced


In [None]:
# 3. Shuffle Data
features_upsampled, target_upsampled = shuffle(features_upsampled, target_upsampled, random_state=12345)

In [None]:
# 4. Train Model

# Use estimators = 100 and max_depth = 10 from previous model
model = RandomForestClassifier(max_depth=10, random_state=12345, n_estimators=100) # set number of trees

# Train model on combined training and validation upsampled set
model.fit(features_upsampled, target_upsampled)

# Calculate f1 and accuracy scores on test set
f1 = f1_score(rf_target_test, model.predict(rf_features_test))
accuracy = accuracy_score(rf_target_test, model.predict(rf_features_test))

# Print F1 score and Accuracy score
print("F1 score of the best model on test set:", round(f1*100,2), "%")
print("Accuracy of the best model on test set:", round(accuracy*100,2), "%")

F1 score of the best model on test set: 60.61 %
Accuracy of the best model on test set: 83.3 %


## Conclusion

### Data Preparation

The data was saved, cleaned and pre-processed. This meant installing the appropriate packages, tidying up column names, checking for null and duplicate values, and encoding the data according to datatypes.

### Class Balances

Class balances were examined and noted that a constant model would produce results similar to the models that were tested. To overcome these problems, the F1 score became the metric of value.

### Improving Models

The models were improved with changes to class weights and thresholds. By upsampling a downsampling, the F1 scores on the random forests were increased to over 60%. The logistic regression model saw some improvement, but saw its biggest jump in F1 when adjusting threshold levels. At a threshold of 0.22, the F1 score was 41%. After creating an AUC-ROC graph it was determined that these logistic regression models were better than random, but not by a significatn amount.

### Final Testing
Ultimately, the random forest produced the highest F1 score at 62%. The hyperparameters used 100 estimators and a depth of 10. The model was further improved by combining training and validation sets and upsampling the data. Once this was done, the model was tested on the test set. This yielded an F1 score of 60% and an accuracy of 83.3%.