# Sprint 8 Project - Supervised Learning

## Project Overview

**Problem Statement -** Beta Bank customers are leaving: little by little, chipping away every month. The bankers figured out it’s cheaper to save the existing customers rather than to attract new ones. My responsability will be to predict whether a customer will leave the bank soon. I have the data on clients’ past behavior and termination of contracts with the bank to do this task.

**Project Objective -** I will build a model with the maximum possible F1 score, with the requirement that the F1 score for my model must be at least 0.59.

**In this project, I will** - 
- Download and prepare the data and explain the procedure
- Examine the balance of classes
  - Train the model without taking into account the imbalance
- Improve the quality of the model using at least two approaches to fixing class imbalance
  - Use the training set to pick the best parameters
  - Train different models on training and validation sets to find the best model
- Perform the final testing

##  Import Libraries

In [1]:
# Import libraries required for the project
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# From sklearn get classification models, model evaluation packages, and training data split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# Split dataset for training
from sklearn.model_selection import train_test_split

# For feature scaling
from sklearn.preprocessing import StandardScaler

# For upsampling shuffle
from sklearn.utils import shuffle

# Accuracy score for decision tree
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

# Show all columns when displaying dataframe
pd.set_option('display.max_columns', None)

Cell below is used to align the tables in the markdown to the left

In [2]:
%%html
<style>
table {float:left}
</style>

## Read in Data

In [3]:
# Read in csv and save as Churn dataframe
churn = pd.read_csv('/datasets/Churn.csv')

### Description of Fields in Dataset

*Below is a description for all of the fields in the dataset that I will be working with to build and train my models*

**Features**

 - `RowNumber` — data string index
 - `CustomerId` — unique customer identifier
 - `Surname` — surname
 - `CreditScore` — credit score
 - `Geography` — country of residence
 - `Gender` — gender
 - `Age` — age
 - `Tenure` — period of maturation for a customer’s fixed deposit (years)
 - `Balance` — account balance
 - `NumOfProducts` — number of banking products used by the customer
 - `HasCrCard` — customer has a credit card
 - `IsActiveMember` — customer’s activeness
 - `EstimatedSalary` — estimated salary

**Target**
- `Exited` — сustomer has left

### Explore Dataset

In [4]:
# Use print so I don't lose outputs

# Check for missing values
print('Check for Missing Values')
print(churn.isna().sum())

# Check values for each column
print('\n Describe Dataframe')
print(churn.describe())

# Check data types
print('\n Check Data Types')
print(churn.info())
#print(churn.dtypes)

#Check for duplicate rows
print('\n Check for Duplicate Rows')
print(churn[churn['CustomerId'].duplicated(keep=False)])

# Check data
print('\n Print First 10 Rows')
churn.head(10)

Check for Missing Values
RowNumber            0
CustomerId           0
Surname              0
CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             909
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64

 Describe Dataframe
         RowNumber    CustomerId   CreditScore           Age       Tenure  \
count  10000.00000  1.000000e+04  10000.000000  10000.000000  9091.000000   
mean    5000.50000  1.569094e+07    650.528800     38.921800     4.997690   
std     2886.89568  7.193619e+04     96.653299     10.487806     2.894723   
min        1.00000  1.556570e+07    350.000000     18.000000     0.000000   
25%     2500.75000  1.562853e+07    584.000000     32.000000     2.000000   
50%     5000.50000  1.569074e+07    652.000000     37.000000     5.000000   
75%     7500.25000  1.575323e+07    718.000000     44.000000     7.000000   
max

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0
5,6,15574012,Chu,645,Spain,Male,44,8.0,113755.78,2,1,0,149756.71,1
6,7,15592531,Bartlett,822,France,Male,50,7.0,0.0,2,1,1,10062.8,0
7,8,15656148,Obinna,376,Germany,Female,29,4.0,115046.74,4,1,0,119346.88,1
8,9,15792365,He,501,France,Male,44,4.0,142051.07,2,0,1,74940.5,0
9,10,15592389,H?,684,France,Male,27,2.0,134603.88,1,1,1,71725.73,0


### Data Exploration Findings

There are a few findings which I've listed below from my data exploration. I will address and changes that need to be made from my data exploration in Section 4: Prepare Data From Training
1. There are 909 missing values in the Tenure field
  - Missing values in this field will be replaced with 0. Another option could be to replace the missing value with the average tenure of all customers.
  - After replacing values, datatype shoud be changed to an integer
2. There are some fields that may add noise to the model which need to be removed since they don't add value in predicting if a user will churn. 
  - The fields that will be dropped are RowNumber, CustomerId, Surname
  - I could potentially calculate the correlation between these fields and Exited
3. There are no duplicated rows or duplicated CustomerIds in the dataset

## Prepare Data For Training

In this Section 4, I will - 
- Prepare the data for training
  - Replace missing values
  - Drop fields
- Process all of the feature types
  - Convert Categorical Fields into Numerical Using OHE
- Create Training, Validation & Test Dataset Before Optimizing For Class Imbalance
- Perform Feature Scaling

All of these steps must be done before beginning to train our model

### Replace Missing Values in Tenure
In the data exploration section above, we found that there were 909 missing values in the Tenure column. I will replace these missing values with 0.

In [5]:
# Replace missing values in Tenure
churn['Tenure'].fillna(0, inplace = True)

# Check to see if missing values were replaced
print(churn['Tenure'].isna().sum())

# Convert from float to int
churn['Tenure'] = churn['Tenure'].astype('int64')

# Check datatype
churn['Tenure'].dtype

0


dtype('int64')

### Remove Unneccesary Fields From Dataframe Before Training

In [6]:
# Check to see if there is high correlation between these fields and Exited
print(churn['RowNumber'].corr(churn['Exited'])) # Correlation -0.016
#churn['Surname'].corr(churn['Exited']) # Name should have no impact on exiting the bank
print(churn['CustomerId'].corr(churn['Exited'])) # Correlation -0.006

# Remove fields that will reduce model accuracy 
# Fields to remove are row_number, surname, CustomerId
churn = churn.drop(['RowNumber','Surname','CustomerId'], axis=1)

# Print resulting df
churn.head()

-0.016571371463984713
-0.006247986637818783


Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


### Convert Categorical Fields into Numerical Using OHE

Because Geography and Gender are categorical but do not have an implicit order or ranking, we can use OHE to change the categorical labels to numerical labels

In [7]:
# Check the unique values in the categorical fields Geography and Gender
print(churn['Geography'].unique()) # Returns ['France' 'Spain' 'Germany']
print(churn['Gender'].unique()) # Returns ['Female' 'Male']

# Because there isn't a ranking of these categorical variables, OHE should be used over OrdinalEncoding
# This will tell the model that there is an order/rank to these categorical varibales, when there is not
churn = pd.get_dummies(churn)

# Drop Gender_Male because Gender_Female will contain 1 or 0 if customer is female or male
churn = churn.drop(['Gender_Male'], axis=1)

# Print sample of dataframe with OHE columns
churn.head()

['France' 'Spain' 'Germany']
['Female' 'Male']


Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_France,Geography_Germany,Geography_Spain,Gender_Female
0,619,42,2,0.0,1,1,1,101348.88,1,1,0,0,1
1,608,41,1,83807.86,1,0,1,112542.58,0,0,0,1,1
2,502,42,8,159660.8,3,1,0,113931.57,1,1,0,0,1
3,699,39,1,0.0,2,0,0,93826.63,0,1,0,0,1
4,850,43,2,125510.82,1,1,1,79084.1,0,0,0,1,1


### Create Training, Validation & Test Dataset Before Optimizing For Class Imbalance

Because there wasn't a test dataset provided, we will need to make our own from the original dataset. To do this we will split the data in a training, validation, and test dataset with a ratio of rows 3:1:1

In [8]:
# Create target and feature datasets
# Exited will be the target
target = churn['Exited']
features = churn.drop('Exited', axis=1)

# Create Training, Validation, and Test Datasets. Split 3:1:1 since I was not provided with a test set
# First we will split .40 to get a traning dataset of 60% of the data source 
# The validiation set will be split in half to get 3 datasets with a ratio of 3:1:1
# Set Random State to 12345 to Replicate Training Set in Future
features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.40, random_state=12345)

# Create a validation and test set from the the original validation set
features_valid, features_test, target_valid, target_test = train_test_split(
    features_valid, target_valid, test_size=0.50, random_state=12345)

# Check the sizes of the training, validation, and test sets for 3:1:1 ratio
# Training 
print(features_train.shape) # Training set contains 60% of original dataframe rows 
print(target_train.shape)   # Training set contains 60% of original dataframe rows 

# Validation
print(features_valid.shape) # Validation set contains 20% of original dataframe rows 
print(target_valid.shape)   # Validation set contains 20% of original dataframe rows 

# Test
print(features_test.shape)  # Test set contains 20% of original dataframe rows 
print(target_test.shape)    # Test set contains 20% of original dataframe rows 

(6000, 12)
(6000,)
(2000, 12)
(2000,)
(2000, 12)
(2000,)


### Feature Scaling

The magnitudes of values and dispersion are higher for the columns like estimated salary and balance compared to Age or Tenure. This means that the algorithm will find these features to be more important than than Age or Tenure. We don't want that. All features should be considered equally important before the algorithm's execution.

In [9]:
# Create a list of numeric columns in the churn dataset that need to be scaled
numeric = ['CreditScore','Age','Tenure','Balance','NumOfProducts'
           ,'HasCrCard','IsActiveMember','EstimatedSalary','Geography_France'
           ,'Geography_Germany','Geography_Spain','Gender_Female']

# Apply feature scaling to the numeric fields in the list
scaler = StandardScaler()
scaler.fit(features_train[numeric])

# Apply Scaler to Training and Validation Sets
features_train[numeric] = scaler.transform(features_train[numeric])
features_valid[numeric] = scaler.transform(features_valid[numeric])
features_test[numeric] = scaler.transform(features_test[numeric])

# Check Scaled Numeric Features
print(features_train.head())
print(features_valid.head())
print(features_test.head())

      CreditScore       Age    Tenure   Balance  NumOfProducts  HasCrCard  \
7479    -0.886751 -0.373192  1.104696  1.232271      -0.891560   0.642466   
3411     0.608663 -0.183385  1.104696  0.600563      -0.891560  -1.556504   
6027     2.052152  0.480939 -0.503694  1.027098       0.830152  -1.556504   
1247    -1.457915 -1.417129  0.461340 -1.233163       0.830152   0.642466   
3716     0.130961 -1.132419 -0.825373  1.140475      -0.891560  -1.556504   

      IsActiveMember  EstimatedSalary  Geography_France  Geography_Germany  \
7479       -1.055187        -0.187705         -1.005013          -0.572475   
3411       -1.055187        -0.333945          0.995012          -0.572475   
6027        0.947699         1.503095         -1.005013           1.746802   
1247       -1.055187        -1.071061          0.995012          -0.572475   
3716       -1.055187         1.524268         -1.005013           1.746802   

      Geography_Spain  Gender_Female  
7479         1.728977      -0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features_train[numeric] = scaler.transform(features_train[numeric])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)


## Train Model Without Taking into Account Feature/Class Imbalance
Now that the data has been cleaned and prepared, and can begin training our model. Because we are predicting whether a customer will churn or not (1 or 0), this is a classification exercise and not a regression exercise as we are not predicting a value; just a label for each customer of the bank. In this section, I will train and tune three models: Decision Tree, Random Forrest, Logistic Regression. 


**Note** that I will not be accounting for class imbalance of the training data in Section 5, but will address class imbalance in Section 6. After training the models, I will compare the performance of each model before and after accounting for class imbalance.

### Train & Fit DecisionTreeClassifier Model

#### Tune DecisionTreeClassifier

In [10]:
# For our Decision Tree Classification Model The Parameter That We Need to Tune is Tree Depth
# To determine the most optimal tree depth for the highest model accuracy, we will use the following code
best_result = 0
best_depth = 0

# For tree depth between 1 to 6, calculate the optimal tree depth and evaluate model for greatest accuracy
for depth in range(1, 6):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth) # create a model with the given depth
    model.fit(features_train,target_train) # train the model
    predictions = model.predict(features_valid) # get the model's predictions
    result = accuracy_score(target_valid,predictions) # calculate the accuracy
    if result > best_result:
        best_result = result
        best_depth = depth
    
    # Print model accuracy at each depth
    print("At a depth of", depth, "the accuracy of the model is", round(result,5))

# Print the accuracy of the model and which depth produced the greatest accuracy  
print()
print("The accuracy of the best model is", round(best_result,5), "at a depth of",best_depth)

At a depth of 1 the accuracy of the model is 0.791
At a depth of 2 the accuracy of the model is 0.824
At a depth of 3 the accuracy of the model is 0.838
At a depth of 4 the accuracy of the model is 0.852
At a depth of 5 the accuracy of the model is 0.853

The accuracy of the best model is 0.853 at a depth of 5


Using a depth of 5, we will now train our DecisionTreeClassifier model

#### Classification Metrics Using Tuned DecisionTreeClassifier Model

In [11]:
# Use Decision Tree Model for Classification With Depth of 5
model = DecisionTreeClassifier(random_state=12345, max_depth=5)

# Fit Model to Traing Dataset
model.fit(features_train, target_train)

# Use Model to Predict Target Values on Validation Set
predicted_valid = model.predict(features_valid)

# Get Accuracy Score 
accuracy_valid = accuracy_score(target_valid, predicted_valid)

# Print Model Accuracy 
print("At a depth of 5, the accuracy of the model is", accuracy_valid)

# Print Confusion Matrix
print('\nConfusion Matrix')
print(confusion_matrix(target_valid, predicted_valid))

# Print Recall Score
print('\nThe Recall Score is',recall_score(target_valid, predicted_valid))

# Print Precision Score 
print('The Precision Score is',precision_score(target_valid, predicted_valid))

# Print F1 Score 
print('The F1 Score is', f1_score(target_valid, predicted_valid))

# Calculate and Print AUC_ROC Score
probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
print('The AUC_ROC Score is',auc_roc)


At a depth of 5, the accuracy of the model is 0.853

Confusion Matrix
[[1533   49]
 [ 245  173]]

The Recall Score is 0.4138755980861244
The Precision Score is 0.7792792792792793
The F1 Score is 0.5406249999999999
The AUC_ROC Score is 0.8227003550711051


### Train & Fit LogisticRegression Model

In [12]:
# Create a Logistic Regression Model
model = LogisticRegression(random_state=54321, solver="liblinear")

# Fit the model to the traning dataset
model.fit(features_train, target_train)

# Score the model accuracy on the traning and validation data sets
score_train = model.score(features_train, target_train)
score_valid = model.score(features_valid, target_valid)

# Print the accuracy of the model on the traning and validation datasets
print("The accuracy of the logistic regression model on the training set is", round(score_train,5))
print("The accuracy of the logistic regression model on the validation set is:",round(score_valid,5))

# Print Confusion Matrix
print('\nConfusion Matrix')
print(confusion_matrix(target_valid, predicted_valid))

# Print Recall Score
print('\nThe Recall Score is',recall_score(target_valid, predicted_valid))

# Print Precision Score 
print('The Precision Score is',precision_score(target_valid, predicted_valid))

# Print F1 Score 
print('The F1 Score is', f1_score(target_valid, predicted_valid))

# Calculate and Print AUC_ROC Score
probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
print('The AUC_ROC Score is',auc_roc)

The accuracy of the logistic regression model on the training set is 0.81883
The accuracy of the logistic regression model on the validation set is: 0.8025

Confusion Matrix
[[1533   49]
 [ 245  173]]

The Recall Score is 0.4138755980861244
The Precision Score is 0.7792792792792793
The F1 Score is 0.5406249999999999
The AUC_ROC Score is 0.7585440874914559


### Train & Fit RandomForestClassifier Model

#### Tune RandomForestClassifier

In [13]:
# For our Random Forest Classification Model The Parameter That We Need to Tune is n_estimators
# n_estimators is the number of trees to be used in the forest
# To determine the most optimal tree depth for the highest model accuracy, we will use the following code
best_score = 0
best_est = 0

# For n_estimators between 1 and 11, calculate the optimal number of trees to be used in the forest and evaluate model for greatest accuracy
for est in range(1, 11):
    model = RandomForestClassifier(random_state=54321, n_estimators=est)
    model.fit(features_train, target_train)
    score = model.score(features_valid, target_valid)
    if score > best_score:
        best_score = score
        best_est = est

    # Print model accuracy at each depth
    print("With", est, "as the n_estimators, the accuracy of the model is", round(score,5))

# Print the accuracy of the model and which depth produced the greatest accuracy  
print()        
print("The accuracy of the best model on the validation set had n_estimators set to", best_est, "with a score of", round(best_score,5))

With 1 as the n_estimators, the accuracy of the model is 0.778
With 2 as the n_estimators, the accuracy of the model is 0.826
With 3 as the n_estimators, the accuracy of the model is 0.8235
With 4 as the n_estimators, the accuracy of the model is 0.836
With 5 as the n_estimators, the accuracy of the model is 0.838
With 6 as the n_estimators, the accuracy of the model is 0.8465
With 7 as the n_estimators, the accuracy of the model is 0.847
With 8 as the n_estimators, the accuracy of the model is 0.8475
With 9 as the n_estimators, the accuracy of the model is 0.8455
With 10 as the n_estimators, the accuracy of the model is 0.848

The accuracy of the best model on the validation set had n_estimators set to 10 with a score of 0.848


Using n_estimators set to 10, we will now train our RandomForestClassifier model

#### Run Classification Metrics on Tuned RandomForestClassifier Model

In [14]:
# Use RandomForestClassifier for Classification With n_estimators of 10
model = RandomForestClassifier(random_state=54321, n_estimators=10)

# Fit Model to Traing Dataset
model.fit(features_train, target_train)

# Use Model to Predict Target Values on Validation Set
predicted_valid = model.predict(features_valid)

# Get Accuracy Score 
accuracy_valid = accuracy_score(target_valid, predicted_valid)

# Print Model Accuracy 
print("With n_estimators set to 5, the accuracy of the model is", accuracy_valid)

# Print Confusion Matrix
print('\nConfusion Matrix')
print(confusion_matrix(target_valid, predicted_valid))

# Print Recall Score
print('\nThe Recall Score is',recall_score(target_valid, predicted_valid))

# Print Precision Score 
print('The Precision Score is',precision_score(target_valid, predicted_valid))

# Print F1 Score 
print('The F1 Score is', f1_score(target_valid, predicted_valid))

# Calculate and Print AUC_ROC Score
probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
print('The AUC_ROC Score is',auc_roc)

With n_estimators set to 5, the accuracy of the model is 0.848

Confusion Matrix
[[1524   58]
 [ 246  172]]

The Recall Score is 0.41148325358851673
The Precision Score is 0.7478260869565218
The F1 Score is 0.5308641975308641
The AUC_ROC Score is 0.8003322364640484


**Please note that,** in Section 8, I will compile all of the model performance/evaluation metrics before and after adjusting the training data for class imbalance. I will discuss there each metric and why I decide to move forward with a particualar model. 

## Address Class Imbalance in the Dataset Using Class Weight Adjustment

Class imbalance in the dataset can be handled in several ways - 
1. **Class Weight Adjustment** - In order to indicate that some observations are more important than others, we can assign a weight to the respective class by using the parameter class_weight='balanced' in our classification models. The algorithm will calculate how many times the class "0" occurs more often than the class "1". We’ll denote this number as N (an unknown number of times). Because of this, rare class will have a higher weight.
2. **Upsampling** - Determine the class with fewer observations. Call it the rare class. Duplicate the rarer class observations several times. Create a new training sample based on the data obtained. The most important tasks are repeated several times to make them easier to remember for the model.
3. **Downsampling** - Determine the class with more observations. Let's call it the majority class. We can randomly drop a portion of majority class observations to remove noise for the model to more easily find rare observations.

In this section, I will be performning **Class Weight Adjustment** and **Upsampling** to the dataset to try and improve model accuracy. Again, once the models have been trained on the new datasets, I will describe the results of our models' performance.

### Class Weight Adjustment
**This will be performed in Section 7** by specifying class_weight='balanced' as a parameter in each classification model.

### Upsampling

In [15]:
# Create function for upsampling
# Function will return an upsampled feature and target traning set to train the model on 
# Upsample function will create multiple positive class observations by 10 and then shuffle the results to create a new training set
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)

    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345
    )

    return features_upsampled, target_upsampled

# Return upsampled traning sets
features_upsampled, target_upsampled = upsample(features_train, target_train, 10)

## Re-Run Models to Account for Class Imbalance By Using Upsampled Training Data
This section will be very similiar to the model evaluation I performed in **Section 5, HOWEVER,** I will be doing the following in this section
1. Run models with parameter `class_weight='balanced'` for class weight adjustment
2. Use upsampled training data

**After I retrain the models and evaluate their results on the validation data set, I will provide a table with the results of the models before and after factoring for class imbalance**

### Train & Fit DecisionTreeClassifier Model Addressing Class Imbalance

#### Tune DecisionTreeClassifier Addressing Class Imbalance

In [16]:
# For our Decision Tree Classification Model The Parameter That We Need to Tune is Tree Depth
# To determine the most optimal tree depth for the highest model accuracy, we will use the following code
best_result = 0
best_depth = 0

# For tree depth between 1 to 6, calculate the optimal tree depth and evaluate model for greatest accuracy
for depth in range(1, 6):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth, class_weight='balanced') # create a model with the given depth
    model.fit(features_upsampled,target_upsampled) # train the model
    predictions = model.predict(features_valid) # get the model's predictions
    result = accuracy_score(target_valid,predictions) # calculate the accuracy
    if result > best_result:
        best_result = result
        best_depth = depth
    
    # Print model accuracy at each depth
    print("At a depth of", depth, "the accuracy of the model is", round(result,5))

# Print the accuracy of the model and which depth produced the greatest accuracy  
print()
print("The accuracy of the best model is", round(best_result,5), "at a depth of",best_depth)

At a depth of 1 the accuracy of the model is 0.7545
At a depth of 2 the accuracy of the model is 0.765
At a depth of 3 the accuracy of the model is 0.765
At a depth of 4 the accuracy of the model is 0.711
At a depth of 5 the accuracy of the model is 0.8105

The accuracy of the best model is 0.8105 at a depth of 5


#### Classification Metrics Using Tuned DecisionTreeClassifier Model Addressing Class Imbalance

In [17]:
# Use Decision Tree Model for Classification With Depth of 5
model = DecisionTreeClassifier(random_state=12345, max_depth=5, class_weight='balanced')

# Fit Model to Traing Dataset
model.fit(features_upsampled,target_upsampled)

# Use Model to Predict Target Values on Validation Set
predicted_valid = model.predict(features_valid)

# Get Accuracy Score 
accuracy_valid = accuracy_score(target_valid, predicted_valid)

# Print Model Accuracy 
print("At a depth of 5, the accuracy of the model is", accuracy_valid)

# Print Confusion Matrix
print('\nConfusion Matrix')
print(confusion_matrix(target_valid, predicted_valid))

# Print Recall Score
print('\nThe Recall Score is',recall_score(target_valid, predicted_valid))

# Print Precision Score 
print('The Precision Score is',precision_score(target_valid, predicted_valid))

# Print F1 Score 
print('The F1 Score is', f1_score(target_valid, predicted_valid))

# Calculate and Print AUC_ROC Score
probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
print('The AUC_ROC Score is',auc_roc)

At a depth of 5, the accuracy of the model is 0.8105

Confusion Matrix
[[1341  241]
 [ 138  280]]

The Recall Score is 0.6698564593301436
The Precision Score is 0.5374280230326296
The F1 Score is 0.5963791267305644
The AUC_ROC Score is 0.8310244134068074


### Train & Fit LogisticRegression Model Addressing Class Imbalance

In [18]:
# Create a Logistic Regression Model
model = LogisticRegression(random_state=54321, solver="liblinear", class_weight='balanced')

# Fit the model to the traning dataset
model.fit(features_upsampled,target_upsampled)

# Score the model accuracy on the traning and validation data sets
score_train = model.score(features_train, target_train)
score_valid = model.score(features_valid, target_valid)

# Print the accuracy of the model on the traning and validation datasets
print("The accuracy of the logistic regression model on the training set is", round(score_train,5))
print("The accuracy of the logistic regression model on the validation set is:",round(score_valid,5))

# Print Confusion Matrix
print('\nConfusion Matrix')
print(confusion_matrix(target_valid, predicted_valid))

# Print Recall Score
print('\nThe Recall Score is',recall_score(target_valid, predicted_valid))

# Print Precision Score 
print('The Precision Score is',precision_score(target_valid, predicted_valid))

# Print F1 Score 
print('The F1 Score is', f1_score(target_valid, predicted_valid))

# Calculate and Print AUC_ROC Score
probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
print('The AUC_ROC Score is',auc_roc)

The accuracy of the logistic regression model on the training set is 0.718
The accuracy of the logistic regression model on the validation set is: 0.701

Confusion Matrix
[[1341  241]
 [ 138  280]]

The Recall Score is 0.6698564593301436
The Precision Score is 0.5374280230326296
The F1 Score is 0.5963791267305644
The AUC_ROC Score is 0.7633756555507836


### Train & Fit RandomForestClassifier Model Addressing Class Imbalance

#### Tune RandomForestClassifier Addressing Class Imbalance

In [19]:
# For our Random Forest Classification Model The Parameter That We Need to Tune is n_estimators
# n_estimators is the number of trees to be used in the forest
# To determine the most optimal tree depth for the highest model accuracy, we will use the following code
best_score = 0
best_est = 0

# For n_estimators between 1 and 11, calculate the optimal number of trees to be used in the forest and evaluate model for greatest accuracy
for est in range(1, 11):
    model = RandomForestClassifier(random_state=54321, n_estimators=est, class_weight='balanced')
    model.fit(features_upsampled,target_upsampled)
    score = model.score(features_valid, target_valid)
    if score > best_score:
        best_score = score
        best_est = est

    # Print model accuracy at each depth
    print("With", est, "as the n_estimators, the accuracy of the model is", round(score,5))

# Print the accuracy of the model and which depth produced the greatest accuracy  
print()        
print("The accuracy of the best model on the validation set had n_estimators set to", best_est, "with a score of", round(best_score,5))

With 1 as the n_estimators, the accuracy of the model is 0.747
With 2 as the n_estimators, the accuracy of the model is 0.814
With 3 as the n_estimators, the accuracy of the model is 0.804
With 4 as the n_estimators, the accuracy of the model is 0.826
With 5 as the n_estimators, the accuracy of the model is 0.822
With 6 as the n_estimators, the accuracy of the model is 0.8295
With 7 as the n_estimators, the accuracy of the model is 0.825
With 8 as the n_estimators, the accuracy of the model is 0.836
With 9 as the n_estimators, the accuracy of the model is 0.8305
With 10 as the n_estimators, the accuracy of the model is 0.839

The accuracy of the best model on the validation set had n_estimators set to 10 with a score of 0.839


#### Run Classification Metrics on Tuned RandomForestClassifier Model Addressing Class Imbalance

In [20]:
# Use RandomForestClassifier for Classification With n_estimators of 10
model = RandomForestClassifier(random_state=54321, n_estimators=10, class_weight='balanced')

# Fit Model to Traing Dataset
model.fit(features_upsampled,target_upsampled)

# Use Model to Predict Target Values on Validation Set
predicted_valid = model.predict(features_valid)

# Get Accuracy Score 
accuracy_valid = accuracy_score(target_valid, predicted_valid)

# Print Model Accuracy 
print("With n_estimators set to 10, the accuracy of the model is", accuracy_valid)

# Print Confusion Matrix
print('\nConfusion Matrix')
print(confusion_matrix(target_valid, predicted_valid))

# Print Recall Score
print('\nThe Recall Score is',recall_score(target_valid, predicted_valid))

# Print Precision Score 
print('The Precision Score is',precision_score(target_valid, predicted_valid))

# Print F1 Score 
print('The F1 Score is', f1_score(target_valid, predicted_valid))

# Calculate and Print AUC_ROC Score
probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
print('The AUC_ROC Score is',auc_roc)

With n_estimators set to 10, the accuracy of the model is 0.839

Confusion Matrix
[[1461  121]
 [ 201  217]]

The Recall Score is 0.5191387559808612
The Precision Score is 0.6420118343195266
The F1 Score is 0.5740740740740741
The AUC_ROC Score is 0.8138039184848687


## Comparing Model Performance Before and After Adjusting For Class Imbalance

The two tables below describe how each classification model performed for the following metrics - 
  - **Accuracy -**  Ratio of the total number of correct predictions and the total number of predictions.
  - **Recall -** What proportion of actual positives were identified correctly?
    - Example - "for all the patients who actually have heart disease, recall tells us how many we correctly identified as having a heart disease"
  - **Precision -** What proportion of positive identifications were actually correct?
    -  Example - "measure of patients that we correctly identify as having a heart disease out of all the patients actually having it as predicted by the model"
  - **F1 Score -** The F1 score combines precision and recall using their harmonic mean, and maximizing the F1 score implies simultaneously maximizing both precision and recall. 
  - **AUC_ROC -** ROC AUC score shows how well the classifier distinguishes positive and negative classes. A random model would have an AUC of 0.5.

*__Before__ adjusting traning data to account for class imbalance*

| Model                         | Accuracy | Recall    | Precision  | F1      | AUC_ROC     |
| ----------------              | ------   | ----      | ------     | ----    | ----        |
| DecisionTreeClassifier        |  **0.853**   | 0.414     |   0.779    | 0.541   | 0.823   |
| LogisticRegression            |  0.802   | 0.414     |   0.779    | 0.541   | 0.758       |
| RandomForestClassifier        |  0.848   | 0.411     |   0.747    | 0.530   | 0.800       |

*__After__ adjusting traning data to account for class imbalance*

| Model                         | Accuracy | Recall    | Precision  | F1      | AUC_ROC     |
| ----------------              | ------   | ----      | ------     | ----    | ----        |
| DecisionTreeClassifier        |  0.810   | **0.669**     |   0.537    | **0.596**   | 0.831       |
| LogisticRegression            |  0.701   | 0.669     |   0.537    | 0.596   | 0.763       |
| RandomForestClassifier        |  0.839   | 0.519     |   0.642    | 0.574   | 0.813       |

The model with the highest metrics after traning the model and fitting to the validation set was the **DecisionTreeClassifier**. However, there were some interested results which I will describe below - 
 - **Without** class imbalance adjustment, the DecisionTreeClassifier model was optimized for higher accuracy (.853 vs .810 after class balancing)
 - **With** class imbalance adjustment, the DecisionTreeClassifier model had higher Recall, Precision, and consequnely a higher F1 Score. Accuracy of the model dropped from 85.3% to 81.0% but recall increased from 41.4% to 66.9% and AUC_ROC slightly increased. 
 
In this situation, I will be moving forward with the DecisionTreeClassifier and I will be training the model on the upsampled data set using the class balancing parameter because this results in better model performance metrics.

## Run DecisionTreeClassifier on Upsampled Dataset Using Class Balancing

With an F1 score greater than 0.59 with an upsampled DTC, we will check now check how the DTC performs on the test dataset (test dataset has had features scaled with StandardScaler)

### Train & Fit DecisionTreeClassifier on Test Data and Evaluate Performance Metrics

In this section, we will train our model with best hyperparameters on upsampled data on train set without tuning hyperparameters to avoid overfitting. I will then check the model on test dataset (test set is scaled with fitted StandardScaler).

In [21]:
# Use Decision Tree Model for Classification With Depth of 5
model = DecisionTreeClassifier(random_state=12345, max_depth=5, class_weight='balanced')

# Fit Model to Upsampled Traing Dataset
model.fit(features_upsampled,target_upsampled)

# Use Model to Predict Target Values on Scaled Test Set
predicted_test = model.predict(features_test)

# Get Accuracy Score 
accuracy_valid = accuracy_score(target_test, predicted_test)

# Print Model Accuracy 
print("At a depth of 5, the accuracy of the model is", accuracy_valid)

# Print Confusion Matrix
print('\nConfusion Matrix')
print(confusion_matrix(target_test, predicted_test))

# Print Recall Score
print('\nThe Recall Score is',recall_score(target_test, predicted_test))

# Print Precision Score 
print('The Precision Score is',precision_score(target_test, predicted_test))

# Print F1 Score 
print('The F1 Score is', f1_score(target_test, predicted_test))

# Calculate and Print AUC_ROC Score
probabilities_test = model.predict_proba(features_test)
probabilities_one_test = probabilities_test[:, 1]
auc_roc = roc_auc_score(target_test, probabilities_one_test)
print('The AUC_ROC Score is',auc_roc)

At a depth of 5, the accuracy of the model is 0.798

Confusion Matrix
[[1316  261]
 [ 143  280]]

The Recall Score is 0.6619385342789598
The Precision Score is 0.5175600739371534
The F1 Score is 0.5809128630705395
The AUC_ROC Score is 0.8355347481752318


# Conclusion

**Restate Project Overview -** Beta Bank customers are leaving: little by little, chipping away every month. The bankers figured out it’s cheaper to save the existing customers rather than to attract new ones. My responsability will be to predict whether a customer will leave the bank soon. I have the data on clients’ past behavior and termination of contracts with the bank to do this task.

**Project Objective -** I will build a model with the maximum possible F1 score, with the requirement that the F1 score for my model must be at least 0.59.

**Findings & Conclusion** - After training the DecisionTreeClassifier on the upsampled training set and fitting the model to the scaled test data, the model produces the following results - 

*__After__ adjusting traning data to account for class imbalance **before** fitting model on test data*

| Model                         | Accuracy | Recall    | Precision  | F1      | AUC_ROC     |
| ----------------              | ------   | ----      | ------     | ----    | ----        |
| DecisionTreeClassifier        |  0.810   | 0.669     |   0.537    | 0.596   | 0.831       |


*__After__ traning model on upsampled traning data and fitting model to the test data*


| Model                         | Accuracy | Recall    | Precision  | F1      | AUC_ROC     |
| ----------------              | ------   | ----      | ------     | ----    | ----        |
| DecisionTreeClassifier        |  0.798   | 0.661     |   0.517    | 0.581   | 0.835       |

When fitting the model to the test set we see a small increase in accuracy and precision, but a very small decrease in model recall, F1 Score and AUC_ROC score. 