Hello Harvey!

My name is Dmitry.  I'm glad to review your work today.
I will mark your mistakes and give you some hints how it is possible to fix them. We are getting ready for real job, where your team leader/senior colleague will do exactly the same. Don't worry and study with pleasure! 

Below you will find my comments - **please do not move, modify or delete them**.

You can find my comments in green, yellow or red boxes like this:

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Success. Everything is done succesfully.
</div>

<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Remarks. Some recommendations.
</div>

<div class="alert alert-block alert-danger">

<b>Reviewer's comment</b> <a class="tocSkip"></a>

Needs fixing. The block requires some corrections. Work can't be accepted with the red comments.
</div>

You can answer me by using this:

<div class="alert alert-block alert-info">
<b>Student answer.</b> <a class="tocSkip"></a>

Text here.
</div>

# Introduction

Beta Bank customers are leaving: little by little, chipping away every month. The bankers figured out it’s cheaper to save the existing customers rather than to attract new ones.


We need to predict whether a customer will leave the bank soon. We have the data on clients’ past behavior and termination of contracts with the bank.

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Great start with an introduction!
</div>

## Initialization

In [1]:
# Load necessary libraries
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.utils import shuffle

In [2]:
# Load dataset and create Data frame
df = pd.read_csv('/datasets/Churn.csv')

Relevant libraries loaded and dataframe created. Now we can ove on to preprocessing

## Data Preprocessing

### Data inspection and visualization

In [3]:
# Visualize and inspect data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


In [4]:
# Visualize and inspect data sample
df.head(10)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0
5,6,15574012,Chu,645,Spain,Male,44,8.0,113755.78,2,1,0,149756.71,1
6,7,15592531,Bartlett,822,France,Male,50,7.0,0.0,2,1,1,10062.8,0
7,8,15656148,Obinna,376,Germany,Female,29,4.0,115046.74,4,1,0,119346.88,1
8,9,15792365,He,501,France,Male,44,4.0,142051.07,2,0,1,74940.5,0
9,10,15592389,H?,684,France,Male,27,2.0,134603.88,1,1,1,71725.73,0


From our initial assessment, we can see that the data is fairly clean and all columns have the right data types for the information stored in them.
However, we can observe missing values in the `Tenure` column. This requires a closer look.
The column names are also irregular, let's deal with that too.

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Good initial conclusion!
</div>

### Standardizing column names

In [5]:
# Convert column names to lower case
df.columns = df.columns.str.lower()

In [6]:
# Visualize column names
print('Column names:', df.columns)
df.head(10)

Column names: Index(['rownumber', 'customerid', 'surname', 'creditscore', 'geography',
       'gender', 'age', 'tenure', 'balance', 'numofproducts', 'hascrcard',
       'isactivemember', 'estimatedsalary', 'exited'],
      dtype='object')


Unnamed: 0,rownumber,customerid,surname,creditscore,geography,gender,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0
5,6,15574012,Chu,645,Spain,Male,44,8.0,113755.78,2,1,0,149756.71,1
6,7,15592531,Bartlett,822,France,Male,50,7.0,0.0,2,1,1,10062.8,0
7,8,15656148,Obinna,376,Germany,Female,29,4.0,115046.74,4,1,0,119346.88,1
8,9,15792365,He,501,France,Male,44,4.0,142051.07,2,0,1,74940.5,0
9,10,15592389,H?,684,France,Male,27,2.0,134603.88,1,1,1,71725.73,0


In [7]:
# Rename columns in snake_case
df = df.rename(columns={
    'rownumber':'row_number',
    'customerid' : 'customer_id',
    'creditscore' : 'credit_score',
    'numofproducts' : 'num_of_products',
    'hascrcard' : 'has_cr_card',
    'isactivemember' : 'is_active_member',
    'estimatedsalary' : 'estimated_salary'
})

In [8]:
print(df.columns)
df.head(10)

Index(['row_number', 'customer_id', 'surname', 'credit_score', 'geography',
       'gender', 'age', 'tenure', 'balance', 'num_of_products', 'has_cr_card',
       'is_active_member', 'estimated_salary', 'exited'],
      dtype='object')


Unnamed: 0,row_number,customer_id,surname,credit_score,geography,gender,age,tenure,balance,num_of_products,has_cr_card,is_active_member,estimated_salary,exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0
5,6,15574012,Chu,645,Spain,Male,44,8.0,113755.78,2,1,0,149756.71,1
6,7,15592531,Bartlett,822,France,Male,50,7.0,0.0,2,1,1,10062.8,0
7,8,15656148,Obinna,376,Germany,Female,29,4.0,115046.74,4,1,0,119346.88,1
8,9,15792365,He,501,France,Male,44,4.0,142051.07,2,0,1,74940.5,0
9,10,15592389,H?,684,France,Male,27,2.0,134603.88,1,1,1,71725.73,0


All column names have been standardized with lowercase letters and snake_case!

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Good job.
</div>

### Working with missing values

In [9]:
# Visualize rows with missing data
display(df[df['tenure'].isna() == True])

Unnamed: 0,row_number,customer_id,surname,credit_score,geography,gender,age,tenure,balance,num_of_products,has_cr_card,is_active_member,estimated_salary,exited
30,31,15589475,Azikiwe,591,Spain,Female,39,,0.00,3,1,0,140469.38,1
48,49,15766205,Yin,550,Germany,Male,38,,103391.38,1,0,1,90878.13,0
51,52,15768193,Trevisani,585,Germany,Male,36,,146050.97,2,0,0,86424.57,0
53,54,15702298,Parkhill,655,Germany,Male,41,,125561.97,1,0,0,164040.94,1
60,61,15651280,Hunter,742,Germany,Male,35,,136857.00,1,0,0,84509.57,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9944,9945,15703923,Cameron,744,Germany,Male,41,,190409.34,2,1,1,138361.48,0
9956,9957,15707861,Nucci,520,France,Female,46,,85216.61,1,1,0,117369.52,1
9964,9965,15642785,Douglas,479,France,Male,34,,117593.48,2,0,0,113308.29,0
9985,9986,15586914,Nepean,659,France,Male,36,,123841.49,2,1,0,96833.00,0


In [10]:
# Visualize values in tenure column
display(df['tenure'].value_counts())

1.0     952
2.0     950
8.0     933
3.0     928
5.0     927
7.0     925
4.0     885
9.0     882
6.0     881
10.0    446
0.0     382
Name: tenure, dtype: int64

In [11]:
display(df[df['tenure'] == 0.0])

Unnamed: 0,row_number,customer_id,surname,credit_score,geography,gender,age,tenure,balance,num_of_products,has_cr_card,is_active_member,estimated_salary,exited
29,30,15656300,Lucciano,411,France,Male,29,0.0,59697.17,2,1,1,53483.21,0
35,36,15794171,Lombardo,475,France,Female,45,0.0,134264.04,1,1,0,27822.99,1
57,58,15647091,Endrizzi,725,Germany,Male,19,0.0,75888.20,1,0,0,45613.75,0
72,73,15812518,Palermo,657,Spain,Female,37,0.0,163607.18,1,0,1,44203.55,0
127,128,15782688,Piccio,625,Germany,Male,56,0.0,148507.24,1,1,0,46824.08,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9793,9794,15772363,Hilton,772,Germany,Female,42,0.0,101979.16,1,1,0,90928.48,0
9799,9800,15722731,Manna,653,France,Male,46,0.0,119556.10,1,1,0,78250.13,1
9843,9844,15778304,Fan,646,Germany,Male,24,0.0,92398.08,1,1,1,18897.29,0
9868,9869,15587640,Rowntree,718,France,Female,43,0.0,93143.39,1,1,0,167554.86,0


From our documentation of the data, the `tenure` column, contains the period of maturation for a customer’s fixed deposit (years). Seeing as not all customers have fixed deposits, it is safe to assume that the missing values are for customers who don't have fixed deposit accounts and so we can fill this value with 0.

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Right thinking!
</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Pro tip: also we can try to fillna with median value (from train set only).
</div>

In [12]:
# Fill missing values
df['tenure'] = df['tenure'].fillna(0)

In [13]:
# Visualize values in tenure column
print(df['tenure'].value_counts())

0.0     1291
1.0      952
2.0      950
8.0      933
3.0      928
5.0      927
7.0      925
4.0      885
9.0      882
6.0      881
10.0     446
Name: tenure, dtype: int64


In [14]:
# Visualize data frame info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   row_number        10000 non-null  int64  
 1   customer_id       10000 non-null  int64  
 2   surname           10000 non-null  object 
 3   credit_score      10000 non-null  int64  
 4   geography         10000 non-null  object 
 5   gender            10000 non-null  object 
 6   age               10000 non-null  int64  
 7   tenure            10000 non-null  float64
 8   balance           10000 non-null  float64
 9   num_of_products   10000 non-null  int64  
 10  has_cr_card       10000 non-null  int64  
 11  is_active_member  10000 non-null  int64  
 12  estimated_salary  10000 non-null  float64
 13  exited            10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


### Checking for duplicates

In [15]:
print(df.duplicated().value_counts())

False    10000
dtype: int64


The data is fully clean and ready for use!

<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

<s>It's also a good practice to check for duplicates.
</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Good job.
</div>

## Prepare data for model

### Dealing with non informative columns

In [16]:
# Drop non informative columns
df_new = df.drop(['customer_id', 'surname', 'row_number'], axis=1)

### Split features and target

In [17]:
# Get target and features of data
features = df_new.drop('exited', axis=1)
target = df_new['exited']

In [18]:
# Visualize features
features.head(10)

Unnamed: 0,credit_score,geography,gender,age,tenure,balance,num_of_products,has_cr_card,is_active_member,estimated_salary
0,619,France,Female,42,2.0,0.0,1,1,1,101348.88
1,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58
2,502,France,Female,42,8.0,159660.8,3,1,0,113931.57
3,699,France,Female,39,1.0,0.0,2,0,0,93826.63
4,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1
5,645,Spain,Male,44,8.0,113755.78,2,1,0,149756.71
6,822,France,Male,50,7.0,0.0,2,1,1,10062.8
7,376,Germany,Female,29,4.0,115046.74,4,1,0,119346.88
8,501,France,Male,44,4.0,142051.07,2,0,1,74940.5
9,684,France,Male,27,2.0,134603.88,1,1,1,71725.73


In [19]:
# Check features info
features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   credit_score      10000 non-null  int64  
 1   geography         10000 non-null  object 
 2   gender            10000 non-null  object 
 3   age               10000 non-null  int64  
 4   tenure            10000 non-null  float64
 5   balance           10000 non-null  float64
 6   num_of_products   10000 non-null  int64  
 7   has_cr_card       10000 non-null  int64  
 8   is_active_member  10000 non-null  int64  
 9   estimated_salary  10000 non-null  float64
dtypes: float64(3), int64(5), object(2)
memory usage: 781.4+ KB


In [20]:
# Visualize target
target.head(10)

0    1
1    0
2    1
3    0
4    0
5    1
6    0
7    1
8    0
9    0
Name: exited, dtype: int64

### One Hot Encoding (OHE)

In [21]:
# Transform dataframe with OHE
features = pd.get_dummies(features, drop_first=True)

<div class="alert alert-block alert-danger">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

<s>Let's talk about how we can improve our code. <br>
1. Seems we have non informative columns in our dataset (like id).
2. In this case it's better to use OHE/get_dummies for categorical columns because we don't have much unique values in them.
</div>

<div class="alert alert-block alert-info">
<b>Student answer.</b> <a class="tocSkip"></a>

Yes OHE is a good option for logistic regression models. But seeing as we are also going to test decision trees and random forests, wouldn't ordinal encoding be better? Or am I just meant to use a logistic Regression Model?
</div>

<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Trees works well with OHE columns too.
</div>

<div class="alert alert-block alert-danger">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

<s>We have some non informative columns - customer_id, surname, row_number. <br>
Our recommendation here is to drop these columns. We got 2944 columns after OHE because of them =)
</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Well done!
</div>

In [22]:
# Visualize transformed dataframe
features

Unnamed: 0,credit_score,age,tenure,balance,num_of_products,has_cr_card,is_active_member,estimated_salary,geography_Germany,geography_Spain,gender_Male
0,619,42,2.0,0.00,1,1,1,101348.88,0,0,0
1,608,41,1.0,83807.86,1,0,1,112542.58,0,1,0
2,502,42,8.0,159660.80,3,1,0,113931.57,0,0,0
3,699,39,1.0,0.00,2,0,0,93826.63,0,0,0
4,850,43,2.0,125510.82,1,1,1,79084.10,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...
9995,771,39,5.0,0.00,2,1,0,96270.64,0,0,1
9996,516,35,10.0,57369.61,1,1,1,101699.77,0,0,1
9997,709,36,7.0,0.00,1,0,1,42085.58,0,0,0
9998,772,42,3.0,75075.31,2,1,0,92888.52,1,0,1


The data has been successfully split into train, valid and test in the ratio 60%:20%:20% respectively

## Split data into training, validation and test datasets 

### Split data into training, validation and test datasets

In [23]:
# Split data into training and test datasets
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size = 0.2, random_state=12345)

In [24]:
# Split training dataset into training and validation datasets
features_train, features_valid, target_train, target_valid =  train_test_split(features_train, target_train, test_size = 0.25, random_state=12345)

## Feature Scaling

### Highlight numeric columns

In [25]:
# highlight numerical columns and save to a single variable
numeric = ['credit_score', 'age', 'balance', 'num_of_products', 'has_cr_card','is_active_member', 'estimated_salary']

### Create scaler variable

In [26]:
# Create scaler variable
scaler = StandardScaler()
scaler.fit(features_train[numeric])

StandardScaler()

<div class="alert alert-block alert-danger">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

<s>OK, no worries, but we have a little error here. <br>
We need to fit our scaler only on train data.
</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Well done!
</div>

### Transform columns with scaler

In [27]:
# Transform training columns
features_train[numeric]= scaler.transform(features_train[numeric])

In [28]:
# Transform validation columns
features_valid[numeric]= scaler.transform(features_valid[numeric])

In [29]:
# Transform test columns
features_test[numeric]= scaler.transform(features_test[numeric])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features_test[numeric]= scaler.transform(features_test[numeric])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)


<div class="alert alert-block alert-danger">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

<s>We also need to transform test set.
</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Good job.
</div>

### Visualize transformed dataframes

In [30]:
features_train

Unnamed: 0,credit_score,age,tenure,balance,num_of_products,has_cr_card,is_active_member,estimated_salary,geography_Germany,geography_Spain,gender_Male
492,-0.134048,-0.078068,4.0,0.076163,0.816929,-1.550255,0.968496,0.331571,0,0,0
6655,-1.010798,0.494555,0.0,0.136391,-0.896909,0.645055,0.968496,-0.727858,0,0,1
4287,0.639554,1.353490,1.0,0.358435,-0.896909,0.645055,0.968496,-0.477006,1,0,1
42,-0.990168,2.116987,2.0,0.651725,-0.896909,0.645055,0.968496,-0.100232,0,0,0
8178,0.567351,0.685430,7.0,0.813110,0.816929,0.645055,0.968496,0.801922,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
8819,-0.618839,-0.268942,10.0,0.744283,-0.896909,-1.550255,0.968496,0.803658,1,0,1
1537,1.743227,-1.032439,8.0,-1.232442,-0.896909,0.645055,0.968496,-1.098797,0,0,0
1408,0.567351,-0.650691,9.0,-1.232442,0.816929,-1.550255,-1.032529,-1.202257,0,0,1
7661,-0.412545,0.494555,2.0,0.615524,-0.896909,0.645055,0.968496,-0.038931,1,0,1


In [31]:
features_valid

Unnamed: 0,credit_score,age,tenure,balance,num_of_products,has_cr_card,is_active_member,estimated_salary,geography_Germany,geography_Spain,gender_Male
2358,0.175393,0.399118,1.0,1.385698,-0.896909,-1.550255,0.968496,-1.466761,0,0,1
8463,-1.299609,0.971741,2.0,-1.232442,-0.896909,0.645055,-1.032529,0.254415,0,1,1
163,0.711757,-0.268942,2.0,-1.232442,0.816929,0.645055,0.968496,0.122863,0,1,0
3074,-0.391916,0.494555,6.0,0.672529,-0.896909,0.645055,-1.032529,0.585847,1,0,0
5989,0.165078,1.353490,10.0,0.536522,-0.896909,-1.550255,-1.032529,1.462457,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
207,-0.350657,-0.459816,5.0,0.933102,-0.896909,0.645055,0.968496,0.905307,0,0,1
8746,0.082561,-0.459816,8.0,0.476293,0.816929,-1.550255,0.968496,1.432571,0,0,1
1809,-0.134048,1.067178,6.0,0.618283,0.816929,0.645055,0.968496,-0.813904,0,0,0
5919,-0.072160,0.971741,8.0,-1.232442,0.816929,0.645055,-1.032529,1.080287,0,1,1


In [32]:
features_test

Unnamed: 0,credit_score,age,tenure,balance,num_of_products,has_cr_card,is_active_member,estimated_salary,geography_Germany,geography_Spain,gender_Male
7867,-0.123733,0.685430,3.0,-1.232442,-0.896909,0.645055,0.968496,0.980212,0,1,0
1402,1.083087,-0.937002,8.0,0.858518,-0.896909,0.645055,-1.032529,-0.390486,0,0,1
8606,1.598822,0.303681,5.0,-1.232442,0.816929,0.645055,0.968496,-0.435169,0,1,1
8885,0.165078,0.589993,4.0,0.412100,0.816929,0.645055,0.968496,1.017079,0,1,1
6494,0.484834,-1.032439,7.0,-1.232442,0.816929,0.645055,0.968496,-1.343558,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...
2563,1.970151,-1.127876,5.0,-1.232442,0.816929,0.645055,0.968496,-0.564021,0,0,0
1167,-1.072686,-0.364379,10.0,-1.232442,-0.896909,0.645055,0.968496,-1.193686,0,0,0
1009,-0.020586,3.071359,0.0,-1.232442,-0.896909,0.645055,0.968496,1.312849,0,1,1
1002,0.753016,0.017370,1.0,-0.415981,0.816929,0.645055,0.968496,1.463205,1,0,1


<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Correct.
</div>

## Investigate class imbalance

In [33]:
# Get number of negative observations
neg_obs = features[target==0]
print(len(neg_obs))

7963


In [34]:
# Get number of positive observations
pos_obs = features[target==1]
print(len(pos_obs))

2037


In [35]:
# Percentage imbalance
neg_fraction = 1 - len(pos_obs)/(len(neg_obs)+len(pos_obs))
pos_fraction = len(pos_obs)/(len(neg_obs)+len(pos_obs))

print(f'Percentage of negative observations: {neg_fraction:.1%}')
print(f'Percentage of positive observations: {pos_fraction:.1%}')

Percentage of negative observations: 79.6%
Percentage of positive observations: 20.4%


From the above analysis, we can see that the classes are heavily imbalanced with the negative observations making the vast majority of the observations in the dataset with a value of 79.6% as opposed to the positive observations at 20.4%

<div class="alert alert-block alert-danger">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

<s>Something went wrong! <br>
Please check calculation:

Pos fraction: 2037/(2037 + 7963) <br>
Neg fraction: 7963/(2037 + 7963)
</div>

<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Pro tip: we can use value_counts() to groupby unique values and count them.
    
Please check example below:
</div>

In [36]:
# reviewer's example
target.value_counts()

0    7963
1    2037
Name: exited, dtype: int64

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Correct!
</div>

## Working With Imbalanced Models

### Logistic Regression model

In [37]:
# Create logistic Regression model
p_model_1 = LogisticRegression(random_state=12345, solver='liblinear')

In [38]:
# Train model
p_model_1.fit(features_train, target_train)

LogisticRegression(random_state=12345, solver='liblinear')

In [39]:
# Test model with validation set
prediction_1 = p_model_1.predict(features_valid)

In [40]:
# Check f1_score
print(f1_score(target_valid, prediction_1))

0.3056603773584906


### Random Forest model

In [41]:
# Create Random Forest model
p_model_2 = RandomForestClassifier(random_state=12345)

In [42]:
# Train model
p_model_2.fit(features_train, target_train)

RandomForestClassifier(random_state=12345)

In [43]:
# Test model with validation set
prediction_2 = p_model_2.predict(features_valid)

In [44]:
# Check f1_score
print(f1_score(target_valid, prediction_2))

0.5650793650793651


### Decision Tree model

In [45]:
# Create Decision Tree model
p_model_3 = DecisionTreeClassifier(random_state=12345)

In [46]:
# Train model
p_model_3.fit(features_train, target_train)

DecisionTreeClassifier(random_state=12345)

In [47]:
# Test model with validation set
prediction_3 = p_model_3.predict(features_valid)

In [48]:
# Check f1_score
print(f1_score(target_valid, prediction_3))

0.48088779284833544


The above results show the f1 score of the basic models and from what we can see, the Logistic Regression model has the highest f1 score however it's not up to the required 0.59. But also, due to class imbalance, our model is very inaccurate and would have the same results as a constant model. Let's balance the classes and improve the quality of our model.

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

<s>Well done!
</div>

<div class="alert alert-block alert-danger">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

<s>Please check my comments above about data preparation.
</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

F1 much better.
</div>

## Balancing Observations

### Downsample negative observations

In [49]:
# Create downsample formula

def downsample(features, target, fraction):
    
    neg_obs_features = features[target==0]
    pos_obs_features = features[target==1]
    
    neg_obs_target = target[target==0]
    pos_obs_target = target[target==1]
    
    features_downsampled = pd.concat([neg_obs_features.sample(frac = fraction, random_state=12345)] + [pos_obs_features])
    
    target_downsampled = pd.concat([neg_obs_target.sample(frac = fraction, random_state=12345)] + [pos_obs_target])
    
    features_downsampled, target_downsampled = shuffle(features_downsampled, target_downsampled, random_state=12345)
    
    return features_downsampled, target_downsampled
    
    

In [50]:
# Downsample negative observations
features_downsampled, target_downsampled = downsample(features_train, target_train, 0.25)

In [51]:
# Visualize results
features_downsampled

Unnamed: 0,credit_score,age,tenure,balance,num_of_products,has_cr_card,is_active_member,estimated_salary,geography_Germany,geography_Spain,gender_Male
228,1.021198,1.067178,2.0,0.322275,-0.896909,0.645055,-1.032529,-0.395064,1,0,1
420,1.072772,2.021550,3.0,0.627744,-0.896909,0.645055,-1.032529,-0.111185,1,0,0
1861,-0.897336,-0.078068,2.0,1.313352,0.816929,0.645055,0.968496,1.383558,1,0,0
1660,0.030987,-0.173505,7.0,0.562402,0.816929,0.645055,-1.032529,-1.555339,0,0,0
5823,1.021198,1.162616,7.0,1.143636,-0.896909,0.645055,-1.032529,0.808878,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...
4592,1.877318,-0.268942,8.0,1.060318,-0.896909,0.645055,-1.032529,-0.172790,0,0,1
5260,0.536407,-0.364379,3.0,1.242022,0.816929,0.645055,0.968496,-1.037781,1,0,0
7886,0.144449,-0.268942,8.0,-1.232442,-0.896909,-1.550255,0.968496,1.022085,0,0,0
4612,0.990254,0.971741,7.0,-0.757433,-0.896909,-1.550255,-1.032529,1.107043,0,0,0


In [52]:
# Check distribution

pos_obs = features_downsampled[target==1]
neg_obs = features_downsampled[target==0]

neg_fraction = 1 - len(pos_obs)/(len(neg_obs)+len(pos_obs))
pos_fraction = len(pos_obs)/(len(neg_obs)+len(pos_obs))

print(f'Percentage of negative observations: {neg_fraction:.1%}')
print(f'Percentage of positive observations: {pos_fraction:.1%}')

Percentage of negative observations: 49.5%
Percentage of positive observations: 50.5%


  pos_obs = features_downsampled[target==1]
  neg_obs = features_downsampled[target==0]


### Logistic Regression Model on downsampled data

In [53]:
lr_model = LogisticRegression(random_state=12345, solver='liblinear', class_weight='balanced')

In [54]:
# Train logistic regression model on downsampled data
lr_model.fit(features_downsampled, target_downsampled)

LogisticRegression(class_weight='balanced', random_state=12345,
                   solver='liblinear')

In [55]:
# Check F1 score of trained model
lr_ds_pred = lr_model.predict(features_valid)
print(f1_score(target_valid, lr_ds_pred))

0.4799286351471901


<div class="alert alert-block alert-danger">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

<s>Please add F1 score for fitted model.
</div>

### Random Forest Model on downsampled data

In [56]:
# Create Random Forest Model with the best parameters and check f1_score

best_rf_model = None
opt_n_est = 0
opt_depth = 0 
req_f1 = 0
for n in range (10,100,10):
    for depth in range (1,20):
        rf_model = RandomForestClassifier(random_state=12345, n_estimators=n, max_depth= depth, class_weight='balanced')
        rf_model.fit(features_train, target_train)
        rf_prediction = rf_model.predict(features_valid)
        if f1_score(target_valid, rf_prediction) > req_f1:
            req_f1 = f1_score(target_valid, rf_prediction)
            best_rf_model = rf_model
            opt_n_est = n
            opt_depth = depth
print(f'opt_n_est: {opt_n_est}')
print(f'opt_depth: {opt_depth}')
print(f'f1_score:{req_f1}')

opt_n_est: 90
opt_depth: 10
f1_score:0.6045918367346939


In [57]:
# Train best rf model with downsampled data
best_rf_model.fit(features_downsampled, target_downsampled)

RandomForestClassifier(class_weight='balanced', max_depth=10, n_estimators=90,
                       random_state=12345)

In [58]:
# Check F1 Score of model
rf_ds_pred = best_rf_model.predict(features_valid)
print(f1_score(target_valid, rf_ds_pred))

0.5666337611056268


<div class="alert alert-block alert-danger">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

<s>Please add F1 score for fitted model.
</div>

### Decision Tree Model on downsampled data

In [59]:
# Create Decision Tree Model with the best parameters and check f1_score
best_dt_model = None
opt_depth = 0
req_f1 = 0

for depth in range(1, 20):
    dt_model = DecisionTreeClassifier(random_state=12345, max_depth=depth, class_weight='balanced')
    dt_model.fit(features_train, target_train)
    dt_prediction = dt_model.predict(features_valid)
    if f1_score(target_valid, dt_prediction) > req_f1:
        best_dt_model = dt_model
        opt_depth = depth
        req_f1 = f1_score(target_valid, dt_prediction)
        
print(f'opt_depth: {opt_depth}')
print(f'f1_score:{req_f1}')

opt_depth: 6
f1_score:0.5587044534412956


In [60]:
# Train best dt model with downsampled data
best_dt_model.fit(features_downsampled, target_downsampled)

DecisionTreeClassifier(class_weight='balanced', max_depth=6, random_state=12345)

In [61]:
# Check F1 Score of model
dt_ds_pred = best_dt_model.predict(features_valid)
print(f1_score(target_valid, dt_ds_pred))

0.5405405405405406


### Upsample positive observations

In [62]:
# Create upsample formula
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)

    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345
    )

    return features_upsampled, target_upsampled

In [63]:
# Upsample positive observations
features_upsampled, target_upsampled = upsample(features_train,target_train, 4)

In [64]:
# Visualize results
features_upsampled

Unnamed: 0,credit_score,age,tenure,balance,num_of_products,has_cr_card,is_active_member,estimated_salary,geography_Germany,geography_Spain,gender_Male
471,0.526093,-0.173505,1.0,1.170711,-0.896909,0.645055,-1.032529,-1.379314,0,0,1
2973,-0.629154,1.639801,9.0,0.835668,-0.896909,0.645055,0.968496,1.675765,0,0,0
9268,-0.577580,-0.937002,2.0,-1.232442,0.816929,0.645055,0.968496,-1.710517,0,1,1
2097,-1.248036,-0.173505,1.0,1.068739,0.816929,-1.550255,0.968496,-0.269873,0,1,1
5519,0.268225,0.399118,1.0,0.905545,-0.896909,0.645055,-1.032529,0.193406,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...
334,-0.268139,-0.173505,6.0,0.504900,-0.896909,0.645055,-1.032529,-1.640812,0,1,0
2819,0.804590,-0.459816,5.0,-0.041836,0.816929,0.645055,-1.032529,-0.052883,0,0,0
5226,-1.763771,0.017370,1.0,0.552457,-0.896909,0.645055,0.968496,0.418909,0,1,1
8228,-1.010798,0.017370,10.0,1.337124,-0.896909,0.645055,-1.032529,-1.167166,0,0,0


In [65]:
# Check distribution

pos_obs = features_upsampled[target==1]
neg_obs = features_upsampled[target==0]

neg_fraction = 1 - len(pos_obs)/(len(neg_obs)+len(pos_obs))
pos_fraction = len(pos_obs)/(len(neg_obs)+len(pos_obs))

print(f'Percentage of negative observations: {neg_fraction:.1%}')
print(f'Percentage of positive observations: {pos_fraction:.1%}')

Percentage of negative observations: 49.5%
Percentage of positive observations: 50.5%


  pos_obs = features_upsampled[target==1]
  neg_obs = features_upsampled[target==0]


<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

<s>Please check this calculations as well.
</div>

### Logistic regression model on upsampled data

In [66]:
#Train lr model on upsampled data
lr_model.fit(features_upsampled, target_upsampled)

LogisticRegression(class_weight='balanced', random_state=12345,
                   solver='liblinear')

In [67]:
# Check F1 Score of model
lr_us_pred = lr_model.predict(features_valid)
print(f1_score(target_valid, lr_us_pred))

0.47763864042933807


<div class="alert alert-block alert-danger">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

<s>Please add F1 score for fitted model.
</div>

### Random forest model on upsampled data

In [68]:
# Train best rf model on upsampled data
best_rf_model.fit(features_upsampled, target_upsampled)

RandomForestClassifier(class_weight='balanced', max_depth=10, n_estimators=90,
                       random_state=12345)

In [69]:
# Check F1 Score of model
rf_us_pred = best_rf_model.predict(features_valid)
print(f1_score(target_valid, rf_us_pred))

0.5893060295790671


<div class="alert alert-block alert-danger">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

<s>Please add F1 score for fitted model.
</div>

### Decision tree model on upsampled data

In [70]:
# Train best dt model on upsampled data
best_dt_model.fit(features_upsampled, target_upsampled)

DecisionTreeClassifier(class_weight='balanced', max_depth=6, random_state=12345)

In [71]:
# Check F1 Score of model
dt_us_pred = best_dt_model.predict(features_valid)
print(f1_score(target_valid, dt_us_pred))

0.5587044534412956


Now the negative data has been downsampled and the positive data upsampled. We can see a more even ratio. Now we can train our model effectively

<div class="alert alert-block alert-danger">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

<s>OK, no worries, but there's a common mix-up here. <br>
We need to split this step into 2 part:
1. Upsampling train data (add "1") and train models<br>
2. Downsampling train data (remove "0") and train models
    
We do not need do upsampling AND downsampling. Please remember, we perform this manipulations only with train dataset!
</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Much better!
</div>

## Working with balanced models

### Logistic Regression Model

In [72]:
# Test model with validation set
lr_prediction = lr_model.predict(features_valid)

In [73]:
# Check f1_score on validation set
print(f1_score(target_valid, lr_prediction))

0.47763864042933807


In [74]:
# Check roc_auc value 

lr_prob = lr_model.predict_proba(features_valid)
lr_prob_one_valid = lr_prob[:, 1]

lr_auc_roc = roc_auc_score(target_valid, lr_prob_one_valid)

print(f'AUC-ROC: {lr_auc_roc}')

AUC-ROC: 0.7729698197002475


Our data was not perfectly balanced so we used one more tool to balance it perfectly. We used the class_weight arguement and with that, we've been able to get the f1 score above 0.59 and AUC-ROC of 0.77!

<div class="alert alert-block alert-danger">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

<s>Oops! We have problems here. Please check my comments above about data processing.
</div>

<div class="alert alert-block alert-danger">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

<s>We do not need here to tune threshold...
</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Well done!
</div>

### Random Forest Model

In [78]:
# Test model with validation set
rf_pred = best_rf_model.predict(features_valid)

In [79]:
# Check f1_score
print(f1_score(target_valid, rf_pred))

0.5893060295790671


In [80]:
# Check roc_auc value 

rf_prob = best_rf_model.predict_proba(features_valid)
rf_prob_one_valid = rf_prob[:, 1]

rf_auc_roc = roc_auc_score(target_valid, rf_prob_one_valid)

print(f'AUC-ROC: {rf_auc_roc}')

AUC-ROC: 0.8520883966308441


Same technique used on the Random Forest model and the results are better than that of the Logistic Regression model.  

### Decision Tree Model

In [81]:
# Test model with validation set
dt_pred = best_dt_model.predict(features_valid)

In [82]:
# Check f1_score
print(f1_score(target_valid, dt_pred))

0.5587044534412956


In [83]:
# Check roc_auc value 

dt_prob = best_dt_model.predict_proba(features_valid)
dt_prob_one_valid = dt_prob[:, 1]

dt_auc_roc = roc_auc_score(target_valid, dt_prob_one_valid)

print(f'AUC-ROC: {dt_auc_roc}')

AUC-ROC: 0.8089065820615813


Last but not the least, the decision tree model trained with the same technique shows good results too!

All three models have good f1 scores i.e above 0.59 as indicated in the task. The Random Forest model has the highest score and so is the preferred model for this task.

## Evaluating models on test dataset

In [84]:
# Evaluate lr model on test dataset
test_pred = lr_model.predict(features_test)
f1 = f1_score(target_test, test_pred)

prob = lr_model.predict_proba(features_test)
prob_one_test = prob[:, 1]

auc_roc = roc_auc_score(target_test, prob_one_test)

print(f'f1_score: {f1}')

print(f'AUC-ROC: {auc_roc}')

f1_score: 0.5025295109612141
AUC-ROC: 0.764339981925675


In [85]:
# Evaluate best rf model on test dataset
test_pred = best_rf_model.predict(features_test)
f1 = f1_score(target_test, test_pred)

prob = best_rf_model.predict_proba(features_test)
prob_one_test = prob[:, 1]

auc_roc = roc_auc_score(target_test, prob_one_test)

print(f'f1_score: {f1}')

print(f'AUC-ROC: {auc_roc}')

f1_score: 0.6393088552915767
AUC-ROC: 0.8684132558946269


In [86]:
# Evaluate best dt model on test dataset
test_pred = best_dt_model.predict(features_test)
f1 = f1_score(target_test, test_pred)

prob = best_dt_model.predict_proba(features_test)
prob_one_test = prob[:, 1]

auc_roc = roc_auc_score(target_test, prob_one_test)

print(f'f1_score: {f1}')

print(f'AUC-ROC: {auc_roc}')

f1_score: 0.5933852140077821
AUC-ROC: 0.8180381466521556


<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

<s>Please check my comments above. Seems we have a leak here(
</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Great! We get F1 > 0.59!
</div>

# Conclusion

We've been able to create machine learning models that can predict whether a customer will leave the bank soon based on their banking behaviours.

From our testing, the Random Forest Model is the most effective with an f1_score of 0.63 and AUC-ROC score of 0.86.

We recommend the Random Forest Model for future tasks.

<div class="alert alert-block alert-warning">
<b>Overall reviewer's comment</b> <a class="tocSkip"></a>

Harvey, thank you for sending your project. You've done a really good job on it! <br>
    
I really like your work! <br>
Your project has good structure and nice code. <br>
    
However, there are some issues. I wrote comments. Please elaborate them.
    
Just a few tiny corrections before your project is done!
</div>

<div class="alert alert-block alert-warning">
<b>Overall reviewer's comment v2</b> <a class="tocSkip"></a>

I'm happy to see you've made a few corrections to your work! <br>
However, there is still an issue in project. Please check my comments. <br>
I will be waiting for your corrected project.
</div>

<div class="alert alert-block alert-success">
<b>Overall reviewer's comment v3</b> <a class="tocSkip"></a>

You've done such a great job improving your project! <br>
    I'm glad to say that your project has been accepted. Good luck!
</div>