# **Imbalanced Classification Project**

# **Problem Statement**

Beta Bank customers are leaving: little by little, chipping away every month. The bankers
figured out it’s cheaper to save the existing customers rather than to attract new ones.
We need to predict whether a customer will leave the bank soon. You have the data on
clients’ past behavior and termination of contracts with the bank.
Build a model with the maximum possible F1 score. To pass the project, you need an F1
score of at least 0.59. Check the F1 for the test set

##Metric of success
Building a model that can predict with high accuracy whether a customer is about to leave

## Recording the Experimental Design
1. Download and prepare the data. Explain the procedure.
2. Examine the balance of classes. Train the model without taking into account the
imbalance. Briefly describe your findings.
3. Improve the quality of the model. Make sure you use at least two approaches to
fixing class imbalance. Use the training set to pick the best parameters. Train
different models on training and validation sets. Find the best one. Briefly
describe your findings.
4. Perform the final testing

### Data Relevance
 The data was Very relevant

In [51]:
#import libraries 

import pandas as pd
import numpy as np
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score


In [36]:
#read the data
df = pd.read_csv('https://bit.ly/2XZK7Bo')
print(df.shape)

(10000, 14)


In [37]:
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


Data Exploration and Data Cleaning

In [38]:
df.dtypes

RowNumber            int64
CustomerId           int64
Surname             object
CreditScore          int64
Geography           object
Gender              object
Age                  int64
Tenure             float64
Balance            float64
NumOfProducts        int64
HasCrCard            int64
IsActiveMember       int64
EstimatedSalary    float64
Exited               int64
dtype: object

In [39]:
df.describe()

Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,10000.0,10000.0,10000.0,10000.0,9091.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,15690940.0,650.5288,38.9218,4.99769,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,2886.89568,71936.19,96.653299,10.487806,2.894723,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,1.0,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,2500.75,15628530.0,584.0,32.0,2.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,5000.5,15690740.0,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,7500.25,15753230.0,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,10000.0,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


In [40]:
# check for duplicates
df.duplicated().sum()

0

In [41]:
# check for null values
df.isnull().sum()

RowNumber            0
CustomerId           0
Surname              0
CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             909
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64

fill tenure  null values with the mean

In [42]:
df.Tenure = df.Tenure.fillna(df.Tenure.mean())

In [43]:
df.isnull().sum()

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

In [44]:
df.nunique()

RowNumber          10000
CustomerId         10000
Surname             2932
CreditScore          460
Geography              3
Gender                 2
Age                   70
Tenure                12
Balance             6382
NumOfProducts          4
HasCrCard              2
IsActiveMember         2
EstimatedSalary     9999
Exited                 2
dtype: int64

change columns to lowercase

In [45]:
df.columns = df.columns.str.lower().str.strip()
df.columns

Index(['rownumber', 'customerid', 'surname', 'creditscore', 'geography',
       'gender', 'age', 'tenure', 'balance', 'numofproducts', 'hascrcard',
       'isactivemember', 'estimatedsalary', 'exited'],
      dtype='object')

In [46]:
#drop columns not needed
df.drop(columns=['rownumber', 'customerid', 'surname','geography','gender'], inplace=True)
df.columns

Index(['creditscore', 'age', 'tenure', 'balance', 'numofproducts', 'hascrcard',
       'isactivemember', 'estimatedsalary', 'exited'],
      dtype='object')

Data Modeling

In [47]:
train_df = df.copy()

In [48]:
target = train_df['exited']
features = train_df.drop(['exited'], axis=1)


In [49]:
# split the data into training and validation sets
features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.25, random_state=12345)

logistic regression model 

In [64]:
lrm = LogisticRegression(random_state=12345,class_weight='balanced',solver='liblinear')
lrm.fit(features_train,target_train)
p_valid = log_reg_model.predict(features_valid)
print('F1 Score: ', f1_score(target_valid, p_valid))

F1 Score:  0.4494103041589074


Decision Tree

In [65]:
# check for optimal max_depth
for i in range(1,10):
    dtree_model = DecisionTreeClassifier(max_depth=i, random_state=12345,class_weight='balanced')
    dtree_model.fit(features_train,target_train)
    p_valid = dt_model.predict(features_valid)
    print('Max Depth:',i  , f1_score(target_valid, p_valid))

Max Depth: 1 0.5575289575289575
Max Depth: 2 0.5575289575289575
Max Depth: 3 0.5575289575289575
Max Depth: 4 0.5575289575289575
Max Depth: 5 0.5575289575289575
Max Depth: 6 0.5575289575289575
Max Depth: 7 0.5575289575289575
Max Depth: 8 0.5575289575289575
Max Depth: 9 0.5575289575289575


Random Forest

In [63]:
# check for optimal max_depth
for i in range(1,10):
    rf_model = RandomForestClassifier(max_depth=5,n_estimators=i,class_weight='balanced', random_state=12345)
    rf_model.fit(features_train,target_train)
    pred_valid = rf_model.predict(features_valid)
    print('Max Depth: ',i, f1_score(target_valid, pred_valid))

Max Depth:  1 0.5378787878787878
Max Depth:  2 0.5421785421785422
Max Depth:  3 0.5743834526650755
Max Depth:  4 0.5837479270315091
Max Depth:  5 0.5808636748518206
Max Depth:  6 0.5788590604026845
Max Depth:  7 0.5742411812961444
Max Depth:  8 0.5739692805173807
Max Depth:  9 0.5869037995149555


the best one is the Random Forest had an  F1 score of  0.5869037995149555 with  n_estimators of 9 