# **Problem Statement**
Beta Bank customers are leaving: little by little, chipping away every month. The bankers
figured out it’s cheaper to save the existing customers rather than to attract new ones.
We need to predict whether a customer will leave the bank soon. You have the data on
clients’ past behavior and termination of contracts with the bank.
Build a model with the maximum possible F1 score. To pass the project, you need an F1
score of at least 0.59. Check the F1 for the test set.
Additionally, measure the AUC-ROC metric and compare it with the F1.
1. Download and prepare the data. Explain the procedure.
2. Examine the balance of classes. Train the model without taking into account the
imbalance. Briefly describe your findings.
3. Improve the quality of the model. Make sure you use at least two approaches to
fixing class imbalance. Use the training set to pick the best parameters. Train
different models on training and validation sets. Find the best one. Briefly
describe your findings.
4. Perform the final testing.


In [1]:
!pip install fsspec

[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0mCollecting fsspec
  Downloading fsspec-2022.8.1-py3-none-any.whl (140 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m140.8/140.8 kB[0m [31m68.5 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: fsspec
[33m  DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0m[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, 

In [2]:
#import libraries
import pandas as pd

#load dataset
df = pd.read_csv("https://bit.ly/2XZK7Bo")

# **Data preprocessing and analysis**

In [3]:
df.sample(5)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
3218,3219,15774872,Joslin,663,France,Male,36,10.0,0.0,2,1,0,136349.55,0
6670,6671,15667932,Bellucci,758,Spain,Female,43,10.0,0.0,2,1,1,55313.44,0
470,471,15759298,Shih,631,Spain,Male,27,10.0,134169.62,1,1,1,176730.02,0
9409,9410,15591150,Nwebube,570,Spain,Male,34,10.0,0.0,2,0,1,183387.12,0
6654,6655,15799998,Cunningham,608,France,Female,30,,85859.76,1,0,0,142730.27,0


In [4]:
#size of the dataset
df.shape

(10000, 14)

In [5]:
#find missing values in the columns
df.isnull().sum()

RowNumber            0
CustomerId           0
Surname              0
CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             909
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64

In [6]:
#fill missing values

#find unique values in Tenure column
df['Tenure'].unique()

#fill missing values with median
df['Tenure'] = df['Tenure'].fillna(df['Tenure'].median())

#check missing values
df['Tenure'].isnull().sum()

0

In [7]:
#check for duplicates
df.duplicated().sum()

0

In [8]:
#find datatypes in the columns
df.dtypes

RowNumber            int64
CustomerId           int64
Surname             object
CreditScore          int64
Geography           object
Gender              object
Age                  int64
Tenure             float64
Balance            float64
NumOfProducts        int64
HasCrCard            int64
IsActiveMember       int64
EstimatedSalary    float64
Exited               int64
dtype: object

In [9]:
#drop some columns
df = df.drop(["Surname","RowNumber", "CustomerId"], axis=1)

In [10]:
#encoding the string values in the dataset
df = pd.get_dummies(df, columns=['Gender', 'Geography'])

In [11]:
df['Exited'].unique()

array([1, 0])

# **Model Training and Evaluation**

In [12]:
#Splitting Dataset

#import libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score

target = df['Exited']
features = df.drop('Exited', axis=1)
#split dataset
features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.25, random_state=12345
)


#feature scaling
scaler = StandardScaler()
scaler.fit(features_train)
features_train = scaler.transform(features_train)
features_valid = scaler.transform(features_valid)


### **Decision Tree Classifier**




In [13]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(random_state=12345, max_depth = 2)
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)

print(f1_score(target_valid, predicted_valid))

0.5287846481876333


In [14]:
from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression(random_state=12345, solver ='liblinear')
lr_model.fit(features_train, target_train)
predicted_valid = lr_model.predict(features_valid)

print(f1_score(target_valid, predicted_valid))

0.29247910863509746


In [15]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(random_state=12345, n_estimators=50)
rf_model.fit(features_train, target_train)
predicted_valid = rf_model.predict(features_valid)

print(f1_score(target_valid, predicted_valid))


0.5810968494749125


# **Class Balance Adjustments**

In [16]:
#check balance of target class
df['Exited'].value_counts(normalize=True)

0    0.7963
1    0.2037
Name: Exited, dtype: float64

In [17]:
#balanced random forest model
rf_model_balanced = RandomForestClassifier(random_state=12345, n_estimators=50, class_weight='balanced')
rf_model_balanced.fit(features_train, target_train)
predicted_valid = rf_model_balanced.predict(features_valid)

print(f1_score(target_valid, predicted_valid))


0.5508982035928144


In [18]:
#balanced Decision Tree
balanced_model = DecisionTreeClassifier(random_state=12345, max_depth = 2, class_weight='balanced')
balanced_model.fit(features_train, target_train)
predicted_valid = balanced_model.predict(features_valid)

print(f1_score(target_valid, predicted_valid))

0.5391566265060241
