Customers of a certain bank are leaving: little by little, chipping away every month. The bankers figured out it’s cheaper to save the existing customers rather than to attract new ones.
Ths task is to predict whether a customer will leave the bank soon. We have the data on clients’ past behavior and termination of contracts with the bank.
My task is to build a model with the maximum possible F1 score. My aim is to get an F1 score of at least 0.59.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.utils import shuffle
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

data = pd.read_csv('')

Importing the models to be used later in the code and reading the file with the dataset.

In [2]:
display(data.head(5))

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


Viewing the content of the data.

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
RowNumber          10000 non-null int64
CustomerId         10000 non-null int64
Surname            10000 non-null object
CreditScore        10000 non-null int64
Geography          10000 non-null object
Gender             10000 non-null object
Age                10000 non-null int64
Tenure             9091 non-null float64
Balance            10000 non-null float64
NumOfProducts      10000 non-null int64
HasCrCard          10000 non-null int64
IsActiveMember     10000 non-null int64
EstimatedSalary    10000 non-null float64
Exited             10000 non-null int64
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


Viewing the datatypes and the completeness of the data.

In [4]:
data.dropna(subset = ['Tenure'], inplace=True)

Dropping the rows that have null values in Tenure.

In [5]:
data['Tenure'] = data['Tenure'].astype(int)

Changing the datatype of column Tenure to integer.

In [6]:
df1 = pd.get_dummies(data['Gender'])
display(df1)

Unnamed: 0,Female,Male
0,1,0
1,1,0
2,1,0
3,1,0
4,1,0
...,...,...
9994,1,0
9995,0,1
9996,0,1
9997,1,0


Am not able to put the 0s and 1s in one column. please guide.

In [7]:
data['Gender'].replace('Female',0,inplace=True)
data['Gender'].replace('Male',1,inplace=True)

Encoding the Gender column with 0s and 1s.

In [8]:
data['Geography'].replace('France',1,inplace=True)
data['Geography'].replace('Germany',2,inplace=True)
data['Geography'].replace('Spain',3,inplace=True)                       

Encoding the Geography column with number 1,2,3.

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9091 entries, 0 to 9998
Data columns (total 14 columns):
RowNumber          9091 non-null int64
CustomerId         9091 non-null int64
Surname            9091 non-null object
CreditScore        9091 non-null int64
Geography          9091 non-null int64
Gender             9091 non-null int64
Age                9091 non-null int64
Tenure             9091 non-null int64
Balance            9091 non-null float64
NumOfProducts      9091 non-null int64
HasCrCard          9091 non-null int64
IsActiveMember     9091 non-null int64
EstimatedSalary    9091 non-null float64
Exited             9091 non-null int64
dtypes: float64(2), int64(11), object(1)
memory usage: 1.0+ MB


viewing the information of the table to see if the changes have been effected. tenure is now an integer.

In [10]:
display(data.head(5))

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,1,0,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,3,0,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,1,0,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,1,0,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,3,0,43,2,125510.82,1,1,1,79084.1,0


Viewing the table to see if the changes have been effected. Geography and Gender are now in number form.

In [11]:
data.Exited.value_counts()

0    7237
1    1854
Name: Exited, dtype: int64

The number of the exited members is not yet half but is significant enough to caution the campany. Shows that the column is imbalanced and will need to be balanced.

In [12]:
#X, y = make_classification(n_samples=9091, n_features=10, n_redundant=0, n_clusters_per_class=1, weights=[0.80,0.20], flip_y=0)

Am wondering if a ratio of 20:80 is so imbalanced to call for more balancing beyond upsampling. Please guide on what is considered as imbalanced. 

In [13]:
# Split target and features
target = data['Exited']
features = data.drop(['Exited','Surname','RowNumber', 'CustomerId'], axis = 1)

# Split into sets
features_train, features_temp, target_train, target_temp = train_test_split(features, target, test_size=0.2,random_state=12345)
features_valid, features_test, target_valid, target_test = train_test_split(features_temp, target_temp, test_size=0.2,random_state=12345)

Defining the features and targets as well as splitting the data into the training and valid datasets

In [14]:
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    
    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345)
    
    return features_upsampled, target_upsampled

features_upsampled, target_upsampled = upsample(features_train, target_train, 100)

Balancing the training dataset using the upsampling method.

In [15]:
def downsample(features_upsampled, target_upsampled, fraction):
    features_zeros = features_upsampled[target_upsampled == 0]
    features_ones = features_upsampled[target_upsampled == 1]
    target_zeros = target_upsampled[target_upsampled == 0]
    target_ones = target_upsampled[target_upsampled == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=12345)] + [features_ones])
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=12345)] + [target_ones])
    
    features_downsampled, target_downsampled = shuffle(
        features_downsampled, target_downsampled, random_state=12345)
    
    return features_downsampled, target_downsampled

features_downsampled, target_downsampled = downsample(features_train, target_train, 0.5)

Balancing the training dataset using the downsampling method.

In [16]:
for estim in range(1,5,1):
        model = DecisionTreeClassifier(random_state=12345,class_weight='balanced')

        model.fit(features_downsampled, target_downsampled)
        predictions_valid = model.predict(features_valid)
        F1 = f1_score(target_valid, predictions_valid)
        auc_roc = roc_auc_score(target_valid,predictions_valid)
        
        model.score(features_train, target_train)
        
        print("n_estimators =" + str(estim)+ ":  F1_score = " + str(F1)+ ": AUR_ROC = " + str(auc_roc))
       

n_estimators =1:  F1_score = 0.48633093525179855: AUR_ROC = 0.6887816699843343
n_estimators =2:  F1_score = 0.48633093525179855: AUR_ROC = 0.6887816699843343
n_estimators =3:  F1_score = 0.48633093525179855: AUR_ROC = 0.6887816699843343
n_estimators =4:  F1_score = 0.48633093525179855: AUR_ROC = 0.6887816699843343


Defining the model ie DecisionTree classifier inside a loop that specifies the hyperparameter range to be used. Intiate the training using the fit() method.validating the trained model using the valid dataset the checking the accuracy of the valid dataset that has been predicted with the trained model. Then printing the F1 score and the AUR_ROC score where the AUR_ROC score is much higher than the F1_score.

In [17]:
for estim in range(5,100,5):
    for depth in range(1,12):

        model = RandomForestClassifier(random_state=12345,class_weight='balanced', n_estimators=estim, max_depth=depth)

        predictions = pd.Series(target.mean(), index=target.index)
        F1 = f1_score(target_valid, predictions_valid)
      
        model.fit(features_downsampled, target_downsampled)
        predictions_train = model.predict(features_train)
        predictions_valid = model.predict(features_valid)
        predictions_test = model.predict(features_test)
        #print("n_estimators =", estim, ":", F1)
        print('Max Depth ' + str(depth)+': ' + str(F1) )
        print()
        print("n_estimators =" + str(estim)+': ' + str(F1) )
        

Max Depth 1: 0.48633093525179855

n_estimators =5: 0.48633093525179855
Max Depth 2: 0.5066991473812424

n_estimators =5: 0.5066991473812424
Max Depth 3: 0.5427408412483039

n_estimators =5: 0.5427408412483039
Max Depth 4: 0.5347043701799485

n_estimators =5: 0.5347043701799485
Max Depth 5: 0.5385620915032681

n_estimators =5: 0.5385620915032681
Max Depth 6: 0.5459387483355527

n_estimators =5: 0.5459387483355527
Max Depth 7: 0.546916890080429

n_estimators =5: 0.546916890080429
Max Depth 8: 0.5413333333333333

n_estimators =5: 0.5413333333333333
Max Depth 9: 0.5490196078431373

n_estimators =5: 0.5490196078431373
Max Depth 10: 0.5480225988700564

n_estimators =5: 0.5480225988700564
Max Depth 11: 0.5541125541125541

n_estimators =5: 0.5541125541125541
Max Depth 1: 0.5561959654178674

n_estimators =10: 0.5561959654178674
Max Depth 2: 0.5029797377830751

n_estimators =10: 0.5029797377830751
Max Depth 3: 0.5431789737171465

n_estimators =10: 0.5431789737171465
Max Depth 4: 0.54896907216494

Defining the model ie RandomForest classifier inside a loop that specifies the hyperparameter range to be used. Intiate the training using the fit() method.validating the trained model using the valid dataset the checking the F1_score of the valid dataset that has been predicted with the trained model. Then printing the F1_score.        
The hyperparameters that are defined by the n_estimators and depth which are defined as the range values of the loops.

In [18]:
F1 = f1_score(target_valid, predictions_valid)

model.fit(features_train,target_train)

model = RandomForestClassifier(n_estimators=20, random_state=12345,class_weight='balanced', max_depth=11)
model.fit(features_downsampled, target_downsampled)
predictions_test = model.predict(features_test)

f1_score_train = f1_score(target_train, predictions_train)
test_f1_score = f1_score(target_test, predictions_test)
F1 = f1_score(target_valid, predictions_valid)
auc_roc = roc_auc_score(target_valid,predictions_valid)

Defining the best model as the final model to be used with the test dataset. Then checking the F1_Score of the trained and the test datasetafter the predictions.

In [19]:
print("Score")
print("Training set:", f1_score_train)
print("Test set:", test_f1_score)
print("f1_score",F1)
print("AUC_ROC",auc_roc)

Score
Training set: 0.8203912270302313
Test set: 0.6
f1_score 0.5975609756097561
AUC_ROC 0.7633894392160095


The best model selected is random Forest classifier with the following hyperparameters:n_estimators=20, random_state=12345,class_weight='balanced', max_depth=11 with the upsampling and the downsampling balancing methods applied.

It can be observed that for a targeted F1_score, the AUC_ROC is much higher than the F1_score.