# Machine Learning Task: Predict if customers will leave or stay

In this task we will help Beta Bank to predict whether customers will leave or stay based on their Credit Score, Geographical Location, Gender,	Age, how long they've been with the bank (Tenure), Balance,	Number of Products they use, whether they have a credit card, if they are active members, and their	Estimated Salary. This information has been gathered in a dataset that we will study.

# Table of Contents
1. [General Information](#step1)
2. [Data Preprocessing](#step2)
    1. [Deleting some columns](#step2_1)
    2. [Missing Tenure values](#step2_2)
    3. [Dummies for the categorical columns](#step2_3)
    4. [Scaling the numerical columns](#step2_4)
    5. [Features, Target, and splitting the data](#step2_5)
3. [Building a Model with the Class Imbalance](#step3)
    1. [Decision Tree](#step3_1)
    2. [Random Forest](#step3_2)
    3. [Logistic Regression](#step3_3)
4. [Balancing out the classes](#step4)
    1. [Class Weight Adjustment](#step4_1)
    2. [Upsampling](#step4_2)
5. [Finally, testing on the test set](#step5)
6. [Conclusion](#step6)

## General Information <a name="step1"></a>

First of all, let us import the necessary libraries and modules with needed functions

In [1]:
import pandas as pd #for dealing with dataframes
from sklearn.tree import DecisionTreeClassifier #to deal with Decision Tree Models
from sklearn.ensemble import RandomForestClassifier #to deal with Random Forest Models
from sklearn.linear_model import LogisticRegression #to deal with Logistic Regression Models
from sklearn.model_selection import train_test_split #to be able to split datasets
from sklearn.preprocessing import StandardScaler #to be able to scale values
from sklearn.utils import shuffle #to be able to shuffle columns
from sklearn.metrics import f1_score #to be able to calculate model's f1_score
from sklearn.metrics import roc_auc_score #to be able to calculate model's auc-roc

Now we can read our data using the pd.read_csv function taking the path to the file as argument

In [2]:
df = pd.read_csv('/datasets/Churn.csv') #reads the csv file and saves it as a pandas dataframe called df
df.info() #general information about the dataframe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
RowNumber          10000 non-null int64
CustomerId         10000 non-null int64
Surname            10000 non-null object
CreditScore        10000 non-null int64
Geography          10000 non-null object
Gender             10000 non-null object
Age                10000 non-null int64
Tenure             9091 non-null float64
Balance            10000 non-null float64
NumOfProducts      10000 non-null int64
HasCrCard          10000 non-null int64
IsActiveMember     10000 non-null int64
EstimatedSalary    10000 non-null float64
Exited             10000 non-null int64
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


In [3]:
df.head() #first 5 rows of our dataframe

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


Our target will be the Exited column, while the rest of the columns will serve as features. Let us make a to-do list to prepare our features. First, we have to fill in the missing values in the Tenure column. Secondly, we need to create dummies for the categorical columns. Then, we will have to standardize (scale) the numerical columns, except the ones with binary values (either 1 or 0) like HasCrCard and IsActiveMember.

## Data Preprocessing <a name="step2"></a>

### Deleting some columns <a name="step2_1"></a>

The columns in question are RowNumber, CustomerId, and Surname. RowNumber is basically the index, just that it starts from 1 and not 0. CusttomerId is just to uniquely differentiate the customers, Surname is also another means of identification; both are different for each and every observation. Including these columns will not help with the training of our models. So we have to drop these columns

In [4]:
data=df.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)
#drops the required columns from df and names the resulting table 'data'
data.info()#general info about 'data'

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
CreditScore        10000 non-null int64
Geography          10000 non-null object
Gender             10000 non-null object
Age                10000 non-null int64
Tenure             9091 non-null float64
Balance            10000 non-null float64
NumOfProducts      10000 non-null int64
HasCrCard          10000 non-null int64
IsActiveMember     10000 non-null int64
EstimatedSalary    10000 non-null float64
Exited             10000 non-null int64
dtypes: float64(3), int64(6), object(2)
memory usage: 859.5+ KB


We have successfully dropped the columns.

### Missing Tenure values <a name="step2_2"></a>

When we take a look at the unique values of the Tenure column, we encounter NaN (missing) values

In [5]:
data['Tenure'].unique()

array([ 2.,  1.,  8.,  7.,  4.,  6.,  3., 10.,  5.,  9.,  0., nan])

We can fill this empty cells with the median value of the Tenure column so that we do not introduce any bias into our dataset

In [6]:
data['Tenure']=data['Tenure'].fillna(data['Tenure'].median())#fills missing cells with the median
data['Tenure'].unique()

array([ 2.,  1.,  8.,  7.,  4.,  6.,  3., 10.,  5.,  9.,  0.])

We have successfully eliminated all missing values from the Tenure

### Dummies for the categorical columns <a name="step2_3"></a>

We now have 2 categorical columns: Geography and Gender. Let us look at each of their values...

In [7]:
data['Geography'].value_counts() #displays the unique values and how many times they appear in the column

France     5014
Germany    2509
Spain      2477
Name: Geography, dtype: int64

In [8]:
data['Gender'].value_counts()

Male      5457
Female    4543
Name: Gender, dtype: int64

So for Geography, we have 3 values: Spain, Germany, and France. When we create dummies for this column, the column will be replaced by 3 columns: Geography_Spain, Geography_Germany, and Geography_France. Each column will take the value 1 in the observation where the Geography column had the country as value, otherwise it gets 0. It will be the similar for the Gender column. We will use the pd.get_dummies function on the whole 'data' table since those are the only categorical columns. We can drop one of the dummy columns for both scenarios because a 0 in Spain and Germany, for example, directly implies a 1 for France. We can do this by setting the parameter drop_first=True.

In [9]:
data=pd.get_dummies(data, drop_first=True)
#replaces the categorical columns by their dummies and drops the first dummy column for each replaced column
data.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain,Gender_Male
0,619,42,2.0,0.0,1,1,1,101348.88,1,0,0,0
1,608,41,1.0,83807.86,1,0,1,112542.58,0,0,1,0
2,502,42,8.0,159660.8,3,1,0,113931.57,1,0,0,0
3,699,39,1.0,0.0,2,0,0,93826.63,0,0,0,0
4,850,43,2.0,125510.82,1,1,1,79084.1,0,0,1,0


We have successfully created the dummies for the Geography and Gender columns

### Scaling the numerical columns <a name="step2_4"></a>

Our numerical columns are: 'CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', and 'EstimatedSalary'. These variables don't have a definite range so we need to scale (or standardize) them by getting their standard z-scores. We do this because the algorithm would normally think that variables with high dispersion are more important and we don't want that. So we will call our StandardScaler() function, we will fit() the numerical columns in it and transform them, then we will get our scaled values

In [10]:
numeric = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']
#creates a list containing the numeric column names
scaler = StandardScaler()#calling the scale function
scaler.fit(data[numeric])#trains the scaler with the data from the numeric columns
data[numeric] = scaler.transform(data[numeric])#transforms the data into scaled values
data.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain,Gender_Male
0,-0.326221,0.293517,-1.086246,-1.225848,-0.911583,1,1,0.021886,1,0,0,0
1,-0.440036,0.198164,-1.448581,0.11735,-0.911583,0,1,0.216534,0,0,1,0
2,-1.536794,0.293517,1.087768,1.333053,2.527057,1,0,0.240687,1,0,0,0
3,0.501521,0.007457,-1.448581,-1.225848,0.807737,0,0,-0.108918,0,0,0,0
4,2.063884,0.388871,-1.086246,0.785728,-0.911583,1,1,-0.365276,0,0,1,0


We have successfully scaled the values in the numeric columns of our table

### Features, target, and splitting the data <a name="step2_5"></a>

Our target will definitely be the Exited column, while the rest of the columns will be the features. We need to split both sets into training, validation and test sets making up 60%, 20%, and 20% respectively. To do so, we will call the train_test_split() function twice. The first time, we will split into the training set and a second set setting the parameter test_size=0.4 (which is the percentage of the dataset the second set should be). The second time, we will do a split on the second set from earlier into equal sizes (test_size=0.5) and the results will be the validation set (20% of original dataset) and the test set (20% of original dataset). The random_state will be set to 12345 and we will keep it the same throughout the model training

In [11]:
features=data.drop('Exited', axis=1)#features will be all columns except Exited column
target=data['Exited']#target will be the exited column
features_train, features_test_valid, target_train, target_test_valid=train_test_split(features, target,\
                                                                                      test_size=0.4,\
                                                                                     random_state=12345)
#1st split to get training sets for both features and target (60%) and a second set (40%)
#the \ signifies line breaks
features_valid, features_test, target_valid, target_test=train_test_split(features_test_valid, \
                                                                        target_test_valid, test_size=0.5, \
                                                                       random_state=12345)
#2nd split on the second set from earlier into the validation and test sets, even split
print(len(features_train), len(target_train), len(features_valid), len(target_valid), len(features_test), \
     len(target_test))
#prints the lengths of the 3 sets of features and targets we derived from splitting

6000 6000 2000 2000 2000 2000


We have successfully defined our features and targets, and split the data into training, validation and test sets with their appropriate proportions

## Building a Model with the Class Imbalance <a name="step3"></a>

### Decision Tree Classification <a name="step3_1"></a>

Here, we will be calling the DecisiontreeClassifier() function. We will call 2 hyperparameters: random_state and max_depth. random_state has to be the same across the board so we will give it a fixed value (12345). max_depth, however, is the hyperparameter we will play with. So we will loop through a bunch of values for max_depth (in this case, 1 to 10) and get their f1-scores and AUC-ROC values, both of which are metrics for model quality. The f1_score processes the target of the validation set and the predictions. The roc_auc_score function processes the target of the validation set with the positive class probabilities of each observation in the valid set. We use the predict_proba() function for this

In [13]:
for i in range(1, 11): #loops throuh values of i from 1 to 10    
    dt_model = DecisionTreeClassifier(random_state=12345, max_depth=i)
    #creates a Decision Tree model with the max_depth value
    dt_model.fit(features_train, target_train)
    #trains the model using the features and target of the training set
    dt_pred_valid=dt_model.predict(features_valid)
    #gets predictions from the model using the features of the validation set
    probabilities_valid = dt_model.predict_proba(features_valid)
    #gets negative class and positive class probabilities for each observation of the features_valid set
    probabilities_one_valid = probabilities_valid[:, 1]
    #gets the positive class probabilities for each observation of the features_valid set
    print('Max depth', i, 'F1 score =', f1_score(target_valid, dt_pred_valid), 'AUC-ROC score =', \
         roc_auc_score(target_valid, probabilities_one_valid))
    #prints the f1 score by comparing the predictions to the target of the validation set and
    #the auc_roc score by comparing the validation set's target to the positive class probabilities

Max depth 1 F1 score = 0.0 AUC-ROC score = 0.6925565119556736
Max depth 2 F1 score = 0.5217391304347825 AUC-ROC score = 0.7501814673449512
Max depth 3 F1 score = 0.4234875444839857 AUC-ROC score = 0.7973440741838507
Max depth 4 F1 score = 0.5528700906344411 AUC-ROC score = 0.813428129858032
Max depth 5 F1 score = 0.5406249999999999 AUC-ROC score = 0.8221680508592478
Max depth 6 F1 score = 0.5696969696969697 AUC-ROC score = 0.8164631712023421


  'precision', 'predicted', average, warn_for)


Max depth 7 F1 score = 0.5320813771517998 AUC-ROC score = 0.8138530658907929
Max depth 8 F1 score = 0.5454545454545454 AUC-ROC score = 0.8119854644656693
Max depth 9 F1 score = 0.5633802816901409 AUC-ROC score = 0.7801515554775917
Max depth 10 F1 score = 0.5406162464985994 AUC-ROC score = 0.7658451236699957


The best f1_score (~0.57) can be observed in max_depth 6, having an AUC-ROC value of ~0.82

### Random Forest Classification <a name="step3_2"></a>

We will be calling the RandomForestClassifier() function. Our random_state hyperparameter should remain the same as before. The hyperparameters we will be playing with are max_depth and n_estimators. In this case we will first create an empty list. Then we will loop through values of max_depth and, within that loop, loop through values of n_estimators. We will use this loop to create models with different permutations of max_depth and n_estimators values that we will store in the list, from which we will choose the model with the highest f1 score

In [14]:
rf = []#empty list
for i in range(1, 11):#loops through values of i from 1 to 10 for max_depth
    for j in range(10, 101, 10):#loops through values of j from 1 to 100 with a step of 10 for n_estimators
        rf_model = RandomForestClassifier(random_state=12345, max_depth=i, n_estimators=j)
        #creates a random forest model
        rf_model.fit(features_train, target_train)
        #trains the model using the features and target of the training set
        rf.append(rf_model)#adds model to the list
    
print(max(rf, key=lambda rf_model: f1_score(rf_model.predict(features_valid), target_valid)))
#prints the model from the list with the highest f1 score based on predictions made using the 
#features of the validation set and the actual target of the validation set


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=10, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=12345,
                       verbose=0, warm_start=False)


The Random Forest model with the highest f1_score has a max_depth=10 and n_estimators=10 hyperparameters. So let us train it specifically with those hyperparameters and get an f1_score and a roc_auc_score. Similar syntax as before.

In [15]:
best_rf_model = RandomForestClassifier(random_state=12345, max_depth=10, n_estimators=10)
best_rf_model.fit(features_train, target_train)
best_rf_pred = best_rf_model.predict(features_valid)
probabilities_rf_valid=best_rf_model.predict_proba(features_valid)
probabilities_rf_one_valid=probabilities_rf_valid[:, 1]
print('F1 score =', f1_score(target_valid, best_rf_pred), 'AUC-ROC =', \
      roc_auc_score(target_valid, probabilities_rf_one_valid))

F1 score = 0.5869894099848714 AUC-ROC = 0.8461436676969979


The f1 score is ~0.59, with a AUC-ROC score of ~0.85

### Logistic Regression <a name="step3_3"></a>

We will use the LogisticRegression() function. Again, our random_state should be the same. However, the max_depth and n_estimators hyperparameters don't apply here. All we'll need is to set a solver. We will use 'liblinear' 

In [16]:
lr_model = LogisticRegression(random_state=12345, solver='liblinear')
lr_model.fit(features_train, target_train)
lr_valid_pred=lr_model.predict(features_valid)
probabilities_lr_valid=lr_model.predict_proba(features_valid)
probabilities_lr_one_valid=probabilities_lr_valid[:, 1]
print('F1 score =', f1_score(target_valid, lr_valid_pred), 'AUC-ROC =', \
     roc_auc_score(target_valid, probabilities_lr_one_valid))

F1 score = 0.33108108108108103 AUC-ROC = 0.7587497504824008


- **Conclusion**

The best of the 3 models was the Random Forest Classifier with max_depth=10 and n_estimators=10 hyperparameters since it had the highest f1 score (about 0.59) and AUC-ROC score (about 0.84). We will use this moving forward

## Taking into Account Class Imbalance <a name="step4"></a>

Let us study class imbalance so as to know the portions or shares of each class in the target of the training set. To do so, we will use the value_counts() function and set the parameter normalize=True.

In [12]:
target_train.value_counts(normalize=True)
#shows unique values of target_train and their shares (percentages) of the data

0    0.800667
1    0.199333
Name: Exited, dtype: float64

The negative class (0) is ~80% of the data, while the positive class (1) is ~20 of the data. So there are 4 times as much 0s as there are 1s. We will look at two approaches to tackling Class Imbalance.

### Class Weight Adjustment <a name="step4_1"></a>

All we need to do here is to set the hyperparameter class_weight='balanced' when training the model. This will make the rarer class (1 in this case) to have more weight. Apart from that, the syntax is the same as before.

In [17]:
bal_rf_model = RandomForestClassifier(random_state=12345, max_depth=10, n_estimators=10, \
                                       class_weight='balanced')
bal_rf_model.fit(features_train, target_train)
bal_rf_pred = bal_rf_model.predict(features_valid)
proba_bal_rf_valid=bal_rf_model.predict_proba(features_valid)
proba_bal_rf_one_valid=proba_bal_rf_valid[:, 1]
print('F1 score =', f1_score(target_valid, bal_rf_pred), 'AUC-ROC =', \
      roc_auc_score(target_valid, proba_bal_rf_one_valid))

F1 score = 0.5907473309608542 AUC-ROC = 0.8300845637827472


The f1 score is already better than before (>0.59), but the AUC-ROC value took a slight dip. The True Positive Rate surely decreased a little.

### Upsampling <a name="step4_2"></a>

In this approach, we will basically repeat the rarer class and its observations enough times for it to be evenly matched with the other class. We saw earlier that there are 4 times as many 0s as there are ones, so we will repeat the ones and their observations 4 times to evenly match the zeros in the training set. After doing so, we will have to shuffle them using the shuffle() function so as not to make learning too easy.

In [19]:
def upsample(features, target, repeat):
#creates a function called upsample which takes features, target, and repeat number as arguments
    features_zeros = features_train[target_train == 0]#gets the negative class features
    features_ones = features_train[target_train == 1]#gets the positive class features 
    target_zeros = target_train[target_train == 0]#gets the negative class of the target
    target_ones = target_train[target_train == 1]#gets the positive class of the target
    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    #upsamples the features by combining the negative class features and the repeated positive class features
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    #upsamples the target by combining the negative class target and the repeated positive class target
    features_upsampled, target_upsampled = shuffle(features_upsampled, target_upsampled, random_state=12345)
    #shuffles the resulting upsampled features and targets
    return features_upsampled, target_upsampled # returns the resulting upsampled features and target

features_upsampled, target_upsampled = upsample(features_train, target_train, 4)
#upsamples the training set of features and target by inputing them in the upsample function with a repeat
#value of 4
print(features_upsampled.shape, target_upsampled.shape)#prints the dimensions of the upsampled sets

(9588, 11) (9588,)


Now we can train a our model using these upsampled features and targets.

In [20]:
ups_rf_model = RandomForestClassifier(random_state=12345, max_depth=10, n_estimators=10)
ups_rf_model.fit(features_upsampled, target_upsampled)
ups_rf_pred = ups_rf_model.predict(features_valid)
proba_ups_rf_valid=ups_rf_model.predict_proba(features_valid)
proba_ups_rf_one_valid=proba_ups_rf_valid[:, 1]
print('F1 score =', f1_score(target_valid, ups_rf_pred), 'AUC-ROC =', \
      roc_auc_score(target_valid, proba_ups_rf_one_valid))

F1 score = 0.5836909871244635 AUC-ROC = 0.8335694929197491


Our f1 score is lower than what we got when using class weight adjustment. However, the reverse is true when it comes AUC-ROC, showing again a slight difference. True Positive Rate must have increased.

- **Conclusion**

We will move forward with the class weight adjustment approach as it has the higher f1 score of 0.59.

## Finally, testing on the test set <a name="step5"></a>

Now let's apply our model (with class weight adjustment) to the test set. Before that we need to train the model using both the training and validation sets; we will join them using the pd.concat() function. 

In [21]:
features_train_final=pd.concat([features_train] + [features_valid])
#vertically stacks the features_train and features_valid sets
target_train_final=pd.concat([target_train] + [target_valid])
#vertically stacks the training and validation targets
final_rf_model = RandomForestClassifier(random_state=12345, max_depth=10, n_estimators=10, \
                                       class_weight='balanced')
final_rf_model.fit(features_train_final, target_train_final)
final_rf_pred = final_rf_model.predict(features_test)
proba_rf_test=final_rf_model.predict_proba(features_test)
proba_rf_one_test=proba_rf_test[:, 1]
print('F1 score =', f1_score(target_test, final_rf_pred), 'AUC-ROC =', \
      roc_auc_score(target_test, proba_rf_one_test))

F1 score = 0.6018518518518517 AUC-ROC = 0.8468229019099917


Our final f1 score is 0.60 which is more than our threshold of 0.59

## Conclusion <a name="step6"></a>
We processed the dataset (scaling numeric columns, filling in missing values, and getting dummy columns from categorical ones). After splitting the data, and without taking into account the 4:1 class imbalance, we trained Decision Tree, Random Forest, and Logistic Regression Classifier models and determined Random Forest to be the best due to its high f1 score (about 0.59) and AUC-ROC value of ~0.85. We then took the class imbalance into account and used to approaches: Class weight adjustment and Upsampling. We chose to go with the former since it had the higher f1 score of 0.59 even though the other had a higher AUC-ROC score. We trained the model with both training and validation data and applied it on the test set and got an f1 score of 0.60