# Predicting Churn (Supervised Learning)

Our dataset shows that clients at the bank we represent are leaving: little by little, chipping away every month. In response, the bank has figured out that it is cheaper to save the existing clients rather than to attract new ones. We have been tasked to try and predict whether a customer will leave the bank soon. Our dataset shows clients’ past behavior and termination of contracts with the bank. 

In this report, I detail how we can test diffrerent models, such as a binary logistical regression model, and seek  the maximum possible F1 score. This type of problem is defined as a supervised learning classification problem, since this dataset has features along with a corresponding target (is labeled). 

I also write checklists to serve as reminders for possible steps to take.


Data Details
* Features 
    * RowNumber — data string index
    * CustomerId — unique customer identifier 
    * Surname — surname
    * CreditScore — credit score
    * Geography — country of residence
    * Gender — gender
    * Age — age
    * Tenure — period of maturation for a customer’s fixed deposit (years)
    * Balance — account balance
    * NumOfProducts — number of banking products used by the customer
    * HasCrCard — customer has a credit card
    * IsActiveMember — customer’s activeness
    * EstimatedSalary — estimated salary
* Target
    * Exited — сustomer has left

## Import Libraries

- [x] Import popular libraries, including a handful of classifiers to train your data with. Return to import libraries as needed. 

In [None]:
import pandas as pd
from scipy import stats
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.utils import resample
from sklearn.metrics import roc_auc_score
from sklearn.metrics import make_scorer
import numpy as np

## Inspect the Data

- [x] Import your data
- [x] Inspect your data. Consider using: 
        - info()
        - head()
        - tail()
        - value_counts(). Can also help you to locate values MCAR, MAR, MNAR. 
        - describe() 
        - duplicated() 
        - shape() 

In [None]:
df = pd.read_csv('/content/Churn.csv')

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


In [None]:
df.describe()

Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,10000.0,10000.0,10000.0,10000.0,9091.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,15690940.0,650.5288,38.9218,4.99769,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,2886.89568,71936.19,96.653299,10.487806,2.894723,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,1.0,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,2500.75,15628530.0,584.0,32.0,2.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,5000.5,15690740.0,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,7500.25,15753230.0,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,10000.0,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


In [None]:
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [None]:
print("\033[1m" + 'We have {} duplicated rows.'.format(df.duplicated().sum()) + "\033[0m")

[1mWe have 0 duplicated rows.[0m


In [None]:
df_nulls = pd.DataFrame(df.isna().sum(),columns=['Missing Values'])
df_nulls['Percent of Nulls'] = round(df_nulls['Missing Values'] / df.shape[0],4) *100
df_nulls

Unnamed: 0,Missing Values,Percent of Nulls
RowNumber,0,0.0
CustomerId,0,0.0
Surname,0,0.0
CreditScore,0,0.0
Geography,0,0.0
Gender,0,0.0
Age,0,0.0
Tenure,909,9.09
Balance,0,0.0
NumOfProducts,0,0.0


Our inspection of the data shows that 9% of values are missing when it comes to data on Tenure for a client (period of maturation for a customer’s fixed deposit in years). We'll choose to fill this missing value in our next phase of Data Prepocessing. Other than missing values, we also notice that we have some categorical variables which we'll have to transform before testing the logistical regression model (random forest and decision tree can take categorical variables).

## Data Preprocessing

Before testing any models, we want to make sure that our dataset is as clean as possible. We'll use this checklist to help guide us through any transformations we might need to make. 
- [x] Deal with Missing Values 
- [ ] Eliminate Outliers 
- [ ] Check for highly correlated features. Sometimes we can drop highly correlated features, but before that, we should also test to see how they influence the model. Try all variants.
- [x] Drop irrelevant features
- [ ] Create new features/variables (i.e., proportions) 
- [ ] Feature Scaling: All features should be considered equally important before the algorithm's execution. 
- [ ] Separating Data Types: Numerical (Discrete, Continuous), Categorical(Ordinal, Nominal, Binary), Date/Time, Text, Image, Sound. 
          We can use select_dtypes() for example cat_df = df.select_dtypes(['object', 'bool']) and 
          num_df = df.select_dtypes(['int', 'float']) 
- [ ] Covert Data Types, if necessary

Transforming Categorical Variables for Logistic Regression Models
- [x] One-Hot Encoding for Categorical variables [Read](https://towardsdatascience.com/one-hot-encoding-is-making-your-tree-based-ensembles-worse-heres-why-d64b282b5769) and [Read](https://datascience.stackexchange.com/questions/18056/why-dont-tree-ensembles-require-one-hot-encoding) Remember its redundancy and keep only one of the encoded dummy variables.
- [ ] Ordinal Encoding for any ordinal variables (Generally, label encoding is a bad idea, particularly for regression algorithms. Other algorithms may allow for unencoded categorial variables). OHE gives each nominal variable a column, but this doesn't make sense for ordinal variables. 


*Note to self: Explore how class imbalance leads to overfitting* <p>
*Note to self: Explore application / distinguish applications of data wrangling methods on each new dataset* 

In [None]:
# Deal with Missing Values 
df.query('Tenure == 0')

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
29,30,15656300,Lucciano,411,France,Male,29,0.0,59697.17,2,1,1,53483.21,0
35,36,15794171,Lombardo,475,France,Female,45,0.0,134264.04,1,1,0,27822.99,1
57,58,15647091,Endrizzi,725,Germany,Male,19,0.0,75888.20,1,0,0,45613.75,0
72,73,15812518,Palermo,657,Spain,Female,37,0.0,163607.18,1,0,1,44203.55,0
127,128,15782688,Piccio,625,Germany,Male,56,0.0,148507.24,1,1,0,46824.08,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9793,9794,15772363,Hilton,772,Germany,Female,42,0.0,101979.16,1,1,0,90928.48,0
9799,9800,15722731,Manna,653,France,Male,46,0.0,119556.10,1,1,0,78250.13,1
9843,9844,15778304,Fan,646,Germany,Male,24,0.0,92398.08,1,1,1,18897.29,0
9868,9869,15587640,Rowntree,718,France,Female,43,0.0,93143.39,1,1,0,167554.86,0


In [None]:
# display(df.query('Tenure.isna()'))

In order to decide what to do with the 909 rows with missing Tenure information, we looked at the rows in detail and didn't find patterns or relationships in the data. We can choose to do any number of things, but in this case, we'll fill the missing values with the median of the Tenure data. The describe() function above showed us the median and mean are virtually the same, so in this case either measure of central tendency could work. 

In [None]:
df['Tenure'] = df['Tenure'].fillna(df['Tenure'].median())
display(df.isna().sum())

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

In [None]:
# Drop irrelevant features
df = df.drop(['RowNumber', 'CustomerId', 'Surname'], axis = 1)

In [None]:
# Transform Categorical Variables for Logistic Regression
df = pd.get_dummies(df, drop_first = True)
df.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain,Gender_Male
0,619,42,2.0,0.0,1,1,1,101348.88,1,0,0,0
1,608,41,1.0,83807.86,1,0,1,112542.58,0,0,1,0
2,502,42,8.0,159660.8,3,1,0,113931.57,1,0,0,0
3,699,39,1.0,0.0,2,0,0,93826.63,0,0,0,0
4,850,43,2.0,125510.82,1,1,1,79084.1,0,0,1,0


We used get_dummies() to transform the two categorical variables (Geography and Gender). These categorical variables were both nominal, so we didn't have to do anything else to transform them. We'll move forward and use this processed dataset with our model training and study it's performance. 

In looking at our checklist above, we can see that we still haven't investigated for highly correlated features or created any new features. In the next section, we'll choose to start training our models based off of the data and see how our dataset performs. We've already changed quite a bit and now would be a good time to dive into our model building and testing. We can always return to manipulate our dataset at a later time. 

## Create Your Datasets

- [ ] Set a radom state for reproduceable results
- [ ] Separate features from target 
- [ ] Create three sets of data: a training set, validation set and testing set. The validation set will be used to test different hyperparameters while the testing set is used to check the performance of the model. 
- [ ] Check the size of the samples we created

In [None]:
random_state = 93

def split_data(data):
    features, target = data.drop(columns=['Exited']), data['Exited']
    
    # 80% train and %20 test
    features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2,random_state=42)
    
    # 80% train (of %80 train) and %20 validation
    features_train, features_valid, target_train, target_valid = \
        train_test_split(features_train, target_train, test_size=0.2)
    
    return features_train, features_valid, features_test, \
           target_train, target_valid, target_test

In [None]:
features_train, features_valid, features_test, target_train, target_valid, target_test = split_data(df)

In [None]:
# Let's check sizes of samples
assert features_train.shape[0] == target_train.shape[0]
print("Training size: ", features_train.shape[0], target_train.shape[0])

assert features_valid.shape[0] == target_valid.shape[0]
print("Validation size: ",features_valid.shape[0], target_valid.shape[0])

assert features_test.shape[0] == target_test.shape[0]
print("Test size: ",features_test.shape[0], target_test.shape[0])

Training size:  6400 6400
Validation size:  1600 1600
Test size:  2000 2000


## Testing Models 

 - [x] Select a variety of classifiers for this classification model. In this project, we'll be trying out the Logistic Regression and Random Forest classifiers. 
 - [ ] Set a radom state in building models for reproduceable results
 - [ ] Quick test using a preferred model of our choice. Test with the evaluation metrics we need. 
 - [ ] Tune hyperparameters using validation set, and leveraging tools like GridSearchCV() 
 - [ ] Check the accuracy (or the evaluation metric/s of your choice) on the models you are building. 
 - [ ] Take note of the hyperparameters that resulted in the highest evaluation metric (i.e., n_estimators = 40) 

Evaluation Metrics 
- [x] Choose the leading evaluation metric for your model. Consider
    * Confusion Matrix (TP, FP, FN, TN) : confusion_matrix() 
    * Accuracy (
    * Recall
    * F1 Score 
    * Precision 
    * Score

Additional Data Processing

Once we've created our different datasets, we can start to do other transformation steps to make our data work better with our model. At this point, since we have a set of data for training, that's the set of data that we'll use to consider:
- [ ] Upsampling. When ratio of classes is far fom 1:1, our classes are imbalanced and will cause problems as we train the model.
- [x] Downsampling. When ratio of classes is far fom 1:1, our classes are imbalanced and will cause problems as we train the model.
- [ ] You can use the class_weight argument from sklearn to try to fix class imbalance 
- [ ] Generally, any methods to correct class Imbalance


Before we move on to building out other classifiers, it makes sense to do a rough test of one of our models and check to see what our preferred evaluation metrics tell us about the dataset/model. Depending on what we find, we might be inclined to return to some of our other available data transformation options, like scaling features. 

## Logistic Regression

In [None]:
def train_LogReg_sanity_check(x_train, x_valid, y_train, y_valid): 
    model = LogisticRegression(solver = 'liblinear', random_state = random_state)
    model.fit(x_train, y_train)
    print('Accuracy:', model.score(x_valid, y_valid))
    print('F1 Score:', f1_score(y_valid, model.predict(x_valid)))
    print('AUC-ROC:', roc_auc_score(y_valid, model.predict_proba(x_valid)[:,1]))

train_LogReg_sanity_check(features_train, features_valid, target_train, target_valid)

Accuracy: 0.795
F1 Score: 0.0989010989010989
AUC-ROC: 0.6776739137519144


Wow, that's quite a low F1 Score. Since that's the prioritized metric for this project, we'll now want to look at our data and see if there's are any other changes we can make to enhance our results. We'll also have the opportunity to look for improved perforance through tuning our preferred model. 

### Class Imbalance

In [None]:
# Check for class imbalance
print(df['Exited'].value_counts())

0    7963
1    2037
Name: Exited, dtype: int64


One area we can make improvements in, is class imbalance. Our counts above show that clients who have not exited are overly represented. More than three times that of clients who have exited the bank. 

In [None]:
# Balance class weights 
def train_LogReg_balanced_check(x_train, x_valid, y_train, y_valid, class_weight = None): 
    model = LogisticRegression(class_weight = class_weight, solver = 'liblinear', random_state = random_state)
    model.fit(x_train, y_train)
    print('Accuracy:', model.score(x_valid, y_valid))
    print('F1 Score:', f1_score(y_valid, model.predict(x_valid)))
    print('AUC-ROC:', roc_auc_score(y_valid, model.predict_proba(x_valid)[:,1]))

train_LogReg_balanced_check(features_train, features_valid, target_train, target_valid, 'balanced')

Accuracy: 0.676875
F1 Score: 0.454065469904963
AUC-ROC: 0.722873565811606


Great! Using class_weight resulted in a jump in F1 Score. Let's try another method for fixing class imbalance and test again. 

In [None]:
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=12345)] + [features_ones])
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=12345)] + [target_ones])
    
    features_downsampled, target_downsampled = shuffle(
        features_downsampled, target_downsampled, random_state=12345)
    
    return features_downsampled, target_downsampled

In [None]:
features_downsampled, target_downsampled = downsample(features_train, target_train, 0.1)

*Note to self: Explore alternate methods for downsampling*

In [None]:
# Balance class weights 
def train_LogReg_downsampled_check(x_train, x_valid, y_train, y_valid, class_weight = None): 
    model = LogisticRegression(class_weight = class_weight, solver = 'liblinear', random_state = random_state)
    model.fit(x_train, y_train)
    print('Accuracy:', model.score(x_valid, y_valid))
    print('F1 Score:', f1_score(y_valid, model.predict(x_valid)))
    print('AUC-ROC:', roc_auc_score(y_valid, model.predict_proba(x_valid)[:,1]))

train_LogReg_downsampled_check(features_downsampled, features_valid, target_downsampled, target_valid, 'balanced')



Accuracy: 0.659375
F1 Score: 0.43523316062176165
AUC-ROC: 0.7044111711901413


Now, back to other data processing we can do - feature scaling. We can see that our quantitative data has different scales. Some range between 1 to 10 and others have a much wider range. This type of difference can negatively affect how the Logistical Regression model learns from our data. Let's use StandardScaler() to help us scale these features.

### Feature Scaling

In [None]:
# Scale features_train
numeric = ['CreditScore', 'Age', 'Balance', 'EstimatedSalary', 'Tenure', 'NumOfProducts']
scaler = StandardScaler() 
scaler.fit(features_train[numeric])
features_train.loc[:,numeric] = scaler.transform(features_train[numeric])

In [None]:
# Scale features_valid
numeric = ['CreditScore', 'Age', 'Balance', 'EstimatedSalary', 'Tenure', 'NumOfProducts']
scaler = StandardScaler() 
scaler.fit(features_valid[numeric])
features_valid.loc[:,numeric] = scaler.transform(features_valid[numeric])

In [None]:
# Scale features_test
numeric = ['CreditScore', 'Age', 'Balance', 'EstimatedSalary', 'Tenure', 'NumOfProducts']
scaler = StandardScaler() 
scaler.fit(features_test[numeric])
features_test.loc[:,numeric] = scaler.transform(features_test[numeric])

In [None]:
def train_LogReg_balanced_scale_check(x_train, x_valid, y_train, y_valid, class_weight = None): 
    model = LogisticRegression(class_weight = class_weight, solver = 'liblinear', random_state = random_state)
    model.fit(x_train, y_train)
    print('Accuracy:', model.score(x_valid, y_valid))
    print('F1 Score:', f1_score(y_valid, model.predict(x_valid)))
    print('AUC-ROC:', roc_auc_score(y_valid, model.predict_proba(x_valid)[:,1]))

train_LogReg_balanced_scale_check(features_train, features_valid, target_train, target_valid, 'balanced')

Accuracy: 0.709375
F1 Score: 0.4709897610921501
AUC-ROC: 0.7634632594586871


Hmm, we're still not getting close to the F1 Score we'd like to see (.59). Let's start trying a different classifier - Random Forest. 

## Random Forest

In [None]:
def tree_model(x_train, x_valid, y_train, y_valid): 
    model = RandomForestClassifier(random_state = random_state, n_estimators = 100) 
    model.fit(x_train, y_train)
    print('Accuracy:', model.score(x_valid, y_valid))
    print('F1 Score:', f1_score(y_valid, model.predict(x_valid)))
    print('AUC-ROC:', roc_auc_score(y_valid, model.predict_proba(x_valid)[:,1]))
    
tree_model(features_train, features_valid, target_train, target_valid)

Accuracy: 0.873125
F1 Score: 0.598019801980198
AUC-ROC: 0.8536357004805487


We've got promising results! Let's dig into the Random Forest classifier and look for the best parameters using GridSearchCV. 

In [None]:
random_forest = RandomForestClassifier(random_state=random_state, class_weight = 'balanced')
random_params = {'n_estimators':range(1,100,5), 'max_depth':range(1,15, 1)}
grid_search = GridSearchCV(estimator = random_forest,
                          param_grid = random_params)
grid_search = grid_search.fit(features_train, target_train)

In [None]:
accuracy = grid_search.best_score_
accuracy

0.8534375000000001

Instead of accuracy, could the code above be for f1? 

In [None]:
grid_search.best_params_

{'max_depth': 14, 'n_estimators': 61}

## Evaluation Metrics

- [x] For Logistic Regression, consider adjusting the classification threshold. This model computes the probability of classes, and the line where the negative class ends and the positive class begins is called the threshold. By default, it is .5, but we can change it. 
- [x] For Logistic Regression, consider studying the PR (Precision-Recall) Curve. 
- [x] For Logistic Regression, consider studying thr ROC Curve, where TPR is plotted along on the y-axis and FPR is plotted along the x-axis. For a model that always answers randomly, the ROC curve is a diagonal line foing from the lower left to the upper right. The higher the curve, the greater the TPR value and the better the model's quaity. 
- [x] For Logistic Regression, and to find out how much our model differs from the random model, let's calculate the AUC-ROC value. This metric ranges from 0 to 1, where the AUC-ROC value for a random model is .5. 

* Underfitting occurs when accuracy is low and approximately the same for both the training and test sets. 
* Overfitting occus when we or the model sees dependencies where there aren't any. 

In [None]:
def tree_model(model, x_train, x_valid, y_train, y_valid, x_test, y_test): 
    model = RandomForestClassifier(random_state = random_state, n_estimators = 46, max_depth = 13, class_weight = 'balanced')
    model.fit(x_train, y_train)
    print('Accuracy:', model.score(x_valid, y_valid))
    print('F1 Score:', f1_score(y_valid, model.predict(x_valid)))
    print('AUC-ROC:', roc_auc_score(y_valid, model.predict_proba(x_valid)[:,1]))
    
    print('Accuracy:', model.score(x_test, y_test))
    print('F1 Score:', f1_score(y_test, model.predict(x_test)))
    print('AUC-ROC:', roc_auc_score(y_test, model.predict_proba(x_test)[:,1]))

In [None]:
tree_model(grid_search, features_train, features_valid, target_train, target_valid, features_test, target_test)

Accuracy: 0.863125
F1 Score: 0.5936920222634507
AUC-ROC: 0.8459800109796727
Accuracy: 0.8575
F1 Score: 0.6013986013986014
AUC-ROC: 0.853495600513656


## Conclusion

Through several rounds of testing and tuning models, and then finally digging into the Random Forest model with GridSearchCV, we were able to identify ideal parameters. The model that we chose gives us an F1 score above .59 when tested against the validation and testing datasets. Also, the AUC-ROC score is .85 and indicates that the classifier is able to distinguish between classes. 