<center> <h1><font size=7> Case Study A</font> </h1> </center>

## Bank Churn - example answer

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.utils import resample

%matplotlib inline

In [None]:
# Set file path
churn_filepath = '../../data/churn.csv'

# Import data into a data frame.
raw_data = pd.read_csv(filepath_or_buffer=churn_filepath, delimiter=",")

raw_data.head()

In [None]:
# Get information about the data types.
raw_data.info()

Lets create a train-test split as early as possible to avoid biasing our model.

In the course previously we have done this quite late, but it's good practice to do it as soon as you can.

Remember, we need to follow the same steps for processing the training and test data. So for any value found using the training data, such as a mean to impute a column, we are going to save it then use it later on for the test data.

This process can be a bit cumbersome when using data frames (shown in this case study). Pipeline and composite transformers can be really useful in this case.

In [None]:
raw_train, raw_test = train_test_split(raw_data, test_size=0.2, random_state=123)

In [None]:
# Checking for missing values
raw_train.isna().sum()

The "Customer_Age", "Card_Category" and "Credit_Limits" are the only columns with missing values. We wille explore each and impute the missing values with a sensible replacement. We are going to replace using basic summary statistics in a single variable, but we could use more complex methods.

In [None]:
# Explore Customer Age
raw_train["Customer_Age"].plot.density();

In [None]:
# The data appears to be normally distributed, so a mean value will be sufficient
# ~10% of this columns values are missing, so this will skew the data
# centrally but we can ignore that for now.
clean_train = raw_train.copy()

median_age = raw_train["Customer_Age"].median()

clean_train["Customer_Age"] = clean_train["Customer_Age"].fillna(median_age)

# Some skewing occurs centrally
clean_train["Customer_Age"].plot.density();

In [None]:
# Explore Card Category
clean_train["Card_Category"].value_counts(normalize=False).plot(kind="bar"); # excludes missing values

Looking at the counts above the most common card category "Blue" is by far more common than any of the others, it is therefore a basic good choice for imputation.

In [None]:
# From inspection, could/should be done programatically
common_category = "Blue"

# fill with most common category
clean_train["Card_Category"] = clean_train["Card_Category"].fillna(common_category)

# look at result
clean_train["Card_Category"].value_counts(normalize=False).plot(kind="bar"); # excludes missing values
clean_train["Card_Category"].isna().sum()

We know that the customer IDs are going to be unique and therefore a bad feature to include in the model. We will therefore remove them from the training data.

Now to impute the final missing data column, "Credit_Limit"

In [None]:
# The data is skewed, but with a second peak between 30-40,000
clean_train["Credit_Limit"].plot.density();

In [None]:
# For now we will ignore the second peak and impute with the median
median_limit = clean_train["Credit_Limit"].median()

clean_train["Credit_Limit"] = clean_train["Credit_Limit"].fillna(median_limit)

clean_train["Credit_Limit"].isna().sum()

In [None]:
# Remove unique columns.
basic_feature_data = clean_train.drop(columns=["CLIENTNUM"])
basic_feature_data.head()

From the above we can see that the "Income_Category" data can be cleaned and encoded more appropriately.

In [None]:
basic_feature_data["Income_Category"].value_counts()

We have a number of options on how to encode this. We could split the values into two columns, with an upper and lower bound. We could encode the data as integers 0, 1, 2 expressing the order of the categories. We could one hot encode the values assuming that each is independent.

As there is a clear order between the different levels I am going to encode each range with the mid point (40 - 60 -> 50), or the lower bound. The Unknown values will need to be imputed.

We cannot perfectly encapsulate the interval nature of this data, as the intervals themselves are not well bounded, we can however approximate them.

In [None]:
income_map = {
    "Less than $40K": 20,
    "$40K - $60K": 50,
    "$80K - $120K": 100,
    "$120K +": 120
}

# this will convert "Unknown" to NaN as it does not appear in the map dict
basic_feature_data["Income_Category_Encoded"] = basic_feature_data["Income_Category"].map(income_map)
basic_feature_data["Income_Category_Encoded"].value_counts()

In [None]:
# The data is skewed, therefore use median
median_income = basic_feature_data["Income_Category_Encoded"].median()

basic_feature_data["Income_Category_Encoded"] = basic_feature_data["Income_Category_Encoded"].fillna(median_income)

basic_feature_data = basic_feature_data.drop(columns=["Income_Category"])

In [None]:
basic_feature_data.head()

"Education_Level" too is an ordinal data type. We have the choice to either encode independence into the data set (one hot encode), or to enforce an interval between categories (integer / label encoding). I am going to opt for interval encoding here, as the order is quite important to education level, and will give more insight. This could be tested and the methods compared. We will again, need to impute the unknown data.

In [None]:
# Explore distribution
basic_feature_data["Education_Level"].value_counts()

In [None]:
# We could quite likely group these together
# but for now we will keep the separate
education_map = {
    "Uneducated": 1,
    "High School": 2,
    "College": 3,
    "Graduate": 4, # assuming graduate is after college in US 
    "Post-Graduate": 5,
    "Doctorate": 6
}

# this will convert "Unknown" to NaN as it does not appear in the map dict
basic_feature_data["Education_Level_Encoded"] = basic_feature_data["Education_Level"].map(education_map)
basic_feature_data["Education_Level_Encoded"].value_counts().sort_index()

In [None]:
basic_feature_data["Education_Level_Encoded"].value_counts().sort_index().plot();

In [None]:
# an interesting choice to make between the three averages based
# on the plot above
# the median and mean select an uncommon value
print("median", basic_feature_data["Education_Level_Encoded"].median())
print("mode", basic_feature_data["Education_Level_Encoded"].mode()[0])
print("mean", basic_feature_data["Education_Level_Encoded"].mean())
# I am going to select the mode here as it appears far more frequently than 
# the others and isn't significantly far from the others
# This is one of the challenges with encoding intervals that may not be representative

In [None]:
# mode produces a series, which needs to be indexed with [0] to get the value
mode_education = basic_feature_data["Education_Level_Encoded"].mode()[0]

basic_feature_data["Education_Level_Encoded"] = basic_feature_data["Education_Level_Encoded"].fillna(mode_education)

basic_feature_data = basic_feature_data.drop(columns=["Education_Level"])

In [None]:
basic_feature_data.head()

Let us look at the target class distribution of the data set.

In [None]:
# Plot the amount of each target class is within the data.
basic_feature_data["Attrition_Flag"].value_counts(normalize=True).plot(kind="bar",
                                                    color=["navy", "gold"],
                                                    title="Target: Attrition Flag distribution");

print(basic_feature_data["Attrition_Flag"].value_counts())

This distribution may reduce the performance of our model, we are going to resample our data so that we have the same amount for both the classes. This will be achieved by undersampling the majority class ("No"/0), we have a reasonable number of minority class (342 samples) and therefore although we will lose some predictive power for "No", our model will generalise better.

In [None]:
# Encode to 1's and 0's
target_map = {
    "Existing Customer": 0,
    "Attrited Customer": 1
}
basic_feature_data["Attrition_Flag_Encoded"] = basic_feature_data["Attrition_Flag"].map(target_map)

# remove non-encoded column
basic_feature_data = basic_feature_data.drop(columns=["Attrition_Flag"])

# Separate majority and minority classes
df_majority = basic_feature_data[basic_feature_data["Attrition_Flag_Encoded"]==0]
df_minority = basic_feature_data[basic_feature_data["Attrition_Flag_Encoded"]==1]
 
# Undersample majority class.
df_majority_downsampled = resample(df_majority, 
                                   replace=False,    # sample without replacement
                                   n_samples=len(df_minority),    # to match minority class
                                   random_state=123) # reproducible results


basic_feature_resampled = pd.concat([df_majority_downsampled, df_minority], axis=0, sort=True)

basic_feature_resampled = basic_feature_resampled.reset_index(drop=True)

# Display new class counts
basic_feature_resampled["Attrition_Flag_Encoded"].value_counts()
# expected_result (ish):
#1    1316
#0    1316
#Name: Attrition_Flag, dtype: int64

In [None]:
basic_feature_resampled.head()

We could alternatively do a `.groupby` followed by a `.sample`, however the above method works for arrays and dataframes.

### Further Categorical Encoding

We will need to encode the remaining categorical data: "Card_Category", "Marital_Status" and "Gender".

Gender (in this data set, not in real life) is binary, which leads itself to 0, 1 encoding.

Card Categories have some associated rank, and therefore be ordinal, however without the domain knowledge we will assume they are independent.

Similarly there could be a progression of maritial status', however, we will assume they are independent.

In [None]:
basic_feature_resampled.dtypes

We can quickly map the "Gender" feature to integers:

In [None]:
gender_map = {"M": 0, "F": 1}

# Map gender values to 0 and 1, fill missing / unknown with majority class
basic_feature_resampled["Gender"] = basic_feature_resampled["Gender"].map(gender_map)

# calculate most common
gender_mode = basic_feature_resampled["Gender"].mode()

basic_feature_resampled["Gender"] = basic_feature_resampled["Gender"].fillna(gender_mode)

We can encode the other two column's at once below.

In [None]:
# Initialise the encoder
one_hot_encoder = OneHotEncoder(handle_unknown="ignore")

# We will keep the "Unknown" value for marital status for now

# Make array of marital status and card category
marital_card = one_hot_encoder.fit_transform(basic_feature_resampled[["Card_Category", "Marital_Status"]]).toarray()

# Store the different categories
column_names = one_hot_encoder.get_feature_names()

# Create a new data frame with the marital status and card categories data.
marital_card_frame = pd.DataFrame(data=marital_card, columns=column_names)

marital_card_frame

In [None]:
# concat the data back to the training data, dropping the original columns
basic_feature_resampled = basic_feature_resampled.drop(columns=["Marital_Status", "Card_Category"])

# concat (ensuring no rows have been dropped)
basic_feature_resampled_encoded = pd.concat([basic_feature_resampled, marital_card_frame], axis=1)

basic_feature_resampled_encoded

Lets separate our target variable from our features and add our new "Sex" columns.

In [None]:
# Assign target to a separate object.
y_train = basic_feature_resampled_encoded["Attrition_Flag_Encoded"]

# remove target
X_train = basic_feature_resampled_encoded.drop(columns=["Attrition_Flag_Encoded"])

We now need to scale our data. Because we are using a model which uses a distance metric it is important to use normalisation scaling so that all features are comparable. 

In [None]:
# Initialise robust scaler
scaler = RobustScaler()

# Fit and transform the data with the normalizer.
X_train_scaled = pd.DataFrame(scaler.fit_transform(X=X_train), 
                              columns=[X_train.columns])

X_train_scaled.head()

There are some other options to standardize the data, we are at first going to just look at normalized data, but we could use:

* Only normalized
* Standardized then normalized
* Neither normalized or standardized
* Only standardized (what we are going with)s

Our data has already been train-test split, but we have only processed the training data, we need to do the same for the test frames before we evaluate the model.

In [None]:
raw_test.head()

In [None]:
# Follow the same steps from the training data processing
# Reset index to stop dropping indexes being an issue
clean_test = raw_test.copy().reset_index(drop=True)

# Remove unique column
clean_test = clean_test.drop(columns="CLIENTNUM")

# impute missing data
clean_test["Customer_Age"] = clean_test["Customer_Age"].fillna(median_age)
clean_test["Card_Category"] = clean_test["Card_Category"].fillna(common_category)
clean_test["Credit_Limit"] = clean_test["Credit_Limit"].fillna(median_limit)

# encode ordinal data
clean_test["Income_Category"] = clean_test["Income_Category"].map(income_map)
clean_test["Income_Category"] = clean_test["Income_Category"].fillna(median_income)

clean_test["Education_Level"] = clean_test["Education_Level"].map(education_map)
clean_test["Education_Level"] = clean_test["Education_Level"].fillna(mode_education)

# one hot encode data
clean_test["Gender"] = clean_test["Gender"].map(gender_map)
clean_test["Gender"] = clean_test["Gender"].fillna(gender_mode)

# **transform** not fit
marital_card_test = one_hot_encoder.transform(clean_test[["Card_Category", "Marital_Status"]]).toarray()
marital_card_frame_test = pd.DataFrame(data=marital_card_test, columns=column_names)
clean_test = clean_test.drop(columns=["Marital_Status", "Card_Category"])

# concat
clean_test = pd.concat([clean_test, marital_card_frame_test], axis=1)

# encode target
clean_test["Attrition_Flag"] = clean_test["Attrition_Flag"].map(target_map)

clean_test

In [None]:
# split X and y for the test set
y_test = clean_test["Attrition_Flag"]
X_test = clean_test.drop(columns=["Attrition_Flag"])

# Scale
X_test_scaled = scaler.transform(X_test)

Lets first train and evaluate a model using the single nearest neighbour to classify the test data

In [None]:
# Initialise the classifier object
neighbour_initial_model = KNeighborsClassifier(n_neighbors=1)

# Fit the model to the training data.
neighbour_initial_model = neighbour_initial_model.fit(X_train_scaled, y_train)

# Predict values on the test set using the trained model.
init_y_pred = neighbour_initial_model.predict(X_test_scaled)

In [None]:
# Set the names for the classification report to produce.
target_names = target_map.keys()

# Generate the report using the target test and prediction values.
classif_report = classification_report(y_test, init_y_pred, target_names=target_names)

print(classif_report)

This is a good first attempt, but we could improve the F1 score probably by selecting a better K or weighting method. 

In [None]:
# Define the parameters and the values we want to search.
parameters = {"n_neighbors":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
              "weights": ["uniform", "distance"]}

# Select the model type we have chosen.
neighbour_improved_model = KNeighborsClassifier()

# Set the number of folds we want to have to get 80:20 train/test split within the grid search cross validation.
n_cv = 5

# Define our grid search model to find optimal parameters.
opt_model = GridSearchCV(estimator=neighbour_improved_model, param_grid=parameters, scoring="f1", cv=n_cv)

# Fit our parameter search model.
opt_model.fit(X_train_scaled, y_train)

print("\nThe best parameters found are: \n\n", opt_model.best_params_)

# Predict target values based on best model found.
better_y_pred = opt_model.best_estimator_.predict(X_test_scaled)

# Generate the report using the target test and predicted values.
classif_report_new = classification_report(y_test, better_y_pred, target_names=target_names)

print(classif_report_new)

So this tuning hasn't substantiall improved the performance of the model. The macro average f1-score has increased by 1, with other metrics largely the same compared to $k=1$. However, the recall and f1 for Attrited Customers has improved. We will likely care about these values more than others as we want to be able to predict whether or not someone will leave our bank at any point, rather than predicting they will stay (most people stay at any point!).  

Remember however, we have used some preprocessing steps which may not be optimal, we should explore whether we can improve the model by using different processes.

Consider:

* Imputing missing values differently (or drop)
* Encoding ordinal data in a different manner
* Resampling differently (the did the even sample even improve the model?)
* Scaling the data differently (could we normalize the distances after scaling?)
* Using a different model