# Using EvalML to Predict Customer Attrition

<p align="center">
<img width=50% src="https://evalml-web-images.s3.amazonaws.com/evalml_horizontal.svg" alt="Featuretools" />
</p>

## Problem and Dataset

The  [Predicting Customer Churn](https://www.kaggle.com/sakshigoyal7/credit-card-customers) dataset on Kaggle is a supervised classification task where the objective is to predict whether or not a customer will end up leaving their bank's credit card service.

In this tutorial, we use  [EvalML](https://github.com/alteryx/evalml) to search and select a pipeline that performs the best at identifying customers who are more likely to leave their bank.

## Approach 

We will show how  [EvalML](https://github.com/alteryx/evalml) can be leveraged to perform preprocessing, visualization, and automated machine learning. While EvalML allows for plenty of options for customization to improve prediction outcomes, we'll focus on a fairly high-level implementation.

Our approach will be as follows:

1. Read in the data and analyze it.
2. Understand the data through visualization.
3. Perform basic preprocessing.
4. Search for a best performing pipeline based on our objective.
5. Review the best pipeline chosen and analyze its performance.

First we're going to need to import some libraries.

In [None]:
import evalml
import numpy as np
import pandas as pd

## Dataset 

The Customer Churn dataset consists of 10,000+ instances and 20 features alongside a label - `Attrition_Flag`. First we want to review this data and see what we're dealing with.

In [None]:
data = pd.read_csv('./data/BankChurners.csv')
data.head()

The first thing we'll do is drop `CLIENTNUM` from the data since a unique client identifier will have no correlation with attrition rates. Now there's clearly some diversity in the types of features, and at first glace it looks like we don't have to worry about any null or missing values. But that seems unlikely with a dataset of this size.

In [None]:
data = data.drop(['CLIENTNUM'], axis=1)
print(f'Feature types: {data.dtypes.unique()}')
print('-----------------------------------')
data.info()

We're going to take a look at some of the unqiue, non-numeric values in the features. Sure enough, `Education_Level`, `Marital_Status`, and `Income_Category` have `Unknown` as a value. This is something we'll have to remember before we get to the model training, since `Unknown` isn't an acceptable value for any of the features.

In [None]:
for feature in data.columns:
    if data[feature].dtype not in ['int64', 'float64']:
        print(f'{feature}: {data[feature].unique()}')

## Visualization

It's worth checking to see how prevalent `Unknown` is proportionally to the the other values. Based on the count plots below, it doesn't look like `Unknown` is the most common value, but it's frequency is high enough that we probably don't want to drop rows containing it altogether.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(nrows=3, ncols=1, figsize=(16, 28))
sns.set(font_scale=1.6)
cols_ = ["Education_Level", "Marital_Status", "Income_Category"]

for ind, col in enumerate(cols_):
    sns.countplot(x=col, data=data, ax=ax[ind])

We're also going to take a look at the correlation matrix to see if there are any features that are too closely tied to others. It looks like `Avg_Open_To_Buy` is perfectly correlated with `Credit_Limit`, so we're going to drop the latter.

In [None]:
fig, ax = plt.subplots(figsize=(20, 16))
df_corr = data.corr(method="pearson")
mask = np.zeros_like(np.array(df_corr))
mask[np.triu_indices_from(mask)] = True
ax = sns.heatmap(df_corr, mask=mask, annot=True)

## Preprocessing the Data

The first thing we're going to do is create a copy, drop the highly correlated feature, and separate the label from the rest of the data. Following that, we should cast some of the unique values in our categorical variables to a numerical format so that our machine learning estimators can work with them.

In [None]:
X = data.copy()
X = X.drop(['Credit_Limit'], axis=1)
y = X.pop('Attrition_Flag')

X['Income_Category'] = X['Income_Category'].replace({'Less than $40K':0,
                                                     '$40K - $60K':1,
                                                     '$60K - $80K':2,
                                                     '$80K - $120K':3,
                                                     '$120K +':4})
X['Card_Category'] = X['Card_Category'].replace({'Blue':0,
                                                 'Silver':1,
                                                 'Gold':2,
                                                 'Platinum':3})
X['Education_Level'] = X['Education_Level'].replace({'Uneducated':0,
                                                     'High School':1,
                                                     'College':2,
                                                     'Graduate':3,
                                                     'Post-Graduate':4,
                                                     'Doctorate':5})

y = y.replace({'Existing Customer':0,
               'Attrited Customer':1})

Now that our data has been cleaned a bit, it's in a better spot for us to apply some transformationw. We'll be replacing the `Unknown` values that we saw earlier with the most frequent value encountered in that feature using SimpleImputer.

In [None]:
from evalml.pipelines.components.transformers.imputers.simple_imputer import SimpleImputer

def preprocessing(X, y):
    imputer = SimpleImputer(impute_strategy="most_frequent", missing_values="Unknown")
    X = imputer.fit_transform(X, y)
    
    return X

X = preprocessing(X, y)

Using `infer_feature_types`, we can convert our dataset into a [Woodwork](https://github.com/alteryx/woodwork) data structure, and even [specify what types](https://evalml.alteryx.com/en/stable/user_guide/automl.html) certain features should be. For example, we want to cast `Income_Category` as a categorical type, rather than natural language which is what it was inferred as.

In [None]:
from evalml.utils.gen_utils import infer_feature_types
X = infer_feature_types(X, feature_types={'Income_Category': 'categorical',
                                          'Education_Level': 'categorical'})
X

## AutoMLSearch

After the preprocessing has been performed, the data is ready to be split into a training and test set.

In [None]:
X_train, X_test, y_train, y_test = evalml.preprocessing.split_data(X, y, problem_type='binary',
                                                                         test_size=.2)

We're ready to begin our automated machine learning now! `AutoMLSearch` is a tool that automatically iterates over a collection of pipelines to see what combination of steps and estimators will result in the best performing pipeline. We would normally have more preprocessing steps explicitly defined for a machine learning problem. Everything from standardization to one hot encoding is on the table. Part of the versatility of `AutoMLSearch` is that the built-in preprocessing component can handle some of this by default. Using a OneHotEncoder to break nominal categorical variables into multiple columns is an important step in providing more useful features for models, and that's exactly what `AutoMLSearch` will do for us.

Now we have some options when considering what model families we want to include in this search. There's a few to choose from, and to see all the ones allowed for binary classification, you can run `print(evalml.pipelines.components.utils.allowed_model_families('binary'))`. Transforming the data through standardization so that all features are on the same scale is another step that could have been very useful, but since the model families we've chosen to work with are all tree-based, it isn't as important for us in this case.

We're dealing with a binary problem type here so we'll be sure to specify that. We also want to make sure that we're optimized for the right objective. Since we're dealing with a dataset in which an abundance of false negatives can deteroriate the quality of our model's predictive capacity, we'll be focusing on the F1 metric which includes Recall.

In [None]:
from evalml import AutoMLSearch

automl = AutoMLSearch(X_train=X_train, y_train=y_train, problem_type="binary", objective="F1", 
                      allowed_model_families=['random_forest' , 'xgboost', 'lightgbm'],
                      additional_objectives=['accuracy binary'], max_batches=5)
automl.search(data_checks=None)

## Pipelines Review

So a lot just happened, let's review the pipelines that were created and tested. We can see that the best performing pipeline was with the LightGBM estimator. We want to learn a little more about it, which can be done with the `describe_pipeline` function. Notice that the pipeline included a preprocessing step of imputation. In this case, it ended up being unnecessary because of our earlier SimpleImputer and our lack of null values for our numerical features. However `AutoMLSearch` comes with the built-in capacity to automatically iterate over the hyperparameters for this preprocessing step as well.

In [None]:
automl.rankings

In [None]:
best_pipeline_ = automl.best_pipeline
automl.describe_pipeline(automl.rankings.iloc[1]["id"])

## Understanding Outcomes

Following the selection of the best performing pipeline, we can continue to learn more about it before we might choose to implement it somewhere else in our business. `Within the evalml.model_understanding` module, we can find several tools that help us understand the models and outcomes we're dealing with. For now, let's take a look at the F1 scores by threshold and feature permutations by importance. We can also use a confusion matrix to more clearly see the break down of false positives and negatives.

In [None]:
best_pipeline_.fit(X_train, y_train)
predictions = best_pipeline_.predict(X_test)

In [None]:
from evalml.model_understanding.graphs import (
    graph_binary_objective_vs_threshold, 
    graph_permutation_importance, 
    graph_confusion_matrix
)

graph_binary_objective_vs_threshold(best_pipeline_, X_test, y_test, "F1")

In [None]:
graph_permutation_importance(best_pipeline_, X_test, y_test, "F1")

In [None]:
graph_confusion_matrix(y_test, predictions)

To see how the final pipeline's predictions performed on the `y_test` set, we can import several metrics and take a look at the performance. For Recall specifically the pipeline did a fairly good job minimizing the false negatives.

In [None]:
from evalml.objectives.standard_metrics import AccuracyBinary, AUC, F1, PrecisionWeighted, Recall

acc = AccuracyBinary()
auc = AUC()
f1 = F1()
pre_w = PrecisionWeighted()
rec = Recall()

print(f"Accuracy (Binary): {acc.score(y_true=y_test, y_predicted=predictions)}")
print(f"Area Under Curve: {auc.score(y_true=y_test, y_predicted=predictions)}")
print(f"F1: {f1.score(y_true=y_test, y_predicted=predictions)}")
print(f"Precision (Weighted): {pre_w.score(y_true=y_test, y_predicted=predictions)}")
print(f"Recall: {rec.score(y_true=y_test, y_predicted=predictions)}")