# Targeting Direct Marketing with XGBoost
_**Supervised Learning with Gradient Boosted Trees: A Binary Prediction Problem With Unbalanced Classes**_

---

---

## Contents

1. [Background](#Background)
1. [Prepration](#Preparation)
1. [Data](#Data)
    1. [Exploration](#Exploration)
    1. [Transformation](#Transformation)
1. [Training](#Training)
1. [Evaluation](#Evaluation)
1. [Tuning](#Tuning)
1. [Extensions](#Extensions)

---

## Background
Direct marketing, either through mail, email, phone, etc., is a common tactic to acquire customers.  Because resources and a customer's attention is limited, the goal is to only target the subset of prospects who are likely to engage with a specific offer.  Predicting those potential customer's based on readily available information like demographics, past interactions, and environmental factors is a common machine learning problem.

This notebook presents an example problem to predict if a customer will enroll for a term deposit at a bank, after one or more phone calls.  The steps include:

* Preparing your Amazon SageMaker notebook
* Downloading data from the internet into Amazon SageMaker
* Investigating and transforming the data so that it can be fed to algorithms
* Estimating a model using the Gradient Boosting algorithm
* Evaluating the effectiveness of the model
* Tuning the model's performance

---

## Preparation
To start, let's setup a few environment variables.

In [None]:
import os

os.environ['AWS_DEFAULT_REGION'] = 'us-west-2'
os.environ['JOBLIB_START_METHOD'] = 'forkserver'

If needed, shell commands can be invoked to install any necessary packages.

In [None]:
!conda install -y -c conda-forge xgboost

Now let's bring in the Python libraries that we'll use throughout the analysis

In [None]:
import numpy as np                              # For matrix operations and numerical processing
import pandas as pd                             # For munging tabular data
import matplotlib.pyplot as plt                 # For charts and visualizations
import sklearn as sk                            # For access to a variety of machine learning models
import xgboost as xgb                           # For gradient boosted trees algorithm
from IPython.display import Image               # For displaying images in the notebook
from IPython.display import display             # For displaying outputs in the notebook
from scipy.stats import randint as sp_randint   # For sampling in HPO

---

## Data
Let's start by downloading a dataset from UCI's ML Repository.

In [None]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip
!unzip bank-additional.zip

Now lets read this into a Pandas data frame and take a look.

In [None]:
data = pd.read_csv('./bank-additional/bank-additional-full.csv', sep=';')
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 20)         # Keep the output on one page
data

Let's talk about the data.  At a high level, we can see:

* We have a little over 40K customer records, and 20 features for each customer
* The features are mixed; some numeric, some categorical
* The data appears to be sorted, at least by `time` and `contact`, maybe more

_**Specifics on each of the features:**_

*Demographics:*
* `age`: Customer's age (numeric)
* `job`: Type of job (categorical: 'admin.', 'services', ...)
* `marital`: Marital status (categorical: 'married', 'single', ...)
* `education`: Level of education (categorical: 'basic.4y', 'high.school', ...)

*Past customer events:*
* `default`: Has credit in default? (categorical: 'no', 'unknown', ...)
* `housing`: Has housing loan? (categorical: 'no', 'yes', ...)
* `loan`: Has personal loan? (categorical: 'no', 'yes', ...)

*Past direct marketing contacts:*
* `contact`: Contact communication type (categorical: 'cellular', 'telephone', ...)
* `month`: Last contact month of year (categorical: 'may', 'nov', ...)
* `day_of_week`: Last contact day of the week (categorical: 'mon', 'fri', ...)
* `duration`: Last contact duration, in seconds (numeric). Important note: If duration = 0 then `y` = 'no'.
 
*Campaign information:*
* `campaign`: Number of contacts performed during this campaign and for this client (numeric, includes last contact)
* `pdays`: Number of days that passed by after the client was last contacted from a previous campaign (numeric)
* `previous`: Number of contacts performed before this campaign and for this client (numeric)
* `poutcome`: Outcome of the previous marketing campaign (categorical: 'nonexistent','success', ...)

*External environment factors:*
* `emp.var.rate`: Employment variation rate - quarterly indicator (numeric)
* `cons.price.idx`: Consumer price index - monthly indicator (numeric)
* `cons.conf.idx`: Consumer confidence index - monthly indicator (numeric)
* `euribor3m`: Euribor 3 month rate - daily indicator (numeric)
* `nr.employed`: Number of employees - quarterly indicator (numeric)

*Target variable:*
* `y`: Has the client subscribed a term deposit? (binary: 'yes','no')

### Exploration
Let's start exploring the data.  First, let's understand how the features are distributed.

In [None]:
# Frequency tables for each categorical feature
for column in data.select_dtypes(include=['object']).columns:
    display(pd.crosstab(index=data[column], columns='% observations', normalize='columns'))

# Histograms for each numeric features
display(data.describe())
%matplotlib inline
hist = data.hist(bins=30, sharey=True, figsize=(10, 10))

Notice that:

* Almost 90% of the values for our target variable `y` are "no", so most customers did not subscribe to a term deposit.
* Many of the predictive features take on values of "unknown".  Some are more common than others.  We should think carefully as to what causes a value of "unknown" (are these customers non-representative in some way?) and how we that should be handled.
  * Even if "unknown" is included as it's own distinct category, what does it mean given that, in reality, those observations likely fall within one of the other categories of that feature?
* Many of the predictive features have categories with very few observations in them.  If we find a small category to be highly predictive of our target outcome, do we have enough evidence to make a generalization about that?
* Contact timing is particularly skewed.  Almost a third in May and less than 1% in December.  What does this mean for predicting our target variable next December?
* There are no missing values in our numeric features.  Or missing values have already been imputed.
  * `pdays` takes a value near 1000 for almost all customers.  Likely a placeholder value signifying no previous contact.
* Several numeric features have a very long tail.  Do we need to handle these few observations with extremely large values differently?
* Several numeric features (particularly the macroeconomic ones) occur in distinct buckets.  Should these be treated as categorical?

Next, let's look at how our features relate to the target that we are attempting to predict.

In [None]:
for column in data.select_dtypes(include=['object']).columns:
    if column != 'y':
        display(pd.crosstab(index=data[column], columns=data['y'], normalize='columns'))

for column in data.select_dtypes(exclude=['object']).columns:
    print(column)
    hist = data[[column, 'y']].hist(by='y', bins=30)
    plt.show()

Notice that:

* Customers who are-- "blue-collar", "married", "unknown" default status, contacted by "telephone", and/or in "may" are a substantially lower portion of "yes" than "no" for subscribing.
* Distributions for numeric variables are different across "yes" and "no" subscribing groups, but the relationships may not be straightforward or obvious.

Now let's look at how our features relate to one another.

In [None]:
display(data.corr())
pd.plotting.scatter_matrix(data, figsize=(12, 12))
plt.show()

Notice that:
* Features vary widely in their relationship with one another.  Some with highly negative correlation, others with highly positive correlation.
* Relationships between features is non-linear and discrete in many cases.

### Transformation

Cleaning up data is part of nearly every machine learning project.  It arguably presents the biggest risk if done incorrectly and is one of the more subjective aspects in the process.  Several common techniques include:

* Handling missing values: Some machine learning algorithms are capable of handling missing values, but most would rather not.  Options include:
 * Removing observations with missing values: This works well if only a very small fraction of observations have incomplete information.
 * Remove features with missing values: This works well if there are a small number of features which have a large number of missing values.
 * Imputing missing values: Entire [books](https://www.amazon.com/Flexible-Imputation-Missing-Interdisciplinary-Statistics/dp/1439868247) have been written on this topic, but common choices are replacing the missing value with the mode or mean of that column's non-missing values.
* Converting categorical to numeric: The most common method is one hot encoding, which for each feature maps every distinct value of that column to its own feature which takes a value of 1 when the categorical feature is equal to that value, and 0 otherwise.
* Oddly distributed data: Although for non-linear models like Gradient Boosted Trees, this has very limited implications, parametric models like regression can produce wildly inaccurate estimates when fed highly skewed data.  In some cases, simply taking the natural log of the features is sufficient to produce more normally distributed data.  In others, bucketing values into discrete ranges is helpful.  These buckets can then be treated as categorical variables and included in the model when one hot encoded.
* Handling more complicated data types: Mainpulating images, text, or data at varying grains is left for other notebook templates.

Luckily, some of these aspects have already been handled for us, and the algorithm we are showcasing tends to do well at handling sparse or oddly distributed data.  Therefore, let's keep pre-processing simple.

In [None]:
data['no_previous_contact'] = np.where(data['pdays'] == 999, 1, 0)                                 # Indicator variable to capture when pdays takes a value of 999
data['not_working'] = np.where(np.in1d(data['job'], ['student', 'retired', 'unemployed']), 1, 0)   # Indicator for individuals not actively employed
model_data = pd.get_dummies(data)                                                                  # Convert categorical variables to sets of indicators

Another question to ask yourself before building a model is whether certain features will add value in your final use case.  For example, if your goal is to deliver the best prediction, then will you have access to that data at the moment of prediction?  Knowing it's raining is highly predictive for umbrella sales, but forecasting weather far enough out to plan inventory on umbrellas is probably just as difficult as forecasting umbrella sales without knowledge of the weather.  So, including this in your model may give you a false sense of precision.

Following this logic, let's remove the economic features and `duration` from our data as they would need to be forecasted with high precision to use as inputs in future predictions.

Even if we were to use values of the economic indicators from the previous quarter, this value is likely not as relevant for prospects contacted early in the next quarter as those contacted later on.

In [None]:
model_data = model_data.drop(['duration', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed'], axis=1)

When building a model whose primary goal is to predict a target value on new data, it is important to understand overfitting.  Supervised learning models are designed to minimize error between their predictions of the target value and actuals, in the data they are given.  This last part is key, as frequently in their quest for greater accuracy, machine learning models bias themselves toward picking up on minor idiosyncrasies within the data they are shown.  These idiosyncrasies then don't repeat themselves in subsequent data, meaning those predictions can actually be made less accurate, at the expense of more accurate predictions in the training phase.

The most common way of preventing this is to build models with the concept that a model shouldn't only be judged on its fit to the data it was trained on, but also on "new" data.  There are several different ways of operationalizing this, holdout validation, cross-validation, leave-one-out validation, etc.  For our purposes, we'll simply randomly split the data into 3 uneven groups.  The model will be trained on 70% of data, it will then be evaluated on 20% of data to give us an estimate of the accuracy we hope to have on "new" data, and 10% will be held back as a final testing dataset which will be used later on.

In [None]:
train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data)), int(0.9 * len(model_data))])   # Randomly sort the data then split out first 70%, second 20%, and last 10%

In Python, most machine learning packages expect features and the target variable to be provided as separate arguments.  Let's split these apart.  Notice that although repetitive it's easiest to do this after the train|validation|test split rather than before.  This avoids any misalignment issues due to random reordering.

In [None]:
train_X = train_data.drop(['y_no', 'y_yes'], axis=1)
train_y = train_data['y_yes']
validation_X = validation_data.drop(['y_no', 'y_yes'], axis=1)
validation_y = validation_data['y_yes']
test_X = test_data.drop(['y_no', 'y_yes'], axis=1)
test_y = test_data['y_yes']

---

## Training
Now we know most of our features have skewed distributions, some are highly correlated with one another, and some appear to have non-linear relationships with our target variable.  Also, for targeting future prospects, good predictive accuracy is preferred to being able to explain why that prospect was targeted.  Taken together, these aspects make gradient boosted trees a good candidate algorithm.

There are several intricacies to understanding the algorithm, but at a high level, gradient boosted trees works by combining predictions from many simple models, each of which tries to address the weaknesses of the previous models.  By doing this the collection of simple models can actually outperform large, complex models.  Say we started with a simple decision tree:

In [None]:
!sudo yum install graphviz

In [None]:
# TODO: Fix the lack of dot here (yum install graphviz maybe)
import sklearn.tree
tr = sk.tree.DecisionTreeClassifier(max_depth = 2)      # Setup a decision tree classifier with only 2 cuts
tr = tr.fit(train_X, train_y)                           # Train that decision tree on our data
sk.tree.export_graphviz(tr,
                        feature_names=train_X.columns, 
                        impurity=False, 
                        proportion=True, 
                        rounded=True)                    # Output the tree for visualization
!dot -Tjpg tree.dot -o tree.png                          # Convert to an image
#Image('tree.png')                                        # Display the tree image

We can see that:

* If we've contacted the customer about a previous campaign recently, then 62% of them subscribed to a term deposit.  This is great information, but it's really only relevant for 3.8% of our population.  We should also note that continuing to focus on the same subset over and over risks missing out on new, incremental prospects (even if the success rate is lower).
* If we have not contacted the customer about a previous campaign and it's not March, then 91% of contacts did not subscribe to a term deposit.  Not contacting the ~95% of the population that meets these criteria in the future could potentially save substantial cost with minimal loss of opportunity.

This extremely simple model is reasonably accurate, but if we wanted to increase accuracy further, we could either:

1. Continue adding cuts to our above tree.
2. Move to a different algorithm.

If we continue to add cuts to the above tree, we could continue to understand, within that 95%, what features distinguish the 9% that did subscribe to a term deposit versus the 91% that did not.  Knowing this would potentially allow us to cut a large portion of the prospects out, with even less risk of lost subscribers.  However, doing so also runs the risk that we find ever more obscure subsets of prospects which do not generalize well when making future predictions.  Note that whether a prospect was contacted in March isn't even a customer attribute.  It could be correlated with other attributes though, depending on how data was collected.  Have we already built a model that's too specific?

One method that might help us understand this is random forests.  These models build a large number of simple decision trees, each time taking a sample of the observations and features, and then averaging their predictions together.  By doing so we limit the chances that our model has focused in on any one feature or small subset of observations.

Let's build a simple random forest, keeping the depth of the tree at 2, but now averaging across 50 trees.  Since 50 trees would be difficult to look at, let's just see if we notice different features show up in our random forest versus our simple decision tree.

In [None]:
import sklearn.ensemble
rf = sk.ensemble.RandomForestClassifier(n_estimators=50, 
                                        max_depth=2, 
                                        random_state=1729)   # Setup a 50 tree random forest each with only 2 cuts
rf.fit(train_X, train_y)                                     # Train this forest on our data
fig, ax = plt.subplots(figsize=(10, 13))                     # Plot the importance of each feature in predicting our target
ax.set_xlabel('Feature Importance')
ax.barh(range(rf.feature_importances_.shape[0]), 
        rf.feature_importances_, 
        tick_label=train_X.columns)
plt.show()

Interestingly, the month features don't show up as very import.  This suggests that our initial, single, small tree may have already become overly specific.

An alternative algorithm we could use would be gradient boosted trees.  Again, we're combining multiple simple trees to produce better results than a single large complex model.  Unlike random forests with gradient boosting the trees are not independent.  Boosting fits trees sequentially where mistakes from previous trees are given more importance for future trees to classify accurately.  This can improve predictive accuracy substantially.

`xgboost` is an extremely popular, open-source package for gradient boosted trees.  It is computationally powerful, fully featured, and has been successfully used in many machine learning competitions.  Let's start with a simple `xgboost` model.

In [None]:
bt = xgb.XGBClassifier(max_depth=2,
                       learning_rate=0.3,
                       n_estimators=50,
                       objective='binary:logistic')   # Setup xgboost model
bt.fit(train_X, train_y, 
       eval_set=[(validation_X, validation_y)], 
       verbose=False)                                 # Train it to our data
fig, ax = plt.subplots(figsize=(8, 8))                # Plot feature importance
xgb.plot_importance(bt, ax=ax)
plt.show()

Gradient boosting finds a very different set of features which are predictive.  Let's compare our models based on predictive accuracy.

---

## Evaluation
There are many ways to compare the performance of a machine learning model, but let's start by simply by comparing actual to predicted values.  In this case, we're simply predicting whether the customer subscribed to a term deposit (`1`) or not (`0`), which produces a simple confusion matrix.

In [None]:
display(pd.crosstab(test_y, tr.predict(test_X), colnames=['tree']))               # Tree model predictions
display(pd.crosstab(test_y, rf.predict(test_X), colnames=['forest']))             # Random forest predictions
display(pd.crosstab(test_y, bt.predict(test_X), colnames=['boosted']))            # xgboost predictions

According to the above, none of the models are doing particularly well.  Boosted trees have the most true positives with 85, but this is only slightly more than the 78 from our exceptionally simple decision tree.  Random forests have the fewest false positives (none) but also the most false negatives.  A key point is that our confusion matrices are comparing binary prediction accuracy (i.e. we only predict the prospect will be a subscriber if the model thinks the probability is at or above 50%).

But, because there's most likely a high value on acquiring a new subscriber, we're probably willing to tolerate multiple rejections to get one.  So, we're much happier to have a false positive than a false negative (within reason).  How much more we should favor false negatives than false positives will need to be a business decision based on the economic return of acquiring a subscriber versus the cost of a call.

Let's address this by comparing how very similar models can perform very differently with a fixed threshold.

### Tuning
Machine learning algorithms frequently come with many knobs that can be tweaked and tuned.  Adjusting these hyperparameters can be handled several different ways.  Tuning them by hand using a high degree of subjectivity or through extensive trial and error used to be a common task for data scientists.

A hasty decision we made early on in our models was keeping the depth of our trees at 2.  We started with this for ease of implementation and interpretation, but as we look to improve performance, we can tune this hyperparameter.

In [None]:
rf4 = sk.ensemble.RandomForestClassifier(n_estimators=50, 
                                         max_depth=4, 
                                         random_state=1729)                         # 4 deep tree random forest instead of 2
rf4.fit(train_X, train_y)

bt4 = xgb.XGBClassifier(max_depth=4,
                        learning_rate=0.3,
                        n_estimators=50,
                        objective='binary:logistic')                                # Setup 4 deep tree xgboost model
bt4.fit(train_X, train_y, 
        eval_set=[(validation_X, validation_y)], 
        verbose=False)

display(pd.crosstab(test_y, rf4.predict(test_X), colnames=['forest4']))
display(pd.crosstab(test_y, bt4.predict(test_X), colnames=['boosted4']))

As we can see above, changing one hyperparameter (tree depth) produces substantially different results for random forests, but only a slightly increase in true positives for `xgboost`.

The driver for the substantial shift in random forests predictions is subtle issue of thresholds that we mentioned above.  Instead of looking at predictions as a hard binary class, let's look at the probability predictions.

In [None]:
plt.hist(rf.predict_proba(test_X)[:, 1], range=(0, 1))
plt.title('max_depth = 2 forest')
plt.show()
plt.hist(rf4.predict_proba(test_X)[:, 1], range=(0, 1))
plt.title('max_depth = 4 forest')
plt.show()

Notice that in both cases, the distribution of prospects is heavily skewed toward a low predicted probability of subscribing (less than 20%).  Both models produce a distinct second, smaller group of prospects with a higher predicted probability, but notice that the forest with trees with 4 cuts has more prospects above the 0.5 threshold.  This drives the substantial shift in true positives that we see above.

Let's look at how our false positive and false negative rate changes as we vary our threshold from something other than 0.5.  The typical way of doing this is a ROC curve, which measures the true positive and false positive rates across the range of possible thresholds.

In [None]:
tr_false_pos, tr_true_pos, _ = sk.metrics.roc_curve(test_y, tr.predict_proba(test_X)[:, 1])      # ROC metrics for decision tree
rf_false_pos, rf_true_pos, _ = sk.metrics.roc_curve(test_y, rf.predict_proba(test_X)[:, 1])      # ROC metrics for random forest
rf4_false_pos, rf4_true_pos, _ = sk.metrics.roc_curve(test_y, rf4.predict_proba(test_X)[:, 1])   # ROC metrics for random forest (4 deep)
bt_false_pos, bt_true_pos, _ = sk.metrics.roc_curve(test_y, bt.predict_proba(test_X)[:, 1])      # ROC metrics for xgboost
bt4_false_pos, bt4_true_pos, _ = sk.metrics.roc_curve(test_y, bt4.predict_proba(test_X)[:, 1])   # ROC metrics for xgboost (4 deep)

plt.plot(tr_false_pos, tr_true_pos, label='tree')                                                # Plot ROC curves for all
plt.plot(rf_false_pos, rf_true_pos, label='forest')
plt.plot(rf4_false_pos, rf4_true_pos, label='forest4')
plt.plot(bt_false_pos, bt_true_pos, label='boosted')
plt.plot(bt4_false_pos, bt4_true_pos, label='boosted4')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()

As we can see, by adjusting the threshold, we can bring in more true positives at the expense of more false positives.  This also shows that, as hoped, random forests and gradient boosted trees can outperform our simple decision tree.  In addition, although the two random forest models had very different performance as judged by confusion matrices, they perform relatively similarly overall.

Note though, that none of the models is giving us spectacular results; in order to get 90%+ true positives, we'd need to exceed 80% false positives.  Given that we'd like to improve model performance, but hand tuning is laborious and didn't yield substantial improvements (at least in our first attempt), let's take a more systematic approach which takes advantage of the compute resources we have at our disposal.

Grid search is a naive method of hyperparameter optimization, but is still quite common.  This sets hyperparameters to different values across a fixed interval you specify, then runs the model at each combination, collects the feedback, and returns the best performing model.

Randomized search is similar, but doesn't test all points in a grid and therefore is a bit easier to place an upper bound on total computational expenditure.  Let's keep things simple and explore random search across a variety of parameters in our gradient boosted tree model.

In [None]:
%%time

import sklearn.model_selection

bthpo = xgb.XGBClassifier(objective='binary:logistic', seed=1729)

hp = {'max_depth': [2, 5, 10],
      'learning_rate': [0.01, 0.1, 0.3],
      'n_estimators': [50, 150],
      'scale_pos_weight': [1, 2, 3],
      'subsample': [0.8, 1],
      'colsample_bytree': [0.6, 1],
      'reg_lambda': [1, 10]}

btrs = sk.model_selection.RandomizedSearchCV(bthpo,
                                             param_distributions=hp,
                                             n_iter=100,
                                             scoring='roc_auc',
                                             n_jobs=4,
                                             cv=2,
                                             verbose=2,
                                             random_state=1729)
btrs.fit(pd.concat([train_X, validation_X]), pd.concat([train_y, validation_y]))

In [None]:
plt.hist(btrs.cv_results_['mean_test_score'])
plt.title('Overall accuracy for 100 models in search space')
plt.show()

Notice that depending on the hyperparameters, there can be a fairly substantial swing in our overall accuracy metric.

In [None]:
btrs_proba = btrs.predict_proba(test_X)[:, 1]
print(np.sum(np.where(btrs_proba > np.median(btrs_proba), 1, 0) * test_y) / np.sum(test_y))
display(pd.crosstab(test_y, btrs.predict(test_X), colnames=['tuned']))                              # Confusion matrix for tuned model
btrs_false_pos, btrs_true_pos, _ = sk.metrics.roc_curve(test_y, btrs.predict_proba(test_X)[:, 1])   # Plot ROC curve for tuned model
plt.plot(bt_false_pos, bt_true_pos, label='boosted')
plt.plot(btrs_false_pos, btrs_true_pos, label='tuned')
plt.legend()
plt.show()

We also see a shift in our confusion matrix.  The ROC curve tells a slightly less prommising story as accuracy relative to the original gradient boosted tree model is only slightly better.  Our initial naive parameters turned out to be reasonable for this task.  Nevertheless, at large scale, small improvements in overall accuracy can produce substantial impact for a business.  Critically, the overall ROC curve is less important as upon implementation, a fixed threshold needs to be fixed.  And that threshold will be based on specific costs and returns within the business.

Overall, it's important to note that, with minimal effort, our model produced accuracies similar to those published in this [paper](http://media.salford-systems.com/video/tutorial/2015/targeted_marketing.pdf).

---

## Extensions

This example was contained within the Notebook environment entirely.  As data sizes grow, utilizing other Amazon SageMaker features such as distributed, serverless training and our hyperparameter optimization service makes more sense.  In addition, if the model needs to be used to provide real-time, online predictions, Amazon SageMakers's auto-scaling hosting should be used.  Please check out the other Amazon SageMaker direct marketing notebook for a more functionally detailed walkthrough of those features.