# Standalone Bank Offer Notebook

### Prediction on Customer Enrollment Dataset with Amazon SageMaker 

### Background

Direct marketing, either through mail, email, phone, etc., is a common tactic to acquire customers. Because resources and a customer's attention is limited, the goal is to only target the subset of prospects who are likely to engage with a specific offer. Predicting those potential customers based on readily available information like demographics, past interactions, and environmental factors is a common machine learning problem.


### Business problem: 
Predict whether a customer will enroll for a certificate of deposit product at a bank, after one or more phone calls.

### Labeled Data: 
Customer demographics (age, employment, type of job, education etc.), responses to marketing events (including past response), external factors (month, day of the week etc.) and whether the customer is enrolled.

### Step 1: Imports

Import libraries and define environment variables

In [None]:
# import libraries
import boto3, re, sys, math, json, os, sagemaker, urllib.request
from sagemaker import get_execution_role
import numpy as np                                
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt                   
from IPython.display import Image                 
from IPython.display import display               
from time import gmtime, strftime                 
from sagemaker.predictor import csv_serializer  
import random
import collections
from collections import Counter

In [None]:
# Define IAM role
role = get_execution_role()
prefix = 'sagemaker/DEMO-xgboost-dm'
containers = {'us-west-2': '433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest',
              'us-east-1': '811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest',
              'us-east-2': '825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest',
              'eu-west-1': '685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:latest'} # each region has its XGBoost container
my_region = boto3.session.Session().region_name # set the region of the instance
print("Success - the MySageMakerInstance is in the " + my_region + " region. You will use the " + containers[my_region] + " container for your SageMaker endpoint.")

### Step 2: Create Storage Bucket

Please note the bucket_name needs to be unique globally for AWS S3, Please enter a first and last name to generate a unique bucket name

#### Unique S3 Bucket Name Generator

In [None]:
def unique_s3_name():
    fname = input('Enter your first name, eg. joe :')
    lname = input('Enter your last name, eg. smith :')
    bucket_name = 'sagemaker-844-'+fname+lname+str(random.randint(100, 999))
    print("Your unique S3 bucket name will be '{}'".format(bucket_name))
    return bucket_name

In [None]:
# eg. bucket_name = 'sagemaker-844-firstnamelastname###' <-- example of a unique bucket name 
s3 = boto3.resource('s3')
try:
    if  my_region == 'us-east-1':
#        s3.create_bucket(Bucket=bucket_name)
        bucket_name = unique_s3_name()
        s3.create_bucket(Bucket=bucket_name)
    else: 
        s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration={ 'LocationConstraint': my_region })
    print('S3 bucket created successfully')
except Exception as e:
    print('S3 error: ',e)

### Step 3: Download Data
Download data from external URL to your SageMaker instance, and read it as a dataframe.

In [None]:
try:

#    urllib.request.urlretrieve ("https://d1.awsstatic.com/tmt/build-train-deploy-machine-learning-model-sagemaker/bank_clean.27f01fbbdf43271788427f3682996ae29ceca05d.csv", "bank_clean.csv")
    !wget -N https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip
    print('Success: downloaded bank-additional.zip!')    
    print('')
except Exception as e:
    print('Data load error: ',e)
    print('')


try:    
    !unzip -o bank-additional.zip    
    print('Success: Unzipped bank-additional.zip to .csv!')    
    print('')    
except Exception as e:
    print('Data load error: ',e)
    print('')
    
try:

    raw_data = pd.read_csv('./bank-additional/bank-additional-full.csv', sep=";")
#    model_data = pd.read_csv('./bank_clean.csv',index_col=0)
    print('Success: Data loaded bank-additional-full.csv into dataframe.')
    print('')
except Exception as e:
    print('Data load error: ',e)
    print('')

### Step 4: Initial View the Dataset

Now that the dateset is downloaded and in a Pandas dataframe we can look at the data.

Data contains 20 features for each customer. Here is a summary of each column:

Demographics:
* age: Customer's age (numeric)
* job: Type of job (categorical: 'admin.', 'services', ...)
* marital: Marital status (categorical: 'married', 'single', ...)
* education: Level of education (categorical: 'basic.4y', 'high.school', ...)

Past customer events:
* default: Has credit in default? (categorical: 'no', 'unknown', ...)
* housing: Has housing loan? (categorical: 'no', 'yes', ...)
* loan: Has personal loan? (categorical: 'no', 'yes', ...)

Past direct marketing contacts:
* contact: Contact communication type (categorical: 'cellular', 'telephone', ...)
* month: Last contact month of year (categorical: 'may', 'nov', ...)
* day_of_week: Last contact day of the week (categorical: 'mon', 'fri', ...)
* duration: Last contact duration, in seconds (numeric). Important note: If duration = 0 then y = 'no'.

Campaign information:
* campaign: Number of contacts performed during this campaign and for this client (numeric, includes last contact)
* pdays: Number of days that passed by after the client was last contacted from a previous campaign (numeric)
* previous: Number of contacts performed before this campaign and for this client (numeric)
* poutcome: Outcome of the previous marketing campaign (categorical: 'nonexistent','success', ...)

External environment factors:
* emp.var.rate: Employment variation rate - quarterly indicator (numeric)
* cons.price.idx: Consumer price index - monthly indicator (numeric)
* cons.conf.idx: Consumer confidence index - monthly indicator (numeric)
* euribor3m: Euribor 3 month rate - daily indicator (numeric)
* nr.employed: Number of employees - quarterly indicator (numeric)

Target variable:
* y: Has the client subscribed a term deposit? (binary: 'yes','no')

In [None]:
raw_data.head()

In [None]:
raw_data.describe()

In [None]:
raw_data.info()

### Step 5: Exploratory Data Analysis

In [None]:
raw_data.columns

Display the Min and Max of the numeric columns

In [None]:
numeric_colns = ['age', 'duration', 'campaign', 'pdays','previous', 'emp.var.rate', 'cons.price.idx',
       'cons.conf.idx', 'euribor3m', 'nr.employed']
for x in numeric_colns:
    value = np.array(raw_data[x])
    print(x,':', 'min:', np.min(value), 'max:', np.max(value))


Look at the range of each numeric coiumns

In [None]:
for x in numeric_colns:
    raw_data[x] = raw_data[x].apply(round)
    value = np.array(raw_data[x])
    print(x,':', 'min:', np.min(value), 'max:', np.max(value), raw_data[x].dtype)
    print(raw_data[x].unique())
    print('')

Display a couple of Box Plots to show the relation of some categorical columns vs. numeric columns.

e.g. "What is the distribution of ages when someone has 'defaulted' on a loan in the past?"   OR

e.g. "what is the distribution of ages based on the past success or failure of a past loan?"

In [None]:
sns.set()
fig = plt.figure(figsize=(20,10))
plt.subplot(1, 2, 1)
sns.boxplot(x='default', y='age', data=raw_data)
plt.subplot(1, 2, 2)
sns.boxplot(x='poutcome', y='age', data=raw_data)

Display a scatter joint plot to show the relation of some numeric columns vs. numeric columns.

e.g. "What is the Ages vs. the days of a campaign?"

In [None]:
sns.set()
g = sns.jointplot("age", "campaign", data=raw_data,
                  kind="reg", truncate=False,
                  xlim=(40, 100), ylim=(0, 50),
                  color="m", height=10)
g.set_axis_labels("Age (years)", "Campaign (days)")

In [None]:
g = sns.lmplot(x="age", y="campaign", hue="default",
               height=10, data=raw_data)
g.set_axis_labels("Age (years)", "Campaign (days)")

Look at the pie chart of the distribution of our dataset in terms of the level of education of our customers.

In [None]:
c = Counter(raw_data['education'])
print(c)

In [None]:
fig = plt.figure(figsize=(8,8))
plt.pie([float(c[v]) for v in c], labels=[str(k) for k in c], autopct=None)
plt.title('Level of Education Achieved') 
plt.tight_layout()

In [None]:
raw_data.columns

In [None]:
for a in raw_data.columns[1:5]:
    data = raw_data[a].value_counts()
    values = raw_data[a].value_counts().index.to_list()
    counts = raw_data[a].value_counts().to_list()
    
    plt.figure(figsize=(12,5))
    ax = sns.barplot(x = values, y = counts)
    
    plt.title(a)
    plt.xticks(rotation=45)
#    print(a, values, counts)

### Step 6: Data Cleaning

Now that we have explored the data thoroughly, we can now start 'cleaning' the dataset. Let's remind ourselves what we have:

In [None]:
raw_data.head()

Many records have "999" for pdays, which is the number of days that passed by after a client was last contacted. It is very likely to be a magic number to represent that no contact was made before. Therefore, we create a new column called "no_previous_contact", then make it "1" when pdays is 999 and "0" otherwise.

In [None]:
# Indicator variable to capture when pdays takes a value of 999
raw_data['no_previous_contact'] = np.where(raw_data['pdays'] == 999, 1, 0) 
raw_data.head()

Note: there is ONE additional column added to the end to give us 22 columns.

In the "job" column, various categories mean the customer is not working, e.g., "student", "retire", and "unemployed". Since it is highly likely whether or not a customer is working will affect his/her decision to enroll in the certificate of deposit, we create a new column to show whether the customer is working based on the "job" column.

In [None]:
# Indicator for individuals not actively employed
raw_data['not_working'] = np.where(np.in1d(raw_data['job'], ['student', 'retired', 'unemployed']), 1, 0)  
raw_data.head()

And now we added another column at the end.

Finally, we convert categorical data to numeric using pd.get_dummies(data), and view the transformed data.

In [None]:
# Convert categorical variables to sets of indicators
model_data = pd.get_dummies(raw_data)


View the dataset now to see the effects of the encoding steps:

In [None]:
model_data.head()

Note, you can now see we have a total of 67 columns. 

### Should we Drop Any Data?
Another question to ask yourself before building a model is whether certain features will add value in your final use case. For example, if your goal is to deliver the best prediction, then will you have access to that data at the moment of prediction? Knowing it's raining is highly predictive for umbrella sales, but forecasting weather far enough out to plan inventory on umbrellas is probably just as difficult as forecasting umbrella sales without knowledge of the weather. So, including this in your model may give you a false sense of precision.

Certain economic features in the data won't be available at the time of predicting a customer's enrollment behaviour, or they can be as difficult to forecast as the business problem, with data being only available for defined time periods and on a lag.

So we remove the economic features and duration from the data as they would need to be forecasted with high precision to use as inputs in future predictions.

In [None]:
#Drop these columns if they can't be used in making a prediction
model_data = model_data.drop(['duration', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed'], axis=1)

View model_data. Now the dataset is cleaned and ready to be split into training and test sets.

In [None]:
model_data.head()

In [None]:
model_data.tail()

We can now see we went from 21 columns and expanded it to 61 columns because of the encoding and dropping column process and all numeric values. We are now ready to start splitting up the data for training.

### Step 7: Shuffle and split the data
Shuffle and split the data into training and test sets. In this example, select 70% of customers for training data.

The rest 30% of customers data is used to evaluate model performance.

In [None]:
train_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data))])
print("We have {} rows of Training data with {} columns".format(train_data.shape[0], train_data.shape[1]))
print('and')
print("We have {} rows of Testing data with {} columns".format(test_data.shape[0], test_data.shape[1]))


### Step 8: Training
Train the training data using SageMaker pre-built XGBoost model.  

XGBoost is a gradient-based optimization to iteratively refine the model parameters. Gradient-based optimization is to find model parameter values that minimize the model error, using the gradient of the model loss function.

Reformat the header and first column, load data from S3. 

(Disregard message on second version SDK v2)

In [None]:
#Drop the 'label' encoded columns from the train_data as that shouldn't be used for training
#Save to a .csv file without column names as training won't use these
#Concatenation
pd.concat([train_data['y_yes'],train_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('train.csv', index=False, header=False)


In [None]:
train_data.head()

In [None]:
#Upload the file to S3 for Amazon SageMaker training to pickup.
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')


In [None]:
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket_name, prefix), content_type='csv')

### Step 9: Set up SageMaker session

Create a Sagemaker session, an estimator (an instance) of the XGBoost model, and define the model's hyperparameters. 

(Disregard message on second version SDK v2)

In [None]:
sess = sagemaker.Session()
xgb = sagemaker.estimator.Estimator(containers[my_region],role, train_instance_count=1, train_instance_type='ml.m4.xlarge',output_path='s3://{}/{}/output'.format(bucket_name, prefix),sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=5,eta=0.2,gamma=4,min_child_weight=6,subsample=0.8,silent=0,objective='binary:logistic',num_round=100)

### Step 10: Train the model

The XGBoost (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models. The XGBoost algorithm performs well in machine learning competitions because of its robust handling of a variety of data types, relationships, distributions, and the variety of hyperparameters that you can fine-tune. You can use XGBoost for regression, classification (binary and multiclass), and ranking problems.

<b> Resource: </b> For a more indepth explanation of xgBoost: https://youtu.be/8b1JEDvenQU

In [None]:
#This training can take up to 3 minutes to complete
xgb.fit({'train': s3_input_train})

### Step 11:  Deploy the Model

Deploy the model on a server and create an endpoint

(Disregard message on second version SDK v2)

In [None]:
#This deployment step can take up to 10 minutes to complete
xgb_predictor = xgb.deploy(initial_instance_count=1,instance_type='ml.m4.xlarge')

### Step 12:  Make Predictions

Run the model to create predictions on whether customers in the test data enrolled for the certificate of deposit product

In [None]:
test_data_array = test_data.drop(['y_no', 'y_yes'], axis=1).values #load the data into an array
xgb_predictor.content_type = 'text/csv' # set the data type for an inference
xgb_predictor.serializer = csv_serializer # set the serializer type
predictions = xgb_predictor.predict(test_data_array).decode('utf-8') # predict!
predictions_array = np.fromstring(predictions[1:], sep=',') # and turn the prediction into an array
print(predictions_array)

### Step 13:  Evaluate model performance

Compare actual vs. predictions in a confusion matrix. 

What shows for the overall classification rate%?, What is the precision and recall of the model?

Remember, a high Precision or Recall only depends on what application your model is trying to solve.

of 65% (278/429) for enrolled and 90% (10,785/11,928) for customers who didn't enroll.

Precision = True Positive / (True Positive + False Positive) = True Positive / Total Predicted Positive

Recall = True Positive / (True Positive + False Negative) = True Positive / Total Actual Positive = 278/(1143+278) = 0.1956

F1 = 2 * (Precision * Recall) / (Precision + Recall) = 2*(0.65*0.1956)/(0.65+0.1956) = 0.3005

<b>Resource:</b> StatQuest explanation of Confusion Matrix https://youtu.be/j-EB6RqqjGI

In [None]:
cm = pd.crosstab(index=test_data['y_yes'], columns=np.round(predictions_array), rownames=['Observed'], colnames=['Predicted'])
tn = cm.iloc[0,0]; fn = cm.iloc[1,0]; tp = cm.iloc[1,1]; fp = cm.iloc[0,1]; p = (tp+tn)/(tp+tn+fp+fn)*100

print("Evaluate model performance Observed")
print("\n{0:<20}{1:<4.1f}%\n".format("Overall Classification Rate: ", p))



print("CONFUSION MATRIX")


print("{0:<15}{1:<15}{2:>8}".format("Predicted", "No Purchase", "Purchase"))
print("{0:<15}{1:<2.0f}% ({2:<}){3:>6.0f}% ({4:<})".format("No Purchase", tn/(tn+fn)*100,tn, fp/(tp+fp)*100, fp))
print("{0:<16}{1:<1.0f}% ({2:<}){3:>7.0f}% ({4:<}) \n".format("Purchase", fn/(tn+fn)*100,fn, tp/(tp+fp)*100, tp))

precision = tp/(tp+fp)*100
recall = tp/(tp+fn)*100
f1_score = 2 * (precision * recall) / (precision + recall)

print("Precision: {}%".format(round(precision,3)))
print("Recall: {}%".format(round(recall),3))
print("F1 Score: {}".format(round(f1_score)))
print("")

print("")

### Step 14: Terminate resources

### IMPORTANT STEP!!! Terminate resources not actively being used to reduce costs and is a best practice. Delete endpoint and all objects in S3 bucket.

In [None]:
sagemaker.Session().delete_endpoint(xgb_predictor.endpoint)
bucket_to_delete = boto3.resource('s3').Bucket(bucket_name)
bucket_to_delete.objects.all().delete()

811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost