# Step 0: Explore a dataset for signal

In this step you run data processing and model training and evaluation in the notebook locally. You don't use `sagemaker` or `boto3` packages.


<div class="alert alert-info"> Make sure you using <code>Python 3</code> kernel in JupyterLab for this notebook.</div>

In [None]:
# We use the opensource xgboost algorithm to implement the model
%pip install -q xgboost

In [None]:
import pandas as pd
import numpy as np 
import json
import joblib
import xgboost as xgb
import os
import matplotlib.pyplot as plt
import mlflow
import mlflow.sklearn
from mlflow.models import infer_signature
from time import gmtime, strftime, sleep
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import MinMaxScaler, LabelEncoder

## Data

This example uses the [direct marketing dataset](https://archive.ics.uci.edu/ml/datasets/bank+marketing) from UCI's ML Repository:
> [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

Download and unzip the dataset:

In [None]:
!wget -P data/ -N https://archive.ics.uci.edu/static/public/222/bank+marketing.zip --no-check-certificate

In [None]:
import zipfile

with zipfile.ZipFile("data/bank+marketing.zip", "r") as z:
    print("Unzipping bank+marketing...")
    z.extractall("data")

with zipfile.ZipFile("data/bank-additional.zip", "r") as z:
    print("Unzipping bank-additional...")
    z.extractall("data")

print("Done")

## Load data
The following cell is tagged with `parameters` as the cell tag to enable parametrization for headless execution of the notebook as [SageMaker Notebook-based workflow](https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-auto-run.html). Refer to the section **Run the notebook as a SageMaker job** for details and an example. Ignore this for now.

In [None]:
# This cell is tagged with `parameters` tag and will be overwritten if the notebook executed headlessly
file_source = "local"
file_name = "bank-additional-full.csv"
input_path = "./data/bank-additional" 
output_path = "./data"

In [None]:
target_col = "y"

In [None]:
# If run the notebook as a job, non-interactivel or headlessly, the notebook cannot access the JupyterLab EBS volume, download the dataset from S3 instead
# See the section "Run the notebook as a SageMaker job" for more details
if file_source != "local":
    session.download_data(
        path=os.path.join(input_path, ""), 
        bucket=bucket_name,
        key_prefix=f"{bucket_prefix}/input/{file_name}"
    )

## EDA
Let's do some explotary data analysis on this dataset.

In [None]:
df_data = pd.read_csv(os.path.join(input_path, file_name), sep=";")

pd.set_option("display.max_columns", 500)  # View all of the columns
df_data  # show first 5 and last 5 rows of the dataframe

In [None]:
# see column metadata
df_data.info()

In [None]:
# see column statistics
df_data.describe()

In [None]:
# see target distribution
df_data[target_col].value_counts().plot.bar()

plt.show()

In [None]:
# see if there are any missing values
df_data.isna().sum()

In [None]:
cat_columns = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']

fig, axs = plt.subplots(3, 3, sharex=False, sharey=False, figsize=(20, 15))

counter = 0
for cat_column in cat_columns:
    value_counts = df_data[cat_column].value_counts()
    
    trace_x = counter // 3
    trace_y = counter % 3
    x_pos = np.arange(0, len(value_counts))
    
    axs[trace_x, trace_y].bar(x_pos, value_counts.values, tick_label = value_counts.index)
    
    axs[trace_x, trace_y].set_title(cat_column)
    
    for tick in axs[trace_x, trace_y].get_xticklabels():
        tick.set_rotation(90)
    
    counter += 1

plt.show()

In [None]:
num_columns = ['duration', 'campaign', 'pdays', 'previous']

fig, axs = plt.subplots(2, 2, sharex=False, sharey=False, figsize=(20, 15))

counter = 0
for num_column in num_columns:
    
    trace_x = counter // 2
    trace_y = counter % 2
    
    axs[trace_x, trace_y].hist(df_data[num_column])
    
    axs[trace_x, trace_y].set_title(num_column)
    
    counter += 1

plt.show()

In [None]:
j_df = pd.DataFrame()

j_df['yes'] = df_data[df_data[target_col] == 'yes']['marital'].value_counts()
j_df['no'] = df_data[df_data[target_col] == 'no']['marital'].value_counts()

j_df.plot.bar(title = 'Marital status and deposit')

## Feature engineering

As an example, the processing script implements the following feature engineering:
1. Create a new column called `no_previous_contact`. Set value to `1` when `pdays` is `999` and `0` otherwise
1. Generate a new column to show whether the customer is working based on `job` column
1. Remove the economic features from the dataset as they would need to be forecasted with high precision to be used as features during inference time
1. Remove `duration` as it is not know before a call is performed
1. Convert categorical variables to numeric using **one hot encoding**
1. Move the target column `y` to the front

In real world you implement additional processing, data quality handling, and feature engineering. You also go via multiple "try & fail" iterations.

In [None]:
# Indicator variable to capture when pdays takes a value of 999
df_data["no_previous_contact"] = np.where(df_data["pdays"] == 999, 1, 0)

# Indicator for individuals not actively employed
df_data["not_working"] = np.where(
    np.in1d(df_data["job"], ["student", "retired", "unemployed"]), 1, 0
)

# remove unnecessary data
df_model_data = df_data.drop(
    ["duration", "emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employed"],
    axis=1,
)


bins = [18, 30, 40, 50, 60, 70, 90]
labels = ['18-29', '30-39', '40-49', '50-59', '60-69', '70-plus']

df_model_data['age_range'] = pd.cut(df_model_data.age, bins, labels=labels, include_lowest=True)
df_model_data = pd.concat([df_model_data, pd.get_dummies(df_model_data['age_range'], prefix='age', dtype=int)], axis=1)
df_model_data.drop('age', axis=1, inplace=True)
df_model_data.drop('age_range', axis=1, inplace=True)

scaled_features = ['pdays', 'previous', 'campaign']
df_model_data[scaled_features] = MinMaxScaler().fit_transform(df_model_data[scaled_features])

df_model_data = pd.get_dummies(df_model_data, dtype=int)  # Convert categorical variables to sets of indicators

# Replace "y_no" and "y_yes" with a single label column, and bring it to the front:
df_model_data = pd.concat(
    [
        df_model_data["y_yes"].rename(target_col),
        df_model_data.drop(["y_no", "y_yes"], axis=1),
    ],
    axis=1,
)

In [None]:
df_model_data

In [None]:
df_model_data.describe()

## Split data

[SageMaker XGBoost](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html#InputOutput-XGBoost) expects data in the libSVM or CSV formats, with:

- The target variable in the first column, and
- No header row

In [None]:
# Shuffle and splitting dataset
train_data, validation_data, test_data = np.split(
    df_model_data.sample(frac=1, random_state=1729),
    [int(0.7 * len(df_model_data)), int(0.9 * len(df_model_data))],
)

print(f"Data split > train:{train_data.shape} | validation:{validation_data.shape} | test:{test_data.shape}")

In [None]:
# Save data to Studio filesystem
train_data.to_csv(os.path.join(output_path, "train.csv"), index=False, header=False)
validation_data.to_csv(os.path.join(output_path, "validation.csv"), index=False, header=False)
test_data.to_csv(os.path.join(output_path, "test.csv"), index=False, header=False)

## Model training and validation

In [None]:
train_features = train_data.drop(target_col, axis=1)
train_label = pd.DataFrame(train_data[target_col])

In [None]:
dtrain = xgb.DMatrix(train_features, label=train_label)

In [None]:
hyperparams = {
                "max_depth": 5,
                "eta": 0.5,
                "alpha": 2.5,
                "objective": "binary:logistic",
                "subsample" : 0.8,
                "colsample_bytree" : 0.8,
                "min_child_weight" : 3
              }

num_boost_round = 150
nfold = 3
early_stopping_rounds = 10

First, train the model on `nfold` number of folds of the training dataset and run a cross-validation.

In [None]:
# Cross-validate on training data
cv_results = xgb.cv(
    params=hyperparams,
    dtrain=dtrain,
    num_boost_round=num_boost_round,
    nfold=nfold,
    early_stopping_rounds=early_stopping_rounds,
    metrics=["auc"],
    seed=10,
)

In [None]:
metrics_data = {
    "binary_classification_metrics": {
        "validation:auc": {
            "value": cv_results.iloc[-1]["test-auc-mean"],
            "standard_deviation": cv_results.iloc[-1]["test-auc-std"]
        },
        "train:auc": {
            "value": cv_results.iloc[-1]["train-auc-mean"],
            "standard_deviation": cv_results.iloc[-1]["train-auc-std"]
        },
    }
}

In [None]:
print(f"Cross-validated train-auc:{cv_results.iloc[-1]['train-auc-mean']:.2f}")
print(f"Cross-validated validation-auc:{cv_results.iloc[-1]['test-auc-mean']:.2f}")

In [None]:
cv_results

Now retrain the model on the full training dataset instead of splitting the training dataset across a number of folds. Use the test dataset for early stopping.

In [None]:
test_features = test_data.drop(target_col, axis=1)
test_label = pd.DataFrame(test_data[target_col])
dtest = xgb.DMatrix(test_features, label=test_label)

### Train a model

In [None]:
# in the production code you need to use the unique ids
run_suffix = strftime('%d-%H-%M-%S', gmtime())
max_metric = 0.0
best_model_run_id = 0

# Train the model for different max_depth values
for i, d in enumerate([2, 5, 10, 15, 20]):
    hyperparams["max_depth"] = d
    print(f"Fit estimator with max_depth={d}")

    # Train the model
    model = xgb.train(
        params=hyperparams, 
        dtrain=dtrain, 
        evals = [(dtrain,'train'), (dtest,'eval')], 
        num_boost_round=num_boost_round, 
        early_stopping_rounds=early_stopping_rounds, 
        verbose_eval = 0
    )

    # Calculate metrics
    test_auc = roc_auc_score(test_label, model.predict(dtest))
    train_auc = roc_auc_score(train_label, model.predict(dtrain))

    if test_auc > max_metric:
        best_model = model
        best_depth = d
        max_metric = test_auc

    print(f"Test AUC: {test_auc:.4f} | Train AUC: {train_auc:.4f}")

In [None]:
print(f"Best model has a max_depth setting of {best_depth}")

### Evaluate the model


In [None]:
predictions = best_model.predict(dtest)

In [None]:
test_auc = roc_auc_score(test_label, predictions)
test_auc

In [None]:
pd.crosstab(
    index=test_label.to_numpy().squeeze(),
    columns=np.round(np.array(model.predict(dtest), dtype=float).squeeze()),
    rownames=['actuals'], 
    colnames=['predictions']
)