## Initialize Notebook

Use the code-cell below to define the global variables `BUCKET` and `ROLE` using Amazon CodeWhisperer recommendations. The values for these variables can be sourced from the sagemaker library.

In [None]:
import sagemaker_datawrangler                     # For interactive data prep widget
import numpy as np                                # For matrix operations and numerical processing
import pandas as pd                               # For munging tabular data
import matplotlib.pyplot as plt                   # For charts and visualizations
from IPython.display import Image                 # For displaying images in the notebook
from IPython.display import display               # For displaying outputs in the notebook
from time import gmtime, strftime                 # For labeling SageMaker models, endpoints, etc.
import sys                                        # For writing outputs to notebook
import math                                       # For ceiling function
import json                                       # For parsing hosting outputs
import os                                         # For manipulating filepath names
import boto3
import re
import zipfile

In this code-cell use Amazon CodeWhisperer to determine the version of the `pandas` library installed.

## Data Import

In the code-cell below, download the public data-set using the requests library. The data-set is a ZIP archive that can be downloaded using this [URL](https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip) and needs to be extracted into the current directory. 


In [None]:
bank_additional = pd.read_csv('bank-additional/bank-additional-full.csv')
print(bank_additional.head())

## Data Exploration

In the code-cell below, use Amazon CodeWhisperer generated recommendations to correlate the features by plotting a `heatmap` from the `seaborn` library.

**Challenge:** Plot a `pairplot` from the `seaborn` library.

## Data Transformation

In the code-cell below, create a new dataframe with a column `no_previous_contact` that populates data from the exisiting dataframe column `pdays`. Set the value of `no_previous_contact` is set to `1` if value of the `pdays` column is `999`; if not, it's set to `0`.


In [None]:
bank_additional['not_working'] = np.where(bank_additional['job'] == 'student', 1, 0)
bank_additional['not_working'] = np.where(bank_additional['job'] == 'retired', 1, 0)
bank_additional['not_working'] = np.where(bank_additional['job'] == 'unemployed', 1, 0)

In [None]:
bank_additional = pd.get_dummies(bank_additional)
bank_additional.head()

In the following code-cell, drop the columns `duration`, `emp.var.rate`, `cons.price.idx`, `cons.conf.idx`, `euribor3m` and `nr.employed` from the data-frame. Store the result to a new data-frame called `model_data` 

In [None]:
train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=42), [int(.7 * len(model_data)), int(.9 * len(model_data))])

In [None]:
pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('train.csv', index=False, header=False)
pd.concat([validation_data['y_yes'], validation_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('validation.csv', index=False, header=False)

Now that we have the data prepared, use the next code-cell to upload the generated CSV files `train.csv` and `validation.csv` to S3 using `boto3`

## Model Training and Tuning

In [None]:
import sagemaker
CONTAINER = sagemaker.image_uris.retrieve('xgboost', sagemaker.session.Session().boto_region_name, 'latest')
print(CONTAINER)

In [None]:
from sagemaker.inputs import TrainingInput

train_input = TrainingInput(s3_data='s3://{}/train/train.csv'.format(BUCKET),
                                                      content_type='text/csv')
validation_input = TrainingInput(s3_data='s3://{}/validation/validation.csv'.format(BUCKET),
                                                      content_type='text/csv')

In [None]:
xgb_estimator = sagemaker.estimator.Estimator(CONTAINER, ROLE, train_instance_count=1, train_instance_type='ml.m5.xlarge', output_path='s3://{}/output'.format(BUCKET))
xgb_estimator.set_hyperparameters(max_depth=5, eta=0.2, gamma=4, min_child_weight=6, subsample=0.8, objective='binary:logistic', num_round=100)
xgb_estimator.fit({'train': train_input, 'validation': validation_input})

## Model Hosting

In the following code-cell, deploy the generated model to a real-time inference endpoint. Use Amazon CodeWhisperer recommendations to do this.

## Model Evaluation

In [None]:
from sagemaker.serializers import CSVSerializer

xgb_predictor.serializer = CSVSerializer()

def predict(data, predictor, rows=500 ):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, predictor.predict(array).decode('utf-8')])

    return np.fromstring(predictions[1:], sep=',')

predictions = predict(test_data.drop(['y_no', 'y_yes'], axis=1).to_numpy(), xgb_predictor)

In [None]:
pd.crosstab(index=test_data['y_yes'], columns=np.round(predictions), rownames=['actuals'], colnames=['predictions'])

## Clean Up

In the following code-cell, delete the SageMaker endpoint. Use Amazon CodeWhisperer recommendations to do this.