# Breast Cancer Prediction with XGBoost
_**Using Gradient Boosted Trees to Predict breast cancer with features derived from breast mass images**_

---

## Contents

1. [Background](#Background)
2. [Notebook Setup](#notebook_setup)
3. [Data Wrangling](#data_wrangling)
4. [Dataset Preparation](#dataset_preparation)
5. [Training](#Training)
6. [Hosting](#Hosting)
  1. [Evaluate](#Evaluate)
7. [(Optional) Clean-up](#cleanup)
8. [Extensions](#Extensions)
  1. [Hyperparameter Optimization](#hyperparameter_optimization)
  1. [(Optional) Final Clean-up](#cleanupfinal)

---
<a id = "Background"></a>

## Background

This notebook illustrates the use of SageMaker's built-in XGBoost algorithm for binary classification.
XGBoost uses decision trees to build a predictive model.

Also demonstrated is Hyperparameter optimization as well as using the best model from HPO to instantiate a new endpoint.

### Why XGBoost and not Logistic Regression?

Whilst logistic regression is often used for classification exercises, it has some drawbacks. For example, additional feature engineering is required to deal with non-linear features.

XGBoost (an implementation of Gradient Boosted Trees) offers several benefits including naturally accounting for non-linear relationships between features and the target variable, as well as accommodating complex interactions between features.
Decision Tree algorithms such as XGBoost also have the added benefit of being able to deal with missing values in both the training dataset and unseen samples that are being used for inference.

Amazon SageMaker provides an XGBoost container that we can use to train in a managed, distributed setting, and then host as a real-time prediction endpoint

The model lifecycle can be viewed below:

![`SageMaker` Model Lifecycle](images/ml-concepts.png)



---
<a id = "notebook_setup"></a>

## Notebook Setup

_This notebook was created and tested on a ml.t2.medium notebook instance._

Amazon SageMaker Studio is the first fully integrated development environment (IDE) for ML. It provides a single, web-based visual interface where you can perform all ML development steps required to build, train, tune, debug, deploy, and monitor models. It gives data scientists all the tools you need to take ML models from experimentation to production without leaving the IDE.

Studio notebooks are one-click Jupyter notebooks that can be spun up quickly. The underlying compute resources are fully elastic, so you can easily dial up or down the available resources, and the changes take place automatically in the background without interrupting your work. You can also share notebooks with others in a few clicks. They get the exact same notebook, saved in the same place.

To learn more about SageMaker Studio architecture, refer to this [blog](https://aws.amazon.com/blogs/machine-learning/dive-deep-into-amazon-sagemaker-studio-notebook-architecture/)


### Libraries
You will be using the AWS SDK for Python ([Boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)) to create, configure, and manage AWS services, such as Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). The SDK provides an object-oriented API as well as low-level access to AWS services. 

### Import dependencies

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The SageMaker role arn used to give learning and hosting access to your data. The snippet below will use the same role used by your SageMaker notebook instance. If you wish to use a different role, specify the full ARN of a role with the `SageMakerFullAccess` policy attached.

In [None]:
# Define IAM role
import sagemaker
import boto3
from sagemaker import get_execution_role

region = boto3.Session().region_name

# Define IAM role
role = get_execution_role()

session = sagemaker.Session()

# bucket = ''  # <uncomment and change to your own bucket if you don't want to use the default bucket>
bucket = session.default_bucket()
print("The role is {} and default bucket is {}".format(role, sagemaker.Session().default_bucket()))
prefix = "sagemaker/xgboost-bc"  # modify to your own path if desired

Next, we'll import the Python libraries we'll need for examining our data

In [None]:
import numpy as np  # For matrix operations and numerical processing
import pandas as pd  # For munging tabular data
import matplotlib.pyplot as plt  # For charts and visualizations
from IPython.display import Image  # For displaying images in the notebook
from IPython.display import display  # For displaying outputs in the notebook
import time  # For labeling SageMaker models, endpoints, etc.
from time import gmtime, strftime  # For labeling SageMaker models, endpoints, etc.
import sys  # For writing outputs to notebook
import math  # For ceiling function
import json  # For parsing hosting outputs
import os  # For manipulating filepath names
import zipfile  # Amazon SageMaker's Python SDK provides many helper functions
from datetime import datetime  # Date time library to log time
import re  # Regular expression
import seaborn as sn  # Seaborn is used to pli=ot confusion matrix

---
<a id='data_wrangling'></a>

## Data Wrangling

For this illustration, we have taken an example for breast cancer prediction using UCI'S breast cancer diagnostic data set available at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29. The data set is also available on `Kaggle` at https://www.kaggle.com/uciml/breast-cancer-wisconsin-data. The purpose here is to use this data set to build a predictive model of whether a breast mass image indicates benign or malignant tumor. 

You can find out all the details of this dataset here: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

Let's download the data and save it in the local folder so that we can take a look at it.

In [None]:
!rm -rfv wdbc.data train.csv validation.csv
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data

The dataset we downloaded does not have column headings; however this information is available at the source

More information about this dataset can be found here: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
Sample images used in this dataset can be seen here: ftp://ftp.cs.wisc.edu/math-prog/cpo-dataset/machine-learn/cancer/cancer_images

- `id`: ID number
- `diagnosis`: The diagnosis of breast tissues (M = malignant, B = benign)
- `radius_mean`: mean of distances from center to points on the perimeter
- `texture_mean`: standard deviation of gray-scale values
- `perimeter_mean`: mean size of the core tumor
- `area_mean`: area of the core tumor
- `smoothness_mean`: mean of local variation in radius lengths
- `compactness_mean`: mean of perimeter^2 / area - 1.0
- `concavity_mean`: mean of severity of concave portions of the contour
- `concave points_mean`: mean for number of concave portions of the contour
- `symmetry_mean`: mean for symmetry between left and right breasts
- `fractal_dimension_mean`: mean for "coastline approximation" - 1
- `radius_se`: standard error for the mean of distances from center to points on the perimeter
- `texture_se`: standard error for standard deviation of gray-scale values
- `perimeter_se`: standard error for size of the core tumor
- `area_se`: standard error for area of the core tumor
- `smoothness_se`: standard error for local variation in radius lengths
- `compactness_se`: standard error for perimeter^2 / area - 1.0
- `concavity_se`: standard error for severity of concave portions of the contour
- `concave points_se`: standard error for number of concave portions of the contour
- `symmetry_se`: standard error for symmetry between left and right breasts
- `fractal_dimension_se`: standard error for "coastline approximation" - 1
- `radius_worst`: "worst" or largest mean value for mean of distances from center to points on the perimeter
- `texture_worst`: "worst" or largest mean value for standard deviation of gray-scale values
- `perimeter_worst`: "worst" or largest mean value for size of the core tumor
- `area_worst`: "worst" or largest mean value for area of the core tumor
- `smoothness_worst`: "worst" or largest mean value for local variation in radius lengths
- `compactness_worst`: "worst" or largest mean value for perimeter^2 / area - 1.0
- `concavity_worst`: "worst" or largest mean value for severity of concave portions of the contour
- `concave points_worst`: "worst" or largest mean value for number of concave portions of the contour
- `symmetry_worst`: "worst" or largest mean value for standard error for symmetry between left and right breasts
- `fractal_dimension_worst`: "worst" or largest mean value for "coastline approximation" - 1


If we load this CSV data into a pandas dataframe, we can easily take a closer look

Note that this dataset doesn't have a header line, so we will add the column names to the pandas dataframe ourselves

In [None]:
import pandas as pd

col_names = [
    "id",
    "diagnosis",
    "radius_mean",
    "texture_mean",
    "perimeter_mean",
    "area_mean",
    "smoothness_mean",
    "compactness_mean",
    "concavity_mean",
    "concave points_mean",
    "symmetry_mean",
    "fractal_dimension_mean",
    "radius_se",
    "texture_se",
    "perimeter_se",
    "area_se",
    "smoothness_se",
    "compactness_se",
    "concavity_se",
    "concave points_se",
    "symmetry_se",
    "fractal_dimension_se",
    "radius_worst",
    "texture_worst",
    "perimeter_worst",
    "area_worst",
    "smoothness_worst",
    "compactness_worst",
    "concavity_worst",
    "concave points_worst",
    "symmetry_worst",
    "fractal_dimension_worst",
]
breastcancer = pd.read_csv("./wdbc.data", header=None, names=col_names)

breastcancer

The breast cancer dataset is quite small, with only 569 records, where each record uses 32 attributes to describe the profile of a breast mass.
Features are computed from a digitized image of a [Fine Needle Aspirate (FNA)](https://www.cancer.org/cancer/breast-cancer/screening-tests-and-early-detection/breast-biopsy/fine-needle-aspiration-biopsy-of-the-breast.html) of a breast mass. They describe characteristics of the cell nuclei present in the image.

The diagnosis column is our target column. Let us take a look at the diagnosis distribution in both absolute and normalized forms:

In [None]:
display(pd.crosstab(index=breastcancer["diagnosis"], columns="% observations"))
display(pd.crosstab(index=breastcancer["diagnosis"], columns="% observations", normalize="columns"))

So 63% of our samples are benign and 37% are malignant.

### Feature Distribution

Let's have a look at the distribution of the numerical features

In [None]:
%matplotlib inline
hist = breastcancer.hist(bins=30, sharey=False, figsize=(20, 20))

From these histograms we can see that:
- Most of the numeric features are following a Gaussian distribution
- `id` should not be included as a feature as it is irrelevant for predicting a diagnosis (Note: if we were going to keep `id`, it should be converted to non-numeric) 


We will drop the `id` column from the dataset:


In [None]:
breastcancer = breastcancer.drop(["id"], axis=1)

#### Note for future reference:
There may be scenarios where you have a numeric field like `id` that did add some non-numeric value.
A good example would be if the first N characters of patient ID indicated the country or state where the patient was located, and you wanted to see if this location had any bearing on the diagnosis.
In such a case you would convert the field to a string and extract the pertinent information.

`breastcancer['id'] = breastcancer['id'].astype(object)`

You would then treat that field as a categorical field.
These categorical fields would later be converted to indicator variables using the pandas 'get_dummies' method


To take a look at the relationship between any categorical fields and the final diagnosis, you would use the following cross-tabulation report: 

```python
for column in breastcancer.select_dtypes(include=['object']).columns:``
    if column != 'diagnosis':
        display(pd.crosstab(index=breastcancer[column], columns=breastcancer['diagnosis'], normalize='columns'))
```

### Direct relationship between features and diagnosis

Now we will look at the direct relationship between numeric (non-object) values and diagnosis. We do this by plotting a histogram for every numeric value.
We divide our samples into `bins`. The X-axis represents the bins and the Y-axis represents how many samples fall into each bin.
By forcing the benign and malignant graphs to share the same X and Y scale it is easier to visualize which bins are more populated between the two diagnoses.

Feel free to adjust the number of bins being plotted by the histogram and view the effect.

In [None]:
for column in breastcancer.select_dtypes(exclude=["object"]).columns:
    print(column)
    hist = breastcancer[[column, "diagnosis"]].hist(
        by="diagnosis", sharey=True, sharex=True, bins=30
    )
    plt.show()

What can we infer from these relationships?

We see that malignant diagnosis appear to have higher values (extend further to the right on the X axis) for the following features:
- radius_mean
- perimeter_mean
- area_mean
- compactness_mean
- concavity_mean
- concave points_mean
- `radius_se`
- `area_se`
- radius_worst
- texture_worst
- perimeter_worst
- area_worst
- compactness_worst
- concavity_worst
- concave points_worst

We see similar distributions for features such as `radius_mean`, `perimeter_mean` and `area_mean` for both malignant and benign diagnosis. This makes sense as each of these features are related to the size of the tumor.

Let's look at correlations using a scatter matrix

In [None]:
import matplotlib.pyplot as plt

# Display the matrix
display(breastcancer.corr())

# Plot the matrix
axes = pd.plotting.scatter_matrix(breastcancer, figsize=(30, 30))
for ax in axes.flatten():
    ax.xaxis.label.set_rotation(90)
    ax.yaxis.label.set_rotation(0)
    ax.yaxis.label.set_ha("right")
plt.tight_layout()
plt.gcf().subplots_adjust(wspace=0, hspace=0)
plt.show()

In the scatter matrix, such strongly correlated features are indicated by a diagonal line running from bottom left to top right. In the correlation matrix, such relationships are indicated by a correlation value close to 1.

In many cases it can be a good idea to remove one element of a highly correlated feature pair. 

`radius_mean` has high correlation (>98%) with both `perimeter_mean` and `area_mean`
`radius_worst` has high correlation (>98%) with both `perimeter_worst` and `area_worst`

To simplify our model, we will drop `perimeter_mean`, `area_mean`, `perimeter_worst` and `area_worst`

In [None]:
breastcancer = breastcancer.drop(
    ["perimeter_mean", "area_mean", "perimeter_worst", "area_worst"], axis=1
)

---
<a id='dataset_preparation'></a>

## Dataset Preparation For XGBoost

Now that we have a clean dataset (and have potentially removed some unnecessary columns), we can prepare the dataset for XGBoost. 

Amazon SageMaker XGBoost can train on data in either a CSV or LibSVM format.  For this example, we'll stick with CSV.  It should:
- Contain only numeric values
- Have the predictor variable in the first column
- Not have a header row

We will also
- Shuffle the dataset
- Split the dataset into training, validation and testing sets

### Step 1: Convert our label (the field we are trying to predict) to numeric - where 0 means B (benign) and 1 means M (malignant)

Ref: [Pandas Categorical](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Categorical.html)


In [None]:
breastcancer.diagnosis = pd.Categorical(breastcancer.diagnosis).codes
breastcancer.head()

### Step 2: Convert any categorical features into numeric features using the "get_dummies" function which will automatically convert categorical variable into dummy/indicator variables.

This dataset does not contain any categorical features (essentially string labels), so we can skip this step.
If your dataset does contain numeric features, you would convert these features to numeric indicator values using the following command:

`model_data = pd.get_dummies(breastcancer)`



### Step 3: Shuffle and split the dataset

And now let's split the data into training, validation, and test sets.  This will help prevent us from overfitting the model, and allow us to test the model's accuracy on data it hasn't already seen.
We will use the train_test_split function provided within sklearn to simplify this process

The ratio we will use is:
- Training dataset - 70%
- Validation dataset - 20%
- Test dataset - 10%

What is random_state?
Whenever randomization is part of a Scikit-learn algorithm, a random_state parameter may be provided to control the random number generator used. Note that the mere presence of random_state doesn’t mean that randomization is always used, as it may be dependent on another parameter, e.g. shuffle, being set.

The passed value will have an effect on the reproducibility of the results returned by the function (fit, split, or any other function like k_means). To learn more about random_state refer [documentation](https://scikit-learn.org/stable/glossary.html#term-random_state)

In [None]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(
    breastcancer, shuffle=True, test_size=0.10, random_state=42
)
train_data, validation_data = train_test_split(
    train_data, shuffle=True, test_size=0.20, random_state=42
)

In [None]:
print("Training data sample size:", len(train_data))
print("Validation data sample size:", len(validation_data))
print("Test data sample size:", len(test_data))

We need to convert the training dataset and validation dataset to CSV and upload to S3 for consumption by the containers running the XGBoost algorithm

In [None]:
train_data.to_csv("train.csv", header=False, index=False)
validation_data.to_csv("validation.csv", header=False, index=False)

Now we'll upload these files to S3.

In [None]:
boto3.Session().resource("s3").Bucket(bucket).Object(
    os.path.join(prefix, "train/train.csv")
).upload_file("train.csv")
boto3.Session().resource("s3").Bucket(bucket).Object(
    os.path.join(prefix, "validation/validation.csv")
).upload_file("validation.csv")

from IPython.core.display import HTML

url = "https://s3.console.aws.amazon.com/s3/buckets/{}?region={}&prefix={}/&showversions=false".format(
    bucket, session.boto_region_name, prefix
)
display(
    HTML(
        'Click <a target="_blank" href="{}">here</a> to view training and validation sets in S3'.format(
            url
        )
    )
)

Let us review the stages:

![`SageMaker` Model Lifecycle](images/ml-concepts.png)


---
## Training

To learn more about SageMaker Training see this [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html).

First we will need to specify the locations of the [XGBoost](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) algorithm containers.

In [None]:
container = sagemaker.image_uris.retrieve("xgboost", boto3.Session().region_name, version="latest")

Next we will create TrainingInput references to point to our training and validation data files in S3.

In [None]:
s3_input_train = sagemaker.inputs.TrainingInput(
    s3_data="s3://{}/{}/train".format(bucket, prefix), content_type="csv"
)
s3_input_validation = sagemaker.inputs.TrainingInput(
    s3_data="s3://{}/{}/validation/".format(bucket, prefix), content_type="csv"
)

### Define our estimator
We will name our estimator `xgb` (because it is using the XGBoost container provided by SageMaker)
We will run our training job on a single m5.xlarge instance
To reduce costs, we will utilize spot instances (however if a spot instance is not available within 15 minutes, training will stop)

For instance types refer the [documentation](https://aws.amazon.com/sagemaker/pricing/)


In [None]:
xgb = sagemaker.estimator.Estimator(
    container,
    role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    use_spot_instances=True,
    max_wait=900,
    max_run=600,
    output_path="s3://{}/{}/output".format(bucket, prefix),
    sagemaker_session=session,
)

### Set our hyperparameters
Now, we can specify a few parameters like what type of training instances we'd like to use and how many, as well as our XGBoost hyperparameters.  
A few key hyperparameters are:
- `max_depth` controls how deep each tree within the algorithm can be built.  Deeper trees can lead to better fit, but are more computationally expensive and can lead to overfitting.  There is typically some trade-off in model performance that needs to be explored between numerous shallow trees and a smaller number of deeper trees.
- `subsample` controls sampling of the training data.  This technique can help reduce overfitting, but setting it too low can also starve the model of data.
- `num_round` controls the number of boosting rounds.  This is essentially the subsequent models that are trained using the residuals of previous iterations.  Again, more rounds should produce a better fit on the training data, but can be computationally expensive or lead to overfitting.
- `eta` controls how aggressive each round of boosting is.  Larger values lead to more conservative boosting.
- `gamma` controls how aggressively trees are grown.  Larger values lead to more conservative models.

More detail on XGBoost's hyperparameters can be found on the GitHub [page](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst).

In [None]:
xgb.set_hyperparameters(
    max_depth=7,
    eta=0.1,
    gamma=4,
    min_child_weight=6,
    subsample=0.8,
    objective="binary:logistic",
    num_round=100,
)

### Run the training job

In [None]:
%%time
# The total time of this step is about 4 minutes, the trainig job takes about 50 seconds using an ml5.xlarge instance
xgb.fit({"train": s3_input_train, "validation": s3_input_validation})

---
## Hosting

Now that we've trained the `xgboost` algorithm on our data, let's deploy a model that's hosted behind a real-time endpoint.

In [None]:
%%time
# The total time of this step is about 6 minutes
xgb_predictor = xgb.deploy(initial_instance_count=1, instance_type="ml.m5.large")

---
## Evaluate

Now that we have a hosted endpoint running, we can make real-time predictions from our model by calling the `predict` method.  But first, we'll need to set up serializers and deserializers for passing our `test_data` NumPy arrays to the model behind the endpoint.

In [None]:
xgb_predictor.serializer = sagemaker.serializers.CSVSerializer()
xgb_predictor.deserializer = sagemaker.deserializers.StringDeserializer()

Now, we'll use a simple function to:
1. Loop over our test dataset
1. Extract the features for each sample 
1. Retrieve the prediction for each sample by invoking the XGBoost endpoint
1. Collect predictions and convert from a python list to a NumPy array

In [None]:
# Convert the dataframe to a numpy array
dtest = test_data.to_numpy()

# As expected, the numpy array has 57 rows (57 samples in the test dataset), and 27 columns (26 features + 1 label)
print(dtest.shape)

# Create a list to hold all of our predictions
predictions = []

# Loop through the matrix of our test data samples, pulling out the features for each sample and running inference
# Note: dtest[i:i+1, 1:] is an vector of all the features for the sample i (without the first entry which is the label)
for i in range(dtest.shape[0]):
    sample_features = dtest[i : i + 1, 1:]
    prediction = xgb_predictor.predict(sample_features)
    predictions.append(float(prediction))

# Convert our list of predictions to a numpy array
predictions = np.asarray(predictions)
display(predictions)

To evaluate the performance of this machine learning model on the test dataset, we will use a simple confusion matrix to compare actual to predicted values.

This [article](https://www.frontiersin.org/articles/10.3389/fpubh.2017.00307/full) outlines sensitivity, specificity and predictive values in medical research.


In this case, we're predicting whether the tumor was malignant (`1`) or benign (`0`).

- We get the actual values from the first column (column 0) of the dataset: `test_data.iloc[:, 0]`
- We get the predicted values from our array of predictions: `predictions`. We will simply round the predictions to the nearest integer (so a prediction < 0.5 will be 0 - benign and a prediction; 0.5 will be 1 - malignant)

In [None]:
confusion_matrix = pd.crosstab(
    index=test_data.iloc[:, 0],
    columns=np.round(predictions, 0),  # change 0 to 1 to see distribution
    rownames=["actual"],
    colnames=["predictions"],
    #     margins = True, # Enable to see All count
)

sn.heatmap(confusion_matrix, annot=True)
plt.show()

Of the 57 samples in the test dataset, 39 were for benign tumors and indeed we've correctly predicted 39 of them.
1 of the benign samples were incorrectly predicted as malignant

16 of the samples were malignant, and we correctly predicted 16 of them.
1 of the malignant samples was incorrectly predicted as benign

An important point here is that because of the `np.round()` function above we are using a simple threshold (or cutoff) of 0.5.  Our predictions from `xgboost` come out as continuous values between 0 and 1, and we force them into the binary classes that we began with.  

However, because we would rather err on the side of a false positive than a false negative, we will adjust this cutoff. 

To get a rough intuition here, let's look at the continuous values of our predictions.

In [None]:
plt.hist(predictions, bins=20)
plt.show()

The continuous valued predictions coming from our model are generally quite decisive so tend to skew toward 0 or 1; however there are a few values between 0.1 and 0.9 where the model is less confident.

How you adjust the cutoff is completely dependent upon the problem space you are addressing and whether you want to have more likelihood of false positives or false negatives.

In the case of predicting malignant tumors we would rather err on the side of false positives than false negatives, so we will be extremely conservative and report any prediction greater than 0.1 as malignant 

Where you define this cutoff is going be based on your problem domain and whether you would prefer to err on the side of false negatives or false positives

In [None]:
confusion_matrix = pd.crosstab(
    index=test_data.iloc[:, 0],
    columns=np.where(predictions > 0.1, 1, 0),
    rownames=["actual"],
    colnames=["predictions"],
)

sn.heatmap(confusion_matrix, annot=True)
plt.show()

Our confusion matrix is very interesting. We now have fewer correct predictions overall; however, for our problem domain this is a better result.

Of the 40 benign tumors, our model has correctly predicted 36 of them. The remaining 4 would be predicted as malignant. These are our false positives
Of the 17 malignant tumors, our model has correctly predicted 16 of them. We have 1 false negatives within this test set.


---
<a id='cleanup'></a>

## (Optional) Clean-up

If you're finished with this predictor, please run the cell below.  This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on.

In [None]:
# Clean up model endpoints
xgb_predictor.delete_model()
xgb_predictor.delete_endpoint(delete_endpoint_config=True)

---
<a id ='hyperparameter_optimization'></a>


# Extensions

## Hyperparameter Optimization (HPO)

Machine learning is great at finding parameters which define patterns in our data, however there are many hyperparameters which govern how our machine learning algorithm will go about finding those parameters.

SageMaker's Hyperparameter optimization can assist in finding the best set of hyperparameters. To learn more about how HPO work, refer to this [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-how-it-works.html)

> Note that our current results are excellent, so hyperparameter optimization is unlikely to improve these; however, we will run through the process of hyperparameter optimization as it could be very relevant to your own dataset




#### Configure HPO job
First we set the hyperparameters that we do not want SageMaker Hyperparameter optimization to experiment with. These are the static hyperparameters

In [None]:
static_hyperparameters = {"objective": "binary:logistic", "num_round": "100"}

We create an estimator that we will use for training and set the static hyperparameters for the training job

Again we will utilize spot instances to reduce costs

In [None]:
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

container = sagemaker.image_uris.retrieve("xgboost", boto3.Session().region_name, version="latest")

xgb_hpo = sagemaker.estimator.Estimator(
    container,
    role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    use_spot_instances=True,
    max_wait=900,
    max_run=600,
    output_path="s3://{}/{}/output".format(bucket, prefix),
    sagemaker_session=session,
)

xgb_hpo.set_hyperparameters(**static_hyperparameters)

Now we tell SageMaker which hyperparameters we want it to experiment with the goal of finding the hyperparameter combination that yields the best performing model

In [None]:
tuned_hyperparameter_ranges = {
    "eta": ContinuousParameter(0.1, 0.5),
    "min_child_weight": ContinuousParameter(1, 10),
    "alpha": ContinuousParameter(0, 2),
    "subsample": ContinuousParameter(0.5, 1),
    "max_depth": IntegerParameter(5, 10),
}

Define the hyperparameter tuning job

In [None]:
tuner = HyperparameterTuner(
    xgb_hpo,
    objective_metric_name="validation:auc",
    objective_type="Maximize",
    hyperparameter_ranges=tuned_hyperparameter_ranges,
    max_jobs=20,
    max_parallel_jobs=3,
)

#### Run the hyperparameter tuning job

In [None]:
timestamp = time.strftime("-%Y%m%d%H%M", time.gmtime())

%time
tuner.fit(
    {"train": s3_input_train, "validation": s3_input_validation},
    job_name="HPObreastCancer" + timestamp,
    include_cls_metadata=False,
)

### Check job name and status of HPO job
sage_client = boto3.Session().client("sagemaker")

hpo_job_name = tuner.latest_tuning_job.job_name
print(hpo_job_name)

sage_client.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=hpo_job_name)[
    "HyperParameterTuningJobStatus"
]

### Analyze tuning job results - after tuning job is completed
Please refer to `HPO_Analyze_TuningJob_Results.ipynb` to see example code to analyze the tuning job results.

However, if the job status is `Completed` and you really don't care about analyzing the results of all the different hyperparameter combinations, the code below will return you the best job name and the best combination of hyperparameters found

In [None]:
tuning_job_result = sage_client.describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=hpo_job_name
)
best_job = tuning_job_result.get("BestTrainingJob", None)

best_job_name = best_job.get("TrainingJobName", None)
tuned_hyperparams = best_job.get("TunedHyperParameters", None)
print("Best job had name:", best_job_name)
print("Best hyperparameter combination:", tuned_hyperparams)

### Create a hosted endpoint from the best job results

Locate the S3 path to the model artifact

In [None]:
info = sage_client.describe_training_job(TrainingJobName=best_job_name)
model_data = info["ModelArtifacts"]["S3ModelArtifacts"]
print(model_data)

Create a SageMaker model from the model artifact

In [None]:
primary_container = {"Image": container, "ModelDataUrl": model_data}

model_name = best_job_name + "-model"

create_model_response = sage_client.create_model(
    ModelName=model_name, ExecutionRoleArn=role, PrimaryContainer=primary_container
)

print(create_model_response["ModelArn"])

Create a configuration for a SageMaker hosted endpoint

In [None]:
endpoint_config_name = "HPO-XGBoostEndpointConfig-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(endpoint_config_name)
create_endpoint_config_response = sage_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": "ml.t2.medium",
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

Create a SageMaker hosted endpoint using the configuration created above.
We will poll every 60 seconds until the endpoint is `InService`

In [None]:
endpoint_name = "HPO-XGBoostEndpoint-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(endpoint_name)
create_endpoint_response = sage_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)
print(create_endpoint_response["EndpointArn"])

resp = sage_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sage_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Create a predictor (this is an object to make it simpler to make requests to an endpoint). 

In [None]:
xgb_predictor_hpo = sagemaker.predictor.Predictor(
    endpoint_name,
    serializer=sagemaker.serializers.CSVSerializer(),
    deserializer=sagemaker.deserializers.StringDeserializer(),
)

As we did with our previous endpoint, we'll use a simple function to:
  - Loop over our test dataset
  - Extract the features for each sample
  - Retrieve the prediction for each sample by invoking the XGBoost endpoint
  - Collect predictions and convert from a python list to a NumPy array

In [None]:
# Create a list to hold all of our predictions using the HPO model
predictions_hpo = []

# Loop through the matrix of our test data samples, pulling out the features for each sample and running inference
# Note: dtest[i:i+1, 1:] is an vector of all the features for the sample i (without the first entry which is the label)
for i in range(dtest.shape[0]):
    sample_features = dtest[i : i + 1, 1:]
    prediction = xgb_predictor_hpo.predict(sample_features)
    predictions_hpo.append(float(prediction))

# Convert our list of predictions to a numpy array
predictions_hpo = np.asarray(predictions_hpo)
display(predictions_hpo)

How does our new model go at making predictions?

In [None]:
confusion_matrix = pd.crosstab(
    index=test_data.iloc[:, 0],
    columns=np.round(predictions_hpo, 0),
    rownames=["actual"],
    colnames=["predictions"],
)

sn.heatmap(confusion_matrix, annot=True)
plt.show()

We are getting the same results as we had in our initial model.
This is not unexpected, as our initial model had excellent results

What about if we shift the cutoff point for a positive result?

In [None]:
confusion_matrix = pd.crosstab(
    index=test_data.iloc[:, 0],
    columns=np.where(predictions_hpo > 0.1, 1, 0),
    rownames=["actual"],
    colnames=["predictions"],
)

sn.heatmap(confusion_matrix, annot=True)
plt.show()

This is no better than our result before Hyperparameter Optimization.
At this point we probably can't improve our model any further by optimizing the hyperparameters, so we know we need to look at the data. 
Perhaps we need to use domain knowledge to include extra relevant features.

---
<a id='cleanupfinal'></a>
### (Optional) Final Clean-up

If you're finished with this notebook, please run the cell below.  This will remove the hosted endpoint you created after the hyperparameter optimization stage; and avoid any charges from a stray instance being left on.

In [None]:
# Clean up model endpoints
xgb_predictor_hpo.delete_model()
xgb_predictor_hpo.delete_endpoint(delete_endpoint_config=True)

# Clean up files
!rm -rfv wdbc.data train.csv validation.csv