# SageMaker basics (all-in-one)
**Demontstrating SageMaker basics: training jobs, built-in algorithms (XGBoost), dataset balancing, endpoint deployemnt, batch transform, and Hyper-parameter Optimization.**

---

---

## Contents

1. [Objective](#Objective)
1. [Background](#Background-(Problem-Description-and-Approach))
1. [Prepare our Environment](#Prepare-our-Environment)
1. [Download and Explore the Data](#Download-and-Explore-the-Data)
1. [Transform the Data](#Transform-the-Data)
1. [Understand the Algorithm](#Understand-the-Algorithm)
1. [Upload the Input Data to S3](#Upload-the-Input-Data-to-S3)
1. [Train the Model](#Train-the-Model)
1. [Deploy and Evaluate the Model](#Deploy-and-Evaluate-the-Model)
1. [Hyperparameter Optimization (HPO)](#Hyperparameter-Optimization-(HPO))
1. [Conclusions](#Conclusions)
1. [Releasing Cloud Resources](#Releasing-Cloud-Resources)


---

## Objective

This workshop aims to give you an **example of using and tuning a SageMaker built-in algorithm**: Focussing on the **data interfaces** and SageMaker's automatic **Hyperparameter Optimization** (HPO) capabilities.

Teaching in-depth data science approaches for tabular data is outside this scope, and we hope you can use this notebook as a starting point to modify for the needs of your future projects.

---

## Background (Problem Description and Approach)

- **Direct marketing**: contacting potential new customers via mail, email, phone call etc. 
- **Challenge**: A) too many potential customers. B) limited resources of the approacher (time, money etc.).
- **Problem: Which are the potential customers with the higher chance of becoming actual customers**? (so as to focus the effort only on them). 
- **Our setting**: A bank who wants to predict *whether a customer will enroll for a term deposit, after one or more phone calls*.
- **Our approach**: Build a ML model to do this prediction, from readily available information e.g. demographics, past interactions etc. (features).
- **Our tools**: We will be using the **XGBoost** algorithm implementation by **Amazon SageMaker**, and using SageMaker **Hyperparameter Optimization (HPO)** to improve our model.


---

## Prepare our Environment

We'll need to:

- **import** some useful libraries (as in any Python notebook)
- **configure** the S3 bucket and folder where data should be stored (to keep our environment tidy)
- **connect** to AWS in general (with [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)) and SageMaker in particular (with the [sagemaker SDK](https://sagemaker.readthedocs.io/en/stable/)), to use the cloud services

While `boto3` is the general AWS SDK for Python, `sagemaker` provides some powerful, higher-level interfaces designed specifically for ML workflows.


In [None]:
!pip install matplotlib

In [None]:
import numpy as np  # For matrix operations and numerical processing
import pandas as pd  # For munging tabular data
import time
import os
from util.classification_report import generate_classification_report  # helper function for classification reports

# setting up SageMaker parameters
import sagemaker
import boto3

boto_session = boto3.Session()
region = boto_session.region_name
bucket_name = sagemaker.Session().default_bucket()
bucket_prefix = "xgboost-example"  # Location in the bucket to store our files
sgmk_session = sagemaker.Session()
sgmk_client = boto_session.client("sagemaker")
sgmk_role = sagemaker.get_execution_role()


In [None]:
print('Using role:', sgmk_role)
print('Using bucket:', bucket_name)

---

## Download and Explore the Data

Let's start by downloading the [direct marketing dataset](https://archive.ics.uci.edu/ml/datasets/bank+marketing) from UCI's ML Repository.

We can run shell commands from inside Jupyter using the `!` prefix:


In [None]:
!wget -P data/ -N https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip

import zipfile
with zipfile.ZipFile("data/bank-additional.zip", 'r') as zip_ref:
    print("Unzipping...")
    zip_ref.extractall("data")
print("Done")

Now lets read this into a Pandas data frame and take a look.


In [None]:
df_data = pd.read_csv("./data/bank-additional/bank-additional-full.csv", sep=";")

pd.set_option("display.max_columns", 500)  # Make sure we can see all of the columns
df_data.head()  # show part of the dataframe


_**Specifics on each of the features:**_

*Demographics:*
* `age`: Customer's age (numeric)
* `job`: Type of job (categorical: 'admin.', 'services', ...)
* `marital`: Marital status (categorical: 'married', 'single', ...)
* `education`: Level of education (categorical: 'basic.4y', 'high.school', ...)

*Past customer events:*
* `default`: Has credit in default? (categorical: 'no', 'unknown', ...)
* `housing`: Has housing loan? (categorical: 'no', 'yes', ...)
* `loan`: Has personal loan? (categorical: 'no', 'yes', ...)

*Past direct marketing contacts:*
* `contact`: Contact communication type (categorical: 'cellular', 'telephone', ...)
* `month`: Last contact month of year (categorical: 'may', 'nov', ...)
* `day_of_week`: Last contact day of the week (categorical: 'mon', 'fri', ...)
* `duration`: Last contact duration, in seconds (numeric). Important note: If duration = 0 then `y` = 'no'.
 
*Campaign information:*
* `campaign`: Number of contacts performed during this campaign and for this client (numeric, includes last contact)
* `pdays`: Number of days that passed by after the client was last contacted from a previous campaign (numeric)
* `previous`: Number of contacts performed before this campaign and for this client (numeric)
* `poutcome`: Outcome of the previous marketing campaign (categorical: 'nonexistent','success', ...)

*External environment factors:*
* `emp.var.rate`: Employment variation rate - quarterly indicator (numeric)
* `cons.price.idx`: Consumer price index - monthly indicator (numeric)
* `cons.conf.idx`: Consumer confidence index - monthly indicator (numeric)
* `euribor3m`: Euribor 3 month rate - daily indicator (numeric)
* `nr.employed`: Number of employees - quarterly indicator (numeric)

*Target variable* **(the one we want to eventually predict):**
* `y`: Has the client subscribed to a term deposit? (binary: 'yes','no')


---

## Transform the Data

Cleaning up data is part of nearly every ML project. Several common steps include:

* **Handling missing values**: In our case there are no missing values.
* **Handling weird/outlier values**: There are some values in the dataset that may require manipulation.
* **Converting categorical to numeric**: There are a lot of categorical variables in our dataset. We need to address this.
* **Oddly distributed data**: We will be using XGBoost, which is a non-linear method, and is minimally affected by the data distribution.
* **Remove unnecessary data**: There are lots of columns representing general economic features that may not be available during inference time.

To summarise, we need to A) address some weird values, B) convert the categorical to numeric valriables and C) Remove unnecessary data:


1. Many records have the value of "999" for `pdays`. It is very likely to be a 'magic' number to represent that *no contact was made before*. Considering that, we will create a new column called "no_previous_contact", then grant it value of "1" when pdays is 999 and "0" otherwise.

2. In the `job` column, there are more than one categories for people who don't work e.g., "student", "retired", and "unemployed". It is very likely the decision to enroll or not to a term deposit depends a lot on whether the customer is working or not. A such, we generate a new column to show whether the customer is working based on `job` column.

3. We will remove the economic features and `duration` from our data as they would need to be forecasted with high precision to be used as features during inference time.

4. We convert categorical variables to numeric using *one hot encoding*.


In [None]:
# Indicator variable to capture when pdays takes a value of 999
df_data["no_previous_contact"] = np.where(df_data["pdays"] == 999, 1, 0)

# Indicator for individuals not actively employed
df_data["not_working"] = np.where(np.in1d(df_data["job"], ["student", "retired", "unemployed"]), 1, 0)

# remove unnecessary data
df_model_data = df_data.drop(
    ["duration", 
    "emp.var.rate", 
    "cons.price.idx", 
    "cons.conf.idx", 
    "euribor3m", 
    "nr.employed"], 
    axis=1,
)

df_model_data = pd.get_dummies(df_model_data)  # Convert categorical variables to sets of indicators

# Replace "y_no" and "y_yes" with a single label column, and bring it to the front:
df_model_data = pd.concat(
    [
        df_model_data["y_yes"].rename("y"),
        df_model_data.drop(["y_no", "y_yes"], axis=1),
    ],
    axis=1,
)

df_model_data.head()  # show part of the new transformed dataframe (which will be used for training)


---

## Understand the Algorithm (XGBoost)

We'll be using SageMaker's [built-in **XGBoost Algorithm**](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html): Benefiting from performance-optimized, pre-implemented functionality like multi-instance parallelization, and support for multiple input formats.

In general to use the pre-built algorithms, we'll need to:

- Refer to the [Common Parameters docs](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html) to see the **high-level configuration** and what features each algorithm has
- Refer to the [algorithm docs](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) to understand the **detail** of the **data formats** and **(hyper)-parameters** it supports

From these docs, we'll understand what data format we need to upload to S3 (next), and how to get the container image URI of the algorithm... which is listed on the Common Parameters page but can also be extracted through the SDK:


---

## Upload the Input Data to S3

We know from [the algorithm docs](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html#InputOutput-XGBoost) that SageMaker XGBoost expects data in the **libSVM** or **CSV** formats, with:

- The target variable in the first column, and
- No header row

...So before initializing training, we will:

1. Suffle and split the data into **Training (70%)**, **Validation (20%)**, and **Test (10%)** sets
2. Save the data in the format the algorithm expects (e.g. CSV)
3. Upload the data to S3
4. Define the training job input "channels" with explicit CSV content type tagging, via the SageMaker SDK [TrainingInput](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html#sagemaker.inputs.TrainingInput) class

The Training and Validation datasets will be used during the training (and tuning) phase, while the 'holdout' Test set will be used afterwards to evaluate the model.


In [None]:
# Shuffle and splitting dataset
train_data, validation_data, test_data = np.split(
    df_model_data.sample(frac=1, random_state=1729), 
    [int(0.7 * len(df_model_data)), int(0.9*len(df_model_data))],
) 

# Create CSV files for Train / Validation / Test
train_data.to_csv("data/train.csv", index=False, header=False)
validation_data.to_csv("data/validation.csv", index=False, header=False)
test_data.to_csv("data/test.csv", index=False, header=False)


In [None]:
# Upload CSV files to S3 for SageMaker training
train_uri = sgmk_session.upload_data(
    path="data/train.csv",
    bucket=bucket_name,
    key_prefix=bucket_prefix,
)
val_uri = sgmk_session.upload_data(
    path="data/validation.csv",
    bucket=bucket_name,
    key_prefix=bucket_prefix,
)


# Define the data input channels for the training job:
s3_input_train = sagemaker.inputs.TrainingInput(train_uri, content_type="csv")
s3_input_validation = sagemaker.inputs.TrainingInput(val_uri, content_type="csv")

print(f"{s3_input_train.config}\n\n{s3_input_validation.config}")

---

## Train the Model

Training a model on SageMaker follows the usual steps with other ML libraries (e.g. SciKit-Learn):
1. Initiate a session (we did this up top).
2. Instantiate an estimator object for our algorithm (XGBoost).
3. Define its hyperparameters.
4. Start the training job.


#### Estimating dataset imbalance

In [None]:
positive_examples = np.count_nonzero(train_data["y"].values==1)
negative_examples = np.count_nonzero(train_data["y"].values==0)
total_examples = positive_examples + negative_examples
ratio_neg_pos = negative_examples/positive_examples
ratio_pos_neg = positive_examples/negative_examples

print('y=0: ', negative_examples, '(', round((negative_examples*100)/total_examples,2), '%)')
print('y=1: ', positive_examples, '(', round((positive_examples*100)/total_examples,2), '%)')
print('positive/negative examples ratio:', ratio_pos_neg)
print('negative/positive examples ratio:', ratio_neg_pos)

The negative examples (customers who did not opt for the term deposit) are almost 8 times more than the positive examples (customers who opted for the term deposit). This is a considerable dataset imbalance. If we don't address it, our classifier will learn to predict all new inputs as negative, because they are the majority by far. We will address this imbalance by taking into consideration the **negative/positive examples ratio** when we instantiate our SageMaker estimator.\
This can be done by the `scale_pos_weight` parameter of the XGBoost estimator, which **controls the balance of positive and negative weights**. You can find more information [here](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html). A typical value to consider: sum(negative cases) / sum(positive cases). In our case, we will set `scale_pos_weight` to the negative/positive examples ratio, to undo this imbalance.

#### A small competition!
SageMaker's XGBoost includes 38 parameters. You can find more information about them [here](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html).
For simplicity, we choose to experiment only with a few of them.

**Please select values for the 4 hyperparameters (by replacing the ?) based on the provided ranges.** Later we will see which model performed best and compare it with the one from the Hyperparameter Optimization step.

In [None]:
# specify algorithm container
training_image = sagemaker.image_uris.retrieve("xgboost", region=region, version="1.0-1")
print(training_image)

# Instantiate an XGBoost estimator object
estimator = sagemaker.estimator.Estimator(
    image_uri=training_image,      # XGBoost algorithm container
    instance_type="ml.m5.xlarge",  # type of training instance
    instance_count=1,              # number of instances to be used
    role=sgmk_role,                # IAM role to be used
    max_run=20*60,                 # Maximum allowed active runtime
    use_spot_instances=True,       # Use spot instances to reduce cost
    max_wait=30*60,                # Maximum clock time (including spot delays)
)

# scale_pos_weight controls the balance of positive and negative weights. 
# It's useful for unbalanced classes. A typical value to consider: sum(negative cases) / sum(positive cases). 

# define its hyperparameters
estimator.set_hyperparameters(
    num_round=150,     # int: [1,300]
    max_depth=5,     # int: [1,10]
    alpha=2,         # float: [0,5]
    eta=0.2,           # float: [0,1]
    objective="binary:logistic",
    scale_pos_weight=ratio_neg_pos,  # set the balance between positive and negative classes, to undo the imbalance
)

# start a training (fitting) job
estimator.fit({ "train": s3_input_train, "validation": s3_input_validation })


---

## Deploy and Evaluate the Model

### Deployment

Now that we've trained the xgboost algorithm on our data, deploying the model (hosting it behind a real-time endpoint) is just one function call!

This deployment might take **up to 10 minutes**, and by default the code will wait for the deployment to complete.

If you like, you can instead:

- Un-comment the `wait=False` parameter
- Use the [Endpoints page of the SageMaker Console](https://console.aws.amazon.com/sagemaker/home?#/endpoints) to check the status of the deployment
- Skip over the *Evaluation* section below (which won't run until the deployment is complete), and start the Hyperparameter Optimization job - which will take a while to run too, so can be started in parallel


In [None]:
# Real-time endpoint:
predictor = estimator.deploy(
    initial_instance_count=1,
    instance_type="ml.t2.xlarge",
    # wait=False,  # Remember, predictor.predict() won't work until deployment finishes!
)


### Evaluation

Since SageMaker is a general purpose ML platform and our endpoint is a web service, we'll need to be explicit that we're sending in tabular data (_serialized_ in CSV string format for the HTTPS request) and expect a tabular response (to be _deserialized_ from CSV to numpy).

In the SageMaker SDK (from v2), this packing and unpacking of the payload for the web endpoint is handled by [serializer classes](https://sagemaker.readthedocs.io/en/stable/api/inference/serializers.html) and [deserializer classes](https://sagemaker.readthedocs.io/en/stable/api/inference/deserializers.html).

Unfortunately the pre-built `CSVDeserializer` produces nested Python lists of strings, rather than a numpy array of numbers - so rather than bothering to implement a custom class (like the examples [here](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/deserializers.py)) we'll be lazy and take this as a post-processing step.

With this setup ready, requesting inferences is as easy as calling `predictor.predict()`:

In [None]:
predictor.serializer = sagemaker.serializers.CSVSerializer()
predictor.deserializer = sagemaker.deserializers.CSVDeserializer()

In [None]:
X_test_numpy = test_data.drop(["y"], axis=1).values

predictions = np.array(predictor.predict(X_test_numpy), dtype=float).squeeze()
predictions

Each number in this vector is the **predicted probability** (in the interval [0,1]) of a potential customer enrolling for a term deposit.

- 0: The person **will not** enroll
- 1: The person **will** enroll (making them a good candidate for direct marketing)

If we like, we could stitch these predictions back on to the original dataframe to explore performance:


In [None]:
test_results = pd.concat(
    [
        pd.Series(predictions, name="y_pred", index=test_data.index),
        test_data,
    ],
    axis=1
)
test_results.head()


...Or use this function we provided to generate a more **comprehensive model report**:


In [None]:
%matplotlib inline

In [None]:
generate_classification_report(
    y_actual=test_data["y"].values, 
    y_predict_proba=predictions, 
    decision_threshold=0.5,
    class_names_list=["Did not enroll","Enrolled"],
    model_info="XGBoost SageMaker inbuilt",
)


---

## Handling large data

Assume you have a large dataset that you would like to use for predictions. Lets create this large dataset by concatenating the training set many times.

In [None]:
large_data = pd.concat([test_data for i in range(100)])  # concatenate the test set 100 times to form a larger dataset
large_data = large_data.drop(["y"], axis=1)  # discard the ground truth (for inference later)
large_data.to_csv("data/large.csv", index=False, header=False)  # save to a csv without headers

large_uri = sgmk_session.upload_data(  # upload it to S3
    path="data/large.csv",
    bucket=bucket_name,
    key_prefix=bucket_prefix,
)


In [None]:
X_large_numpy = large_data.values
predictions = np.array(predictor.predict(X_large_numpy), dtype=float).squeeze()  # this will create an error. It is intended to do so. 

# Check the following cell for an explanation on why this error occurs and how you can address it.


What just happened? Why did we get this error? 

- **Short answer**: we sent too much data to our endpoint. 

- **Longer answer**: Payloads for invoking SageMaker endpoints are limited to 5MB. We sent 100 times the size of our testing set (412K rows!), which exceeded this. The endpoint couldn't handle a payload greater than 5MB.

If we limit the size of data to less than 5MB, then our endpoint should be able to handle it.

In [None]:
X_large_numpy = large_data.iloc[:10,:].values  # sent only the first 10 rows, not the entire large dataset
predictions = np.array(predictor.predict(X_large_numpy), dtype=float).squeeze() 
predictions

This time our endpoint was able to handle the smaller size of data.

This makes sense because endpoints are designed for **online (real-time) inference**. This essentially means they are optimized for **continuous arrival of small data packages**, not big chunks of data. 

If you want to run an one-time offline inference on a large dataset, then using Batch Transform makes more sense.

---

## Batch Transform

Batch transform is optimized for **offline bulk inference of many predictions**. It is better suited for **periodic arrival** of big chunks of data (as opposed to continuous arrival of small data packages in endpoints). You can think of it as a **transient computing cluster** for an one-time inference. Once inference is finished, infrastructure is decommissioned, as opposed to endpoints which remain in service until you take them down. 

Let's execute an one-time inference on the large dataset using Batch Transform.

In [None]:
transformer = estimator.transformer(
    instance_count=1, 
    instance_type='ml.m5.xlarge',
    strategy='MultiRecord',  # or 'SingleRecord'. If SingleRecord, one line is sent at a time. if MultiRecord, more lines are sent, according to the max_payload
    max_payload=6,  # the max payload in MB to be used. 6 is the default. 
    assemble_with='Line',
    output_path=f"s3://{bucket_name}/{bucket_prefix}/",
)


Running a Batch Transform may take a few minutes, depending on the dataset, the complexity of the model and the number of instances you use. 

More details on the arguments can be found here. 
- https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html
- https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html?highlight=transformer#sagemaker.estimator.Estimator.transformer




In [None]:
transformer.transform(
    large_uri, 
    content_type='text/csv', 
    split_type='Line'
)

transformer.wait()

Let's download and open the output of the Batch Transform

In [None]:
sagemaker.Session().download_data( 
    path='data/', # destination 
    bucket=bucket_name, 
    key_prefix=f'{bucket_prefix}/large.csv.out' # source
)

batch_out = pd.read_csv('data/large.csv.out')
batch_out.head(10)

---

## Hyperparameter Optimization (HPO)
*Note, with the default settings below, the hyperparameter tuning job can take up to ~20 minutes to complete.*

We will use SageMaker HyperParameter Optimization (HPO) to automate the searching process effectively. Specifically, we **specify a range**, or a list of possible values in the case of categorical hyperparameters, for each of the hyperparameter that we plan to tune.

SageMaker hyperparameter tuning will automatically launch **multiple training jobs** with different hyperparameter settings, evaluate results of those training jobs based on a predefined "objective metric", and select the hyperparameter settings for future attempts based on previous results. For each hyperparameter tuning job, we will specify the maximum number of HPO tries (`max_jobs`) and how many of these can happen in parallel (`max_parallel_jobs`).

Tip: `max_parallel_jobs` creates a **trade-off between performance and speed** (better hyperparameter values vs how long it takes to find these values). If `max_parallel_jobs` is large, then HPO is faster, but the discovered values may not be optimal. Smaller `max_parallel_jobs` will increase the chance of finding optimal values, but HPO will take more time to finish.

Next we'll specify the objective metric that we'd like to tune and its definition, which includes the regular expression (Regex) needed to extract that metric from the CloudWatch logs of the training job. Since we are using built-in XGBoost algorithm here, it emits two predefined metrics: **validation:auc** and **train:auc**, and we elected to monitor *validation:auc* as you can see below. In this case (because it's pre-built for us), we only need to specify the metric name.

For more information on the documentation of the Sagemaker HPO please refer [here](https://sagemaker.readthedocs.io/en/stable/tuner.html).

In [None]:
# import required HPO objects
from sagemaker.tuner import IntegerParameter, ContinuousParameter, HyperparameterTuner

# set up hyperparameter ranges
ranges = {
    "num_round": IntegerParameter(1, 300),
    "max_depth": IntegerParameter(1, 10),
    "alpha": ContinuousParameter(0, 5),
    "eta": ContinuousParameter(0, 1),
}

# set up the objective metric
objective = "validation:auc"

# instantiate a HPO object
tuner = HyperparameterTuner(
    estimator=estimator,              # the SageMaker estimator object
    hyperparameter_ranges=ranges,     # the range of hyperparameters
    max_jobs=20,                      # total number of HPO jobs
    max_parallel_jobs=5,              # how many HPO jobs can run in parallel
    strategy="Bayesian",              # the internal optimization strategy of HPO
    objective_metric_name=objective,  # the objective metric to be used for HPO
    objective_type="Maximize",        # maximize or minimize the objective metric
)  


### Launch HPO
Now we can launch a hyperparameter tuning job by calling *fit()* function. After the hyperparameter tuning job is created, we can go to SageMaker console to track the progress of the hyperparameter tuning job until it is completed.

In [None]:
# start HPO
tuner.fit({ "train": s3_input_train, "validation": s3_input_validation })


HPO jobs often take quite a long time to finish and as such, sometimes you may want to free up the notebook and then resume the wait later.

Just like the Estimator, we won't be able to `deploy()` the model until the HPO tuning job is complete; and the status is visible through both the [AWS Console](https://console.aws.amazon.com/sagemaker/home?#/hyper-tuning-jobs) and the [SageMaker API](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeHyperParameterTuningJob.html). We could for example write a polling script like the below:

### Deploy and test optimized model
Deploying the best model is another simple `.deploy()` call:

In [None]:
# deploy the best model from HPO
hpo_predictor = tuner.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.large",
    serializer=sagemaker.serializers.CSVSerializer(),
    deserializer=sagemaker.deserializers.CSVDeserializer(),
)


Once deployed, we can now evaluate the performance of the best model.

In [None]:
# getting the predicted probabilities of the best model
hpo_predictions = np.array(hpo_predictor.predict(X_test_numpy), dtype=float).squeeze()
print(hpo_predictions)

# generate report for the best model
generate_classification_report(
    y_actual=test_data["y"].values, 
    y_predict_proba=hpo_predictions, 
    decision_threshold=0.5,
    class_names_list=["Did not enroll","Enrolled"],
    model_info="Best model (with HPO)",
)


---

## Conclusions

In our run, the optimized HPO model exhibited an AUC of ~0.774: fairly higher than our first-guess parameter combination!

Depending on the number of tries, HPO can find a better performing model faster, compared to simply trying different hyperparameters by trial and error or grid search. You can learn more in-depth details about SageMaker HPO [here](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-how-it-works.html).

SageMaker built-in algorithms are great for getting a first model fast, and combining them with SageMaker HPO can really boost their accuracy.

As we mentioned here, the best way to success with a built-in algorithm is to **read the [algorithm's doc pages](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html) carefully** - to understand what data format and parameters it needs!


---

## Releasing Cloud Resources

It's generally a good practice to deactivate all endpoints which are not in use.  

Please uncomment the following lines and run the cell in order to deactivate the 2 endpoints that were created before. 


In [None]:
predictor.delete_endpoint(delete_endpoint_config=True)
hpo_predictor.delete_endpoint(delete_endpoint_config=True)
