# HDBSCAN

This notebook is divided into two parts: _building_ the container and _using_ the container.

# Part 1: Packaging and Uploading of Custom Algorithm for use with Amazon SageMaker

### How Amazon SageMaker runs Docker container

Amazon SageMaker runs the container with the argument `train` or `serve`. 

* `ENTRYPOINT` is not defined in the Dockerfile so Docker will run the command `train` at training time and `serve` at serving time. 

#### Running the container during training

When Amazon SageMaker runs training, it uses the `train` script. Files under the `/opt/ml` directory will be used:

    /opt/ml
    |-- input
    |   |-- config
    |   |   |-- hyperparameters.json
    |   |   `-- resourceConfig.json
    |   `-- data
    |       `-- <channel_name>
    |           `-- <input data>
    |-- model
    |   `-- <model files>
    `-- output
        `-- failure

##### The input

* `/opt/ml/input/config` contains information to control how the program runs. `hyperparameters.json` contains hyperparameter. `resourceConfig.json` describes the network layout used for distributed training. 
* `/opt/ml/input/data/<channel_name>/` (for File mode) contains the input data for that channel. The channels are created based on the call to CreateTrainingJob. The files for each channel will be copied from S3 to this directory, preserving the tree structure indicated by the S3 key structure. 
* `/opt/ml/input/data/<channel_name>_<epoch_number>` (for Pipe mode) is the pipe for a given epoch. 

##### The output

* `/opt/ml/model/` stores the model that the custom algorithm generates. This file will be available at the S3 location returned in the `DescribeTrainingJob` result.
* `/opt/ml/output` is where the algorithm can write a file `failure` that describes why the job failed. The contents of this file will be returned in the `FailureReason` field of the `DescribeTrainingJob` result. 

#### Running the container during hosting

Here, hosting serves to respond to inference requests that come in via HTTP. The following Python serving stack is used:

![Request serving stack](stack.png)

Amazon SageMaker uses two URLs in the container:

* `/ping` will receive `GET` requests from the infrastructure. The program returns 200 if the container is up and accepting requests.
* `/invocations` is the endpoint that receives client inference `POST` requests. The format of the request and the response is up to the algorithm. If the client supplied `ContentType` and `Accept` headers, these will be passed in as well. 

The container will have the model files in the same place they were written during training:

    /opt/ml
    `-- model
        `-- <model files>



### Parts of the  container

In the `container` directory are the components needed to package the custom algorithm for Amazon SageMaker:

    .
    |-- Dockerfile
    |-- build_and_push.sh
    `-- log_clustering
        |-- nginx.conf
        |-- predictor.py
        |-- serve
        |-- train
        `-- wsgi.py

* __`Dockerfile`__ describes how to build your Docker container image. More details below.
* __`build_and_push.sh`__ is a script that uses the Dockerfile to build the container images and then pushes it to ECR. 
* __`log_clustering`__ is the directory which contains the files that will be installed in the container.


* __`nginx.conf`__ is the configuration file for the nginx front-end. 
* __`predictor.py`__ is the program that implements the Flask web server and the custom algorithm predictions. 
* __`serve`__ is the program started when the container is started for hosting. It launches the gunicorn server which runs multiple instances of the Flask app defined in `predictor.py`. 
* __`train`__ is the program that is invoked when the container is run for training. 
* __`wsgi.py`__ is a small wrapper used to invoke the Flask app. 

In [1]:
!cat container/Dockerfile

# Build an image that can do training and inference in SageMaker
# This is a Python 3 image that uses the nginx, gunicorn, flask stack
# for serving inferences in a stable way.

FROM ubuntu:18.04

MAINTAINER Amazon AI <sage-learner@amazon.com>


RUN apt-get -y update && apt-get install -y --no-install-recommends \
    wget \
    python3-pip \
    python3-setuptools \
    nginx \
    ca-certificates \
    && rm -rf /var/lib/apt/lists/*

RUN apt-get update
RUN apt-get install -y python3-dev  
RUN apt-get install -y build-essential

RUN ln -s /usr/bin/python3 /usr/bin/python
RUN ln -s /usr/bin/pip3 /usr/bin/pip

# Here we get all python packages.
# pip leaves the install caches populated which uses
# a significant amount of space. These optimizations save a fair amount of space in the
# image, which reduces start up time.
RUN python -m pip install --upgrade pip
RUN pip --no-cache-dir install Cython
RUN pip --no-cache-dir install numpy 
RUN pip --no-cache-dir install pandas
RUN pip --no-ca

### Building and registering the container

The following shell code builds the container image using `docker build` and push the container image to ECR using `docker push`. 

This code looks for an ECR repository. If the repository doesn't exist, the script will create it.

In [2]:
%%sh

# The name of our algorithm
algorithm_name=hdb

cd container

chmod +x hdbscan-cluster/train
chmod +x hdbscan-cluster/serve

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${fullname}

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build  -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}

docker push ${fullname}

Login Succeeded
Sending build context to Docker daemon   34.3kB
Step 1/23 : FROM ubuntu:18.04
 ---> 5a214d77f5d7
Step 2/23 : MAINTAINER Amazon AI <sage-learner@amazon.com>
 ---> Using cache
 ---> 75ab4e79e103
Step 3/23 : RUN apt-get -y update && apt-get install -y --no-install-recommends     wget     python3-pip     python3-setuptools     nginx     ca-certificates     && rm -rf /var/lib/apt/lists/*
 ---> Using cache
 ---> c4b1bab7dc93
Step 4/23 : RUN apt-get update
 ---> Running in fa7fbd781b1a
Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic InRelease [242 kB]
Get:3 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages [1434 kB]
Get:4 http://security.ubuntu.com/ubuntu bionic-security/multiverse amd64 Packages [26.7 kB]
Get:5 http://security.ubuntu.com/ubuntu bionic-security/restricted amd64 Packages [633 kB]
Get:6 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages [2400 kB]


https://docs.docker.com/engine/reference/commandline/login/#credentials-store



# Part 2: Using Custom Algorithm in Amazon SageMaker

## Set up the environment

A S3 bucket and the role that will be used for working with SageMaker is defined.

In [3]:
# S3 prefix
prefix = "DEMO-hdbscan_cluster"

# Define IAM role
import boto3
import re

import os
import numpy as np
import pandas as pd
from sagemaker import get_execution_role

role = get_execution_role()

## Create the session

The session is initialized to remember our connection parameters to SageMaker, and used to perform all SageMaker operations.

In [4]:
import sagemaker as sage
from time import gmtime, strftime

sess = sage.Session()

## Upload the data for training

In [5]:
WORK_DIRECTORY = "data"

data_location = sess.upload_data(WORK_DIRECTORY, key_prefix=prefix)

## Create an estimator and fit the model

In order to use SageMaker to fit/train the custom algorithm, an `Estimator` is created that defines how to use the container to train.

In [6]:
account = sess.boto_session.client("sts").get_caller_identity()["Account"]
region = sess.boto_session.region_name
image = "{}.dkr.ecr.{}.amazonaws.com/hdb:latest".format(account, region)

hdbcluster = sage.estimator.Estimator(
    image,
    role,
    1,
    "ml.c4.2xlarge",
    output_path="s3://{}/output".format(sess.default_bucket()),
    sagemaker_session=sess,
)

hdbcluster.fit(data_location)

2021-11-03 02:43:59 Starting - Starting the training job...
2021-11-03 02:44:24 Starting - Launching requested ML instancesProfilerReport-1635907439: InProgress
......
2021-11-03 02:45:25 Starting - Preparing the instances for training......
2021-11-03 02:46:26 Downloading - Downloading input data...
2021-11-03 02:46:45 Training - Downloading the training image......
2021-11-03 02:47:47 Training - Training image download completed. Training in progress.[34mStarting the training.[0m
[34mLoading /opt/ml/input/data/training/HDFS_100k.log_structured.csv[0m
[34m219 94[0m
[34mTotal: 7940 instances, 313 anomaly, 7627 normal[0m
[34mTrain: 5557 instances, 219 anomaly, 5338 normal[0m
[34mTest: 2383 instances, 94 anomaly, 2289 normal
[0m
[34mTrain data shape: 5557-by-16
[0m
[34mTest data shape: 2383-by-16
[0m
  scores[i] = (max_lambda - lambda_) / max_lambda[0m
[34mprediction (col)     0   1[0m
[34mactual (row)              [0m
[34m0                 2287   2[0m
[34m1      

## Hosting trained model
The trained model is used to get predictions using HTTP endpoint. 

### Deploy the model

In [7]:
from sagemaker.predictor import csv_serializer

predictor = hdbcluster.deploy(1, "ml.m4.xlarge", serializer=csv_serializer)

--------!

### Choose some data and use it for a prediction

In [8]:
import pandas as pd

test_data = pd.read_csv("data/payload2.csv", header=None)
test_data.sample(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
3,-7.55862e-12,-2.51954e-12,0.041863,0.042629,0.042629,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,-7.55862e-12,-2.51954e-12,0.041863,0.042629,0.042629,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0,-7.55862e-12,-2.51954e-12,0.041863,0.042629,0.042629,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
test_data_np = test_data.to_numpy()

In [10]:
print(predictor.predict(test_data_np).decode("utf-8"))

The csv_serializer has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


0
0
1
0
1



### Optional cleanup

In [None]:
sess.delete_endpoint(predictor.endpoint)