# Build preprocessing model and custom inference image

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/inference|structured|realtime|byoc|byoc-nginx-python|featurizer|featurizer.ipynb)

---

We demonstrate building an ML application to predict the rings of Abalone.

After the model is hosted for inference, the payload will be sent as a raw (untransformed) csv string to a real-time endpoint.
The raw payload is first received by the preprocessing container. The raw payload is then transformed (feature-engineering) by the "preprocessor", and the transformed record (float values) are returned as a csv string by the preprocessing container.

The transformed record is then passed to the predictor container (hosting an XGBoost model). The predictor then converts the input data (from preprocessing container) into [XGBoost DMatrix](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.DMatrix) format, loads the model, calls `booster.predict(input_data)` and returns the predictions (Rings) in a JSON format.

In this notebook, we build the "preprocessor" model that transforms the raw csv records using the following transformations with [scikit-learn](https://scikit-learn.org) 
 - [`SimpleImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) for handling missing values, 
 - [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) for normalizing numerical columns, and
 - [`OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) for transforming categorical columns. After fitting the transformer, we save the fitted model to disk in [`joblib`](https://joblib.readthedocs.io/en/latest/persistence.html) format.

## Prerequisites

Upgrade the below packages to the latest version.

In [None]:
!pip install -U awscli boto3 sagemaker watermark scikit-learn tqdm --quiet

%load_ext watermark
%watermark -p awscli,boto3,sagemaker,scikit-learn,tqdm

## Build preprocessor model

Here are the high level transformation steps (feature-engineering) we perform on the columns in the dataset

- Target column here is "Rings"
- Convert all numeric columns to dtype of `float64`
  - Handle missing values with a `SimpleImputer` w/ strategy (mean)
  - Scale numerical values using a `StandardScaler`
- One hot encode "sex" column
  - Handle missing values with a `SimpleImputer` w/ strategy (most common)
- Process the columns using [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn-compose-columntransformer)
- Serialize the transformer model `preprocess.joblib` to disk

![](../images/byoc-featurizer.png)

## Download UCI Abalone dataset from S3

We download the UCI abalone dataset from sagemaker samples repository in s3

In [None]:
import boto3
import os
from sagemaker import session, get_execution_role
from sagemaker.s3 import S3Downloader, S3Uploader, s3_path_join
from pathlib import Path

account_id = boto3.client("sts").get_caller_identity().get("Account")
sm_session = session.Session()
region = sm_session.boto_region_name
role = get_execution_role()
bucket = sm_session.default_bucket()
prefix = "sagemaker/abalone/models/byoc"

current_dir = os.getcwd()

abalone_s3uri = (
    f"s3://sagemaker-example-files-prod-{region}/datasets/tabular/uci_abalone/abalone.csv"
)

featurizer_image_name = "abalone/featurizer"

base_dir = Path("../data").resolve()
featurizer_model_dir = Path("./models").absolute()

if not base_dir.joinpath("abalone.csv").exists():
    S3Downloader.download(s3_uri=abalone_s3uri, local_path=base_dir, sagemaker_session=sm_session)

### Fitting the preprocessor model

First, we split the training data to train and test sets and save the test data to `abalone_test.csv` file in the data directory.

We use [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn-compose-columntransformer) to impute, scale and one-hot encode the columns, then fit the [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn-compose-columntransformer) on input data.

In [None]:
import os
import joblib
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

DATA_FILE = base_dir.joinpath("abalone.csv")

if not DATA_FILE.exists():
    raise ValueError(f"{DATA_FILE} doesn't exist")

if not featurizer_model_dir.exists():
    featurizer_model_dir.mkdir(parents=True)

# As we get a headerless CSV file, we specify the column names here.
feature_columns_names = [
    "sex",
    "length",
    "diameter",
    "height",
    "whole_weight",
    "shucked_weight",
    "viscera_weight",
    "shell_weight",
]
label_column = "rings"

feature_columns_dtype = {
    "sex": str,
    "length": np.float64,
    "diameter": np.float64,
    "height": np.float64,
    "whole_weight": np.float64,
    "shucked_weight": np.float64,
    "viscera_weight": np.float64,
    "shell_weight": np.float64,
}
label_column_dtype = {"rings": np.float64}


def merge_two_dicts(x, y):
    z = x.copy()
    z.update(y)
    return z


df = pd.read_csv(
    DATA_FILE,
    header=None,
    names=feature_columns_names + [label_column],
    dtype=merge_two_dicts(feature_columns_dtype, label_column_dtype),
)

print(f"Splitting raw dataset to train and test datasets..")
(df_train_val, df_test) = train_test_split(df, random_state=42, test_size=0.1)
df_test.to_csv(f"{base_dir}/abalone_test.csv", index=False)
print(f"Test dataset written to {base_dir}/abalone_test.csv")


numeric_features = list(feature_columns_names)
numeric_features.remove("sex")
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
    ]
)

categorical_features = ["sex"]
categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

# Call fit on ColumnTransformer to fit all transformers to X, y
preprocessor = preprocess.fit(df_train_val)

# Save the processor model to disk

joblib.dump(preprocess, featurizer_model_dir.joinpath("preprocess.joblib"))
print(f"Saved preprocessor model to {featurizer_model_dir}")

## Build custom inference container with preprocessor

As our preprocessor model is now ready, we now build our own container to host the preprocessor model. This is also referred to as Bring-Your-Own-Container (BYOC) mode in Amazon SageMaker.

In BYOC mode, we bring our own serving stack. We use [nginx](https://nginx.org/), ["gunicorn"](https://gunicorn.org/#deployment) and ["flask"](https://flask.palletsprojects.com/en/2.3.x/tutorial/factory/) app as our serving stack.

First, we create an inference script [`preprocessing.py`](./code/preprocessing.py) that implements the following:

1. Implement `/ping` and `/invocations` routes
1. Define functions to load the serialized model (`./models/preprocess.joblib`) from disk and to transform the received input with `model.transform()`

In [None]:
!pygmentize code/preprocessing.py

Next, we create configuration files for 
- [Nginx](https://nginx.org/) (reverse-proxy), [`nginx.conf`](code/nginx.conf)
- [Gunicorn](https://gunicorn.org/#deployment) (webserver-gateway interface), [`wsgi.py`](code/wsgi.py)
- [serve](code/serve) (python script to launch nginx, [`gunicorn`](https://gunicorn.org/#deployment) processes)

For convenience, we place all the above files along with [`inference.py`](./code/inference.py) under `code/` directory

Next, build and test the custom image locally

### Build and test docker image locally

In [None]:
!pygmentize Dockerfile

- Build [Dockerfile](./Dockerfile) using command `docker build -t abalone/featurizer .` command
- Next, we launch the inference container locally by mounting `./models` directory to `/opt/ml/model` in the container and exposing port `8080`
  ```docker
  docker run --rm -v $(pwd)/models:/opt/ml/model -p 8080:8080 abalone/featurizer
  ```
- We then test the running container by invoking `/ping` and `/invocations` endpoint with some test data
- Finally, we push local image to ECR using [./build_n_push.sh](./build_n_push.sh) shell script

>NOTE: We set `abalone/featurizer` as the docker image name.

Sample commands are saved in [commands.txt](./commands.txt)

In [None]:
!docker build -t $featurizer_image_name .

**NOTE:** Do not run the below cell from the notebook, as the container will be in serving mode, the execution hangs.

Open a terminal and change directory into the `featurizer/` directory where `models/` folder exists and run the following command.

In [None]:
# docker run --rm -v $(pwd)/models:/opt/ml/model -p 8080:8080 abalone/featurizer

# NOTE: If you run this from the NB, the cell will be continue with execution

#### Invoke endpoint /ping and /invocations

Test serving container by sending HTTP requests using curl to both `/ping` and `/invocations` endpoints

Uncomment below cells to test inference on local container.

In [None]:
# Health check by invoking  /ping
# !curl http://localhost:8080/ping

In [None]:
# Send test records
# !curl --data-raw 'I,0.365,0.295,0.095,0.25,0.1075,0.0545,0.08,9.0' \
# -H 'Content-Type: text/csv' \
# -v http://localhost:8080/invocations

#### (Optional) View container logs locally (using docker logs)

- To inspect a running container to view container config values or IP address we use `docker inspect <CONTAINER_ID_OR_NAME>`
- To view and tail logs generated in the container we use `docker logs --follow <NUM_OF_LINES> <CONTAINER_ID_OR_NAME>`
- SageMaker publishes container logs to CloudWatch. CloudWatch logs for a given endpoint are published to the following log stream path
`/aws/sagemaker/Endpoints/ENDPOINT_NAME/VARIANT_NAME/CONTAINER_NAME`

**NOTE:** 
1. Run this command in a terminal as running this inside a cell would hang execution.
1. the below command assumes there is only one running container. If you have more, then use command with container name `docker inspect <CONTAINER_ID_OR_NAME>` 

In [None]:
# RUN THE BELOW IN A TERMINAL
# docker ps --format "{{.Names}}" | xargs -n1 -I{} docker logs --follow --tail 50 {}

### Tag and push the local image to private ECR

- Tag the `abalone/featurizer` local image to `{account_id}.dkr.ecr.{region}.amazonaws.com/{imagename}:{tag}` format
- Run [./build_n_push.sh](./build_n_push.sh) shell script with image name `abalone/featurizer` as parameter


In [None]:
!pygmentize ./build_n_push.sh

In [None]:
# push image to private ECR
!chmod +x ./build_n_push.sh

!./build_n_push.sh $featurizer_image_name

### Test preprocessor inference image by deploying to a real-time endpoint

1. We first compress and upload preprocessor model artifact to S3
2. Create SageMaker [`Model`](https://sagemaker.readthedocs.io/en/stable/api/inference/model.html) object with `image_uri` pointing to the custom image in ECR and s3 location of the model artifact from step 1
3. Deploy model to an Amazon SageMaker real-time endpoint

### Compress and upload model artifact to S3

Generated preprocessor model (under `./models` directory) is compressed to a `tar.gz` format and uploaded to S3

In [None]:
# os.chdir(current_dir)
os.getcwd()

In [None]:
import subprocess

os.chdir(featurizer_model_dir)

model_s3uri = s3_path_join(f"s3://{bucket}/{prefix}", "featurizer")

featurizer_model_path = featurizer_model_dir.joinpath("model.tar.gz")

if featurizer_model_path.exists():
    featurizer_model_path.unlink()

# SageMaker expects model artifacts to be compressed to `model.tar.gz`
tar_cmd = "tar -czvf model.tar.gz preprocess.joblib ../code/"
result = subprocess.run(tar_cmd, shell=True, capture_output=True)

if result.returncode == 0:
    print(f"{featurizer_model_path} archive created successfully!")
    os.chdir(current_dir)
else:
    os.chdir(featurizer_model_dir)
    print("An error occurred:", result.stderr)

In [None]:
# Upload compressed model artifact to S3 using S3Uploader utility class
model_data_url = S3Uploader.upload(
    local_path=featurizer_model_path.absolute(),
    desired_s3_uri=model_s3uri,
    sagemaker_session=sm_session,
)
print(f"Uploaded predictor model.tar.gz to {model_data_url}")

### Deploy model to endpoint

We deploy the model to a real-time endpoint.

In [None]:
# Preprocessing Scikit-learn model
from datetime import datetime
from uuid import uuid4
from sagemaker.model import Model

suffix = f"{str(uuid4())[:5]}-{datetime.now().strftime('%d%b%Y')}"

featurizer_image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/{featurizer_image_name}:latest"

model_name = f"AbaloneXGB-featurizer-{suffix}"
print(f"Creating featurizer model: {model_name}")
sklearn_model = Model(
    image_uri=featurizer_image_uri,
    name=model_name,
    model_data=model_data_url,
    role=role,
    sagemaker_session=sm_session,
)

endpoint_name = f"AbaloneXGB-featurizer-ep-{suffix}"
print(f"Deploying model {model_name} to Endpoint: {endpoint_name}")

predictor = sklearn_model.deploy(
    endpoint_name=endpoint_name,
    initial_instance_count=1,
    instance_type="ml.m5.xlarge",
)

### Send test payload as inference

- Use records from `./data/abalone_test.csv` as payload to "featurizer" model
- Save responses (transformations) to `./data/abalone_featurizer_predictions.csv`

In [None]:
import csv
import os
from time import sleep
from tqdm import tqdm
from sagemaker.predictor import Predictor
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import CSVDeserializer

raw_dataset = base_dir.joinpath("abalone_test.csv")
transformed_dataset = base_dir.joinpath("abalone_featurizer_predictions.csv")

predictor = Predictor(endpoint_name=endpoint_name, sagemaker_session=sm_session)
predictor.content_type = "text/csv"
predictor.serializer = CSVSerializer()
predictor.accept = "text/csv"
predictor.deserializer = CSVDeserializer()


# Send 50 records for inference
limit = 10
i = 0

# Remove file if exists
if transformed_dataset.exists():
    transformed_dataset.unlink()

transformations = []
with open(raw_dataset, "r") as _f:
    lines = _f.readlines()
    for row in lines:
        # Skip headers
        if i == 0:
            i += 1
        elif i <= limit:
            row = row.rstrip("\n")
            splits = row.split(",")
            # Remove the target column (last column)
            label = splits.pop(-1)
            input_cols = ",".join(s for s in splits)
            prediction = None
            try:
                # print(input_cols)
                response = predictor.predict(input_cols)
                response = ",".join(map(str, response[0]))
                print(response)
                transformations.append(response)
                i += 1
                sleep(0.15)
            except Exception as e:
                print(f"Prediction error: {e}")
                pass

with open(transformed_dataset, "w") as csvfile:
    for line in transformations:
        csvfile.write(f"{line}\n")

csvfile.close()

print(f"Saved transformed records to {transformed_dataset}")

### View Logs emitted by the endpoint in CloudWatch

Logs from the Amazon SageMaker real-time endpoint that are written to `stdout` and `stderr` streams are automatically streamed to Amazon CloudWatch.

We can verify the logs by reading them from the CloudWatch log stream for the endpoint

In [None]:
from datetime import timedelta, datetime

# Get endpoint logs from CloudWatch log stream (last 15minutes)
logs_client = boto3.client("logs")
end_time = datetime.utcnow()
start_time = end_time - timedelta(minutes=15)

log_group_name = f"/aws/sagemaker/Endpoints/{endpoint_name}"
log_streams = logs_client.describe_log_streams(logGroupName=log_group_name)
log_stream_name = log_streams["logStreams"][0]["logStreamName"]

# Retrieve the logs
logs = logs_client.get_log_events(
    logGroupName=log_group_name,
    logStreamName=log_stream_name,
    startTime=int(start_time.timestamp() * 1000),
    endTime=int(end_time.timestamp() * 1000),
)

# Print the logs
for event in logs["events"]:
    print(f"{datetime.fromtimestamp(event['timestamp'] // 1000)}: {event['message']}")

### Cleanup

Finally, Cleanup resources. Delete endpoint and model

In [None]:
# Delete endpoint
try:
    print(f"Deleting model: {model_name}")
    predictor.delete_model()
except Exception as e:
    print(f"Error deleting model: {model_name}\n{e}")
    pass

# Delete model
try:
    print(f"Deleting endpoint: {endpoint_name}")
    predictor.delete_endpoint()
except Exception as e:
    print(f"Error deleting endpoint: {endpoint_name}\n{e}")
    pass

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.


![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/inference|structured|realtime|byoc|byoc-nginx-python|featurizer|featurizer.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/inference|structured|realtime|byoc|byoc-nginx-python|featurizer|featurizer.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/inference|structured|realtime|byoc|byoc-nginx-python|featurizer|featurizer.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/inference|structured|realtime|byoc|byoc-nginx-python|featurizer|featurizer.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/inference|structured|realtime|byoc|byoc-nginx-python|featurizer|featurizer.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/inference|structured|realtime|byoc|byoc-nginx-python|featurizer|featurizer.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/inference|structured|realtime|byoc|byoc-nginx-python|featurizer|featurizer.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/inference|structured|realtime|byoc|byoc-nginx-python|featurizer|featurizer.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/inference|structured|realtime|byoc|byoc-nginx-python|featurizer|featurizer.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/inference|structured|realtime|byoc|byoc-nginx-python|featurizer|featurizer.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/inference|structured|realtime|byoc|byoc-nginx-python|featurizer|featurizer.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/inference|structured|realtime|byoc|byoc-nginx-python|featurizer|featurizer.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/inference|structured|realtime|byoc|byoc-nginx-python|featurizer|featurizer.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/inference|structured|realtime|byoc|byoc-nginx-python|featurizer|featurizer.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/inference|structured|realtime|byoc|byoc-nginx-python|featurizer|featurizer.ipynb)
