This notebook screens that it can perform *train* -> *register model* -> *batch transform* -> *delete model*.

It is recommended to run this notebook in one go without a kernel restart, as the training and batch-transform jobs are short by design (i.e., +/- 5min).

If there's a need to restart the kernel after the training job, then make sure to set the `train_job_name` variable after the training section.

In [None]:
%matplotlib inline
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'retina'

import pandas as pd
import numpy as np
import sagemaker as sm
from sagemaker import KMeans, KMeansModel

import smconfig

# Configuration of this screening test.
sess = sm.Session()
sm_kwargs = smconfig.SmKwargs(sm.get_execution_role())
s3_input_path = f's3://{smconfig.s3_bucket}/kmeans-input'
s3_sagemaker_path = f's3://{smconfig.s3_bucket}/sagemaker

# Enforce blocking API to screen whether this role has permissions to
# Describe{Training,Transform}Job.
block_notebook_while_training = True

# Set env vars for ! or %%.
%set_env BUCKET=$smconfig.s3_bucket
%set_env S3_INPUT_PATH=$s3_input_path
%set_env S3_SAGEMAKER_PATH=$s3_sagemaker_path

# Training job

In [None]:
estimator = KMeans(
    k=2,
    epochs=10,
    data_location=s3_input_path,
    output_path=sagemaker_output_path,
    instance_count=1,
    instance_type='ml.m5.large',
    sagemaker_session=sess,
    **sm_kwargs.train,
)

# Generate synthetic input data as protobuf files on S3.
# NOTE: SageMaker K-Means algo REQUIRES float32.
train_input = estimator.record_set(
    np.random.rand(100, 8).astype('float32')
)

# Submit a training job.
estimator.fit(train_input, wait=block_notebook_while_training)

# Track the jobname for subsequent CloudWatch CLI operations.
%set_env TRAIN_JOB_NAME=%estimator.latest_training_job.name

# Retrieve CloudWatch log events

You can retrieve the training logs using `awscli` after this notebook is unblocked. Assuming the job name is stored in an environment variable `TRAIN_JOB_NAME`, run these CLI commands:

```bash
# Find out the log-stream name; should look like TRAIN_JOB_NAME/xxx.
aws logs describe-log-streams \
    --log-group-name /aws/sagemaker/TrainingJobs \
    --log-stream-name-prefix $TRAIN_JOB_NAME \
    | jq -r '.logStreams[].logStreamName'


# Get the log events
aws logs get-log-events \
    --log-group-name /aws/sagemaker/TrainingJobs \
    --log-stream-name <LOG_STREAM_NAME>
```

# Optional: Inspect model artifact

The model artifact `s3://bucket/sagemaker/train_job_name/output/model.tar.gz` contains the cluster centroids.

In [None]:
import mxnet as mx

# Uncomment & modify next line if:
# - you want to use another k-means job, or
# - you restart the kernel which clears the `train_job_name` variable.
#train_job_name = 'xxxx'

model_artifact = get_model_tgz(train_job_name, sess.sagemaker_client)
%set_env MODEL_ARTIFACT=$model_artifact

# Extract model artifact to /tmp
!aws s3 cp $MODEL_ARTIFACT - | tar -C /tmp -xzf -

# Load & inspect
kmeans_model_params = mx.ndarray.load('/tmp/model_algo-1')
cluster_centroids = pd.DataFrame(kmeans_model_params[0].asnumpy())

print(
    f'type(kmeans_model_params) = {type(kmeans_model_params)}',
    f'len(kmeans_model_params) = {len(kmeans_model_params)}',
    sep='\n'
)
for i,o in enumerate(kmeans_model_params):
    print(f'type(kmeans_modle_params[{i}]) = {type(o)}')
    try:
        print('  => shape:', o.shape, end='\n\n')
    except Exception:
        pass

cluster_centroids

# Batch Transform

After training job finishes, we'll assign cluster id to the training data.

## Register model

NOTE: if model name already registered, the old registered model will be untouched.

In [None]:
model = KMeansModel(
    model_data=model_artifact,
    sagemaker_session=sess,
    name='kmeans-screening-1234',
    **sm_kwargs.model,
)

# Create model
model._create_sagemaker_model(instance_type='ml.m5.large', tags=sm_kwargs.tags)

## Submit Batch-Transform job

In [None]:
# The S3 input path to batch transform must contain only protobuf file, so we
# simply copy all-but-manifest from the input record set (autogenerated prior
# training) to a new area.
!aws s3 sync \
    s3://...
    $S3_SAGEMAKER_PATH/bt/input/ \
    --exclude .amazon.manifest \
    --storage-class ONEZONE_IA

# I/O path to S3 MUST have trailing '/'.
bt_input_dir = endslash(s3_sagemaker_path + 'bt/input')
bt_input_dir = endslash(s3_sagemaker_path + 'bt/output')

bt = model.transformer(
    instance_count=1,
    instance_type='ml.m5.large',
    strategy='MultiRecord',
    output_path=bt_output_dir,
    wait=True,
    log=True,
    
    # https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html#cm-batch
    accept='application/jsonlines',
    assemble_with='Line',
    
    **sm_kwargs.bt,
)

bt.transform(
    data=bt_input_dir,
    data_type='S3Prefix',
    
    # https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference#cm-batch
    content_type='application/x-recordio-protobuf',
    split_type='RecordIO'
)

## Inspect sample output

In [None]:
!aws s3 ls $S3_SAGEMAKER_PATH/bt/output/
!aws s3 cp $S3_SAGEMAKER_PATH/bt/output/matrix_0.pbr.out - | head

## De-register model

In [None]:
model.delete_model()