This notebook screens that it can perform *train* -> *register model* -> *batch transform* -> *delete model*. It is designed to run in one go without a kernel restart, hence submits only short training and batch-transform jobs each of which runs for 3+ minutes.

Steps:
- **Pre-requisite**:
  * Install `requirements.txt` to conda environment `mxnet_p36`.
  * Make sure to choose kernel `conda_mxnet_p36`.
- **Action**: click *Kernel* -> *Restart Kernel and Run All Cells...* 
- **Expected outcome**: no exception seen.

# Setup

Before you run the next cell, please open `smconfig.py` and review the mandatory SageMaker `kwargs` then disable the `NotImplementedException` in the last line.

In [None]:
%matplotlib inline
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'retina'

import pandas as pd
import mxnet as mx
import numpy as np
import sagemaker as sm
from sagemaker import KMeans, KMeansModel

import smconfig
from smallmatter.pathlib import Path2
from smallmatter.sm import get_model_tgz


# Configuration of this screening test.
sess = sm.Session()
sm_kwargs = smconfig.SmKwargs(sm.get_execution_role())
s3_input_path = f'{smconfig.s3_bucket}/screening/kmeans-input'
s3_sagemaker_path = f'{smconfig.s3_bucket}/screening/sagemaker'

# Enforce blocking API to validate permissions to Describe{Training,Transform}Job.
block_notebook_while_training = True

# Propagate to env vars of the whole notebook, for usage by ! or %%.
%set_env BUCKET=$smconfig.s3_bucket
%set_env S3_INPUT_PATH=$s3_input_path
%set_env S3_SAGEMAKER_PATH=$s3_sagemaker_path

# Training job

In [None]:
estimator = KMeans(
    k=2,
    epochs=5,

    # record_set() needs trailing '/'. Otherwise, instead of s3://bucket/sagemaker/kmeans-input/KMeans-xxxx,
    # we'll get s3://bucket/sagemaker/kmeans-inputKMeans-xxxx.
    data_location=s3_input_path + '/',

    output_path=s3_sagemaker_path,
    instance_count=1,
    instance_type='ml.m5.large',
    sagemaker_session=sess,
    **sm_kwargs.train,
)

# Generate synthetic input data as protobuf files on S3.
# NOTE: SageMaker K-Means algo REQUIRES float32.
train_input = estimator.record_set(
    np.random.rand(100, 8).astype('float32')
)

# Submit a training job.
estimator.fit(train_input, wait=block_notebook_while_training)

# Track the jobname for subsequent CloudWatch CLI operations.
train_job_name = estimator.latest_training_job.name
%set_env TRAIN_JOB_NAME=$estimator.latest_training_job.name

# Retrieve CloudWatch log events

You can retrieve the training logs using `awscli` after this notebook is unblocked. This will be a good test to verify that this notebook's role has sufficient permissions to read CloudWatch logs.

Assuming the job name is stored in an environment variable `TRAIN_JOB_NAME`, run these CLI commands:

```bash
# Find out the log-stream name; should look like TRAIN_JOB_NAME/xxx.
aws logs describe-log-streams \
    --log-group-name /aws/sagemaker/TrainingJobs \
    --log-stream-name-prefix $TRAIN_JOB_NAME \
    | jq -r '.logStreams[].logStreamName'


# Get the log events
aws logs get-log-events \
    --log-group-name /aws/sagemaker/TrainingJobs \
    --log-stream-name <LOG_STREAM_NAME>
```



# Optional: Inspect model artifact

The model artifact `s3://bucket/sagemaker/train_job_name/output/model.tar.gz` contains the cluster centroids.

In [None]:
# Extract model artifact to /tmp
model_artifact = str(get_model_tgz(train_job_name, sess.sagemaker_client))
%env MODEL_ARTIFACT=$model_artifact
!aws s3 cp $MODEL_ARTIFACT - | tar -C /tmp -xzf -

# Load & inspect
kmeans_model_params = mx.ndarray.load('/tmp/model_algo-1')
cluster_centroids = pd.DataFrame(kmeans_model_params[0].asnumpy())

print(
    f'type(kmeans_model_params) = {type(kmeans_model_params)}',
    f'len(kmeans_model_params) = {len(kmeans_model_params)}',
    sep='\n'
)
for i,o in enumerate(kmeans_model_params):
    print(f'type(kmeans_modle_params[{i}]) = {type(o)}')
    try:
        print('  => shape:', o.shape, end='\n\n')
    except Exception:
        pass

cluster_centroids

# Batch Transform

After training job finishes, we'll assign cluster id to the training data.

## Register model

NOTE: if model name already registered, the old registered model will be untouched.

In [None]:
model = KMeansModel(
    model_data=str(model_artifact),
    sagemaker_session=sess,
    name='kmeans-screening-1234',
    **sm_kwargs.model,
)

# Create model
model._create_sagemaker_model(instance_type='ml.m5.large', tags=sm_kwargs.tags)

## Submit Batch-Transform job

In [None]:
bt_input_src = Path2(train_input.s3_data).parent
bt_input_dir = s3_sagemaker_path + '/bt/input'
bt_output_dir = s3_sagemaker_path + '/bt/output'
# Propagate to env vars but just for this cell.
%env BT_INPUT_SRC=$bt_input_src
%env BT_INPUT_DIR=$bt_input_dir
%env BT_OUTPUT_DIR=$bt_output_dir

# The S3 input path to batch transform must contain only protobuf file, so we
# simply copy all-but-manifest from the input record set (autogenerated prior
# training) to a new area.
!aws s3 sync \
    $BT_INPUT_SRC/ \
    $BT_INPUT_DIR/ \
    --exclude .amazon.manifest \
    --storage-class ONEZONE_IA
!aws s3 ls $BT_INPUT_DIR/

bt = model.transformer(
    instance_count=1,
    instance_type='ml.m5.large',
    strategy='MultiRecord',
    output_path=bt_output_dir + '/',  # Input S3 dir for batch transform must ends with '/'

    # https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html#cm-batch
    accept='application/jsonlines',
    assemble_with='Line',

    **sm_kwargs.bt,
)

bt.transform(
    data=bt_input_dir + '/',   # Output S3 dir for batch transform must ends with '/'
    data_type='S3Prefix',
    wait=block_notebook_while_training,
    logs=True,

    # https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference#cm-batch
    content_type='application/x-recordio-protobuf',
    split_type='RecordIO'
)

## Inspect sample output

In [None]:
!aws s3 ls $BT_OUTPUT_DIR/
!aws s3 cp $BT_OUTPUT_DIR/matrix_0.pbr.out - | head

## De-register model

In [None]:
model.delete_model()