### Set SageMaker version for Local Mode for Processing

The cells below will install specific versions of the SageMaker SDK. Pick jost one, run it once and then comment it out.

#### Dev version 2.9 (Processing Local Mode Support)

In [2]:
import sys
import IPython
dist_version = '2.9.2.dev0'

!aws s3 cp s3://gianpo-public/sagemaker-{dist_version}.tar.gz .
!{sys.executable} -m pip install -q -U pip
!{sys.executable} -m pip install -q sagemaker-{dist_version}.tar.gz
IPython.Application.instance().kernel.do_shutdown(True)

download: s3://gianpo-public/sagemaker-2.9.2.dev0.tar.gz to ./sagemaker-2.9.2.dev0.tar.gz


{'status': 'ok', 'restart': True}

#### Latest release

In [None]:
#!pip install -U sagemaker
#import IPython
#IPython.Application.instance().kernel.do_shutdown(True)

#### Latest 1.x Release

In [None]:
#!pip install -U sagemaker==1.72.1
#import IPython
#IPython.Application.instance().kernel.do_shutdown(True)

# Data Processing Job Creation and Execution

This notebook will show how to capture the data processing steps done in the original notebook as an *Amazon SageMaker Processing* job. Processing jobs capture common, repeatable data transformations and allow you to easily run them as managed processes on resources spun just for that. We'll use a variation of the [Dask](https://www.dask.org) [Processing job example](https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker_processing/feature_transformation_with_sagemaker_processing_dask/feature_transformation_with_sagemaker_processing_dask.ipynb) to execute our script.

## Initialization scripts

In [4]:
import os
from pathlib import Path

The variables below will be used to create a container and push it to Amazon Elastic Container Registry [ECR]. This is the basis of how we capture a repeatable and scalable process.

In [5]:
import boto3

account_id = boto3.client('sts').get_caller_identity().get('Account')
region = boto3.session.Session().region_name

ecr_repository = 'dask_processing'
tag = 'latest'
URI_SUFFIX = 'amazonaws.com'
dask_repository_uri = f'{account_id}.dkr.ecr.{region}.{URI_SUFFIX}/{ecr_repository}:{tag}'
print(dask_repository_uri)
root_path = Path('/home/ec2-user/SageMaker/defect_detection/')
code_path = root_path / "notebooks/WM-811K/src/"
code_path.mkdir(exist_ok=True)
data_path = root_path / "data/MIR-WM811K/"

160951647621.dkr.ecr.us-east-1.amazonaws.com/dask_processing:latest


In [6]:
root_path = Path('/home/ec2-user/SageMaker/defect_detection/')#.resolve()

code_path = root_path / "notebooks/WM-811K/src/"
code_path.mkdir(exist_ok=True)
data_path = root_path / "data/MIR-WM811K/"

## Create SageMaker Processing Job

### Build a Container for Dask Processing

Create a container for processing with Dask. The code below is based on [this example](https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker_processing/feature_transformation_with_sagemaker_processing_dask/feature_transformation_with_sagemaker_processing_dask.ipynb). Here are the contents of the Docker definition:

In [1]:
!pygmentize src/data_processing/Dockerfile

[34mFROM[39;49;00m [33mcontinuumio/miniconda3:4.7.12[39;49;00m


[34mRUN[39;49;00m apt-get update
[34mRUN[39;49;00m apt-get install -y curl unzip python3 python3-setuptools python3-pip python-dev python3-dev python-psutil
[34mRUN[39;49;00m pip3 install py4j [31mpsutil[39;49;00m==[34m5[39;49;00m.6.5 [31mnumpy[39;49;00m==[34m1[39;49;00m.17.4
[34mRUN[39;49;00m apt-get clean
[34mRUN[39;49;00m rm -rf /var/lib/apt/lists/*

[34mENV[39;49;00m PYTHONHASHSEED [34m0[39;49;00m
[34mENV[39;49;00m PYTHONIOENCODING UTF-8
[34mENV[39;49;00m PIP_DISABLE_PIP_VERSION_CHECK [34m1[39;49;00m


[34mRUN[39;49;00m conda install --yes [33m\[39;49;00m
    -c conda-forge [33m\[39;49;00m
    [31mpython[39;49;00m==[34m3[39;49;00m.8 [33m\[39;49;00m
    python-blosc [33m\[39;49;00m
    cytoolz [33m\[39;49;00m
    pillow [33m\[39;49;00m
    [31mdask[39;49;00m==[34m2[39;49;00m.16.0 [33m\[39;49;00m
    [31mdistributed[39;49;00m==[34m2[39;49;00m.16.0 [33m\[39;49

While it may seem complicated, most of it was just a copy of the example linked before. In fact, the only reason why we had to create another container was that we needed an [image manipulation library](https://pillow.readthedocs.io/en/stable/index.html) that was not in the example.

This container can run arbitrary python scripts leveraging a dask cluster built on demand, as we'll see below.

#### Docker Build

In [51]:
%%sh 
pushd src/data_processing
docker build -t  wafer-data-processing .
popd

~/SageMaker/defect_detection/notebooks/WM-811K/src/data_processing ~/SageMaker/defect_detection/notebooks/WM-811K
Sending build context to Docker daemon  17.92kB
Step 1/21 : FROM continuumio/miniconda3:4.7.12
 ---> 406f2b43ea59
Step 2/21 : RUN apt-get update
 ---> Using cache
 ---> 683b4f2f5e32
Step 3/21 : RUN apt-get install -y curl unzip python3 python3-setuptools python3-pip python-dev python3-dev python-psutil
 ---> Running in 3db7da2349c9
Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  binutils binutils-common binutils-x86-64-linux-gnu build-essential cpp cpp-8
  dbus dh-python dirmngr dpkg-dev fakeroot g++ g++-8 gcc gcc-8 gir1.2-glib-2.0
  gnupg gnupg-l10n gnupg-utils gpg gpg-agent gpg-wks-client gpg-wks-server
  gpgconf gpgsm libalgorithm-diff-perl libalgorithm-diff-xs-perl
  libalgorithm-merge-perl libapparmor1 libasan5 libassuan0 libatomic1
  libbinutils libc-dev-bin libc6-dev libcc1-0 lib

#### Push to ECR

In [150]:
# Create ECR repository and push docker image

!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)
!aws ecr create-repository --repository-name $ecr_repository
!docker tag {ecr_repository + tag} $dask_repository_uri
!docker push $dask_repository_uri

Note: AWS CLI version 2, the latest major version of the AWS CLI, is now stable and recommended for general use. For more information, see the AWS CLI version 2 installation instructions at: https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html

usage: aws [options] <command> <subcommand> [<subcommand> ...] [parameters]
To see help text, you can run:

  aws help
  aws <command> help
  aws <command> <subcommand> help
aws: error: argument --region: expected one argument
Note: AWS CLI version 2, the latest major version of the AWS CLI, is now stable and recommended for general use. For more information, see the AWS CLI version 2 installation instructions at: https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html

usage: aws [options] <command> <subcommand> [<subcommand> ...] [parameters]
To see help text, you can run:

  aws help
  aws <command> help
  aws <command> <subcommand> help
aws: error: argument --repository-name: expected one argument
"docker tag" requ

### Create Script

As mentioned before, once built, the Dask container can be reused to run any number of scripts. We'll use it to do the data transformations we had before. The script itself has been prepared on an editor, and can be found at `notebooks/WM-811K/src/data_processing.py`. It's made from parts of the original notebook, with imports resolved and a bit of refactoring for code clarity.

In [2]:
!pygmentize ~/SageMaker/defect_detection/notebooks/WM-811K/src/data_processing.py

[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mlogging[39;49;00m
[34mimport[39;49;00m [04m[36mdask[39;49;00m[04m[36m.[39;49;00m[04m[36mdataframe[39;49;00m [34mas[39;49;00m [04m[36mdd[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mfrom[39;49;00m [04m[36mPIL[39;49;00m [34mimport[39;49;00m Image
[34mfrom[39;49;00m [04m[36mpathlib[39;49;00m [34mimport[39;49;00m Path
[34mfrom[39;49;00m [04m[36mdask[39;49;00m[04m[36m.[39;49;00m[04m[36mdistributed[39;49;00m [34mimport[39;49;00m Client


[34mdef[39;49;00m [32mhot_encode[39;49;00m(img_arr):
    new_arr = np.zeros(([34m676[39;49;00m, [34m3[39;49;00m))
    [34mfor[39;49;00m x [35min[39;49;00m [36mrange[39;49;00m([34m676[39;49;00m):
        new_arr[x, img_ar

## Run the Processing Job

With the container and script ready, setting up and running the data transformation is simple. First we tell sagemaker where to find the container, then we tell it which script to run and the data it will use and generate.

### Set up the Script Processor

The most important parameter below is the `dask_repository_uri`. The other parameters control naming of job executions and resources available to it.

In [7]:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput

region = boto3.session.Session().region_name

role = get_execution_role()

dask_processor = ScriptProcessor(
    base_job_name="wafer-data-processing",
    image_uri=dask_repository_uri,
    command=["/opt/program/bootstrap.py"],
    volume_size_in_gb=5,
    role=role,
    instance_count=4,
    instance_type="local",
    max_runtime_in_seconds=60*20,
)

### Run

With the `dask_processor` ready, we execute it, pointing to the files we want to process. The run will create a log below, that can be inspected for results or error messages.

In [8]:
dask_processor.run(
    code=str(code_path / 'data_processing.py'),
    inputs=[ProcessingInput(
        source="s3://sagemaker-us-east-1-160951647621/wafer-input/wafers.pkl.gz",
        destination='/opt/ml/processing/input'
    )],
    outputs=[ProcessingOutput(output_name='autoencoder/train', source='/opt/ml/processing/train')]
)


Job Name:  wafer-data-processing-2020-10-05-21-26-16-330
Inputs:  [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-160951647621/wafer-input/wafers.pkl.gz', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-160951647621/wafer-data-processing-2020-10-05-21-26-16-330/input/code/data_processing.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'autoencoder/train', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-160951647621/wafer-data-processing-2020-10-05-21-26-16-330/output/autoencoder/train', 'LocalPath': '/opt/ml/processing/train', 'S3UploadMode': 'EndOfJob'}}]
Creating hxwuq9jhaz-algo-2-pf7g4 ... 
Creating zqoiqizqyp-algo-1-pf7g4

We can now check the results of the processing job. The `latest_job` property holds all information about what happened on it, including where the generated file is.

In [10]:
processed_data = dask_processor.latest_job.describe()['ProcessingOutputConfig']['Outputs'][0]['S3Output']['S3Uri']

In [19]:
bucket, *path = processed_data.split("/")[2:]
path = "/".join(path)
print(f"The output was saved to bucket {bucket}, under the folder {path}")

The output was saved to bucket sagemaker-us-east-1-160951647621, under the folder wafer-data-processing-2020-10-05-21-26-16-330/output/autoencoder/train


Download the file and check the resulting vectors inside it.

In [16]:
sagemaker.utils.download_file(bucket, path + "/data.npz", "/tmp/data.npz", sagemaker.session.Session())

In [17]:
import numpy as np

with np.load("/tmp/data.npz", allow_pickle=True) as data:
    x = data['x']
    y = data['y']
    label_classes = data['label_classes'].item(0)

In [18]:
print(f"X shape: {x.shape}\nY shape: {y.shape}\nLabels: {label_classes}")

X shape: (22894, 26, 26, 3)
Y shape: (22894,)
Labels: {'Center': 0, 'Edge-Loc': 1, 'Edge-Ring': 2, 'Loc': 3, 'Near-full': 4, 'Random': 5, 'Scratch': 6, 'none': 7, 'Donut': 8}



Everything looks in order. We can proceed to [training the autoencoder](train_autoencoder.ipynb).