The cell above is for access to a new experimental feature for faster debugging of processing jobs. It will be released on the SageMaker SDK following the standard release process.

# Data Augmentation

This notebook runs a processing job for generating additional wafer maps using the encode -> add noise -> decode strategy we built before. It has a very similar structure to the data processing notebook, since the general steps for building and running processing are similar. We built a separate docker image to have access to Tensorflow. We're using the AWS Optimized [Deep Learning Containers](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html) Tensorflow image as the base for our processing.

## Initialization

In [1]:
import os

from pathlib import Path

In [2]:
root_path = Path('/home/ec2-user/SageMaker/defect_detection/')#.resolve()

code_path = root_path / "notebooks/WM-811K/src/"
code_path.mkdir(exist_ok=True)
data_path = root_path / "data/MIR-WM811K/"

In [3]:
import boto3


account_id = boto3.client('sts').get_caller_identity().get('Account')
region = boto3.session.Session().region_name
ecr_repository = 'data-augmentation'
tag = ':latest'
uri_suffix = 'amazonaws.com'
repository_uri = '{}.dkr.ecr.{}.{}/{}'.format(account_id, region, uri_suffix, ecr_repository + tag)

In [4]:
repository_uri

'160951647621.dkr.ecr.us-east-1.amazonaws.com/data-augmentation:latest'

# Create the execution script

In [1]:
!pygmentize ./src/data_augmentation/program/augmentation.py

[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mtarfile[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mlogging[39;49;00m
[34mfrom[39;49;00m [04m[36mpathlib[39;49;00m [34mimport[39;49;00m Path
[34mfrom[39;49;00m [04m[36mtensorflow[39;49;00m[04m[36m.[39;49;00m[04m[36mkeras[39;49;00m[04m[36m.[39;49;00m[04m[36mmodels[39;49;00m [34mimport[39;49;00m load_model


[34mdef[39;49;00m [32mparse_arguments[39;49;00m():
    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    parser.add_argument([33m"[39;49;00m[33m--limit[39;49;00m[33m"[39;49;00m, [36mtype[39;49;00m=[36mint[39;49;00m, default=[34mNone[39;49;00m)
    parser.add_argument([33m"[39;49;00m[33m--augmented-size[39;49;00m[33m"[39;49;00m, [36mtype[39;49;00m=[36mint[39;49;00m, default=[34m2000[39;49;00m)
    [34mreturn[39

# Build a Container for augmentation

In [46]:
%%sh 
pushd src/data_augmentation
$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)
docker build -t  data-augmentation .
popd

~/SageMaker/defect_detection/notebooks/WM-811K/src/data_augmentation ~/SageMaker/defect_detection/notebooks/WM-811K
Sending build context to Docker daemon  13.82kB
Step 1/15 : FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.1.0-gpu-py3
 ---> 43a74e93a483
Step 2/15 : RUN apt-get update
 ---> Using cache
 ---> 0e0a4fe719b7
Step 3/15 : RUN apt-get install -y curl unzip python3 python3-setuptools python3-pip python-dev python3-dev python-psutil ffmpeg libsm6 libxext6
 ---> Using cache
 ---> bd665a670110
Step 4/15 : RUN pip3 install py4j psutil==5.6.5 numpy==1.17.4
 ---> Using cache
 ---> a12df01f221b
Step 5/15 : RUN apt-get clean
 ---> Using cache
 ---> de8e4caee4ab
Step 6/15 : RUN rm -rf /var/lib/apt/lists/*
 ---> Using cache
 ---> 7ab1fdfd2da8
Step 7/15 : ENV PYTHONHASHSEED 0
 ---> Using cache
 ---> 0644a25711e6
Step 8/15 : ENV PYTHONIOENCODING UTF-8
 ---> Using cache
 ---> ddc2111be8d3
Step 9/15 : ENV PIP_DISABLE_PIP_VERSION_CHECK 1
 ---> Using cache
 ---> 3768ac

Note: AWS CLI version 2, the latest major version of the AWS CLI, is now stable and recommended for general use. For more information, see the AWS CLI version 2 installation instructions at: https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html

usage: aws [options] <command> <subcommand> [<subcommand> ...] [parameters]
To see help text, you can run:

  aws help
  aws <command> help
  aws <command> <subcommand> help
aws: error: argument --region: expected one argument


In [47]:
!aws ecr create-repository --repository-name $ecr_repository
!docker tag {ecr_repository + tag} $repository_uri
!docker push $repository_uri


An error occurred (RepositoryAlreadyExistsException) when calling the CreateRepository operation: The repository with name 'data-augmentation' already exists in the registry with id '160951647621'
The push refers to repository [160951647621.dkr.ecr.us-east-1.amazonaws.com/data-augmentation]

[1B48bf6f13: Preparing 
[1B4c23698f: Preparing 
[1Bb8edb1eb: Preparing 
[1Bd3858bc6: Preparing 
[1Bbd70a43c: Preparing 
[1Bb267bb8a: Preparing 
[1B7470f0dd: Preparing 
[1Bec6e212a: Preparing 
[1Bf319a508: Preparing 
[1Bf7132110: Preparing 
[1Beb9ebda6: Preparing 
[7Bb267bb8a: Waiting g 
[7B7470f0dd: Waiting g 
[7Bec6e212a: Waiting g 
[1B62cacce5: Preparing 
[1Bd22b16ab: Preparing 
[1B26dec4ac: Preparing 
[1B6ff78197: Preparing 
[1Bdf5cf960: Preparing 
[7B5bf23a91: Waiting g 
[1Bb763c8de: Preparing 
[13B7132110: Waiting g 
[8Bd22b16ab: Waiting g 
[14Bb9ebda6: Waiting g 
[9B26dec4ac: Waiting g 
[15B829d3bc: Waiting g 
[15B4b15037: Waiting g 
[1Ba4b22186: Preparing 
[12Bf

# Run the Container

In [5]:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput

region = boto3.session.Session().region_name

role = get_execution_role()

data_augmenter = ScriptProcessor(
    base_job_name="data-augmentation",
    image_uri=repository_uri,
    command=["python3"],
    role=role,
    instance_count=1,
    instance_type="local",
    max_runtime_in_seconds=1200,
)

In [6]:
data_augmenter.run(
    code="src/data_augmentation/program/augmentation.py",
    arguments=["--augmented-size", "2000"],
    inputs=[
        ProcessingInput(
            source="s3://sagemaker-us-east-1-160951647621/train-autoencoder-2020-10-05-21-46-58-914/output",
            destination='/opt/ml/processing/models'
        ), ProcessingInput(
            source="s3://sagemaker-us-east-1-160951647621/wafer-data-processing-2020-10-05-21-26-16-330/output/autoencoder/train",
            destination="/opt/ml/processing/data"
        )
    ],
    outputs=[ProcessingOutput(output_name='classifier/train', source='/opt/ml/processing/augmented')]
)


Job Name:  data-augmentation-2020-10-05-21-59-11-375
Inputs:  [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-160951647621/train-autoencoder-2020-10-05-21-46-58-914/output', 'LocalPath': '/opt/ml/processing/models', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'input-2', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-160951647621/wafer-data-processing-2020-10-05-21-26-16-330/output/autoencoder/train', 'LocalPath': '/opt/ml/processing/data', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-160951647621/data-augmentation-2020-10-05-21-59-11-375/input/code/augmentation.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionTyp

We can now check the results of the processing job. The `latest_job` property holds all information about what happened on it, including where the generated file is.

In [7]:
processed_data = data_augmenter.latest_job.describe()['ProcessingOutputConfig']['Outputs'][0]['S3Output']['S3Uri']

In [8]:
bucket, *path = processed_data.split("/")[2:]
path = "/".join(path)

In [14]:
bucket, *path = processed_data.split("/")[2:]
path = "/".join(path)
print(f"The output was saved to bucket {bucket}, under the folder {path}")

The output was saved to bucket sagemaker-us-east-1-160951647621, under the folder data-augmentation-2020-10-05-21-59-11-375/output/classifier/train


Download the file and check the resulting vectors inside it.

In [10]:
sagemaker.utils.download_file(bucket, path + "/data.npz", "/tmp/data.npz", sagemaker.session.Session())

In [12]:
import numpy as np

with np.load("/tmp/data.npz", allow_pickle=True) as data:
    x = data['x']
    y = data['y']
    label_classes = data['label_classes'].item(0)

In [13]:
print(f"X shape: {x.shape}\nY shape: {y.shape}\nLabels: {label_classes}")

X shape: (40484, 26, 26, 3)
Y shape: (40484,)
Labels: {'Center': 0, 'Edge-Loc': 1, 'Edge-Ring': 2, 'Loc': 3, 'Near-full': 4, 'Random': 5, 'Scratch': 6, 'none': 7, 'Donut': 8}


After augmentation, we almost doubled the total number of examples. This new dataset will be used to [train the classifier model](train_classifier.ipynb).