# Process Your Data with Your Own Processing Container

In the [scikit learn data processing notebook](scikit_learn_data_processing_and_model_evaluation.ipynb) you learned about how to use the pre-built `SKLearn` container to process your data. In this notebook, you will learn how to build your own processing container and run it with `Processor`

First, inspect the sample data your processing job will consume. It is in a public S3 bucket we preloaded for you. 

In [2]:
import pandas as pd
import boto3

region = boto3.Session().region_name

input_data = "s3://sagemaker-sample-data-{}/processing/census/census-income.csv".format(region)
df = pd.read_csv(input_data, nrows=10)
df.head(n=10)

Unnamed: 0,age,class of worker,detailed industry recode,detailed occupation recode,education,wage per hour,enroll in edu inst last wk,marital stat,major industry code,major occupation code,...,country of birth father,country of birth mother,country of birth self,citizenship,own business or self employed,fill inc questionnaire for veteran's admin,veterans benefits,weeks worked in year,year,income
0,73,Not in universe,0,0,High school graduate,0,Not in universe,Widowed,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,0,95,- 50000.
1,58,Self-employed-not incorporated,4,34,Some college but no degree,0,Not in universe,Divorced,Construction,Precision production craft & repair,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,94,- 50000.
2,18,Not in universe,0,0,10th grade,0,High school,Never married,Not in universe or children,Not in universe,...,Vietnam,Vietnam,Vietnam,Foreign born- Not a citizen of U S,0,Not in universe,2,0,95,- 50000.
3,9,Not in universe,0,0,Children,0,Not in universe,Never married,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,0,0,94,- 50000.
4,10,Not in universe,0,0,Children,0,Not in universe,Never married,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,0,0,94,- 50000.
5,48,Private,40,10,Some college but no degree,1200,Not in universe,Married-civilian spouse present,Entertainment,Professional specialty,...,Philippines,United-States,United-States,Native- Born in the United States,2,Not in universe,2,52,95,- 50000.
6,42,Private,34,3,Bachelors degree(BA AB BS),0,Not in universe,Married-civilian spouse present,Finance insurance and real estate,Executive admin and managerial,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,94,- 50000.
7,28,Private,4,40,High school graduate,0,Not in universe,Never married,Construction,Handlers equip cleaners etc,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,30,95,- 50000.
8,47,Local government,43,26,Some college but no degree,876,Not in universe,Married-civilian spouse present,Education,Adm support including clerical,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,95,- 50000.
9,34,Private,4,37,Some college but no degree,0,Not in universe,Married-civilian spouse present,Construction,Machine operators assmblrs & inspctrs,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,94,- 50000.


Next, write the logic for the data processing and put it into your docker image.

In [3]:
%%writefile preprocessing.py

import argparse
import os
import warnings

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelBinarizer, KBinsDiscretizer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.compose import make_column_transformer

from sklearn.exceptions import DataConversionWarning

warnings.filterwarnings(action="ignore", category=DataConversionWarning)


columns = [
    "age",
    "education",
    "major industry code",
    "class of worker",
    "num persons worked for employer",
    "capital gains",
    "capital losses",
    "dividends from stocks",
    "income",
]
class_labels = [" - 50000.", " 50000+."]


def print_shape(df):
    negative_examples, positive_examples = np.bincount(df["income"])
    print(
        "Data shape: {}, {} positive examples, {} negative examples".format(
            df.shape, positive_examples, negative_examples
        )
    )


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--train-test-split-ratio", type=float, default=0.3)
    args, _ = parser.parse_known_args()

    print("Received arguments {}".format(args))

    input_data_path = os.path.join("/opt/ml/processing/input", "census-income.csv")

    print("Reading input data from {}".format(input_data_path))
    df = pd.read_csv(input_data_path)
    df = pd.DataFrame(data=df, columns=columns)
    df.dropna(inplace=True)
    df.drop_duplicates(inplace=True)
    df.replace(class_labels, [0, 1], inplace=True)

    negative_examples, positive_examples = np.bincount(df["income"])
    print(
        "Data after cleaning: {}, {} positive examples, {} negative examples".format(
            df.shape, positive_examples, negative_examples
        )
    )

    split_ratio = args.train_test_split_ratio
    print("Splitting data into train and test sets with ratio {}".format(split_ratio))
    X_train, X_test, y_train, y_test = train_test_split(
        df.drop("income", axis=1), df["income"], test_size=split_ratio, random_state=0
    )

    preprocess = make_column_transformer(
        (
            ["age", "num persons worked for employer"],
            KBinsDiscretizer(encode="onehot-dense", n_bins=10),
        ),
        (["capital gains", "capital losses", "dividends from stocks"], StandardScaler()),
        (["education", "major industry code", "class of worker"], OneHotEncoder(sparse=False)),
    )
    print("Running preprocessing and feature engineering transformations")
    train_features = preprocess.fit_transform(X_train)
    test_features = preprocess.transform(X_test)

    print("Train data shape after preprocessing: {}".format(train_features.shape))
    print("Test data shape after preprocessing: {}".format(test_features.shape))

    train_features_output_path = os.path.join("/opt/ml/processing/train", "train_features.csv")
    train_labels_output_path = os.path.join("/opt/ml/processing/train", "train_labels.csv")

    test_features_output_path = os.path.join("/opt/ml/processing/test", "test_features.csv")
    test_labels_output_path = os.path.join("/opt/ml/processing/test", "test_labels.csv")

    print("Saving training features to {}".format(train_features_output_path))
    pd.DataFrame(train_features).to_csv(train_features_output_path, header=False, index=False)

    print("Saving test features to {}".format(test_features_output_path))
    pd.DataFrame(test_features).to_csv(test_features_output_path, header=False, index=False)

    print("Saving training labels to {}".format(train_labels_output_path))
    y_train.to_csv(train_labels_output_path, header=False, index=False)

    print("Saving test labels to {}".format(test_labels_output_path))
    y_test.to_csv(test_labels_output_path, header=False, index=False)

Overwriting preprocessing.py


In [8]:
%%writefile Dockerfile

FROM python:3.7-slim-buster

RUN pip3 install pandas==0.25.3 scikit-learn==0.21.3
ENV PYTHONUNBUFFERED=TRUE

RUN mkdir /code
WORKDIR /code
COPY preprocessing.py /code

ENTRYPOINT ["python3", "preprocessing.py"]

Overwriting Dockerfile


Next, create an ECR repository, build the docker image and push it to the ECR repo. 

In [6]:
import boto3

account_id = boto3.client("sts").get_caller_identity().get("Account")
region = boto3.Session().region_name
ecr_repository = "sagemaker-processing-container"
tag = ":latest"

uri_suffix = "amazonaws.com"
if region in ["cn-north-1", "cn-northwest-1"]:
    uri_suffix = "amazonaws.com.cn"
processing_repository_uri = "{}.dkr.ecr.{}.{}/{}".format(
    account_id, region, uri_suffix, ecr_repository + tag
)

In [6]:
import pprint as pp

ecr = boto3.client("ecr")
try:
    # The repository might already exist
    # in your ECR
    cr_res = ecr.create_repository(repositoryName=ecr_repository)
    pp.pprint(cr_res)
except Exception as e:
    print(e)

An error occurred (RepositoryAlreadyExistsException) when calling the CreateRepository operation: The repository with name 'sagemaker-processing-container' already exists in the registry with id '688520471316'


In [9]:
# Create ECR repository and push docker image
!docker build -t $ecr_repository .
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)

!docker tag {ecr_repository + tag} $processing_repository_uri
!docker push $processing_repository_uri

Sending build context to Docker daemon  328.7kB
Step 1/7 : FROM python:3.7-slim-buster
 ---> fac67772ca5f
Step 2/7 : RUN pip3 install pandas==0.25.3 scikit-learn==0.21.3
 ---> Using cache
 ---> f9dfbb0dbe41
Step 3/7 : ENV PYTHONUNBUFFERED=TRUE
 ---> Using cache
 ---> e654b7466952
Step 4/7 : RUN mkdir /code
 ---> Running in 21cdf57cae50
Removing intermediate container 21cdf57cae50
 ---> ac402d1c249d
Step 5/7 : WORKDIR /code
 ---> Running in 5e2025ffb3c9
Removing intermediate container 5e2025ffb3c9
 ---> d0a10bc9f581
Step 6/7 : COPY preprocessing.py /code
 ---> 02b6f9cb0380
Step 7/7 : ENTRYPOINT ["python3", "preprocessing.py"]
 ---> Running in 61fffe192f5e
Removing intermediate container 61fffe192f5e
 ---> 338e005911df
Successfully built 338e005911df
Successfully tagged sagemaker-processing-container:latest
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
The push refers to repository [688520471316.dkr.ecr.us-west-2.amazonaws.com/sagemaker-pr

## Configure the inputs for a Processing Job

The `Processor` class helps you to configure the input for a SageMaker [`CreateProcessingJob`](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateProcessingJob.html). The basic configuration defines:
- The ECR location of your processing image
- The execution role the SageMaker service can assume on your behalf
- The type of instance to run your processing job
- The number of instances to run your processing job

In [None]:
from sagemaker.processing import Processor
help(Processor)

In [10]:
from sagemaker import get_execution_role

processor = Processor(
    role=get_execution_role(),
    image_uri=processing_repository_uri,
    instance_type='ml.m5.xlarge',
    instance_count=1
)


Now, configure input/output and run this processing job. 

In [22]:
from sagemaker.processing import ProcessingInput, ProcessingOutput
help(ProcessingInput)

Help on class ProcessingInput in module sagemaker.processing:

class ProcessingInput(builtins.object)
 |  Accepts parameters that specify an Amazon S3 input for a processing job.
 |  
 |  Also provides a method to turn those parameters into a dictionary.
 |  
 |  Methods defined here:
 |  
 |  __init__(self, source, destination, input_name=None, s3_data_type='S3Prefix', s3_input_mode='File', s3_data_distribution_type='FullyReplicated', s3_compression_type='None')
 |      Initializes a ``ProcessingInput`` instance. ``ProcessingInput`` accepts parameters
 |      that specify an Amazon S3 input for a processing job and provides a method
 |      to turn those parameters into a dictionary.
 |      
 |      Args:
 |          source (str): The source for the input. If a local path is provided, it will
 |              automatically be uploaded to S3 under:
 |              "s3://<default-bucket-name>/<job-name>/input/<input-name>".
 |          destination (str): The destination of the input.
 |

In [21]:
help(ProcessingOutput)

Help on class ProcessingOutput in module sagemaker.processing:

class ProcessingOutput(builtins.object)
 |  Accepts parameters that specify an Amazon S3 output for a processing job.
 |  
 |  It also provides a method to turn those parameters into a dictionary.
 |  
 |  Methods defined here:
 |  
 |  __init__(self, source, destination=None, output_name=None, s3_upload_mode='EndOfJob')
 |      Initializes a ``ProcessingOutput`` instance. ``ProcessingOutput`` accepts parameters that
 |      specify an Amazon S3 output for a processing job and provides a method to turn
 |      those parameters into a dictionary.
 |      
 |      Args:
 |          source (str): The source for the output.
 |          destination (str): The destination of the output. If a destination
 |              is not provided, one will be generated:
 |              "s3://<default-bucket-name>/<job-name>/output/<output-name>".
 |          output_name (str): The name of the output. If a name
 |              is not provide

In [25]:
import sagemaker

default_bucket = sagemaker.Session().default_bucket()

inputs = [
    ProcessingInput(
        source=input_data, # where SageMaker fetches input
        destination='/opt/ml/processing/input') # where SageMaker mounts the input into your container
    ]

outputs = [
    ProcessingOutput(
        output_name='train_data',
        source='/opt/ml/processing/train', # where SageMaker finds your output
        destination="s3://" + default_bucket + '/tmp/output/train_data' # where SageMaker saves your output
    ),
    ProcessingOutput(
        output_name='test_data',
        source='/opt/ml/processing/test',
        destination='s3://' + default_bucket + '/tmp/output/test_data'
    )
]

In [26]:
from sagemaker.processing import ProcessingInput, ProcessingOutput

processor.run(
    inputs=inputs,
    outputs=outputs
)


Job Name:  sagemaker-processing-container-2021-05-14-20-37-28-155
Inputs:  [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://sagemaker-sample-data-us-west-2/processing/census/census-income.csv', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'train_data', 'S3Output': {'S3Uri': 's3://sagemaker-us-west-2-688520471316/tmp/output/train_data', 'LocalPath': '/opt/ml/processing/train', 'S3UploadMode': 'EndOfJob'}}, {'OutputName': 'test_data', 'S3Output': {'S3Uri': 's3://sagemaker-us-west-2-688520471316/tmp/output/test_data', 'LocalPath': '/opt/ml/processing/test', 'S3UploadMode': 'EndOfJob'}}]
.......................[34mReceived arguments Namespace(train_test_split_ratio=0.3)[0m
[34mReading input data from /opt/ml/processing/input/census-income.csv[0m
[34mData after cleaning: (68285, 9), 11401 positive examples, 56884 negative examples[0m


In [29]:
!aws s3 ls s3://sagemaker-us-west-2-688520471316/tmp/output/test_data/

2021-05-14 20:41:15    6688214 test_features.csv
2021-05-14 20:41:15      40972 test_labels.csv


## Summary
In this notebook, you learned how to build your own container for running a Processing Job and how to use `Processor` class from SageMaker Python SKD 