# Creating an ML Pipeline

This notebook will put together all the steps we built so far. It will execute the processing job and the generate the inference for the results in batch.

## Setup

In [16]:
import boto3
import sagemaker as sm
from datetime import datetime
from time import strftime, gmtime

bucket = sm.session.Session().default_bucket()
smclient = boto3.client('sagemaker')
s3client = boto3.client('s3')
role = sm.get_execution_role()
dask_repository_uri = '113147044314.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-dask-muse:latest'

prefix = "sagemaker/muse-dask-preprocess-demo"
input_prefix = prefix + "/input/book-depository/raw"
code_prefix = prefix + "/code"
input_preprocessed_prefix = prefix + "/input/book-depository/preprocessed"
input_descriptions_prefix = prefix + "/input/book-depository/descriptions"
input_rejected_prefix = prefix + "/input/book-depository/rejected"
input_reports_prefix = prefix + "/input/book-depository/reports"

output_inference_prefix = prefix + "/output/book-depository/inference"

## Run the Processing Job

In [24]:
from sagemaker.processing import ProcessingInput, ProcessingOutput,  ScriptProcessor

dask_processor = ScriptProcessor(
    base_job_name="dask-preprocessor",
    image_uri=dask_repository_uri,
    command=["/opt/program/bootstrap.py"],
    role=role,
    instance_count=10,
    instance_type="ml.m5.xlarge",
    max_runtime_in_seconds=1200,
)

In [17]:
timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())
timestamp_prefix = '2020-07-07-01-09-02'

job_name = f'muse-dask-processing-{timestamp_prefix}'
s3_code_location = f"s3://{bucket}/{code_prefix}/preprocessing.py"
sample_ratio = 0.1

input_data = ProcessingInput(source=f"s3://{bucket}/{input_prefix}", destination='/opt/ml/processing/input', input_name='dataset')
output_data = ProcessingOutput(source='/opt/ml/processing/processed/', destination=f"s3://{bucket}/{input_preprocessed_prefix}/{timestamp_prefix}", output_name='processed-dataset')
descriptions_data = ProcessingOutput(source='/opt/ml/processing/descriptions/', destination=f"s3://{bucket}/{input_descriptions_prefix}/{timestamp_prefix}", output_name='descriptions-dataset')
rejected_data = ProcessingOutput(source='/opt/ml/processing/rejected', destination=f"s3://{bucket}/{input_rejected_prefix}/{timestamp_prefix}", output_name='rejected-dataset')
reports_on_data = ProcessingOutput(source='/opt/ml/processing/reports', destination=f"s3://{bucket}/{input_reports_prefix}/{timestamp_prefix}", output_name='dataset-reports')

print(f"Ready to execute {job_name}:\n\tScript location: {s3_code_location}")
print(f"\tInputs: {input_data.source} ({sample_ratio*100:0.2f}% sample)")
print(f"\tProcessed data destination: {output_data.destination}")
print(f"\tDescriptions data destination: {descriptions_data.destination}")
print(f"\tRejected data destination: {rejected_data.destination}")
print(f"\tReports destination: {reports_on_data.destination}")      

Ready to execute muse-dask-processing-2020-07-07-01-09-02:
	Script location: s3://sagemaker-eu-west-1-113147044314/sagemaker/muse-dask-preprocess-demo/code/preprocessing.py
	Inputs: s3://sagemaker-eu-west-1-113147044314/sagemaker/muse-dask-preprocess-demo/input/book-depository/raw (10.00% sample)
	Processed data destination: s3://sagemaker-eu-west-1-113147044314/sagemaker/muse-dask-preprocess-demo/input/book-depository/preprocessed/2020-07-07-01-09-02
	Descriptions data destination: s3://sagemaker-eu-west-1-113147044314/sagemaker/muse-dask-preprocess-demo/input/book-depository/descriptions/2020-07-07-01-09-02
	Rejected data destination: s3://sagemaker-eu-west-1-113147044314/sagemaker/muse-dask-preprocess-demo/input/book-depository/rejected/2020-07-07-01-09-02
	Reports destination: s3://sagemaker-eu-west-1-113147044314/sagemaker/muse-dask-preprocess-demo/input/book-depository/reports/2020-07-07-01-09-02


In [45]:
%%time

dask_processor.run(code=s3_code_location,
                   inputs=[input_data],
                   outputs=[output_data, descriptions_data, rejected_data, reports_on_data],
                   job_name=job_name,
                   arguments=['--sample', str(sample_ratio)]
                  )


Job Name:  muse-dask-processing-2020-07-07-01-09-02
Inputs:  [{'InputName': 'dataset', 'S3Input': {'S3Uri': 's3://sagemaker-eu-west-1-113147044314/sagemaker/muse-dask-preprocess-demo/input/book-depository/raw', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-eu-west-1-113147044314/sagemaker/muse-dask-preprocess-demo/code/preprocessing.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'processed-dataset', 'S3Output': {'S3Uri': 's3://sagemaker-eu-west-1-113147044314/sagemaker/muse-dask-preprocess-demo/input/book-depository/preprocessed/2020-07-07-01-09-02', 'LocalPath': '/opt/ml/processing/processed/', 'S3UploadMode': 'EndOfJob'}}, {'OutputName': 'descriptions-dat

## Creating a Processing Job for Inference

We can reuse the base processing container and the output of the previous job to run inference. Here's the new script we'll need. It takes as input:
- The location of the generated data
- The desired destination of the inference results
- The model artifact we should use.

In [68]:
%%writefile inference.py

import argparse
import json
import logging
import os
import sys
import boto3
import time
import csv
import pandas as pd
import dask.dataframe as dd
from dask.distributed import Client

def invoke_endpoint(row: pd.Series, region, endpoint):
    runtime = boto3.client('runtime.sagemaker', region_name=region)
    payload = row.description
    response = runtime.invoke_endpoint(EndpointName=endpoint,
                                       ContentType='text/csv',
                                       Body=payload)
    result = json.loads(response['Body'].read().decode())['output']
    return(str(result))

def invoke_inference(df, region, endpoint):
    return(df.apply(invoke_endpoint, axis=1, region=region, endpoint=endpoint))

def gen_inference(source_data_dir, inference_data_dir, endpoint_name, region, block_size='32MB', sample=1.0):
    print("---------------------------------------------")
    print(f"Loading data from {source_data_dir}.")
    if sample < 1.0:
        print(f"Taking a fraction of {sample:0.2f} of the data")
    print("---------------------------------------------")
    data = dd.read_csv(
        f'{source_data_dir}/dataset-*.csv', header=0, 
        usecols=['description', 'title'],
        blocksize=block_size,
    ).repartition(partition_size=block_size).sample(frac=sample)
    
    data['embedding'] = data.map_partitions(
        invoke_inference,
        meta=pd.Series(name='embedding', dtype='U'),
        region=region,
        endpoint=endpoint_name
    )
    print(f"Saving inference dataset to {inference_data_dir}")
    data.to_csv(f'{inference_data_dir}/dataset-*.csv', compute=True, index=False, quoting=csv.QUOTE_NONNUMERIC)    


def start_dask_cluster(scheduler_ip):
    # Start the Dask cluster client
    try:
        client = Client("tcp://{ip}:8786".format(ip=scheduler_ip))
        logging.info("Cluster information: {}".format(client))
    except Exception as err:
        logging.exception(err)


def parse_processing_job_config(config_file="/opt/ml/config/processingjobconfig.json"):
    with open(config_file, "r") as config_file:
        config = json.load(config_file)
    inputs = {in_path["InputName"]: in_path["S3Input"]["LocalPath"] for in_path in config["ProcessingInputs"]}
    outputs = {out_path["OutputName"]: out_path["S3Output"]["LocalPath"] for out_path in config["ProcessingOutputConfig"]["Outputs"]}
    return (inputs, outputs)
    
    
def parse_arguments():
    parser = argparse.ArgumentParser()
    parser.add_argument("--data-to-process", type=str, default="dataset")
    parser.add_argument("--inference-data", type=str, default="inference-dataset")
    parser.add_argument("--endpoint-name", type=str, default="muse-large")
    parser.add_argument("--region", type=str, default="us-east-1")
    parser.add_argument("--block-size", type=str, default="32MB")
    parser.add_argument("--scheduler-ip", type=str, default=sys.argv[-1])
    parser.add_argument("--sample", type=float, default=1.0)
    args, _ = parser.parse_known_args()
    
    # Get processor scrip arguments
    args_iter = iter(sys.argv[1:])
    script_args = dict(zip(args_iter, args_iter))
    return(args, script_args)


if __name__ == '__main__':
    inputs, outputs = parse_processing_job_config()
    args, script_args = parse_arguments()
    start_dask_cluster(args.scheduler_ip)
    
    print('----------------------------------------------------')
    print('Starting inference')
    print('----------------------------------------------------')
    gen_inference(
        source_data_dir=inputs[args.data_to_process], 
        inference_data_dir=outputs[args.inference_data],
        endpoint_name=args.endpoint_name,
        region=args.region,
        block_size=args.block_size,
        sample=args.sample
    )
    print('----------------------------------------------------')
    print('Inference finished')
    print('----------------------------------------------------')

Overwriting inference.py


In [27]:
parallelism = 10

inference_processor = ScriptProcessor(
    base_job_name="dask-inference",
    image_uri=dask_repository_uri,
    command=["/opt/program/bootstrap.py"],
    role=role,
    instance_count=parallelism-5,
    instance_type="ml.m5.xlarge",
    max_runtime_in_seconds=1200,
)

In [19]:
model_descr = smclient.describe_model(ModelName='model-muse-large-000002')
model_name = model_descr['ModelName']
model_data = model_descr['PrimaryContainer']['ModelDataUrl']
model_image = model_descr['PrimaryContainer']['Image']

In [33]:
model = sm.model.Model(model_data = model_data,
                       image=model_image,
                       role=role, 
                       predictor_cls=sm.predictor.RealTimePredictor,
                       name=model_name)

In [38]:
predictor = model.deploy(initial_instance_count=parallelism, instance_type='ml.c5.2xlarge', endpoint_name='muse-large')

Using already existing model: model-muse-large-000002


---------------!

In [69]:
timestamp_prefix_2 = strftime("%Y-%m-%d-%H-%M-%S", gmtime())

job_name = f'muse-dask-inference-{timestamp_prefix_2}'
s3_code_location = f"s3://{bucket}/{code_prefix}/inference.py"
s3client.upload_file("inference.py", Bucket=bucket, Key=f"{code_prefix}/inference.py")
sample_ratio = 0.1

input_data = ProcessingInput(source=f"s3://{bucket}/{input_preprocessed_prefix}/{timestamp_prefix}", destination='/opt/ml/processing/input/', input_name='dataset')
inference_data = ProcessingOutput(source='/opt/ml/processing/inference/', destination=f"s3://{bucket}/{output_inference_prefix}/{timestamp_prefix_2}", output_name='inference-dataset')

print(f"Ready to execute {job_name}:\n\tScript location: {s3_code_location}")
print(f"\tInputs: {input_data.source} ({sample_ratio*100:0.2f}% sample)")
print(f"\tInference data destination: {inference_data.destination}")
#print(f"\t Using endpoint: {predictor.endpoint}")

Ready to execute muse-dask-inference-2020-07-07-03-02-05:
	Script location: s3://sagemaker-eu-west-1-113147044314/sagemaker/muse-dask-preprocess-demo/code/inference.py
	Inputs: s3://sagemaker-eu-west-1-113147044314/sagemaker/muse-dask-preprocess-demo/input/book-depository/preprocessed/2020-07-07-01-09-02 (10.00% sample)
	Inference data destination: s3://sagemaker-eu-west-1-113147044314/sagemaker/muse-dask-preprocess-demo/output/book-depository/inference/2020-07-07-03-02-05


And here's the call to inference. Pay attention to the `region` parameter before execution - most labs are run under `us-east-1`.

In [70]:
%%time

inference_processor.run(code=s3_code_location,
                        inputs=[input_data],
                        outputs=[inference_data],
                        job_name=job_name,
                        arguments=[
                            '--sample', str(sample_ratio),
                            '--endpoint-name', 'muse-large',
                            '--region', 'eu-west-1'
                        ]
                  )


Job Name:  muse-dask-inference-2020-07-07-03-02-05
Inputs:  [{'InputName': 'dataset', 'S3Input': {'S3Uri': 's3://sagemaker-eu-west-1-113147044314/sagemaker/muse-dask-preprocess-demo/input/book-depository/preprocessed/2020-07-07-01-09-02', 'LocalPath': '/opt/ml/processing/input/', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-eu-west-1-113147044314/sagemaker/muse-dask-preprocess-demo/code/inference.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'inference-dataset', 'S3Output': {'S3Uri': 's3://sagemaker-eu-west-1-113147044314/sagemaker/muse-dask-preprocess-demo/output/book-depository/inference/2020-07-07-03-02-05', 'LocalPath': '/opt/ml/processing/inference/', 'S3UploadMode': 'EndOfJob'}}]
.........

Let's check if the job finished correctly...

In [84]:
inference_processor.jobs[-1].describe()['ProcessingJobStatus']

'Completed'

And download one of the resulting files to inspect it. The filename may need to be changed to `dataset-00.csv` if you ran with a large sample.

In [83]:
prefix = "/".join(inference_data.destination.split("/")[3:]) + '/dataset-0.csv'
s3client.download_file(bucket, prefix, 'inference-sample.csv')

The file should be on your explorer window. You can open it with Jupyter.