## Workflow Creation using SageMaker Pipelines (Simple - Only One Step)


This notebook shows how to:

1. Define a set of Pipeline parameters that can be used to parametrize a SageMaker Pipeline.
2. Define a Processing step that performs cleaning, feature engineering, and splitting the input data into train and test data sets.
3. Start a Pipeline execution and wait for execution to complete.

![A typical ML Application pipeline](./img/pipeline-full.png)

#### Imports 

In [1]:
from sagemaker.workflow.parameters import ParameterInteger,ParameterString
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.workflow.steps import ProcessingStep
from sagemaker.workflow.pipeline import Pipeline
import pandas as pd
import sagemaker
import logging
import boto3
import json

##### Setup logger

In [2]:
logger = logging.getLogger('__name__')
logger.setLevel(logging.INFO)
logger.addHandler(logging.StreamHandler())

##### Essentials

In [3]:
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
role = sagemaker.get_execution_role()
bucket = sagemaker_session.default_bucket()
model_package_group_name = f'Abalone'

Couldn't call 'get_role' to get Role ARN from role name AmazonSageMaker-ExecutionRole-20210522T230509 to get Role path.
Assuming role was created in SageMaker AWS console, as the name contains `AmazonSageMaker-ExecutionRole`. Defaulting to Role ARN with service-role in path. If this Role ARN is incorrect, please add IAM read permissions to your role or supply the Role Arn directly.


In [4]:
print(f'Default bucket = {bucket}')

Default bucket = sagemaker-us-east-1-892313895307


####  Prep data

The dataset you use is the [UCI Machine Learning Abalone Dataset](https://archive.ics.uci.edu/ml/datasets/abalone).  The aim for this task is to determine the age of an abalone from its physical measurements. At the core, this is a regression problem.

Predict age based on physical measurements.

In [5]:
df = pd.read_csv('./data/abalone.csv')
df.head(5)

Unnamed: 0,sex,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


In [6]:
df.pop('rings')
df.head(5)

Unnamed: 0,sex,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055


In [7]:
list(df.columns)

['sex',
 'length',
 'diameter',
 'height',
 'whole_weight',
 'shucked_weight',
 'viscera_weight',
 'shell_weight']

Copy data from local to S3

In [8]:
!aws s3 cp ./data/abalone.csv s3://{bucket}/abalone/

upload: data/abalone.csv to s3://sagemaker-us-east-1-892313895307/abalone/abalone.csv


In [9]:
!aws s3 cp ./data/abalone-unlabeled.csv s3://{bucket}/abalone/

upload: data/abalone-unlabeled.csv to s3://sagemaker-us-east-1-892313895307/abalone/abalone-unlabeled.csv


In [10]:
input_data_uri = f's3://{bucket}/abalone/abalone.csv'
batch_data_uri = f's3://{bucket}/abalone/abalone-unlabeled.csv'

### 1. Define Pipeline-level parameters 

In [11]:
processing_instance_count = ParameterInteger(name='ProcessingInstanceCount', default_value=1)
processing_instance_type = ParameterString(name='ProcessingInstanceType', default_value='ml.m5.xlarge')
training_instance_type = ParameterString(name='TrainingInstanceType', default_value='ml.m5.xlarge')
model_approval_status = ParameterString(name='ModelApprovalStatus', default_value='PendingManualApproval')
input_data = ParameterString(name='InputData', default_value=input_data_uri)
batch_data = ParameterString(name='BatchData', default_value=batch_data_uri)

![Define Parameters](./img/pipeline-1.png)

### 2. Feature Engineering

* Fill in missing sex category data and encode it so that it is suitable for training.
* Scale and normalize all numerical fields, aside from sex and rings numerical data.
* Split the data into training, validation, and test datasets.

In [12]:
%%writefile src/preprocessing.py
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np
import argparse
import requests
import tempfile
import logging
import sklearn
import os


logger = logging.getLogger('__name__')
logger.setLevel(logging.INFO)
logger.addHandler(logging.StreamHandler())

logger.info(f'Using Sklearn version: {sklearn.__version__}')


if __name__ == '__main__':
    logger.info('Sklearn Preprocessing Job [Start]')
    base_dir = '/opt/ml/processing'

    df = pd.read_csv(f'{base_dir}/input/abalone.csv')
    y = df.pop('rings')
    cols = df.columns
    logger.info(f'Columns = {cols}')

    numeric_features = list(df.columns)
    numeric_features.remove('sex')
    numeric_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), 
                                          ('scaler', StandardScaler())])

    categorical_features = ['sex']
    categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
                                              ('onehot', OneHotEncoder(handle_unknown='ignore'))])

    preprocess = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_features), 
                                                 ('cat', categorical_transformer, categorical_features)])

    X_pre = preprocess.fit_transform(df)
    y_pre = y.to_numpy().reshape(len(y), 1)

    X = np.concatenate((y_pre, X_pre), axis=1)

    np.random.shuffle(X)
    train, validation, test = np.split(X, [int(0.7 * len(X)), int(0.85 * len(X))])

    pd.DataFrame(train).to_csv(f'{base_dir}/train/train.csv', header=False, index=False)
    pd.DataFrame(validation).to_csv(f'{base_dir}/validation/validation.csv', header=False, index=False)
    pd.DataFrame(test).to_csv(f'{base_dir}/test/test.csv', header=False, index=False)
    logger.info('Sklearn Preprocessing Job [End]')

Overwriting src/preprocessing.py


In [13]:
framework_version = '0.23-1'

sklearn_processor = SKLearnProcessor(framework_version=framework_version, 
                                     instance_type=processing_instance_type, 
                                     instance_count=processing_instance_count, 
                                     base_job_name='sklearn-abalone-preprocess', 
                                     role=role)

In [14]:
step_process = ProcessingStep(name='AbalonePreprocess', 
                              processor=sklearn_processor, 
                              inputs=[ProcessingInput(source=input_data, destination='/opt/ml/processing/input')], 
                              outputs=[ProcessingOutput(output_name='train', source='/opt/ml/processing/train'), 
                                       ProcessingOutput(output_name='validation', source='/opt/ml/processing/validation'), 
                                       ProcessingOutput(output_name='test', source='/opt/ml/processing/test')], 
                              code='src/preprocessing.py')

In [15]:
step_process.__dict__

{'name': 'AbalonePreprocess',
 'step_type': <StepTypeEnum.PROCESSING: 'Processing'>,
 'depends_on': None,
 'processor': <sagemaker.sklearn.processing.SKLearnProcessor at 0x7f94f49b06d0>,
 'inputs': [<sagemaker.processing.ProcessingInput at 0x7f94f41dbfd0>],
 'outputs': [<sagemaker.processing.ProcessingOutput at 0x7f94f423afd0>,
  <sagemaker.processing.ProcessingOutput at 0x7f94f423af50>,
  <sagemaker.processing.ProcessingOutput at 0x7f94f423aed0>],
 'job_arguments': None,
 'code': 'src/preprocessing.py',
 'property_files': None,
 '_properties': <sagemaker.workflow.properties.Properties at 0x7f94f423add0>,
 'cache_config': None}

![Define a Processing Step for Feature Engineering](img/pipeline-2.png)

### 3. Start a Pipeline Execution

In [16]:
pipeline_name = 'AbalonePipeline'

pipeline = Pipeline(name=pipeline_name, 
                    parameters=[processing_instance_type, 
                                processing_instance_count, 
                                input_data, 
                                batch_data], 
                    steps=[step_process])

Examine the Pipeline definition

In [17]:
definition = json.loads(pipeline.definition())
definition

{'Version': '2020-12-01',
 'Metadata': {},
 'Parameters': [{'Name': 'ProcessingInstanceType',
   'Type': 'String',
   'DefaultValue': 'ml.m5.xlarge'},
  {'Name': 'ProcessingInstanceCount', 'Type': 'Integer', 'DefaultValue': 1},
  {'Name': 'InputData',
   'Type': 'String',
   'DefaultValue': 's3://sagemaker-us-east-1-892313895307/abalone/abalone.csv'},
  {'Name': 'BatchData',
   'Type': 'String',
   'DefaultValue': 's3://sagemaker-us-east-1-892313895307/abalone/abalone-unlabeled.csv'}],
 'Steps': [{'Name': 'AbalonePreprocess',
   'Type': 'Processing',
   'Arguments': {'ProcessingResources': {'ClusterConfig': {'InstanceType': {'Get': 'Parameters.ProcessingInstanceType'},
      'InstanceCount': {'Get': 'Parameters.ProcessingInstanceCount'},
      'VolumeSizeInGB': 30}},
    'AppSpecification': {'ImageUri': '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3',
     'ContainerEntrypoint': ['python3',
      '/opt/ml/processing/input/code/preprocessing.py']},
 

#### Kickstart Pipeline execution

In [18]:
pipeline.upsert(role_arn=role)

{'PipelineArn': 'arn:aws:sagemaker:us-east-1:892313895307:pipeline/abalonepipeline',
 'ResponseMetadata': {'RequestId': 'f87f634f-e487-4e61-8d5b-bc7beccca7bb',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'f87f634f-e487-4e61-8d5b-bc7beccca7bb',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '83',
   'date': 'Sun, 23 May 2021 20:10:43 GMT'},
  'RetryAttempts': 0}}

In [19]:
execution = pipeline.start()

In [20]:
execution.describe()

{'PipelineArn': 'arn:aws:sagemaker:us-east-1:892313895307:pipeline/abalonepipeline',
 'PipelineExecutionArn': 'arn:aws:sagemaker:us-east-1:892313895307:pipeline/abalonepipeline/execution/by6e2643ovj5',
 'PipelineExecutionDisplayName': 'execution-1621800644848',
 'PipelineExecutionStatus': 'Executing',
 'CreationTime': datetime.datetime(2021, 5, 23, 20, 10, 44, 676000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2021, 5, 23, 20, 10, 44, 676000, tzinfo=tzlocal()),
 'CreatedBy': {'UserProfileArn': 'arn:aws:sagemaker:us-east-1:892313895307:user-profile/d-dowart1jabkf/team-v',
  'UserProfileName': 'team-v',
  'DomainId': 'd-dowart1jabkf'},
 'LastModifiedBy': {'UserProfileArn': 'arn:aws:sagemaker:us-east-1:892313895307:user-profile/d-dowart1jabkf/team-v',
  'UserProfileName': 'team-v',
  'DomainId': 'd-dowart1jabkf'},
 'ResponseMetadata': {'RequestId': '54bdeb31-b405-4069-98d7-daeb17c55146',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '54bdeb31-b405-4069-98d7

In [21]:
execution.wait()

In [22]:
execution.list_steps()

[{'StepName': 'AbalonePreprocess',
  'StartTime': datetime.datetime(2021, 5, 23, 20, 10, 45, 166000, tzinfo=tzlocal()),
  'EndTime': datetime.datetime(2021, 5, 23, 20, 14, 58, 253000, tzinfo=tzlocal()),
  'StepStatus': 'Succeeded',
  'Metadata': {'ProcessingJob': {'Arn': 'arn:aws:sagemaker:us-east-1:892313895307:processing-job/pipelines-by6e2643ovj5-abalonepreprocess-qy49bwcpyu'}}}]