# Inference Pipeline with Scikit-learn and Linear Learner
Typically a Machine Learning (ML) process consists of few steps: data gathering with various ETL jobs, pre-processing the data, featurizing the dataset by incorporating standard techniques or prior knowledge, and finally training an ML model using an algorithm. 
In many cases, when the trained model is used for processing real time or batch prediction requests, the model receives data in a format which needs to pre-processed (e.g. featurized) before it can be passed to the algorithm. In the following notebook, we will demonstrate how you can build your ML Pipeline leveraging the Sagemaker Scikit-learn container and SageMaker Linear Learner algorithm & after the model is trained, deploy the Pipeline (Data preprocessing and Lineara Learner) as an Inference Pipeline behind a single Endpoint for real time inference and for batch inferences using Amazon SageMaker Batch Transform.

We will demonstrate this using the Abalone Dataset to guess the age of Abalone with physical features. The dataset is available from [UCI Machine Learning](https://archive.ics.uci.edu/ml/datasets/abalone); the aim for this task is to determine age of an Abalone (a kind of shellfish) from its physical measurements. We'll use Sagemaker's Scikit-learn container to featurize the dataset so that it can be used for training with Linear Learner.

### Table of contents
* [Preprocessing data and training the model](#training)
 * [Upload the data for training](#upload_data)
 * [Create a Scikit-learn script to train with](#create_sklearn_script)
 * [Create SageMaker Scikit Estimator](#create_sklearn_estimator)
 * [Batch transform our training data](#preprocess_train_data)
 * [Fit a LinearLearner Model with the preprocessed data](#training_model)
* [Inference Pipeline with Scikit preprocessor and Linear Learner](#inference_pipeline)
 * [Set up the inference pipeline](#pipeline_setup)
 * [Make a request to our pipeline endpoint](#pipeline_inference_request)
 * [Delete Endpoint](#delete_endpoint)

Let's first create our Sagemaker session and role, and create a S3 prefix to use for the notebook example.

In [6]:
!pwd

/home/ec2-user/SageMaker/aws-ml/pipeline/sklearn-pipeline-linear


In [7]:
import sagemaker
from sagemaker import get_execution_role
import pandas as pd

sagemaker_session = sagemaker.Session()

# Get a SageMaker-compatible role used by this Notebook Instance.
role = get_execution_role()

# S3 prefix
bucket = sagemaker_session.default_bucket()
prefix = 'Scikit-LinearLearner-pipeline-abalone-example'

In [8]:
bucket

'sagemaker-us-east-1-120286446822'

# Preprocessing data and training the model <a class="anchor" id="training"></a>
## Downloading dataset <a class="anchor" id="download_data"></a>
SageMaker team has downloaded the dataset from UCI and uploaded to one of the S3 buckets in our account.

In [9]:
!wget --directory-prefix=./abalone_data https://s3-us-west-2.amazonaws.com/sparkml-mleap/data/abalone/abalone.csv

--2020-09-16 11:04:56--  https://s3-us-west-2.amazonaws.com/sparkml-mleap/data/abalone/abalone.csv
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.218.232.248
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.232.248|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 191873 (187K) [binary/octet-stream]
Saving to: ‘./abalone_data/abalone.csv.4’


2020-09-16 11:04:57 (855 KB/s) - ‘./abalone_data/abalone.csv.4’ saved [191873/191873]



## Upload the data for training <a class="anchor" id="upload_data"></a>

When training large models with huge amounts of data, you'll typically use big data tools, like Amazon Athena, AWS Glue, or Amazon EMR, to create your data in S3. We can use the tools provided by the SageMaker Python SDK to upload the data to a default bucket. 

In [10]:
RAW_FILE = 'abalone.csv'
WORK_DIRECTORY = 'abalone_data'

RAW_FILE_PATH  = WORK_DIRECTORY + "/abalone.csv"
RAW_TRAIN_PATH = WORK_DIRECTORY + "/abalone_train.csv"
RAW_TEST_PATH  = WORK_DIRECTORY + "/abalone_test.csv"


X = pd.read_csv(filepath_or_buffer=RAW_FILE_PATH, header=None)

train_data = X.head(int(len(X)*0.8)).copy()
test_data  = X.tail(int(len(X)*0.2)).copy()

train_data.to_csv(path_or_buf=RAW_TRAIN_PATH, index=False)
test_data.to_csv(path_or_buf=RAW_TEST_PATH, index=False)

print(len(train_data), len(test_data))
print(len(train_data)+len(test_data))

print(X.shape)
X.head(2)

3341 835
4176
(4177, 9)


Unnamed: 0,0,1,2,3,4,5,6,7,8
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7


In [11]:
train_input = sagemaker_session.upload_data(
    path=RAW_TRAIN_PATH, 
    bucket=bucket,
    key_prefix='{}/{}'.format(prefix, 'train'))

test_input = sagemaker_session.upload_data(
    path=RAW_TEST_PATH, 
    bucket=bucket,
    key_prefix='{}/{}'.format(prefix, 'test'))

train_input, test_input

('s3://sagemaker-us-east-1-120286446822/Scikit-LinearLearner-pipeline-abalone-example/train/abalone_train.csv',
 's3://sagemaker-us-east-1-120286446822/Scikit-LinearLearner-pipeline-abalone-example/test/abalone_test.csv')

## Create a Scikit-learn script to train with <a class="anchor" id="create_sklearn_script"></a>
To run Scikit-learn on Sagemaker `SKLearn` Estimator with a script as an entry point. The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, such as:

* SM_MODEL_DIR: A string representing the path to the directory to write model artifacts to. These artifacts are uploaded to S3 for model hosting.
* SM_OUTPUT_DIR: A string representing the filesystem path to write output artifacts to. Output artifacts may include checkpoints, graphs, and other files to save, not including model artifacts. These artifacts are compressed and uploaded to S3 to the same S3 prefix as the model artifacts.

Supposing two input channels, 'train' and 'test', were used in the call to the Chainer estimator's fit() method, the following will be set, following the format SM_CHANNEL_[channel_name]:

* SM_CHANNEL_TRAIN: A string representing the path to the directory containing data in the 'train' channel
* SM_CHANNEL_TEST: Same as above, but for the 'test' channel.

A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to model_dir so that it can be hosted later. Hyperparameters are passed to your script as arguments and can be retrieved with an argparse.ArgumentParser instance. For example, the script run by this notebook:

## Create SageMaker Scikit Estimator <a class="anchor" id="create_sklearn_estimator"></a>

To run our Scikit-learn training script on SageMaker, we construct a `sagemaker.sklearn.estimator.sklearn` estimator, which accepts several constructor arguments:

* __entry_point__: The path to the Python script SageMaker runs for training and prediction.
* __role__: Role ARN
* __framework_version__: Scikit-learn version you want to use for executing your model training code.
* __train_instance_type__ *(optional)*: The type of SageMaker instances for training. __Note__: Because Scikit-learn does not natively support GPU training, Sagemaker Scikit-learn does not currently support training on GPU instance types.
* __sagemaker_session__ *(optional)*: The session used to train on Sagemaker.

To see the code for the SKLearn Estimator, see here: https://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/sklearn

In [12]:
from sagemaker.sklearn.estimator import SKLearn

FRAMEWORK_VERSION = "0.23-1"
script_path = 'sklearn_abalone_featurizer.py'

sklearn_preprocessor = SKLearn(
    entry_point=script_path,
    role=role,
    framework_version=FRAMEWORK_VERSION,
    train_instance_type="ml.c4.xlarge",
    sagemaker_session=sagemaker_session)


In [13]:
sklearn_preprocessor.fit({'train': train_input})

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


2020-09-16 11:05:35 Starting - Starting the training job...
2020-09-16 11:05:37 Starting - Launching requested ML instances.........
2020-09-16 11:07:09 Starting - Preparing the instances for training...
2020-09-16 11:08:03 Downloading - Downloading input data...
2020-09-16 11:08:35 Training - Downloading the training image.....[34m2020-09-16 11:09:13,087 sagemaker-training-toolkit INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2020-09-16 11:09:13,089 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-09-16 11:09:13,099 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2020-09-16 11:09:13,436 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-09-16 11:09:14,905 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-09-16 11:09:14,916 sagemaker-training-toolkit INFO     No GPUs detected (norm

## Batch transform our training data <a class="anchor" id="preprocess_train_data"></a>
Now that our proprocessor is properly fitted, let's go ahead and preprocess our training data. Let's use batch transform to directly preprocess the raw data and store right back into s3.

In [14]:
# Define a SKLearn Transformer from the trained SKLearn Estimator
transformer = sklearn_preprocessor.transformer(
    instance_count=1, 
    instance_type='ml.m5.xlarge',
    assemble_with = 'Line',
    accept = 'text/csv')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


In [15]:
# Preprocess training input
transformer.transform(train_input, content_type="text/csv")
print("Waiting for transform job: " + transformer.latest_transform_job.job_name)
transformer.wait()
preprocessed_train = transformer.output_path

Waiting for transform job: sagemaker-scikit-learn-2020-09-16-11-16-28-070
...........................
.[32m2020-09-16T11:20:58.120:[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=6, BatchStrategy=MULTI_RECORD[0m
[34m2020-09-16 11:20:54,764 INFO - sagemaker-containers - No GPUs detected (normal if no gpus installed)[0m
[34m2020-09-16 11:20:54,766 INFO - sagemaker-containers - No GPUs detected (normal if no gpus installed)[0m
[34m2020-09-16 11:20:54,767 INFO - sagemaker-containers - nginx config: [0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr;
[0m
[34mworker_rlimit_nofile 4096;
[0m
[34mevents {
  worker_connections 2048;[0m
[34m}
[0m
[35m2020-09-16 11:20:54,764 INFO - sagemaker-containers - No GPUs detected (normal if no gpus installed)[0m
[35m2020-09-16 11:20:54,766 INFO - sagemaker-containers - No GPUs detected (normal if no gpus installed)[0m
[35m2020-09-16 11:20:54,767 INFO - sagemaker-

[35m169.254.255.130 - - [16/Sep/2020:11:20:55 +0000] "GET /ping HTTP/1.1" 502 182 "-" "Go-http-client/1.1"
  Building wheel for sklearn-abalone-featurizer (setup.py): finished with status 'done'
  Created wheel for sklearn-abalone-featurizer: filename=sklearn_abalone_featurizer-1.0.0-py2.py3-none-any.whl size=10934 sha256=d662646db52d5a98048bb6abfd64826603ad14b3d1715a1da3b8d0df8e61ecc2
  Stored in directory: /home/model-server/tmp/pip-ephem-wheel-cache-yn76aboi/wheels/3e/0f/51/2f1df833dd0412c1bc2f5ee56baac195b5be563353d111dca6[0m
[35mSuccessfully built sklearn-abalone-featurizer[0m
[35m2020/09/16 11:20:55 [crit] 14#14: *17 connect() to unix:/tmp/gunicorn.sock failed (2: No such file or directory) while connecting to upstream, client: 169.254.255.130, server: , request: "GET /ping HTTP/1.1", upstream: "http://unix:/tmp/gunicorn.sock:/ping", host: "169.254.255.131:8080"[0m
[35m169.254.255.130 - - [16/Sep/2020:11:20:55 +0000] "GET /ping HTTP/1.1" 502 182 "-" "Go-http-client/1.1"[0

In [16]:
# Preprocess test input
transformer.transform(test_input, content_type="text/csv")
print("Waiting for transform job: " + transformer.latest_transform_job.job_name)
transformer.wait()
preprocessed_test = transformer.output_path

Waiting for transform job: sagemaker-scikit-learn-2020-09-16-11-21-11-153
............................[34m2020-09-16 11:25:46,180 INFO - sagemaker-containers - No GPUs detected (normal if no gpus installed)[0m
[34m2020-09-16 11:25:46,182 INFO - sagemaker-containers - No GPUs detected (normal if no gpus installed)[0m
[34m2020-09-16 11:25:46,183 INFO - sagemaker-containers - nginx config: [0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr;
[0m
[34mworker_rlimit_nofile 4096;
[0m
[34mevents {
  worker_connections 2048;[0m
[34m}
[0m
[35m2020-09-16 11:25:46,180 INFO - sagemaker-containers - No GPUs detected (normal if no gpus installed)[0m
[35m2020-09-16 11:25:46,182 INFO - sagemaker-containers - No GPUs detected (normal if no gpus installed)[0m
[35m2020-09-16 11:25:46,183 INFO - sagemaker-containers - nginx config: [0m
[35mworker_processes auto;[0m
[35mdaemon off;[0m
[35mpid /tmp/nginx.pid;[0m
[35merror

[32m2020-09-16T11:25:49.620:[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=6, BatchStrategy=MULTI_RECORD[0m



In [18]:
preprocessed_train, preprocessed_test

('s3://sagemaker-us-east-1-120286446822/sagemaker-scikit-learn-2020-09-16-11-16-28-070',
 's3://sagemaker-us-east-1-120286446822/sagemaker-scikit-learn-2020-09-16-11-21-11-153')

## Fit a LinearLearner Model with the preprocessed data <a class="anchor" id="training_model"></a>
Let's take the preprocessed training data and fit a LinearLearner Model. Sagemaker provides prebuilt algorithm containers that can be used with the Python SDK. The previous Scikit-learn job preprocessed the raw Titanic dataset into labeled, useable data that we can now use to fit a binary classifier Linear Learner model.

For more on Linear Learner see: https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html

### custom sklearn model

In [47]:
%%writefile script.py

import argparse
import joblib
import os

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor



# inference functions ---------------
def model_fn(model_dir):
    clf = joblib.load(os.path.join(model_dir, "model.joblib"))
    return clf



if __name__ =='__main__':

    print('extracting arguments')
    parser = argparse.ArgumentParser()

    # hyperparameters sent by the client are passed as command-line arguments to the script.
    # to simplify the demo we don't use all sklearn RandomForest hyperparameters
    parser.add_argument('--n-estimators', type=int, default=10)
    parser.add_argument('--min-samples-leaf', type=int, default=3)

    # Data, model, and output directories
    parser.add_argument('--model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
    parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
    #parser.add_argument('--train-file', type=str, default='boston_train.csv')
    #parser.add_argument('--test-file', type=str, default='boston_test.csv')
    parser.add_argument('--features', type=str)  # in this script we ask user to explicitly name features
    parser.add_argument('--target', type=str) # in this script we ask user to explicitly name the target

    args, _ = parser.parse_known_args()
    
    print("args.train", args.train)
    print("args.test", args.test)
    
    print('reading train data')
    print("args.train : ",args.train)
    # Take the set of files and read them all into a single pandas dataframe
    input_files = [ os.path.join(args.train, file) for file in os.listdir(args.train) ]
    if len(input_files) == 0:
        raise ValueError(('There are no files in {}.\n' +
                          'This usually indicates that the channel ({}) was incorrectly specified,\n' +
                          'the data specification in S3 was incorrectly specified or the role specified\n' +
                          'does not have permission to access the data.').format(args.train, "train"))
    raw_data = [ pd.read_csv(file, header=None, engine="python") for file in input_files ]
    train_df = pd.concat(raw_data)
    print(train_df.shape)
    
    print('reading test data')
    print("args.test : ",args.test)
    # Take the set of files and read them all into a single pandas dataframe
    input_files = [ os.path.join(args.test, file) for file in os.listdir(args.test) ]
    if len(input_files) == 0:
        raise ValueError(('There are no files in {}.\n' +
                          'This usually indicates that the channel ({}) was incorrectly specified,\n' +
                          'the data specification in S3 was incorrectly specified or the role specified\n' +
                          'does not have permission to access the data.').format(args.train, "train"))
    raw_data = [ pd.read_csv(file, header=None, engine="python") for file in input_files ]
    test_df = pd.concat(raw_data)
    print(test_df.shape)

    print('building training and testing datasets')
    """
    X_train = train_df[args.features.split()]
    X_test = test_df[args.features.split()]
    y_train = train_df[args.target]
    y_test = test_df[args.target]
    """
    print(train_df.columns.values)
    col_to_predict = train_df.columns.values[0]
    print("col_to_predict : {}, arg_type : {}".format(col_to_predict, type(col_to_predict)))
    X_train = train_df.drop(columns=[col_to_predict])
    X_test = test_df.drop(columns=[col_to_predict])
    y_train = train_df[col_to_predict]
    y_test = test_df[col_to_predict]
    
    
    # train
    print('training model')
    model = RandomForestRegressor(
        n_estimators=args.n_estimators,
        min_samples_leaf=args.min_samples_leaf,
        n_jobs=-1)
    
    print("-"*100)
    print("X_train.shape : ", X_train.shape)
    print("model training on num features : ", X_train.shape[1])
    print("sample data : \n", X_train.head(1).values)
    model.fit(X_train, y_train)

    # print abs error
    print('validating model')
    abs_err = np.abs(model.predict(X_test) - y_test)

    # print couple perf metrics
    for q in [10, 50, 90]:
        print('AE-at-' + str(q) + 'th-percentile: '
              + str(np.percentile(a=abs_err, q=q)))
        
    # persist model
    path = os.path.join(args.model_dir, "model.joblib")
    joblib.dump(model, path)
    print('model persisted at ' + path)
    print(args.min_samples_leaf)

Overwriting script.py


In [20]:
!pwd

/home/ec2-user/SageMaker/aws-ml/pipeline/sklearn-pipeline-linear


In [21]:
"""
local
-----
AE-at-10th-percentile: 0.19839047619047645
AE-at-50th-percentile: 1.0396309523809517
AE-at-90th-percentile: 3.0963095238095253

sagemaker:
---------
AE-at-10th-percentile: 0.16885396825397017
AE-at-50th-percentile: 1.0484166666666646
AE-at-90th-percentile: 3.149433333333336
"""
print()




### Sagemaker Training

In [55]:
preprocessed_train, preprocessed_test

('s3://sagemaker-us-east-1-120286446822/sagemaker-scikit-learn-2020-09-16-11-16-28-070',
 's3://sagemaker-us-east-1-120286446822/sagemaker-scikit-learn-2020-09-16-11-21-11-153')

In [56]:
from sagemaker.sklearn.estimator import SKLearn

FRAMEWORK_VERSION = '0.23-1'

sklearn_estimator = SKLearn(
    entry_point='script.py',
    role = get_execution_role(),
    train_instance_count=1,
    train_instance_type='ml.c5.xlarge',
    framework_version=FRAMEWORK_VERSION,
    base_job_name='rf-scikit',
    metric_definitions=[
        {'Name': 'median-AE',
         'Regex': "AE-at-50th-percentile: ([0-9.]+).*$"}],
    hyperparameters = {'n-estimators': 100,
                       'min-samples-leaf': 2,
                       'features': 'CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT',
                       'target': 'target'})

In [57]:
# launch training job, with asynchronous call
#sklearn_estimator.fit({'train':trainpath, 'test': testpath}, wait=True)
sklearn_estimator.fit({'train':preprocessed_train, 'test': preprocessed_test}, wait=True)

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


2020-09-16 12:07:09 Starting - Starting the training job...
2020-09-16 12:07:12 Starting - Launching requested ML instances.........
2020-09-16 12:08:43 Starting - Preparing the instances for training...
2020-09-16 12:09:32 Downloading - Downloading input data...
2020-09-16 12:10:02 Training - Downloading the training image..[34m2020-09-16 12:10:17,859 sagemaker-training-toolkit INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2020-09-16 12:10:17,860 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-09-16 12:10:17,869 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m

2020-09-16 12:10:17 Training - Training image download completed. Training in progress.[34m2020-09-16 12:10:40,709 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-09-16 12:10:40,719 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[

### only model pred

#### deploy model

In [58]:
predictor = sklearn_estimator.deploy(instance_type='ml.m4.xlarge', initial_instance_count=1)

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


-----------------!

#### get test data

In [59]:
import json
import io
from urllib.parse import urlparse
import boto3

def get_csv_output_from_s3(s3uri, file_name):
    parsed_url = urlparse(s3uri)
    bucket_name = parsed_url.netloc
    prefix = parsed_url.path[1:]
    s3 = boto3.resource('s3')
    print(bucket_name)
    print(prefix)
    print(file_name)
    obj = s3.Object(bucket_name, '{}/{}'.format(prefix, file_name))
    return obj.get()["Body"].read().decode('utf-8')   

In [61]:
import pandas as pd

path       = preprocessed_train
batch_file = 'abalone_train.csv' # imp
output = get_csv_output_from_s3(path, '{}.out'.format(batch_file))
validate_df = pd.read_csv(io.StringIO(output), sep=",", header=None)
print(validate_df.shape)
validate_df.sample(2) 

sagemaker-us-east-1-120286446822
sagemaker-scikit-learn-2020-09-16-11-16-28-070
abalone_train.csv.out
(3342, 12)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
1964,9.0,1.095726,1.088898,0.453911,0.646967,0.73203,0.280501,0.586375,0.0,0.0,0.0,1.0
318,10.0,-0.59974,-0.55421,-0.616586,-0.92433,-0.893819,-0.535805,-0.65768,0.0,0.0,0.0,1.0


In [63]:
# `data` is a NumPy array or a Python list.
# `response` is a NumPy array.

data = validate_df.drop(columns=[0]).values

response = predictor.predict(data)
response

array([12.39078571, 11.73078571,  7.52116667, ..., 16.62067063,
       12.3132619 , 14.55588095])

In [95]:
print("hello")

hello


# Serial Inference Pipeline with Scikit preprocessor and Linear Learner <a class="anchor" id="serial_inference"></a>


## Set up the inference pipeline <a class="anchor" id="pipeline_setup"></a>
Setting up a Machine Learning pipeline can be done with the Pipeline Model. This sets up a list of models in a single endpoint; in this example, we configure our pipeline model with the fitted Scikit-learn inference model and the fitted Linear Learner model. Deploying the model follows the same ```deploy``` pattern in the SDK.

In [97]:
sklearn_estimator_RF = sklearn_estimator.create_model()
sklearn_estimator_RF

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


<sagemaker.sklearn.model.SKLearnModel at 0x7ffa3650aef0>

In [98]:
from sagemaker.model import Model
from sagemaker.pipeline import PipelineModel
import boto3
from time import gmtime, strftime

timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())

# step_1 : get models
scikit_learn_inferencee_model = sklearn_preprocessor.create_model()
sklearn_estimator_RF = sklearn_estimator.create_model()
#linear_learner_model = ll_estimator.create_model()

# step_2 : set-up pipeline
model_name = 'sklearn-inference-pipeline-' + timestamp_prefix
endpoint_name = 'sklearn-inference-pipeline-ep-' + timestamp_prefix
sm_model = PipelineModel(
    name=model_name, 
    role=role, 
    models=[
        scikit_learn_inferencee_model, 
        sklearn_estimator_RF])

#sm_model.deploy(initial_instance_count=1, instance_type='ml.c4.xlarge', endpoint_name=endpoint_name)
sm_model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge', endpoint_name=endpoint_name)

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


-----------------!

In [105]:
pipeline_tf = sm_model.transformer(
                            instance_count=1, 
                            instance_type='ml.m5.xlarge',
                            assemble_with = 'Line',
                            accept = 'text/csv')

Using already existing model: sklearn-inference-pipeline-2020-09-16-10-18-39


In [106]:
# input : test_input

pipeline_tf.transform(test_input, content_type="text/csv")
print("Waiting for transform job: " + transformer.latest_transform_job.job_name)
pipeline_tf.wait()
predictions_path = pipeline_tf.output_path
predictions_path

Waiting for transform job: sagemaker-scikit-learn-2020-09-16-09-01-25-571
............................[34m2020-09-16 10:36:47,551 INFO - sagemaker-containers - No GPUs detected (normal if no gpus installed)[0m
[34m2020-09-16 10:36:47,554 INFO - sagemaker-containers - No GPUs detected (normal if no gpus installed)[0m
[34m2020-09-16 10:36:47,554 INFO - sagemaker-containers - nginx config: [0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr;
[0m
[35m2020-09-16 10:36:47,551 INFO - sagemaker-containers - No GPUs detected (normal if no gpus installed)[0m
[35m2020-09-16 10:36:47,554 INFO - sagemaker-containers - No GPUs detected (normal if no gpus installed)[0m
[35m2020-09-16 10:36:47,554 INFO - sagemaker-containers - nginx config: [0m
[35mworker_processes auto;[0m
[35mdaemon off;[0m
[35mpid /tmp/nginx.pid;[0m
[35merror_log  /dev/stderr;
[0m
[34mworker_rlimit_nofile 4096;
[0m
[34mevents {
  worker_connection

[36m2020-09-16T10:36:52.460:[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=6, BatchStrategy=MULTI_RECORD[0m
[36m2020-09-16T10:36:52.515:[sagemaker logs]: sagemaker-us-east-1-120286446822/Scikit-LinearLearner-pipeline-abalone-example/test/abalone_test.csv: [container-2]: Bad HTTP status received from algorithm: 500[0m
[36m2020-09-16T10:36:52.515:[sagemaker logs]: sagemaker-us-east-1-120286446822/Scikit-LinearLearner-pipeline-abalone-example/test/abalone_test.csv: [0m
[36m2020-09-16T10:36:52.515:[sagemaker logs]: sagemaker-us-east-1-120286446822/Scikit-LinearLearner-pipeline-abalone-example/test/abalone_test.csv: Message:[0m
[36m2020-09-16T10:36:52.515:[sagemaker logs]: sagemaker-us-east-1-120286446822/Scikit-LinearLearner-pipeline-abalone-example/test/abalone_test.csv: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">[0m
[36m2020-09-16T10:36:52.515:[sagemaker logs]: sagemaker-us-east-1-120286446822/Scikit-LinearLearner-pipeline-abalone-example/test/abalone_

UnexpectedStatusException: Error for Transform job sklearn-inference-pipeline-2020-09-16-1-2020-09-16-10-32-14-868: Failed. Reason: AlgorithmError: See job logs for more information

In [107]:
test_input

's3://sagemaker-us-east-1-120286446822/Scikit-LinearLearner-pipeline-abalone-example/test/abalone_test.csv'

In [None]:
# Preprocess training input
transformer.transform(train_input, content_type="text/csv")
print("Waiting for transform job: " + transformer.latest_transform_job.job_name)
transformer.wait()
preprocessed_train = transformer.output_path

In [None]:
error

## Make a request to our pipeline endpoint <a class="anchor" id="pipeline_inference_request"></a>

Here we just grab the first line from the test data (you'll notice that the inference python script is very particular about the ordering of the inference request data). The ```ContentType``` field configures the first container, while the ```Accept``` field configures the last container. You can also specify each container's ```Accept``` and ```ContentType``` values using environment variables.

We make our request with the payload in ```'text/csv'``` format, since that is what our script currently supports. If other formats need to be supported, this would have to be added to the ```output_fn()``` method in our entry point. Note that we set the ```Accept``` to ```application/json```, since Linear Learner does not support ```text/csv``` ```Accept```. The prediction output in this case is trying to guess the number of rings the abalone specimen would have given its other physical features; the actual number of rings is 10.

In [103]:
endpoint_name

'sklearn-inference-pipeline-ep-2020-09-16-10-18-39'

In [110]:
from sagemaker.predictor import json_serializer, csv_serializer, json_deserializer, RealTimePredictor
from sagemaker.content_types import CONTENT_TYPE_CSV, CONTENT_TYPE_JSON

#payload = 'M, 0.44, 0.365, 0.125, 0.516, 0.2155, 0.114, 0.155' # 10
payload = '14,I,0.47,0.4,0.16,0.51,0.1615,0.073,0.198' # 14
actual_rings = 10

predictor = RealTimePredictor(
    endpoint=endpoint_name,
    sagemaker_session=sagemaker_session,
    serializer=csv_serializer,
    content_type=CONTENT_TYPE_CSV,
    accept=CONTENT_TYPE_JSON)

print(predictor.predict(payload))

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from container-1 with message "<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>500 Internal Server Error</title>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>
". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/sklearn-inference-pipeline-ep-2020-09-16-10-18-39 in account 120286446822 for more information.

## Delete Endpoint <a class="anchor" id="delete_endpoint"></a>
Once we are finished with the endpoint, we clean up the resources!

In [None]:
sm_client = sagemaker_session.boto_session.client('sagemaker')
sm_client.delete_endpoint(EndpointName=endpoint_name)

In [15]:
sm_client = sagemaker_session.boto_session.client('sagemaker')
sm_client.delete_endpoint(EndpointName=endpoint_name)

'inference-pipeline-ep-2020-09-14-10-41-27'

In [18]:
#ll_estimator
