In [2]:
import sys

!{sys.executable} -m pip install "sagemaker>=2.51.0"

You should consider upgrading via the '/opt/conda/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

# SageMaker Pipelines Air Quality: Batch Transforms and Model Monitoring

This notebook illustrates how to train and deploy a model in a SageMaker Pipeline, with both a transformer and an endpoint. It also introduces model monitoring to detect model drift and dataset corruption.

The steps in this pipeline include:
* Preprocessing the  dataset.
* Train a Linear Learner Model.
* Creating a Transform Job to run batch inference on the dataset
* Creating an endpoint

After the pipeline is completed, model monitoring is applied to the 

## The Scenario
For the demonstration in this notebook, we examine the relationship between an air pollutant (NO<sub>2</sub>) and weather in a selected city: Dublin, Ireland. 

The air quality data comes from a long established monitoring station run by the Irish Environmental Protection Agency. The station is located in Rathmines, Dublin, Ireland. Rathmines is an inner suburb of Dublin, about 3 kilometers south from the city center.  Dublin, the capital city of the Republic of Ireland, has a population of approximately one million people. The city is bounded by the sea to the east, mountains to the south, and flat topography to the west and north. The mountains to the south of Dublin affect the wind speed and direction over the city. When the general flow of wind is from the south the mountains deflect the flow to a south-westerly or south-easterly direction.

The weather data comes from the long established weather station located at Dublin Airport. Dublin Airport is located on the flat topography to the north of the city. It is about 12 kilometers north of Dublin city center.


## The Tools
* Amazon SageMaker for machine learning and deploying pipelines. 
* Amazon Simple Storage Service (Amazon S3) to stage the data for analysis. 

## The Data
Hourly air pollution datasets for the Rathmines monitoring station are published by the Irish Environmental Protection Agency. The data we used spans the years 2011 to 2016. This data is available as Open Data. The provenance of the data is described at the following link, and data can also be downloaded at this link:

http://erc.epa.ie/

A daily weather data set for Dublin Airport stretching back to 1942 is published by the Irish Meteorological Service (Met. Eireann) on their website under a Creative Commons License.

https://www.met.ie/climate/available-data/historical-data

For global studies, there is a handy repository of air quality data available on [OpenAQ](https://openaq.org) this data is also available via [Registry of Open Data on AWS](https://registry.opendata.aws/openaq/).


## The Method
#### Prepaing the data for analysis and loading data from Amazon S3
The data is in CSV format. Before being put our Amazon S3 bucket, the data was modified to prepare it for analysis:
 - Weather Data: The data set contained more information than we needed for the purpose of this proof of concept. To prepare the weather data the following actions with the original dataset were carried out:
     - Removed the header, this takes up the first 25 rows of the dataset.
     - Converted measurement unit for wind speed from knots to meters per second.
     - Selected a subset of the parameters available. Parameters were chosen based on results from scientific papers on this subject.
     - The names of the parameters selected were changed to reduce ambiguity.
         - ‘rain’ became ‘rain_mm’.  The precipitation amount in mm.
         - ‘maxtp’ became ‘maxtemp’. The maximum air temperature in celcius.
         - ‘mintp’ became ‘mintemp’. The minimum air temperature in celcius.
         -‘cbl’  became ‘pressure_hpa. The mean air pressure in hectopascals.
         - ‘wdsp’ became ‘wd_speed_m_per_s’ (and the units converted from knots).
         - ‘ddhm’ became ‘winddirection’.
         - ‘sun’ became ‘sun_hours’ The sunshine duration.
         - ‘evap’ became ‘evap_mm’. Evaporation (mm).
 - Air Quality Data: Each year of air quality data came in separate files and the units used to measure the pollutants changed from standard units (SI) to an obsolete unit. We decided to only use the years where the SI units are used, this limited us to a time period of 2011 to 2016. These yearly files were merged into one file. 
 - Sample Rate: The weather observations are 24-hour daily averages, and the air quality data came as 1-hour averages. We resampled the air quality data to 24-hour averages and changed the parameter name to indicate this. For example NO<sub>2</sub> became NO2_avg.
 
After this preliminary data transformation, we published the new data in our S3 bucket.

### Preparing Amazon SageMaker 

When opening a SageMaker notebook, we load the relevant libraries into the notebook:

In [3]:
import os
import time
import boto3
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import sagemaker
from sagemaker import get_execution_role

In [4]:
sess = boto3.Session()
sm = sess.client("sagemaker")
role = get_execution_role()
sagemaker_session = sagemaker.Session(boto_session=sess)
bucket = "sagemaker-workshop-us-west-1-648739860567"
region = boto3.Session().region_name
prefix = "sagemaker/DEMO-ModelMonitor"

data_capture_prefix = "{}/datacapture".format(prefix)
s3_capture_upload_path = "s3://{}/{}".format(bucket, data_capture_prefix)
reports_prefix = "{}/reports".format(prefix)
s3_report_path = "s3://{}/{}".format(bucket, reports_prefix)


model_package_group_name = "Linear-Learner-Air-Quality"  # Model name in model registry
pipeline_name = "LinearLearnerAirQualityPipeline"  # SageMaker Pipeline name
current_time = time.strftime("%m-%d-%H-%M-%S", time.localtime())


Those libraries will help us analyze the data using pandas, a popular data manipulation tool, as well as numpy, the de-facto scientific library in Python. Seaborn and matplotlib are used to power our visualisations. 

#### Loading prepared data into Amazon SageMaker from Amazon S3
Now that we have the notebook ready for use with the right libraries imported, we can import the data. For this, we will use the pandas library, which is great for exploring and massaging tabular data directly in Python. We can use the **pandas.read_csv** command, supplying it with the location of our data in S3. For both the air pollution and weather data, we changed the column names to something more readable (see the following code, for an air pollution example). We also had to create our own date parser, due to the specific format use for dates in the data.

In [5]:
from sagemaker.workflow.parameters import ParameterInteger, ParameterString, ParameterFloat
from sagemaker.workflow.steps import CacheConfig

cache_config = CacheConfig(enable_caching=True,expire_after="PT8H")
# raw input data
nox_data = ParameterString(name="NoxData", default_value='s3://sagemaker-workshop-us-west-1-648739860567/aws-machine-learning-blog/artifacts/air-quality/Dublin_Rathmines_NOx_2011_2016_ugm3_daily.csv')
weather_data = ParameterString(name="WeatherData", default_value='s3://sagemaker-workshop-us-west-1-648739860567/aws-machine-learning-blog/artifacts/air-quality/DublinAirportWeatherStationDerived_1942_to_2018.csv')

# processing step parameters
processing_instance_type = ParameterString(
    name="ProcessingInstanceType", default_value="ml.m5.large"
)

# training step parameters
training_instance_type = ParameterString(name="TrainingInstanceType", default_value="ml.m5.2xlarge")
training_epochs = ParameterString(name="TrainingEpochs", default_value="1")


# Transformer step parameters
transformer_instance_type = ParameterString(name="TransformInstanceType", default_value="ml.m5.large")
transformer_instance_count = ParameterInteger(name="TransformInstanceCount", default_value=2)
max_payload_in_mb = ParameterInteger(name="MaxPayloadMB", default_value=2)
output_data_path = ParameterString(name="OutputDataS3Path",default_value="s3://{}/air-quality-batch-infer/".format(bucket))
concurrency = ParameterInteger(name="MaxConcurrentRequests",default_value=4)


## Preprocessing

Whether exported from data wrangler or already extant, you can use a preprocessing job to clean your data

In [11]:
%%writefile preprocess.py

from pathlib import Path
# import boto3
import numpy as np
import pandas as pd
import os
from sklearn.preprocessing import StandardScaler
from datetime import datetime
from dateutil.relativedelta import relativedelta

def convert_date_to_right_century(dt):
    if dt > datetime.now():
        dt -= relativedelta(years=100)
    return dt

def parse(x):
    return datetime.strptime(x, '%d-%b-%y')

if __name__ == "__main__":
    col_names = ['daily_avg', 'nox_avg', 'no_avg', 'no2_avg']
    nox_df = pd.read_csv(next(Path('/opt/ml/processing/input/nox').iterdir()),  
                        date_parser=parse,
                        parse_dates=['Daily_Avg'])
    nox_df.columns = col_names
    nox_df = nox_df.set_index('daily_avg')
    nox_df["no2_avg"] = nox_df["no2_avg"].apply(lambda x: 5 if x <= 0 else x)
    
    weather_col_names = ['observation_date', 'maxtemp', 'mintemp', 'rain_mm', 'pressure_hpa', 'wd_speed_m_per_s', 'wind_direction', 'sun_hours', 'g_rad', 'evap_mm']
    weather_df = pd.read_csv(next(Path('/opt/ml/processing/input/weather').iterdir()), 
                        date_parser=parse,
                        parse_dates=['date'])
    weather_df.columns = weather_col_names
    weather_df['observation_date'] = weather_df['observation_date'].apply(convert_date_to_right_century)
    weather_df = weather_df.set_index('observation_date')
    new_weather_df = weather_df['2011-01-01':'2016-12-31']
    new_weather_df[['wind_direction']] = new_weather_df[['wind_direction']].apply(pd.to_numeric)
    weather_sub_df = new_weather_df[['maxtemp','wd_speed_m_per_s','wind_direction','pressure_hpa','sun_hours']]
    no2_df = nox_df[['no2_avg']]
    comp_df = pd.merge(weather_sub_df,no2_df, left_index=True, right_index=True)
    aq_df = comp_df.iloc[1:].copy()

    # Adding wind_speed_direction, the product of wind_speed and the direction
    aq_df["wind_speed_direction"] = aq_df.apply(lambda row: row['wd_speed_m_per_s'] * float(row['wind_direction']), axis=1)
    aq_train_df = aq_df[aq_df.index.year < 2016]
    aq_test_df = aq_df[aq_df.index.year == 2016]
    
    x_train = aq_train_df.drop('no2_avg',1)
    output_path = os.path.join("/opt/ml/processing/train", "x_train.csv")
    x_train.to_csv(output_path,index=False)#,header=False)
    x_test = aq_test_df.drop('no2_avg',1)
    output_path = os.path.join("/opt/ml/processing/test", "x_test.csv")
    x_test.to_csv(output_path,index=False)#,header=False)
    aq_train_df.to_csv('/opt/ml/processing/testcsv/x_test_header.csv')
    
    
#     aq_train_df.to_csv("/opt/ml/preprocessing/training-dataset-with-header.csv")
#     with open('/op/ml/preprocessing/training-dataset-with-header.csv','rb') as training_data_file:
#         baseline_prefix = "sagemaker/DEMO-ModelMonitor/baselining"
#         bucket = "sagemaker-workshop-us-west-1-648739860567"
#         s3_key = os.path.join(baseline_prefix, "data", "training-dataset-with-header.csv")
#         boto3.Session().resource("s3").Bucket(bucket).Object(s3_key).upload_fileobj(training_data_file)
    
    y_train = aq_train_df[["no2_avg"]]
    output_path = os.path.join("/opt/ml/processing/train", "y_train.csv")
    y_train.to_csv(output_path,index=False)#,header=False)
    y_test = aq_test_df[["no2_avg"]]
    output_path = os.path.join("/opt/ml/processing/test", "y_test.csv")
    y_test.to_csv(output_path,index=False)#,header=False)

Overwriting preprocess.py


In [None]:
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

framework_version = "0.23-1"

# Create SKlearn processor object,
# The object contains information about what instance type to use, the IAM role to use etc.
# A managed processor comes with a preconfigured container, so only specifying version is required.
sklearn_processor = SKLearnProcessor(
    framework_version=framework_version,
    role=role,
    instance_type="ml.m5.large",
    instance_count=1,
    base_job_name="linear-learner-air-quality-processing-job",
)

sklearn_processor.run(
    code="preprocess.py",
    inputs=[
        ProcessingInput(source=nox_data, destination="/opt/ml/processing/input/nox"),
        ProcessingInput(source=weather_data, destination="/opt/ml/processing/input/weather")
        
    ],
    outputs=[
        ProcessingOutput(output_name="train", source="/opt/ml/processing/train",destination="s3://sagemaker-workshop-us-west-1-648739860567/workshop/processing/train"),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/test",destination="s3://sagemaker-workshop-us-west-1-648739860567/workshop/processing/test"),
        ProcessingOutput(output_name="testcsv",source="/opt/ml/processing/testcsv",destination="s3://sagemaker-workshop-us-west-1-648739860567/workshop/processing/testcsv"),
    ],
)


Job Name:  linear-learner-air-quality-processing-j-2022-06-06-16-16-13-009
Inputs:  [{'InputName': 'input-1', 'AppManaged': False, 'S3Input': {'S3Uri': ParameterString(name='NoxData', parameter_type=<ParameterTypeEnum.STRING: 'String'>, default_value='s3://sagemaker-workshop-us-west-1-648739860567/aws-machine-learning-blog/artifacts/air-quality/Dublin_Rathmines_NOx_2011_2016_ugm3_daily.csv'), 'LocalPath': '/opt/ml/processing/input/nox', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'input-2', 'AppManaged': False, 'S3Input': {'S3Uri': ParameterString(name='WeatherData', parameter_type=<ParameterTypeEnum.STRING: 'String'>, default_value='s3://sagemaker-workshop-us-west-1-648739860567/aws-machine-learning-blog/artifacts/air-quality/DublinAirportWeatherStationDerived_1942_to_2018.csv'), 'LocalPath': '/opt/ml/processing/input/weather', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistrib

## Training & Hyperparameter Tuning

In [None]:
from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput
import time
import boto3
from sagemaker.image_uris import retrieve

linear_image = retrieve("linear-learner", boto3.Session().region_name)


# Where to store the trained model
model_path = f"s3://{bucket}/{prefix}/model/"

hyperparameters = {"epochs": training_epochs}

linear_estimator = Estimator(
    linear_image,
    role,
    instance_count=1,
    instance_type=training_instance_type,
    volume_size=20,
    max_run=3600,
    input_mode="File",
    output_path=model_path,
    sagemaker_session=sagemaker_session,
)

linear_estimator.set_hyperparameters(normalize_data=True,normalize_label=True, predictor_type="regressor", mini_batch_size=32)

linear_estimator.fit(
    {"train": TrainingInput(s3_data="s3://sagemaker-workshop-us-west-1-648739860567/workshop/processing/train",s3_data_type='S3Prefix',content_type="text/csv"),
    "test": TrainingInput(s3_data="s3://sagemaker-workshop-us-west-1-648739860567/workshop/processing/test",s3_data_type='S3Prefix',content_type="text/csv")}
)

In [None]:
param_l1 = sagemaker.parameter.ContinuousParameter(1e-7, 
                                                   1,
                                                   scaling_type='Logarithmic')

param_wd = sagemaker.parameter.ContinuousParameter(1e-7, 
                                                   1,
                                                   scaling_type='Logarithmic')

param_learning_rate = sagemaker.parameter.ContinuousParameter(1e-5,
                                                             1,
                                                             scaling_type='Logarithmic')

hypertuner = sagemaker.tuner.HyperparameterTuner(linear_learner, 
                             objective_metric_name = 'test:mse', 
                             hyperparameter_ranges = {
                                               'l1' : param_l1,
                                               'wd' : param_wd,
                                               'learning_rate' : param_learning_rate,
                             }, 
                             metric_definitions=None, 
                             strategy='Bayesian', 
                             objective_type='Minimize', 
                             max_jobs=20, max_parallel_jobs=3,
                             early_stopping_type='Off'
                             )

## Processing Step 

The first step in the pipeline will preprocess the data to prepare it for training. The data was already cleaned, as described above, and those steps would be incorporated here when working with the raw data.

We create a `SKLearnProcessor` object that has been parameterized, so we can separately track and change the job configuration as needed. As an example, we can increase the instance type size and count to accommodate a growing dataset.

In [None]:
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import ProcessingStep

framework_version = "0.23-1"

# Create SKlearn processor object,
# The object contains information about what instance type to use, the IAM role to use etc.
# A managed processor comes with a preconfigured container, so only specifying version is required.
sklearn_processor = SKLearnProcessor(
    framework_version=framework_version,
    role=role,
    instance_type=processing_instance_type,
    instance_count=1,
    base_job_name="linear-learner-air-quality-processing-job",
)

# Use the sklearn_processor in a Sagemaker pipelines ProcessingStep
step_preprocess_data = ProcessingStep(
    name="Preprocess-Linear-Learner-Air-Quality-Data",
    processor=sklearn_processor,
    inputs=[
        ProcessingInput(source=nox_data, destination="/opt/ml/processing/input/nox"),
        ProcessingInput(source=weather_data, destination="/opt/ml/processing/input/weather")
        
    ],
    outputs=[
        ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
        ProcessingOutput(output_name="testcsv",source="/opt/ml/processing/testcsv"),
    ],
    code="preprocess.py",
    cache_config=cache_config,
)

## Train model step
In the second step, the train and validation output from the precious processing step are used to train a model. 

In [None]:
from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput
from sagemaker.workflow.steps import TrainingStep, TuningStep
import time
import boto3
from sagemaker.image_uris import retrieve

linear_image = retrieve("linear-learner", boto3.Session().region_name)


# Where to store the trained model
model_path = f"s3://{bucket}/{prefix}/model/"

hyperparameters = {"epochs": training_epochs}

linear_estimator = Estimator(
    linear_image,
    role,
    instance_count=1,
    instance_type=training_instance_type,
    volume_size=20,
    max_run=3600,
    input_mode="File",
    output_path=model_path,
    sagemaker_session=sagemaker_session,
)

linear_estimator.set_hyperparameters(normalize_data=True,normalize_label=True, predictor_type="regressor", mini_batch_size=32)

# param_l1 = sagemaker.parameter.ContinuousParameter(1e-7, 
#                                                    1,
#                                                    scaling_type='Logarithmic')

# param_wd = sagemaker.parameter.ContinuousParameter(1e-7, 
#                                                    1,
#                                                    scaling_type='Logarithmic')

# param_learning_rate = sagemaker.parameter.ContinuousParameter(1e-5,
#                                                              1,
#                                                              scaling_type='Logarithmic')

# hypertuner = sagemaker.tuner.HyperparameterTuner(linear_estimator, 
#                              objective_metric_name = 'test:mse', 
#                              hyperparameter_ranges = {
#                                                'l1' : param_l1,
#                                                'wd' : param_wd,
#                                                'learning_rate' : param_learning_rate,
#                              }, 
#                              metric_definitions=None, 
#                              strategy='Bayesian', 
#                              objective_type='Minimize', 
#                              max_jobs=20, max_parallel_jobs=3,
#                              early_stopping_type='Off'
#                              )

# step_tune_model = TuningStep(
#     name="Tune-Linear-Learner-Air-Quality-Model",
#     tuner=hypertuner,
#     inputs={
#         "train": TrainingInput(
#             s3_data=step_preprocess_data.properties.ProcessingOutputConfig.Outputs[
#                 "train"
#             ].S3Output.S3Uri,
#             content_type="text/csv",
#         ),
#         "test": TrainingInput(
#             s3_data=step_preprocess_data.properties.ProcessingOutputConfig.Outputs[
#                 "test"
#             ].S3Output.S3Uri,
#             content_type="text/csv",
#         ),
#     },
# )

# Use the linear_estimator in a Sagemaker pipelines TrainingStep.
# NOTE how the input to the training job directly references the output of the previous step.
step_train_model = TrainingStep(
    name="Train-Linear-Learner-Air-Quality-Model",
    estimator=linear_estimator,
    inputs={
        "train": TrainingInput(
            s3_data=step_preprocess_data.properties.ProcessingOutputConfig.Outputs[
                "train"
            ].S3Output.S3Uri,
            content_type="text/csv",
        ),
        "test": TrainingInput(
            s3_data=step_preprocess_data.properties.ProcessingOutputConfig.Outputs[
                "test"
            ].S3Output.S3Uri,
            content_type="text/csv",
        ),
    }
)

## Create the model

The model is created and the name of the model is provided to the Lambda function for deployment. The `CreateModelStep` dynamically assigns a name to the model.

In [369]:
from sagemaker.workflow.step_collections import CreateModelStep
from sagemaker.model import Model

model = Model(
    role=role,
    image_uri = linear_image,
    model_data=step_train_model.properties.ModelArtifacts.S3ModelArtifacts,
    sagemaker_session=sagemaker_session,
)

step_create_model = CreateModelStep(
    name="Create-Linear-Learner-Air-Quality-Model",
    model=model,
    inputs=sagemaker.inputs.CreateModelInput(instance_type=transformer_instance_type),
)

## Endpoint creation for Model Monitoring

In [357]:
%%writefile deploy_model_lambda.py


"""
This Lambda function deploys the model to SageMaker Endpoint. 
If Endpoint exists, then Endpoint will be updated with new Endpoint Config.
"""

import json
import boto3
import time


sm_client = boto3.client("sagemaker")


def lambda_handler(event, context):

    print(f"Received Event: {event}")

    current_time = time.strftime("%m-%d-%H-%M-%S", time.localtime())
    endpoint_instance_type = event["endpoint_instance_type"]
    model_name = event["model_name"]
    endpoint_config_name = "{}-{}".format(event["endpoint_config_name"], current_time)
    endpoint_name = event["endpoint_name"]
    s3_capture_upload_path = event["s3_capture_upload_path"]

    # Create Endpoint Configuration
    create_endpoint_config_response = sm_client.create_endpoint_config(
        EndpointConfigName=endpoint_config_name,
        ProductionVariants=[
            {
                "InstanceType": endpoint_instance_type,
                "InitialVariantWeight": 1,
                "InitialInstanceCount": 1,
                "ModelName": model_name,
                "VariantName": "AllTraffic",
            }
        ],
        DataCaptureConfig= {
            'EnableCapture':True,
            'InitialSamplingPercentage': 100,
            'DestinationS3Uri':s3_capture_upload_path
        }
    )
    print(f"create_endpoint_config_response: {create_endpoint_config_response}")

    # Check if an endpoint exists. If no - Create new endpoint, if yes - Update existing endpoint
    list_endpoints_response = sm_client.list_endpoints(
        SortBy="CreationTime",
        SortOrder="Descending",
        NameContains=endpoint_name,
    )
    print(f"list_endpoints_response: {list_endpoints_response}")

    if len(list_endpoints_response["Endpoints"]) > 0:
        print("Updating Endpoint with new Endpoint Configuration")
        update_endpoint_response = sm_client.update_endpoint(
            EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
        )
        print(f"update_endpoint_response: {update_endpoint_response}")
    else:
        print("Creating Endpoint")
        create_endpoint_response = sm_client.create_endpoint(
            EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
        )
        print(f"create_endpoint_response: {create_endpoint_response}")

    return {"statusCode": 200, "body": json.dumps("Endpoint Created Successfully")}

Overwriting deploy_model_lambda.py


In [358]:
from iam_helper import create_sagemaker_lambda_role

lambda_role = create_sagemaker_lambda_role("deploy-model-lambda-role")

Using ARN from existing role: deploy-model-lambda-role


In [359]:
from sagemaker.workflow.lambda_step import LambdaStep
from sagemaker.lambda_helper import Lambda

endpoint_config_name = "linear-learner-air-quality-config"
endpoint_name = "linear-learner-air-quality-endpoint-" + current_time

deploy_model_lambda_function_name = "sagemaker-deploy-model-lambda-" + current_time

deploy_model_lambda_function = Lambda(
    function_name=deploy_model_lambda_function_name,
    execution_role_arn=lambda_role,
    script="deploy_model_lambda.py",
    handler="deploy_model_lambda.lambda_handler",
)

step_deploy_predictor = LambdaStep(
    name="Deploy-Linear-Learner-Air-Quality-Endpoint",
    lambda_func=deploy_model_lambda_function,
    inputs={
        "model_name": step_create_model.properties.ModelName,
        "endpoint_config_name": endpoint_config_name,
        "endpoint_name": endpoint_name,
        "endpoint_instance_type": transformer_instance_type,
        "model_monitoring_s3_capture_upload_path": s3_capture_upload_path,
    },
    cache_config=cache_config,
)

## Batch Transformer Step

The model can be either deployed for real time inference or set up to be run on batches of data with a transform job. Creating a `Transformer` from a sagemaker model creates a transformer which can be used to perform batch inference.

When creating the transformer, the output defaults to the sagemaker defualt bucket. It can be specified with `output_path` to save to a more desirable location. The other relevant parameters are `instance_count` and `instance_type`, which dictate the number and size of instance that will run the transform job, `max_concurrent_transforms`, which determines how many HTTP requests can be made to each transform container at a time, and `max_payload`, which determines how many megabytes can be sent to a transformer at once (max 4).

The transformer can then be passed to the TransformStep, which enables the pipeline to create it.

In [360]:
from sagemaker.transformer import Transformer
from sagemaker.workflow.steps import TransformStep
transformer = Transformer(
    model_name=step_create_model.properties.ModelName,
    instance_count=transformer_instance_count,
    instance_type=transformer_instance_type,
    max_concurrent_transforms=concurrency,
    max_payload=max_payload_in_mb,
    output_path=output_data_path,
)

step_batch_transform = TransformStep(
    name="Create-Linear-Learner-Air-Quality-Transformer",
    transformer=transformer,
    inputs=
        sagemaker.inputs.TransformInput(
            data=step_preprocess_data.properties.ProcessingOutputConfig.Outputs[
                "test"
            ].S3Output.S3Uri, # Use the same data from S3 as before
            data_type='S3Prefix',
            content_type='text/csv'
        ),
    
    cache_config=cache_config,
)

## Pipeline Creation: Orchestrate all steps

Now that all pipeline steps are created, a pipeline is created.

In [384]:

# Create SKlearn processor object,
# The object contains information about what instance type to use, the IAM role to use etc.
# A managed processor comes with a preconfigured container, so only specifying version is required.
from sagemaker.workflow.pipeline import Pipeline

# Create a Sagemaker Pipeline.
# Each parameter for the pipeline must be set as a parameter explicitly when the pipeline is created.
# Also pass in each of the steps created above.
# Note that the order of execution is determined from each step's dependencies on other steps,
# not on the order they are passed in below.
pipeline = Pipeline(
    name=pipeline_name,
    parameters=[
        processing_instance_type,
        training_instance_type,
        input_data,
        training_epochs,
        transformer_instance_type,
        transformer_instance_count,
        max_payload_in_mb,
        output_data_path,
        concurrency,
        nox_data,
        weather_data,
    ],
    steps=[step_preprocess_data, step_train_model, step_create_model, step_batch_transform, step_deploy_predictor],
)

## Execute the Pipeline

### List the execution steps to check out the status and artifacts:

In [385]:
import json

definition = json.loads(pipeline.definition())
# definition

### Submit pipeline

In [386]:
pipeline.upsert(role_arn=role)

{'PipelineArn': 'arn:aws:sagemaker:us-west-1:648739860567:pipeline/linearlearnerairqualitypipeline',
 'ResponseMetadata': {'RequestId': '2aaf00f9-26e5-4da6-876b-9451499dc9da',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '2aaf00f9-26e5-4da6-876b-9451499dc9da',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '99',
   'date': 'Mon, 06 Jun 2022 15:25:35 GMT'},
  'RetryAttempts': 0}}

### Execute pipeline using the default parameters

In [387]:
execution = pipeline.start()

### Wait for pipeline to complete

In [None]:
execution.wait()

## Visualize SageMaker Pipeline
In SageMaker Studio, choose `SageMaker Components and registries` in the left pane and under `Pipelines`, click the pipeline that was created. Then all pipeline executions are shown, and the one just created should have a status of `Succeded`. Selecting that execution, the different pipeline steps can be tracked as they execute.

![](images/pipeline.png)

### Create a baselining job with training dataset
Now that you have the training data ready in Amazon S3, start a job to suggest constraints. DefaultModelMonitor.suggest_baseline(..) starts a ProcessingJob using an Amazon SageMaker provided Model Monitor container to generate the constraints.

In [None]:
baseline_prefix = prefix + "/baselining"
baseline_data_prefix = baseline_prefix + "/data"
baseline_results_prefix = baseline_prefix + "/results"

baseline_data_uri = "s3://{}/{}".format(bucket, baseline_data_prefix)
baseline_results_uri = "s3://{}/{}".format(bucket, baseline_results_prefix)
print("Baseline data uri: {}".format(baseline_data_uri))
print("Baseline results uri: {}".format(baseline_results_uri))

In [None]:
from sagemaker.model_monitor import DefaultModelMonitor
from sagemaker.model_monitor.dataset_format import DatasetFormat

my_default_monitor = DefaultModelMonitor(
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    volume_size_in_gb=20,
    max_runtime_in_seconds=3600,
)

my_default_monitor.suggest_baseline(
    baseline_dataset=baseline_data_uri + "/training-dataset-with-header.csv",
    dataset_format=DatasetFormat.csv(header=True),
    output_s3_uri=json.loads(pipeline.definition())['Steps'][0]['Arguments']['ProcessingOutputConfig']['Outputs'][2]['S3Output']['S3Uri']
    wait=True,
)

## Clean up (optional)

#### Delete the pipeline to keep the studio environment tidy.

In [None]:
def delete_sagemaker_pipeline(sm_client, pipeline_name):
    try:
        sm_client.delete_pipeline(
            PipelineName=pipeline_name,
        )
        print("{} pipeline deleted".format(pipeline_name))
    except Exception as e:
        print("{} \n".format(e))
        return

In [None]:
delete_sagemaker_pipeline(client, pipeline_name)

## Acknowledgements


### Irish Weather Data:
Met Éireann retains Intellectual Property Rights and copyright over our data. If data are published in raw or processed format Met Éireann must be acknowledged as the source. Met Éireann does not accept any liability whatsoever for any error or omission in the data series, their availability, or for any loss or damage arising from their use. This work is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) License.

### Irish Air Quality Data:
EPA,"EPA Ireland Archive of Nitrogen Oxides Monitoring Data". Associated datasets and digitial information objects connected to this resource are available at: Secure Archive For Environmental Research Data (SAFER) managed by Environmental Protection Agency Ireland http://erc.epa.ie/safer/resource?id=216a8992-76e5-102b-aa08-55a7497570d3 (Last Accessed: 2018-06-30) (both require as their data usage license that they be credited)

### Wind Rose Code
The Air Quality Rose was adapted from Wind Rose code that was published on GitHub under a BSD-license:
https://github.com/Geosyntec/cloudside

The air quality rose is based on a function called "rose" is in the viz.py submodule:
https://github.com/Geosyntec/cloudside/blob/master/cloudside/viz.py#L370

