# <u>Model Deployment and integration</u>

# Outline

- model deployment options
- how to choose the right option

- strategy to deploy model
    - minimize risk
        - A/B Testing
        - multi-armed bandits
        
- integrating models
    - how to use model
   
- monitor the model
    - sagemaker model monitor
    
- Lab
    - setting monitors for deployment
    - setting A/B testing

# Model Deployment Overview

- <img src='./pics/machine learning workflow.png'>

- model deployment
    - clould
        - real-time inferance
        - batch inferance
        
- Deployment option (Real-time inferance)
- <img src='./pics/deploying model.png'>

- Batch inferance
- <img src='./pics/batch inferance.png'>
    - running a batch job against a batch request
    - store prediction in a db
    
- Deployment of model in Edge
    - deploy model closer to users
    - in case of poor network connectivity
    
    - procedure
        - train model in somewhere
        - optimize the model for deployment
            - package the model to run on smaller devices
            - sagemaker neo to compile the model to run at the edge
        - typical example: manufacturing defects at the production end
        - inferance will be send to cloud
            - for further analysis
            - for training and optimizing the model
- <img src='./pics/edge deployment.png'>

## Deployment options
- <img src='./pics/deployment options.png'>

    - Transient environment: computation and storage will be consumed during the operation time

# Model Deployment Strategies

- deploy new/updated model
- <img src='./pics/model deployment strategies.png'>
    - if you have a new model, you dont want to deploy the model that disrupts the service
    - you may want to monitor that model for a period to assess the performace of the model, and if there is any issue may need to role back that model
   
- Deployment strategies
    - Blue/Green
    - Shadow/Challenger
    - Canary
    - A/B
    - Multi-Armed Bandits
    
- Blue/Green Deployment
    - swap the prediction trafic to new model
    - advantage: easy to roll back and reduce the downtime
    - <img src='./pics/Blue Green Deployment.png'>
    - risk: if the new model is not performing well you are serving bad prediction to the 100% of the trafic
    
- Shadow/Challenger Deployment
    - parellel pediction request traffic
    - validate the new version without impact
    - <img src='./pics/shadow channelger deployment.png'>
    - the predition from model 1 will be send to the user
    - prediction from model 2 is caputred for analysis
    - once model 2 is find to be confortable then we can sever the predictions form model verison 2 to users
    
- Canary Deployment
    - split traffic
    - target smaller sepecific user/groups
    - shorter validation cycles
    - minimize risk of low performing model    
    - <img src='./pics/canery deployment.png'>

- A/B Deployment
    - splitting traffic
        - similar to canery
        - splitting may be based on group or random splitting
    - Target larger users/groups ~OR~ Distiribute % of traffic
    - Longer validation cycles
    - minimize risk of low performing models
    - <img src='./pics/AB deployment.png'>
    
- all the approches that has been covered are static approches

- Multi-Armed Bandits
    - Dynamic approch
        - machine learning to decide how and when to rute traffic b/w models
    -  Use Reinforcement learning to shit traffic to the winning model
    - however there will be traffic to non-winning model since early winners are not the best model
    - experiment manager: a reinforcement model to determine the traffic b/w model
    - reward metircs
        - more traffic to the winning model (exploitation)
        - exploration: can the loosing model can catch-up with the winning model
        
    - <img src='./pics/multi-armed strategies.png'>

# Amazon SageMaker Hosting: Real-Time Inferance

- <img src='./pics/sagemaker endpoints.png'>

- model serving stack
- Hosting stack
    - web server that interact with the model
    - client application will interact with real time invoking
    - request will send once you invoke the model
        - request sent though a load balancer
        - diff builtin serializers and deserializers
        
   - we can choose,
       - instance type (resounces you need to predict)
       - instance size
       - number of instances
       - autoscaling options
 
- 3 Options to select model
    - builtin model
    - Script mode
    - Docker
    
- Builtin Model
- <img src='./pics/builtin model hosting.png'>
- you need,
    - the model container
    - stored model artifacts in s3 (the trained model parameters)

- Script mode
- <img src='./pics/script mode hosting.png'>
- same as above, but bring the inferance code

- Docker (Bringing you own model)
- <img src='./pics/docker hosting.png'>
       
- AWS will distirbute the containers in various availability zones for high availability


## Autoscaling 
- Why?
    - meet the demand of the workload
    - cost optimization
    
- <img src='./pics/autoscaling.png'>

- cloudwatch capture metrics like,
    - utalization metrics
    - invocation metrics: no of invocation has been send against a ML hosting
        - this is the default autoscaling metric. you  can change the metic if you want such as cpu utalization
        
- <img src='./pics/scaling policy.png'>

- once the instances are up, the load balancer can distribute the load among the various instances.
- cooldown policy: to scale down the model
    - time to wait before scaling down once scaled up (in sec)
    
- cooldown period: the time in second a sacle in can take place after a scale in happen 

### how to do the scaling? (cell below)
- reigester a scalable target
    - an aws resouce to scale: here it is the sagemaker
    - define the scaling policy
    - Apply autoscaling policy
        - apply the autoscale to the endpoint
        - TargetTrackingScaling is the spcific policy type supported by the sagemaker

In [None]:
# Register Scalable Target
autoscale.register_scalable_target(
    ServiceNameSpace='sagemaker'
    ResourceId='endpoint/'+endpoint_name,
    ScalabelDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=1,
    MaxCapacity=2,
    RoleARN=role,
    SuspendedState={
        "DynamicScalingInSuspended": False,
        "DynamicScaleingOutSuspended": False:
        "ScheduleScalingSuspended": False
    }
)

# Define the Scaling Policy
scaling_policy = {
    "TargetValue": 2.0,
    "PredefinedMetricSpecification":{ # scaling metirc
        "PredifinedMetricType": "SageMakerVariantInvocationsPerInstance",
    },
    "ScaleOutCooldown": 60, # wait time, in second, before beginning another scale out activty after last one completes
    "ScaleInCooldown": 300, # wait time in second, before beginning another scale in activity after the last one completes
}

# apply scaling policy
autoscale.put_scaling_policy(
    PolicyName=...,
    ServiceNameSpace='sagemaker',
    ResouceId="endpoint/"+endpoint_name,
    ScaleableDimension='sagemaker:variant:DesiredInstanceCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration=scaling_policy
)

## Multi-Model Endpoints

- <img src='./pics/multi model endpoint.png'>

- multiple model behind same endpoints
    - sagemaker dynamically invoke various models
    - the invoking is based on the client request
        - here in the pic, the client application is invoking the model 1
    - the model will be loaded till the resources exhusted 
    - all the models share an endpoint will share the same container
    
## Inferace Pipeline

- <img src='./pics/inferance pipeline.png'>
- the inferace pipeline run in a sequential manner behind the same endpoint

# Amazon SageMaker: Real-Time Inferance Production Variants

- production varienets
    - A/B Testing (lab)
    - Canery  testing
    
- <img src='./pics/production varient.png'>

- production varient can be used for Cannery testing and A/B testing
- <img src='./pics/production varient for canery.png'>

- The canery user group will routed to the varient B model
    - this is achieved programatically, by invoking model B when the client application invokest the model
    
- A/B testing (LAB for the week)
- <img src='./pics/production varient for AB.png'>
    - the traffic is splitted equally
    - you can use programtic traffic to route a particular traffic to a particular model
    - In A/B testing
        - a larger group is participating in the model testing
        - for a longer period of time

## Septs for A/B testing (see the cell below)

- construct Docker image URIs
- crate 2 model object
    - <img src='./pics/sagemaker models.png'>
    
- create production varients
    - <img src='./pics/create production varients.png'>
    - 50% of the varient will flow to model A while other 50% flow to model B
    
- create the endpoint configuration
- create endpoint

In [None]:
# construct docker image URIs
import sagemaker

inferace_image_uri = sagemaker.image_uris.retireve(
    framework = ..., # pytorch, tensorflow etc
    version='1.6.0',
    instance_type='ml.p5.xlarge',
    py_version='py3',
    image_scope='inferance'
)

# create model
sm.create_model(
    name = model_name_a,
    ....
)

sm.create_model(
    name = model_name_b,
    ....
)

# create production varients
from sagemaker.session import production_variant

varientA = production_variant(
    model_name=...,
    instance_type=...,
    initial_instance_coutn=1,
    variant_name='VariantA',
    inital_weight=50,
)

# endpoint configuration
endpoint_config = sm.create_endpoint_config(
            EndpointConfigName=...,
            ProductionVriants=[VarientA, VarientB]
)

# create endpoint
endpoint_response = sm.create_endpoint(
    EndpointName=...,
    EndpointConfigName=...
)

# Amazon SageMaker Batch Trasnform: Batch Inference

- SageMaker Batch Trasform
- workflow
    - <img src='./pics/batch workflow.png'>
    
    - package the model
    - <img src='./pics/package the model batch.png'>
    
    - create the transformer
    - <img src='./pics/batch trasformer.png'>
    
    - start the trasformation job
    - <img src='./pics/run trasformer.png'>
   

# <u>Model integration and monitoring</u>

# Model Integration

# Monitoring ML Workloads

# Model Monitoring using Amazon Sagemaker Model Monitor