# **Advanced Model Deployment and Monitoring**

## **Model Deployment and Integration**

### **Model Deployment Overview**

![](2024-01-02-11-12-38.png)

![](2024-01-02-11-12-59.png)

There are two general options that you'll learn about including real time Inference and batch inference, I'll cover both of these options this week. But let's start with deploying a model for real time inference in the cloud deploying a model for real time. Inference means deploying it to a persistent hosted environment that's able to serve requests for prediction and provide prediction responses back in real time or near real time. This involves exposing an endpoint that has his serving stack that can accept and respond to requests. A serving stack needs to include a proxy that can accept incoming requests and direct them to an application that then uses your Inference code to interact with your model. This is a good option when you need to have low latency combined with the ability to serve new prediction requests that come in, so some example use cases here would be fraud detection. Where you may need to be able to identify whether an incoming transactions is potentially fraudulent in near real time or product recommendations. Where you want to be able to predict the appropriate products based on a customer's current search history or a customer's current shopping cart. So let's take a look at how a real time persistent endpoint would apply to your product review use case. In this case, let's assume you need to identify whether a product review is negative and immediately notify a customer support engineer about negative reviews. So that they can proactively reach out to the customer right away here you have some type of web application that a consumer enters their product review into. Then that web application or secondary process called by that web application coordinates a call to your real time end point that serves your model with the new product review text. The hosted model then returns a prediction. So in this case it would be a negative class for sentiment that can then be used to initiate a back end process that opens a high severity support ticket to the customer support engineer. Given that your objective here is to have quick customer support response. You can see where you would need to have that model consistently available through a real time endpoint that's able to serve your prediction requests that come in. And serve your response traffic, let's now look at batch inference and see how it compares to real time inference with batch inference. You aren't hosting a model that persists and can serve requests for prediction as they come in. Instead, your batch in those requests for prediction, running a batch job against those batch requests and then out putting your prediction responses typically is batch records as well. Then once you have your prediction responses, they can then be used in a number of different ways. Those prediction responses are often used for reporting or are persisted into a secondary data store for use by other applications or for additional reporting. Use cases that are focused on forecasting are a natural fit for batch inference. So say you're doing sales forecasting where you typically use batch sales data over a period of time to come up with new sales forecast. 

![](2024-01-02-11-15-20.png)

In this case, you'd use batch jobs to process those prediction requests and potentially store those predictions for additional visibility or analysis, let's go back to your product review case. So let's say your ultimate business goal here is to be able to identify vendors that have potential quality issues by detecting trends for negative product reviews per vendor. So in this case, you don't need a real time end point, but you would use a batch inference job to take a batch of product review data. Then run batch jobs at a reasonable frequency that you identify that can take all of those product reviews on input. Process those predictions and that output that data just as the prediction request data is a set of batch records on input. The prediction responses that are output to the model are also collected as a collection of batch records. That data could then be persisted so that your analysts could aggregate the data. Run reports to identify any potential issues with vendors that have a large number of negative reviews with your batch job. These jobs aren't persisted so they run for only the amount of time that it takes to process those batch requests on input. Let's briefly touched on deployment to the edge which is an option that is not a cloud specific. But is a key consideration when deploying models closer to your users or in areas with poor network connectivity. In the case of edge deployments, you train your models in another environment in this case in the cloud and then optimize your model for deployment to edge devices. This process is typically aimed at compiling or packaging your model in a way that is optimized to run at the edge. Which usually means things like reducing the model package size for running on smaller devices. In this case you could use something like Sagemaker Neo to compile your model in a way that is optimized for running at the edge and use cases. Bring your model closer to where it will be used for prediction, so typical use cases here would be like manufacturing, where you have cameras on an assembly line. And you need to make real time inferences or in use cases where you need to detect equipment anomalies at the edge. Inference data in this case is often sent back to the cloud for additional analysis or for collection of ground truth data that can then be used to further optimize your model. I discovered the primary options for deploying a model covering real time Inference, batch inference and deploying models to the edge, the right option to choose depends on several factors. The choice to deploy to the edge is typically an obvious one as there's edge devices and you might be working with use cases where there is limited network connectivity. You might also be working with internet of things or IOT use cases or use cases where the cost in terms of the time spent in data transfer is not an option even when it's single digit millisecond response. Now, the choice between real time inference and batch inference typically comes down to the ways that you need to request and consume predictions in combination with cost. A real time endpoint can serve real time predictions, where the prediction requests sent on input is unique and requires an immediate response with low latency. The trade off is that a persistent endpoint typically cost more because you pay for the compute. And the storage resources that are required to host that model while that endpoint is up and running a batch job in contrast works well when you can batch your data for prediction. And that's your responses back, now, these responses can then be persisted into a secondary database that can serve real time applications when there is no need for new prediction requests. And responses per transaction, so in this case, you can run batch jobs in a transient environment. Meaning that the compute and storage environments are only active for the duration of your batch job. As a general rule, you should use the option that meets your use case and is the most cost effective. 

![](2024-01-02-11-18-22.png)

![](2024-01-02-11-18-57.png)

![](2024-01-02-11-19-20.png)

![](2024-01-02-11-19-50.png)

![](2024-01-02-11-20-32.png)

![](2024-01-02-11-21-10.png)

![](2024-01-02-11-21-40.png)

![](2024-01-02-11-22-01.png)

![](2024-01-02-11-22-28.png)

![](2024-01-02-11-23-01.png)

![](2024-01-02-11-23-41.png)

### **Model Deployment Strategies**

![](2024-01-02-11-36-18.png)

This is important because you want to be able to deploy new models in a way that minimizes risk and downtime while measuring the performance of a new model or a new model version. As an example, if you have a newer version of a model, you typically don't want to deploy that model or that new version in a way that disrupts service. You may also want to monitor the performance of that new model version for a period of time in a way that allows you to seamlessly roll back if there is an issue with that new version. In this section, I'll talk about some of the common deployment strategies. I'll cover each of these deployment strategies so you'll learn about blue/green deployments, shadow/challenger deployments, canary deployments, A/B testing, and finally, multi-armed bandits. I'll start with blue/green deployments. Some of you may be familiar with the concept of blue/green deployments for applications or software. The same concept really applies to models as well. With blue/green deployments, you deploy your new model version to a stack that conserved prediction and response traffic coming into an endpoint. Then when you're ready to have that new model version actually start to process prediction requests coming in, you swap the traffic to that new model version. This makes it easy to roll back because if there are issues with that new model or that new model version doesn't perform well, you can swap traffic back to the previous model version. Let's take a closer look at how a blue/green deployment works. With blue/green deployment, you have a current model version running in production. In this case, we have version 1. This accepts 100 percent of the prediction request traffic and responds with prediction responses. When you have a new model version to deploy, in this case, model version 2, you build a new server or container to deploy your model version into. This includes not only the new model version but also the code in the software that's needed to accept and respond to prediction requests. As you can see in this picture, the new model version is deployed, but the load balancer has not yet been updated to point to that new server hosting the model, so no traffic is hitting that endpoint yet. After the new model version is deployed successfully, you can then shift 100 percent of your traffic to that new cluster serving model version 2 by updating your load balancer. This strategy helps reduce downtime if there's a need to roll back and swap back to version 1 because you only need to re-point your load balancer back to version 1. The downside to this strategy is that it is 100 percent swap of traffic. So if the new model version, version 2, in this case, is not performing well, then you run the risk of serving bad predictions to 100 percent of your traffic versus a smaller percentage of traffic. Let's now cover the second type of deployment strategy you see here called shadow or challenger deployment. This is often referred to as challenger models because in this case, you're running a new model version in production by letting the new version accept prediction requests to see how that new model would respond, but you're not actually serving the prediction response data from that new model version. This lets you validate the new model version with real traffic without impacting live prediction responses. Let's take a look at how it works. You can see with the shadow or challenger deployment strategy, the new model version is deployed and both versions have 100 percent of prediction requests traffic being sent to each version. However, you'll notice for version 2, only the prediction requests are sent to the model, and you aren't actually serving prediction responses from model version 2. Responses that would have been sent back for model version 2 are typically captured and then analyzed for whether version 1 or version 2 of the model would have performed better against that full traffic load. This strategy also allows you to minimize the risk of deploying a new model version that may not perform as well as model version 1, and this is because you're still able to analyze how version 2 of your model would perform without actually serving the prediction responses back from that model version. Then once you are comfortable that model version 2 is performing better, you can actually start to serve prediction responses directly from model version 2 instead of model version 1. The next deployment strategy that I'll cover is canary deployment. With a canary deployment, you split traffic between model versions and target a smaller group to expose that new model version 2. Typically, you're exposing the select set of users to the new model for a smaller period of time to be able to validate the performance of that new model version before fully deploying that new version out to production. Canary deployment is a deployment strategy where you're essentially splitting traffic between two model versions, and again, with canary deployments, you typically expose a smaller specific group to that new model version while model version 1 still serves the majority of your traffic.

![](2024-01-02-11-41-46.png)

![](2024-01-02-11-43-08.png)

![](2024-01-02-11-44-26.png)

![](2024-01-02-11-44-56.png)

![](2024-01-02-11-45-45.png)

![](2024-01-02-11-49-58.png)

![](2024-01-02-11-50-42.png)

![](2024-01-02-11-53-10.png)

In the image here you can see that 95 percent of prediction requests and responses are served by Model Version 1 and a smaller set of users are directed to Model Version 2. Canary deployments are good for validating a new model version with a specific or smaller set of users before rolling it out to all users. This is something that can't be done with a blue-green deployment strategy. The next deployment strategy is A/B testing. Canary and A/B testing are similar in that you're splitting traffic. However, A/B testing is different in that typically you're splitting traffic between larger groups and for longer periods of time to measure performance of different model versions over time. This split can be done by targeting specific user groups or just by setting a percentage of traffic to randomly distribute to different groups. Let's take a closer look at A/B testing. With A/B testing, again, you're also splitting your traffic to compare model versions. However, here you split traffic between those larger groups for the purpose of comparing different model versions in live production environments. Here, you typically do a larger split across users. So 50 percent one model version, 50 percent the other model version. Then you can also perform A/B testing against more than two model versions as well, although it's not shown here. While A/B testing seemed similar to canary deployments, A/B testing tests those larger groups, like I mentioned, and typically runs for longer periods of time than canary deployments. A/B tests are focused on gathering live data about different model versions. They typically, again, run for longer periods of time to be able to gather that performance data that is statistically significant enough, which provides that ability to confidently roll out Version 2 to a larger percent of traffic. Because you're running multiple models for longer periods of time, A/B testing allows you to really validate your different model versions over multiple variations of user behavior. As an example, you may have a forecasting use case that has seasonality to it. You need to be able to capture how your model performs over changes to the environment over time. So I just covered some of the common static approaches to deploying new or updated models. All of the approaches that were covered are static approaches, meaning that you manually identify things like when to swap traffic and how to distribute that traffic. I'll cover another approach that is more dynamic in nature, meaning that instead of manually identify when and how you distribute traffic, you can take advantage of approaches that incorporate machine learning to automatically decide when and how to distribute traffic between multiple versions of a deployed model. For this, I'll cover multi-armed bandits. A/B tests are typically fairly static and need to run over a period of time. With this, you do run the potential risk of running with a bad or low-performing model for that same longer period of time. A more dynamic method for testing is multi-armed bandits. Multi-armed bandits use reinforcement learning as a way to dynamically shift traffic to the winning model versions by rewarding the winning model with more traffic but still exploring the nonwinning model versions in the case that those early winners were not the overall best models. Let's take a look at what multi-armed bandit strategy testing looks like. In this implementation, you first have an experiment manager, which is basically a model that uses reinforcement learning to determine how to distribute traffic between your model versions. This model chooses the model version to send traffic to based on the current reward metrics and the chosen exploit explore strategy. Exploitation refers to continuing to send traffic to that winning model, whereas exploration allows for routing traffic to other models to see if they can eventually catch up or perform as well as the other model. It will also continue to adjust that prediction traffic to send more traffic to the winning model. In this example, you can see a new product review and star rating comment, and in this case, your model versions are trying to predict the star rating. You can see Model Version 1 predicted that this was a five-star rating, while Model Version 2 predicted it was a four-star rating. The actual rating was four stars. So in this case Model Version 2 wins. So your multi-arm bandit will reward that model by sending more traffic to Model Version 2. In this section, you learned about various deployment strategies that can be used to minimize downtime and evaluate the performance of a new model with no or minimal impact to your users. All of these concepts are general and they cover machine learning on any platform. But in the next section you'll learn more about deployment options that are really specific to Amazon SageMaker.

![](2024-01-02-11-55-57.png)

![](2024-01-02-11-57-05.png)

![](2024-01-02-11-58-07.png)

![](2024-01-02-11-59-28.png)

![](2024-01-02-12-00-28.png)

![](2024-01-02-12-02-16.png)

### **Amazon SageMaker Hosting: Real-Time Inference**

![](2024-01-02-12-07-16.png)

![](2024-01-02-12-08-11.png)

Let's go a little deeper and talk more about those endpoints and some of their more advanced features. As a reminder, SageMaker endpoints can be used to serve your models for predictions in real-time with low latency. Serving your predictions in real-time requires a model serving stack that not only has your trained model, but also a hosting stack to be able to serve those predictions. That hosting stack typically include some type of a proxy, a web server that can interact with your loaded serving code and your trained model. Your model can then be consumed by client applications through real time invoke API request. The request payload sent when you invoke the endpoint is routed to a load balancer and then routed to your machine learning instance or instances that are hosting your models for prediction. SageMaker has several built-in serializers and deserializers that you can use depending on your data formats. As an example for serialization on prediction request, you can use the JSON line serializer, which will then serialize your inference requests data to a JSON lines formatted string. For deserialization on prediction response, the JSON deserializer will then deserialize JSON lines data from an inference endpoint response. Finally, response payload is then routed back to the client application. With SageMaker model hosting, you choose the machine-learning instance type, as well as the count combined with the docker container image and optionally the inference code, and then SageMaker takes care of creating the endpoint, and deploying that model to the endpoint. The type of machine learning instance you choose really comes down to the amount of compute and memory you need. I discovered the high-level architecture with deployed SageMaker endpoint, but let's now cover some of the deployment options as they relate to the actual components that are deployed inside your machine learning instance. SageMaker has three basic scenarios for deployment when you use it to train and deploy your model. You can use prebuilt code, prebuilt serving containers, or a mixture of the two. I'll start with deploying a model that was trained using a built-in algorithm. In this option, you use both prebuilt inference code combined with a prebuilt serving container. The container includes the web proxy and the serving stack combined with the code that's needed to load and serve your model for real time predictions. This scenario would be valid for some of the SageMaker built-in algorithms where you need only your trained model and the configuration for how you want to host that machine learning instance behind that endpoint. For this scenario to deploy your endpoint, you identify the prebuilt container image to use and then the location of your trained model artifact in S3. Because SageMaker provides these built-in container images, you don't have any container images to actually build for this scenario. Let's now cover deploying a model using a built-in framework like TensorFlow or PyTorch, where you're still using prebuilt container images for inference, but with the option of bringing your own serving code as well. The next option still uses a prebuilt container that's purpose-built for a framework such as TensorFlow or PyTorch, and then you can optionally bring your own serving code. In this option, you'll notice that while you're still using a prebuilt container image, you may still need or want to bring your own inference code. You'll have the opportunity to specifically work with this option in the lab for this week. Finally, the last optional cover is bringing your own container image and inference code for hosting a model on a SageMaker endpoint. In this case, you'll have some additional work to do by creating a container that's compatible with SageMaker for inference. 

![](2024-01-02-12-18-27.png)

![](2024-01-02-12-20-30.png)

![](2024-01-02-12-21-04.png)

![](2024-01-02-12-21-39.png)

![](2024-01-02-12-22-24.png)

![](2024-01-02-12-22-46.png)

k![](2024-01-02-12-22-59.png)

![](2024-01-02-12-23-18.png)

![](2024-01-02-12-23-51.png)

![](2024-01-02-12-24-35.png)

But this also offers the flexibility to choose and customize the underlying container that's hosting your model. You just learned about the three different types of deployment options for SageMaker. All of these options deploy your model to a number of machine learning instances that you specify when you're configuring your endpoint. You typically want to use smaller instances and more than one machine learning instance. In this case, SageMaker will automatically distribute those instances across AWS availability zones for high availability. But once your endpoints are deployed, how do you then ensure that you're able to scale up and down to meet the demands of your workloads without overprovisioning your ML instances. This is where autoscaling comes in. It allows you to scale the number of machine learning instances that are hosting your endpoints up or down based on your workload demands. This is important to meet the demands of your workload, which means that you can increase the number of instances that serve your model when you reach a threshold for capacity that you've established. This is also important for cost optimization for two reasons. First, not only can you scale your instances up to meet the higher workload demands when you need it, but you can also scale it back down to a lower level of compute when it is no longer needed. Second, using autoscaling allows you to maintain a minimum footprint during normal traffic workloads, versus overprovisioning and paying for compute that you don't need. The on-demand access to compute and storage resources that the Cloud provides allows for this ability to quickly scale up and down. Let's take a look at how it works conceptually. When you deploy your endpoint, the machine learning instances that back that implant will emit a number of metrics to Amazon CloudWatch. For those that are unfamiliar with it, CloudWatch is the managed AWS service for monitoring your AWS resources. SageMaker emits a number of metrics about that deployed endpoints such as utilization metrics and invocation metrics. Invocation metrics indicate the number of times an invoke endpoint request has been run against your endpoint, and it's the default scaling metric for SageMaker autoscaling. You can actually define a custom scaling metric as well, such as CPU utilization. Let's assume you've set up your autoscaling on your endpoint and you're using the default scaling metric of number of invocations. Each instance will emit that metric to CloudWatch. As part of the scaling policy that you can figure. If the number of invocations exceeds the threshold that you've identified, then SageMaker will apply the scaling policy and scale up by the number of instances that you've configured. After scaling policy for your endpoint, the new instances will come online and your load balancer will be able to distribute traffic load to those new instances automatically. You can also add a cool down policy for scaling out your model, which is the value in seconds that you specify to wait for a previous scaled-out activity to take effect. The scale out cooldown period is intended to allow instances to scale out continuously, but not excessively. Finally, you can specify a cool down period for scaling in your model as well. This is the amount of time in seconds, again, after a scale-in activity completes, before another scale-in activity can start. This allows instances to scale in slowly. I just covered the concept of autoscaling SageMaker endpoints, but let's now cover how you actually set it up. First, you register your scalable target. A scalable target is an AWS resource, and in this case, you want to scale the SageMaker resource as indicated in the service namespace. This is accepted as your input parameter. Because autoscaling is used by other AWS resources, you'll see a few parameters that specifically indicate that you want to scale a SageMaker endpoint resource. Similarly, the scalable dimension is a set value for SageMaker endpoint scaling. Some of the additional input parameters that you need to configure include the resource ID, which in this case is the endpoint variant that you want to scale. You'll also need to specify a few key parameters that control the minimum and maximum number of machine learning instances. The minimum capacity indicates the minimum value you plan to scale into. The maximum capacity is the maximum number of instances that you want to scale out to. In this case, you always want to have at least one instance running, and a maximum of two during peak periods. After you register your scalable target, you need to then define the scaling policy. The scaling policy provides additional information about the scaling behavior for your instances. In this case, you have your predefined metric, which is the number of invocations on your instance, and then your target value, which indicates the number of invocations per machine learning instance that you want to allow before invoking your scaling policy. You'll also see the scale-out and scale-in cooldown metrics that I mentioned previously. In this case, you see a scale-out cooldown of 60, which means that after autoscaling successfully scales out, it starts to calculate that cool-down time. The Scaling policy will increase again to that desired capacity until the cool down period ends. The ScaleInCool down setting of 300 seconds means a SageMaker will not attempt to start another cool down policy within 300 seconds when the last one completed. In your final step to set up autoscaling, you will apply autoscaling policy, which means you apply that policy to your endpoint. 

![](2024-01-02-12-25-14.png)

![](2024-01-02-12-26-21.png)

![](2024-01-02-12-27-06.png)

![](2024-01-02-12-27-53.png)

Your endpoint will now be skilled in and scaled out according to that scaling policy that you've defined. You'll notice here you refer to the previous configuration that was discussed, and you'll also see a new parameter called policy type. Target tracking scaling refers to the specific autoscaling type that is supported by SageMaker. This uses a scaling metric and a target value as an indicator to scale. You'll have the opportunity to get hands on your lab for this week in setting up and applying autoscaling to SageMaker endpoints. You just learned about how SageMaker handles deployment to your machine-learning instances across a variety of options, and I also walked you through how to apply autoscaling to dynamically provision resources to meet the demands of your workload. But I'll quickly cover a few additional capabilities for SageMaker endpoints that you should be aware of, including multi-model endpoints and inference pipelines. I'll start with multi-model endpoints. Until now, you've learned about SageMaker endpoints that serve predictions for one model. However, you can also host multiple models behind a single endpoint. Instead of downloading your model from S3 to them machine learning instance immediately when you create the endpoint, with multi-model endpoints, SageMaker dynamically loads your models when you invoke them. You invoke them through your client applications by explicitly identifying the model that you're invoking. In this case you see the predict function is identifying Model 1 for this prediction request. SageMaker will keep that model loaded until resources are exhausted on that instance. If you remember, I previously discussed the deployment options around the container image that is used for inference when you deploy a SageMaker endpoint. All of the models that are hosted on a multi-modal endpoint must share the same serving container image. Multi-model endpoints are an option that can improve endpoint utilization when your models are of similar size and share the same container image and have similar invocation latency requirements. Here, you'll see another feature called inference pipeline. Inference pipeline allows you to host multiple models behind a single endpoint. But in this case, the models are sequential chain of models with the steps that are required for inference. This allows you to take your data transformation model, your predictor model, and your post-processing transformer, and host them so they can be sequentially run behind a single endpoint. As you can see in this picture, the inference request comes into the endpoint, then the first model is invoked, and that model is your data transformation. The output of that model is then passed to the next step, which is actually your XGBoost model here, or your predictor model. That output is then passed to the next step, where ultimately in that final step in the pipeline, it provides the final response or the post-process response to that inference request. This allows you to couple your pre and post-processing code behind the same endpoint and helps ensure that your training and your inference code stay synchronized. In this section, you learned more about using SageMaker Hosting to deploy models, do a fully managed endpoint for your real-time inference use cases. You also learned about hosting your endpoint on machine learning instances where you can take advantage of capabilities like autoscaling to dynamically increase or decrease the number of machine learning instances hosting your models so that they can meet the demands of your prediction request traffic. You also learned about some of the advanced deployment options such as multi-model endpoints and inference pipeline. These won't be in your labs for this week, but they are advanced deployment options to be aware of when you're looking at the best option for deploying your models.

![](2024-01-02-12-28-42.png)

![](2024-01-02-12-29-13.png)

![](2024-01-02-12-29-59.png)

### **Amazon SageMaker: Real-time Inference Production Variants**

![](2024-01-02-12-36-40.png)

![](2024-01-02-12-36-57.png)

![](2024-01-02-12-37-24.png)

![](2024-01-02-12-37-54.png)

![](2024-01-02-12-38-35.png)

![](2024-01-02-12-39-01.png)

![](2024-01-02-12-39-22.png)

![](2024-01-02-12-39-43.png)

![](2024-01-02-12-39-54.png)

![](2024-01-02-12-40-23.png)

### **Amazon SageMaker Batch Transform: Batch Inference**

![](2024-01-02-12-42-26.png)

![](2024-01-02-12-42-45.png)

![](2024-01-02-12-43-57.png)

![](2024-01-02-12-44-14.png)

![](2024-01-02-12-44-43.png)

![](2024-01-02-12-44-54.png)

Let's start with how batch Transform jobs work. We've batch Transform, you package your model first. This step is the same, whether you're going to deploy your model to a SageMaker endpoint, or whether you're deploying it for batch use cases. Similar to hosting for SageMaker endpoints, you either use a built-in container for your inference image or you can also bring your own. Your model package contains information about the S3 location of your trained model artifact, and the container image to use for inference. Next, you create your transformer. For this, you provide the configuration information about how you want your batch job to run. This also includes parameters such as the size and the type of machine learning instances that you want to run your batch job with, as well as the name of the model package that you previously created. Additionally, you specify the output location, which is the S3 bucket, where you want to store your prediction responses. After you've configured your transformer, you're ready to start your batch transformed job. This can be done on an ad hoc basis or scheduled as part of a normal process. When you start your job, you provide the S3 location of your batch prediction requests data. SageMaker will then automatically spin up the Machine Learning instances using the configuration that you supplied, and it will process you're batch requests for prediction. When the job is complete, SageMaker will automatically output the prediction response data to the S3 location that you specified and spin down the Machine Learning Instances. Batch jobs operate in a transient environment, which means that the Compute is only needed for the time it takes to complete the batch Transform job. Batch Transform also has more advanced features such as inference pipeline. If you recall, inference pipeline allows you to sequentially chain together multiple models. You can combine your steps for inference within a single batch job so that the batch job includes your data transformation model for transforming your input data into the format expected by the model, the actual model for prediction, and then potentially a data post-processing model that transforms the labels that will be used as your inference response and put to your S3 bucket for output. In this section, you learned about batch Transform as a way to deploy your model using SageMaker so that it meets your batch use case needs. You also learn that similar to SageMaker endpoints, you can use the feature called inference pipeline to combine multiple models to run sequentially during your batch job. 

![](2024-01-02-12-45-32.png)

![](2024-01-02-12-45-53.png)

![](2024-01-02-12-47-22.png)

## **Model Integration and Monitoring**

### **Model Integration**

![](2024-01-02-12-56-25.png)

![](2024-01-02-12-57-10.png)

![](2024-01-02-13-00-18.png)

![](2024-01-02-13-01-08.png)

![](2024-01-02-13-01-34.png)

When you deploy a model to an endpoint, that model is trained using specific features that have been engineered for model performance, and also to ensure that those features are in the required format and are understandable to your machine learning algorithm. The same transformations that were applied to your data for training, need to be applied to your prediction requests that are sent to that same deployed model. As an example, with your product review use case, if you were to send the exact text payload with the string of, "I simply love it" into your hosted model, you would get an error. This is because your model was trained on data in a specific format. Without performing a data transformation to put it in that same format that is expected for inference, you'll get an error because your model can't understand that text data. To fix this you'd need to apply those same data transformations that were applied when you trained your model, to your product review text before you send it to the model for prediction. There's a number of ways that you can do this, but let's look at one potential method. In this case, you're relying on your client applications to transform that prediction requests data into the correct format before it's actually sent to the endpoint for inference. All this would work if your client code always remains synchronized with your training code. It's difficult to scale when you have multiple applications or teams that interface with your model. As you can imagine, it's challenging in this case to always ensure that your data preprocessing code stays synchronized with the data preprocessing code used for training your model. Another consideration here is you may also still need to convert that model prediction response into a format that's readable by your client application. As an example, you're model here will return a one for positive, but your client applications know that a class of one actually translates into positive. Let's look at another option. You could implement a back-end function or process that runs before you reach the endpoint, that hosts your model for prediction. This is a common implementation pattern, but you still need to ensure that your data transformation code or the transformer model that runs before the formatted prediction request is sent to your endpoint, always stay synchronized with your trained model. Finally, you can also couple your data preprocessing transformers with your model by hosting them behind the same endpoint. In this case, because your data preprocessing is tightly coupled and hosted as well as deployed along with your model. It helps ensure that your training and your inference code stay synchronized, while abstracting the complexity away from the machine learning client applications that integrate with your model. In this section, I briefly covered some of the integration patterns for integrating your client applications with your deployed machine learning models. This doesn't cover every possibility or scenario, but it does provide you with the information that you need to consider with integrating your own models with different applications.

### **Monitoring ML Workloads**

![](2024-01-02-13-03-14.png)

Models decay over time for a number of reasons. But they typically relate to some type of change in the environment where the model was originally trained. So the trained models make predictions based on old information or they are able to adapt to changes in the environment over time. There's a lot of examples for what causes models to degrade but I'll cover a few. First, you can have a change in customer behavior. So let's say you have a model that's trying to predict which products a specific customer might be interested in. Customer behavior can change drastically and quickly, based on a number of factors such as life changes or the economy, just to name a few. They may be interested in new products based on those life changes. And if you're using a model that's trained on old data, the likelihood of providing timely and relevant recommendations goes down. Next, you could have changing business environments, let's say your company acquired a new product line. Finally, you could have a sudden change in the upstream data as a result of a changing data pipeline. So let's say you ingest raw data that's used to train your model for multiple sources and suddenly a feature that is used to train your model no longer appears in your ingested data. All of these examples and many more can lead to model decay. So how can we monitor for signals of model decay? You often hear about two types of monitors when it comes to monitoring machine learning models. The first is concept drift and the second is data drift. I'll cover each of these in more detail, starting first with concept drift. At a high level, concept drift happens when the environment you trained your model in no longer reflects the current environment. In this case, the actual definition of a label changes depending on a particular feature, such as geographical location or age group. When you have a model that predicts information in a dynamic world, the underlying dynamics of that world can shift, impacting the target your machine learning model is trying to predict. A method for detecting concept drift includes continuing to collect ground truth data that reflects your current environment. And running this labeled ground truth data against your deployed model to evaluate your model performance against your current environment. Here, you're looking to see if the performance metric that you optimize for during training like accuracy still performs within an acceptable range for your current environment. Another common type of model monitor is data drift. With data drift, you're looking for changes in the model input data or changes to the feature data. So you're looking for signals that the serving data has shifted from the original expected data distribution that was actually used for training. This is often referred to as training serving skew. There are many methods to help with this level of monitoring. One is an open source library called Deequ, which performs a few steps to detect signs of data drift. First, you do data profiling to gather statistics about each feature that was used to train the model. So collecting data like the number of distinct values for categorical data or statistics like min and max for numeric features. Using those statistics that are gathered during that data profiling, constraints get established to then identify the boundaries for normal or expected ranges of values for your feature data. Finally, by using the profile data in combination with the identified constraints, you can then detect anomalies to determine when your data goes out of range from the constraints. I covered two common types of model monitors for detecting concept and data drift. This isn't inclusive of every model monitor or method for monitoring your models. But it gives you a good idea of monitors you should consider to detect for potential signs of model decay. Next, system monitoring is also key to ensuring your models in the surrounding resources that are supporting your machine learning workloads are monitored for signals of disruption as well as potential performance decline. You want to ensure you include system monitoring so that you can make sure that the surrounding and underlying resources that are used to host your model are healthy and functioning as expected. This includes monitoring things like model latency, which is the time it takes for a model to respond to a prediction request. This also includes system metrics for the infrastructure that's hosting your model, so things like CPU utilization. Finally, another example is monitoring your machine learning pipelines so that you know, if there are any potential issues with model retraining or deploying a new model version. These are just some of the examples of system monitoring you would need to consider as part of your overall monitoring strategy for your machine learning workloads. Finally, let's cover the monitoring or measuring of business impact. With this, you're looking at ensuring your deployed model is actually accomplishing what you intend for it to do, which ties back to the impact to your business objectives. This can be difficult to monitor or measure depending on the use case, but let's say you have excess stock of a particular item. And you want to get rid of that excess stock by offering coupons to customers who are likely to be interested in that particular product. The model you're building in this case will predict which users are likely to respond to that offer. You can typically identify how much stock you have before sending those target coupons, then see how many of the customers you send coupons to actually bought the product. As well as the impact that it had on your stock of products. In this section, I covered the general considerations for monitoring your machine learning workloads. Which included monitoring the model specifically, as well as monitoring the underlying systems and resources that are hosting your model and interacting with your model. I also covered monitoring and measuring the business impact of your model.

![](2024-01-02-13-05-19.png)

![](2024-01-02-13-05-59.png)

![](2024-01-02-13-06-22.png)

![](2024-01-02-13-06-35.png)

![](2024-01-02-13-06-55.png)

![](2024-01-02-13-07-39.png)

![](2024-01-02-13-07-59.png)

![](2024-01-02-13-08-26.png)

### **Model Monitoring using Amazon SageMaker Model Monitor**

![](2024-01-02-13-10-20.png)

![](2024-01-02-13-11-10.png)

Model monitor includes four different monitor types including data quality, to monitor drift in data model quality to monitor drift in model quality metrics. Statistical bias drift, which is used to monitor signs of statistical bias drift in your model predictions and finally feature attribution drift. Which is used to monitor drift in features. Let's cover each of these in more detail, starting with data quality. With the data quality, monitor your monitoring for signals that the feature data that was used to train your models has now drifted or is statistically different from the current data that's coming in for Inference model monitor uses DQ. Which is an open source library built on Apache spark that performs data profiling generates constraints based on that data profile. And then detects for anomalies when data goes out of bounds from the expected values or constraints. I'll now walk you through how to set up the data quality monitor for your hosted Sage maker employments. To start, you need to enable data capture on your end point, which tells your endpoint to begin capturing the prediction request data coming in and the prediction response data. You can also identify a sampling percentage which is the percentage of traffic that you want to capture. In this case you're telling Sage maker to capture the prediction request and response for 100% of the traffic that comes into that end point. Next you create a baseline which actually runs a Sage maker processing job that runs DQ to capture statistics about your data, that was used to train your model. Once that baseline runs the output includes statistics about each feature of your training data. So depending on the features there will be statistics relevant to that feature type. As an example for our numeric data, you'll see statistics like the min or the max or as for string or categorical data, you'll see statistics on missing or distinct values. The baseline job also automatically identifies constraints based on the statistics discovered. You can optionally add or modify constraints based on your domain knowledge as well. These constraints are then used to evaluate or monitor for potential signs of data drift. In the next step, you set up the monitoring schedule which identifies how often you want to analyze your inference data against the established baseline. More specifically against those constraints that have been identified. Finally, the monitoring job outputs results, which includes statistics in any violations against your identified constraints. That information is also captured in amazon cloud watch as well so that you can set up alerts for potential signs of data drift. The second type of model monitor with Sage maker model monitor is the model quality monitor. With the model quality monitor, you're actually using new ground truth data that is collected to evaluate against your deployed model for signs of concept drift. So here you use the new label data to evaluate your model against the performance metric that you've optimized for during training, which could be something like accuracy. You then compare the new accuracy value to the one you identify during model training to save your accuracy is potentially going down. Because you aren't doing hands on activities with model monitor this week, I'm not going to walk through the steps for each model monitor type. However, the general steps include the same steps that you saw before with data quality where you enable data capture on your end point, create a model quality baseline and then set up a monitoring schedule. However, you also need to ensure for model quality that you have a method in place to collect and ingest new ground truth data that can be used to evaluate your model performance on that new label data. Statistical bias drift is the third monitor type. The statistical bias drift monitor, monitors for predictions for signals of statistical bias. And it does this by integrating with Sage maker, clarify again, the process to set up this monitor is similar to the others. In this case you create a baseline that is specific to bias drift and then schedule your monitoring jobs just like the other monitoring types. Finally, feature attribution, monitors for drift in your features for this model, monitor, monitors drift by comparing how the ranking of individual features changed from the training data to the live data. This helps explain model predictions over time. Again, the steps for this model monitor type are similar to the others. However, the baseline job in this case uses SHAP behind the scenes SHAP or shapely additive explanations is a common technique used to explain the output of a machine learning model. In this section, you learned about the monitor types that Sage maker model monitor can use to monitor your models as well as the process for setting up those model monitors. Again, this is important because all of these monitors can help you detect potential signs of model decay, for your deployed models, which also allows you to react quickly to models that are no longer performing well. 

![](2024-01-02-13-11-25.png)

![](2024-01-02-13-11-41.png)

![](2024-01-02-13-11-55.png)

![](2024-01-02-13-12-12.png)

![](2024-01-02-13-12-32.png)

![](2024-01-02-13-12-53.png)

![](2024-01-02-13-13-11.png)

![](2024-01-02-13-13-26.png)

![](2024-01-02-13-13-36.png)

![](2024-01-02-13-14-05.png)

![](2024-01-02-13-14-22.png)

![](2024-01-02-13-15-01.png)

![](2024-01-02-13-14-43.png)

![](2024-01-02-13-15-15.png)