# **Data Labeling and Human-in-the-loop Pipelines**

## **Data Labeling**

### **Data Labeling**

![](2024-01-02-17-02-16.png)

![](2024-01-02-17-02-55.png)

![](2024-01-02-17-03-12.png)

Prior to building training and deploying machine learning models, you need data. Again, successful models are built on high quality training data. But collecting and labelling the training data sets involves a lot of time and effort. To build training data sets you need to evaluate and label a large number of data samples. These labeling tasks are usually distributed across more than only one person, adding significant overhead and cost. If there are incorrect labels, the system will learn from the bad information and make inaccurate predictions. Let's discuss the concept of data labeling, common data labeling types, its challenges and ways to efficiently label data in more detail. A variety of studies have shown that the majority of the data scientists time is spent on data preparation tasks such as finding, cleaning, and labeling data. And only about 20% of the time is actually spent on developing the models and creating insight. Why is this? And how can you implement a more efficient data labeling to break this 80 20 rule. To understand the challenges, let's first define data labeling. In machine learning, data labeling is the process of identifying raw data such as images, text files, and videos, among others, and adding one or more meaningful and informative labels to the data. For example, labels might indicate whether a photo contains a dog or a cat which words were uttered in an audio recording, or if an X ray image shows a tumor. Labels provide context so that the machine learning model can learn from it. Today, most practical machine learning models utilize supervised learning. For supervised learning to work, you need a label set of data that the model can learn from, so it can make the correct decisions. In machine learning, a properly labeled data set that you use as the objective standard to train and assess a given model is often called the ground truth. The accuracy of your trained model will depend on the accuracy of your ground truth. So spending the time and resources to ensure highly accurate data labeling is essential. Let's have a look at common types of data labeling. When building a computer vision system, you need to label images or pixels or create a border that fully encloses a digital image known as a bounding box to generate your training data set. You can classify images by type, such as an image showing either a scene from a basketball or a soccer game. Or you can classify images by content defining what's actually in the image itself, such as a human and a vehicle in the example shown here. These are examples of single label and multi label classification. You can also segmented image at the pixel level. The process known as semantic segmentation identifies all pixels that fall under a given label and usually involves applying a colored filler or mask over those pixels. You can use labeled image data to build a computer vision model to automatically categorize images, detect the location of objects, or segment an image. If you need to label video data, you can choose between video classification and video object detection tasks. In a video classification task, you categorize your video clips into specific classes, such as whether the video shows a scene from a concert or sports. In video object detection tasks, you can choose between bounding box, where workers draw bounding boxes around specified objects in your video. Polygon, where you draw polygons around specified objects in your video, such as shown here with the cars example. Polyline, where you draw polylines around specified objects in your video, as shown here in the running track video. Or key point, where you draw key points around specified objects in your video, as shown here in the volleyball game example. Instead of just detecting objects, you can also track objects in video data using the same data labeling techniques shown here. The difference is that instead of looking at the video on an individual video frame by frame basis, you track the movement of objects in a sequence of video frames. In natural language processing, you identify important sections of text or text the text with specify labels to generate your training data set. For example, you may want to identify the sentiment or intent of a text. In a single label classification task, this might be assigning the label positive or negative to a text. Or you might want to assign multiple labels such as positive and inspiring to the text. This would be an example of multi-label classification. With named entity recognition, you apply labels two words within a larger text, for example, to identify places and people. Natural language processing models are used to a text classification, sentiment analysis, named entity recognition, and optical character recognition. The biggest challenge in data labeling is the massive scale. Machine learning models need large labeled data sets. This could be tens of thousands of images to train a computer vision model of thousands of documents to fine tune a natural language model. Another challenge is the need for high accuracy. Machine learning models depend on accurately labeled data. If there are incorrect labels. Again, the system will learn from the bad information and make an accurate prediction. A third challenge is time. Data labeling is time consuming. As discussed, building a training data set can take up to 80% of the data scientists time. So how can you label data more efficiently? To address the previously mentioned challenges, you can combine human labelers with managed data labeling services. These data labeling services provide additional tools to scale the labelling efforts for access to additional human workforces. Train a model based on human feedback, so it can perform automated data labeling. And increase the labeling quality by offering additional features to assist the human labelers. In the next video, I will introduce you to one of those managed data labeling services, Amazon SageMaker Ground Truth. 

![](2024-01-03-11-44-40.png)

![](2024-01-03-11-46-43.png)

![](2024-01-03-11-47-59.png)

![](2024-01-03-11-50-18.png)

![](2024-01-03-11-52-05.png)

![](2024-01-03-11-52-56.png)

![](2024-01-03-11-54-10.png)

### **Data Labeling with Amazon SageMaker Ground Truth**

### **Data Labeling Best Practices**

## **Human-in-the-loop Pipelines**

### **Human-In-The-Loop Pipelines**

![](2024-01-02-21-38-51.png)

Some machine learning applications need human oversight to ensure accuracy with sensitive data to help provide continuous improvements and retrain models with updated predictions. However, in these situations, you're often forced to choose between a machine learning only or human only system. Again, what if you want the best of both worlds? Integrating machine learning systems into your workflow while keeping a human eye on the results to provide a required level or position. This concept is called human-in-the-loop pipelines. You can allow human reviewers to step in when a model is unable to make a high confidence prediction, or to audit its prediction on an ongoing basis. For example, in image classification tasks, you can allow human reviewers to verify the appropriate class that has been selected. In the example shown here, a human reviewer would verify that the appropriate dog breed has been assigned by the machine learning model. Another common use case is to review form extraction tasks. Extracting information from scanned employment or mortgage application forms can require human review in some cases due to low quality scans or poor handwriting. Now, let's have a look at how to implement human reviews of model predictions. First, a client application sends input data to your machine learning model, and the model makes a prediction. If the prediction comes back with a high confidence score, the prediction result is returned directly to the client application. If the prediction comes back with a low confidence score, the result is sent for human review. The human reviewers correct the result if needed, and the consolidated results across all human reviewers is returned to the client application. You should also store the reviewed prediction result and make it available to retrain the model, so it can improve over time. Note that the confidence threshold value that decides whether to start a human loop for review will depend on your use case and requirements. Instead of starting a human loop based on the model's prediction confidence, you could also decide to review a random sampling percentage off the model predictions. Now, if you were to implement this human review of ML predictions manually, you would need to coordinate across ML scientists, engineering, and operations teams. You would need to manage a large number of reviewers and write custom software to manage the review tasks. Also, with manual processes, it's difficult to achieve high review accuracy. This is another scenario in which you can benefit from managed services to orchestrate the human-in-the-loop pipelines for you. One of those managed services is Amazon Augmented AI, or Amazon A2I

![](2024-01-03-10-46-02.png)

![](2024-01-03-10-46-29.png)

![](2024-01-03-10-47-07.png)

![](2024-01-03-10-48-00.png)

![](2024-01-03-10-48-35.png)

### **Human-In-The-Loop Pipelines with Amazon Augmented AI (Amazon A2I))**

![](2024-01-03-11-12-19.png)

![](2024-01-03-11-13-20.png)

Let's discuss how you build human-in-the-loop pipelines with Amazon Augmented AI or Amazon A2I. Here is the ambition human-in-the-loop workflow again, to implement human review of model predictions. Amazon A2I makes it easy to build and manage human reviews for machine learning applications. Amazon A2I provides built-in human review workflows for common machine learning use cases such as text extraction from documents, which allows predictions from, for example, Amazon Textract to be reviewed easily. You can also create your own workflows for machine-learning models built on SageMaker or any other tools. Using Amazon A2I, you can allow human reviewers to step in when a model is unable to make a high confidence prediction or to audit its predictions on an ongoing basis. Let's dive deeper into the individual steps for building a human-in-the-loop review system, and predictions. First, you need to define the human workforce again. Similar to ground truth, which was shown earlier this week, you can choose between the Amazon Mechanical Turk workforce of over 500,000 independent contractors worldwide, a private workforce consisting of your employees or co-workers or from a vendor company listed on the AWS marketplace. The workforce setup steps are exactly the same as Amazon A2I leverages the workforce teams set up by our ground truth. In this example, let's reuse the private workforce setup I demonstrated earlier this week. Next, you need to do find the human task UI, which is the instructions page for your human reviewers. Again, this has done exactly the same way you set up the task UI for the data labelers in the ground truth example. Again, I will reuse the text classification UI to classify product reviews into the three sentiment classes. In a third step, you need to define the human review workflow. The human review workflow is defined in a flow definition. The flow definition specifies the workforce, where your tasks are sent. The set of instructions that your workforce receives, which is the task UI template you've created in the previous step, and the configuration of your work or tasks, including the number of workers that receive a task. The flow definition also specifies where your output data is stored. Amazon A2I provides built-in human review workflows for common ML use cases such as content moderation and text extraction from documents. For this purpose, Amazon A2I is integrated with AWS AI Services, including Amazon Textract. You can also create your custom workflows to integrate with additional AI services or ML models built on SageMaker or any other tools. Let's have a look at those options. Amazon Textract is a document analysis service that detects and extracts printed text and handwriting. Structured data such as fields of interest and their values and tables from images and scans of documents. Thanks to the built-in integration, you can start a human review workflow within an application program interface or API call to Amazon Textract by providing the conditions that cause a human loop to be called. You can also build your custom human review workflow with other AWS AI Services by providing some lines of code to define those conditions that cause the human loop to be created. Let's have a look at how you could start a human loop on low confidence model predictions from Amazon Comprehend. Amazon Comprehend is a natural language processing service that uses machine learning to uncover information and unstructured data. Let's use Amazon Comprehend to classify our product reviews. First, let's define if you sample product reviews such as, I enjoyed this product and it's okay. You also need to define a condition for when to start the HumanLoop. For this example, let's say you want to review all predictions that are returned with a confidence score lower than 90 percent. Next, you call the Amazon Comprehend API and provide the sample reviews and an Amazon Comprehend modeling point. Note that in this example I've already previously trained a custom Amazon Comprehend model on the product reviews data and the three is sentiment classes. This custom Amazon Comprehend model is available via the provided end point Amazon Resource Name or ARN. Then I parse the sentiment class and confidence score from the Amazon Comprehend API response. In the simple if-clause, I can check whether the returned confidence score is under the defined threshold of 90 percent. Which means I want to start the HumanLoop with the predicted sentiment class and the actual sample review as inputs. Lastly, let's discuss how you can create your human review workflow for custom ML models built on SageMaker. Like the previous example of how Amazon A2I can work with AWS AI Services. You can also use Amazon A2I to review real-time low confidence predictions made by a model deployed to a SageMaker hosted endpoint and incrementally train your model using Amazon AI output data. From your notebook environment let's define a custom SageMaker predict a class to specify how to process the model inputs and how to process the model outputs. You can do this via the Serializer and Deserializer settings. Then let's deploy the fine tune PyTorch RoBERTa model as a SageMaker model endpoint. You might recall that RoBERTa stands for the robustly optimized bert pre-training approach. You can use the PyTorch model class, provide a model name, specify the custom predictor I created in the previous step, and the S3 location of the model artifact of the modal tar.gz file.

![](2024-01-03-11-14-24.png)

![](2024-01-03-11-15-04.png)

![](2024-01-03-11-15-32.png)

Then you can deploy the model by calling model.deploy and providing a model endpoint name. Once the model endpoint is ready, you can send a sample reviews again to the model via the predictor predict API call. Note that you need to pass the reviews in the JSON format the model expects as input. Then you parse the modal response again to obtain the predicted label and the confidence score. Now, you are ready to define the human loop condition again. In an if-clause similar to the previous one, you can check whether the returned confidence score is under the defined threshold of 90 percent. Which again means you want to start the HumanLoop with the predicted label and the review as inputs. You can also add additional print statements to document when the HumanLoops get started. Here you can see a sample results showing human loop was started for a low confidence prediction of 54 percent for the sampler review, it is okay. To verify the human loop results, you can capture the completed HumanLoops API responses. Assembler result of the HumanLoop input content and the human answer is shown here. Let's discuss how you build human-in-the-loop pipelines with Amazon Augmented AI or Amazon A2I. Here is the ambition human-in-the-loop workflow again, to implement human review of model predictions. Amazon A2I makes it easy to build and manage human reviews for machine learning applications. Amazon A2I provides built-in human review workflows for common machine learning use cases such as text extraction from documents, which allows predictions from, for example, Amazon Textract to be reviewed easily. You can also create your own workflows for machine-learning models built on SageMaker or any other tools. Using Amazon A2I, you can allow human reviewers to step in when a model is unable to make a high confidence prediction or to audit its predictions on an ongoing basis. Let's dive deeper into the individual steps for building a human-in-the-loop review system, [inaudible] predictions. First, you need to define the human workforce again. Similar to ground truth, which was shown earlier this week, you can choose between the Amazon Mechanical Turk workforce of over 500,000 independent contractors worldwide, a private workforce consisting of your employees or co-workers or from a vendor company listed on the AWS marketplace. The workforce setup steps are exactly the same as Amazon A2I leverages the workforce teams set up by our ground truth. In this example, let's reuse the private workforce setup I demonstrated earlier this week. Next, you need to do find the human task UI, which is the instructions page for your human reviewers. Again, this has done exactly the same way you set up the task UI for the data labelers in the ground truth example. Again, I will reuse the text classification UI to classify product reviews into the three sentiment classes. In a third step, you need to define the human review workflow. The human review workflow is defined in a flow definition. The flow definition specifies the workforce, where your tasks are sent. The set of instructions that your workforce receives, which is the task UI template you've created in the previous step, and the configuration of your work or tasks, including the number of workers that receive a task. The flow definition also specifies where your output data is stored. Amazon A2I provides built-in human review workflows for common ML use cases such as content moderation and text extraction from documents. For this purpose, Amazon A2I is integrated with AWS AI Services, including Amazon Textract. You can also create your custom workflows to integrate with additional AI services or ML models built on SageMaker or any other tools. Let's have a look at those options. Amazon Textract is a document analysis service that detects and extracts printed text and handwriting. Structured data such as fields of interest and their values and tables from images and scans of documents. Thanks to the built-in integration, you can start a human review workflow within an application program interface or API call to Amazon Textract by providing the conditions that cause a human loop to be called. You can also build your custom human review workflow with other AWS AI Services by providing some lines of code to define those conditions that cause the human loop to be created. Let's have a look at how you could start a human loop on low confidence model predictions from Amazon Comprehend.

![](2024-01-03-11-16-33.png)

![](2024-01-03-11-16-54.png)

![](2024-01-03-11-17-25.png)

![](2024-01-03-11-18-15.png)

![](2024-01-03-11-19-22.png)

![](2024-01-03-11-20-42.png)

![](2024-01-03-11-21-31.png)

Amazon Comprehend is a natural language processing service that uses machine learning to uncover information and unstructured data. Let's use Amazon Comprehend to classify our product reviews. First, let's define if you sample product reviews such as, I enjoyed this product and it's okay. You also need to define a condition for when to start the HumanLoop. For this example, let's say you want to review all predictions that are returned with a confidence score lower than 90 percent. Next, you call the Amazon Comprehend API and provide the sample reviews and an Amazon Comprehend modeling point. Note that in this example I've already previously trained a custom Amazon Comprehend model on the product reviews data and the three is sentiment classes. This custom Amazon Comprehend model is available via the provided end point Amazon Resource Name or ARN. Then I parse the sentiment class and confidence score from the Amazon Comprehend API response. In the simple if-clause, I can check whether the returned confidence score is under the defined threshold of 90 percent. Which means I want to start the HumanLoop with the predicted sentiment class and the actual sample review as inputs. Lastly, let's discuss how you can create your human review workflow for custom ML models built on SageMaker. Like the previous example of how Amazon A2I can work with AWS AI Services. You can also use Amazon A2I to review real-time low confidence predictions made by a model deployed to a SageMaker hosted endpoint and incrementally train your model using Amazon AI output data. From your notebook environment let's define a custom SageMaker predict a class to specify how to process the model inputs and how to process the model outputs. You can do this via the Serializer and Deserializer settings. Then let's deploy the fine tune PyTorch RoBERTa model as a SageMaker model endpoint. You might recall that RoBERTa stands for the robustly optimized bert pre-training approach. You can use the PyTorch model class, provide a model name, specify the custom predictor I created in the previous step, and the S3 location of the model artifact of the modal tar.gz file. Then you can deploy the model by calling model.deploy and providing a model endpoint name. Once the model endpoint is ready, you can send a sample reviews again to the model via the predictor predict API call. Note that you need to pass the reviews in the JSON format the model expects as input. Then you parse the modal response again to obtain the predicted label and the confidence score. Now, you are ready to define the human loop condition again. In an if-clause similar to the previous one, you can check whether the returned confidence score is under the defined threshold of 90 percent. Which again means you want to start the HumanLoop with the predicted label and the review as inputs. You can also add additional print statements to document when the HumanLoops get started. Here you can see a sample results showing human loop was started for a low confidence prediction of 54 percent for the sampler review, it is okay. To verify the human loop results, you can capture the completed HumanLoops API responses. Assembler result of the HumanLoop input content and the human answer is shown here. 

![](2024-01-03-11-22-09.png)

![](2024-01-03-11-22-40.png)

![](2024-01-03-11-23-36.png)

![](2024-01-03-11-26-33.png)

![](2024-01-03-11-26-59.png)

![](2024-01-03-11-27-41.png)

![](2024-01-03-11-28-17.png)

![](2024-01-03-11-28-37.png)

![](2024-01-03-11-29-00.png)

![](2024-01-03-11-30-03.png)

![](2024-01-03-11-30-52.png)

![](2024-01-03-11-31-12.png)

![](2024-01-03-11-31-57.png)