# **Data Labeling and Human-in-the-loop Pipelines**

## **Data Labeling**

## **Human-in-the-loop Pipelines**

![](2023-12-30-15-11-34.png)

![](2023-12-30-15-12-02.png)

![](2023-12-30-15-12-30.png)

![](2023-12-30-15-22-45.png)

![](2023-12-30-15-23-06.png)

![](2023-12-30-15-23-36.png)

![](2023-12-30-15-24-28.png)

![](2023-12-30-15-25-15.png)

![](2023-12-30-15-25-55.png)

![](2023-12-30-15-26-10.png)

![](2023-12-30-15-26-24.png)

![](2023-12-30-15-27-23.png)

![](2023-12-30-15-29-45.png)

![](2023-12-30-15-30-30.png)

## **BERT Feature Engineering at Scale**

![](2023-12-30-15-33-17.png)

![](2023-12-30-15-33-43.png)

![](2023-12-30-15-39-25.png)

I will highlight a few differences between BlazingText and BERT at a very high level. As you can see here, BlazingText is based on Word2Vec, whereas BERT is based on transformer architecture. Both BlazingText and BERT generate word embeddings. However, BlazingText operates at word level, whereas BERT operates at a sentence level. Additionally, using the bidirectional nature of the transformer architecture, BERT can capture context of a word. Let me clarify with an example. First, BlazingText learns word-level embeddings for all the words that are included in the training corpus. These vectors, or embeddings, are then projected into a high-dimensional vector space. So, similar words will generate vectors that are close together, and these are represented very close to each other in the learned vector space. Take, for example, the word dress. The embedding that is generated by BlazingText for the word dress is shown on the screen here. Regardless of where the word dress appears in a sentence, the embedding generated for that particular portion of the sentence is always going to be the same, which means BlazingText is not really capturing the context of the word dress in a sentence. Contrast that to BERT where the input is not the word, but the sentence itself. The output once again is embedding, but this time, embedding is based on three individual components: token, segment, and position. How exactly to do this transformation from a sentence into then embedding that, consists of these three individual embeddings you will see in the next video. Let's take the example of these two sentences: "I love the dress." "I love the dress, but not the price." Obviously, the context of the word dress is different in these two sentences. BERT can take into consideration the words that come prior to the word dress, as well as the words that follow the word dress. Using this bidirectional nature of the transformer architecture, BERT is able to capture the context. So the embeddings that are generated for the word dress in these two sentences will be completely different. However, the length of the embeddings in these two sentences is going to be fixed. With BERT, you encode sequences. Here is an end-to-end example of converting a sequence into BERT embeddings that can be readily used by the BERT algorithm. I will build this entire process step-by-step in the next video. You will continue to learn about the BERT algorithm and the model architecture in the next few weeks. For this week, I will stay focused on converting the raw text into the BERT embeddings that can be readily used by the BERT algorithm. That is, I will stay focused on the feature engineering portion of the process. 

![](2023-12-30-15-40-32.png)

![](2023-12-30-15-41-57.png)

![](2023-12-30-15-43-40.png)

I will start with the input of this process. As you can see here the input is an input sequence. You will notice that my input sequence has just one sentence. The input sequence could also consist of two different sentences, which is much more applicable for NLP tasks, such as generating question and answer phase. Once I have the input sequence, the next step is to apply Word Piece Tokenization. Word Piece Tokenization is a technique that is used to segment words into sub words and is based on pre-trained models with the dimension of 768. Here, you will see the tokens that are generated from the input sequence. In addition to the tokens coming from the individual words of the sentence, you also see a special token CLS. CLS specifies that this is a classification problem. If my input sequence consisted of multiple sentences, then I would see another special token, SEP, that separates the tokens from the individual sentences. Once I have the Word Piece tokens, the next step is to apply token embedding. To determine token embedding for the individual tokens, all I have to do is simply look at the tokens in the 768 dimension vector that I mentioned before. Here, the token CLS gets an embedding of 101 because that is the position of CLS in that 768 dimension. Similarly, the token love gets a token embedding of 2293, the token this gets an token embedding of 2023, and so on. Next step is to perform segment embedding. Segment embedding becomes much more important when there are multiple sentences in the input sequence. The segment ID of 0 represents that a sentence is the first sentence in the sequence, and similarly the segment embedding of 1 represents that it is a second sentence in the input sequence. Here I have only one sentence in my input sequence. So for all the individual tokens, I get a segment embedding of 0. Next step is to apply position embedding. The position embedding determines the index portion of the individual token in the input sequence. Here, my input sequence consists of four tokens. So based on a zero-based index, you can see the position embedding tokens for all the tokens. The position embedding for the token CLS is 0, the position embedding for the token love is 1, and so on. Once I have the three individual embeddings, it's time to pull all of these together. The final step includes determining an element wise sum of the position, segment and token embedding that have been previously determined. So the final embedding is of the dimension 1, 4, 768 in this particular case. And that makes sense because I started with one input sequence that consisted of three different words and I applied the Word Piece Tokenization that has pre-trained models of dimension 768. Now this BERT embedding can be applied directly to the BERT algorithm as an input. 

![](2023-12-30-15-45-34.png)

![](2023-12-30-15-46-05.png)

![](2023-12-30-15-47-01.png)

![](2023-12-30-15-47-43.png)

![](2023-12-30-15-52-22.png)

![](2023-12-30-15-52-52.png)

![](2023-12-30-15-54-49.png)

![](2023-12-30-15-58-57.png)

For this purpose, you will use the psychic learn libraries. From the psychic learn, you will use the RoBERTa tokenizer class. Before jumping into the code, I will briefly introduce the RoBERTa model. Roberta model is built on top of BERT model, but it modifies a few hyper parameters and the way the model is trained. It also uses a lot more training data than the original BERT model. This results in significant performance improvements in a variety of NLP tasks when compared to the original BERT model. You will learn more about RoBERTa model architecture in week two. If you would like to read more about this model, please have a look at this research paper. You will find the link in the additional resources section for this week. Time to look at some code. To start using the RoBERTa tokenizer, you first import the class and then you construct an object of tokenizer. To construct the organizer class, you specify the pretrained model. The pretrained model we use here is RoBERTa base. Once you have the tokenizer object in hand, you will run the encode_plus method. The encode_plus method expects a few parameters, as you can see here. One of the parameters is the review. This is the raw review text from your product review data set, which is the text that needs to be encoded. You will also see a flag, a true or false flag, whether to add special tokens to the embeddings or not. You will also see the maximum parameter that specifies the maximum sequence length, along with a few other parameters. A brief note on the maximum sequence length parameter. This is a hyper parameter that is available on both BERT and RoBERTa models. The max of sequence length parameter specifies the maximum number of tokens that can be passed into BERT model with a single sample. To determine the right value for this hyper parameter, you can analyze your data set. Here you can see the word distribution of the product review data set. This analysis indicates that all of our reviews consist of 115 or less words. Now, it's not a 1-1 mapping between the word count and the input token count, but it can be a good indication. You can definitely experiment with different values for this parameter. For the purposes of this use case and the product review data set, setting this max sequence length to the value of 128 has been proven to work well. Once you determine all the necessary parameters, generating the embeddings is really very straightforward. You simply run the encode_plus method. 

![](2023-12-30-16-00-21.png)

![](2023-12-30-16-00-50.png)

![](2023-12-30-16-01-29.png)

![](2023-12-30-16-02-04.png)

![](2023-12-30-16-02-51.png)

![](2023-12-30-16-03-23.png)

![](2023-12-30-16-03-42.png)

![](2023-12-30-16-04-26.png)

![](2023-12-30-16-05-23.png)

![](2023-12-30-16-05-52.png)

![](2023-12-30-16-06-23.png)

However, the real challenge comes in when you have to generate these embeddings at scale. This is exactly the challenge that you will tackle in this week's lab. The challenge is performing feature engineering at scale, and to address the challenge, you will use Amazon SageMaker processing. Amazon SageMaker processing allows you to perform data related tasks such as, preprocessing, postprocessing, and model evaluation at scale. SageMaker processing provides this capability by using a distributed cluster. By specifying some parameters, you can control how many notes and the type of the notes that make up the distributed cluster. Sagemaker processing job executes on the distributed cluster that you configure. Sagemaker processing provides a built-in container for Sklearn. So the court that you use with Sklearn and RoBERTa tokenizer should work out-of-the-box using Sagemaker processing. As you can see here, Sagemaker processing expects data to be in S3 bucket. So you specify the S3 location where you're on, input data is stored and Sagemaker processing execute the Sklearn script on the raw data. Finally, the output, which consists of the embeddings, is persisted back into an S3 bucket. Let's look at some code again. To use Sagemaker processing of its psychic learn, you start by importing a class called Sklearn processor, along with a couple of other classes that captured input and the output of the processing job. Then you set up the processing cluster using the Sklearn processing object. To this object, you pass in the framework version of the psychic learn you would like to use, as well as the instance count and the instance type that make up the distributed cluster. Once you can figure the cluster, you simply run the run method with a few parameters. As expected, these parameters include a script to execute. This is a python script that consists of the psychic learned code to generate the embeddings. Additionally, you provide the processing input that specifies the location of the input data in the S3 bucket. And finally, you specify where in S3 the output should go to, and you mention that using the processing output construct. Pulling all of these together, this is how your lab for this week is going to look like. You will convert the review text from the product review data set into BERT embeddings, using the psychic learn container on Sagemaker processing. 

## **Feature Store**

![](2023-12-30-16-19-51.png)

It would save you a lot of time, if you can store the results of feature engineering efforts, and reuse those results, so that you don't have to run the feature engineering pipeline again and again. It would save time not only for you, but for any other teams in your organization, that may want to use the same data and same features, for their own machine learning projects. Enter feature store. Feature store, at a very high level, is a repository to store engineered features. For such a feature store, there are three high level characteristics that you want to consider. First, you want the feature store to be centralized, so that multiple teams can contribute their features to this centralized repository. Second, you want the features from the feature store to be reusable. This is to allow reuse of engineered features, not just across multiple phases of a single machine learning project, but across multiple machine learning projects. And finally, you want the feature store to be discoverable, so that any team member can come in and search for the features they want, and use the search results in their own machine learning projects. Now, if you extend the feature engineering pipeline that you reviewed before to include a feature store, it would look just like this. As you can see here, the engineered features can directly be persisted into the feature store. Besides the high level characteristics of being centralized, being reusable, and discoverable, what are some other functional capabilities that you can think of for such a feature store to be usable for you and your organization? The functional capabilities would include the ability to create the feature store itself, as well as the ability to create and ingest individual features, retrieve the features, as well as delete the features once they are obsolete. Now, you can architect design and build such a feature store using mechanisms, like a database for persistence and APIs for creating, retrieving, and deleting the features. Or, you can use a AWS tool like Amazon Sagemaker Feature Store, which is a managed service that provides a purpose-built feature store. 

![](2023-12-30-16-20-39.png)

![](2023-12-30-16-21-00.png)

![](2023-12-30-16-21-51.png)

Sagemaker Feature Store is a fully managed service that provides purpose-built feature store. The capabilities of Sagemaker Feature Store closely aligns with the desirable characteristics for any feature store that were reviewed in the previous video. First, SageMaker Feature Store provides you with a centralized repository to securely save and serve features from. Next, SageMaker Feature Store provides you with the capabilities to reuse the features, not just across a single machine learning project, but across multiple projects. A typical challenge that data scientist see is training an inference skew that could result from discrepancies in the data used for training and the data used for inferencing. Sagemaker Feature Store helps reduce the skew by reusing the features across training and inference traces and by keeping the features consistent. Finally, SageMaker Feature Store provides the capabilities to create it for the features both in real time and batch. The ability to creating for features in real time suppose use cases such as near real time ML predictions. Similarly, the ability to look up features in batch mode can be used to support use cases, such as model training. Next, I will introduce the APIs to use with SageMaker Feature Store. In the lab this week, you will create a feature store and use several of the capabilities that I discuss here. To start using the Feature Store APIs import the feature group class from the SageMaker SDK. Feature group is a construct that allows you to group multiple features together and treat them as a set. The first step here is to create a feature group. As you can see here, feature group expects a name and also a list of feature definition. Feature definitions capture the individual name and type of the features. Once you have the name and the feature definitions, go ahead and create the feature group. You will also see that the create method expects our S3 location, where the feature group, along with the individual features, will be saved. Once the feature group is created, it's time to ingest features into the feature group. 

![](2023-12-30-16-29-02.png)

![](2023-12-30-16-29-31.png)

Here, you use the ingest API that will be used to ingest features into the feature group in a multi-threaded fashion. Once the features group is populated, you can retrieve the features by taking advantage of the retrieved APIs. Here, I'm showing you one possible way to retrieve the features. Directly the features are directly from the S3 location where the features are persisted, you take advantage of the athena query object. As you can see here, you construct a sequel statement, a query string that selects the appropriate features that you want to retrieve from the feature group. Once you have the query string, you run the run method on the Athena 3D object by passing in the query string. This results the queried features in a DataFrame format. Now, you have an option to convert that DataFrame into a CSV file and save the CSV file wherever you need to. In fact, you can store the CSV file into an S3 location and use that as a direct input to a training job on SageMaker. So far, you have seen the APIs to use with SageMaker feature store. If you'd like more of a visual approach, you can view the feature store and the featured groups created in a SageMaker studio environment. Here is a view of the feature group and the feature definitions. As you can see here, the feature definitions capture the featured name and the feature type. You'll see the features like review ID, date, sentiment, and so on. The SageMaker Studio environment also provides you with queries that you can use to interactively explore the feature groups. You can take the queries provided in this environment and run them in any query interface of your choice. When you execute these queries, you will get results, something similar to this. Here are all the features that you have queried for in that query string. You can see the date, review ID, and so on. Now, think about what the input ID feature is representing here. The input IDs, or the BERT embeddings that you have generated on the raw review text in a prior lab. So here, you're storing the embeddings as one of the features and have the ability to creating the feature group to retrieve those input IDs. 

![](2023-12-30-16-30-08.png)

![](2023-12-30-16-31-14.png)

![](2023-12-30-16-31-53.png)

![](2023-12-30-16-33-02.png)

![](2023-12-30-16-33-39.png)

![](2023-12-30-16-34-01.png)

![](2023-12-30-16-34-53.png)

![](2023-12-30-16-35-07.png)