# **Train Debug and Profile a Machine Learning Modele**

## **Training of a Custom Model**

![](2023-12-31-11-12-08.png)

![](2023-12-31-11-20-38.png)

![](2023-12-31-11-21-16.png)

![](2023-12-31-11-22-11.png)

![](2023-12-31-11-22-53.png)

![](2023-12-31-11-23-29.png)

![](2023-12-31-11-24-13.png)

![](2023-12-31-11-29-36.png)

![](2023-12-31-11-31-06.png)

The main difference here is that the model has already been trained on large collections of text data. You will provide specific text data, the product reviews data, to adapt the model to your text domain and also provide your task and model training code. Telling the pretrained model to perform a text classification task, with the three sentiment classes. Now, let's dive deeper into the concept of model pretraining and fine tuning. I already mentioned a few times that pretrained and all key models have been trained on large text corpus, such as large book collections or Wikipedia. In this unsupervised learning step, the model builds vocabulary of tokens, from the training data, and learns the vector representations. You can also pretrain NLP models on specific language data. A BERT model trained on the German language is available as GermanBERT. Another one trained on the French language is available under the name of CamemBERT. BERTje is trained on the Dutch language. There are also many more pretrained models available, that focus on specific text domains and use cases, such as classifying patents with PatentBERT, SciBERT, which is trained on scientific text, or ClinicalBERT, which is trained on healthcare text data. You can use those pretrained models and simply adapt them to your specific data set. This process is called fine tuning. You might be familiar with the concept of transfer learning, which has become popular in computer vision. It's a machine learning technique where a model is trained on one task and then repurposed on a second related task. Now, think of fine tuning as the transfer learning in NLP. If you work with English product reviews as your training data, you can use an English language model, pretrained for example, on Wikipedia, and then fine tune it to the English product reviews. The assumption here is that the majority of words used in the product reviews have been learned already from the English Wikipedia. 

![](2023-12-31-11-33-08.png)

![](2023-12-31-11-34-09.png)

As part of the fine tuning step, you would also train the model on your specific NLP task. In the product reviews example, adding a text classifier layer to the pretrained model that classifies the reviews into positive, neutral, and negative sentiment classes. Fine tuning is generally faster than pretraining, as the model doesn't have to learn millions or billions of BERT vector representations. Also note that fine tuning is a supervised learning step, as you fit the model using labeled training data. Now, where can you find pretrained models to get started? Many of the popular machine learning frameworks, such as PyTorch, TensorFlow, and Apache mxnet, have dedicated model huts, or zoos, where you can find pretrained models. The open source NLP project, Hugging Face, also provides an extensive model hub with over 8,000 pretrained NLP models. If you want to deploy pretrained models straight into your AWS account, you can use SageMaker JumpStart to get easy access to pretrained text and vision models. JumpStart works with PyTorch Hub and TensorFlow Hub and lets you deploy supported models in one click into the SageMaker model hosting environment. JumpStart provides access to over a 100 pretrained vision models, such as Inception V3, ResNet 18, and many more. JumpStart also lists over 30 pretrained text models from PyTorch Hub and TensorFlow Hub, including a variety of BERT models. In one click, you can deploy the pretrained model in your AWS account, or you can select the model and fine tune it to your data set. JumpStart also provides a collection of solutions for popular machine learning use cases, such as, for example, fraud detection in financial transactions, predictive maintenance, demand forecasting, churn prediction, and more. When you choose a solution, JumpStart provides a description of the solution and the launch button. There's no extra configuration needed. Solutions launch all of the resources necessary to run the solution, including training and model hosting instances. After launching the solution, JumpStart provides a link to a notebook that you can use to explore the solutions' features. If you don't find a suitable model via JumpStart, you can also pull in other pretrained models via custom code. This week, you will work with a pretrained RoBERTa model from the Hugging Face model zoo, then RoBERTa for sequence classification, which is a pretrained RoBERTa model and comes already preconfigured for text classification tasks. 

![](2023-12-31-11-35-10.png)

![](2023-12-31-11-36-40.png)

![](2023-12-31-11-38-09.png)

![](2023-12-31-11-38-38.png)

![](2023-12-31-11-39-19.png)

![](2023-12-31-11-39-44.png)

![](2023-12-31-11-40-21.png)

![](2023-12-31-11-40-47.png)

![](2023-12-31-11-41-34.png)

![](2023-12-31-11-42-17.png)

![](2023-12-31-11-43-04.png)

![](2023-12-31-11-44-21.png)

![](2023-12-31-11-45-01.png)

![](2023-12-31-11-45-17.png)

![](2023-12-31-11-45-59.png)

![](2023-12-31-11-48-14.png)

![](2023-12-31-11-49-02.png)

![](2023-12-31-11-49-23.png)

![](2023-12-31-11-49-36.png)

While you can use BERT as is without training from scratch, it's useful to understand how BERT uses word masking and next sentence prediction in parallel to learn and understand language. As BERT sees new text, the model masks 15 percent of the words in each sentence. BERT then predicts the masked words and corrects itself, meaning it updates the model weights when it predicts incorrectly. This step is called masked language model or masked LM. Masking forces the model to learn the surrounding words for each sentence. At the same time, BERT is masking and predicting words, or to be more precise, input tokens. It is also performing next sentence prediction, or NSP, on pairs of input sequences. To perform NSP, BERT randomly chooses 50 percent of the sentence pairs and replaces one of the two sentences with a random sentence from another part of the document. BERT then predicts if the two sentences are a valid sentence pair or not. BERT again will correct itself when it predicts incorrectly. Both of those training tasks are performed in parallel to create a single accuracy score for the combined training efforts. This results in a more robust model capable of performing word and sentence level predictive tasks. Also, note that this pre-training step is implemented as unsupervised learning. The input data is large collections of unlabeled text. Now, in many cases, you don't need to train BERT from scratch. Neural networks are designed to be re-used and continuously trained as new data arrives into the system. Since BERT has already been pre-trained on millions of public documents from Wikipedia and the Google Books corpus, the vocabulary and learned representations are indeed transferable to a large number of NLP and NLU tasks across a wide variety of domains. In the fine-tuning step, you also configure the model for the actual NLP task, such as question and answer, text classification, or a named entity recognition. Fine-tuning is implemented as supervised learning and no masking or next sentence prediction happens. As a result, fine-tuning is very fast and requires a relatively small number of samples or product reviews, in our case. In this week's use case, you will take the pre-trained RoBERTa model from the Hugging Face model hub and fine tune it to classify the product reviews into the three sentiment classes. Again, the RoBERTa model architecture builds on BERT's language masking strategy, but removes the next sentence pre-training objective. It also trains with much larger mini-batches and learning rates and with a 160 gigabyte of text, RoBERTa also uses much more training data compared to BERT, which is pre-trained with 16 gigabytes of text data. These model architecture changes focus on building an even better performing masked language model for the NLP downstream tasks, such as text classification. 

![](2023-12-31-11-52-06.png)

## **Training of a Custom Model with Amazon Sagemaker**

### **Train a custom model with Amazon SageMaker**

![](2023-12-31-12-03-45.png)

![](2023-12-31-12-27-56.png)

 This time I show you how to train or fine tune the text classifier with a custom model code for the pre trained bert model you pull from the hugging face model hub. This option is also called bring your on script or a script mode in SageMaker. This option requires a little bit more effort but gives you increased levels of customization in return. To start training a model in SageMaker, you create a training job. The training job includes the following information. The URL of the amazon simple storage service or amazon S3 bucket where you have stored the training data. The compute resources that you want SageMaker to use for the model training. Compute resources are ml compute instances that are managed by SageMaker. The URL of the S3 Bucket where you want to store the output of the training job. The Amazon elastic container registry or Amazon ECR path, where the training code image is stored. SageMaker provides built in docker images that include deep learning framework libraries and other dependencies needed for model training and inference. Using script mode, you can leverage these pre built images for many popular frameworks, including TensorFlow, pyTorch, and Mxnet. After you create the training job, SageMaker launches the ml compute instances and uses the training code and the training data set to train the model. It saves the resulting model artifacts and other outputs in the S3 bucket you specify for the purpose. Here are the steps you need to perform, first you need to configure the training validation and test data set. You also need to specify which evaluation metrics to capture, for example the validation loss and validation accuracy. Next you need to configure the model type of parameters such as number of epics, learning raid etc. Then you need to write and provide the custom model training script used to fit the model. Let's discuss each step in more detail starting with the data set and the evaluation metrics. You can use SageMaker training input class to configure a data input flow for the training. The example code here shows how to configure training input objects to use the training validation and test data splits uploaded to an S3 bucket. If you write your custom model training code, make sure the algorithm code calculates and amidst model metrics such as validation loss and validation accuracy. You can then define riddick's expressions as shown here to capture the values of these metrics from the Amazon cloudwatch locks. Next configure the models hyper parameters. Model hyper parameters include, for example, number of epics, the learning rate, batch sizes for training, and validation data, and more. One important type of parameter for bert models is the maximum sequence length. You didn't have to specify this parameter before if you remember as both SageMaker out of politics and the blazing text algorithm use different and will P algorithms with different type of parameters. As a quick reminder, the maximum sequence length refers to the maximum number of input tokens you can pass to the bert model per sample. 

![](2023-12-31-12-30-01.png)

![](2023-12-31-12-30-28.png)

![](2023-12-31-12-31-12.png)

![](2023-12-31-12-31-26.png)

![](2023-12-31-12-31-44.png)

![](2023-12-31-12-32-09.png)

I choose the value of 128 because the word distribution of the reviews showed that one 100% of the reviews in the training data said have 115 words or less. Next provide your custom training script. You can start from some example code and then customize it to your needs. And the following code example, you can see an extract from the python model training script called trained on pie. First you import the hugging phase, transform a library. Remember, you can install the library with pip install transformers. Hugging face provides pretrained RobertaModel for sequence classification that already pre configured roberta for tax classification tasks, let's download the model conflict for this RobertaNodel. You can do this by calling RobertaConfig from pre-trained and simply provide the model name in this example, roberta-base. You can then customize the configuration by specifying the number of labels for the classifier. You can set non-labels to three representing the three sentiment classes. The ID to label and label to ID parameters. Let you map the zero based index to the actual class, label of -1 for the negative class. The label of 0 for the neutral class and the label of 1 for the positive class. You then download the pretrained RobertaModel from the hugging face library with the command RobertaForSequenceClassification. From pre-trained providing the model name and the customized configuration. With a pre-trained model at hand, you need to write the code to fine-tune the model here called train model. Here is the code extract that shows how to fine-tune the model using pyTorch in the train model function. You define the loss function and the optimizer to use in this example, I'm using the CrossEntropyLoss function and the Adam optimizer for the model. Then you write the training code, looping through the number of epics and training steps. You read in the training data from the pyTorch data logger, put the model into your trading mode. Clear gradients from the previous step, past the training sample, retrieve the model prediction, calculate the loss, and compute the gradients via back propagation. Finally, you update the parameters with the optimizer step and repeat the loop through all specified training steps and number of epics. Make sure to also include code which runs a validation loop after each epic not shown here that calculates and amidst the evaluation metrics you want to captur. You can see a full example in this week's let assignment, with all configurations done and the model training code ready, you can now fit the model. Create a SageMaker pyTorchEstimator as shown here. You specify the location of the model training script, as the entry point and set the source directory, choose a local directory containing your model training script. And any additional dependencies listed in a requirements .txt file. In the estimator, You can also specify the AWS instance types and instance count to run the model training. You can also define the pyTorch frame work version you're using. This will instruct SageMaker to use the right framework training image to run your code. You pass that you find type of parameters and model evaluation metrics to capture. And then finally, you call estimator.fit to start the fine tuning of the model. 

![](2023-12-31-12-32-38.png)

![](2023-12-31-12-32-53.png)

![](2023-12-31-12-33-21.png)

![](2023-12-31-12-33-51.png)

![](2023-12-31-12-34-17.png)

![](2023-12-31-12-35-28.png)

![](2023-12-31-12-36-11.png)

![](2023-12-31-12-36-39.png)

![](2023-12-31-12-37-03.png)

![](2023-12-31-12-37-30.png)

![](2023-12-31-12-37-49.png)

![](2023-12-31-12-38-25.png)

![](2023-12-31-12-39-08.png)

![](2023-12-31-12-39-31.png)

![](2023-12-31-12-39-50.png)

![](2023-12-31-12-40-25.png)

![](2023-12-31-12-40-48.png)

![](2023-12-31-12-41-07.png)

### **Debug and profile models**

![](2023-12-31-12-47-28.png)

Training machine learning models is difficult and often a OPEC process and especially training deep learning models usually takes a long time with several training iterations and different combinations of hyper parameters before your model yields the desired accuracy. Also, system resources could be inefficiently used, making the model training expensive and compute intensive. Debugging and profiling your model training gives you visibility and control to quickly troubleshoot and take corrective measures if needed. For example, capturing metrics in real time during training can help you to detect common training errors such as the gradient values becoming too large or too small. Common training errors include vanishing or explode ingredients. Deep neural networks typically learn through back propagation, in which the models losses trace back through the network. The neurons weights are modified in order to minimize the loss. If the network is too deep, however, the learning algorithm can spend its whole lost touch it on the top layers and waits in the lower layers, never get updated. That's the vanishing gradient problem. In return, the learning algorithm might trace a series of errors to the same neuron resulting in a large modification to that neurons wade that it imbalances the network. That's the exploding gradient problem. Another common error is bad initialization. Initialization assigns random values to the model parameters. If all parameters have the same initial value, they received the same gradient and the model is unable to learn. Initializing parameters with values that are too small or too large may lead to vanishing or exploding gradients again. And then overfitting, the training loop consists of training and validation. If the model's performance improves on a training set but not on a validation data set, it's a clear indication that the model is overfitting. If the model's performance initially improves on the validation set but then begins to fall off, training needs to stop to prevent the overfitting. All these issues impact your model's learning process. Debugging them is usually hard and even harder when you run distributed training. Another area you want to track is the system resource utilization monitoring and profiling. System resources can help you answer how many GPU, CPU, network and memory resources your model training consumes more. Specifically, it helps you to detect and alert you on bottlenecks so you can quickly take corrective actions. Here is an overview of potential bottlenecks. These could include I/O bottlenecks when loading your data. CPU or memory bottlenecks when processing the data and GPU bottlenecks or maybe underutilization during model training. If you encounter any model training errors or a system bottlenecks, you want to be informed, so you can take corrective actions. For example, why not stop the model training as soon as the model starts overfitting. This can help save both time and money and it's not only about stopping the model training when an issue is found, maybe you also want to send a notification via email or via text message in that case. Let's see how you can debug in profile models using another tool from the AWS toolbox, SageMaker Debunker. 

![](2023-12-31-12-48-18.png)

![](2023-12-31-12-49-42.png)

![](2023-12-31-12-50-11.png)

![](2023-12-31-12-50-31.png)

![](2023-12-31-12-51-09.png)

![](2023-12-31-12-51-26.png)

### **Debug and Profile Models with Amazon SageMaker Debugger**

![](2023-12-31-13-10-33.png)

![](2023-12-31-13-11-08.png)

![](2023-12-31-13-11-23.png)

![](2023-12-31-13-11-38.png)

![](2023-12-31-13-12-03.png)

![](2023-12-31-13-12-29.png)

![](2023-12-31-13-12-44.png)

![](2023-12-31-13-12-55.png)

![](2023-12-31-13-13-26.png)

![](2023-12-31-13-13-40.png)

Debugger automatically captures real time metrics during the model training such as training and validation loss and accuracy, confusion matrices and learn ingredients to help you improve model accuracy. The metrics from Debugger can also be visualized in SageMaker Studio for easy understanding. Debugger can also generate warnings and remediation advice when common training problems are detected. Also Debugger automatically monitors and profiles your system resources such as CPU, GPU, network and memory in real time. And provides recommendations on reallocation of these resources. This enables you to use your resources more efficiently during the model training and helps to reduce costs and resources. So how does Debugger work exactly? Debugger captures real time debugging data during the model training and stores this data in your security S3 Bucket. The captured data includes system metrics, framework metrics, and output tensors. System metrics include for example hardware resource utilization data such as CPU, GPU, and memory utilization. Network metrics as well as data input and output or I/O metrics. Framework metrics could include convolutional operations in the forward pass, batch normalization operations in backward pass. And a lot of operations between steps and gradient descent algorithm operations to calculate and update the loss function. And finally the output tensors. Output tensors are collections of model parameters that are continuously updated during the back propagation and optimization process, of training machine learning and deep learning models. The captured data for output tensors includes scalar values such as for accuracy and loss, and matrices for example representing weights, gradients, input layers and output layers. Now while the model training is still in progress, Debugger also reads the data from the S3 bucket and already runs a real time continuous analysis through rules. And here's a list of Debugger building rules you can choose from. You can use these rules to debug and profile your model during a training job. There are many rules available whether you want to analyze your data sets, track loss and accuracy during the model training. Inspect weights, monitor tensors, observe the activation function, monitoring metrics of decision trees or keeping an eye on gradients. You can also take corrective actions. In case Debugger detects an issue, for example, the model starts to over fit. You can use Amazon CloudWatch events to create a trigger to send you a text message, email you the status or even stop the training job. You can also analyze the data in your notebook environment using the Debugger SDK. Or you can visualize the training metrics and system resources using the SageMaker Studio IDE. Now let's have a look at some code examples. This code here shows how you can leverage the building rules to watch for common training errors. Select the rules you want to evaluate, such as loss_not_decreasing or model starts to over train. By the way detecting when the model starts overtraining can prevent the model from overfitting. Then pass the rules with the rules parameter in your estimator. SageMaker will then start a separate processing job for each rule you specify in parallel to your training job. The processing job will collect the relevant data and observe the metrics. To profile the system and framework metrics for your training jobs, you need to perform very similar steps. First you select the rules to observe again. Debugger comes with a list of building rules you can select such as check for low GPU utilization. If you select the ProfilerReport rule, the rule will invoke all of the building rules for monitoring and profiling.

![](2023-12-31-13-15-58.png)

![](2023-12-31-13-16-40.png)

By default, Debugger collects system metrics every 500 milliseconds. And basic output tensors that is scalar outputs such as loss and accuracy every 500 steps. You can modify the configuration if needed. To enable the framework profiling, configure the framework profile params parameter as shown here. Then pass the rules and profiler_config in the estimator as shown earlier. Not that the list of selected rules can contain both the debugging rules together with the profiling rules. Now, how can you analyze the results? For any SageMaker training job, the Debugger profiler report rule invokes all of the monitoring and profiling rules and aggregates the rule analysis into a comprehensive report. You can download the Debugger profiling report while you're training job is running or after the job has finished from S3. At the top of the report Debugger provides a summary of your training job. In this section you can few for example, the durations and time stems at different training faces. In this rule summary section, Debugger aggregates all of the real evaluation results, analysis, rule descriptions and the suggestions. The report also shows system resource utilization such as CPU and network utilization over time. In this sample screenshot, you can see that the CPU utilization is lower in the beginning probably as the data is loaded via the network and then goes up when the training starts. Debugger also creates a system utilization heat map. In the sample shown here, I have run the training job on a MLC59 X large instance which consists of 36 VCPU's. The heat map here shows how each VCPU was utilized over time. The darker the color, the higher the utilization. If you see resources being underutilized, you could scale down to use a smaller instance type, and save cost and run the training job more efficiently. r each rule you specify in parallel to your training job. The processing job will collect the relevant data and observe the metrics. To profile the system and framework metrics for your training jobs, you need to perform very similar steps. First you select the rules to observe again. Debugger comes with a list of building rules you can select such as check for low GPU utilization. If you select the ProfilerReport rule, the rule will invoke all of the building rules for monitoring and profiling. By default, Debugger collects system metrics every 500 milliseconds. And basic output tensors that is scalar outputs such as loss and accuracy every 500 steps. You can modify the configuration if needed. To enable the framework profiling, configure the framework profile params parameter as shown here. Then pass the rules and profiler_config in the estimator as shown earlier. Not that the list of selected rules can contain both the debugging rules together with the profiling rules. Now, how can you analyze the results? For any SageMaker training job, the Debugger profiler report rule invokes all of the monitoring and profiling rules and aggregates the rule analysis into a comprehensive report. You can download the Debugger profiling report while you're training job is running or after the job has finished from S3. At the top of the report Debugger provides a summary of your training job. In this section you can few for example, the durations and time stems at different training faces. In this rule summary section, Debugger aggregates all of the real evaluation results, analysis, rule descriptions and the suggestions. The report also shows system resource utilization such as CPU and network utilization over time. In this sample screenshot, you can see that the CPU utilization is lower in the beginning probably as the data is loaded via the network and then goes up when the training starts. Debugger also creates a system utilization heat map. In the sample shown here, I have run the training job on a MLC59 X large instance which consists of 36 VCPU's. The heat map here shows how each VCPU was utilized over time. The darker the color, the higher the utilization. If you see resources being underutilized, you could scale down to use a smaller instance type, and save cost and run the training job more efficiently. : Added to Selection. Press [⌘ + S] to save as a note (Required) ​ 

![](2023-12-31-13-17-02.png)

![](2023-12-31-13-17-23.png)

![](2023-12-31-13-17-42.png)

![](2023-12-31-13-18-01.png)

![](2023-12-31-13-18-18.png)

![](2023-12-31-13-18-43.png)