# **Built-in Algorithms**

![](2023-12-30-09-51-17.png)

![](2023-12-30-09-48-28.png)

The big advantage of build-in algorithms is that they require no coding. To start running experiments, you can just provide the algorithms with your input data, set any model hyper parameters, and define the compute resources such as the number and type of compute instances to use. Another benefit of build-in algorithms is that many of them support GPU's and parallelization across multiple, compute instances without any additional configuration. This means if you are working with a large data set and you want to distribute your model training, you don't have to worry about writing that code either. 

## **Built-in Algorithms**

![](2023-12-30-09-51-53.png)

![](2023-12-30-09-52-39.png)

![](2023-12-30-09-52-53.png)

![](2023-12-30-09-53-45.png)

![](2023-12-30-09-54-26.png)

![](2023-12-30-09-54-44.png)

![](2023-12-30-09-55-13.png)

### **Use Cases and Algorithms**

![](2023-12-30-09-56-12.png)

![](2023-12-30-09-57-15.png)

![](2023-12-30-10-32-31.png)

Classification covers both binary and multi-class classification. For example, a typical binary classification use case would be you want to predict whether an email is spam or not spam. Your input data in this case would be tabular with a label, whether the training data sample is indeed spam or not spam. For such classification problems, you could choose between the XGBoost or the K-Nearest Neighbors algorithm. XGBoost, which is short for extreme gradient boosting, is a popular and efficient open-source implementation off the gradient boosted trees algorithm. The XGBoost algorithm performs really well because of its robust handling of a variety of data types, relationships, and distributions, and the variety of hyperparameters that you can fine tune. You can use XGBoost also for regression and ranking problems. The K-Nearest Neighbors or k-NN algorithm is an index-based algorithm. For classification problems, the algorithm queries the K points that are closest to the sample point and returns the most frequently used label off their class as the predicted label. Let's move on to regression problems. Here, the goal is to predict a numeric or continuous value, such as estimating the value of a house, given input features such as location, number of rooms, property tax rates, and other data. For regression tasks, you can choose between the Linear Learner or the XGBoost algorithm. The Linear Learner algorithm extends up on typical linear models by actually training many models in parallel, each with slightly different type of parameters and then returns the one with the best fit.


![](2023-12-30-10-33-35.png)

A third problem type is time-series forecasting. Imagine you want to predict sales on a new product, given previous sales data. For time-series forecasting, you can leverage the DeepAR Forecasting algorithm. The DeepAR Forecasting algorithm is a supervised learning algorithm for forecasting scalar, meaning one-dimensional time series, using recurrent neural networks or RNNs. Let's move on to clustering tasks. Clustering is an example of unsupervised learning. Here, the data is not labeled. The clustering algorithm tries to find patterns in the data and starts grouping data points into those distinct clusters. One prominent problem type which is addressed by clustering, is dimension reduction in the feature engineering step. Assume you want to predict the mileage of a car. In this use case, the color of the car should not be any relevant input feature, and then can be dropped. You can use the principal component analysis or PCA algorithm for this task. PCA is an unsupervised machine learning algorithm that attempts to reduce the dimensionality or the number of features within a dataset while still retaining as much information as possible. Another popular problem type for clustering is anomaly detection. For example, anomalies could manifest as unexpected spikes in time-series data, such as an unexpected high or low number of requested chairs, maybe due to weather conditions or a large event in town. Anomalies can be detected using the Random Cut Forest algorithm or RCF. RCF is an unsupervised algorithm for detecting anomalous data points within a dataset. RCF associates an anomaly score with each data point. Low score values indicate that the data point is considered normal.

![](2023-12-30-10-34-26.png)

![](2023-12-30-10-35-29.png)

![](2023-12-30-10-37-32.png)

![](2023-12-30-10-38-55.png)

![](2023-12-30-10-39-26.png)

![](2023-12-30-10-40-30.png)

However, high values indicate the presence often anomaly in the data. One of the most popular algorithms for clustering or grouping of data is K-Means. K-Means could be used, for example, if you want to group customers in high, medium, or low spending groups based on transaction data. K-Means is an algorithm that trains a model that group similar objects together. For example, suppose you want to create a model to recognize handwritten digits, and you choose the MNIST dataset for training. You can think of the MNIST database or Modified National Institute of Standards and Technology database as the Hello World dataset of Computer Vision. The dataset provides thousands of images of handwritten digits from 0-9. In this example, you might choose to create 10 clusters, one for each number. As part of model training, the K-Means algorithm would then group the input images into one of those 10 clusters. Another clustering problem type is topic modeling. Here, you are working with text data specifically as input. For example, say you want to organize a set of documents into topics based on words and phrases used in those documents. Two built-in algorithms could help you implement this, Latent Dirichlet Allocation, which is also known as LDA, or a Neural Topic Model, which is also known as NTM. Although you can use both the NTM and LDA algorithms for topic modeling, they are distinct algorithms and can be expected to produce different results based on the same input data. LDA is a generative probability model, which means it attempts to provide a model for the distribution of outputs and inputs based on latent variables. In statistics, latent variables are variables that are not directly observed, but are inferred from other variables in the training dataset. This is opposed to the discriminative models, which attempt to learn how inputs map to the outputs. NTM uses a deep learning model rather than a pure statistical model. I would recommend you try both LDA and NTM and explore which one works better on your specific data. 

![](2023-12-30-10-41-21.png)

Let's have a look at popular image processing use cases. Content moderation refers to the ability to review user-generated content and decide whether the content is appropriate to display, or whether the content should be removed. This use case could be implemented in various ways and doesn't apply only to images. For the image use case, you can use the built-in image classification algorithm to classify the image into one of your defined output categories. The built-in image classification algorithm can be run in two modes: full training or transfer learning. In full training, the network is initialized with random weights and trained on user data from scratch. In transfer learning mode, network is initialized with pre-trained weights, and just the top fully connected layer is initialized with the random weights. Then the whole network is fine tuned with new data. In this mode, training can be achieved even with a smaller dataset. This is because the network is already trained and therefore can be used in cases without sufficient training data. Another use case working with image data is detecting people or objects and images. The object detection algorithm detects all instances of predefined objects within the images categorizes the object, and also adds a bounding box indicating the location and scale of the object in given images.

![](2023-12-30-10-43-48.png)

The third use case describes, for example, how self-driving cars identify objects in their path, which is realized with semantic segmentation. Semantic segmentation is different from the image classification and object detection in that it classifies every pixel in an image. This leads to information about the shapes of the objects contained in the image. The segmentation output is represented as a grayscale image called a segmentation mask that has the same shape as the input image. Classifying each pixel is fundamental for understanding scenes, which is critical to an increasing number of Computer Vision applications, such as self-driving vehicles, but also medical imaging diagnostics, and robot sensing. Let's come back to the field of text analysis. Typical use cases here are, for example, a translating text. Let's say you want to convert Spanish to English. A built-in algorithm that you could use for this purpose is sequence to sequence. The sequence to sequence algorithm is a supervised learning algorithm where the input is a sequence of tokens, for example, text or audio, and the output is generated as another sequence of tokens. You can use the same algorithm to summarize texts. Imagine you want to summarize a long research paper into just a short abstract. You can also use the sequence to sequence algorithm for speech-to-text conversations. Let's say you want to transcribe call center conversations. Finally, the use case you will be working on, text classification, classifying product reviews into sentiment classes. 

![](2023-12-30-10-45-03.png)

![](2023-12-30-10-45-33.png)

![](2023-12-30-10-46-27.png)

### **Text Analysis**

 In fact, a very simple bag of words model was already introduced in the 1950s to count the occurence of each word in a document. I want to spend a few minutes walking you through some of the more recent advancements starting in 2013. In 2013, a research team led by Thomas Michael off at Google introduced the Word2Vec algorithm. Word2Vec converts text into vectors also called embeddings. Each of those vectors consists of 300 values. Hence, it represents a 300 dimensional vector space. You can then use those vector representations as inputs to your machine learning. Use cases for example, applying Nearest Neighbor Classification or clustering algorithms. And Word2Vec is famous for the two different model architecture as you see here, which it implements to generate the word embeddings, continuous-bag-of-words or CBOW, and continuous skip gram. The architectures are based upon shallow two layer neural networks. Another approach back in 2013, CBOW predicts the current word from a window of surrounding context words, continuous skip-gram uses the current word to predict the surrounding window of context words. One challenge though with Word2Vec is that it tends to run into what's called out of vocabulary issues, because its vocabulary only contains three million words. The vocabulary is a set of known words that the model learned in the training phase. Out of vocabulary words are words that were not present in the text data set the model was initially trained on so if the word is not found in its vocabulary, the model architecture assigns a zero to that word which is basically discarding the word. In 2014, a research team led by Jeffrey Pennington at Stanford University introduced GloVe or global vectors, for word representations. GloVe novel approach In 2014 used the regression model to learn word representations through unsupervised learning. In 2016, a research team led by Piatra Janowski at Facebook AI Research published their work on FastText. And let's have a closer look at FastText. FastText builds on Word2Vec but it treats each word as a set of sup words called the character n-grams. And this helps with the out of vocabulary issue that I mentioned for Word2Vec. Here are examples how FastText divides the word into smaller character sets. Now, even if the word Amazon is not in the vocabulary, chances are that the character set 'am' is. The embedding that FastText learns for a word is the aggregate of the embeddings of each n-gram with the word. FastText uses the same CBOW and skip-gram models but it adds support for text classification use cases. With a character n-gram representations of words, FastText increases the effective vocabulary of Word2Vec beyond the three million words. Another large milestone in the evolution of text analysis was the introduction of the Transformer Architecture in 2017 in a paper called "Attention Is All You Need" published at the 2017 conference on neural information processing systems. Vaswani et al from Google Brain and collaborators at Google research and the University of Toronto introduced a novel neural network architecture based on a self-attention mechanism. The concept of attention had been studied before for different model architectures and generally refers to one model component capturing the correlation between inputs and outputs. In NLP terms, the attention would map each word from the model's output to the words in the input sequence, assigning them weights depending on their importance towards the predicted word. 

![](2023-12-30-11-36-24.png)

![](2023-12-30-11-37-45.png)

![](2023-12-30-11-38-27.png)

![](2023-12-30-11-52-57.png)

![](2023-12-30-11-53-41.png)

The self-attention mechanism in this new transformer architecture focuses on capturing the relationships between all words in the input sequence and thereby significantly improving the accuracy of natural language understanding tasks such as machine translation. While the transformer architecture marked a very important milestone for NLP, other research teams kept evolving alternative architectures. Also in 2017, Saurabh Gupta and Vineet Khare from AWS introduced BlazingText, BlazingText provides highly optimized implementations of the Word2Vec and text classification algorithms. BlazingText scales and accelerates Word2Vec using multiple CPUs or GPUs for training. Similarly, the BlazingText implementation of the text classification algorithm extends FastText to use GPU acceleration with custom CUDA kernels. CUDA, or compute unified device architecture, is a parallel computing platform and programming model developed by NVIDIA. And to give you an idea of the scope of the acceleration, using BlazingText, you can train a model on more than a billion words in a couple of minutes using a multi-core CPU or GPU. BlazingText creates character n-gram embeddings using the continuous bag of words and skip-gram training architectures, BlazingText also allows you to save money by stopping your model training early. Let's say when the validation accuracy stops increasing BlazingText also optimizes the IO for datasets stored in Amazon simple storage service or Amazon S3. Later this week, you will use the BlazingText algorithm to train the text classify your model. In 2018, Matthew E Peters at the Allen Institute for Artificial Intelligence along with collaborators from the University of Washington published the ELMo algorithm which is short for embeddings from language models. In ELMo, word vectors are learned by a deep bidirectional language model. ELMo combines a forward and backward language model and is thus able to better capture syntax and semantics across different linguistic contexts. Later in 2018, a research team led by Alex Redford at Open AI released GPT to improve language understanding by generative pre-training. GPT is based on the transformer architecture but performs two training steps. First, GPT learns a language model from a large unlabeled text corpus, and second GPT performs a supervised learning step with labeled data to learn a specific NLP tasks such as text classification. GPT is only trained and can predict context from left to right, which is often referred to as uni-directional. Shortly after GPT, a research team led by Jacob Devlin at Google AI Language published BERT, or bidirectional encoder representations from transformers. BERT in contrast to GPT is truly bidirectional. In the unsupervised training step, BERT learns representations from unlabeled text, from left to right and right to left contexts jointly. This novel approach created interest in BERT across the industry and has led to many variations of BERT models, some of which are focused on specific language such as French, German, or Spanish. There are also BERT models that focus on a specific text domain such as scientific text. And up to today, BERT is still among the most popular NLP models.

![](2023-12-30-11-54-04.png)

![](2023-12-30-11-55-04.png)

![](2023-12-30-11-56-11.png)

![](2023-12-30-11-56-47.png)

![](2023-12-30-11-57-21.png)

![](2023-12-30-11-57-58.png)

### **Train a Text Classifier**

![](2023-12-30-11-59-29.png)

Before you start training the model, you need to transform your training data to the input format BlazingText requires. Specifically append the sentiment classes 1, 0, and minus 1 to the label identifier as shown here and remove any column headers. The BlazingText algorithm also requires the text to be tokenized into one sentence per line. You can use Python's Natural Language Toolkit or NLTK to perform exactly that step. Finally, just upload the training data into bucket. You can tune SageMaker BlazingText text classification models with the hyper-parameters shown here. Parameters are what a model learns, hyper-parameters control how the model learns those parameters. The tunable hyper-parameters for BlazingText include the number of epics that correspond to the number of complete passes through the dataset. The learning rate is the step size used by the numerical optimizer. Min_count lets you configure the removal of words that appear fewer times than what you specify here. The vector_dim is the number of dimensions in the vector space to use. The word_ngrams is the number of words in a word n-gram, and this was an important parameter and can have a significant impact on accuracy. Early_stopping defines whether to stop training if, for example, validation accuracy doesn't improve after the number of epochs specified in the patient's parameter. Now, in preparation for training, you need to configure the algorithms data input channels to point to the uploaded training and validation files in S3. You can retrieve the correct SageMaker model training image for BlazingTexts via the SageMaker image_uris retrieve call. SageMaker offers prebuilt docker images for built-in algorithms that contain all model code. The docker images are stored in a Docker Container Registry, the Amazon Elastic Container Registry or Amazon ECR. The image_uris retrieve function will then retrieve the correct image from the Container Registry. You simply have to specify the framework as shown here. Then you pass that image_uris together with any additional settings to a SageMaker estimator object and finally, you can start training the BlazingText text classifier by calling Estimator fit. Are you curious to see a result? Here are a sample of model evaluation metrics showing the results, training accuracy and validation accuracy. 

![](2023-12-30-12-01-10.png)

![](2023-12-30-12-01-33.png)

![](2023-12-30-12-05-28.png)

![](2023-12-30-12-05-56.png)

![](2023-12-30-12-06-18.png)

![](2023-12-30-12-06-35.png)

![](2023-12-30-12-08-22.png)

![](2023-12-30-12-09-18.png)

![](2023-12-30-12-10-30.png)

![](2023-12-30-12-10-49.png)

![](2023-12-30-12-11-24.png)

## **Practice**