# NOTES -  

## Encoding Text in Numeric Form

Machine-learning models such as those that you might use for natural language processing cannot work with text data directly. You need to encode your text data in some numeric form. And the question is, how? You can imagine that any document, whether it's a review or some other kind of text, is essentially just a word sequence, and you can organize a document into individual words using some kind of tokenizer. Now your document is essentially just a sequence or a list of words, and you can now encode or represent each word in numeric form. So each of the original words in your document will be represented using some kind of numeric encoding. Once you have numeric representations for the individual words in your document, it's easy to represent your document as a tensor. Represent each word as numeric data, aggregate into a tensor, which is essentially just a matrix. Now the big question is how do you encode individual words in a numeric form? There are several techniques that you can use to achieve this. You can choose to one-hot encode your words. You can choose to represent individual words using frequency-based numeric representations or you can have prediction- based representations to encode words. I'll give you a quick big-picture overview of how each of these encodings work. One-hot encoding's what you'll use to represent each word and text to by the presence or absence of a word. The size of the feature vector to represent the word is equal to the size of your vocabulary. That is all of the unique words that you have across your entire corpus. Each word will have a corresponding position in that feature vector, and you'll use 1 to indicate the word is present or 0 to indicate the word is absent. One-hot encoding has several flaws. In a sense, you have no idea how often a particular word occurred in text. You only know whether the word was present or absent. An improvement over one-hot encoding is frequency-based encodings. 


Frequency-based numeric encodings can be divided into three broad categories. You can have count-based encodings, you can have individual words represented using their TFIDF scores, which stands for term frequency-inverse document frequency, or you can represent words using co- occurrence matrices. 


The simplest of these numeric representations are count-based word representations. The only improvement that count-based feature vectors offer over one-hot encoding is that you use the numbers in your feature vector to represent a count of how often a word occurs in a document. These capture the frequency of a word in a particular document, and this is important because the frequency may indicate the importance of a word in a document. 



An improvement over count-based feature vectors are feature vectors built using TFIDF scores. TF stands for tone frequency, and IDF stands for inverse document frequency. TFIDF scores try to capture how often a word occurs in a document, as well as across the entire corpus. 


- The TF component of the TFIDF score stands for term frequency. If a word occurs more frequently in a single document, that word might be important. It up-weighs words that occur more frequently in one document. 


IDF stands for inverse document frequency. 


- IDF scores tend to downweigh words that occur frequently across the corpus. Stop words such as a, an, and the might occur frequently across the corpus. These words are not significant and should be downweighed. 


The TFIDF score for a word is a combination of these two components. Both TFIDF and count vectors do not capture the context surrounding a word, co-occurrence frequencies do. 



Co-occurrence frequencies are generated on the principle that similar words will occur together and will have similar context. 


Co-occurrence matrices for a word representation are generated using something called a context window, a window centered around a word, which includes a certain number of neighboring words. 


The co-occurrence matrix will try to capture word co- occurrences, the number of times two words, w1 and w2, have occurred together within that context window. 



And finally, an advanced technique for a word vector representations are prediction- based embeddings. 


- These word embeddings are generated using advanced machine- learning models such as neural networks. 


- These prediction-based embeddings are numerical representations of text, which capture the meaning of text, as well as semantic relationships, and these are generated using ML models. 


Word vector embeddings are low- dimensionality representations of words, which are almost like magic. They capture meaning. Word embeddings are able to capture relationships such as analogies. The relationship between king and queen is the same as that between man and woman, or the relationship between Paris and France is similar to the relationship between London and England. Not only do word embeddings capture meaning, they also offer dramatic dimensionality reduction. The feature vectors used to represent words have very low dimensionality and work very well in natural language processing applications.

## Loading and Exploring the Newsgroup Dataset

In this demo, we'll see how we can use neural network in scikit-learn to build and train a classification model to classify text data. To train this model, we'll work with the 20 newsgroups dataset that's available as a part of the built-in datasets in scikit-learn. This is a dataset of postings in 20 different newsgroups. Import a fetch_20newsgroups function and invoke it to get the newsgroups data. This newsgroup data is in the form of a dictionary. Let's take a look at the keys of this dictionary. The data key here contains the actual text contents. Target_names contains the target categories or classes into which the newsgroups have been categorized or divided. And if you want a quick explanation of what this dataset is about, the description key is what you're looking for. Let's print out a description of this dataset so that we understand what we're working with. You can see that this is a text dataset of newsgroup posts across 20 different newsgroups. If you scroll down below, you'll be able to view dataset characteristics. You can see that there are about 18, 000 posts divided into 20 classes, and each post comprises of just text data. If you scroll down further, you'll find the 20 categories into which this dataset has been classified. There are sports-related categories, software-related categories, science-related categories, and so on. The target_names property on your newsgroups data dictionary will give you all of the target classes listed out in one place. This will be identified by numeric identifiers that are present in the target property of the same dictionary. I'm going to take a look at one newsgroup post kind of at random to take a look at what the data that I'm working with looks like. It is a single post made by someone who has a Harvard email ID. It could be about software, it could be about science. Let's see what the target label is. The numeric identifier identifying the category to which this post belongs is 14, which corresponds to space in science. When we build and train our classifier models, we'll use these numeric identifiers to feed into our model rather than using the class names directly. You can use mp.unique to see how many unique categories or classes you are dealing with. You can see that the numeric identifiers range from 0 to 19. There are 20 different categories. And the total end of the data that we're working with here is roughly 11, 000 records.

## Creating Feature Vectors from Text Data Using Tf-Idf

All machine-learning models, including neural networks, can only work with numeric data. They can't work with text directly, so you need to convert this text data to numeric form using some technique. There are different methods you can use to create feature vectors from text data. What I've chosen here is TFIDF. TFIDF stands for tone frequency-inverse document frequency. Every word in the vocabulary of our text data will be represented using a TFIDF score. More frequent tones in a document get upscaled. The score rises. That is the tone frequency. Common terms across the document corpus get downscaled. That is the inverse document frequency. Creating feature vectors from text data in scikit-learn is straightforward. You simply use the TFIDF vectorizer. Here I've specified that I want stop_words in english to be removed from the underlying text. Stop words can be thought of as terms with no information contents. It's just a, an, the, and so on. I'm going to create feature vectors from our newsgroup data by using this TfidfVectorizer and calling fit_transform on our newsgroup text. The TfidfVectorizer will build a vocabulary of all of the words that it encounters in our text. You can see that the shape of the transformed data is 11314, 129796. The fourth dimension here, that was to the number of records in our dataset. We have about 11, 000 newsgroup posts. And the second dimension here refers to the size of our feature vectors, which is equal to the number of words in our vocabulary, that is roughly 130, 000 words. If you invoke get_feature_names on your TfidfVectorizer, the length of the feature_names will give you the size of your vocabulary. You can see that it's equal to the second dimension. The feature vectors containing TFIDF scores are cleared by first assigning numeric identifiers to every word in our vocabulary. Here is a random sample of words that make up our vocabulary and the numeric identifier associated with each of these words. These numeric identifiers identify the position of each of these words in a feature vector. These identifiers start at 0, and the last number is the length of the vocabulary minus 1. Let's take a look at the first newsgroup post. This is the original text data now in feature vector form, and you can see that this is a sparse matrix. This is the first post at index 0 and the second dimension are the numeric identifiers of the words that are present in this post, and these values that you see here are the TFIDF scores for these words. Each newsgroup post is now represented using such feature vectors, and this is what we'll use to train our model. Let's import the train_test_split function in order to split our data into training data and a test subset. We'll use 80% of the data to train our classification model and 20% to evaluate the model. That'll give us roughly 9, 000 records to train our model and about 2, 000 records to test our model.

## Building and Training a Classification Model on Text Data

With our text data now ready and converted to feature vector form, we're ready to build a classification model using the MLPClassifier. I'm going to keep this neural network fairly simple. I'll have just 1 hidden layer with 32 neurons. You can, of course, change the design of your neural network and see how the performance of your model changes. That's something I suggest you try and see how it works. The activation function that I've chosen is the ReLU activation, and I'm going to train just for 50 iterations. Now this is a little less given the size of our data. Let's see how this works out for us. Notice that I have specified a different optimizer to train our neural network, the Adam optimizer. The Adam optimizer is a popular one and it's widely used when your datasets are large. Our text dataset contains thousands of samples for training. Adam works better than L-BFTS and other optimizers. Let's start training our neural network on this text data for classification. Call fit on the training data, and wait for the iterations to run through. Now, as the iterations progress, you can see how the loss falls, and the loss is constantly falling. After completing 50 iterations of training, I see a warning on screen, which says ConvergenceWarning. What this essentially means, that your neural network trained for those 50 iterations, but those loss was constantly falling. There is a way to improve your neural network by training for longer. You can say that your neural network model parameters have converged when the rate at which loss reduces is very, very small of the order of 0.0001 typically. Now this does not mean that your model can't be used for prediction. It just says this is not the best possible model. It can perform better if you allow the parameters to converge to a lower-loss value. I'm curious as to how this model performs on the test data even when the model parameters are not at their best possible values. So I'm going to call predict on x_test, and I get the predicted values in the variable y_pred. Let's set up a DataFrame with the test and the predicted values, and let's see a sample of actual labels verses the predicted values from our model. And you can see that there are many matches and there are many predictions that are wrong as well. In order to objectively evaluate this model, we need something like accuracy. Let's calculate the accuracy on the test data and you can see that it's 92%. This is an accuracy across 20 categories, which means this is a great score. With just 50 iterations of training, we built a very good model. Now for self study, I suggest you make two specific changes and see how the performance of your model changes. Make a deeper neural network with many more layers, and another thing, train for longer. Train for 1, 000 iterations and see if your model improves. You're likely to discover that it will.

## Encoding Images in Numeric Form

Machine-learning algorithms work with images as well, image classification, object detection, all of these are models which work with images, and you need to be able to represent these images in numeric form. An image is something that you can view in two dimensions, but images are essentially just matrices as far as their representations are concerned. The basic building block of any image is a pixel, and a pixel can be a grayscale pixel or a color pixel. Let's talk about RGB, or color images, where pixels are represented using red, green, and blue values. A pixel is represented in an RGB, or a color image, using three values, one value for red, another for green, and another for blue. And each of these values are numbers between 0 and 255. A red pixel will have an R value of 255, G and B will be equal to 0. A green pixel will have a G value of 255, R and B will be equal to 0. And a blue pixel will have the value of B equal to 255, R and G will be equal to 0. A single pixel in an RGB image needs three values to represent color. This is a three-channel image, also referred to as a multi-channel image. The image that you're working with can also be a grayscale image where the value of a pixel just represents intensity or shades of gray. Each pixel represents only intensity information, and it's represented using a value between 0 and 1. A very bright pixel might have an intensity of 1. A dark or a black pixel might have an intensity of 0. So you have one value to represent intensity. This is a 1 channel or a single-channel image. Let's move on from pixels to images. Every image can be represented by a three- dimensional matrix. A grayscale image, as well as a RGB multi-channel image. Let's consider the 6x6 images that you see here on screen. For a grayscale image, the dimensions of this matrix will be 6, 6, 1. Observe that the last dimension has size 1 corresponding to the 1 value to represent pixel intensity. The matrix representing the same 6x6 image in color will have shape 6, 6, 3. The last dimension is of size 3 corresponding to the three channels for the RGB values used to represent a pixel. When you feed image data into machine- learning models, you don't feed in a single image at a time. You typically work with a list of images, which is essentially just a four-dimensional matrix. The very last dimension refers to the number of channels. The dimensions at the center here are the height and weight of each image in this list. You can have a list of images only if all of the images in that list are of the same size. And finally, the first dimension of this 4D matrix representing a list of images is the batch size, the number of images in the list.

## Loading and Visualizing the Lego Bricks Image Dataset

In this demo, we'll see how we can perform classification on image data using scikit- learn's MLPClassifier. Let's start with a new Python notebook. In order to work with images, I'm going to install an additional library here, opencv-python. OpenCV is a very popular library used for image processing, and the Python API is available here in opencv-python. Once you have it on your local machine, let's go up and set up the import statements for the libraries that we'll use in this demo. Notice that we import cv2. That is the module for the OpenCV library. We'll train our neural network classifier on the Lego dataset. This is made up of Lego bit images, and the original source of this data is here at Kaggle. This is a fairly large dataset of different kinds of Lego brick images. Let's take a look at how this image data is available. Look under datasets/Lego/train, and you'll find that there are a number of subfolders in there. Each of these subfolders under it contains a number of different images for that image type. An image type can be a 2x2 brick, a 1x1 brick, a flat tile, and so on. I'm going to define a helper function here called load_images to load this image data into this program. This will take in the path where the image data is located and repopulate and return the training images and the training labels in the form of a list. The images are located in subfolders under this dataset path. We'll iterate through every subfolder and we'll print out the fullpath to screen so that we can see the progress of this load. If the current path is not a directory, we'll just continue. All of the images are located in subdirectories. We'll get the entire list of images located under each subfolder by calling os.listdir on the fullpath to that subfolder. I'll now set up a second nested for loop within the outer for in order to iterate over every image file within the subfolder. We'll now get the fullpath to this particular file and then check whether this path is a directory or not. If it's a directory, we'll just continue. We expect image files to be under the subfolder, not directories. I'll now use the OpenCV library to read in the contents of this image file and store it in the image variable, and we'll append this image to the list of training images. The image file is present under these subfolders that you see displayed off to the right of your screen. And the name of the subfolder is the category or type of the image. So the label corresponding to this particular image, which we'll append to labels_train, is basically the name of the subfolder. Whether it's a Cross axle Lego brick, a plate 1x1, and so on. Once both of these for loops are complete, all of the images have been loaded into two lists, which will then return in the form of NumPy arrays. With this helper function set up, we are now ready to load in the image data and work the load_images function and pass in the dataset, but you'll find that this code takes a minute or two to execute, and all of the image files will be loaded in. This is a fairly large dataset. Once you have everything loaded into your program, let's take a look at the shape of the training data. We have a total of 6, 379 images, that is the batch_size. Every image is 200 pixels by 200 pixels, and each of these are color images with three channels. These three channels represent the RGB values for each pixel. You can access any one of these images and view the shape of the image matrix. You can see it's a 200 by 200 by 3 image. Let's take a look at the shape of the label data as well, and there are 6, 379 labels, 1 corresponding to each of the images in our data. You can use Matplotlib to display one of these images and the corresponding label to screen. I've simply picked an image at index 10 at random, and you can see that this is a 1x1 brick, and the label image is displayed right there. Let's pick another image at random. This is one at index 11, and you'll see it's once again a brick 1x1. Since we read in images one subfolder at a time, images that belong to the same category are clustered together. Let's take a look at a different Lego block, this one at index 1000, and you can see that this is a different image. It's a plate 1x1.

## Building and Training a Classification Model on Image Data

We've loaded in the image data, but the current shape of the image data where every image is represented in 2D will not work when we try to feed this data into a fully-connected neural network, so we need to reshape the images to be in a form that will work with our model. We'll preserve the original batch dimension of the images, that is the number of images that we are working with. So after the reshape, the first dimension will continue to be the batch_size, or number of images, and the remaining dimensions will flatten to form a 1D feature vector. Every image installed being represented using three dimensions will flatten to form a single-dimensional vector height multiplied by weight multiplied by number of channels. If you now look at the shape of X, which contains our reshaped images, you can see that we have the 6, 379 from earlier, but each image has been flattened to be a 1-dimensional vector. Two hundred multiplied by two hundred multiplied by three channels will give us 120, 000 pixels for each image. Before we feed in our training data to our neural network, we need to label and code the categories. That is our target labels. Instantiate the LabelEncoder and call fit_transform on labels_train. This will convert the Lego brick categories to currently in string format to be numeric IDs. Let's take a look at the unique numeric IDs that the LabelEncoder generated. You can see that there are 16 categories of Lego bricks, starting from 0 all the way through to 15. Let's split our Lego brick images into training data and the test subset. The train_test_split, by default, will shuffle your data. We've also explicitly set shuffle to True so that all the images in the same category will not appear together when we feed them into our model. After the split, we'll use about 5, 000 images to train our neural network and the remaining 1, 000 or so images we'll use to evaluate the network that we built. Now that we have created feature vectors from your image data, building and training a neural network in scikit-learn is straightforward. We simply use the MLPClassifier and specify the hyperparameters of our network. Observe that this is a much deeper neural network than any that we've used so far. There are three hidden layers, and each layer contains 100 neurons. I've chosen to use ReLU activation for all of these layers, and the optimization algorithm that we've chosen here is the Adam optimizer. If your training data runs into thousands of records, the Adam optimizer is what you should go with. And I'm going to run training for 100 iterations. Let's see how this neural network does with image data. Call fit on the training data, and wait for a while. It'll take a while for this model to train. Image data tends to be heavy weight, remember? Every image has 120, 000 features corresponding to the number of pixels. I wait for training to complete, and at the end, I find an interesting little message. Observe that training ran for just 43 iterations. The message here says that training for a neural network stopped early because training loss did not improve more than that threshold specified. The threshold here of 0.0001 is the default tolerance of the neural network. As this classifier trained, the class entropy loss calculated, did not improve more than this tolerance factor, and that's why the training of this model stopped early. These other final weights of the model, the model has converged early. Now this is interesting because our model on the text data did not converge after 50 iterations, whereas this model converged early. Let's use this model for prediction on the test data and see how it performs on the Lego images. Let's calculate the accuracy of this model, and you'll see it's 88%, 88% across 16 different categories of images. That's pretty good.

## MODULE -4

### Implementing Dimensionality Reduction Using Restricted Boltzmann Machines in scikit-learn

## Restricted Boltzmann Machines for Dimensionality Reduction

Restricted Boltzmann machines, or RBMs, are shallow neural networks, which use unsupervised learning techniques to learn latent factors that exist in your data, and which is why restricted Boltzmann machines can be used to perform dimensionality reduction. An RBM performs dimensionality reduction using a shallow neural network. This is a neural network with just two layers. The first layer here, at the left of the screen in blue, is the visible layer. This is the layer into which you'll feed your input data. So the number of neurons in this visible layer should be equal to the size of your input. The dimensionality should match the dimensionality of your input. The second layer of the restricted Boltzmann machine is this hidden layer. The hidden layer has a number of neurons, which is less than the dimensionality of the input. This is what learns latent features that exist in your data. When you work with RBMs in scikit-learn, you'll be able to specify the number of units, or neurons, in the hidden layer. You'll configure the number of hidden units to be equal to the low-dimensionality output that you want. Any input that you pass into this RBM is fed into the first visible layer, also referred to as the input layer. Now I've shown the input going in only to one neuron in the visible layer. That's not generally the case. The input would be fed in across all of the neurons. Every neuron in this visible layer will be connected to neurons in other layers. So there is no intra-layer connectivity of neurons, and this is the restriction in restricted Boltzmann machines. In an RBM, every neuron in the first visible layer will be connected to every neuron in the hidden layer, but there is no intra-layer communication at all. This is what makes this Boltzmann machine a restricted Boltzmann machine. Now all of these neurons, or nodes, perform a mathematical operation on the input, so each node processes the input that it receives and it then makes stochastic decisions about whether to transfer the input or not and how to transmit the input. The term stochastic here simply means randomly determined, and it's randomly determined because it determines on the weights between the interconnections of the layers. All of the interconnections between these two layers will be associated with weights, and this should not be surprising to you at all. This is how neural networks work. Interconnections between neurons are associated with weights, and a bias is added to the weighted input. The hidden layer is associated with an activation function and the weighted sum of the inputs to the hidden layer, which is the output of the RBM, is passed through an activation function. And a final output that you'll receive from this RBM is this output of the activation function. What I've essentially described so far is a forward pass through this shallow neural network, but how do RBMs learn latent features in your data? We'll get to that in just a bit. Let's first make sure that you understand this architecture completely. Observe that all inputs from all visible nodes are fed to the hidden nodes in your RBM, and what you see here is a computation graph that is a symmetric bipartite graph. And this computation graph is a symmetric bipartite graph because RBMs do not allow intra-layer neuron communications. A bipartite graph is one who's vertices can be divided into two disjoined and independent sets, U and V, which correspond to the two layers of our RBM such that each edge connects a vertex in U to a vertex in V, so vertices in U are not connected to each other, vertices in V are not connected to each other. What we've discussed so far is the forward pass that the RBM makes on our input data. Inputs have been fed into the first visible layer. The weighted sum of the inputs are fed into the hidden layer they pass into an activation function. How do RBMs learn latent factors in your data? Well, they try to reconstruct the data by themselves in an unsupervised manner, and they do this by taking the output generated in the forward pass and sending this output back through the RBM in the reverse direction. The output of the forward pass is the output of the activation layer, so these activations are then reversed and sent back through the RBM. And this backward path that works on the activated output tries to reconstruct the input. If you remember, the interconnections between these two layers are associated with weights, and these weights are applied in the backward pass as well. And in the backward pass, the RBM is trained. The weights of the RBM are adjusted to improve the reconstruction of the input. The optimizer will tweak these weights of these interconnections so that the reconstructed input is as close to the original input as possible. And really, that's it. This is how unsupervised training of RBMs work. Multiple forward passes are made on the input and backward passes improve the reconstruction of the input. After these multiple forward and backward passes are done, the final lower- dimensionality output of the hidden layer represent the latent features of our data. And once you have these lower- dimensionality representations of your data, you can use this data to feed into machine-learning models. The use of RBMs for dimensionality reduction can be used as a preprocessing technique on your input data before you use this lower-dimensionality data to build and train classifiers or regressors. As you can see, the idea behind RBMs is very, very interesting; however, RBMs are an older concept and have been replaced with newer models such as autoencoders, which find latent features in your data. An autoencoder is a feed for a neural network that has a characteristic sandwich-like appearance. So you have the input layer, we have hidden layers, and then the output layer. And the idea is that the hidden layer is of a lower dimensionality, and it is this hidden layer that learns latent features in your data. Autoencoders work on the same principle as restricted Boltzmann machines. They try to reconstruct the input at the output.

## A Brief History of Restricted Boltzmann Machines

Now that we understand how restricted Boltzmann machines work, let's take a look at a brief history of how RBMs came to be. Now neural networks, as a concept, have been around for a while. It's only been in recent years that we've been able to build and train them as scale. Hopfield networks were introduced in 1974. These were an early form of recurrent neural networks where neurons had memory or state. Now these tended to be quite inefficient. They were huge networks and hard to train. Boltzmann machines originated from these Hopfield networks in the year around 1985. Boltzmann machines were defined as a stochastic Hopfield network with hidden units. Once again, these were not possible to train efficiently, which is why Boltzmann machines were restricted. Restricted Boltzmann machines were originally developed in 1986, but became quite popular in the mid-2000s. The idea behind RBMs were to impose constraints on Boltzmann machines to ease the training process. No inter-layer communication RBMs allow only inter-layer communication between neurons. Technology evolves fast and RBMs are not really used much today. deep belief networks introduced around 2009 compose, or stack, RBMs and autoencoder layers. Autoencoders, as we've discussed, reconstruct the input at the output and those are generative in nature. There are also newer machine-learning models such as GANs, or generative adversarial networks, which are very popular today and used to generate realistic images and videos. And these are based on similar generative principles as the original RBMs and autoencoders. Now that we've understood the evolution of deep belief nets, let's quickly discuss Boltzmann machines, the precursor to RBMs. Boltzmann machines were fully-connected neural networks, which had visible layers, as well as hidden layers. They were built using a special type of neuron known as the stochastic neuron. Stochastic neurons help make decisions of how the input was to be passed through to the output, and the output of Boltzmann machines were probabilistic rather than deterministic. So the neurons output a value of 1 or 0 with specific probabilities. Boltzmann machines are so called because they rely on a mathematical concept called the Boltzmann probability distribution. Boltzmann machines use the Boltzmann probability distribution to model the distribution of the input data at the output, and this is what help Boltzmann machines reconstruct the input at the output. You can see that the basic principle behind these machines, which date back 30 even 40 years, were essentially the same. In practice, Boltzmann machines were found to be very hard to train efficiently, which is why tweaks were proposed to the original architecture to enable practical use. The tweaks impose restrictions on the architecture of the Boltzmann machine to give us restricted Boltzmann machines. Now we've studied I'll quickly summarize what we learned here. RBMs are made up of a visible layer and a hidden layer, and there are no connections allowed between two visible neurons or between two hidden neurons. Connections are only allowed between visible and hidden neurons, thus the RBM neural network can be part of a symmetric bipartite graph. The advantage of restricted Boltzmann machines is that they can be trained efficiently, and the training algorithm is very similar to backpropagation for regular neural networks. They use something called Gibbs sampling, which is a Monte Carlo-based technique to generate sample sequences. The contrasting divergent algorithm uses a technique called pseudolikelihood to train the model's parameters.

## Training a Classifier on All Features of the Input Data

In this demo, we'll see how we can reduce the dimensionality of the input data that we feed into our machine-learning model, and we'll perform this dimensionality reduction using a Bernoulli, a restricted Boltzmann machine. An RBM is a two-layer shallow neural network that tries to reconstruct the input data, and while doing so, learns latent features of reduced dimensionality in the input data. Just like with other neural networks, the RBM in scikit-learn is implemented using a high-level estimator object. In this demo, we'll build and train a logistic regression classifier to work on the mnist dataset of handwritten digit images. We'll build and train a simple classification model, observe its accuracy, and we'll compare this model to one where we preprocessed the data to reduce dimensionality using the Bernoulli restricted Boltzmann machine. As a student of machine learning, you've probably used the mnist dataset before. This is a dataset of handwritten digit images where every image is 28 pixels by 28 pixels, and all of the images are grayscale images. The original source of this dataset is this URL that you see here on screen. I have this data in the form of a CSV file that I'll read into a pandas DataFrame. The first column in this DataFrame corresponds to the label, which is the numeric digit that the image represents. The remaining columns represent pixel values for the images. One record, or one role, corresponds to one image, and you can see that the total number of pixels is equal to 784. The columns range from 0 to 783. 784 is basically 28 multiplied by 28. Let's take a look at the shape of this data. You see that there are a total of 42, 000 images in this CSV file. Each role corresponds to an image, and each role has a label that is the numeric digit represented by the image and 784 pixel values of the image itself. If you want to see the unique numeric identifiers in the mnist handwritten data, calling unique on the label column will show you that there are a total of 10 digits from 0 through 9. Let's set up the features that we'll use to train our classification model. The features are all of the pixel values of the individual images, that is all of the columns except for the every first column at index 0. And the first column contains the target labels for classification. We extract the label data into the mnist_labels variable. Before we use this data to train a classification model, let's see what an mnist image looks like. For that, I'm going to set up this little helper function here called display_image, which takes as its input argument the index of the image to display from our dataset. We print out the label corresponding to the image and we use Matplotlib to display the image itself. We need to reshape the record, that is the single record in our DataFrame, to be 28 pixels by 28 pixels. That is a 2D image. Now that we have this helper function, let's take a look at the image at index position 2. You can see that it's an image of the handwritten digit 1, and the label that corresponds to this image is also 1. Let's take a look at one more image here. This is the image at index 200, and it's an image of the digit 0. The mnist feature's DataFrame contains the features that we'll use to feed into our model to train our model. There are 42, 000 images, and each image has 784 pixels. All of these are grayscale images, so we have just one value representing our pixel, that is the pixel intensity, and you can see that pixel intensities are represented using integers from 0 to 255. I'm now going to scale all of the pixel intensity values in my mnist dataset to be between 0 and 1. I simply divide by 255. If you sample the intensity values in the mnist_features now, you'll see that all pixel intensities are now represented between 0 and 1 as decimals. I'll now use the train_test_split function in order to split our mnist dataset into training data and the test data that we'll use to evaluate our model. We'll use a simple LogisticRegression classification model on this mnist dataset, and we'll run this model for the maximum of 1, 000 iterations. Because they're working with multi-class classification, there are 10 output categories. I set multi_class to be multinomial. This uses the multinomial loss function to train this model behind the scenes, which works well with multi-class data. I'm going to fit this LogisticRegression model using regularization so it doesn't overfit on our training data. And in order to find the best value of this regularization parameter, I'm going to use GridSearch with cross validation. The regularization parameter in LogisticRegression is this parameter C. This is the inverse of the regularization strength. Smaller values of C indicate stronger regularization. Instantiate the GridSearchCV estimator object. The model that you want to train is our LogisticRegression model with different values of C and we use two-fold cross validation. The training data will be divided into two parts. One part will be used to train the model, and the second part will be used to validate the models that are built. We call grid_search.fit to start the training process. Based on the hyperparameters that we have specified, this will train three different models for different values of C, and best_params would print out that value of C that produced the best model, and it happens to be 0.1. A couple of things to observe here. We trained our LogisticRegression model without performing any dimensionality reduction on the mnsit dataset. We used all 784 pixels in every image. Second observation, stronger regularization, that is a smaller value of C, produced a better model. Let's rank all of the models, the three models that we built and trained, based on the mean test score. The test score, by default, is simply the accuracy of the model on the validation data. Run the for loop through the three models that were built and trained using GridSearch, and here are the three models along with their rank. You can see that the best model here had an accuracy of 91.51 %. The size of a feature

## Dimensionality Reduction Using Restricted Boltzmann Machines

vector in the mnist dataset is 784, but are all of those features equally important? Is it possible to get a better or maybe an equally good model using a feature vector of reduced dimensionality? That's exactly what we'll see here in this demo, and we'll perform dimensionality reduction using a Bernoulli restricted Boltzmann machine. Restricted Boltzmann machines are shallow neural networks often with just two layers. This neural network uses unsupervised learning and tries to reconstruct the input that was passed into the first layer using the activations of the hidden layer. In this manner, it learns lower- dimensionality latent representations of your input data. Instantiate the BernoulliRBM estimator object and store it in the RBM variable, and we'll preprocess the data using a pipeline in scikit-learn. A pipeline in scikit-learn allows you to combine the data preprocessing steps and a final ML model all in one higher-level estimator abstraction. Here we preprocess the data and reduce its dimensionality using the restricted Boltzmann machine, and we feed in this reduced dimensionality data into our logistic regression model. The first stage of the pipeline performs dimensionality reduction, and the second stage, the actual classification. Note that this logistic regression classifier will be trained on input features with lower dimensionality. The learning rate of our RBM are set to 0.06. This is something you can change. And the LogisticRegression model will have the penalty term equal to 0.1. Remember, this is what gave us the best model previously. There are a number of different types of parameters that you can tune for this pipeline as a whole. This includes the hyperparameters of the RBM, that is the restricted Boltzmann machine, as well as the hyperparameters for the LogisticRegression classifier. Observe the RBMs specific hyperparameters. It has a batch size, a learning rate, the number of components, and the number of iterations for which you wanted to train. The number of components is the lower- dimensionality representation. This is something that you can specify, the number of latent features of lower dimensionality that you want this RBM to generate. I'm going to use GridSearch with cross validation in order to find the best hyperparameters for my restricted Boltzmann machine. In GridSearch, I'm going to tweak the number of components, that is the lower- dimensionality representation, to go from just five components all the way through to 150. I've specified two values for the number of iterations of training for this RBM, 5 iterations or 20 iterations. GridSearch will use this parameter grid to train the pipeline, which preprocesses the data, as well as fits the LogisticRegression model. This will allow us to find the best classifier trained on features of a reduced dimensionality. GridSearch will build and train eight different models corresponding to each hyperparameter combination. Let's start training by calling fit on the training data, and once training is complete, we print out the best_params for this pipeline. We had instantiated our RBM estimator object with verbose equal to True, but this way you'll see all of these details printed out during training. Now the training process for any model involves choosing values for the model parameters such that those values maximize the likelihood that training data looks the way it actually does. This process is called the maximum likelihood estimation. Now there are different ways to approximate the likelihood, and one such technique is the pseudolikelihood. The pseudolikelihood technique is what this RBM uses to find the best values of the model parameters. There are eight different models to train, lots of training data, and there's also preprocessing involved, so this might take a while to run. And at the very end, the best parameters for your pipeline will be printed out to screen. The best classifier model was found to use 150 training features as its input. This is dimensionality reduced from 784, remember that, and the RBM used just 5 iterations to train. I'm going to run a for loop to iterate through all eight models that were built and trained using GridSearch, and we'll print out the mean test scores to see how the different models performed. When we reduce dimensionality from 784 down to just 5 components in our mnist data, the classifier models performed really poorly. The accuracy was around 11% or 15%. But with just 100 or 150 components, the classifier model reached accuracy levels equal to that of the model that was trained on all input features, all 784 features. You can see that the best model had an accuracy of 93.57 %, which is greater than the roughly 91% accuracy that we got when we used all 784 features to train our classifier.

## Summary and Further Study
And with this demo on preprocessing data using restricted Boltzmann machines to build a classifier model, we come to the very end of this module. We started this module off by understanding how restricted Boltzmann machines work for dimensionality reduction. We saw that it is a shallow two-layer neural network where the input layer has dimensionality equal to that of our input data and a hidden layer with lower dimensionality, and it is this hidden layer that helps it perform dimensionality reduction. We saw that the training of an RBM works by passing the inputs forwards and backwards through the layers of an RBM until it learns latent features in your data by trying to reconstruct the input. We then performed a hands-on demo using RBMs in scikit-learn as a preprocessing step for our classification model. And this brings us to the very end of this course on Building Neural Network Models with scikit-learn. Now if you're interested in the scikit-learn library, here are some other resources for you on Pluralsight. Building Clustering Models with scikit- learn will explore the different clustering algorithms available as estimators in scikit. Or if you're interesting in ensemble learning techniques, Employing Ensemble Methods with scikit- learn is the course for you. That's it for me here today. Thank you for listening.