# Building Neural Networks

LIN 371 :: UT Austin

Jessy Li

Most of the material adapted from https://realpython.com/python-keras-text-classification/

**NOTE**: the original post has an **error** in that they used the test data as validation data during training. In practice, ALWAYS use a separate validation set! Validation data is used to tune the model with respect to some parameters, as we will show see below.

In this tutorial we'll explore:
* Using word embeddings in Keras
* Using RNN (e.g., LSTM) in Keras
* Applying Dropout regularization

## Data
We will again use sentiment analysis as an example.

Go ahead and download the data set from [the Sentiment Labelled Sentences Data Set](https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences) from the UCI Machine Learning Repository.

By the way, this repository is a wonderful source for machine learning data sets when you want to try out some algorithms. This data set includes labeled reviews from IMDb, Amazon, and Yelp. Each review is marked with a score of 0 for a negative sentiment or 1 for a positive sentiment.

Extract the folder into the current directory (if using Colab: upload to your Google Drive) and go ahead and load the data with Pandas:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd

filepath_dict = {'yelp':   '/content/drive/My Drive/LIN371/sentiment-labelled-sentences/yelp_labelled.txt',
                 'amazon': '/content/drive/My Drive/LIN371/sentiment-labelled-sentences/amazon_cells_labelled.txt',
                 'imdb':   '/content/drive/My Drive/LIN371/sentiment-labelled-sentences/imdb_labelled.txt'}

df_list = []
for source, filepath in filepath_dict.items():
    df = pd.read_csv(filepath, names=['sentence', 'label'], sep='\t')
    df['source'] = source  # Add another column filled with the source name
    df_list.append(df)

df = pd.concat(df_list)
print(df.head())

                                            sentence  label source
0                           Wow... Loved this place.      1   yelp
1                                 Crust is not good.      0   yelp
2          Not tasty and the texture was just nasty.      0   yelp
3  Stopped by during the late May bank holiday of...      1   yelp
4  The selection on the menu was great and so wer...      1   yelp


Now split the data into **train/validation/test**. Usually, with a moderately-sized dataset like this one, a rough 80-20 split for training-testing is good.

In [None]:
from sklearn.model_selection import train_test_split

sentences = df['sentence'].values
y = df['label'].values
# First split 20% of the data into testing and validation
sentences_train, sentences_test_val, y_train, y_test_val = train_test_split(sentences, y, test_size=0.2, random_state=1000)
# Then split 50% of the test+val data as validation
sentences_test, sentences_val, y_test, y_val = train_test_split(sentences_test_val, y_test_val, test_size=0.5, random_state=1000)

print(y_train.shape)
print(y_val.shape)
print(y_test.shape)

(2198,)
(275,)
(275,)


Transform our sentences into one-hot representations using CountVectorizer

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit(sentences_train) #in the past we did fit_transform()
X_train = vectorizer.transform(sentences_train)
X_test  = vectorizer.transform(sentences_test)
X_val = vectorizer.transform(sentences_val)


In [None]:
print(X_train.shape)

(2198, 4642)


## Building a logistic regression benchmark

In [None]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()
classifier.fit(X_train, y_train)
score = classifier.score(X_test, y_test)
print("Accuracy:", score)

Accuracy: 0.8472727272727273


## Keras basics

[Keras](https://keras.io/) is a popular, high-level deep learning and neural networks API by [François Chollet](https://twitter.com/fchollet) which is capable of running on top of [Tensorflow](https://www.tensorflow.org/) (Google).

To quote the wonderful book by François Chollet, Deep Learning with Python:


_Keras is a model-level library, providing high-level building blocks for developing deep-learning models. It doesn’t handle low-level operations such as tensor manipulation and differentiation. Instead, it relies on a specialized, well-optimized tensor library to do so, serving as the backend engine of Keras_


It is a great way to start experimenting with neural networks without having to implement every layer and piece on your own. For example Tensorflow is a great machine learning library, but you have to implement a lot of boilerplate code to have a model running.

**Note**: a very, very popular deep learning framework is [PyTorch](https://pytorch.org/); it is extremely powerful but also less _high level_ than Keras. In class, we work with Keras because it is very, very easy to understand.

### Installing Keras

**You do not need to do this if you're using COLAB**

Two ways (among many) to install:
* You can install Keras using the Anaconda Navigator; serach for "keras". This **will not** install Tensorflow as your backend so you must do it separately by yourself, in the same interface.
* You can also install it using pip, but you will need to install the backend, e.g., Tensorflow, yourself:
```
pip install tensorflow
pip install keras
```

### Your First Keras Model

The most convenient way to think of a Keras model is a stack of layers; in Keras this is handled by [the Sequential Model API](https://keras.io/models/sequential/).

The Sequential model is a linear stack of layers, where you can use the large variety of available layers in Keras. The most common layer is the `Dense` layer which is your regular densely connected neural network layer with all the weights and biases that you are already familiar with.

Before we build our model, we need to know the input dimension of our feature vectors. This happens only in the first layer since the following layers can do automatic shape inference. In order to build the Sequential model, you can add layers one by one in order as follows:

In [None]:
import os
from keras.models import Sequential
from keras import layers

#TODO
input_dim = X_train.shape[1]
model = Sequential()
model.add(layers.Dense(10, input_dim = input_dim, activation = "relu"))
model.add(layers.Dense(1, activation = "sigmoid"))

Before you can start with the training of the model, you need to configure the learning process. This is done with the `.compile()` method. This method specifies the optimizer and the loss function.

Additionally, you can add a list of metrics which can be later used for evaluation, but they do not influence the training. In this case, we want to use the binary cross entropy and the Adam optimizer (a popular method that's often used instead of SGD).

Keras also includes a handy `.summary()` function to give an overview of the model and the number of parameters available for training:

In [None]:
model.compile(loss = "binary_crossentropy", optimizer = "adam", metrics = ["accuracy"])
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_2 (Dense)             (None, 10)                46430     
                                                                 
 dense_3 (Dense)             (None, 1)                 11        
                                                                 
Total params: 46441 (181.41 KB)
Trainable params: 46441 (181.41 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Note the ``Param #`` in the table --- this specifies the number of parameters (w's) in each layer. Always a good idea to check the dimensions!

Since the training in neural networks is an iterative process, the training won’t just stop after it is done. You have to specify the number of iterations you want the model to be training. Those **completed iterations** are commonly called epochs. We want to run it for 20 epochs to be able to see how the training loss and accuracy are changing after each epoch.

Another parameter you have to your selection is the batch size. The batch size is responsible for how many samples we want to use in **one forward/backward pass** (think of that as the number of examples in each iteration in Stochastic Gradient Decesnt). This increases the speed of the computation as it need fewer epochs to run, but it also needs more memory, and the model may degrade with larger batch sizes. Since we have a small training set, we can leave this to a low batch size:

In [None]:
#TODO
model.fit(X_train, y_train,
          epochs = 10,
          batch_size = 16,
          validation_data = (X_val, y_val))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7e173edc3a00>

Note that if you rerun the `.fit()` method, you’ll start off with the computed weights from the previous training. Make sure to compile the model again before you start training the model again.

Now you can use the `.evaluate()` method to measure the accuracy of the model. We expect that the training data has a higher accuracy then for the testing data. The longer you would train a neural network, the more likely it is that it starts overfitting.

In [None]:
#TODO
loss, accuracy = model.evaluate(X_test, y_test)
print("Testing accuracy: {:.4f}".format(accuracy))

Testing accuracy: 0.8436


Did we overfit?

### Using word embeddings

How can you get such a word embedding? You have two options for this:
* Train your word embeddings during the training of your neural network.
* Use pretrained word embeddings (e.g., Word2Vec, Glove), which you can directly use in your model. There you have the option to either leave these word embeddings unchanged during training or you train them also.

Now you need to tokenize the data into a format that can be used by the word embeddings. Keras offers a couple of convenience methods for [text preprocessing](https://keras.io/preprocessing/text/) and [sequence preprocessing](https://keras.io/preprocessing/sequence/) which you can employ to prepare your text.

You can start by using the `Tokenizer` utility class which can vectorize a text corpus into a list of integers. Each integer maps to a value in a dictionary that encodes the entire corpus, with the keys in the dictionary being the vocabulary terms themselves.

You can add the parameter `num_words`, which is responsible for setting the size of the vocabulary. The most common `num_words` words will be then kept.

In [None]:
from keras.preprocessing.text import Tokenizer

#TODO
tokenizer = Tokenizer(num_words = 10000)
tokenizer.fit_on_texts(sentences_train)

X_train = tokenizer.texts_to_sequences(sentences_train)
X_test = tokenizer.texts_to_sequences(sentences_test)
X_val = tokenizer.texts_to_sequences(sentences_val)


In [None]:
print(sentences_train[2])
print(X_train[2])

As an earlier review noted, plug in this charger and nothing happens.
[26, 47, 773, 563, 1974, 339, 11, 8, 237, 2, 174, 1975]


In [None]:
print(sentences_train[3])
print(X_train[3])

I went on Motorola's website and followed all directions, but could not get it to pair again.
[3, 227, 19, 1976, 774, 2, 1977, 32, 1978, 22, 111, 13, 81, 6, 7, 775, 103]


The indexing is ordered after the most common words in the text, which you can see by the word `the` having the index 1.

It is important to note that the index 0 is reserved and is not assigned to any word. This zero index is used for *padding*, which I’ll introduce in a moment.

You can see the index of each word by taking a look at the word_index dictionary of the Tokenizer object.

Unknown words (words that are not in the vocabulary) are denoted in Keras with `word_count + 1` since they can also hold some information.

In [None]:
for word in ['the', 'all', 'happy', 'sad']:
    print('{}: {}'.format(word, tokenizer.word_index[word]))

the: 1
all: 32
happy: 202
sad: 671


**Note**: Pay close attention to the difference between this technique and the `X_train` that was produced by scikit-learn’s `CountVectorizer`. With `CountVectorizer`, we had stacked vectors of word counts, and each vector was the **same** length (the size of the total corpus vocabulary). With Keras `Tokenizer`, the resulting vectors equal the length of each text/sentence, and the numbers don’t denote counts, but rather correspond to the word values from the dictionary `tokenizer.word_index`.

This means that one problem that we have is that each text sequence has in most cases **different** length of words. To counter this, you can use `pad_sequence()` which simply pads the sequence of words with zeros. By default, it prepends zeros but we can also append them. Typically it does not matter whether you prepend or append zeros.

Additionally you would want to add a `maxlen` parameter to specify how long the sequences should be. This cuts sequences that exceed that number. In the following code, you can see how to pad sequences with Keras:

In [None]:
from keras.utils import pad_sequences

maxlen = 100
X_train = pad_sequences(X_train, maxlen = maxlen, padding = "post")
X_test = pad_sequences(X_test, maxlen = maxlen, padding = "post")
X_val = pad_sequences(X_val, maxlen = maxlen, padding = "post")

In [None]:
print(sentences_train[2])
print(X_train[2])

As an earlier review noted, plug in this charger and nothing happens.
[  26   47  773  563 1974  339   11    8  237    2  174 1975    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0]


The first values represent the index in the vocabulary as you have learned from the previous examples. You can also see that the resulting feature vector contains mostly zeros, since you have a fairly short sentence.

Let's again take a look at the 4th example:

In [None]:
print(sentences_train[4])
print(X_train[4])

All of the tapas dishes were delicious!
[  32    9    1 1979  956   43  215    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0]


### Keras Embedding Layer

Now we are ready to learn a new embedding space through a task, just like we did with neural langauge models or Word2Vec. In this case, our task is sentiment classification. The first step is to use Keras' [Embedding Layer](https://keras.io/layers/embeddings/) which takes the one-hot integers and maps them to a dense vector of the embedding, randomly initialized. You will need the following parameters:
* `input_dim`: the size of the vocabulary
* `output_dim`: the size of the dense vector
* `input_length`: the length of the sequence

But how do we go from the embedding layer, which gives a `num_input_example` X `sequence_length` matrix to a dense layer, which expects a flat vector? One way is to just average the embeddings!

In [None]:
from keras.models import Sequential
from keras import layers

embedding_dim = 50

model = Sequential()

model.add(layers.Embedding(input_dim = len(tokenizer.word_index)+1,
                           output_dim = embedding_dim,
                           input_length = maxlen))

model.add(layers.GlobalAveragePooling1D())
model.add(layers.Dense(10, activation = "relu"))
model.add(layers.Dense(1, activation = "sigmoid"))
model.compile(loss = "binary_crossentropy", optimizer = "adam", metrics = ["accuracy"])
model.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 100, 50)           237150    
                                                                 
 global_average_pooling1d (  (None, 50)                0         
 GlobalAveragePooling1D)                                         
                                                                 
 dense_4 (Dense)             (None, 10)                510       
                                                                 
 dense_5 (Dense)             (None, 1)                 11        
                                                                 
Total params: 237671 (928.40 KB)
Trainable params: 237671 (928.40 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Now we have maaaannnyyyy parameters to train...

Let's look at the results:

In [None]:
model.fit(X_train, y_train, epochs = 10, validation_data = (X_val, y_val), batch_size = 16)
loss, accuracy = model.evaluate(X_test, y_test)
print("Testing Accuracy:  {:.4f}".format(accuracy))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Testing Accuracy:  0.8436


### Using Pretrained Word Embeddings

We just saw an example of jointly learning word embeddings incorporated into the larger model that we want to solve. But our data is too small compared to the number of parameters we have to learn and we're grossly overfitting. One way to solve this is to use pre-trained word embeddings.

We will work with the [GloVe](https://nlp.stanford.edu/projects/glove/) (Global Vectors for Word Representation) word embeddings from the Stanford NLP Group as their size is more manageable than the Word2Vec word embeddings provided by Google. Go ahead and download the 6B (trained on 6 billion words) word embeddings from [here](http://nlp.stanford.edu/data/glove.6B.zip) (822 MB).

This is a large file with 400000 lines, with each line representing a word followed by its vector as a stream of floats. For example, here are the first 50 characters of the first line:
```
Shell:
$ head -n 1 data/glove_word_embeddings/glove.6B.50d.txt | cut -c-50
    the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.04445
```

Now let's build a matrix, where the row id's correspond to the one-hots in our `Tokenizer`'s `word_index`, and the columns correspond to that word's embedding:

In [None]:
import numpy as np

def create_embedding_matrix(filepath, word_index, embedding_dim):
    vocab_size = len(word_index) + 1  # Adding again 1 because of reserved 0 index
    embedding_matrix = np.zeros((vocab_size, embedding_dim))

    with open(filepath) as f:
      for line in f:
        word, *vector = line.split()
        if word in word_index:
          idx = word_index[word]
          embedding_matrix[idx] = np.array(vector, dtype = np.float32)[:embedding_dim]

    return embedding_matrix

embedding_dim = 50
embedding_matrix = create_embedding_matrix('/content/drive/My Drive/LIN371/glove.6B.50d.txt',
                                           tokenizer.word_index,
                                           embedding_dim)

We can check how many of the words we have in training are in Glove:

In [None]:
vocab_size = len(tokenizer.word_index)+1
nonzero_elements = np.count_nonzero(np.count_nonzero(embedding_matrix, axis=1))
print(nonzero_elements / vocab_size)

0.9424414927261227


In [None]:
model = Sequential()

model.add(layers.Embedding(input_dim = vocab_size,
                           output_dim = embedding_dim,
                           input_length = maxlen,
                           weights = [embedding_matrix],trainable = False))

model.add(layers.GlobalAveragePooling1D())
model.add(layers.Dense(10, activation = "relu"))
model.add(layers.Dense(1, activation = "sigmoid"))
model.compile(loss = "binary_crossentropy", optimizer = "adam", metrics = ["accuracy"])
model.summary()


Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_4 (Embedding)     (None, 100, 50)           237150    
                                                                 
 global_average_pooling1d_2  (None, 50)                0         
  (GlobalAveragePooling1D)                                       
                                                                 
 dense_8 (Dense)             (None, 10)                510       
                                                                 
 dense_9 (Dense)             (None, 1)                 11        
                                                                 
Total params: 237671 (928.40 KB)
Trainable params: 521 (2.04 KB)
Non-trainable params: 237150 (926.37 KB)
_________________________________________________________________


In [None]:
model.fit(X_train, y_train, epochs = 10, validation_data = (X_val, y_val), batch_size = 16)
loss, accuracy = model.evaluate(X_test, y_test)
print("Testing Accuracy:  {:.4f}".format(accuracy))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Testing Accuracy:  0.8436


### LSTMs with Keras

Recall that we were "averaging" the word embeddings in order to get a sentence representation... but can we do better? In class, we discussed the use of RNNs (e.g., LSTM) as a way of "summarizing" sentence meaning in its hidden vectors. As it turns out, it is very easy to use an LSTM instead of doing an average pooling; you only need to specify an output dimension for the LSTM, i.e., the size of your "sentence summary vector":

In [None]:
model = Sequential()

#TODO

model.add(layers.Embedding(input_dim = vocab_size,
                           output_dim = embedding_dim,
                           input_length = maxlen,
                           weights = [embedding_matrix],trainable = False))

model.add(layers.LSTM(64))
model.add(layers.Dropout(0.5))

model.add(layers.Dense(10, activation = "relu"))
model.add(layers.Dense(1, activation = "sigmoid"))
model.compile(loss = "binary_crossentropy", optimizer = "adam", metrics = ["accuracy"])
model.summary()


## Notable hyperparameters

You might already have noticed: there are so many hyperparameters associated with a neural network! How to find the best combination? Unfortuantely, in most cases, you'll just have to try varying one parameter and hold everything else consant. Name a few ones that we looked at in this tutorial!

* Dropout
* Padding sent length
* \# of nodes
* \# epoches
* \# of layers
* embedding dimension
* activation
* learning rate

In [None]:
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4", trainable=False)