#### CSC 180 Intelligent Systems 

#### Dr. Haiquan Chen, Dept of Computer Scicence

#### California State University, Sacramento


# Lab 13: Natural Lauguage Processing using RNN with word embeddings 


### Introduction to Natural Language Processing (NLP)

***Natural Language Processing (NLP)*** is a branch of computer science (and more broadly, a branch of artificial intelligence) that is concerned with providing computers with the ability to understand texts and human language. 

Common tasks in NLP include:

- *Text classification* — assign a class label to text based on the topic discussed in the text, e.g., sentiment analysis (positive or negative movie review), spam detection, content filtering (detect abusive content).
- *Text summarization/reading comprehension* — summarize a long input document with a shorter text.
- *Speech recognition* — convert spoken language to text.
- *Machine translation* — convert text in a source language to a target language.
- *Part of Speech (PoS) tagging* — mark up words in text as nouns, verbs, adverbs, etc. 
- *Question answering* — output an answer to an input question.
- *Dialog generation* — generate the next reply in a conversation given the history of the conversation.
- *Text generation/language modeling* — generate text to complete the sentence or to complete the paragraph.


## In order to perform operations with text data, they first need to be converted into numerical. Converting text data into numerical form typically involves the following steps:

#### 1 *Standardization* - remove punctuation, convert the text to lowercase.

* Remove punctuation marks (such as comma, period) or non-alphabetic characters (@, #, {, ]).
* Change all words to lower-case letters, since the model should consider *Text* and *text* as the same word.

#### 2 *Tokenization* - break up the text into tokens (e.g., tokens are individual (sub)-words).

#### 3 *Indexing* - assign a numerical index to each token in the training set (i.e., vocabulary).

#### 4 *Embedding* - convert each index to  a numerical vector. Generally, we have three approaches:  

* Convert each document to its TF-IDF representation (each word become a seperate feature) (traditional approach)
* Train a embedding layer from scratch (to convert each word to its embedding) 
* Use pretrained embedding models such as word2vec or GloVe (to convert each word to its embedding) 


An example of text standardization and word-level tokenization is shown in the next figure.

<img src='https://raw.githubusercontent.com/avakanski/Fall-2022-Python-Programming-for-Data-Science/main/Lectures/Theme%203%20-%20Model%20Engineering%20Pipelines/Lecture%2018%20-%20Natural%20Language%20Processing/images/tokenization.png' width=500px/>

In [3]:
#!pip install tensorflow-datasets

In [1]:
import tensorflow as tf             
import tensorflow_datasets as tfds



# 1. Text Tokenization with TensorFlow/Keras Tokenizer

TensorFlow/Keras provides a text preprocessing function `Tokenizer` for converting raw text into sequences of tokens. The `Tokenizer` performs both text standardization and tokenization. 

The `Tokenizer` has the following arguments:
- *num_words*: the maximum number of words to keep in the input text. It is better to set a high number if we are not sure, because if we set a number less than the words in the text, some words will not be tokenized.
- *filters*: by default, all punctuations and special characters in the text will be removed. If we want to change that, we can provide a list of punctuations and characters to keep. 
- *lower*: can be True or False. By default, it is True, and that means all texts will be converted to lowercase.
- *split*: separator for splitting words. A default separator is a space (" ").  
- *char_level*: can be True or False. By default, it is False and will perform word-level tokenization. If it is True, the function will perform character-level tokenization. 
- *oov_token*: oov stands for Out Of Vocabulary, and it denotes a symbol that will be added to the word_index to replace words that are not present in the input text. 



###  Word-level Tokens

In [2]:
# Sample sentences
sentences = ['TensorFlow is a Deep Learning framework.',
             'TensorFlow is a well designed, most widely used deep learning API!',
             'Our lectures fully embrace TensorFlow environment!']    

After the text is broken down into individual words, the `Tokenizer` builds a *vocabulary* of all words that are found in the input text, and assigns a unique integer to each word in the vocabulary. We can inspect the words by using again the attribute `word_index`.

In [3]:
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=1000)

# Fitting tokenizer on sentences
tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index
print(word_index)

{'tensorflow': 1, 'is': 2, 'a': 3, 'deep': 4, 'learning': 5, 'framework': 6, 'well': 7, 'designed': 8, 'most': 9, 'widely': 10, 'used': 11, 'api': 12, 'our': 13, 'lectures': 14, 'fully': 15, 'embrace': 16, 'environment': 17}


There are 17 unique words in the above sentences. By default, all punctuations are removed and all letters are converted to lowercase. 


In [4]:
print(tokenizer.texts_to_sequences(sentences))

[[1, 2, 3, 4, 5, 6], [1, 2, 3, 7, 8, 9, 10, 11, 4, 5, 12], [13, 14, 15, 16, 1, 17]]


### Out of Vocabulary Words

To handle the case when the Tokenizer is applied to text that contains words that were not present in the original documents, we can define a special token `oov_token`. This token will be used to replace these words that are Out Of Vocabulary (OOV).

In the example below, we set the `oov_token`, which has been assigned the token `1`.

In [5]:
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=1000, oov_token='Word Out of Vocab')
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

{'Word Out of Vocab': 1, 'tensorflow': 2, 'is': 3, 'a': 4, 'deep': 5, 'learning': 6, 'framework': 7, 'well': 8, 'designed': 9, 'most': 10, 'widely': 11, 'used': 12, 'api': 13, 'our': 14, 'lectures': 15, 'fully': 16, 'embrace': 17, 'environment': 18}


In [6]:
# Converting text to sequences
print(tokenizer.texts_to_sequences(sentences))

[[2, 3, 4, 5, 6, 7], [2, 3, 4, 8, 9, 10, 11, 12, 5, 6, 13], [14, 15, 16, 17, 2, 18]]


Next, if we pass text with new words that the tokenizer was not fit to, the new words will be replaced with the `oov_token`.

In [7]:
new_sentences = ['I enjoy learning AI Topics', 
                'TensorFlow is a superb deep learning API'] # superb is a new word 

print(tokenizer.texts_to_sequences(new_sentences))

[[1, 1, 6, 1, 1], [2, 3, 4, 1, 5, 6, 13]]


And also, if we work with a large dataset of documents, we can limit the number of words in the vocabulary to 20,000 or 30,000, and consider the rare words as out-of-vocabulary words. This can reduce the feature space of the model, by ignoring those words that are present only once or twice in the large database.



### Padding Word Sequences 

Most machine learning models require the input samples to have the same length/size. In Keras, the function `pad_sequences()` can be used to pad the text sequences with predefined values, so that they have the same length. 

The function `pad_sequences()` accepts the following arguments:
- *sequence*: a list of sequences in integer forms (tokenized texts).
- *maxlen*: maximum length of all sequences; if not provided, sequences will be padded to the length of the longest sequence.
- *padding*: 'pre' (default) or 'post', whether to pad before the sequence or after the sequence.
- *truncating*: 'pre' (default) or post', whether to remove the values from sequences larger than maxlen at the beginning or at the end of the sequences.
- *value*: a float or a string to use as a padding value. By default, the sequences are padded with 0.


In [16]:
tokenized_sentences = tokenizer.texts_to_sequences(sentences)

print(tokenized_sentences)

[[2, 3, 4, 5, 6, 7], [2, 3, 4, 8, 9, 10, 11, 12, 5, 6, 13], [14, 15, 16, 17, 2, 18]]


The next cell shows the above sequences post-padded with 0 to sequences with length 10. 

In [17]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

padded_sequences = pad_sequences(tokenized_sentences, maxlen=10, padding='post')

print(padded_sequences)

[[ 2  3  4  5  6  7  0  0  0  0]
 [ 3  4  8  9 10 11 12  5  6 13]
 [14 15 16 17  2 18  0  0  0  0]]



# 2.  Word Embeddings

### Loading the IMDB Reviews Dataset

https://keras.io/api/datasets/imdb/

https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

IMDB Reviews Dataset can be downloaded from the built-in datasets in Keras.  There are 25,000 samples of movie reviews for training and 25,000 samples for validation. Setting `max_features` to 20,000 means we are only considering the first 20,000 words and the rest of the words will have the out-of-vocabulary token. Each movie review has a positive or negative label. 

The training and validation datasets will be loaded as lists with 25,000 elements.

In [10]:
max_features = 20000

(train_data, train_labels), (val_data, val_labels) = tf.keras.datasets.imdb.load_data(num_words=max_features)

In [11]:
print(len(train_data))
print(len(val_data))

25000
25000


Displayed below is the third movie review. It is also a list, it contains 141 words, and as we can see the words in the dataset are already converted to tokens. 

In [12]:
# Display the third movie review
print('Number of words in the third review', len(train_data[2]))
print(train_data[2])

Number of words in the third review 141
[1, 14, 47, 8, 30, 31, 7, 4, 249, 108, 7, 4, 5974, 54, 61, 369, 13, 71, 149, 14, 22, 112, 4, 2401, 311, 12, 16, 3711, 33, 75, 43, 1829, 296, 4, 86, 320, 35, 534, 19, 263, 4821, 1301, 4, 1873, 33, 89, 78, 12, 66, 16, 4, 360, 7, 4, 58, 316, 334, 11, 4, 1716, 43, 645, 662, 8, 257, 85, 1200, 42, 1228, 2578, 83, 68, 3912, 15, 36, 165, 1539, 278, 36, 69, 2, 780, 8, 106, 14, 6905, 1338, 18, 6, 22, 12, 215, 28, 610, 40, 6, 87, 326, 23, 2300, 21, 23, 22, 12, 272, 40, 57, 31, 11, 4, 22, 47, 6, 2307, 51, 9, 170, 23, 595, 116, 595, 1352, 13, 191, 79, 638, 89, 2, 14, 9, 8, 106, 607, 624, 35, 534, 6, 227, 7, 129, 113]


In [13]:
# Display the first 10 labels
train_labels[:10]

array([1, 0, 0, 1, 0, 0, 1, 0, 1, 0], dtype=int64)

In [14]:
from tensorflow.keras.utils import to_categorical

train_labels = to_categorical(train_labels, num_classes=2)
val_labels = to_categorical(val_labels, num_classes=2)

### Preparing the Dataset

Let's pad the data using the `pad_sequences` function in Keras. Setting `maxlen` indicates to use the first 200 words in each movie review, and ignore the rest. Most movie reviews in the dataset are shorter than 200 words, however for those that are longer than 200 words some information will be lost. That is a tradeoff between computational expense and model performance.

We can see in the next cell that for the third reivew, which has a length of 141 words, the first 59 words are now 0, and the length is 200.

In [18]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

train_data = pad_sequences(train_data, maxlen=200, padding='post')
val_data = pad_sequences(val_data, maxlen=200, padding='post')

In [19]:
# Display the third movie review
print('Shape of the third padded review:', train_data[2].shape, '\n')
print(train_data[2])

Shape of the third padded review: (200,) 

[   1   14   47    8   30   31    7    4  249  108    7    4 5974   54
   61  369   13   71  149   14   22  112    4 2401  311   12   16 3711
   33   75   43 1829  296    4   86  320   35  534   19  263 4821 1301
    4 1873   33   89   78   12   66   16    4  360    7    4   58  316
  334   11    4 1716   43  645  662    8  257   85 1200   42 1228 2578
   83   68 3912   15   36  165 1539  278   36   69    2  780    8  106
   14 6905 1338   18    6   22   12  215   28  610   40    6   87  326
   23 2300   21   23   22   12  272   40   57   31   11    4   22   47
    6 2307   51    9  170   23  595  116  595 1352   13  191   79  638
   89    2   14    9    8  106  607  624   35  534    6  227    7  129
  113    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    

### Embedding Layer in TenforFlow

TenforFlow has `Embedding` layer, which we will use to project the input tokens into a vector into an embedding space. The `Embedding` layer requires at the minimum to specify the number of possible tokens in the data sequences, and the dimensionality of the vectors in the embeddings space. The layer takes integer indices as inputs, and outputs a feature vector. It can be considered as a look-up table, which maps a vector to each integer.

To understand how the Embedding layer works, let's take data with the maximum number of words set to 100, and represent them with 5-dimensional vectors. The Embedding layer will represent words with feature vectors of a given dimension. In the cell below, the layer just assigned random values to the numbers 1, 2, and 3, and we can see that to each number a 5-dimensional vector is assigned. 

***Note that the embeddings are trainable, and when we include the Embedding layer in a model, as we train the model, words that are similar will get closer in the embedding space.***

In [27]:
# embedding layer: represent a dataset with a vocabulary of 100 words with 5 dimensional vectors
embedding_layer = tf.keras.layers.Embedding(input_dim=100, output_dim=5)

embed_integers = embedding_layer(tf.constant([1, 2, 3]))

embed_integers

<tf.Tensor: shape=(3, 5), dtype=float32, numpy=
array([[ 0.0034371 , -0.01215959,  0.04182121,  0.01836193,  0.01725897],
       [-0.02923861,  0.03471864,  0.03798798,  0.03107215, -0.03317062],
       [-0.02710744, -0.02245896, -0.00804018, -0.02075752, -0.02589865]],
      dtype=float32)>

In [28]:
embed_integers.numpy()

array([[ 0.0034371 , -0.01215959,  0.04182121,  0.01836193,  0.01725897],
       [-0.02923861,  0.03471864,  0.03798798,  0.03107215, -0.03317062],
       [-0.02710744, -0.02245896, -0.00804018, -0.02075752, -0.02589865]],
      dtype=float32)

### Define, Compile, and Train the Model

Next, we will define a model that uses an `Embedding` layer to project the words in input sequences into 8-dimensional vectors. These vectors will be further processed through dense layers, and the last layer will predict the label of movie reviews.

In [34]:
embedding_dim = 8
#max_features = 20000

# Create a model
model = tf.keras.Sequential([
       tf.keras.layers.Embedding(input_dim=max_features, output_dim=embedding_dim, input_length=200),
       # add some RNN layers if you prefer
       tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(100)),
       tf.keras.layers.Dense(32, activation='relu'),
       tf.keras.layers.Dropout(0.5),
       tf.keras.layers.Dense(2, activation='softmax')   
])

Since there are only 2 labels: positive and negative review,  you use 2 neurons with softmax in the output layer. Accordingly, your loss must be categorical_crossentropy.

We will compile the model with `categorical_crossentropy` loss and `adam` optimizer.  Note that `binary_crossentropy` is calculated on top of sigmoid outputs, whereas `categorical_crossentropy` is calculated over softmax activation outputs.

In [36]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [37]:
history = model.fit(train_data[0:500], train_labels[0:500], validation_data = (val_data[0:500], val_labels[0:500]), epochs=10)

Epoch 1/10


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
