# Processing Text Part II

The two fundamental deep-learning algorithms for sequence processing are **Recurrent Neural Networks** and **1D CNNs**. Applications of these algorithms include the following:

- Document classification and timeseries classification, such as identifying the topic of an article or the author of a book.
- Timeseries comparisons, such as estimating how closely related two documents are.
- Sequence-to-sequence learning, such as translating an English sentence into French.
- Sentiment analysis, such as classifying the sentiment of tweets or movie reviews as positive or negative.
- Timeseries forecasting, such as predicting the future weather at a certain location given recent weather data.

The following examples are modified versions of some applications that can be found in [1]. Further, the next cell is included in case you have your data on Google Drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

%cd "/content/drive/MyDrive/TC3007C/Sequences"
!pwd

Now we import some libraries and modules.

In [None]:
import numpy as np
import tensorflow as tf
import os
import matplotlib.pyplot as plt

from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.layers import Flatten, Dense, Embedding
from keras.datasets import imdb
from keras import preprocessing
from keras.models import Sequential
from keras import models

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

## Processing text data

Deep learning for natural-language processing is pattern recognition applied to words, sentences, and paragraphs, in much the same way that computer vision is pattern recognition applied to pixels. 

As expected, deep-learning models do not take raw text as input, they only work with numeric tensors. To deal with this we need to **vectorize** text, which is the process of transforming text into numeric tensors. This can be done in multiple ways:

- Segment text into words, and transform each word into a vector.
- Segment text into characters, and transform each character into a vector.
- Extract n-grams of words or characters, and transform each n-gram into a vector.

The different units into which you can break down text (words, characters, or n-grams) are called **tokens**, and breaking text into such tokens is called **tokenization**. In general, text-vectorization processes consist of applying some tokenization scheme and then associating numeric vectors with the generated tokens. These vectors are fed into deep neural networks. There are multiple ways to associate a vector with a token. Let us talk about two major ones: **one-hot encoding** of tokens, and **token embedding**.

### One-hot encoding

One-hot encoding is the most common, most basic way to turn a token into a vector. It consists of associating a unique integer index with every word and then turning this integer index $i$ into a binary vector of size $n$, which is the size of the vocabulary: the vector is all zeros except for the i-th entry, which is equal to one.



In [None]:
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
tokenizer = Tokenizer(num_words=10)
tokenizer.fit_on_texts(samples) 
sequences = tokenizer.texts_to_sequences(samples)
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')
word_index = tokenizer.word_index
print(f'Found {len(word_index)} unique tokens.')
print(word_index)
print(sequences)
print(one_hot_results)

### Token embedding

This method is mainly used for words, so it is also known as **word embeddings**. Whereas the vectors obtained through one-hot encoding are binary, sparse, and very high-dimensional (same dimensionality as the number of words in the vocabulary), word embeddings are low-dimensional floating-point vectors: that is, dense vectors, as opposed to sparse vectors. So, word embeddings pack more information into far fewer dimensions.

There are two ways to obtain word embeddings: **learn word embeddings** jointly with the main task you care about, and load **pretrained word embeddings**.

#### Learning word embeddings

The simplest way to associate a dense vector with a word is to choose the vector at random. The problem with this approach is that the resulting embedding space has no structure. Something more useful would be to have a space in which the geometric relationships between word vectors should reflect the semantic relationships between these words. Then, word embeddings are meant to map human language into a geometric space. 

In real-world word-embedding spaces, common examples of meaningful geometric transformations are **gender** vectors and **plural** vectors. For instance, by adding a `female` vector to the vector `king`, we obtain the vector `queen`. By adding a `plural` vector, we obtain `kings`.Word-embedding spaces typically feature thousands of such interpretable and potentially useful vectors.

Learning a word embedding of a particular task is equivalent to training an extra layer of a neural network, which is known as the `Embedding` layer. This layer can be understood as a dictionary that maps integer indices (which stand for specific words) to dense vectors. It takes integers as input, it looks up these integers in an internal dictionary, and it returns the associated vectors. By the way, this layer needs to be told, at least, the size of the vocabulary we are working with and the dimensionality of the embedding.

To see how this works, let us implement a classifiers that predicts if a movie review is *positive* or *negative*. The data we will work with is the **IMDB** database that is already included in `keras`. For now, let us import the data limiting the number of the most frequent words we will handle (`max_features`), and also truncating the reviews so that they have twenty words at most (`maxlen`).

In [None]:
max_features = 10000
maxlen = 20

(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=max_features) 
X_train = tf.keras.preprocessing.sequence.pad_sequences(X_train, maxlen=maxlen) 
X_test = tf.keras.preprocessing.sequence.pad_sequences(X_test, maxlen=maxlen)

Let us implement a simple neural network that includes an `Embedding` layer. In this case, the dimensionality of the embedding space is eight, so the model will learn for each word a vector of eight components, which means that this layer will have 80.000 trainable parameters.

In [None]:
model = Sequential([
                    Embedding(max_features, 8, input_length=maxlen), 
                    Flatten(),
                    Dense(1, activation='sigmoid')
                   ])

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc']) 
model.summary()

And now we train...

In [None]:
history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test))

So we get a validation accuracy of approximately 75%, which is pretty good considering that we are only looking at the first twenty words in every review. 

By the way, now that the model is trained, we can see what the embedding layer learned during training. As we mentioned, the embedding layer will map every word in our corpus to a vector that lives in a space with eight dimensions. This means that a review, which has twenty words, will become a matrix in which each row is a vector of eight components.



In [None]:
layer_outputs = [layer.output for layer in model.layers[:3]] 
activation_model = models.Model(inputs=model.input, outputs=layer_outputs)
activations = activation_model.predict(np.expand_dims(X_train[0], axis=0))
print(X_train[0])
print(activations[0].shape)
print(activations[0])

#### Pretrained word embeddings

When little training data is available, instead of learning word embeddings jointly with the problem we want to solve, we can load embedding vectors from a precomputed embedding space that we know is highly structured and exhibits useful properties. This is analogous to the concept of transfer learning: there is not enough data available to learn truly powerful features on our own, but we expect the features that we need to be fairly generic.

The idea of a dense, low-dimensional embedding space for words, computed in an unsupervised way, was initially explored by Bengio et al. in the early 2000s, but it only started to take off in research and industry applications after the release of one of the most famous and successful word-embedding schemes: the **Word2vec** algorithm, developed by Tomas Mikolov at Google in 2013. Word2vec captures specific semantic properties, such as gender.

There are various precomputed databases of word embeddings that you can download and use in a `keras` embedding layer. Word2vec is one of them. Another popular one is called **Global Vectors for Word Representation (GloVe)**, which was developed by Stanford researchers in 2014. This embedding technique is based on factorizing a matrix of word co-occurrence statistics. Its developers have made available precomputed embeddings for millions of English tokens, obtained from Wikipedia data and Common Crawl data.

Well, let us use GloVe as a pretrained embedding. To accomplish this, we will have to work with the raw IMDB datase that can be downloaded from here: http://mng.bz/0tIo 

Now, let us collect the individual training reviews into a list of strings, one string per review. We will also get the review labels (positive/negative) into a labels list.

In [None]:
imdb_dir = '/Users/dotero/Documents/TEC/Cursos/Concentración IA/Módulos/TC3007C/Deep Learning/Code/Sequences/IMDB'
train_dir = os.path.join(imdb_dir, 'train')
labels = []
texts = []

for label_type in ['neg', 'pos']:
    dir_name = os.path.join(train_dir, label_type)
    for fname in os.listdir(dir_name):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname))
            texts.append(f.read())
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)

Let us vectorize the text and prepare a training and validation split. Because pretrained word embeddings are meant to be particularly useful on problems where little training data is available, we will add the following twist: restricting the training data to the first 200 samples. In other words, we will train our model to classify movie reviews after looking at just 200 examples. Notice that we have to first shuffle the data because samples of said data are ordered (all negative first, then all positive).

In [None]:
maxlen = 100
training_samples = 200 
validation_samples = 10000
max_words = 10000

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index

print(f'Found {len(word_index)} unique tokens.')

data = pad_sequences(sequences, maxlen=maxlen)
labels = np.asarray(labels)

print(f'Shape of data tensor: {data.shape}.')
print(f'Shape of label tensor: {labels.shape[0]}.')

indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
X_train = data[:training_samples]
y_train = labels[:training_samples]
X_val = data[training_samples: training_samples + validation_samples] 
y_val = labels[training_samples: training_samples + validation_samples]

To use the pretrained GloVe embedding from 2014 English Wikipedia we have to go to https://nlp.stanford.edu/projects/glove and download it. It is a large file, so be patient if you do not have a good internet connection. The file contains 100-dimensional embedding vectors for 400,000 words (or nonword tokens). Unzip it.

Let us parse the unzipped file (a .txt file) to build an index that maps words (as strings) to their vector representation (as number vectors).

In [None]:
glove_dir = '/Users/dotero/Documents/TEC/Cursos/Concentración IA/Módulos/TC3007C/Deep Learning/Code/Sequences/glove.6B'
embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))

for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print(f'Found {len(embeddings_index)} word vectors.')

In order to use the parameters of the pretrained embedding, we will have to build an embedding matrix that we will feed into the embedding layer. It must be a matrix with `max_words` rows and `embedding_dim` columns, where each row $i$ contains the `embedding_dim`-dimensional vector for the word of index $i$ in the reference word index (built during tokenization). Note that index zero is not supposed to stand for any word or token, it is just a placeholder.

In [None]:
embedding_dim = 100
embedding_matrix = np.zeros((max_words, embedding_dim))

for word, i in word_index.items():
    if i < max_words:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

Now we built our model.

In [None]:
model_pre_trained = Sequential([
                                Embedding(max_words, embedding_dim, input_length=maxlen),
                                Flatten(),
                                Dense(32, activation='relu'),
                                Dense(1, activation='sigmoid')
                               ])

model_pre_trained.summary()

The embedding layer has a single weight matrix: a 2D float matrix where each entry $i$ is the word vector meant to be associated with index $i$. Let us load the embedding matrix that we built into the embedding layer. By the way, we do not want to alter these parameters during training, so we have to tell `keras` that the embedding layer will not be trained.

In [None]:
model_pre_trained.layers[0].set_weights([embedding_matrix])
model_pre_trained.layers[0].trainable = False

We are all set for compiling and training our model.

In [None]:
model_pre_trained.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model_pre_trained.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_val, y_val))

Let us visualize our results.

In [None]:
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

print(f"Average validation accuracy: {sum(val_acc) / 10}.")

The model quickly starts overfitting, which is unsurprising given the small number of training samples. Validation accuracy seems to reach the high fifties.

Notice that results may vary because we have so few training samples, so performance is heavily dependent on exactly which 200 samples are chosen randomly. If this works poorly for you, try choosing a different random set of 200 samples, however, keep in mind that in real life we do not get to choose our training data.

We can also train the same model without loading the pretrained word embeddings and without freezing the embedding layer. In that case, we will use a task-specific embedding of the input tokens, which is generally more powerful than pretrained word embeddings when lots of data is available. But in this case, we have only 200 training samples. 

In [None]:
model = Sequential([
                    Embedding(max_words, embedding_dim, input_length=maxlen),
                    Flatten(),
                    Dense(32, activation='relu'),
                    Dense(1, activation='sigmoid')
                   ])

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_val, y_val))

And now we see how thigs went.

In [None]:
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

print(f"Average validation accuracy: {sum(val_acc) / 10}.")

Average accuracy of the validation set with a pretrained embedding is a bit higher than without it, so, in this case, pretrained word embeddings outperform learned embeddings. If we were to increase the number of training samples, this will quickly stop being the case.

## References

[1] Chollet, Francois. *Deep learning with Python*. Simon and Schuster, 2021.

[2] https://code.google.com/archive/p/word2vec

[3] https://nlp.stanford.edu/projects/glove