# RNN: Working with text data<a id="Top"></a>

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>
Table of Content
<ul>
<li>1. Tokenization</li>  
<li>2. <a href="#Part_2">One-hot text encoding</a></li>
<li>3. <a href="#Part_3">Word embeddings</a></li>
    <ul>
        <li> 3.1 <a href="#Part_3_1">Learning word embeddings with the Keras Embedding layer</a></li>
        <li> 3.2 <a href="#Part_3_2">Using pretrained word embeddings</a></li>
    </ul>
</ul>    
</font>
</div>

# 1. Tokenization

Machine learning algorithms don't truly understand text in a human sense. A learning model only takes numerical
inputs and trys to map out the input data's structure. So in order to work wiht text, which is one of the most
widespread forms of sequential data, one has to decide a word representation that maps text into numerical
tensors. Sometimes, this procudure is called vectorizing text. There are multiple possible ways:
- Segment text into words, and transform each word into a vector.
- Segment text into characters, and transform each character into a vector.
- Extrac n-grams of words of characters, and transform each n-gram into a vector.

The different units that one can break down text into (words, characters, n-grams) are called __tokens__.
Breaking text into tokens is called __tokenization__. So tokenization is really a name that applies to all procedures
that can associate numeric vectors with the generated tokens. An example of this process that goes from 
text to tokens to vectors is depicted by the following diagram

<img src='./images/fig_RNN-TextData.png' width=350>

There are two major tokenization schemes: one-hot encoding and token embedding (or word embedding).

As a side note, word n-grams are groups of $N$ (or fewer) consecutive words that one can extract from a sentence.
The same idea can be applied to characters. Take the sentence "__This cat sat on the mat__" as an example.
It can be decomposed into the following set of 2-grams
```python
    {"The", "The cat", "cat", "cat sat", "sat", "sat on", "on", "on the", "the", "the mat", "mat"}
```
or the following 3-grams:
```python
    {"The", "The cat", "cat", "cat sat", "The cat sat",
     "sat", "sat on", "on", "cat sat on", "on the", "the", 
     "sat on the", "the mat", "mat", "on the mat"}
```

One can see that n-gram tokenization is not an order-preserving method.


# 2. One-hot text encoding<a id="Part_2"></a>
<a href="#Top">Back to page top</a>

One-hot encoding is the most common and basic way of turning a token into a vector. One-hot encoding is used
all over the place in machine learning. For example, in the MNIST challenge, the targets are one-hot encoded
before they are sent to the CNN model. One-hot encoding consists of attaching a unique integer index $i$ with 
every word, then turning this integer $i$ into a binary vector of size $N$, the size of the vocabulary. The
vector is all zeros except for the $i$-th entry, which is 1. Of course, one-hot encoding can be applied at the 
character level too. Below is a code example from Chollet's book that performs word-level one-hot encoding

In [1]:
import numpy as np

samples = ['The cat sat on the mat.', 'The dog ate my homework.']
token_index = {}
for sample in samples:
    for word in sample.split():
        if word not in token_index:
            token_index[word] = len(token_index) + 1
            # Print out the unique words            
            print(' unique word: {0:12s}, token index: {1:3d}'.format(word, token_index[word]))
            
# Consider the first 10 words in each sample            
max_length = 10

results = np.zeros(shape=(len(samples), 
                          max_length, 
                          max(token_index.values()) + 1)) 

for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        index = token_index.get(word)
        results[i, j, index] = 1.

 unique word: The         , token index:   1
 unique word: cat         , token index:   2
 unique word: sat         , token index:   3
 unique word: on          , token index:   4
 unique word: the         , token index:   5
 unique word: mat.        , token index:   6
 unique word: dog         , token index:   7
 unique word: ate         , token index:   8
 unique word: my          , token index:   9
 unique word: homework.   , token index:  10


In this example, the two sentences have total 11 words. But only 10 of them are unique ones. Note that 
the code can differenciate lower and upper cases. So "The" and "the" are two different tokens. Once the unique words
have been figured out, the code went on to attach of the words an integer, then convert the interger into a
binary vector. Note that the first index 0 of the vector is not associated with anything. The variable 
`max_length` limits the number of words to be tokenized. The following cell shows the vectorized sample:

In [2]:
results

array([[[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

       [[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0

Keras has built-in utilities for one-hot encoding text at the word or character level. The tools have a number
of important features such as stripping special characters from strings and only taking into account the $N$ 
most common words in your dataset. So in general one should use Keras utilities. Below is an example using 
Keras utitlies on the sentences we just saw from the previous example:

In [3]:
from keras.preprocessing.text import Tokenizer

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

# Consider the first num_words in each sentence.
tokenizer = Tokenizer(num_words=20) 
tokenizer.fit_on_texts(samples)

sequences = tokenizer.texts_to_sequences(samples)

one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')
word_index = tokenizer.word_index 
print('Found %s unique tokens.' % len(word_index))

Using TensorFlow backend.


Found 9 unique tokens.


In [4]:
word_index

{'the': 1,
 'cat': 2,
 'sat': 3,
 'on': 4,
 'mat': 5,
 'dog': 6,
 'ate': 7,
 'my': 8,
 'homework': 9}

In [5]:
print(one_hot_results[0])
print(one_hot_results[1])

[0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


Apparently, Keras `Tokenizer` is case insensitive, it treats "The" and "the" as identical. This makes sense.
`Tokenizer` also returns a dictionary of unique words and corresponding index.

If the number of unique tokens in the vocabulary is too large, one can hash words into vectors of fixed size.
The advantage of this method is that it does not maintain an explicit list of word index, which saves memory
and allows online encoding of the data. However, the method may suffer from hash collisions: two different
words may end up with the same hash. 

# 3. Word embeddings<a id="Part_3"></a>
<a href="#Top">Back to page top</a>

While one-hot encoding is straightford, the algorithm lacks representation capability. Each one-hot vectorization
distinguishes a word in a vocabulary from every other word in the vocabulary. So basically one-hot encoding can 
not tell you the semantics of the text. Word embeddings, on the other hand, will group commonly co-occurring
tokens together in the embedding space. Let's expand on the meaning of "together."

Word embeddings are algorithms that map human languages into a geometric space, called embedding space. As a
result, it is possible to measure the distance between two tokens (word or character) in the embedding space. 
For example, a L2 diatance can be a measure of separation between two tokens. And since embedding space is a 
vector space, a vector that conntects two tokens thus has two properties: length and direction. Reasonable 
word embeddings should reflect the semantic relationships between the tokens in terms of the geometric 
distance measure. This relationship could be understood as a geometric transformation, as indicated by the
example below.

<img src='./images/fig_RNN-WordEmbedding.png' width=300>

The diagram is a toy example of four words embedded in a two-dimensional embedding space. Some semantic 
relationships can be observed. Firstly, the translations from cat to tiger and from dog to wolf are given
by the same vector (solid orange). Secondly, another vector (dashed orange) moves the point from dog to cat 
and from wolf to tiger. The first vector can be interpreted as the relation "from pet to wild animal;" 
while the second can be interpreted as a “from canine to feline” vector.

One should keep in mind that __there is no one ideal word-embedding space__. As word embeddings are largely 
task-dependent. A good word-embedding space for English-language movie review may very likely look different 
from the one for French-language spam email detection.

There are two ways to get word embeddings:
- Learn word embeddings jointly with the main network training task (such as document classification or sentiment
  prediction). In this setup, you start with random word vectors and then learn word vectors in the same way you
  learn the weights of a neural network.
- Pretrained word embeddings: load into your model word embeddings that were precomputed using a different 
  machine-learning task than the one you’re trying to solve. 


## 3.1 Learning word embeddings with the Keras `Embedding` layer<a id="Part_3_1"></a>
<a href="#Top">Back to page top</a>

Since a good word-embedding space varies from task to task, sometimes it's reasonable to learn a new embedding
space for the task at hand. Keras offers an excellent tool to carry out the job: the `Embedding` layer. The `Embedding` layer can be understood as a dictionary that maps integer indices to dense vectors:

$$ \mbox{Word index}\,\Longrightarrow\,\mbox{Embedding layer}\,\Longrightarrow\,\mbox{Word vector} $$

The Keras `Embedding` has the following functional interface:
```python
   Embedding(input_dim, output_dim, input_length=None) 
```    
The meaning of the arguments:
- `input_dim`: __Size of the vocabulary, i.e. the maximum number of tokens + 1__. For example, setting 
  `inpu_dim=1000` means the largest token index should be no larger than 999. 
- `output_dim`: __Dimension of the embedding space__.
- `input_length`: This optional argument specifies __the length of input sequences__, when it is constant. 
  It is required if one is going to connect `Flatten()` then `Dense()` layers.
  
Let's look at an example:  

In [6]:
from keras.models import Sequential
from keras.layers import Embedding, Flatten

maxlen=100
model = Sequential()
model.add( Embedding(1000, 8, input_length=maxlen) )
model.add( Flatten() )
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 100, 8)            8000      
_________________________________________________________________
flatten_1 (Flatten)          (None, 800)               0         
Total params: 8,000
Trainable params: 8,000
Non-trainable params: 0
_________________________________________________________________


In this code snippet, 
1. `input_dim=1000` means we are considering 1000 unique tokens. So the maximal index should be no larger than 999.
2. `output_dim=8` implies the embedding space dimsnsion is 8.
3. `input_length=maxlen` suggests that we are considering the first `maxlen=100` words (among the 1000 tokens) in
   the text.

This model's input tensor has shape `(batch_size, input_length) = (batch_size, 100)`. After 
the `Embedding` layer activation, the output will have shape `(batch_size, 100, 8)`. The `Flatten()` layer turns
the 3D tensor `(batch_size, 100, 8)` into a 2D one with shape `(batch_size, 100*8)`.
  
Let's look at another example where `input_length` is not specified:

In [7]:
from keras.layers import LSTM

model = Sequential()
model.add( Embedding(1000, 8) )
model.add( LSTM(32) )
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, None, 8)           8000      
_________________________________________________________________
lstm_1 (LSTM)                (None, 32)                5248      
Total params: 13,248
Trainable params: 13,248
Non-trainable params: 0
_________________________________________________________________


Now in this example, the model's input is of the shape `(batch_size, sequence_length)`, i.e. the second 
dimension is inferred dynamically from the input itself. As such, the model returns a floating point tensor 
of shape `(batch_size, sequence_length, output_dim)`.  This 3D tensor is then sent to an LSTM layer.

## 3.2 Using pretrained word embeddings<a id="Part_3_2"></a>
<a href="#Top">Back to page top</a>

In the case when there is little data available to learn an appropriate task-specific embedding space, one can
load embedding vectors from a precomputed embedding space that is highly structured and has useful properties.
This rationale of using pretrained word embeddings is basically the same as for using pretrained convolutional
neural networks. 

There are various precomputed databases of word embeddings that one can use in a Keras `Embedding` layer.
Examples include
- __Google <a href='https://code.google.com/archive/p/word2vec/'>Word2vec</a>__. It seems that the Google Code
  repository is no longer in use. Anyways, Google provides a nice 
  <a href='https://www.tensorflow.org/tutorials/representation/word2vec'>tutorial</a>. 
- __<a href='https://nlp.stanford.edu/projects/glove/'>GloVe</a>__, or Global Vectors for Word Representation. The
  English tokens are obtained from Wikipedia data and Common Crawl data. It is argued 
  <a href='https://towardsdatascience.com/beyond-word-embeddings-part-2-word-vectors-nlp-modeling-from-bow-to-bert-4ebd4711d0ec'>here</a> that Word2vec only takes local contexts into account. But GloVe uses neural methods to
  decompose the co-occurence matrix into more expressive and dense word vectors. However in practive, neither
  GloVe or Word2vec has been shown to provide better results. Rather, they should both be evaluated for a given
  dataset.
- __<a href='https://github.com/facebookresearch/fastText'>Facebook fastText</a>__ is built on Word2vec by
  learning representations for each word and the n-grams found within each word. FastText has been shown to
  be more accurate than Word2vec vectors by several measures.
- <a href='https://www.tensorflow.org/hub/tutorials/text_classification_with_tf_hub'>TensorFlow Hub</a>'s
  <a href='https://tfhub.dev/s?module-type=text-embedding'>Text Embedding module</a>.
  
How do we use pretrained word embeddings? From the Embedding layer training section, we know that an 
`Embedding` layer is essentially a dictiionary that maps a word index to a vector in the embedding space.
Therefore, a pretrained word embedding is a 2D matrix of shape `(max_words, embedding_dim)` where each 
$i$ in the `max_word` entries contains the `emneddomg_dim`-dimensional vectors for the word of index $i$
in the reference word index built during tokenization. The index 0 does not stand for and word or token, it's
a placeholder.

Let's download the GloVe word embeddings from 2014 English Wikipedia as an example and take a closer look at 
the precomputed embedding. The file name is `glove.6B.zip`. After the file is unzipped, we'll load the file
`glove.6B.50d.txt`. As the file name suggests, the dimension of the embedding space is 50. The file has 400K 
tokens, but we'll read in the first 100 words for the purpose of demonstration.

In [10]:
import pandas as pd
import csv

In [18]:
glove_path = './../Keras/glove.6B/glove.6B.50d.txt'
GloveEmbedding = pd.read_table(glove_path, sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE, nrows=100)
GloveEmbedding.head(10)

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,41,42,43,44,45,46,47,48,49,50
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
the,0.418,0.24968,-0.41242,0.1217,0.34527,-0.044457,-0.49688,-0.17862,-0.00066,-0.6566,...,-0.29871,-0.15749,-0.34758,-0.045637,-0.44251,0.18785,0.002785,-0.18411,-0.11514,-0.78581
",",0.013441,0.23682,-0.16899,0.40951,0.63812,0.47709,-0.42852,-0.55641,-0.364,-0.23938,...,-0.080262,0.63003,0.32111,-0.46765,0.22786,0.36034,-0.37818,-0.56657,0.044691,0.30392
.,0.15164,0.30177,-0.16763,0.17684,0.31719,0.33973,-0.43478,-0.31086,-0.44999,-0.29486,...,-6.4e-05,0.068987,0.087939,-0.10285,-0.13931,0.22314,-0.080803,-0.35652,0.016413,0.10216
of,0.70853,0.57088,-0.4716,0.18048,0.54449,0.72603,0.18157,-0.52393,0.10381,-0.17566,...,-0.34727,0.28483,0.075693,-0.062178,-0.38988,0.22902,-0.21617,-0.22562,-0.093918,-0.80375
to,0.68047,-0.039263,0.30186,-0.17792,0.42962,0.032246,-0.41376,0.13228,-0.29847,-0.085253,...,-0.094375,0.018324,0.21048,-0.03088,-0.19722,0.082279,-0.09434,-0.073297,-0.064699,-0.26044
and,0.26818,0.14346,-0.27877,0.016257,0.11384,0.69923,-0.51332,-0.47368,-0.33075,-0.13834,...,-0.069043,0.36885,0.25168,-0.24517,0.25381,0.1367,-0.31178,-0.6321,-0.25028,-0.38097
in,0.33042,0.24995,-0.60874,0.10923,0.036372,0.151,-0.55083,-0.074239,-0.092307,-0.32821,...,-0.48609,-0.008027,0.031184,-0.36576,-0.42699,0.42164,-0.11666,-0.50703,-0.027273,-0.53285
a,0.21705,0.46515,-0.46757,0.10082,1.0135,0.74845,-0.53104,-0.26256,0.16812,0.13182,...,0.13813,0.36973,-0.64289,0.024142,-0.039315,-0.26037,0.12017,-0.043782,0.41013,0.1796
"""",0.25769,0.45629,-0.76974,-0.37679,0.59272,-0.063527,0.20545,-0.57385,-0.29009,-0.13662,...,0.030498,-0.39543,-0.38515,-1.0002,0.087599,-0.31009,-0.34677,-0.31438,0.75004,0.97065
's,0.23727,0.40478,-0.20547,0.58805,0.65533,0.32867,-0.81964,-0.23236,0.27428,0.24265,...,-0.12342,0.65961,-0.51802,-0.82995,-0.082739,0.28155,-0.423,-0.27378,-0.007901,-0.030231


As the dataframe shows, each token (e.g. `the`, `of`, `to`, ... etc and punctuations) has ite corresponding 
50-dimension vector. 

With the precomputed vectors, the next step is to build an embedding matrix that maps the tokens from
tokens from tje text data to its corresponding vector found in the precomputed embedding. If a token
is not found in the precomputed embedding, it is customary to set its vector as zero.