# Deep Learning for Text and Sequences
## Preamble

This series explores deep-learning models that can process text, timeseries and sequence data in general.

### Fundamental Algorithms
- Recurrent Neural Networks
- One Dimensional(1D) Convnets

### Applications of Deep Learning Algorithms Sequence Processing
1. Document Classification and Timeseries Classification
    - Identifying the author of an article
2. Timeseries Comparisons
    - Estimating how close two documents are
3. Sequence-to-Sequence Learning
    - Decoding and English Statement into Swahili
4. Sentiment Analysis
    - Classifying the sentiment of movie reviews as +ve or -ve
5. Timeseries Forecasting
    - Predicting the weather at a certain location, given recent weather data
   
### Our Area of Focus
- Sentiment Analysis of the IMDB Dataset
- Weather Temperature Forecasting

## Working With Text Data
Text data is understood either as a __sequence of characters__ or a __sequence of words__, with deep-learning models __just mapping the statistical structure of written language without actually understanding it in a human sense__. 
This is deemed sufficient in most simple textual tasks.

Deep-learning for natural-language processing is <span style = "color: red;">___pattern recognition applied to words, sentences and paragraphs___</span>.  
Deep-learning models work with __numeric tensors only__.

### Discover your Vocabulary
1. __Vectorizing__: (in context of text) The process of transforming text into numerical tensors through the following ways:
    - Segmenting of text into words, and transforming each word into a vector
    - Segmenting of text into characters, and transforming each character into 
        a vector
    - Extracting __n-grams__ of words or characters, and transforming each 
        n-gram into a vector
2. __Tokenizing__: The breaking of text into different units known as __tokens__.
3. __N-grams__: Overlapping groups of multiple consecutive words or characters.

All text-vectorization processes consist of applying some tokenization and then associating numeric tensors with the generated tokens. These vectors, packed into __sequence tensors__, are fed into deep neural networks.  

### Ways to Associate a Vector With a Token
- One-hot encoding of tokens
- Token embedding (typicaly used exclusively for words and called 
    word tokenization)

## Understanding N-grams and Bags-of-words
Word n-grams are <span style = "color: red;">__groups of N(or fewer) consecutive words or characters that one can extract from a sentence__</span>.

### Examples
#### Word-based Bigrams (2-grams)
{"The quick", "quick brown", "brown fox", "fox jumps", "jumps over", "over the", "the lazy", "lazy dog"}

#### Word-based Bigrams (3-grams)
{"The", "quick", "brown", "The quick", "quick brown", "brown fox",
 "The quick brown", "quick brown fox", "brown fox jumps", 
 "fox jumps over", "jumps over the", "over the lazy", "the lazy dog"}
 
The term bag refers to the fact that one's dealing with a set or link of words/tokens rather than a sequence, i.e., they are in no specific order.  
The family of tokenization methods is called __bag-of-words__.

Bag of words _tends to be used in shallow language-processing models rather
than deep learning models_ because it is not an order-preserving tokenization method.

Extracting n-grams is a form of feature engineering, and deep learning replaces this _rigid and brittle approach_ with __hierarchical feature engineering__. It is suitable for lightweight, shallow text-processing models such as __logistic regression__ and __random forests__.

## Text-Vectorization Processes
### One-hot Encoding of Words and Characters
It consists of associating a unique integer index with every word and then
turning this integer index, _i_ into a binary vector of size _N_ (the size of the vocabulary).  
The vector is all zeros except for the _i<sup>th<sup>_ entry, which is 1.

**Listing 1-1** Word-level one hot encoding

In [3]:
import numpy as np

# Define a list of text samples
samples = ["Kenya is a beautiful country", 
           "To find oneself, you must lose oneself"]

# Initialize an empty dictionary to store word indices
token_index = {}

# Tokenize the text samples and create a dictionary of unique 
# words with integer indices
for sample in samples:
    for word in sample.split():
         # Check if the word is not already in the dictionary
        if word not in token_index:
            # Add the word to the dictionary with a unique index
            token_index[word] = len(token_index) + 1
print("Token index: ", token_index)

# Define the maximum sequence length
max_length = 10
# Initialize a NumPy array for one-hot encoding with appropriate dimensions
results = np.zeros(
    shape = (len(samples), max_length, max(token_index.values()) + 1))
print("Initial results array: ", results)

# Iterate over the samples and their indices
for i, sample in enumerate(samples):
    # Tokenize the sample into words and limit to the maximum sequence length
    for j, word in list(enumerate(sample.split()))[:max_length]:
        # Get the index of the word from the dictionary
        index = token_index.get(word)
        # Set the corresponding element in the results array to 1
        results[i, j, index] = 1
print("Final results array: ", results)

Token index:  {'Kenya': 1, 'is': 2, 'a': 3, 'beautiful': 4, 'country': 5, 'To': 6, 'find': 7, 'oneself,': 8, 'you': 9, 'must': 10, 'lose': 11, 'oneself': 12}
Initial results array:  [[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

 [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

**Listing 1-2.** Character-level one-hot encoding (toy example)

In [4]:
import string

characters = string.printable
token_index = dict(zip(range(1, len(characters) + 1), characters))
print("Token index dict: \n", token_index)

max_length = 50
results = np.zeros((len(samples), max_length, max(token_index.keys()) + 1))
print("Initial results array: \n", results)

for i, sample in enumerate(samples):
    for j, character in enumerate(sample):
        index = token_index.get(character)
        results[i, j, index] = 1
print("Final results array: \n", results)

Token index dict: 
 {1: '0', 2: '1', 3: '2', 4: '3', 5: '4', 6: '5', 7: '6', 8: '7', 9: '8', 10: '9', 11: 'a', 12: 'b', 13: 'c', 14: 'd', 15: 'e', 16: 'f', 17: 'g', 18: 'h', 19: 'i', 20: 'j', 21: 'k', 22: 'l', 23: 'm', 24: 'n', 25: 'o', 26: 'p', 27: 'q', 28: 'r', 29: 's', 30: 't', 31: 'u', 32: 'v', 33: 'w', 34: 'x', 35: 'y', 36: 'z', 37: 'A', 38: 'B', 39: 'C', 40: 'D', 41: 'E', 42: 'F', 43: 'G', 44: 'H', 45: 'I', 46: 'J', 47: 'K', 48: 'L', 49: 'M', 50: 'N', 51: 'O', 52: 'P', 53: 'Q', 54: 'R', 55: 'S', 56: 'T', 57: 'U', 58: 'V', 59: 'W', 60: 'X', 61: 'Y', 62: 'Z', 63: '!', 64: '"', 65: '#', 66: '$', 67: '%', 68: '&', 69: "'", 70: '(', 71: ')', 72: '*', 73: '+', 74: ',', 75: '-', 76: '.', 77: '/', 78: ':', 79: ';', 80: '<', 81: '=', 82: '>', 83: '?', 84: '@', 85: '[', 86: '\\', 87: ']', 88: '^', 89: '_', 90: '`', 91: '{', 92: '|', 93: '}', 94: '~', 95: ' ', 96: '\t', 97: '\n', 98: '\r', 99: '\x0b', 100: '\x0c'}
Initial results array: 
 [[[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]


Keras has inbuilt utilities for doing one-hot encoding of text at the word level or 
character level, starting from raw text data. They do important features such as 
stripping special characters from strings and only taking into account the _N_ most 
common words in the dataset.

**Listing 1-3.** Using Keras for word-level one-hot encoding

In [6]:
from keras.preprocessing.text import Tokenizer

# Create the tokenizer
tokenizer = Tokenizer(num_words = 1000)

# Build word indices
tokenizer.fit_on_texts(samples)

# Convert strings into list of integer indices
sequences = tokenizer.texts_to_sequences(samples)
one_hot_results = tokenizer.texts_to_matrix(samples, mode = 'binary')

# Recover word index computed
word_index = tokenizer.word_index
print(f"Found {len(word_index)} unique tokens.")

Found 11 unique tokens.


### Using Word Embeddings
![image.png](attachment:2d08a494-71b0-4fa4-9be3-d5610d6c0d76.png)  
Whereas the vectors obtained through one-hot encoding are __binary__, __sparse__
(mostly made out of 0s) and __high-dimensional__, word embeddings are low dimensional, 
floating-point numbers (that is, dense vectors as opposed to sparse vectors.)

Additionally, unlike word vectors learned via one-hot encoding, word embeddings __are 
learned from data.__

#### Ways of Obtaining Word Embeddings
1. Learn word embeddings jointly with the main task one cares about - start with random 
    word vectors and then learn word vectors in the same way one learns weights of a 
    neural network.
2. Load into the model pretrained word embeddings.

#### Learning Word Embeddings Using the Embedding Layer
The simplest way to associate a dense vector with a word is to choose the vector at 
random. A drawback is that __the resulting embedding space has no structure__. For instance, 
the word _accurately_ and exactly may end up with completely different embedding, despite 
their interchangeability in most sentences.  
It is difficult for a neural network to make sense of such noisy, unstructured embedding 
space.

The geometric relationships between word vectors should reflect the semantic relationships 
between these words. Word embeddings are meant to map human language into a geometric space.  
For instance, in a reasonable embedding space, one would expect:
- synonymns to be embedded into similar word vectors
- geometric distance (such as L2 distance) between any two word vectors to relate to the 
    semantic distance between associated words
- specific direction in the embedding space to be meaningful

What makes a good-embedding space depends heavily on the tasks at hand because _the importance
 of certain semantic relationships varies from task to task_. Therefore, it is reasonable to 
 learn a new embedding space with every new task. Fortunately, backpropagation makes 
 this easy, and Keras makes it even easier - it's about learning the weights of a layer: 
 __the Embedding layer__.
 
**Listing 1-4.** Instantiating an embedding layer

In [7]:
from keras.layers import Embedding
embedding_layer = Embedding(1000, 64)

The Embedding layer takes at least two arguments:
- Number of possible tokens
- Dimensionality of the embeddings

It is best understood as a dictionary that maps integer indices(which stand for specific 
words) to dense vectors. It take integers as input, looks up these integers in an internal 
dictionary and returns the associated vectors. It is effectively a dictionary lookup.

The Embedding layer takes as input a __2D tensor of integers__, of __shape(samples, 
sequence_length)__, where each entry is a sequence of integers.  
It can embed sequences of variable lengths with all sequences in a batch having a 
mandatory equal length because we need to pack them into a single tensor. This can be 
achieved through padding with zeros for shorter sequences and truncating for longer 
sequences.

The layer returns a __3D floating-point tensor__ of __shape(samples, sequence_length, 
embedding_dimensionality)__. Such a 3D tensor can then be processed by a __Recurrent 
Neural Network__ layer of a __One Dimensional(1D) Convolutional__ layer.

When one instantiates an Embedding layer, its weights (internal dictionary of token 
vectors) are initially random akin to any other layer. During training, these word 
vectors are gradually adjusted via __backpropagation__, structuring the space into
 something the downstream model can exploit. Once fully trained, the embedding space 
 will show a lot of structure - a kind of structure specialized for the specific 
 problem for which one is training the model.