<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Word-embeddings" data-toc-modified-id="Word-embeddings-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Word embeddings</a></span></li></ul></div>

In order to perform machine learning on text documents, the raw (text) data cannot be fed directly to algorithm as these algorithms expect numerical feature vectors so instead we need to turn the text content into numerical feature vectors.

From the [scikit-learn documentation](https://scikit-learn.org/stable/modules/feature_extraction.html):
<b>
We call vectorization the general process of turning a collection of text documents into numerical feature vectors.
</b>
So Vectorizing text is the process of transforming text into numeric tensors. 

Vectorization can be done in multiple ways:
- Segment text into words, and transform each word into a vector.
- Segment text into characters, and transform each character into a vector.
- Extract n-grams of words or characters, and transform each n-gram into a vector.
N-grams are overlapping groups of multiple consecutive words or characters

<b>Tokenization: </b> the segementation of text into words or characters

***Text-vectorization processes consist of applying some tokenization scheme to the text then associating numeric vectors with the generated tokens.***

# One-hot encoding of words and characters

In [1]:
import numpy as np
import string

In [2]:
samples = ['My team Bayern lost today','good boy']

In [3]:

characters = string.printable
class OneHot_w:
    def __init__(self,max_lenght):
        self.max_lenght=max_lenght 
        
    def Word_index(self,sentences):
        word_index={}
        for sent in sentences:
            for word in sent.split():
                if word not in word_index:
                    word_index[word]=len(word_index)+1            
        return word_index
    
    def __call__(self,sentences):
        token_index=self.Word_index(sentences)
        results=np.zeros((len(sentences),self.max_lenght,max(token_index.values())+1))
        for i,sentence in enumerate(sentences):
                for j ,word in list(enumerate(sentence.split()))[:self.max_lenght]:
                    index=token_index.get(word)
                    results[i,j,index]=1
   
        return results

In [4]:
oh=OneHot_w(max_lenght=6)
oh(samples)

array([[[0., 1., 0., 0., 0., 0., 0., 0.],
        [0., 0., 1., 0., 0., 0., 0., 0.],
        [0., 0., 0., 1., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.]],

       [[0., 0., 0., 0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 0., 0., 0., 1.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.]]])

# character encoding

In [5]:
samples = ['My team Bayern lost today','good boy']
characters = string.printable
token_index = dict(zip(range(1, len(characters) + 1), characters))
max_length = 25
results = np.zeros((len(samples), max_length, max(token_index.keys()) + 1))
for i, sample in enumerate(samples):
    for j, character in enumerate(sample):
        index = token_index.get(character)
        results[i, j, index] = 1.

In [6]:
results

array([[[1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        ...,
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.]],

       [[1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]])

In [7]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [8]:
samples = ['My team Bayern lost today','good boy']

In [9]:
tokenizer=Tokenizer(num_words=10)
tokenizer.fit_on_texts(samples)

In [10]:
tokenizer.index_word

{1: 'my', 2: 'team', 3: 'bayern', 4: 'lost', 5: 'today', 6: 'good', 7: 'boy'}

In [11]:
tokenizer.document_count,tokenizer.num_words

(2, 10)

In [12]:
seq=tokenizer.texts_to_sequences(samples)
one_hot=tokenizer.sequences_to_matrix(seq)
one_hot

array([[0., 1., 1., 1., 1., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 1., 0., 0.]])

In [13]:
tokenizer.texts_to_matrix(samples)

array([[0., 1., 1., 1., 1., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 1., 0., 0.]])

# ONE HOT CHARACTER ENCODING

In [14]:
tokens=Tokenizer(char_level=True)

In [15]:
tokens.fit_on_texts(samples)

In [16]:
tokens.index_word

{1: ' ',
 2: 'o',
 3: 'y',
 4: 't',
 5: 'a',
 6: 'm',
 7: 'e',
 8: 'b',
 9: 'd',
 10: 'r',
 11: 'n',
 12: 'l',
 13: 's',
 14: 'g'}

In [17]:
seqq=tokens.texts_to_sequences(samples)
tokens.sequences_to_matrix(seqq)

array([[0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0.],
       [0., 1., 1., 1., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 1.]])

# Using word embeddings

Another popular and powerful way to associate a vector with a word is the use of dense
word vectors, also called word embeddings. Unlike the one-hot encoding, word embeddings are learned from data.


<P>Consider the sentence "The cat sat on the mat". The vocabulary (or unique words) in this sentence is (cat, mat, on, sat, the). </P>
From <a href="https://www.tensorflow.org/tutorials/text/word_embeddings">Tensorflow documentation-Word embeddings</a>

### Word embeddings

> Word embeddings give us a way to use an efficient, dense representation in which similar words have a similar encoding. Importantly, you do not have to specify this encoding by hand. An embedding is a dense vector of floating point values (the length of the vector is a parameter you specify). Instead of specifying the values for the embedding manually, they are trainable parameters (weights learned by the model during training, in the same way a model learns weights for a dense layer). It is common to see word embeddings that are 8-dimensional (for small datasets), up to 1024-dimensions when working with large datasets. A higher dimensional embedding can capture fine-grained relationships between words, but takes more data to learn.
<img src="images/embedding2.png?raw=1" alt="Diagram of an embedding" width="400"/>
Above is a diagram for a word embedding. Each word is represented as a 4-dimensional vector of floating point values. Another way to think of an embedding is as "lookup table". After these weights have been learned, you can encode each word by looking up the dense vector it corresponds to in the table.





# The Embedding layer
From <a href="https://www.tensorflow.org/tutorials/text/word_embeddings">Tensorflow documentation-Word embeddings</a>
> Keras makes it easy to use word embeddings. Take a look at the Embedding layer.
The Embedding layer can be understood as a lookup table that maps from integer indices (which stand for specific words) to dense vectors (their embeddings). The dimensionality (or width) of the embedding is a parameter you can experiment with to see what works well for your problem, much in the same way you would experiment with the number of neurons in a Dense layer.

> When you create an Embedding layer, the weights for the embedding are randomly initialized (just like any other layer). During training, they are gradually adjusted via backpropagation. Once trained, the learned word embeddings will roughly encode similarities between words (as they were learned for the specific problem your model is trained on). If you pass an integer to an embedding layer, the result replaces each integer with the vector from the embedding table:

In [18]:
from tensorflow.keras.layers import Embedding,Flatten,Dense,Dropout
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.datasets import imdb
import tensorflow as tf
from tensorflow.keras.models import Sequential

For text or sequence problems, the Embedding layer takes as input a 2D tensor of integers, of shape (samples,sequence_length) and returns a 3D floating-point tensor of shape (samples, sequence_length, embedding_dimensionality). All sequences in a batch must have the same length, though (because you need to pack them into a single tensor), so sequences that are shorter than others should be padded with zeros, and sequences that are longer should be truncated.

In [19]:
embedding_layer=Embedding(10,5)

In [20]:
result = embedding_layer(tf.constant([[1,2],[2,4]]))
result.numpy()

array([[[ 0.00124159,  0.02829922,  0.02806177, -0.01132657,
          0.01436049],
        [ 0.02277093,  0.00207782,  0.00144293,  0.03614566,
          0.01122695]],

       [[ 0.02277093,  0.00207782,  0.00144293,  0.03614566,
          0.01122695],
        [ 0.0148036 , -0.01145764, -0.02028   ,  0.00362878,
         -0.03939173]]], dtype=float32)

# Loading the IMDB data for use with an Embedding layer

You’ll restrict the movie reviews to the top 15,000 most common words and  considering looking at the first 30 words in every review. The network will learn 16-dimensional embeddings for each of the 15,000 words

In [21]:
# keep the 15000 most common words
max_features=15000
embedding_dim=16
max_length=30
(x_train, y_train), (x_test, y_test)=imdb.load_data(num_words=max_features)

In [22]:
x_train_pad=pad_sequences(x_train,maxlen=max_length,padding='pre',truncating='pre')
x_test_pad=pad_sequences(x_test,maxlen=max_length,padding='pre',truncating='pre')

In [23]:
x_test_pad[1]

array([  49,  238,   60,  135, 1162,   14,    9,  290,    4,   58,   10,
         10,  472,   45,   55,  878,    8,  169,   11,  374, 5687,   25,
        203,   28,    8,  818,   12,  125,    4, 3077])

In [24]:
x_train_pad[1]

array([ 371,   78,   22,  625,   64, 1382,    9,    8,  168,  145,   23,
          4, 1690,   15,   16,    4, 1355,    5,   28,    6,   52,  154,
        462,   33,   89,   78,  285,   16,  145,   95])

In [29]:
model=Sequential()
model.add(Embedding(input_dim=max_features,output_dim=embedding_dim,input_length=max_length))
model.add(Flatten())
model.add(Dense(14,activation='relu'))
model.add(Dropout(0.05))
model.add(Dense(1,activation='sigmoid'))

In [30]:
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 30, 16)            240000    
_________________________________________________________________
flatten_1 (Flatten)          (None, 480)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 14)                6734      
_________________________________________________________________
dropout_1 (Dropout)          (None, 14)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 15        
Total params: 246,749
Trainable params: 246,749
Non-trainable params: 0
_________________________________________________________________


In [31]:
history = model.fit(x_train_pad, y_train,epochs=6,batch_size=32,validation_split=0.2)

Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
