### Word Embeddings
It is the mapping of words or phrases to vectors of real numbers.

Two popular methods of learning word embeddings from text include:

1. Word2Vec.
2. GloVe.

A word embedding can also be learned as part of a deep learning model. Though a slower approach compared to **Word2Vec** and **GloVe**, tailors the model to a specific training dataset.

### Embedding Layer in Deep Learning library
The Embedding layer is the first hidden layer of a neural network


### Learning an Embedding from 'Jumanji: Welcome to the jungle' Review

In [1]:
from tensorflow.python.keras.preprocessing.text import one_hot
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense
from tensorflow.python.keras.layers import Flatten
from tensorflow.python.keras.layers.embeddings import Embedding

ImportError: No module named 'tensorflow.python.keras.layers.embeddings'

In [90]:
reviews = ['Well scripted!',
        'Good movie',
        'Great effort',
        'nice work',
        'Excellent!',
        'cool',
        'poor movie!',
        'not cool',
        'poor work',
        'Could have done better.']


# define class labels

labels = [1,1,1,1,1,1,0,0,0,0]

Next, integer encode each review.
**Keras** provides `one_hot` function that creates a hash of each word as an efficient integer encoding.

Let's estimate a vocabulary size of 50.

In [91]:
# integer encode the review
vocab_size = 50
encoded_reviews = [one_hot(r, vocab_size) for r in reviews]
print(encoded_reviews)

[[22, 43], [25, 45], [31, 10], [7, 46], [41], [18], [19, 45], [7, 18], [19, 46], [29, 19, 3, 11]]


The sequences have different lengths and Keras prefers inputs to have the same length. We will pad all input sequences to have the length of 4.

In [72]:
# pad documents to a max length of 4 words
max_length = 4
padded_reviews = pad_sequences(encoded_reviews, maxlen=max_length, padding='post')
print(padded_reviews)

[[22 43  0  0]
 [25 45  0  0]
 [31 10  0  0]
 [ 7 46  0  0]
 [41  0  0  0]
 [18  0  0  0]
 [19 45  0  0]
 [ 7 18  0  0]
 [19 46  0  0]
 [29 19  3 11]]


We are now ready to define our Embedding layer as part of our neural network model.

The Embedding has a vocabulary of 50 and an input length of 4. We will choose a small embedding space of 8 dimensions.

The model is a simple binary classification model. Importantly, the output from the Embedding layer will be 4 vectors of 8 dimensions each, one for each word. We flatten this to a one 32-element vector to pass on to the Dense output layer.

In [73]:
# define the model
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_10 (Embedding)     (None, 4, 8)              400       
_________________________________________________________________
flatten_10 (Flatten)         (None, 32)                0         
_________________________________________________________________
dense_10 (Dense)             (None, 1)                 33        
Total params: 433
Trainable params: 433
Non-trainable params: 0
_________________________________________________________________
None


fit the classification model

In [75]:
# fit the model
model.fit(padded_reviews, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(padded_reviews, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

Accuracy: 80.000001


### Using Pre-Trained GloVe Embedding

The Keras Embedding layer can also use a word embedding learned elsewhere.

It is common in the field of Natural Language Processing to learn, save, and make freely available word embeddings.

For example, the researchers behind GloVe method provide a suite of pre-trained word embeddings on their website released under a public domain license. See:

[GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/)

In this case, we need to be able to map words to integers as well as integers to words.

Keras provides a Tokenizer class that can be fit on the training data, can convert text to sequences consistently by calling the texts_to_sequences() method on the Tokenizer class, and provides access to the dictionary mapping of words to integers in a word_index attribute.

In [77]:
from keras.preprocessing.text import Tokenizer

In [80]:
# prepare tokenizer
t = Tokenizer()
t.fit_on_texts(reviews)
vocab_size = len(t.word_index) + 1
# integer encode the documents
encoded_reviews = t.texts_to_sequences(reviews)
print(encoded_reviews)
# pad documents to a max length of 4 words
max_length = 4
padded_reviews = pad_sequences(encoded_reviews, maxlen=max_length, padding='post')
print(padded_reviews)

[[5, 6], [7, 1], [8, 9], [10, 2], [11], [3], [4, 1], [12, 3], [4, 2], [13, 14, 15, 16]]
[[ 5  6  0  0]
 [ 7  1  0  0]
 [ 8  9  0  0]
 [10  2  0  0]
 [11  0  0  0]
 [ 3  0  0  0]
 [ 4  1  0  0]
 [12  3  0  0]
 [ 4  2  0  0]
 [13 14 15 16]]


Next, we need to load the entire GloVe word embedding file into memory as a dictionary of word to embedding array.

In [81]:
# load the whole embedding into memory
embeddings_index = dict()
f = open('glove.6B.100d.txt')
for line in f:
	values = line.split()
	word = values[0]
	coefs = asarray(values[1:], dtype='float32')
	embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

FileNotFoundError: [Errno 2] No such file or directory: 'glove.6B.100d.txt'

This is pretty slow. It might be better to filter the embedding for the unique words in your training data.

Next, we need to create a matrix of one embedding for each word in the training dataset. We can do that by enumerating all unique words in the Tokenizer.word_index and locating the embedding weight vector from the loaded GloVe embedding.

The result is a matrix of weights only for words we will see during training.

In [84]:
import numpy as np

In [85]:
# create a weight matrix for words in training docs
embedding_matrix = np.zeros((vocab_size, 100))
for word, i in t.word_index.items():
	embedding_vector = embeddings_index.get(word)
	if embedding_vector is not None:
		embedding_matrix[i] = embedding_vector

Now we can define our model, fit, and evaluate it as before.

The key difference is that the embedding layer can be seeded with the GloVe word embedding weights. We chose the 100-dimensional version, therefore the Embedding layer must be defined with output_dim set to 100. Finally, we do not want to update the learned word weights in this model, therefore we will set the trainable attribute for the model to be False.

In [86]:
# define model
model = Sequential()
e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=4, trainable=False)
model.add(e)
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_11 (Embedding)     (None, 4, 100)            1700      
_________________________________________________________________
flatten_11 (Flatten)         (None, 400)               0         
_________________________________________________________________
dense_11 (Dense)             (None, 1)                 401       
Total params: 2,101
Trainable params: 401
Non-trainable params: 1,700
_________________________________________________________________
None


ValueError: Error when checking input: expected embedding_11_input to have shape (None, 4) but got array with shape (9, 6)