# Word embeddings

Word embeddings give us a way to use an efficient, dense representation in which similar words have a similar encoding. Importantly, we do not have to specify this encoding by hand. An embedding is a dense vector of floating point values (the length of the vector is a parameter you specify).

It is common to see word embeddings that are 8-dimensional (for small datasets), up to 1024-dimensions when working with large datasets. A higher dimensional embedding can capture fine-grained relationships between words, but takes more data to learn.

<img src="../../data/pictures/embedding2.png" />

In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

import tensorflow_datasets as tfds
tfds.disable_progress_bar()

In [2]:
embedding_layer = layers.Embedding(1000, 5)

When you create an Embedding layer, the weights for the embedding are randomly initialized (just like any other layer). During training, they are gradually adjusted via backpropagation. Once trained, the learned word embeddings will roughly encode similarities between words

In [9]:
embedding_layer(tf.constant([1,2,999]).numpy())

<tf.Tensor: id=27, shape=(3, 5), dtype=float32, numpy=
array([[-0.02030228, -0.00625715,  0.03203246, -0.00734886, -0.03341927],
       [ 0.04393892, -0.0441018 ,  0.04863973, -0.04373846, -0.01031322],
       [ 0.03667463,  0.02193252,  0.04739991,  0.0125013 ,  0.04758617]],
      dtype=float32)>

For text or sequence problems, the Embedding layer takes a 2D tensor of integers, of shape `(samples, sequence_length)`, where each entry is a sequence of integers.

The returned tensor has one more axis than the input, the embedding vectors are aligned along the new last axis. Pass it a `(2, 3)` input batch and the output is `(2, 3, N)`: `(samples, sequence_length, embedding_dimensionality)`.

In [10]:
result = embedding_layer(tf.constant([[0,1,2],[3,4,5]]))
result.shape

TensorShape([2, 3, 5])

## Learn embeddings from scratch

In [12]:
(train_data, test_data), info = tfds.load('imdb_reviews/subwords8k', 
                                          split = (tfds.Split.TRAIN, tfds.Split.TEST), 
                                          with_info=True,
                                          as_supervised=True)

In [18]:
encoder = info.features['text'].encoder
encoder.subwords[:20]

['the_',
 ', ',
 '. ',
 'a_',
 'and_',
 'of_',
 'to_',
 's_',
 'is_',
 'br',
 'in_',
 'I_',
 'that_',
 'this_',
 'it_',
 ' /><',
 ' />',
 'was_',
 'The_',
 'as_']

In [14]:
train_batches = train_data.shuffle(1000).padded_batch(10, padded_shapes=([None],[]))
test_batches = test_data.shuffle(1000).padded_batch(10, padded_shapes=([None],[]))

In [17]:
train_batch, train_labels = next(iter(train_batches))
train_batch.numpy()

array([[ 444,   18,  122, ...,    0,    0,    0],
       [2307, 1031,  800, ...,    0,    0,    0],
       [ 156,   37, 1167, ...,    0,    0,    0],
       ...,
       [1071,    2,    4, ...,    0,    0,    0],
       [  62,    9,  281, ...,    0,    0,    0],
       [  62,   27,   18, ...,    0,    0,    0]])

#### Create a simple model

In [21]:
embedding_dim = 16

model = keras.Sequential([
  layers.Embedding(encoder.vocab_size, embedding_dim),
  layers.GlobalAveragePooling1D(),
  layers.Dense(16, activation='relu'),
  layers.Dense(1)
])

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, None, 16)          130960    
_________________________________________________________________
global_average_pooling1d_1 ( (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 16)                272       
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 17        
Total params: 131,249
Trainable params: 131,249
Non-trainable params: 0
_________________________________________________________________


#### Compile and train the model

In [41]:
callbacks = [
    tf.keras.callbacks.TensorBoard(log_dir='/tf/tb_logs')
]

In [43]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

history = model.fit(
    train_batches,
    epochs=10,
    validation_data=test_batches,
    validation_steps=20,
    callbacks=callbacks
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Here the model is overfitting (llok at the difference between the `validation loss` and the `training loss`.

But in this case we are interested in the embadding layer that we just trained.

#### Retrieve the learned embeddings

Next, let's retrieve the word embeddings learned during training. This will be a matrix of shape `(vocab_size, embedding-dimension)`.

In [24]:
weights = model.layers[0].get_weights()[0]
print('shape: (vocab_size, embedding_dim)')
print(weights.shape) 

shape: (vocab_size, embedding_dim)
(8185, 16)


In order to use Tensorboard Projector plugin to plot the embeddings space, please refer to this [Github issue](https://github.com/tensorflow/tensorboard/issues/2471#issuecomment-580423961)

(The Tensorflow documentation is not so good on this part).