##### Copyright 2019 The TensorFlow Authors.

In [1]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

In [3]:
tf.__version__

'2.20.0'

# Word embeddings

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/tutorials/text/word_embeddings">
    <img src="https://www.tensorflow.org/images/tf_logo_32px.png" />
    View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/text/word_embeddings.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />
    Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/docs/blob/master/site/en/tutorials/text/word_embeddings.ipynb">
    <img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />
    View source on GitHub</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/docs/site/en/tutorials/text/word_embeddings.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
</table>

This tutorial contains an introduction to word embeddings. You will train your own word embeddings using a simple Keras model for a sentiment classification task, and then visualize them in the [Embedding Projector](http://projector.tensorflow.org).

## Representing text as numbers

Machine learning models take vectors (arrays of numbers) as input. When working with text, the first thing you must do is come up with a strategy to convert strings to numbers (or to "vectorize" the text) before feeding it to the model. In this section, you will look at three strategies for doing so.

### One-hot encodings

As a first idea, you might "one-hot" encode each word in your vocabulary. Consider the sentence "The cat sat on the mat". The vocabulary (or unique words) in this sentence is (cat, mat, on, sat, the). To represent each word, you will create a zero vector with length equal to the vocabulary, then place a one in the index that corresponds to the word.

To create a vector that contains the encoding of the sentence, you could then concatenate the one-hot vectors for each word.

**Key point: This approach is inefficient.** A one-hot encoded vector is sparse (meaning, most indices are zero). Imagine you have 10,000 words in the vocabulary. To one-hot encode each word, you would create a vector where 99.99% of the elements are zero.

### Encode each word with a unique number

A second approach you might try is to encode each word using a **unique number**. Continuing the example above, you could assign 1 to "cat", 2 to "mat", and so on. You could then encode the sentence "The cat sat on the mat" as a dense vector like [5, 1, 4, 3, 5, 2]. This appoach is efficient. Instead of a sparse vector, you now have a dense one (where all elements are full).

There are two downsides to this approach, however:

* The integer-encoding is arbitrary (it does not capture any relationship between words).

* An integer-encoding can be challenging for a model to interpret. A linear classifier, for example, learns a single weight for each feature. Because there is no relationship between the similarity of any two words and the similarity of their encodings, this feature-weight combination is not meaningful.

### Word embeddings

Word embeddings give us a way to use an efficient, dense representation in which similar words have a similar encoding. Importantly, you do not have to specify this encoding by hand. An embedding is a dense vector of floating point values (the length of the vector is a parameter you specify). Instead of specifying the values for the embedding manually, they are trainable parameters (weights learned by the model during training, in the same way a model learns weights for a dense layer). It is common to see word embeddings that are 8-dimensional (for small datasets), up to 1024-dimensions when working with large datasets. A higher dimensional embedding can capture fine-grained relationships between words, but takes more data to learn.


## Setup

In [1]:
import io
import os
import re
import shutil
import string
import tensorflow as tf

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D, TextVectorization
#from tensorflow.keras.layers.experimental.preprocessing import TextVectorization



### Download the IMDb Dataset
You will use the [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/) through the tutorial. You will train a sentiment classifier model on this dataset and in the process learn embeddings from scratch.  

Take a look at the `train/` directory. It has `pos` and `neg` folders with movie reviews labelled as positive and negative respectively. You will use reviews from `pos` and `neg` folders to train a binary classification model.

In [None]:
#dataset_dir = os.getcwd()+ "\\data\\"

Next, create a `tf.data.Dataset` using `tf.keras.preprocessing.text_dataset_from_directory`. You can read more about using this utility in this [text classification tutorial](https://www.tensorflow.org/tutorials/keras/text_classification). 

Use the `train` directory to create both train and validation datasets with a split of 20% for validation.

In [None]:
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file("aclImdb_v1", url,
                                    untar=True, cache_dir='.',
                                    cache_subdir='')

#dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')


In [10]:
dataset_dir = os.path.join(dataset, 'aclImdb')

In [11]:
dataset_dir

'.\\aclImdb_v1\\aclImdb'

In [12]:
os.listdir(dataset_dir)

['imdb.vocab', 'imdbEr.txt', 'README', 'test', 'train']

In [14]:
batch_size = 1024
seed = 123

train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    dataset_dir,
    batch_size = batch_size,
    validation_split = 0.2,
    subset = 'training',
    seed = seed
)

val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    dataset_dir,
    batch_size = batch_size,
    validation_split = 0.2,
    subset = 'validation',
    seed = seed
)

Found 100005 files belonging to 2 classes.
Using 80004 files for training.
Found 100005 files belonging to 2 classes.
Using 20001 files for validation.


Take a look at a few movie reviews and their labels `(1: positive, 0: negative)` from the train dataset.


In [15]:
for text_batch, label_batch in train_ds.take(1):
  for i in range(5): 
    print(label_batch[i].numpy(), text_batch.numpy()[i])

1 b"TLJ's understated tour de force - of course Daniel Day-Lewis got the 2007 Oscar for showing the world what a great actor he is (and he is, don't get me wrong). <br /><br />On the other hand, Tommy Lee Jones doesn't show off his acting skills, he shows us a man, a troubled man, a, restrained, some would say uptight individual, a loving but maybe distant father, a remote and quite desperate husband, with all his prejudices and doubts. A performance not to be missed, it won't leave you untouched.<br /><br />It's a pity that this film probably won't be watched by the general public in the US - it's probably too devoid of outright action to appeal to moviegoers. Still, I feel it's one of the best and most thought provoking war films I've seen."
1 b"Reality TV hit a new low with this offensive crap of a show. Why anyone thought the Gotti family should be treated like stars is beyond me. They are nothing but a bunch of scumbags who came from an even bigger scumbag John Gotti. John Gotti g

## Using the Embedding layer

Keras makes it easy to use word embeddings. Take a look at the [Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer.

The Embedding layer can be understood as a lookup table that maps from integer indices (which stand for specific words) to dense vectors (their embeddings). The dimensionality (or width) of the embedding is a parameter you can experiment with to see what works well for your problem, much in the same way you would experiment with the number of neurons in a Dense layer.


In [None]:
1000**(1/4)

5.623413251903491

In [16]:
embedding_layer = tf.keras.layers.Embedding(1000, 5)

When you create an Embedding layer, the weights for the embedding are randomly initialized (just like any other layer). During training, they are gradually adjusted via backpropagation. Once trained, the learned word embeddings will roughly encode similarities between words (as they were learned for the specific problem your model is trained on).

If you pass an integer to an embedding layer, the result replaces each integer with the vector from the embedding table:

In [17]:
result = embedding_layer(tf.constant([0,1,2,3,4,999]))
result.numpy()

array([[ 0.04820881,  0.04538068, -0.02478263, -0.02170532, -0.01400336],
       [ 0.02425085, -0.01960527, -0.04573896,  0.02602256, -0.0418945 ],
       [ 0.00992699,  0.02334431,  0.029349  ,  0.03743095, -0.02251465],
       [ 0.03972812, -0.03846786,  0.02049775,  0.02883178, -0.00373375],
       [ 0.03185851, -0.01781528,  0.01423195, -0.01213143,  0.04362803],
       [ 0.04838741,  0.03356085, -0.0286508 ,  0.00046884, -0.02944571]],
      dtype=float32)

In [18]:
print(embedding_layer.embeddings.shape)
embedding_layer.embeddings

(1000, 5)


<Variable path=embedding/embeddings, shape=(1000, 5), dtype=float32, value=[[ 0.04820881  0.04538068 -0.02478263 -0.02170532 -0.01400336]
 [ 0.02425085 -0.01960527 -0.04573896  0.02602256 -0.0418945 ]
 [ 0.00992699  0.02334431  0.029349    0.03743095 -0.02251465]
 ...
 [-0.04700933 -0.03634419 -0.04952866 -0.0435307  -0.02841244]
 [-0.04396464  0.03299543  0.03521543  0.01868508 -0.00172043]
 [ 0.04838741  0.03356085 -0.0286508   0.00046884 -0.02944571]]>

For text or sequence problems, the Embedding layer takes a 2D tensor of integers, of shape `(samples, sequence_length)`, where each entry is a sequence of integers. It can embed sequences of variable lengths. You could feed into the embedding layer above batches with shapes `(32, 10)` (batch of 32 sequences of length 10) or `(64, 15)` (batch of 64 sequences of length 15).

The returned tensor has one more axis than the input, the embedding vectors are aligned along the new last axis. Pass it a `(2, 3)` input batch and the output is `(2, 3, N)`


In [None]:
result = embedding_layer(tf.constant([[1, 2, 999],
                                      [3, 4, 5]]))
print(result.shape)
result.numpy()

(2, 3, 5)


array([[[-0.04285935, -0.03683591,  0.0365647 ,  0.02776528,
          0.04007946],
        [ 0.04702337, -0.01751066,  0.04784627,  0.03322775,
         -0.02377505],
        [ 0.00138973,  0.04737257,  0.046233  , -0.04244896,
         -0.00984656]],

       [[ 0.04705515,  0.01817394,  0.00196736,  0.04857327,
         -0.02762976],
        [ 0.0409334 ,  0.0323813 ,  0.0018477 , -0.01845671,
          0.0471117 ],
        [ 0.02788243, -0.01122351,  0.02467654, -0.04573428,
         -0.02559091]]], dtype=float32)

When given a batch of sequences as input, an embedding layer returns a 3D floating point tensor, of shape `(samples, sequence_length, embedding_dimensionality)`. To convert from this sequence of variable length to a fixed representation there are a variety of standard approaches. You could use an RNN, Attention, or pooling layer before passing it to a Dense layer. This tutorial uses pooling because it's the simplest.

## Text preprocessing

Next, define the dataset preprocessing steps required for your sentiment classification model. Initialize a TextVectorization layer with the desired parameters to vectorize movie reviews. You can learn more about using this layer in the [Text Classification](https://www.tensorflow.org/tutorials/keras/text_classification) tutorial.

In [19]:
# Create a custom standardization function to strip HTML break tags '<br />'.
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
  return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation), '')


vocab_size = 10000
sequence_length = 100

vectorize_layer = TextVectorization(
    standardize = custom_standardization,
    max_tokens = vocab_size,
    output_mode = 'int',
    output_sequence_length = sequence_length
)

# Make a text-only dataset (no labels) and call adapt to build the vocabulary.
text_ds = train_ds.map(lambda x, y: x)
vectorize_layer.adapt(text_ds)

## Create a classification model

Use the [Keras Sequential API](https://www.tensorflow.org/guide/keras/sequential_model) to define the sentiment classification model. In this case it is a "Continuous bag of words" style model.
* The [`TextVectorization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/TextVectorization) layer transforms strings into vocabulary indices. You have already initialized `vectorize_layer` as a TextVectorization layer and built it's vocabulary by calling `adapt` on `text_ds`. Now vectorize_layer can be used as the first layer of your end-to-end classification model, feeding transformed strings into the Embedding layer.
* The [`Embedding`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer takes the integer-encoded vocabulary and looks up the embedding vector for each word-index. These vectors are learned as the model trains. The vectors add a dimension to the output array. The resulting dimensions are: `(batch, sequence, embedding)`.

* The [`GlobalAveragePooling1D`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GlobalAveragePooling1D) layer returns a fixed-length output vector for each example by averaging over the sequence dimension. This allows the model to handle input of variable length, in the simplest way possible.

* The fixed-length output vector is piped through a fully-connected ([`Dense`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense)) layer with 16 hidden units.

* The last layer is densely connected with a single output node. 

Caution: This model doesn't use masking, so the zero-padding is used as part of the input and hence the padding length may affect the output.  To fix this, see the [masking and padding guide](https://www.tensorflow.org/guide/keras/masking_and_padding).

In [None]:
10000**(1/4)

10.0

In [20]:
embedding_dim = 16
vocab_size = 10000

model = Sequential([
    vectorize_layer,
    Embedding(vocab_size, embedding_dim, name='embedding'),
    GlobalAveragePooling1D(),
    Dense(16, activation='relu'),
    Dense(1, activation='sigmoid')
])

## Compile and train the model

You will use [TensorBoard](https://www.tensorflow.org/tensorboard) to visualize metrics including loss and accuracy. Create a `tf.keras.callbacks.TensorBoard`.

In [21]:
model.compile(optimizer='adam',
             loss = 'binary_crossentropy',
             metrics = ['accuracy'])

Compile and train the model using the `Adam` optimizer and `BinaryCrossentropy` loss. 

In [22]:
model.fit(
    train_ds,
    validation_data = val_ds,
    epochs = 10
)

Epoch 1/10
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m57s[0m 700ms/step - accuracy: 0.7506 - loss: 0.5731 - val_accuracy: 0.7474 - val_loss: 0.5664
Epoch 2/10
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 221ms/step - accuracy: 0.7506 - loss: 0.5617 - val_accuracy: 0.7474 - val_loss: 0.5658
Epoch 3/10
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 224ms/step - accuracy: 0.7506 - loss: 0.5606 - val_accuracy: 0.7474 - val_loss: 0.5651
Epoch 4/10
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 234ms/step - accuracy: 0.7506 - loss: 0.5594 - val_accuracy: 0.7474 - val_loss: 0.5646
Epoch 5/10
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 237ms/step - accuracy: 0.7506 - loss: 0.5579 - val_accuracy: 0.7474 - val_loss: 0.5636
Epoch 6/10
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 235ms/step - accuracy: 0.7506 - loss: 0.5560 - val_accuracy: 0.7474 - val_loss: 0.5628
Epoch 7/10
[1m79/79[

<keras.src.callbacks.history.History at 0x1f777fdea50>

With this approach the model reaches a validation accuracy of around 84% (note that the model is overfitting since training accuracy is higher).

Note: Your results may be a bit different, depending on how weights were randomly initialized before training the embedding layer. 

You can look into the model summary to learn more about each layer of the model.

In [23]:
model.summary()

## Retrieve the trained word embeddings and save them to disk

Next, retrieve the word embeddings learned during training. The embeddings are weights of the Embedding layer in the model. The weights matrix is of shape `(vocab_size, embedding_dimension)`.

Obtain the weights from the model using `get_layer()` and `get_weights()`. The `get_vocabulary()` function provides the vocabulary to build a metadata file with one token per line. 

In [24]:
model.get_layer('embedding').get_weights()

[array([[ 0.05907438,  0.23412746,  0.23534971, ..., -0.09626865,
          0.09062655,  0.06896333],
        [ 0.1147095 ,  0.17401318,  0.31448993, ..., -0.0022399 ,
          0.18856098,  0.0735514 ],
        [ 0.08888543,  0.12151925,  0.39881966, ...,  0.11146606,
          0.19021235,  0.06070075],
        ...,
        [ 0.12234419, -0.14856614, -0.10638776, ..., -0.1314967 ,
          0.05774949,  0.072873  ],
        [ 0.00645153, -0.01365015, -0.04901185, ..., -0.04100302,
          0.02415335,  0.02211501],
        [-0.00688213,  0.03717855,  0.04374218, ..., -0.03315779,
         -0.00409688,  0.0451286 ]], shape=(10000, 16), dtype=float32)]

In [25]:
weights = model.get_layer('embedding').get_weights()[0]
vocab = vectorize_layer.get_vocabulary()

In [26]:
print(len(vocab))
print(vocab[:10])

10000
['', '[UNK]', np.str_('the'), np.str_('and'), np.str_('a'), np.str_('of'), np.str_('to'), np.str_('is'), np.str_('in'), np.str_('it')]


In [27]:
print(weights.shape)
print(weights[:2])

(10000, 16)
[[ 0.05907438  0.23412746  0.23534971 -0.07160985 -0.06878137  0.09857569
   0.00254296  0.1315659  -0.04926193  0.10393105 -0.16451576  0.12808688
  -0.14555945 -0.09626865  0.09062655  0.06896333]
 [ 0.1147095   0.17401318  0.31448993 -0.14692967 -0.08142169  0.05555097
   0.12721878  0.19671664 -0.12406144  0.22536229 -0.17943113  0.03286254
  -0.14897685 -0.0022399   0.18856098  0.0735514 ]]


Write the weights to disk. To use the [Embedding Projector](http://projector.tensorflow.org), you will upload two files in tab separated format: a file of vectors (containing the embedding), and a file of meta data (containing the words).

In [28]:
out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(vocab):
  if index == 0:
    continue  # skip 0, it's padding.
  vec = weights[index]
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
  out_m.write(word + "\n")
out_v.close()
out_m.close()

If you are running this tutorial in [Colaboratory](https://colab.research.google.com), you can use the following snippet to download these files to your local machine (or use the file browser, *View -> Table of contents -> File browser*).

In [None]:
# try:
#   from google.colab import files
#   files.download('vectors.tsv')
#   files.download('metadata.tsv')
# except Exception:
#   pass

## Visualize the embeddings

To visualize the embeddings, upload them to the embedding projector.

Open the [Embedding Projector](http://projector.tensorflow.org/) (this can also run in a local TensorBoard instance).

* Click on "Load data".

* Upload the two files you created above: `vecs.tsv` and `meta.tsv`.

The embeddings you have trained will now be displayed. You can search for words to find their closest neighbors. For example, try searching for "beautiful". You may see neighbors like "wonderful". 

Note: Experimentally, you may be able to produce more interpretable embeddings by using a simpler model. Try deleting the `Dense(16)` layer, retraining the model, and visualizing the embeddings again.

Note: Typically, a much larger dataset is needed to train more interpretable word embeddings. This tutorial uses a small IMDb dataset for the purpose of demonstration.


## Next Steps

This tutorial has shown you how to train and visualize word embeddings from scratch on a small dataset.

* To train word embeddings using Word2Vec algorithm, try the [Word2Vec](https://www.tensorflow.org/tutorials/text/word2vec) tutorial. 

* To learn more about advanced text processing, read the [Transformer model for language understanding](https://www.tensorflow.org/tutorials/text/transformer).