In [3]:
from __future__ import absolute_import, division, print_function, unicode_literals
import io
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_datasets as tfds

## Imported TensorFlow and going to use tf to generate word embeddings using IMDB data

We are going to generate embeddings (treat as a look-up) that maps integer indicies for each word to dense vectors (their embeddings)

In [15]:
embedding_layer = layers.Embedding(1000,5)

If you pass an integer to an embedding layer, the result replaces each integer with the vector from the embedding table:

In [16]:
result = embedding_layer(tf.constant([1,2,3]))
result.numpy()

array([[ 0.00317502,  0.00068425,  0.02909   , -0.03462193, -0.01016934],
       [-0.02223927,  0.00222535,  0.04956671, -0.00577765,  0.03148793],
       [ 0.03160057,  0.04774586,  0.01194793,  0.0312151 ,  0.018883  ]],
      dtype=float32)

For text or sequence problems, the Embedding layer takes a 2D tensor of integers, of shape (samples, sequence_length), where each entry is a sequence of integers. It can embed sequences of variable lengths. You could feed into the embedding layer above batches with shapes (32, 10) (batch of 32 sequences of length 10) or (64, 15) (batch of 64 sequences of length 15).

The returned tensor has one more axis than the input, the embedding vectors are aligned along the new last axis. Pass it a (2, 3) input batch and the output is (2, 3, N)

In [None]:
result = embedding_layer(tf.constant([1,2,3]))
result.numpy()

# Learning embeddings from scratch
In this tutorial you will train a sentiment classifier on IMDB movie reviews. In the process, the model will learn embeddings from scratch. We will use to a preprocessed dataset.

To load a text dataset from scratch see the [Loading text tutorial](https://www.tensorflow.org/tutorials/load_data/text).

In [19]:
(train_data, test_data), info = tfds.load(
    'imdb_reviews/subwords8k', 
    split = (tfds.Split.TRAIN, tfds.Split.TEST), 
    with_info=True, as_supervised=True)

[1mDownloading and preparing dataset imdb_reviews (80.23 MiB) to C:\Users\Christopher\tensorflow_datasets\imdb_reviews\subwords8k\1.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…







HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to C:\Users\Christopher\tensorflow_datasets\imdb_reviews\subwords8k\1.0.0.incompleteM80R0U\imdb_reviews-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to C:\Users\Christopher\tensorflow_datasets\imdb_reviews\subwords8k\1.0.0.incompleteM80R0U\imdb_reviews-test.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to C:\Users\Christopher\tensorflow_datasets\imdb_reviews\subwords8k\1.0.0.incompleteM80R0U\imdb_reviews-unsupervised.tfrecord


HBox(children=(FloatProgress(value=0.0, max=50000.0), HTML(value='')))

[1mDataset imdb_reviews downloaded and prepared to C:\Users\Christopher\tensorflow_datasets\imdb_reviews\subwords8k\1.0.0. Subsequent calls will reuse this data.[0m


Get the encoder (tfds.features.text.SubwordTextEncoder), and have a quick look at the vocabulary.

The "_" in the vocabulary represent spaces. Note how the vocabulary includes whole words (ending with "_") and partial words which it can use to build larger words:

In [20]:
encoder = info.features['text'].encoder
encoder.subwords[:20]

['the_',
 ', ',
 '. ',
 'a_',
 'and_',
 'of_',
 'to_',
 's_',
 'is_',
 'br',
 'in_',
 'I_',
 'that_',
 'this_',
 'it_',
 ' /><',
 ' />',
 'was_',
 'The_',
 'as_']

Movie reviews can be different lengths. We will use the padded_batch method to standardize the lengths of the reviews.

In [21]:
padded_shapes = ([None],())
train_batches = train_data.shuffle(1000).padded_batch(10, padded_shapes = padded_shapes)
test_batches = test_data.shuffle(1000).padded_batch(10, padded_shapes = padded_shapes)