<a href="https://colab.research.google.com/github/apresland/tensorflow-nlp/blob/word-embeddings/word_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word Embeddings
This notebook denonstrates creating word embeddings in tensorflow. We will train a word embeddings using a simple Keras model for a sentiment classification task.

In [None]:
import re
import string
import tensorflow as tf
import tensorflow_datasets as tfds

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

# Download the IMDB dataset
The IMDB dataset is available as a  TensorFlow datasets. The following code downloads the IMDB dataset: 

In [None]:
# Split the training set 80%:20%
train_ds, val_ds, test_ds = tfds.load(
    name="imdb_reviews", 
    split=('train[:80%]', 'train[80%:]', 'test'),
    batch_size=1024,
    as_supervised=True)

# Inspect some examples

In [None]:
for text_batch, label_batch in train_ds.take(1):
  for i in range(5):
    print(label_batch[i].numpy(), text_batch.numpy()[i])

# Text pre-processing
Define the dataset preprocessing steps required for the classification model.

In [None]:
VOCAB_SIZE = 10000
MAX_SEQUENCE_LENGTH = 100

# Create a custom standardization function to strip HTML break tags '<br />'.
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
  return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation), '')

# Normalize, split, and map strings to integers.
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

# Make a text-only dataset (no labels) and call adapt to build the vocabulary.
text_ds = train_ds.map(lambda x, y: x)
vectorize_layer.adapt(text_ds)

# Create a classifiction model
Use Keras to define the sentiment classification model in a "continuous bag of words" style.

In [None]:
EMBEDDING_DIM=16

model = tf.keras.models.Sequential([
  vectorize_layer,
  tf.keras.layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM, name="embedding"),
  tf.keras.layers.GlobalAveragePooling1D(),
  tf.keras.layers.Dense(16, activation='relu'),
  tf.keras.layers.Dense(1)
])

# Compile and train the model

In [None]:
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")


In [None]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [None]:
model.fit(
    train_ds,
    validation_data=val_ds, 
    epochs=15,
    callbacks=[tensorboard_callback])

Visualize the metrics in TensorBoard

In [None]:
%load_ext tensorboard
%tensorboard --logdir logs