## Intro
We'll use the IMDB dataset that contains the text of 50,000 movie reviews from the Internet Movie Database. These are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are balanced, meaning they contain an equal number of positive and negative reviews.



In [1]:
# !pip install tensorflow-datasets

In [2]:
import tensorflow_datasets as tfds
import tensorflow as tf
import matplotlib.pyplot as plt
from tensorflow import keras
from tensorflow.keras import layers
from keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional


Using TensorFlow backend.


In [3]:
def plot_graphs(history, metric):
    plt.plot(history.history[metric])
    plt.plot(history.history['val_' + metric], '')
    plt.xlabel('Epoch')
    plt.ylabel(metric)
    plt.legend([metric, 'val_'+metric])
    plt.show()

Download dataset

In [4]:
# Use the version pre-encoded with an ~8k vocabulary.
# dataset, info = tfds.load('imdb_reviews/subwords8k', 
                          
#                           # Also return the `info` structure. 
#                           with_info=True,

#                           # Return (example, label) pairs from the dataset (instead of a dictionary).
#                           as_supervised=True)

In [5]:
train_dataset, test_dataset = dataset['train'], dataset['test']
train_dataset

NameError: name 'dataset' is not defined

Let's take a moment to understand the format of the data. The dataset comes preprocessed: each example is an array of integers representing the words of the movie review.

The text of reviews have been converted to integers, where each integer represents a specific word-piece in the dictionary.

Each label is an integer value of either 0 or 1, where 0 is a negative review, and 1 is a positive review.

In [None]:
# print out first three sentences and its labels
for train_example, train_label in train_dataset.take(1):
  print('Encoded text:', train_example.numpy())
  print('Label:', train_label.numpy())

In [None]:
encoder = info.features['text'].encoder
print('Unique vocabulary size: {}'.format(encoder.vocab_size))

In [None]:
encoder.vocab_size

In [None]:
sample_string = 'Hello'

encoded_string = encoder.encode(sample_string)
print('Encoded string is {}'.format(encoded_string))
print(type(encoded_string[0]))

original_string = encoder.decode(encoded_string)
print('The original string is: {}'.format(original_string))

In [None]:
for index in encoded_string:
    print('{} ==> {}'.format(index, encoder.decode([index])))

In [None]:
assert original_string == sample_string  #make sure our decode string and encoded string is the same

In [None]:
original_sentences = encoder.decode(train_example.numpy())
print('Original sentences: {}'.format(original_sentences))

In [None]:
train_example.numpy().shape

Next create batches of these encoded strings. Use the padded_batch method to zero-pad the sequences to the length of the longest string in the batch:



In [None]:
BUFFER_SIZE = 10000
BATCH_SIZE = 64

You will want to create batches of training data for your model. The reviews are all different lengths, so use padded_batch to zero pad the sequences while batching.

In [None]:
train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE)

test_dataset = test_dataset.padded_batch(BATCH_SIZE)

## Create the model

Build a tf.keras.Sequential model and start with an embedding layer. An embedding layer stores one vector per word. When called, it converts the sequences of word indices to sequences of vectors. These vectors are trainable. After training (on enough data), words with similar meanings often have similar vectors.

This index-lookup is much more efficient than the equivalent operation of passing a one-hot encoded vector through a tf.keras.layers.Dense layer.

A recurrent neural network (RNN) processes sequence input by iterating through the elements. RNNs pass the outputs from one timestep to their input—and then to the next.

The tf.keras.layers.Bidirectional wrapper can also be used with an RNN layer. This propagates the input forward and backwards through the RNN layer and then concatenates the output. This helps the RNN to learn long range dependencies.

In [None]:
model = tf.keras.Sequential([
tf.keras.layers.Embedding(encoder.vocab_size, 64),
tf.keras.layers.Bidirectional( tf.keras.layers.LSTM(64)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid'),
          ])

In [None]:
model.summary()

In [None]:
model.compile(loss='binary_crossentropy',
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

In [None]:
history = model.fit(train_dataset, epochs=2,
                    validation_data=test_dataset, 
                    validation_steps=30)