# Importing the Modules

- Let us begin by importing the modules and setting the random seed so as to get reproducible results.

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

tf.random.set_seed(42)

You can load the IMDB dataset easily:

In [None]:
(X_train, y_test), (X_valid, y_test) = keras.datasets.imdb.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


This dataset is already preprocessed.

X_train consists of a list of reviews, each of which is represented as a NumPy array of integers, where each integer represents a word.
All punctuation was removed, and then words were converted to lowercase, split by spaces, and finally indexed by frequency (so low integers correspond to frequent words).
The integers 0, 1, and 2 are special: they represent the padding token, the start-of-sequence (SSS) token, and unknown words, respectively.

In [None]:
X_train[0][:10]

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65]

If you want to visualize a review, you can decode it like this:

In [None]:
word_index = keras.datasets.imdb.get_word_index()
id_to_word = {id_ + 3: word for word, id_ in word_index.items()}
for id_, token in enumerate(("<pad>", "<sos>", "<unk>")):
    id_to_word[id_] = token
" ".join([id_to_word[id_] for id_ in X_train[0][:10]])

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


'<sos> this film was just brilliant casting location scenery story'

# Preparing the Dataset

- Let us load the fashion mnist dataset from Keras data sets.

- We shall then split the data into train, validation, and test parts.

First, we will load the original IMDb reviews, as text (byte strings), using TensorFlow Datasets.

In [None]:
import tensorflow_datasets as tfds

datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True)

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…







HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteW5R6B7/imdb_reviews-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteW5R6B7/imdb_reviews-test.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteW5R6B7/imdb_reviews-unsupervised.tfrecord


HBox(children=(FloatProgress(value=0.0, max=50000.0), HTML(value='')))



[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


**Note:**

- `datasets["train"]` contains the train data. Similarly, `datasets["test"]` contains the test data.

- `datasets["train"].batch(2)` batches 2 data samples at a time.

-  `datasets["train"].batch(2).take(1)` allows to take 1 batch at a time.

- Each batch is of type `Eager Temsor`. We could convert it to numpy array using `X_batch.numpy()`.

In [None]:
datasets.keys()

dict_keys(['test', 'train', 'unsupervised'])

In [None]:
train_size = info.splits["train"].num_examples
test_size = info.splits["test"].num_examples

In [None]:
train_size, test_size

(25000, 25000)

In [None]:
train_size==25000 and test_size==25000

True

In [None]:
# count=0
# for X_batch, y_batch in datasets["train"].batch(10):
#     print(X_batch)
#     count+=1
#     if count==6:
#         break

In [None]:
# for X_batch, y_batch in datasets["train"].batch(10).take(5):
#     print("@ ",X_batch," #")

We shall traverse through the batches and show the review(first 200 characters) and label of the first batch data samples:

In [None]:
for X_batch, y_batch in datasets["train"].batch(2).take(1):
    print(type(X_batch))
    # print(X_batch.numpy()[0], type(y_batch))
    for review, label in zip(X_batch.numpy(), y_batch.numpy()):
        print("Review:", review.decode("utf-8")[:200], "...")
        print("Label:", label, "= Positive" if label else "= Negative")
        print()

<class 'tensorflow.python.framework.ops.EagerTensor'>
Review: This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting  ...
Label: 0 = Negative

Review: I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However  ...
Label: 0 = Negative



In [None]:
str(type(y_batch))=="<class 'tensorflow.python.framework.ops.EagerTensor'>"

True

## Defining the preprocess function

- Now we will create this preprocessing function where we will:

 - Truncate the reviews, keeping only the first 300 characters of each since you can generally tell whether a review is positive or not in the first sentence or two.

 - Then we use regular expressions to replace `<br/>` tags with spaces and to replace any characters other than letters and quotes with spaces.

 - Finally, the `preprocess()` function splits the reviews by the spaces, which returns a [ragged tensor][1], and it converts this ragged tensor to a dense tensor, padding all reviews with the padding token `<pad>` so that they all have the same length.


  [1]: https://www.tensorflow.org/api_docs/python/tf/RaggedTensor

**Note:**

- `tf.strings` - Operations for working with string Tensors.

- `tf.strings.substr(X_batch, 0, 300)` - For each string in the input Tensor `X_batch`, it creates a substring starting at index `pos`(here 0) with a total length of `len`(here 300). So basically, it returns substrings from Tensor of strings.

- `tf.strings.regex_replace(X_batch, rb"<br\s*/?>", b" ")` - Replaces elements of `X_batch` matching regex pattern `<br\s*/?>` with rewrite ` `.

- `tf.strings.split(X_batch)` - Split elements of input `X_batch` into a RaggedTensor.

- `X_batch.to_tensor(default_value=b"<pad>")` - Converts the RaggedTensor into a `tf.Tensor`. `default_value` is the value to set for indices not specified in `X_batch`. Empty values are assigned `default_value`(here `<pad>`).

In [None]:
def preprocess(X_batch, y_batch):
    X_batch = tf.strings.substr(X_batch, 0, 300)
    X_batch = tf.strings.regex_replace(X_batch, rb"<br\s*/?>", b" ")
    X_batch = tf.strings.regex_replace(X_batch, b"[^a-zA-Z']", b" ")
    X_batch = tf.strings.split(X_batch)
    return X_batch.to_tensor(default_value=b"<pad>"), y_batch

In [None]:
str(type(preprocess(X_batch, y_batch)[1]))=="<class 'tensorflow.python.framework.ops.EagerTensor'>"

True

In [None]:
callable(preprocess)

True

## Constructing the Vocabulary

Next, we will construct the vocabulary. This requires going through the whole training set once, applying our `preprocess()` function, and using a `Counter()` to count the number of occurrences of each word.

**Note:**

- `Counter().update()` : We can add values to the Counter by using `update()` method.

- `map(myfunc)` of the tensorflow datasets maps the function(or applies the function) `myfunc` across all the samples of the given dataset. [More here][1].





  [1]: https://www.tensorflow.org/api_docs/python/tf/data/Dataset?hl=en&version=stable#map

In [None]:
from collections import Counter

vocabulary = Counter()
for X_batch, y_batch in datasets["train"].batch(2).map(preprocess):
    # print(vocabulary)
    for review in X_batch:
        vocabulary.update(list(review.numpy()))
        # print(vocabulary)
        # break
    # break

Let’s look at the three most common words:

In [None]:
vocabulary.most_common()[:5]

[(b'<pad>', 63155),
 (b'the', 61137),
 (b'a', 38564),
 (b'of', 33983),
 (b'and', 33431)]

In [None]:
len(vocabulary)==53893

True

In [None]:
str(type(Counter))=="<class 'type'>"

True

## Truncating the Vocabulary

There are more than 50,000 words in the `vocabulary`. So let us truncate it to have only 10,000 most common words.abulary, keeping only the 10,000 most common words:

In [None]:
vocab_size = 10000
truncated_vocabulary = [
    word for word, count in vocabulary.most_common()[:vocab_size]]

In [None]:
len(truncated_vocabulary)==10000

True

## Creating a lookup table

Computer can only process numbers but not words. Thus we need to convert the words in `truncated_vocabulary` into numbers.

So we now need to add a preprocessing step to replace each word with its ID (i.e., its index in the `truncated_vocabulary`). We will create a lookup table for this, using 1,000 out-of-vocabulary (oov) buckets.

We shall create the lookup table such that the most frequently occurring words have lower indices than less frequently occurring words.
 
**Note:**

- `tf.lookup.KeyValueTensorInitializer` : Table initializer given keys and values tensors. [More here](https://www.tensorflow.org/api_docs/python/tf/lookup/KeyValueTensorInitializer?version=nightly#methods)

- `tf.lookup.StaticVocabularyTable` : String to Id table wrapper that assigns out-of-vocabulary keys to buckets. [More here](https://www.tensorflow.org/api_docs/python/tf/lookup/StaticVocabularyTable#methods)

 If `<other term> -> bucket_id`, where bucket_id will be between 3 and 3 + `num_oov_buckets` - 1, calculated by: hash(`<term>`) % `num_oov_buckets` + vocab_size

- `table.lookup` : Looks up keys in the table, outputs the corresponding values.

In [None]:
words = tf.constant(truncated_vocabulary)
word_ids = tf.range(len(truncated_vocabulary), dtype=tf.int64)
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids) # Table initializers given keys and values tensors.
num_oov_buckets = 1000
# String to Id table wrapper that assigns out-of-vocabulary keys to buckets.
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets)

In [None]:
table.lookup(tf.constant([b"This movie was faaaaaantastic".split()]))

<tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[   22,    12,    11, 10053]])>

**Observe,** the words “this,” “movie,” and “was” were found in the table, so their IDs are lower than 10,000, while the word “faaaaaantastic” was not found, so it was mapped to one of the oov buckets, with an ID greater than or equal to 10,000.

## Creating the Final Train and Test sets

Now we will create the final training and test sets.

For creating the final training set `train_set`,

 - we batch the reviews

 - then we convert them to short sequences of words using the `preprocess()` function

 - then encode these words using a simple `encode_words()` function that uses the `table` we just built and finally [prefetch](https://www.tensorflow.org/guide/data_performance#prefetching) the next batch.

Let us test the model(after training) on 1000 samples of the test data as it takes a lot of time to test on the whole test set. So we shall create the final test set on 1000 samples as follows.

For creating the final test set `test_set`,

 - we create a batch of 1000 test samples 

 - then we convert them to short sequences of words using the `preprocess()` function

 - then encode these words using a simple `encode_words()` function that uses the `table` we just built.

**Note:**

 - `dataset.repeat().batch(32)` repeatedly creates the batches of 32 samples in the dataset.

 - `dataset.repeat().batch(32).map(preprocess)` applies the function `preprocess` on every batch.

 - `dataset.map(encode_words).prefetch(1)` applies the function `encode_words` to the data samples and paralelly fetches the next batch.

In [None]:
def encode_words(X_batch, y_batch):
    return table.lookup(X_batch), y_batch

train_set = datasets["train"].repeat().batch(32).map(preprocess)
train_set = train_set.map(encode_words).prefetch(1)

In [None]:
test_set = datasets["test"].batch(1000).map(preprocess)
test_set = test_set.map(encode_words)

In [None]:
for X_batch, y_batch in train_set.take(1):
    print(X_batch)
    print(y_batch)

tf.Tensor(
[[  22   11   28 ...    0    0    0]
 [   6   21   70 ...    0    0    0]
 [4099 6881    1 ...    0    0    0]
 ...
 [  22   12  118 ...  331 1047    0]
 [1757 4101  451 ...    0    0    0]
 [3365 4392    6 ...    0    0    0]], shape=(32, 60), dtype=int64)
tf.Tensor([0 0 0 1 1 1 0 0 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 1 1 0 0 0 1 0 0 0], shape=(32,), dtype=int64)


## Building the Model

- Now that we have preprocessed and created the dataset, we can create the model:

 - The first layer is an Embedding layer, which will convert word IDs into embeddings. The embedding matrix needs to have one row per word ID (vocab_size + num_oov_buckets) and one column per embedding dimension (this example uses 128 dimensions, but this is a hyperparameter you could tune). 
 - Whereas the inputs of the model will be 2D tensors of shape [batch size, time steps], the output of the Embedding layer will be a 3D tensor of shape [batch size, time steps, embedding size].

**Note:**

- `keras.layers.Embedding` : Turns positive integers (indexes) into dense vectors of fixed size. [More here](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding).
- `keras.layers.GRU` : The GRU(Gated Recurrent Unit) Layer.

In [None]:
embed_size = 128
model = keras.models.Sequential([
    keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size,
                           mask_zero=True, # not shown in the book
                           input_shape=[None]),
    keras.layers.GRU(4, return_sequences=True),
    keras.layers.GRU(2),
    keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

In [None]:
str(type(model))=="<class 'tensorflow.python.keras.engine.sequential.Sequential'>"

True

## Training and Testing the Model

- It's time for training the model on the train data.

- Let us also measure the time of training using time module.

- Finally, let us test the model performance on the test data.

In [None]:
import time
start = time.time()
model.fit(train_set, steps_per_epoch=train_size // 32, epochs=2)
end = time.time()

print("Time of execution:", end-start)

Epoch 1/2
Epoch 2/2
Time of execution: 132.91920161247253


In [None]:
model.evaluate(test_set)



[0.5337879061698914, 0.7559599876403809]

In [None]:
np.save('/content/sentiment_analysis.npy', history.history)
model.save("/content/sentiment_analysis.h5")