## Part 0: Importing dependencies

In [1]:
import tensorflow as tf

tf.__version__

'2.6.2'

## Part 1: Dataset preprocessing

In this exercise, imdb reviews are classified by sentiment (positive/negative).
For this, a Recurrent Neural Network is used.
Also, the preprocessing of the data is quite different from previous exercises.

### Set up dataset parameters

These parameters determine what will be loaded from the dataset.
Here, I go with the 20000 most frequent words in the dataset and set the maximum length of any sequence (converted review) to 100 words.
For more details, see [the keras documentation page](https://keras.io/api/datasets/imdb/).

In [2]:
number_of_words = 20000
max_len = 100

### Load the IMDB dataset

In [3]:
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.imdb.load_data(num_words=number_of_words)

### Pad all sequences to be the same length

In order to classify the reviews, they have to be of the same length.
To achieve this, they are padded, to achieve the max_len.
Longer reviews are truncated.

In [4]:
X_train = tf.keras.preprocessing.sequence.pad_sequences(X_train, maxlen=max_len)
X_test = tf.keras.preprocessing.sequence.pad_sequences(X_test, maxlen=max_len)

Let's see a sequence. It will appear as a sequence of numbers and not of words, since the dataset has been partially prepared.

In [5]:
X_train[1]

array([  163,    11,  3215, 10156,     4,  1153,     9,   194,   775,
           7,  8255, 11596,   349,  2637,   148,   605, 15358,  8003,
          15,   123,   125,    68,     2,  6853,    15,   349,   165,
        4362,    98,     5,     4,   228,     9,    43,     2,  1157,
          15,   299,   120,     5,   120,   174,    11,   220,   175,
         136,    50,     9,  4373,   228,  8255,     5,     2,   656,
         245,  2350,     5,     4,  9837,   131,   152,   491,    18,
           2,    32,  7464,  1212,    14,     9,     6,   371,    78,
          22,   625,    64,  1382,     9,     8,   168,   145,    23,
           4,  1690,    15,    16,     4,  1355,     5,    28,     6,
          52,   154,   462,    33,    89,    78,   285,    16,   145,
          95], dtype=int32)

### Set up Embedding Layer parameters

For more information on the embedding layer see the next section.

In [6]:
vocab_size = number_of_words
vocab_size

20000

In [7]:
embed_size = 128

## Part 2: Building a Recurrent Neural Network

Recurrent neural networks differ greatly from simple ANNs and CNNs.
For more details, see [the deep learning book](https://www.deeplearningbook.org/).

The RNN consists (in an equivalent way to the CNN in 02) of a first part that makes the network a RNN and a simple part to give the output.
There are more complex models which are outside of this exercise's scope.

### Define the model

In [8]:
model = tf.keras.Sequential()

### Add the Embeding Layer

Word embeddings are a method of "translating" words into numerical data for the neural networks.
Word embeddings are dense and trainable, which means they are more efficient than one-hot encoders and allow the network to learn relationships between words.
For more details see [here](https://www.tensorflow.org/text/guide/word_embeddings)

In [9]:
model.add(tf.keras.layers.Embedding(vocab_size, embed_size, input_shape=(X_train.shape[1],)))

### Add the LSTM Layer

Long Short Term Memory neurons are very different than the neurons seen so far. They have logical gates which decide the pieces of information to be forgotten and the ones that are not.
For more details see [this paper](https://arxiv.org/pdf/1506.02078.pdf).

- number of neurons: 128
- activation function: tanh

In [10]:
model.add(tf.keras.layers.LSTM(units=128, activation='tanh'))

### Add the Dense output layer

- neurons: 1
- activation function: sigmoid

In [11]:
model.add(tf.keras.layers.Dense(units=1, activation='sigmoid'))

### Compile the model

This time I go with Root Mean Square Propagation for the optimizer.
Recurrent neural networks run into gradient decay problems (where the gradient disappears), and rmprop was developed to overcome this problem.

In [12]:
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])

In [13]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 100, 128)          2560000   
_________________________________________________________________
lstm (LSTM)                  (None, 128)               131584    
_________________________________________________________________
dense (Dense)                (None, 1)                 129       
Total params: 2,691,713
Trainable params: 2,691,713
Non-trainable params: 0
_________________________________________________________________


## Part 3: Train the model

The parameter batch size is used to define the size of training batches, i.e. the amount of data samples put into the network.
Batch size is an important optimizable training hyperparameter.

In [14]:
model.fit(X_train, y_train, epochs=3, batch_size=128)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f1b7f0da9b0>

### Evaluate the model

In [15]:
test_loss, test_acurracy = model.evaluate(X_test, y_test)



In [16]:
print("Test accuracy: {}".format(test_acurracy))

Test accuracy: 0.8454800248146057


## Part 4: Save the model

### Save the architecture of the network as .json

In [17]:
model_json = model.to_json()
with open("03_imdb_model.json", "w") as json_file:
    json_file.write(model_json)

### Save the network weights

In [18]:
model.save_weights("03_imdb_model.h5")