# Text processing with RNN using IMDB reviews dataset.

#### Run cells and answer following questions.

## Question 1:

#### The dense model below was built using 1 Ngram model and classification accuracy on test data achieved is ~0.888. Reuse the same model with 2 Ngram option. What do you have to change? What accuracy do you achieve on test data? Is it better or worse than 1 Ngram?

## Question 2:

#### Find the funtion to retrieve the created vocabulary ( reference: https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization). Retrieve vocabulary for 2 Ngram model and display and count how many times you see a phrase that contains word 'terrible'.  

## Question 3:

#### In the last part with MultiHeadAttention layer what regularization method could we use to reduce training when performance starts oscilating on the validation set?

## Question 4 optional:

#### In the last part it would be interesting to access and inspect Embedding and MultiHeadAttention layers to understand their structure. But the IMDB dataset is very complex. How can we create a very simple two or four sentences dataset to work with the model and how to access those layers to inspect their weights? The most important thing is that we understand why we are choosing one or the other architecture.     

## Download IMDB data from github and unpack it

In [1]:
!wget https://github.com/erinijapranckeviciene/MF54609_18981_1_20241/raw/refs/heads/main/datasets/RNN/aclImdb.zip

--2025-01-14 23:25:15--  https://github.com/erinijapranckeviciene/MF54609_18981_1_20241/raw/refs/heads/main/datasets/RNN/aclImdb.zip
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/erinijapranckeviciene/MF54609_18981_1_20241/refs/heads/main/datasets/RNN/aclImdb.zip [following]
--2025-01-14 23:25:16--  https://raw.githubusercontent.com/erinijapranckeviciene/MF54609_18981_1_20241/refs/heads/main/datasets/RNN/aclImdb.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 73380778 (70M) [application/zip]
Saving to: ‘aclImdb.zip’


2025-01-14 23:25:18 (87.1 MB/s) - ‘aclImdb.zip’ saved [73380778/73380778

In [2]:
!unzip -qq aclImdb.zip

## Prepare IMDB data

In [3]:
import os, pathlib, shutil, random
from tensorflow import keras

batch_size = 32
train_ds = keras.utils.text_dataset_from_directory("aclImdb/train", batch_size=batch_size )
val_ds = keras.utils.text_dataset_from_directory("aclImdb/val", batch_size=batch_size)
test_ds = keras.utils.text_dataset_from_directory("aclImdb/test", batch_size=batch_size)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


#### Inspect the dataset read from directory files
Displaying the shapes and dtypes of the first batch

In [4]:
for inputs, targets in train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor(b"I just caught an episode about Brad, the crack cocaine addict who turned to a drug addicted life on the streets after his bicycle racing career went to shambles as fast as it started. I have to say that the story about his biking career was more heart-breaking than his drug addiction. Here's this young guy who is winning bike races left and right and is invited to train with an Olympic training team for two weeks, and immediately upon arriving he insults Lance Armstrong, one of the greatest athletes who ever lived, and is generally callous and unfriendly to everyone in general. Understandably, he is soon asked to leave. Most of the show is about his struggle with addiction and how he got his life back, but what I wanted to know was what was wrong with him in the first place to make his act like such an ass?<br /><br />At any rate, I was confused about how the 

#### Keep only text in text_only_train_ds. Use only text data.

In [5]:
# This function returns only data part without target
# to create a new dataset that will be used to create dictionary
text_only_train_ds = train_ds.map(lambda x, y: x)
for inputs in text_only_train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("inputs[0]:", inputs[0])
    break


inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
inputs[0]: tf.Tensor(b'What I liked best in this film is that like the films of Hitchcock, it is a thriller that does not take itself too seriously.<br /><br />Hitchcock understood that people go the the movies to have a good time. Something that Hollywood seems to have forgotten in recent years. This is a thriller, but it has plenty of laughs and always has one eye winking at the camera.<br /><br />Rachel McAdams is wonderful as always. Cillian Murphy is creepier than he was in Batman Begins. In the old days, there were guys who always played the bad guy. We don\'t see much of that these days because I suspect the Hollywood agents consider it a bad career move, but Cillian Murphy is really good at being bad.<br /><br />The directing is surprising stylish. The story is good but the dialog could have used some sprucing up.<br /><br />"Red Eye" is a really fun film and people were applauding when the closing credits started rolling. If 

In [6]:
from tensorflow.keras.layers import TextVectorization

# Text vectorization layer creates a structure that will populate with textual data
# https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization
# This layer performs text standardization, splitting it into Ngrams - single or pairs of words
# It prepares output according to output mode
text_vectorization = TextVectorization( max_tokens=20000, output_mode="multi_hot")

# Use text_only_train_ds to create a dictionary of words that are in IMDB reviews,
# the vocabulary is limited by 20000 (adapt() method)
text_vectorization.adapt(text_only_train_ds)

# Here datasets are created where the text in the string is converted
# to the representation of 20000 dimensional vectors in which a presence of word in
# the text is marked by 1 and absence by 0. This is done for train, validation and test.
binary_1gram_train_ds = train_ds.map( lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
binary_1gram_val_ds = val_ds.map( lambda x, y: (text_vectorization(x), y),num_parallel_calls=4)
binary_1gram_test_ds = test_ds.map( lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)


In [7]:
for inputs, targets in binary_1gram_train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (32, 20000)
inputs.dtype: <dtype: 'int64'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor([1 1 1 ... 0 0 0], shape=(20000,), dtype=int64)
targets[0]: tf.Tensor(1, shape=(), dtype=int32)


#### This is function creates model from Listing 11.5
Our model-building utility

In [8]:
from tensorflow import keras
from tensorflow.keras import layers

# This is dense model
def get_model(max_tokens=20000, hidden_dim=16):
    inputs = keras.Input(shape=(max_tokens,))
    x = layers.Dense(hidden_dim, activation="relu")(inputs)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
    return model

In [9]:
model = get_model()
model.summary()



#### Training model

In [10]:
# cache() is a dataset method that saves the dataset elements in memory or disk and reuses them when called
callbacks = [ keras.callbacks.ModelCheckpoint("binary_1gram.keras", save_best_only=True) ]
model.fit(binary_1gram_train_ds.cache(), validation_data=binary_1gram_val_ds.cache(), verbose=0, epochs=10, callbacks=callbacks)

<keras.src.callbacks.history.History at 0x780a795fab30>

In [11]:
# Accuracy on test data
model = keras.models.load_model("binary_1gram.keras")
print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m498s[0m 635ms/step - accuracy: 0.8897 - loss: 0.2896
Test acc: 0.888


## Different output mode -  integer sequence datasets

#### LSTM layer with 4 units gives worse that 2 Ngram

In [12]:
import tensorflow as tf
from tensorflow.keras import layers

# max_length is a length of the vector that encodes text
# 250 size gave ~0.85 on test data
# 600 with 4 LSTM units does not train at all
# the reviews are about 300 words
max_length = 300
max_tokens = 20000
text_vectorization = layers.TextVectorization( max_tokens=max_tokens, output_mode="int", output_sequence_length=max_length)

text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
int_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
int_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)

In [13]:
for inputs, targets in int_train_ds:
    print(inputs.shape)
    print(inputs[0])
    print(targets[0])
    # With this test_input variable verify tf.one_hot() transformation
    test_input=inputs[0]
    break

(32, 300)
tf.Tensor(
[   11   482     7  1006   384     3     2   538   112     2   354    26
     4   164  6358     3  1411    66   838  3140    17   252    82    30
    73    51   937     3    81    69  2147     6     9    30    73    51
  1783     3   896    21     4   158   662    71    82  3656  4020    21
   354     3    66   456    39   384     3     2   538 12372  2744    13
    13   879  8317     1     7     4  7013    17     4   463  1053    55
   183    73    51  4541     3  3460 10134    71     2   340     5    40
   458   350    55     7  1903    17     2   342     5    40  1121  1530
    36    14    40  3168   133 15112    25  2381  7751    21    40    40
 15904  1623     3  1978     3  1116   310  2701    40   971  1419    52
    74    13    13     1     1     4  4491     7  8393  1319    17  3055
   139     5    40   358   672    37  4261     3    53     4  2188  2942
   726   261    34   597     8    40    55     7  1050   895     6    83
     4   661   808    33    40

#### Try RNN for text classification using integer feature vector

In [14]:
tf.one_hot(test_input, depth=max_tokens)

<tf.Tensor: shape=(300, 20000), dtype=float32, numpy=
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)>

In [21]:
# The sizes of layers are minimized in order to run
import tensorflow as tf

inputs = keras.Input(shape=(None,), dtype="int64")
# Here will be binary input as test_input

embedded = tf.keras.ops.one_hot(inputs, num_classes=max_tokens)
print(embedded)

# RNN layer
# NOT USE TOO BIG MODEL : x = layers.Bidirectional(layers.LSTM(32))(embedded)
# When units=4 the model runs.
x = layers.LSTM(10)(embedded)
x = layers.Dropout(0.1)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)

model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
model.summary()


<KerasTensor shape=(None, None, 20000), dtype=float32, sparse=False, name=keras_tensor_18>


In [22]:
callbacks = [ keras.callbacks.ModelCheckpoint("int_lstm.keras", save_best_only=True) ]

In [23]:
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, verbose=1, callbacks=callbacks)

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m47s[0m 68ms/step - accuracy: 0.5161 - loss: 0.6921 - val_accuracy: 0.6316 - val_loss: 0.6517
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m79s[0m 67ms/step - accuracy: 0.6459 - loss: 0.6515 - val_accuracy: 0.7078 - val_loss: 0.6085
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 70ms/step - accuracy: 0.7311 - loss: 0.5790 - val_accuracy: 0.7872 - val_loss: 0.5143
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 68ms/step - accuracy: 0.7998 - loss: 0.5066 - val_accuracy: 0.8372 - val_loss: 0.4449
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 68ms/step - accuracy: 0.8291 - loss: 0.4621 - val_accuracy: 0.8422 - val_loss: 0.4273
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 67ms/step - accuracy: 0.8352 - loss: 0.4511 - val_accuracy: 0.8418 - val_loss: 0.4303
Epoch 7/10
[1m6

<keras.src.callbacks.history.History at 0x780a701aafb0>

In [24]:
model = keras.models.load_model("int_lstm.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 27ms/step - accuracy: 0.8545 - loss: 0.3942
Test acc: 0.855


In [25]:
pred=model.predict(int_test_ds)

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 21ms/step


In [None]:
print(pred)

[[0.50074637]
 [0.9223024 ]
 [0.92039317]
 ...
 [0.10627599]
 [0.04852633]
 [0.9195446 ]]


In [28]:
import numpy as np

predicted=np.array([int(x>0.5) for x in np.concatenate(pred) ] )
print(predicted)
unique, counts = np.unique(predicted, return_counts=True)
print([unique, counts])

[0 0 1 ... 0 0 0]
[array([0, 1]), array([13013, 11987])]


In [29]:
import numpy as np
# This displays the contents of the dataset
# list(int_test_ds)
# Collect targets to use in confusion table
target_arr=[]
for inputs, targets in int_test_ds:
    target=[int(x) for x in targets]
    target_arr.append(target)

actual=np.concatenate(target_arr)
print(actual)
unique, counts = np.unique(actual, return_counts=True)
print([unique, counts])

[0 0 0 ... 1 0 0]
[array([0, 1]), array([12500, 12500])]


In [30]:
# show confusion table
# Why confusion table shows different accuracy than accuracy of model evaluate?
# Need to investigate, but leaving it for later now.
tf.math.confusion_matrix(actual, predicted)

<tf.Tensor: shape=(2, 2), dtype=int32, numpy=
array([[6534, 5966],
       [6479, 6021]], dtype=int32)>

#### Try to use Enbedding layer for the previous problem

In [31]:
vocab_size = 20000
embed_dim = 256

# ignore from FC book Ch.11 Listing 11.22
#num_heads = 2
#dense_dim = 32

inputs = keras.Input(shape=(None,), dtype="int64")
x = layers.Embedding(vocab_size, embed_dim)(inputs)

x = layers.LSTM(35)(x)

x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)

# Maybe optimizer could be different?
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
model.summary()

In [32]:
callbacks = [ keras.callbacks.ModelCheckpoint("int_embed_lstm.keras", save_best_only=True) ]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, verbose=1, callbacks=callbacks)

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 25ms/step - accuracy: 0.5097 - loss: 0.6942 - val_accuracy: 0.5048 - val_loss: 0.6947
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 26ms/step - accuracy: 0.5269 - loss: 0.6915 - val_accuracy: 0.5084 - val_loss: 0.6934
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 23ms/step - accuracy: 0.5291 - loss: 0.6885 - val_accuracy: 0.5122 - val_loss: 0.6957
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 25ms/step - accuracy: 0.5471 - loss: 0.6813 - val_accuracy: 0.5252 - val_loss: 0.6879
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 27ms/step - accuracy: 0.5614 - loss: 0.6641 - val_accuracy: 0.5518 - val_loss: 0.6642
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 25ms/step - accuracy: 0.5774 - loss: 0.6458 - val_accuracy: 0.5598 - val_loss: 0.6546
Epoch 7/10
[1m6

<keras.src.callbacks.history.History at 0x7809fb151690>

In [33]:
model = keras.models.load_model("int_embed_lstm.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 7ms/step - accuracy: 0.8047 - loss: 0.4942
Test acc: 0.801


### The more complex architecture the poorer results are

#### Model that uses Encoder and MultiHeadAttention layers ( by some analogy to Listing 11.22 )

In [34]:
# What can we do to improve? Here we dont have any RNN, just transformations

vocab_size = 20000
embed_dim = 256

# from FC book Ch.11 Listing 11.22
# num_heads = 2 - acheved 8.69
# Lets try more heads/words
# With 4 heads 8.73 , just a bit , with 5 heads 8.70
# Still worse than a simpler method.
num_heads = 5
dense_dim = 32

inputs = keras.Input(shape=(None,), dtype="int64")
x = layers.Embedding(vocab_size, embed_dim)(inputs)

# try to add attention
x = layers.MultiHeadAttention( num_heads=num_heads, key_dim=embed_dim)(x,x)

# This layer is needed - what it does?
x = layers.GlobalMaxPooling1D()(x)

x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)

model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
model.summary()

#### Use model with attention , it really improved from previous, but not compared to more simple model of 2 Ngram.  

In [35]:
callbacks = [ keras.callbacks.ModelCheckpoint("attention.keras", save_best_only=True) ]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=20, callbacks=callbacks)

Epoch 1/20
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 58ms/step - accuracy: 0.5084 - loss: 0.6942 - val_accuracy: 0.5000 - val_loss: 0.6940
Epoch 2/20
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 57ms/step - accuracy: 0.5202 - loss: 0.6926 - val_accuracy: 0.5836 - val_loss: 0.6548
Epoch 3/20
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 58ms/step - accuracy: 0.6555 - loss: 0.6213 - val_accuracy: 0.8152 - val_loss: 0.4337
Epoch 4/20
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 57ms/step - accuracy: 0.7776 - loss: 0.4691 - val_accuracy: 0.8428 - val_loss: 0.3689
Epoch 5/20
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 56ms/step - accuracy: 0.8252 - loss: 0.3889 - val_accuracy: 0.7776 - val_loss: 0.4559
Epoch 6/20
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 57ms/step - accuracy: 0.8482 - loss: 0.3501 - val_accuracy: 0.8646 - val_loss: 0.3240
Epoch 7/20
[1m6

<keras.src.callbacks.history.History at 0x7809f3cf7d60>

In [36]:
model = keras.models.load_model("attention.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 15ms/step - accuracy: 0.8732 - loss: 0.3030
Test acc: 0.875
