# Text processing with RNN using IMDB reviews dataset.

#### Run cells and answer following questions.

## Question 1:

#### The dense model below was built using 1 Ngram model and classification accuracy on test data achieved is ~0.888. Reuse the same model with 2 Ngram option. What do you have to change? What accuracy do you achieve on test data? Is it better or worse than 1 Ngram? 

## Question 2:

#### Find the funtion to retrieve the created vocabulary ( reference: https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization). Retrieve vocabulary for 2 Ngram model and display and count how many times you see a phrase that contains word 'terrible'.  

## Question 3:

#### In the last part with MultiHeadAttention layer what regularization method could we use to reduce training when performance starts oscilating on the validation set?

## Question 4 optional: 

#### In the last part it would be interesting to access and inspect Embedding and MultiHeadAttention layers to understand their structure. But the IMDB dataset is very complex. How can we create a very simple two or four sentences dataset to work with the model and how to access those layers to inspect their weights? The most important thing is that we understand why we are choosing one or the other architecture.     

## Prepare IMDB data

In [None]:
import os, pathlib, shutil, random
from tensorflow import keras

batch_size = 32
train_ds = keras.utils.text_dataset_from_directory("aclImdb/train", batch_size=batch_size )
val_ds = keras.utils.text_dataset_from_directory("aclImdb/val", batch_size=batch_size)
test_ds = keras.utils.text_dataset_from_directory("aclImdb/test", batch_size=batch_size)

#### Inspect the dataset read from directory files 
Displaying the shapes and dtypes of the first batch

In [2]:
for inputs, targets in train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor(b'An entertaining and substantive film, Non-Stop has drawn deserving comparisons with "Run Lola Run". The film quickly develops into a chase sequence, during which the viewers learn about the three main characters through flashbacks and daydream sequences. The chase serves not as as a fast-paced climax, but as a journey that makes up the majority of the film. During the "run" we see the characters grow and momentarily forget about their dreary lives, about the "macho" roles they\'ve bought into, and eventually forgetting about why they started running in the first place. Much like fighting provided a "clarity" for the characters in "Fight Club," running provides this film\'s characters with a means to step away from the false values that we all allow society to create for us. Their running serves as way to truly taste life from an unclouded perspective, and all 

#### Keep only text in text_only_train_ds. Use only text data. 

In [3]:
# This function returns only data part without target 
# to create a new dataset that will be used to create dictionary
text_only_train_ds = train_ds.map(lambda x, y: x)
for inputs in text_only_train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("inputs[0]:", inputs[0])
    break


inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
inputs[0]: tf.Tensor(b'When this movie was first shown on television I had high hopes that we would finally have a decent movie about World War I as experienced by American soldiers. Unfortunately this is not it.<br /><br />It should have been a good movie about WWI. Even though it was made for television it is obvious that a real effort was made to use appropriate equipment and props. But the writing and directing are badly lacking, even though the makers of this movie obviously borrowed freely from quite a few well made war movies. War movie clich\xc3\xa9s abound such as the arrogant general who apparently does not care a flip about the lives of his men. When will Hollywood realize that, even though there have been plenty of bad generals, most combat unit generals have seen plenty of combat themselves and are not naive about what the average grunt experiences? The first part of this movie appeared to be "Paths of Glory" with America

In [12]:
from tensorflow.keras.layers import TextVectorization

# Text vectorization layer creates a structure that will populate with textual data
# https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization
# This layer performs text standardization, splitting it into Ngrams - single or pairs of words
# It prepares output according to output mode
text_vectorization = TextVectorization( max_tokens=20000, output_mode="multi_hot")

# Use text_only_train_ds to create a dictionary of words that are in IMDB reviews,
# the vocabulary is limited by 20000 (adapt() method) 
text_vectorization.adapt(text_only_train_ds)

# Here datasets are created where the text in the string is converted 
# to the representation of 20000 dimensional vectors in which a presence of word in 
# the text is marked by 1 and absence by 0. This is done for train, validation and test. 
binary_1gram_train_ds = train_ds.map( lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
binary_1gram_val_ds = val_ds.map( lambda x, y: (text_vectorization(x), y),num_parallel_calls=4)
binary_1gram_test_ds = test_ds.map( lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)


In [13]:
for inputs, targets in binary_1gram_train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (32, 20000)
inputs.dtype: <dtype: 'float32'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor([1. 1. 1. ... 0. 0. 0.], shape=(20000,), dtype=float32)
targets[0]: tf.Tensor(1, shape=(), dtype=int32)


#### This is function creates model from Listing 11.5
Our model-building utility

In [15]:
from tensorflow import keras
from tensorflow.keras import layers

# This is dense model
def get_model(max_tokens=20000, hidden_dim=16):
    inputs = keras.Input(shape=(max_tokens,))
    x = layers.Dense(hidden_dim, activation="relu")(inputs)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
    return model

In [16]:
model = get_model()
model.summary()



Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense (Dense)               (None, 16)                320016    
                                                                 
 dropout (Dropout)           (None, 16)                0         
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320033 (1.22 MB)
Trainable params: 320033 (1.22 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


#### Training model

In [17]:
# cache() is a dataset method that saves the dataset elements in memory or disk and reuses them when called 
callbacks = [ keras.callbacks.ModelCheckpoint("binary_1gram.keras", save_best_only=True) ]
model.fit(binary_1gram_train_ds.cache(), validation_data=binary_1gram_val_ds.cache(), verbose=0, epochs=10, callbacks=callbacks)

2024-01-16 18:49:19.326636: I external/local_xla/xla/service/service.cc:168] XLA service 0x7f820840c350 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-01-16 18:49:19.326669: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 2060, Compute Capability 7.5
2024-01-16 18:49:19.337653: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-01-16 18:49:19.366483: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8904
I0000 00:00:1705448959.406738   31625 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


<keras.src.callbacks.History at 0x7f83102a6690>

In [18]:
# Accuracy on test data
model = keras.models.load_model("binary_1gram.keras")
print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")

Test acc: 0.888


## Different output mode -  integer sequence datasets

#### LSTM layer with 4 units gives worse that 2 Ngram 

In [73]:
import tensorflow as tf
from tensorflow.keras import layers

# max_length is a length of the vector that encodes text
# 250 size gave ~0.85 on test data
# 600 with 4 LSTM units does not train at all
# the reviews are about 300 words
max_length = 300 
max_tokens = 20000
text_vectorization = layers.TextVectorization( max_tokens=max_tokens, output_mode="int", output_sequence_length=max_length)

text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
int_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
int_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)

In [5]:
for inputs, targets in int_train_ds:
    print(inputs.shape)
    print(inputs[0])
    print(targets[0])
    # With this test_input variable verify tf.one_hot() transformation
    test_input=inputs[0]
    break

(32, 300)
tf.Tensor(
[  275    10  3286    17    11    20  1052    10   417     8  4824     3
  2755    37    84  7536    19   845     9     7    10   220  2625     2
  1508    21     4    88   472   287     6    83    49    55   117    17
    56     1   173    32    10    69   130    62    62    50    39    56
   278    14  1067    21    11    20     3     9   125  4166    70   241
    10     1  1395  3142     9    46    45    23    69    85   948   775
   583     3     5   257     4   226    12    76  3522    23    21   123
 19844    68     9     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0

#### Try RNN for text classification using integer feature vector

In [6]:
tf.one_hot(test_input, depth=max_tokens)

<tf.Tensor: shape=(300, 20000), dtype=float32, numpy=
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.]], dtype=float32)>

In [7]:
# The sizes of layers are minimized in order to run
import tensorflow as tf

inputs = keras.Input(shape=(None,), dtype="int64")
# Here will be binary input as test_input
embedded = tf.one_hot(inputs, depth=max_tokens)
print(embedded)

# RNN layer 
# NOT USE TOO BIG MODEL : x = layers.Bidirectional(layers.LSTM(32))(embedded)
# When units=4 the model runs. 
x = layers.LSTM(10)(embedded)
x = layers.Dropout(0.1)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)

model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
model.summary()


KerasTensor(type_spec=TensorSpec(shape=(None, None, 20000), dtype=tf.float32, name=None), name='tf.one_hot/one_hot:0', description="created by layer 'tf.one_hot'")
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 tf.one_hot (TFOpLambda)     (None, None, 20000)       0         
                                                                 
 lstm (LSTM)                 (None, 10)                800440    
                                                                 
 dropout (Dropout)           (None, 10)                0         
                                                                 
 dense (Dense)               (None, 1)                 11        
                                                                 
Total params: 800451 (3.05 MB

In [8]:
callbacks = [ keras.callbacks.ModelCheckpoint("int_lstm.keras", save_best_only=True) ]

In [9]:
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, verbose=1, callbacks=callbacks)

Epoch 1/10


2024-01-16 20:28:54.352918: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8904
2024-01-16 20:28:55.343696: I external/local_xla/xla/service/service.cc:168] XLA service 0x7fe0c012b120 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-01-16 20:28:55.343753: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 2060, Compute Capability 7.5
2024-01-16 20:28:55.354269: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1705454935.397039   45236 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7fe1d03d5f50>

In [59]:
model = keras.models.load_model("int_lstm.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Test acc: 0.867


In [60]:
pred=model.predict(int_test_ds)



In [61]:
print(pred)

[[0.50074637]
 [0.9223024 ]
 [0.92039317]
 ...
 [0.10627599]
 [0.04852633]
 [0.9195446 ]]


In [62]:
predicted=np.array([int(x>0.5) for x in np.concatenate(pred) ] )
print(predicted)
unique, counts = np.unique(predicted, return_counts=True)
print([unique, counts])

[1 1 1 ... 0 0 1]
[array([0, 1]), array([12590, 12410])]


In [63]:
import numpy as np
# This displays the contents of the dataset
# list(int_test_ds)
# Collect targets to use in confusion table
target_arr=[]
for inputs, targets in int_test_ds:
    target=[int(x) for x in targets]
    target_arr.append(target)
    
actual=np.concatenate(target_arr)
print(actual)
unique, counts = np.unique(actual, return_counts=True)
print([unique, counts])

[0 1 1 ... 0 0 1]
[array([0, 1]), array([12500, 12500])]


In [68]:
# show confusion table
# Why confusion table shows different accuracy than accuracy of model evaluate? 
# Need to investigate, but leaving it for later now. 
tf.math.confusion_matrix(actual, predicted)

<tf.Tensor: shape=(2, 2), dtype=int32, numpy=
array([[6289, 6211],
       [6301, 6199]], dtype=int32)>

#### Try to use Enbedding layer for the previous problem 

In [77]:
vocab_size = 20000
embed_dim = 256

# ignore from FC book Ch.11 Listing 11.22
#num_heads = 2
#dense_dim = 32

inputs = keras.Input(shape=(None,), dtype="int64")
x = layers.Embedding(vocab_size, embed_dim)(inputs)

x = layers.LSTM(35)(x)

x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)

# Maybe optimizer could be different? 
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
model.summary()

Model: "model_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_5 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_3 (Embedding)     (None, None, 256)         5120000   
                                                                 
 lstm_4 (LSTM)               (None, 35)                40880     
                                                                 
 dropout_4 (Dropout)         (None, 35)                0         
                                                                 
 dense_4 (Dense)             (None, 1)                 36        
                                                                 
Total params: 5160916 (19.69 MB)
Trainable params: 5160916 (19.69 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [78]:
callbacks = [ keras.callbacks.ModelCheckpoint("int_embed_lstm.keras", save_best_only=True) ]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, verbose=1, callbacks=callbacks)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7fdf66b05c10>

In [79]:
model = keras.models.load_model("int_embed_lstm.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Test acc: 0.768


### The more complex architecture the poorer results are

#### Model that uses Encoder and MultiHeadAttention layers ( by some analogy to Listing 11.22 )

In [101]:
# What can we do to improve? Here we dont have any RNN, just transformations 

vocab_size = 20000
embed_dim = 256

# from FC book Ch.11 Listing 11.22
# num_heads = 2 - acheved 8.69
# Lets try more heads/words
# With 4 heads 8.73 , just a bit , with 5 heads 8.70
# Still worse than a simpler method. 
num_heads = 5
dense_dim = 32

inputs = keras.Input(shape=(None,), dtype="int64")
x = layers.Embedding(vocab_size, embed_dim)(inputs)

# try to add attention 
x = layers.MultiHeadAttention( num_heads=num_heads, key_dim=embed_dim)(x,x)

# This layer is needed - what it does? 
x = layers.GlobalMaxPooling1D()(x)

x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)

model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
model.summary()

Model: "model_13"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_17 (InputLayer)       [(None, None)]               0         []                            
                                                                                                  
 embedding_15 (Embedding)    (None, None, 256)            5120000   ['input_17[0][0]']            
                                                                                                  
 multi_head_attention_10 (M  (None, None, 256)            1314816   ['embedding_15[0][0]',        
 ultiHeadAttention)                                                  'embedding_15[0][0]']        
                                                                                                  
 global_max_pooling1d_9 (Gl  (None, 256)                  0         ['multi_head_attention_

#### Use model with attention , it really improved from previous, but not compared to more simple model of 2 Ngram.  

In [102]:
callbacks = [ keras.callbacks.ModelCheckpoint("attention.keras", save_best_only=True) ]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=20, callbacks=callbacks)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.src.callbacks.History at 0x7fdf76216550>

In [103]:
model = keras.models.load_model("attention.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Test acc: 0.870
