<a href="https://colab.research.google.com/github/akprodromou/Natural-Language-Processing/blob/main/exercise3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel $\rightarrow$ Restart) and then **run all cells** (in the menubar, select Cell $\rightarrow$ Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [17]:
NAME = "ANTONIS PRODROMOU"
AEM = "238"

---

# Introduction and learning goals

Welcome to your 3rd assignment! The goal is to get hands-on experience with training feed-forward neural networks.

# Initialization

In [18]:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
from tensorflow import keras

# Setting random seeds for reproducible grading
np.random.seed(42)
tf.random.set_seed(42)

print("TensorFlow Version:", tf.__version__)
print("Keras Version:", keras.__version__)

TensorFlow Version: 2.19.0
Keras Version: 3.10.0


# Reading and preprocessing the data

Load the IMDb dataset using keras.datasets.imdb.load_data. Only keep the top 10,000 most frequent words (num_words=10000). Store the result to variables x_train, x_test, while the associated ground truth data should be stored to y_train and y_test respectively.

These data are already tokenized into lists of numbers correspondings to tokens. Therefore the next step is to pad/truncate them to a maximum length using keras.preprocessing.sequence.pad_sequences. Set this length to 256. Store the updated data to the same x_train and x_test variables.

In [19]:
keras.utils.set_random_seed(42)

# load the IMDb dataset
# only keep the 10,000 most frequent words (this becomes the size of our vocabulary)
(x_train, y_train), (x_test, y_test) = keras.datasets.imdb.load_data(num_words=10000)

# padding and truncation are strategies for creating rectangular tensors from batches of varying lengths
# pad/truncate (x_train, y_train) to a 256 maximum length
# store the updated data to the same x_train and x_test variables
x_train = tf.keras.utils.pad_sequences(
    x_train,
    maxlen=256,
    dtype='int32',
    padding='pre',
    truncating='pre',
    value=0.0
)

x_test = tf.keras.utils.pad_sequences(
    x_test,
    maxlen=256,
    dtype='int32',
    padding='pre',
    truncating='pre',
    value=0.0
)


In [20]:
"""Testing"""
assert x_train.shape == (25000, 256), f"Expected x_train shape (25000, 250), but got {x_train.shape}"
assert x_test.shape == (25000, 256), f"Expected x_test shape (25000, 250), but got {x_test.shape}"


# Creating the architecture

Create the neural architecture. You should use an embedding layer with 32 dimensions for each word. Your architecture should concatenate these embeddings. Then you should use a dense layer with 64 nodes and a ReLU as activation function. Finally you should use an appropriate output layer. Store the model in a variable called model.

In [21]:
# initialize the model
# store the model in a variable called model
model = keras.Sequential()

# embedding layer with 32 dimensions for each word
model.add(keras.layers.Embedding(
    # size of vocabulary
    input_dim = 10000,
    # embedding length
    output_dim = 32
))

# concatenate these embeddings for each review
# flatten does not include the batch dimension
model.add(keras.layers.Flatten(
    # input_shape=(256, 32)
))

# use a dense layer with 64 nodes and a ReLU as activation function
model.add(keras.layers.Dense(
    units = 64,
    activation='relu'
))

# use an appropriate output layer
# since our classification is binary, we'll use sigmoid
model.add(keras.layers.Dense(
    units = 1,
    activation='sigmoid'
))


In [22]:
"""Testing"""
assert len(model.layers) == 4, "Model should have exactly 4 layers"
assert isinstance(model.layers[0], keras.layers.Embedding), "First layer must be Embedding"


Compile the model using the appropriate loss function. Study the available loss functions here: https://www.tensorflow.org/api_docs/python/tf/keras/losses . Which one should you use given the representation of the target values in our dataset and your choice of output layer? The following code prints the target values for the first 20 training examples. Use the Adam optimizer with a learning rate of 0.0001. Use accuracy as metric.

In [23]:
print(y_train[0:20])

[1 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 1 1 0 1]


In [24]:
# compile is a model method
model.compile(
    # Adam optimizer with a learning rate of 0.0001
    optimizer=keras.optimizers.Adam(learning_rate=1e-4),
    # cross entropy for binary classification
    loss=keras.losses.BinaryCrossentropy(),
    metrics=[
        # use accuracy as metric
        'accuracy'
    ],
)

In [25]:
"""Testing"""
assert model.optimizer.learning_rate == 0.0001

# Training and evaluating the model

Fit the model for 5 epochs, using 1/4 of the data as validation set, and presenting the metrics also for the validation set at the end of each epoch. Save the fit results to a history variable. Use a batch size of 128 samples.

In [26]:
# Reserve 1/4 of the data for validation, i.e. 6250
x_val = x_train[:6250]
y_val = y_train[:6250]
x_train = x_train[6250:]
y_train = y_train[6250:]

# fit() will train the model by slicing the data into "batches" of size 'batch_size',
# and repeatedly iterating over the entire dataset for a given number of epochs
# save the fit results to a history variable
history = model.fit(
    x_train,
    y_train,
    # use a batch size of 128 samples
    batch_size=128,
    # fit the model for 5 epochs
    epochs=5,
    # validation for monitoring validation loss and
    # metrics at the end of each epoch
    validation_data=(x_val, y_val),
)

print(f"Accuracy History per epoch: {history.history['accuracy']}")

Epoch 1/5
[1m147/147[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 32ms/step - accuracy: 0.5192 - loss: 0.6920 - val_accuracy: 0.5453 - val_loss: 0.6887
Epoch 2/5
[1m147/147[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 37ms/step - accuracy: 0.6695 - loss: 0.6735 - val_accuracy: 0.7045 - val_loss: 0.6337
Epoch 3/5
[1m147/147[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 28ms/step - accuracy: 0.7772 - loss: 0.5724 - val_accuracy: 0.8139 - val_loss: 0.4642
Epoch 4/5
[1m147/147[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 39ms/step - accuracy: 0.8594 - loss: 0.3915 - val_accuracy: 0.8526 - val_loss: 0.3675
Epoch 5/5
[1m147/147[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 30ms/step - accuracy: 0.8936 - loss: 0.2943 - val_accuracy: 0.8634 - val_loss: 0.3273
Accuracy History per epoch: [0.5271999835968018, 0.6822400093078613, 0.8036800026893616, 0.8692799806594849, 0.8993066549301147]


In [27]:
"""Testing"""
final_accuracy = history.history['accuracy'][-1]
assert final_accuracy > 0.85

Evaluate the model on the test data and store the accuracy in variable called accuracy

In [28]:
print("Evaluate on test data")
test_metrics = model.evaluate(x_test, y_test)
print("test loss, test acc:", test_metrics)
accuracy = test_metrics[1]
print("accuracy:", accuracy)

Evaluate on test data
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step - accuracy: 0.8629 - loss: 0.3286
test loss, test acc: [0.3284449875354767, 0.8626000285148621]
accuracy: 0.8626000285148621


In [29]:
"""Testing"""
assert accuracy > 0.85