Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel $\rightarrow$ Restart) and then **run all cells** (in the menubar, select Cell $\rightarrow$ Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [1]:
STUDENT_ID = "128"

---

Your task is to create an RNN using TensorFlow to classify news headlines into one of four categories: World News, Sports, Business, or Sci/Tech. Study first the example given in the Recurrent Neural Networks and the Encoder-Decoder Architecture section in the elearning platform: https://colab.research.google.com/drive/1dA0mXy8nwvnxPB-51LIKV7LBjIkOlMkF?usp=sharing

In [2]:
import os
import numpy as np

import tensorflow as tf
import tensorflow_datasets as tfds

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.list_physical_devices("GPU") else "NOT AVAILABLE")

Version:  2.17.1
Eager mode:  True
GPU is available


Load the AG News Subset dataset using tfds. Split the existing train split into half, in order to use the first half for training and the second half for validation. Store the training, validation and test splits into variables train_dataset, validation_dataset and test_dataset respectively.

In [3]:
# Load the AG News Subset dataset
dataset, info = tfds.load(
    'ag_news_subset',
    split=['train', 'test'],
    with_info=True,
    as_supervised=True
)

train_val_dataset, test_dataset = dataset

# Split the train dataset into training and validation subsets
train_dataset = train_val_dataset.take(len(train_val_dataset) // 2)
validation_dataset = train_val_dataset.skip(len(train_val_dataset) // 2)

# Display dataset sizes
train_size = len(list(train_dataset))
val_size = len(list(validation_dataset))
test_size = len(list(test_dataset))

print(f"Number of training examples: {train_size}")
print(f"Number of validation examples: {val_size}")
print(f"Number of test examples: {test_size}")

Number of training examples: 60000
Number of validation examples: 60000
Number of test examples: 7600


In [4]:
"""Testing"""
assert len(train_dataset) == 60000
assert len(validation_dataset) == 60000
assert len(test_dataset) == 7600

Batch the datasets to be ready for training and inference. Use a batch size of 64.

In [5]:
BUFFER_SIZE = 5000
BATCH_SIZE = 64
train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
validation_dataset = validation_dataset.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
test_dataset = test_dataset.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

In [6]:
"""Testing"""
for batch in train_dataset.take(1):
  inputs, labels = batch
  assert inputs.shape[0] == 64

Define a TextVectorization layer with an appropriate output mode for use with RNNs and learn the vocabulary based on the training set using the adapt function. Use a 9k vocabulary size. Store the layer into a variable called encoder.

In [7]:
# Define the TextVectorization layer
VOCAB_SIZE = 9000

encoder = tf.keras.layers.TextVectorization(
    max_tokens=VOCAB_SIZE,      # Maximum vocabulary size
    output_mode='int',
    output_sequence_length=100   #for faster training
)

# Adapt the encoder to the training set
encoder.adapt(train_dataset.map(lambda text, label: text))

print("TextVectorization layer created and adapted to the training set.")


TextVectorization layer created and adapted to the training set.


In [8]:
"""Testing"""
assert len(encoder.get_vocabulary()) == 9000

Construct the model architecture. Use a trainable embedding layer with 300 dimensions. Then use an LSTM layer with 128 cells. Finally use an appropriate output layer for the classification task at hand. Store the model in a variable called model.

In [9]:
# Construct the model architecture
model = tf.keras.Sequential([
    # Input layer to define the shape explicitly
    tf.keras.Input(shape=(1,), dtype=tf.string),
    encoder,  # Text Vectorization layer
    tf.keras.layers.Embedding(
        input_dim=len(encoder.get_vocabulary()),  # Vocabulary size
        output_dim=300,                          # Embedding dimensions
        mask_zero=True                           # Mask padding tokens
    ),
    # LSTM layer
    tf.keras.layers.LSTM(128),
    # Dense output layer with softmax activation for multi-class classification
    tf.keras.layers.Dense(4, activation='softmax')  # 4 categories
])


# Print model summary
model.summary()


In [10]:
"""Testing"""
assert len(model.layers) == 4

Compile the model using the appropriate loss function. Study the available loss functions here: https://www.tensorflow.org/api_docs/python/tf/keras/losses . Look particularly at the CategoricalCrossEntropy and SparseCategoricalCrossEntropy losses. Which one should you use given the representation of the target values in our dataset? The following code prints the target values for the first batch of training examples. Use the Adam optimizer with a learning rate of 0.00001. Use accuracy as metric.

In [11]:
for text, label in train_dataset.take(1):
  print(label)

tf.Tensor(
[2 1 2 1 2 0 2 1 2 1 1 3 3 3 3 3 0 3 3 2 0 1 2 1 3 0 1 3 1 0 3 1 1 2 2 2 2
 1 2 3 3 3 0 2 2 2 2 3 3 3 1 0 3 1 2 3 2 1 1 2 1 3 2 2], shape=(64,), dtype=int64)


In [12]:
model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5),
    metrics=['accuracy']
)

In [13]:
"""Testing"""
assert model.optimizer.learning_rate == 0.00001

Fit the model for 5 epochs, presenting the metrics also for the validation set at the end of each epoch. Save the fit resutls to a history variable.

In [14]:
EPOCHS = 5
history = model.fit(
    train_dataset,
    validation_data=validation_dataset,
    epochs=EPOCHS
)
print("Model compiled successfully.")

Epoch 1/5
[1m938/938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 17ms/step - accuracy: 0.3465 - loss: 1.3811 - val_accuracy: 0.5072 - val_loss: 1.2238
Epoch 2/5
[1m938/938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 18ms/step - accuracy: 0.5048 - loss: 1.0976 - val_accuracy: 0.5564 - val_loss: 0.9757
Epoch 3/5
[1m938/938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 18ms/step - accuracy: 0.5971 - loss: 0.9307 - val_accuracy: 0.7189 - val_loss: 0.7972
Epoch 4/5
[1m938/938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 18ms/step - accuracy: 0.7620 - loss: 0.7317 - val_accuracy: 0.8058 - val_loss: 0.6360
Epoch 5/5
[1m938/938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 18ms/step - accuracy: 0.8218 - loss: 0.5845 - val_accuracy: 0.8243 - val_loss: 0.5707
Model compiled successfully.


In [15]:
"""Testing"""
final_accuracy = history.history['accuracy'][-1]
assert final_accuracy > 0.75

Develop a function to get the embeddings that the network learned and compute the similarity with a given word.

In [28]:
from sklearn.metrics.pairwise import cosine_similarity

def find_most_similar(target_word, top_n):
    """
    Finds the most similar words to a given target word based on learned embeddings.

    Args:
        target_word (str): The word to find similar words for.
        top_n (int): The number of most similar words to return.

    Returns:
        List[str]: The list of most similar words.
    """
    # Retrieve the embedding layer from the model
    embedding_layer = model.get_layer('embedding')
    embedding_weights = embedding_layer.get_weights()[0]  # Shape: (vocab_size, embedding_dim)

    # Get the index of the target word using the encoder
    target_word_idx = encoder([target_word]).numpy()[0][0]  # Get the token index for the target word

    # Check if target_word_idx is valid (target word must be in the vocabulary)
    if target_word_idx >= len(embedding_weights) or target_word_idx == 0:
        # Index 0 is typically reserved for padding token, which should not be used
        raise ValueError(f"Word '{target_word}' not in vocabulary or it's padding/unknown.")

    # Get the embedding of the target word
    target_embedding = embedding_weights[target_word_idx]

    # Compute similarities with all other embeddings
    similarities = cosine_similarity(
        target_embedding.reshape(1, -1),  # Target word embedding
        embedding_weights                # All embeddings in the vocabulary
    )[0]  # Flatten the result to a 1D array

    # Get the top N most similar word indices (excluding the target word itself and padding)
    similar_indices = np.argsort(similarities)[-top_n - 1:][::-1]  # Sort in descending order
    similar_indices = [i for i in similar_indices if i != target_word_idx and i != 0]  # Exclude target word and padding

    # Map indices to words using the vocabulary
    vocabulary = encoder.get_vocabulary()
    similar_words = [vocabulary[i] for i in similar_indices][:top_n]  # Map indices to words

    return similar_words

# Example usage:
most_similar_words = find_most_similar("computer", 5)
print(f"Words most similar to 'computer': {most_similar_words}")



Words most similar to 'computer': ['software', 'technology', 'space', 'internet', 'web']


In [20]:
"""Testing"""
assert "software" in find_most_similar("computer", 5)