# --------------------- ASSIGNMENT 2 (AI) - TRANSFORMERS -----------------------

### Question 1: Sentiment Analysis with Transformers

Dataset Problem: Use the IMDB movie reviews dataset to perform sentiment analysis using a Transformer model. load the dataset from TensorFlow datasets library and solve the problem.

Due to the complexity and size of Transformer models, use via libraries like Hugging Face's Transformers and work it out, feel free to experiment with more than 1 transformer model and compare the results and give a short explanation on the best model, what are the reasons for its performance.

In [None]:
!pip install tensorflow tensorflow-datasets transformers datasets scikit-learn



In [4]:
# Importing Libraries

import tensorflow as tf
import tensorflow_datasets as tfds
from transformers import TFAutoModelForSequenceClassification, AutoTokenizer
from sklearn.metrics import accuracy_score, classification_report
import numpy as np

In [6]:
# Load IMDB Dataset from Tensorflow Dataset

(train_data, test_data), info = tfds.load(
    'imdb_reviews',
    split=['train', 'test'],
    as_supervised=True,
    with_info=True
)



Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.ZJAPT2_1.0.0/imdb_reviews-train.tfrecor…

Generating test examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.ZJAPT2_1.0.0/imdb_reviews-test.tfrecord…

Generating unsupervised examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.ZJAPT2_1.0.0/imdb_reviews-unsupervised.…

Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.


In [None]:
# Convert Dataset to list

# Needed for Hugging Face Tokenizers

# Train data

train_texts  = []
train_labels = []

for text, label in tfds.as_numpy(train_data):
    train_texts.append(text.decode('utf-8'))
    train_labels.append(label)

# Test data

test_texts  = []
test_labels = []

for text, label in tfds.as_numpy(test_data):
    test_texts.append(text.decode('utf-8'))
    test_labels.append(label)

## Model Building

In [None]:
# -------------------------------------------------------- Model 1 - DistilBERT ------------------------------------------------------------------------

# - Smaller and Faster than BERT
# - Retains `97% of BERT's performance
# - Ideal for limited compute environments

In [None]:
# Tokenization

distilbert_tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

train_encodings = distilbert_tokenizer(
    train_texts,
    truncation=True,
    padding=True,
    max_length=256
)

test_encodings = distilbert_tokenizer(
    test_texts,
    truncation=True,
    padding=True,
    max_length=256
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 2984e4f5-7a77-4364-8030-e08fcf0f5d85)')' thrown while requesting HEAD https://huggingface.co/distilbert-base-uncased/resolve/main/vocab.txt
Retrying in 1s [Retry 1/5].


vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: baab69e2-52f2-471c-bf37-5cd8466f3006)')' thrown while requesting HEAD https://huggingface.co/distilbert-base-uncased/resolve/main/tokenizer.json
Retrying in 1s [Retry 1/5].


tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
# Model Loading

distilbert_model = TFAutoModelForSequenceClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=2,
    force_download=True,
    use_safetensors=False
)

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

tf_model.h5:   0%|          | 0.00/363M [00:00<?, ?B/s]

TensorFlow and JAX classes are deprecated and will be removed in Transformers v5. We recommend migrating to PyTorch classes or pinning your version of Transformers.
Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['activation_13', 'vocab_transform', 'vocab_projector', 'vocab_layer_norm']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-

In [None]:
# Prepare TF Dataset

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
)).shuffle(1000).batch(16)

test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    test_labels
)).batch(16)

In [None]:
# Compile and Train

distilbert_model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy']
)

distilbert_model.fit(
    train_dataset,
    validation_data=test_dataset,
    epochs=2
)

Epoch 1/2
Epoch 2/2


<tf_keras.src.callbacks.History at 0x7993df713ef0>

In [None]:
# Evaluation

logits = distilbert_model.predict(test_dataset).logits
preds = np.argmax(logits, axis=1)

print('DistilBERT Accuracy:', accuracy_score(test_labels, preds))
print(classification_report(test_labels, preds))

DistilBERT Accuracy: 0.89736
              precision    recall  f1-score   support

           0       0.91      0.88      0.90     12500
           1       0.89      0.91      0.90     12500

    accuracy                           0.90     25000
   macro avg       0.90      0.90      0.90     25000
weighted avg       0.90      0.90      0.90     25000



In [None]:
# ----------------------------------------------- Model 2 - BERT (Base Uncased) ------------------------------------------------------------

# - Full transformer encoder
# - Deeper and more expressive than DistilBERT
# - Often Achieves higher accuracy on NLP tasks

In [None]:
# Tokenization

bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

train_encodings_bert = bert_tokenizer(
    train_texts,
    truncation=True,
    padding=True,
    max_length=256
)

test_encodings_bert = bert_tokenizer(
    test_texts,
    truncation=True,
    padding=True,
    max_length=256
)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
# Model Loading

bert_model = TFAutoModelForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2,
    force_download=True,
    use_safetensors=False
)

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tf_model.h5:   0%|          | 0.00/536M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Prepare TF Dataset

train_dataset_bert = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings_bert),
    train_labels
)).shuffle(1000).batch(16)

test_dataset_bert = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings_bert),
    test_labels
)).batch(16)

In [None]:
# Compile and Train

bert_model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy']
)

bert_model.fit(
    train_dataset,
    validation_data=test_dataset,
    epochs=2
)

Epoch 1/2
Epoch 2/2


<tf_keras.src.callbacks.History at 0x7991cada5640>

In [None]:
# Evaluation

logits_bert = bert_model.predict(test_dataset_bert).logits
preds_bert = np.argmax(logits_bert, axis=1)

print('BERT Accuracy:', accuracy_score(test_labels, preds_bert))
print(classification_report(test_labels, preds_bert))

BERT Accuracy: 0.91616
              precision    recall  f1-score   support

           0       0.91      0.92      0.92     12500
           1       0.92      0.91      0.92     12500

    accuracy                           0.92     25000
   macro avg       0.92      0.92      0.92     25000
weighted avg       0.92      0.92      0.92     25000



In [None]:
print(best_model := 'DistilBERT' if accuracy_score(test_labels, preds) > accuracy_score(test_labels, preds_bert) else 'BERT')

BERT


### **Best Model & Explanation:**

#### Best Performing Model: BERT (Base Uncased)

Reasons for Better Performance:

* Deeper Architecture
* BERT has 12 Transformer layers
* DistilBERT has 6 layers
* Richer Contextual Understanding
* Better handling of long movie reviews
* Strong bidirectional attention
* Higher Representational Capacity
* Learns subtle sentiment cues better (sarcasm, negation)

Trade-off:

* BERT is slower and heavier
* DistilBERT is faster and more efficient, making it suitable for deployment

#### Summary

Sentiment analysis was performed on the IMDB movie reviews dataset using Transformer-based models loaded via Hugging Face. DistilBERT and BERT were implemented and compared. While DistilBERT offered faster training with competitive accuracy, BERT achieved superior performance due to its deeper architecture and richer contextual representations. Therefore, BERT is the preferred model when accuracy is the priority, whereas DistilBERT is ideal for resource-constrained environments.

#### Question 2: Text Generation with Transformers

Dataset Problem: Using a pre-trained GPT model (any version) from Hugging Face's Transformers, generate a short story based on a given prompt. Example prompt is below:

Prompt=” In a distant future, humanity has discovered”

In [14]:
# Import Libraries

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

In [15]:
# Load Pre-trained GPT-2 Model and Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

model.eval() # set model to evaluation mode

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [16]:
# Define Prompt

prompt = 'In a distant future, humanity has discovered'

In [17]:
# Tokenize Input

input_ids = tokenizer.encode(prompt, return_tensors='pt')

In [27]:
# Generate Text

output  = model.generate(
    input_ids,
    max_length=130,
    num_return_sequences=1,
    top_k=50,
    top_p=0.95,
    temperature=0.8,
    pad_token_id=tokenizer.eos_token_id,
    do_sample=True
    )

* max_length: Maximum length of generated text

* temperature: Controls creativity (higher = more random)

* top_k: Limits choices to top k tokens

* top_p: Nucleus sampling for diversity

* do_sample: Enables randomness

In [28]:
# Decode and print output

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

In a distant future, humanity has discovered that, in the very same region where it had its first contact with the first humans, it had its first contact with the first extraterrestrial race, which came to Earth in the 18th century. In that same place, the first colonists were able to create their own species of living beings; in other words, the first human-made race to be found on Mars had its own race.

This information was crucial to the development of the concept of the "Planet of the Apes" and its future. In general, the first human colonists discovered on Earth were a group of about 500 people.
