**BERT text classifier on the IMDb dataset**

In [1]:
import os
import torch
import numpy as np
from torch import nn
from torch.utils.data import DataLoader, Dataset
from transformers import BertTokenizer, BertModel, get_linear_schedule_with_warmup
from sklearn.model_selection import train_test_split
from torch.optim import AdamW
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd

In [2]:
pip install datasets transformers



In [3]:
# Random seed
torch.manual_seed(42)
np.random.seed(42)


device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cuda


**Data Loading and Spliting**

In [4]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")
split = imdb_dataset["train"].train_test_split(test_size=0.10, seed=42, stratify_by_column="label")
train_ds = split["train"]
val_ds   = split["test"]
test_ds  = imdb_dataset["test"]

print("Sizes -> Train:", len(train_ds), "Val.:", len(val_ds), "Test.:", len(test_ds))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(…):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Sizes -> Train: 22500 Val.: 2500 Test.: 25000


**use AutoTokenizer with truncation or set a max_length**

In [5]:
from transformers import AutoTokenizer, DataCollatorWithPadding, set_seed

In [6]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
MAX_LENGTH = 384

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [7]:
def tokenize_fn(batch):
    return tokenizer(
        batch["text"],
        truncation=True,
        max_length=MAX_LENGTH
    )

**Create Datasets (tokenize with map; use DataCollatorWithPadding for dynamic padding).**

In [8]:
from transformers import DataCollatorWithPadding
 #still BERT, but via AutoTokenizer
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, pad_to_multiple_of=8)

In [9]:
tokenized_train = train_ds.map(tokenize_fn, batched=True, remove_columns=["text"])
tokenized_val   = val_ds.map(tokenize_fn,   batched=True, remove_columns=["text"])
tokenized_test  = test_ds.map(tokenize_fn,  batched=True, remove_columns=["text"])

print("Tokenized columns:", tokenized_train.column_names[:6])

Map:   0%|          | 0/22500 [00:00<?, ? examples/s]

Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Tokenized columns: ['label', 'input_ids', 'token_type_ids', 'attention_mask']


**AutoModelForSequenceClassification** is a **Hugging Face** class that dynamically loads a *pre-trained transformer* model with a sequence classification head on top.

In [10]:
MODEL_NAME = "bert-base-uncased"

**initialize the model**

In [12]:
from transformers import AutoModelForSequenceClassification

# labels
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=2,
    id2label=id2label,
    label2id=label2id,
).to(device)

print("Model ready:", MODEL_NAME)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model ready: bert-base-uncased


In [16]:
from transformers import TrainingArguments, Trainer, EarlyStoppingCallback
import numpy as np
import torch

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {"accuracy": (preds == labels).mean()}

args = TrainingArguments(
    output_dir="imdb_out",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,     # ← fixed
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    eval_strategy="epoch",       # ← fixed
    save_strategy="epoch",
    load_best_model_at_end=True,
    warmup_ratio=0.1,
    lr_scheduler_type="linear",
    label_smoothing_factor=0.1,        # small regularization boost
    metric_for_best_model="accuracy",
    fp16=torch.cuda.is_available(),
    report_to="none",
    logging_steps=100,
)
#Early stopping = “stop training when validation stops improving.”
#It saves time, avoids overfitting, and automatically keeps the best checkpoint.
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,               # ← fixed
    data_collator=data_collator,
    compute_metrics=compute_metrics,     # returns {"accuracy": ...}
    callbacks=[EarlyStoppingCallback(
        early_stopping_patience=2,      # stop after 2 evals with no improvement
    )],
)
trainer.train()

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3325,0.336887,0.9272
2,0.2857,0.352895,0.9216
3,0.2535,0.341054,0.932


TrainOutput(global_step=4221, training_loss=0.3139053140028877, metrics={'train_runtime': 1362.1008, 'train_samples_per_second': 49.556, 'train_steps_per_second': 3.099, 'total_flos': 1.33103771796384e+16, 'train_loss': 0.3139053140028877, 'epoch': 3.0})

In [17]:
val_metrics = trainer.evaluate(tokenized_val)
test_metrics = trainer.evaluate(tokenized_test)

print("Validation metrics:", val_metrics)  # includes 'eval_accuracy'
print("Test metrics:", test_metrics)        # includes 'eval_accuracy'


Validation metrics: {'eval_loss': 0.34105443954467773, 'eval_accuracy': 0.932, 'eval_runtime': 15.5905, 'eval_samples_per_second': 160.354, 'eval_steps_per_second': 5.067, 'epoch': 3.0}
Test metrics: {'eval_loss': 0.3372657895088196, 'eval_accuracy': 0.9328, 'eval_runtime': 141.5518, 'eval_samples_per_second': 176.614, 'eval_steps_per_second': 5.524, 'epoch': 3.0}


In [18]:
SAVE_DIR = "/content/imdb_sentiment"
os.makedirs(SAVE_DIR, exist_ok=True)
trainer.save_model(SAVE_DIR)           # saves model + config
tokenizer.save_pretrained(SAVE_DIR)    # saves tokenizer files
print("Saved to:", SAVE_DIR)


Saved to: /content/imdb_sentiment


In [23]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
import torch

reload_tok = AutoTokenizer.from_pretrained(SAVE_DIR, use_fast=True)
reload_model = AutoModelForSequenceClassification.from_pretrained(SAVE_DIR).to(device)

clf = pipeline(
    "text-classification",
    model=reload_model,
    tokenizer=reload_tok,
    device=0 if torch.cuda.is_available() else -1,
    truncation=True,
    return_token_type_ids=False
)

custom = [
    "This movie was a total waste of time.",
    "Absolutely loved it—brilliant performances!",
    "It was okay; some scenes worked, others dragged.",
    "Great visuals but the story was predictable.",
    "One of the best films I’ve seen this year."
]

print("\nCustom predictions:")
for s in custom:
    out = clf(s)[0]
    print(f"- {s}\n  -> {out['label']} (confidence={out['score']:.3f})")

Device set to use cuda:0



Custom predictions:
- This movie was a total waste of time.
  -> NEGATIVE (confidence=0.945)
- Absolutely loved it—brilliant performances!
  -> POSITIVE (confidence=0.950)
- It was okay; some scenes worked, others dragged.
  -> NEGATIVE (confidence=0.865)
- Great visuals but the story was predictable.
  -> NEGATIVE (confidence=0.919)
- One of the best films I’ve seen this year.
  -> POSITIVE (confidence=0.954)


**Q2. Explain what is paged Attention (In a markdown). Implement model inference using vLLM. Try using a different model from the Hugging Face Model Hub and experiment with your own prompt**

**PagedAttention** *is a KV cache memory layout that takes inspiration (based on the OS concept) of paging/virtual memory*.


1.   It partitions KV (key-value) cache into small fixed-size pages/blocks (e.g. one per head), caches only needed KV pairs, and has a lightweight map to retrieve the correct pages on-demand.
2.  Eliminates large contiguous allocations and fragmentation; it allocates dynamically only to sequences which are currently active.
3.  It reduces wastage of up to 60-80% to less than 4% through deterrence of over-reservation and non-recycling.
4.  Supports high throughput multi-user multi-query batching.
5.  Imporves long text generation performance by keeping KV cache lean and pageable.
6.  Supports key Transformer families (LLaMA, GPT, Falcon, Mistral, etc.) and is also easily interoperable with Hugging Face models.
7.  As compared to GPU it has more requests/GPU, reduced OOMs, and faster responses through KV cache paging.
8.  Improved batching and improved utilization of the GPU; 2x faster in complex sampling methodologies at most.



**Code**

In [2]:
pip install -U vllm transformers

Collecting transformers
  Downloading transformers-4.56.2-py3-none-any.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.1/40.1 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.56.2-py3-none-any.whl (11.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.6/11.6 MB[0m [31m42.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.56.1
    Uninstalling transformers-4.56.1:
      Successfully uninstalled transformers-4.56.1
Successfully installed transformers-4.56.2


In [1]:
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
# 1) Choose a model from the HF Hub
MODEL = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

INFO 09-26 18:57:11 [__init__.py:216] Automatically detected platform cuda.


**2) Build a "chat-style" prompt using the model's tokenizer template**

In [2]:
tok = AutoTokenizer.from_pretrained(MODEL)

messages = [
    {"role": "system", "content": "You are a kind, clear teacher."},
    {"role": "user",   "content": "Explain 'Paged Attention' to me like I am new to LLMs. Keep it very short."}
]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

**3) Decide how the model should sample text (controls style/length)**

In [3]:

sampling = SamplingParams(
    temperature=0.7,   # 0.0 = deterministic, higher = more creative
    top_p=0.9,
    max_tokens=200     # how many new tokens to generate at most
)


**4) Create the vLLM engine.**

In [4]:
llm = LLM(model=MODEL, dtype="auto", trust_remote_code=True)

INFO 09-26 18:58:08 [utils.py:328] non-default args: {'trust_remote_code': True, 'disable_log_stats': True, 'model': 'TinyLlama/TinyLlama-1.1B-Chat-v1.0'}


The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.


config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

INFO 09-26 18:58:44 [__init__.py:742] Resolved architecture: LlamaForCausalLM


`torch_dtype` is deprecated! Use `dtype` instead!


INFO 09-26 18:58:44 [__init__.py:1815] Using max model len 2048
INFO 09-26 18:58:51 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192.


generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

INFO 09-26 19:01:40 [llm.py:295] Supported_tasks: ['generate']
INFO 09-26 19:01:40 [__init__.py:36] No IOProcessor plugins requested by the model


**5) Generate! (can pass a list of prompts to batch multiple requests.)**

In [5]:
outputs = llm.generate([prompt], sampling)

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

**6) Print the model's answer**

In [6]:

print("\n Model Response")
print(outputs[0].outputs[0].text.strip())


 Model Response
Paged Attention is a technique used in LLMs (Learning to Listen and Speak) to help learners focus on the speaker's message. Here's a short explanation:

When you listen to someone speaking, you're constantly switching your attention between the speaker and the background noise around you. This can be distracting, especially when you're not familiar with the speaker or the topic. Paged Attention is a technique that helps learners maintain their focus on the speaker by using a page to represent the speaker's voice. The learner keeps their attention on the page as they listen to the speaker, and they gradually shift their attention back to the speaker as they come to the end of a page. This helps the learner to become more aware of the speaker's message and to respond to it more effectively.


My own prompt

In [7]:
messages = [
    {"role": "system", "content": "You are a helpful study buddy."},
    {"role": "user",   "content": "Create 5 beginner MCQs about Paged Attention, with answers."}
]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = llm.generate([prompt], sampling)
print(outputs[0].outputs[0].text)

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

1. Which algorithm is used to implement Paged Attention?
A. Dynamic Programming
B. Linear Programming
C. Binary Search
D. Bellman-Ford Algorithm

2. What is the purpose of Paged Attention in Computer Vision?
A. Detecting objects in a video
B. Identifying objects in a still image
C. Image segmentation
D. Image registration

3. What is the role of the center node in Paged Attention?
A. It is responsible for finding the closest object to the current node
B. It is responsible for updating the object's location in the graph
C. It is responsible for finding the center node in the graph
D. It is responsible for updating the object's location in the graph

4. What is the algorithm used for image segmentation in Paged Attention?
A. K-Means Clustering
B. D


In [9]:
prompts = []
for q in [
    "Explain Paged Attention in one paragraph.",
    "Give a 3-bullet summary of Paged Attention.",
    "Write a one-line best feature for Paged Attention."
]:
    msgs = [
        {"role": "system", "content": "You are concise and concrete."},
        {"role": "user", "content": q}
    ]
    prompts.append(tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True))

outs = llm.generate(prompts, sampling)
for i, out in enumerate(outs, 1):
    print(f" Answer {i}\n{out.outputs[0].text.strip()}")


Adding requests:   0%|          | 0/3 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/3 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

 Answer 1
Paged Attention is a technique that involves using the entirety of a page, rather than just a small section, to emphasize a particular point or message. It can be particularly useful when conveying complex information or ideas that require a deep dive into a topic. By focusing on the entire page, readers are able to engage with the content more deeply and retain the information for longer. This technique can also be used to create a sense of unity and cohesion in a document, as the entire page is used to convey a unified message. Ultimately, Paged Attention can help to make a document more memorable, engaging, and effective in conveying its message.
 Answer 2
Paged Attention is a platform that provides individuals with a personalized attention budget, allowing them to prioritize their time and focus on tasks that matter most to them. The platform uses machine learning algorithms to analyze users' past behavior and preferences, and then provides a customized attention budget b