# Week 4: Transfer Learning, BERT (Homework)

## Question Search Engine

Embeddings are a good source of information for solving various tasks. For example, we can classify texts or find similar documents using their representations. We already know about word2vec, GloVe and fasttext, but they don't use context information from given text (only from contexts of source data).

For today we will use full power of context-aware embeddings to find text duplicates!

__Warning:__ this task assumes you have seen `seminar.ipynb`!

In [1]:

import torch
import torch.nn as nn
import torch.nn.functional as F
import transformers
import datasets

  from .autonotebook import tqdm as notebook_tqdm


### Data Preparation

In [2]:
data_files = {
    "train": "train.jsonl",
    "validation": "validation.jsonl",
    "test": "test.jsonl"
}

qqp = datasets.load_dataset("json", data_files=data_files)
print("\n")
print("Sample[0]:", qqp["train"][0])
print("Sample[3]:", qqp["train"][3])



Sample[0]: {'text1': 'How is the life of a math student? Could you describe your own experiences?', 'text2': 'Which level of prepration is enough for the exam jlpt5?', 'label': 0, 'idx': 0, 'label_text': 'not duplicate'}
Sample[3]: {'text1': 'What can one do after MBBS?', 'text2': 'What do i do after my MBBS ?', 'label': 1, 'idx': 3, 'label_text': 'duplicate'}


In [3]:
model_name = "./model"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
model = transformers.AutoModelForSequenceClassification.from_pretrained(model_name)

In [4]:
MAX_LENGTH = 128

def preprocess_function(examples):
    result = tokenizer(
        examples["text1"],
        examples["text2"],
        padding="max_length",
        max_length=MAX_LENGTH,
        truncation=True,
    )

    result["label"] = examples["label"]

    return result

In [5]:
qqp_preprocessed = qqp.map(preprocess_function, batched=True)

In [6]:
print(repr(qqp_preprocessed["train"][0]["input_ids"])[:100], "...")

[101, 1293, 1110, 1103, 1297, 1104, 170, 12523, 2377, 136, 1180, 1128, 5594, 1240, 1319, 5758, 136,  ...


### Evaluation (1 point)

We randomly chose a model trained on QQP - but is it any good?

One way to measure this is with validation accuracy - which is what you will implement next.

Here's the interface to help you do that:

In [7]:
val_set = qqp_preprocessed["validation"]
val_loader = torch.utils.data.DataLoader(
    val_set, 
    batch_size=32,
    shuffle=False, 
    num_workers=2,
    collate_fn=transformers.default_data_collator,
    pin_memory=True
)

In [8]:
for batch in val_loader:
    break  # here be your training code
print("Sample batch:", batch)

with torch.no_grad():
    predicted = model(
        input_ids=batch["input_ids"],
        attention_mask=batch["attention_mask"],
        token_type_ids=batch["token_type_ids"],
    )

print("\nPrediction (probs):", torch.softmax(predicted.logits, dim=1).data.numpy())

Sample batch: {'labels': tensor([0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0]), 'idx': tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]), 'input_ids': tensor([[ 101, 1725, 1132,  ...,    0,    0,    0],
        [ 101,  178, 1328,  ...,    0,    0,    0],
        [ 101, 1110, 1175,  ...,    0,    0,    0],
        ...,
        [ 101,  107, 1150,  ...,    0,    0,    0],
        [ 101, 1184, 2146,  ...,    0,    0,    0],
        [ 101, 1725, 1674,  ...,    0,    0,    0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...

**Task 1 (1 point)**

- Measure the validation accuracy of your model. Doing so naively may take several hours. Please make sure you use the following optimizations:
  - Run the model on GPU with no_grad
  - Using batch size larger than 1
  - Use optimize data loader with num_workers > 1
  - (Optional) Use [mixed precision](https://pytorch.org/docs/stable/notes/amp_examples.html)


In [9]:
from tqdm import tqdm

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

correct = 0
total = 0

with torch.no_grad():
    for batch in tqdm(val_loader, desc="Evaluating"):
        inputs = {
            'input_ids': batch['input_ids'].to(device),
            'attention_mask': batch['attention_mask'].to(device),
            'token_type_ids': batch['token_type_ids'].to(device)
        }
        
        labels = batch['labels'].to(device)
        
        outputs = model(**inputs)
        predictions = torch.argmax(outputs.logits, dim=-1)
        
        correct += (predictions == labels).sum().item()
        total += predictions.size(0)

accuracy = correct / total
print(f"Validation Accuracy: {accuracy:.4f}")

Evaluating: 100%|██████████| 1264/1264 [02:26<00:00,  8.60it/s]

Validation Accuracy: 0.8926





In [10]:
assert 0.89 < accuracy < 0.91

### Training (4 points)

For this task, you have two options:

__Option A:__ fine-tune your own model. You are free to choose any model __except for the original BERT.__ We recommend [DeBERTa-v3](https://huggingface.co/microsoft/deberta-v3-base). Better yet, choose the best model based on public benchmarks (e.g. [GLUE](https://gluebenchmark.com/)).

You can write the training code manually or use transformers.Trainer (see [this example](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification)). Please make sure that your model's accuracy is at least __comparable__ with the above example for BERT.


__Option B:__ compare at least 3 pre-finetuned models (in addition to the above BERT model). For each model, report (1) its accuracy, (2) its speed, measured in samples per second in your hardware setup and (3) its size in megabytes. Please take care to compare models in equal setting, e.g. same CPU / GPU. Compile your results into a table and write a short (~half-page on top of a table) report, summarizing your findings.

**Task 2 (4 points)**
- Choose Option A or Option B (only one will be graded)
- Follow all the instructions and restrictions

In [13]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import numpy as np
import evaluate

model_name = "./distilbert" 
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def preprocess_function(examples):
    return tokenizer(examples["text1"], examples["text2"], truncation=True, max_length=128)

encoded_dataset = qqp.map(preprocess_function, batched=True)

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(
    output_dir="./results_distilbert",
    learning_rate=3e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="no",
    fp16=True,
    report_to="none"
)

train_subset = encoded_dataset["train"].shuffle(seed=42).select(range(15000))
eval_subset = encoded_dataset["validation"].shuffle(seed=42).select(range(2000))

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_subset,
    eval_dataset=eval_subset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

print("Starting training DistilBERT...")
trainer.train()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at ./distilbert and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 363846/363846 [00:12<00:00, 28964.81 examples/s]
Map: 100%|██████████| 40430/40430 [00:01<00:00, 28806.32 examples/s]
Map: 100%|██████████| 390965/390965 [00:13<00:00, 28208.07 examples/s]
Using the latest cached version of the module from C:\Users\Glak\.cache\huggingface\modules\evaluate_modules\metrics\evaluate-metric--accuracy\f887c0aab52c2d38e1f8a215681126379eca617f96c447638f751434e8e65b14 (last modified on Tue Dec 23 03:52:28 2025) since it couldn't be found locally at evaluate-metric--accuracy, or remotely on the Hugging Face Hub.
  trainer = Trainer(


Starting training DistilBERT...


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.389822,0.81
2,0.458900,0.390444,0.831
3,0.305100,0.427847,0.833


TrainOutput(global_step=1407, training_loss=0.3334024804826315, metrics={'train_runtime': 102.9699, 'train_samples_per_second': 437.021, 'train_steps_per_second': 13.664, 'total_flos': 795145909448160.0, 'train_loss': 0.3334024804826315, 'epoch': 3.0})

### Finding Duplicates (1 point)

Finally, it is time to use your model to find duplicate questions.
Please implement a function that takes a question and finds top-5 potential duplicates in the training set. For now, it is fine if your function is slow, as long as it yields correct results.

Showcase how your function works with at least 5 examples.

**Task 3 (1 point)**
- Implement function for finding duplicates
- Test it on several examples (at least 5)
- Check suggested duplicates and make a conclusion about model correctness

In [14]:
def find_duplicates(question, dataset, model, tokenizer, top_k=5, search_limit=1000):

    device = model.device
    model.eval()
 
    candidates = dataset.select(range(min(len(dataset), search_limit)))
    candidate_texts = candidates["text1"] 

    pairs = [[question, cand] for cand in candidate_texts]
    
    duplicates = []
    batch_size = 32
    
    with torch.no_grad():
        for i in range(0, len(pairs), batch_size):
            batch_pairs = pairs[i : i + batch_size]
            inputs = tokenizer(
                [p[0] for p in batch_pairs], 
                [p[1] for p in batch_pairs], 
                padding=True, truncation=True, max_length=128, return_tensors="pt"
            ).to(device)

            outputs = model(**inputs)

            probs = torch.softmax(outputs.logits, dim=1)[:, 1]
            
            for j, prob in enumerate(probs):
                if prob.item() > 0.1:
                    duplicates.append((candidate_texts[i + j], prob.item()))
    
    duplicates.sort(key=lambda x: x[1], reverse=True)
    return duplicates[:top_k]


test_questions = [
    "How can I learn Python?",
    "What is the best way to lose weight?",
    "Why is the sky blue?",
    "How do I make money online?",
    "What are the best movies of 2020?"
]

search_dataset = qqp["train"]

print(f"Searching duplicates in first {1000} examples...\n")

for q in test_questions:
    print(f"Query: {q}")
    results = find_duplicates(q, search_dataset, model, tokenizer)
    
    if not results:
        print("  No duplicates found in subset.")
    for res, score in results:
        print(f"  [{score:.4f}] {res}")
    print("-" * 30)

Searching duplicates in first 1000 examples...

Query: How can I learn Python?
  No duplicates found in subset.
------------------------------
Query: What is the best way to lose weight?
  [0.9682] What will be the repercussions of banning Rs 500 and Rs 1000 notes on Indian economy?
  [0.9679] How can changing 500 and 1000 rupee notes end the black money in India?
  [0.9613] What are your views about governments decision to stop flow of 1000 and 500 rupee notes.?
  [0.9539] How do I lose weight fast?
  [0.9539] How do I lose weight fast?
------------------------------
Query: Why is the sky blue?
  [0.3982] It bothers me to see a black man with a white woman but it does not bother me to see a white man with a black woman. Why might this bother me?
  [0.1562] What will be the repercussions of banning Rs 500 and Rs 1000 notes on Indian economy?
  [0.1085] If more vacuum energy appears with expansion and it has no limit, can infinite of this energy be created? If yes is energy infinite?
  

### Bonus: Finding Duplicates Faster (0.5 point)

Try to find a way to run the function faster than just passing over all questions in a loop. For isntance, you can form a short-list of potential candidates using a cheaper method, and then run your tranformer on that short list. If you opted for this solution, please keep both the original implementation and the optimized one - and explain briefly what is the difference there.

**Bonus Task 1 (0.5 point)**
- Speed up your implementation from "Finding Duplicates" part
- Capture both old and new implementation work time
- Describe your approach

In [None]:
<A whole lot of YOUR CODE HERE>

### Bonus: Finding Duplicates in Old-Fashioned way (1.5 points)

In this bonus task you are supposed to use pretrained embeddings (word2vec, GloVe or fasttext) for solving the duplicates problem.

**Bonus Task 2 (1.5 points)**
- Solve Finding Duplicates problem using mentioned embeddings
- Compare old-fashioned solution to previous ones (quality, speed, etc.)
- Make a small report (up to 5 steps, results and conclusions) on work done in this part

In [None]:
<A whole lot of YOUR CODE HERE>