### Homework 5: Question search engine

Remeber week01 where you used GloVe embeddings to find related questions? That was.. cute, but far from state of the art. It's time to really solve this task using context-aware embeddings.

__Warning:__ this task assumes you have seen `seminar.ipynb`!

In [1]:
%pip install --upgrade transformers datasets accelerate tqdm protobuf sentencepiece evaluate
import torch
import torch.nn as nn
import torch.nn.functional as F
import transformers
import datasets
from tqdm import tqdm
import numpy as np


Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


  from .autonotebook import tqdm as notebook_tqdm


### Load data and model

In [2]:
qqp = datasets.load_dataset('SetFit/qqp')
print('\n')
print("Sample[0]:", qqp['train'][0])
print("Sample[3]:", qqp['train'][3])

Repo card metadata block was not found. Setting CardData to empty.




Sample[0]: {'text1': 'How is the life of a math student? Could you describe your own experiences?', 'text2': 'Which level of prepration is enough for the exam jlpt5?', 'label': 0, 'idx': 0, 'label_text': 'not duplicate'}
Sample[3]: {'text1': 'What can one do after MBBS?', 'text2': 'What do i do after my MBBS ?', 'label': 1, 'idx': 3, 'label_text': 'duplicate'}


In [184]:
model_name = "gchhablani/bert-base-cased-finetuned-qqp"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
model = transformers.AutoModelForSequenceClassification.from_pretrained(model_name)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




### Tokenize the data

In [6]:
MAX_LENGTH = 128
def preprocess_function(examples):
    result = tokenizer(
        examples['text1'], examples['text2'],
        padding='max_length', max_length=MAX_LENGTH, truncation=True
    )
    result['label'] = examples['label']
    return result

qqp_preprocessed = qqp.map(preprocess_function, batched=True)

Map: 100%|██████████| 363846/363846 [00:14<00:00, 24650.16 examples/s]
Map: 100%|██████████| 40430/40430 [00:01<00:00, 24620.94 examples/s]
Map: 100%|██████████| 390965/390965 [00:15<00:00, 24903.69 examples/s]


In [181]:
print(repr(qqp_preprocessed['train'][0]['input_ids'])[:100], "...")

[101, 1731, 1110, 1103, 1297, 1104, 170, 12523, 2377, 136, 7426, 1128, 5594, 1240, 1319, 5758, 136,  ...


### Task 1: evaluation (1 point)

We randomly chose a model trained on QQP - but is it any good?

One way to measure this is with validation accuracy - which is what you will implement next.

Here's the interface to help you do that:

In [201]:
batch_size = 5 # 1

val_set = qqp_preprocessed['validation']
val_loader = torch.utils.data.DataLoader(
    val_set,
    batch_size=batch_size,
    shuffle=False,
    collate_fn=transformers.default_data_collator,
    # num_workers=2
)
print(val_set['label'][0], val_set['label_text'][0], val_set['text1'][0], val_set['text2'][0])
print(val_set['label'][2], val_set['label_text'][2], val_set['text1'][2], val_set['text2'][2])

0 not duplicate Why are African-Americans so beautiful? Why are hispanics so beautiful?
1 duplicate Is there a reason why we should travel alone? What are some reasons to travel alone?


__Your task__ is to measure the validation accuracy of your model.
Doing so naively may take several hours. Please make sure you use the following optimizations:

- run the model on GPU with no_grad
- using batch size larger than 1
- use optimize data loader with num_workers > 1
- (optional) use [mixed precision](https://pytorch.org/docs/stable/notes/amp_examples.html)


In [210]:
samples_count = 0
successfull_preds_count = 0

model.eval()

for batch in tqdm(val_loader):
    with torch.no_grad():
        predicted = model(
            input_ids=batch['input_ids'],
            attention_mask=batch['attention_mask'],
            token_type_ids=batch['token_type_ids']
        )
        
    probs = torch.softmax(predicted.logits, dim=1)
    
    batch_successfull_preds_count = 0
    for i, sample_probs in enumerate(probs):
        y = batch['labels'][i]
        if sample_probs[y] >= 0.5:
            batch_successfull_preds_count += 1
            
    successfull_preds_count += batch_successfull_preds_count
    samples_count += batch['labels'].size(0)
    
accuracy = successfull_preds_count / samples_count
accuracy

100%|██████████| 8086/8086 [36:22<00:00,  3.70it/s]     


0.9083848627256987

In [211]:
assert 0.9 < accuracy < 0.91

### Task 2: train the model (4 points)

For this task, you have two options:

__Option A:__ fine-tune your own model. You are free to choose any model __except for the original BERT.__ We recommend [DeBERTa-v3](https://huggingface.co/microsoft/deberta-v3-base). Better yet, choose the best model based on public benchmarks (e.g. [GLUE](https://gluebenchmark.com/)).

You can write the training code manually or use transformers.Trainer (see [this example](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification)). Please make sure that your model's accuracy is at least __comparable__ with the above example for BERT.


__Option B:__ compare at least 3 pre-finetuned models (in addition to the above BERT model). For each model, report (1) its accuracy, (2) its speed, measured in samples per second in your hardware setup and (3) its size in megabytes. Please take care to compare models in equal setting, e.g. same CPU / GPU. Compile your results into a table and write a short (~half-page on top of a table) report, summarizing your findings.

In [3]:
model_name = "microsoft/deberta-v3-base"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name, use_fast = False)
model = transformers.AutoModelForSequenceClassification.from_pretrained(model_name)

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
MAX_SEQ_LENGTH = 256

def preprocess_function(examples):
    result = tokenizer(
        examples['text1'],
        examples['text2'],
        padding='max_length',
        max_length=MAX_SEQ_LENGTH,
        truncation=True
    )
    result['label'] = examples['label']
    return result

qqp_preprocessed = qqp.map(preprocess_function, batched=True)


Map:   4%|▍         | 14000/363846 [00:01<00:45, 7740.67 examples/s]Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Map:  17%|█▋        | 62000/363846 [00:07<00:37, 8107.91 examples/s]Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Map:  28%|██▊       | 102000/363846 [00:12<00:32, 7996.08 examples/s]Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Map:  38%|███▊      | 139000/363846 [00:17<00:29, 7667.45 examples/s]Be aware, overflowing tokens are not returned fo

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    eval_strategy="steps",
    warmup_steps=500,
    per_device_train_batch_size=8,
    learning_rate=2e-5,
    # num_train_epochs=3,
    num_train_epochs=2,
    output_dir="deberta_train",
    overwrite_output_dir=True,
    # logging_steps=1000,
    # logging_dir="deberta_train"
)

In [7]:
import evaluate

metric = evaluate.load("accuracy")

In [8]:
import numpy as np

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [9]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=qqp_preprocessed['train'],
    eval_dataset=qqp_preprocessed['validation'],
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train() # 17+ hours with default deberta args

Step,Training Loss,Validation Loss


KeyboardInterrupt: 

### Task 3: try the full pipeline (1 point)

Finally, it is time to use your model to find duplicate questions.
Please implement a function that takes a question and finds top-5 potential duplicates in the training set. For now, it is fine if your function is slow, as long as it yields correct results.

Showcase how your function works with at least 5 examples.

In [109]:
def get_duplicates(tokenizer, model, question):
    mps_device = torch.device("mps")
    model.to(mps_device)
    
    duplicates = []
    duplicate_logits = []
    
    for i, q in enumerate(tqdm(set(qqp['train']['text2']))):
        with torch.no_grad():
            prepocessed = tokenizer(
                question,
                q,
                padding='max_length',
                max_length=MAX_LENGTH,
                truncation=True,
                return_tensors="pt"
                )
            
            predicted = model(**prepocessed.to(mps_device))
            
            probs = torch.softmax(predicted.logits[0], dim=-1)
            
            if probs[1] > 0.9:
                duplicates.append(f'{i} | {probs[1]}: {q}')
                duplicate_logits.append(predicted.logits[0][1].item())
            
    sorted = np.argsort(np.array(duplicate_logits))[-20:]
    duplicates = np.array(duplicates)[sorted].tolist()
        
    return duplicates

In [94]:
q = qqp['train'][1]['text1']
q, get_duplicates(q)

  1%|          | 2032/273393 [00:44<1:38:07, 46.09it/s]Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
 11%|█         | 30640/273393 [11:06<1:26:35, 46.73it/s]Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
 24%|██▍       | 65186/273393 [22:50<1:14:07, 46.82it/s]Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
 28%|██▊       | 77514/273393 [26:57<1:07:33, 48.32it/s]Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs 

('How do I control my horny emotions?',
 ["60821 | 0.9846774339675903: Is there any way to tell if you wrote a question on Quora if you don't remember if you wrote it?",
  '229885 | 0.9847133755683899: How should I edit my question correctly if Quora marks down my question for improvement?',
  '177772 | 0.9853441119194031: How do I protect a business idea from being stolen from VC? How do I protect the idea from being copied?',
  '97005 | 0.9854615926742554: Why are some questions on Quora flagged as needing improvement when they don’t need improvement?',
  "209962 | 0.9856707453727722: Why did Modi scrap Rs 500 & Rs 1000 notes? And what's the reason for the sudden introduction of the 2000 rupee note?",
  '34828 | 0.9861811995506287: Why did the government print Rs 2000 notes? Why they didn’t print new 1000 notes?',
  '208963 | 0.9873888492584229: Why Central Govt banned old 500 and 1000 Rs note, but issued new 500 and 2000 Rs note?',
  '46728 | 0.9880470037460327: Why does 500 and 100

In [111]:
# взять все парные тексты входного текста, получить предсказания по ним
# искусственно, неприменимо, кроме как на обучающей сборке
question = 'How do I control my horny emotions?'
candidates = []

for pair in qqp['train']:
    if question == pair['text1'] or question == pair['text2']:
        candidates.append(pair['text1'])
        candidates.append(pair['text2'])

candidates 

['How do I control my horny emotions?', 'How do you control your horniness?']

In [107]:
bert_tokenizer = transformers.AutoTokenizer.from_pretrained("gchhablani/bert-base-cased-finetuned-qqp")
bert_model = transformers.AutoModelForSequenceClassification.from_pretrained("gchhablani/bert-base-cased-finetuned-qqp")

In [110]:
q = qqp['train'][1]['text1']
q, get_duplicates(bert_tokenizer, bert_model, q) # x3 performance improvement with msp device

100%|██████████| 273393/273393 [57:47<00:00, 78.85it/s] 


('How do I control my horny emotions?',
 ['216701 | 0.9965547323226929: How do I learn how to use self control and get over illogical jealousy?',
  '17380 | 0.9965278506278992: WHERE DO I GET GOOD TAMIL NADU FOOD PRODUCTS IN RALEIGH, NORTH CAROLINA?',
  '122782 | 0.9970905780792236: Can we control our feelings and emotions?',
  '134518 | 0.997090220451355: How can I curb my sexual desires?',
  '137568 | 0.9968903660774231: How do I get over limerence?',
  '82080 | 0.996866762638092: How do I overcome my social anxiety?',
  '28544 | 0.9973135590553284: How do I stop being so horny all the damn time?',
  '254619 | 0.9973885416984558: Can we control our feelings?',
  '63725 | 0.9973657727241516: Does our mind control our emotions?',
  '175427 | 0.9975578784942627: How can I stop being horny?',
  '72352 | 0.9972877502441406: I HAD CUM IN SHINY NYLON SHORTS WHEN I WAS 12... I NOW HAVE DEVELOPED A FETISH FOR THEM.. WHY?',
  '94921 | 0.9974210262298584: How do i control emotions at work place

__Bonus:__ for bonus points, try to find a way to run the function faster than just passing over all questions in a loop. For isntance, you can form a short-list of potential candidates using a cheaper method, and then run your tranformer on that short list. If you opted for this solution, please keep both the original implementation and the optimized one - and explain briefly what is the difference there.

In [136]:
from torch.distributed.device_mesh import init_device_mesh

mesh_1d = init_device_mesh("cuda", mesh_shape=(8,))

Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "/Users/ant.korneev/Library/Python/3.9/lib/python/site-packages/IPython/core/interactiveshell.py", line 3550, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/var/folders/7l/3qdkg7n17mj7g3w_q8yf5xzc0000gq/T/ipykernel_61839/1726673958.py", line 3, in <module>
    mesh_1d = init_device_mesh("cuda", mesh_shape=(8,))
  File "/Users/ant.korneev/Library/Python/3.9/lib/python/site-packages/torch/distributed/device_mesh.py", line 713, in init_device_mesh
    return not_none(_find_pg_by_ranks_and_tag(*dim_group_infos))
  File "/Users/ant.korneev/Library/Python/3.9/lib/python/site-packages/torch/distributed/device_mesh.py", line 255, in __init__
    @staticmethod
  File "/Users/ant.korneev/Library/Python/3.9/lib/python/site-packages/torch/distributed/device_mesh.py", line 268, in _get_or_create_default_group
    raise KeyError(
  File "/Users/ant.korneev/Library/Python/3.9/lib/python/site-packages/torch/distributed/c10d_logger.p

In [135]:
%%sh

cd ../../transformers/examples/pytorch/text-classification/

pip install datasets
export TASK_NAME=qqp 

output_dir="deberta_results"

num_gpus=8

batch_size=8

python -m torch.distributed.launch --nproc_per_node=${num_gpus} \
  run_glue.py \
  --model_name_or_path microsoft/deberta-v3-base \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --evaluation_strategy steps \
  --max_seq_length 256 \
  --warmup_steps 500 \
  --per_device_train_batch_size ${batch_size} \
  --learning_rate 2e-5 \
  --num_train_epochs 3 \
  --output_dir $output_dir \
  --overwrite_output_dir \
  --logging_steps 1000 \
  --logging_dir $output_dir

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Defaulting to user installation because normal site-packages is not writeable



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip[0m
/opt/anaconda3/bin/python: Error while finding module specification for 'torch.distributed.launch' (ModuleNotFoundError: No module named 'torch')


CalledProcessError: Command 'b'\ncd ../../transformers/examples/pytorch/text-classification/\n\npip install datasets\nexport TASK_NAME=qqp \n\noutput_dir="deberta_results"\n\nnum_gpus=8\n\nbatch_size=8\n\npython -m torch.distributed.launch --nproc_per_node=${num_gpus} \\\n  run_glue.py \\\n  --model_name_or_path microsoft/deberta-v3-base \\\n  --task_name $TASK_NAME \\\n  --do_train \\\n  --do_eval \\\n  --evaluation_strategy steps \\\n  --max_seq_length 256 \\\n  --warmup_steps 500 \\\n  --per_device_train_batch_size ${batch_size} \\\n  --learning_rate 2e-5 \\\n  --num_train_epochs 3 \\\n  --output_dir $output_dir \\\n  --overwrite_output_dir \\\n  --logging_steps 1000 \\\n  --logging_dir $output_dir\n'' returned non-zero exit status 1.