# Глубинное обучение для текстовых данных, ФКН ВШЭ
## Домашнее задание 4: Direct Preference Optimization 

__Мягкий дедлайн 16.11.25 23:59__ \
__Жесткий дедлайн 19.11.25 23:59__

### О задании

В этом задании вам предстоит обучить большую LLM для ответов на вопросы с помощью DPO, а также реализовать LoRA для эффективного обучения. 

### Оценивание и штрафы

Максимально допустимая оценка за работу — __11 баллов__.

Оценка за это домашнее задание будет формироваться из оценки за __задания__ и за __отчет__, в котором от вас требуется написать о проделанной работе. За отчет можно получить до 2-х баллов, однако в случае отсутствия отчета баллы за соответствующие задания не будут ставиться. Мы настаиваем на том, чтобы вы оформили весь код в виде полноценного проекта. Этот ноутбук нужно рассматривать скорее как файл с условием, чем как место для написания массивного кода. За сдачу больших ноутбуков с кодом оценка будет снижена. Ответы на все вопросы в заданиях можно (нужно) писать в отчете.

Задание выполняется самостоятельно. «Похожие» решения считаются плагиатом и все задействованные студенты (в том числе те, у кого списали) не могут получить за него больше 0 баллов. Весь код должен быть написан самостоятельно. Чужим кодом для пользоваться запрещается даже с указанием ссылки на источник. В разумных рамках, конечно. Взять пару очевидных строчек кода для реализации какого-то небольшого функционала можно.

### План решения

<img src="https://miro.medium.com/v2/resize:fit:1400/1*lK6iJMz5CGh2fo7TsDn15A.png" alt="drawing" width="700"/>

Обучение следованию инструкциям с помощью DPO разбивается на два этапа:    
1. __Supervised Fine-tuning (SFT)__ – обучение базовой модели ответам на запросы в нужном формате.
2. __Direct Preference Optimization (DPO)__ – обучение SFT модели приоритизации "хороших" ответов.

Мы не хотим обучать модели целиком по двум причинам: 1) используемые модели очень большие; 2) нам требуется лишь выравнить модель с нашими предпочтениями, не внося в нее новых знаний, что не требует серьезного обучения. Поэтому мы будем использовать PEFT, а именно LoRA для обучения.

Таким образом, вам надо будет:
1. Реализовать и протестировать LoRA
2. Разобраться с данными и привести их к нужному формату
3. Обучить SFT модель
4. Обучить DPO модель
5. Порадоваться, что вы молодцы и со всем справились
6. (Опционально) сделать веб-интерфейс для вашей модели, переиспользуя код из первой домашки (мы можем выдать бонусы, если получится классно).

### О датасете

Мы будем работать с датасетом [Anthropic Helpful-Harmless](https://huggingface.co/datasets/Anthropic/hh-rlhf) для RLHF. В нем содержится 160к примеров ответов на вопросы с историей.

### Low-Rank Adaptation (LoRA)

<img src="https://heidloff.net/assets/img/2023/08/lora.png" alt="drawing" width="600"/>

__Задание 1 (3 балла).__ Реализуйте самостоятельно модуль LoRA для эффективного обучения LLM по схеме, описанной в [статье](https://arxiv.org/pdf/2106.09685). Встройте его в свою любимую LLM и убедитесь, что ошибка убывает при обучении параметров LoRA на безусловную генерацию. Для этого возьмите любые данные на свой выбор. Замерьте насколько уменьшилось число обучаемых параметров, как изменилась скорость во время forward и backward процессов и как изменились затраты по памяти. Сделайте выводы и напишите о них в отчете.

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM, AutoTokenizer
import comet_ml
from comet_ml import Experiment
import os

# мои модули
from NLP_HW4.lora import (
    apply_lora_to_model,
    count_parameters,
    save_lora_weights,
    load_lora_weights,
)
from NLP_HW4.utils import (
    set_seed,
    print_model_stats,
    compare_models_memory,
    print_comparison_results,
    MemoryTracker,
)
from NLP_HW4.data_preprocessing import (
    prepare_wikitext_data,
    create_dataloaders,
)
from NLP_HW4.trainer import LoRATrainer

In [3]:
COMET_API_KEY = "O1nAdsqsP8eEohC547Mn1oHJW"
COMET_PROJECT_NAME = "nlp-hw-4"

возьмем маленькую модель того же семейства чтоб протестировать как работает моя LORA

In [4]:
MODEL_NAME = "EleutherAI/pythia-160m"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

In [5]:
print(f"Device: {DEVICE}")

Device: cuda


In [6]:
LORA_RANK = 8
LORA_ALPHA = 16
LORA_DROPOUT = 0.1
TARGET_MODULES = ["query_key_value"]

In [7]:
BATCH_SIZE = 8
LEARNING_RATE = 3e-4
NUM_EPOCHS = 2
WARMUP_STEPS = 100
MAX_LENGTH = 256
GRADIENT_ACCUMULATION_STEPS = 4

In [8]:
SEED = 42
SAVE_DIR = "./lora_checkpoints"
set_seed(SEED)
os.makedirs(SAVE_DIR, exist_ok=True)

In [9]:
experiment = Experiment(
    api_key=COMET_API_KEY,
    project_name=COMET_PROJECT_NAME,
)

[1;38;5;39mCOMET INFO:[0m Experiment is live on comet.com https://www.comet.com/arinaromashkina/nlp-hw-4/a03eebea8f714ed982975342ecab414c

[1;38;5;39mCOMET INFO:[0m Couldn't find a Git repository in '/home/aromashkina22/arcadia/sdg/sdc/ros/scene_modeling/notebooks' nor in any parent directory. Set `COMET_GIT_DIRECTORY` if your Git Repository is elsewhere.


In [10]:
experiment.log_parameters({
    "model_name": MODEL_NAME,
    "lora_rank": LORA_RANK,
    "lora_alpha": LORA_ALPHA,
    "lora_dropout": LORA_DROPOUT,
    "target_modules": TARGET_MODULES,
    "batch_size": BATCH_SIZE,
    "learning_rate": LEARNING_RATE,
    "num_epochs": NUM_EPOCHS,
    "max_length": MAX_LENGTH,
    "seed": SEED,
})

возьмем небольшой датасет wikitext который известен из дз по другим предметам

In [11]:
train_dataset, val_dataset, tokenizer = prepare_wikitext_data(
    tokenizer_name=MODEL_NAME,
    max_length=MAX_LENGTH,
    dataset_name="wikitext",
    dataset_config="wikitext-2-raw-v1",
)

train_loader, val_loader = create_dataloaders(
    train_dataset,
    val_dataset,
    batch_size=BATCH_SIZE,
    num_workers=2,
    pad_token_id=tokenizer.pad_token_id,
)

In [12]:
print(f"Train len: {len(train_loader)}")
print(f"Val len: {len(val_loader)}")

Train len: 2000
Val len: 212


In [13]:
base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float32,
)

print_model_stats(base_model, "Base Model")

`torch_dtype` is deprecated! Use `dtype` instead!



Base Model
total_params: 162,322,944
trainable_params: 162,322,944
lora_param: 0
trainable_percentage %: 100.00%



In [14]:
model_with_lora = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float32,
)


model_with_lora = apply_lora_to_model(
    model_with_lora,
    target_modules=TARGET_MODULES,
    rank=LORA_RANK,
    alpha=LORA_ALPHA,
    dropout=LORA_DROPOUT,
)

print_model_stats(model_with_lora, "Model with LoRA")


Model with LoRA
total_params: 162,617,856
trainable_params: 294,912
lora_param: 294,912
trainable_percentage %: 0.18%



ну то есть правильно вроде - осталось совсем чуть чуть параметров обучаемых

In [15]:
stats = count_parameters(model_with_lora)
experiment.log_metrics({
    "model/total_params": stats['total_params'],
    "model/trainable_params": stats['trainable_params'],
    "model/lora_params": stats['lora_params'],
    "model/trainable_percentage": stats['trainable_percentage'],
})

In [16]:
comparison_results = compare_models_memory(
    base_model,
    model_with_lora,
    batch_size=BATCH_SIZE,
    seq_len=MAX_LENGTH,
    device=DEVICE,
)

print_comparison_results(comparison_results)

Baseline Forward: 0.2570 seconds
Baseline Backward: 0.0935 seconds
LoRA Forward: 0.0496 seconds
LoRA Backward: 0.0384 seconds

forward_time:
Baseline: 0.2570s
LoRA:     0.0496s
Speedup:  5.18x
backward_time:
Baseline: 0.0935s
LoRA:     0.0384s
Speedup:  2.44x
memory_usage_peak:
Baseline Peak: 3298.15 MB
LoRA Peak:     2935.00 MB
Savings:       11.01%



ну тут тоже ожидаемые результаты что с LORA значительно быстрее

In [17]:
for model_type, metrics in comparison_results.items():
    for metric_name, value in metrics.items():
        experiment.log_metric(f"comparison/{model_type}_{metric_name}", value)


del base_model
torch.cuda.empty_cache()

In [18]:
optimizer = torch.optim.AdamW(
    [p for p in model_with_lora.parameters() if p.requires_grad],
    lr=LEARNING_RATE,
    betas=(0.9, 0.999),
    weight_decay=0.01,
)

In [19]:
trainer = LoRATrainer(
    model=model_with_lora,
    train_loader=train_loader,
    val_loader=val_loader,
    optimizer=optimizer,
    device=DEVICE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    max_grad_norm=1.0,
    log_interval=10,
    eval_interval=200,
    save_dir=SAVE_DIR,
    experiment=experiment,
)

In [20]:
best_val_loss = trainer.train(
    num_epochs=NUM_EPOCHS,
    warmup_steps=WARMUP_STEPS,
    save_best=True,
)

print(f"Best validation loss: {best_val_loss:.4f}")
experiment.log_metric("final/best_val_loss", best_val_loss)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Epoch 0:   0%|          | 0/2000 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Validation:   0%|          | 0/212 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Step 50 Val Loss: 2.8032



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Validation:   0%|          | 0/212 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Step 100 Val Loss: 2.3021



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Validation:   0%|          | 0/212 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Step 150 Val Loss: 2.2157



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Validation:   0%|          | 0/212 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Step 200 Val Loss: 2.1884



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Validation:   0%|          | 0/212 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Step 250 Val Loss: 2.1766



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Validation:   0%|          | 0/212 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Step 300 Val Loss: 2.1646



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Validation:   0%|          | 0/212 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Step 350 Val Loss: 2.1604



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Validation:   0%|          | 0/212 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Step 400 Val Loss: 2.1543



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Validation:   0%|          | 0/212 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Step 450 Val Loss: 2.1498



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Validation:   0%|          | 0/212 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Step 500 Val Loss: 2.1458



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Validation:   0%|          | 0/212 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



Epoch 0 Summary: Train Loss: 2.3921 Val Loss: 2.1458 
checkpoint saved to ./lora_checkpoints/checkpoint_epoch_0.pt
checkpoint saved to ./lora_checkpoints/best_model.pt
saved new best model with val_loss: 2.1458


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Epoch 1:   0%|          | 0/2000 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Validation:   0%|          | 0/212 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Step 550 Val Loss: 2.1443



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Validation:   0%|          | 0/212 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Step 600 Val Loss: 2.1405



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Validation:   0%|          | 0/212 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Step 650 Val Loss: 2.1378



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Validation:   0%|          | 0/212 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Step 700 Val Loss: 2.1366



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Validation:   0%|          | 0/212 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Step 750 Val Loss: 2.1339



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Validation:   0%|          | 0/212 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Step 800 Val Loss: 2.1322



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Validation:   0%|          | 0/212 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Step 850 Val Loss: 2.1309



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Validation:   0%|          | 0/212 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Step 900 Val Loss: 2.1294



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Validation:   0%|          | 0/212 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Step 950 Val Loss: 2.1289



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Validation:   0%|          | 0/212 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Step 1000 Val Loss: 2.1285



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Validation:   0%|          | 0/212 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f481c20c310>
Traceback (most recent call last):
  File "/home/aromashkina22/miniconda3/envs/env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1618, in __del__
    self._shutdown_workers()
  File "/home/aromashkina22/miniconda3/envs/env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1582, in _shutdown_workers
    w.

RuntimeError: DataLoader worker (pid(s) 1273919, 1273920) exited unexpectedly

ну тут мы получили что лосс падает
там красивый график в comet ml так что считаю что все работает неплохо

In [21]:
test_prompts = [
    "The homework for NLP course will",
    "The future of artificial intelligence",
    "Once upon a time",
    "I love MOP I love Yandex",
]

for prompt in test_prompts:
    print(f"Prompt: {prompt}")
    generated = trainer.generate(
        prompt,
        tokenizer,
        max_length=50,
        temperature=0.8,
        top_k=50,
    )
    print(f"Generated: {generated}")
    print("-" * 60)
    
    experiment.log_text(f"Prompt: {prompt}\nGenerated: {generated}")

Prompt: The homework for NLP course will
Generated: The homework for NLP course will be offered to anyone who has mastered the concept and can get into the program . In addition to the class , NLP will be offered to anyone who has mastered the concept and can get into the program . The class will be held in a
------------------------------------------------------------
Prompt: The future of artificial intelligence
Generated: The future of artificial intelligence is uncertain . The company's research and development program to develop a " machine " has been plagued by delays in the application of advanced computational methods , and the most recent results from this field were reported in the June 2012 issue of the journal Nature . 

------------------------------------------------------------
Prompt: Once upon a time
Generated: Once upon a time , I knew what I was going to do . I had to work on my strength and get my form right . I was so tired and I couldn’t do any more . I was going 

ну судя по последнему концепцию модель поняла)))

In [22]:
final_lora_path = os.path.join(SAVE_DIR, "final_lora_weights.pt")
save_lora_weights(model_with_lora, final_lora_path)
experiment.end()

[1;38;5;39mCOMET INFO:[0m ---------------------------------------------------------------------------------------
[1;38;5;39mCOMET INFO:[0m Comet.ml Experiment Summary
[1;38;5;39mCOMET INFO:[0m ---------------------------------------------------------------------------------------
[1;38;5;39mCOMET INFO:[0m   Data:
[1;38;5;39mCOMET INFO:[0m     display_summary_level : 1
[1;38;5;39mCOMET INFO:[0m     name                  : dusty_centipede_3043
[1;38;5;39mCOMET INFO:[0m     url                   : https://www.comet.com/arinaromashkina/nlp-hw-4/a03eebea8f714ed982975342ecab414c
[1;38;5;39mCOMET INFO:[0m   Metrics [count] (min, max):
[1;38;5;39mCOMET INFO:[0m     comparison/Baseline_backward_time         : 0.09347176551818848
[1;38;5;39mCOMET INFO:[0m     comparison/Baseline_forward_time          : 0.25700855255126953
[1;38;5;39mCOMET INFO:[0m     comparison/Baseline_memory_after_backward : 1800.8173828125
[1;38;5;39mCOMET INFO:[0m     comparison/Baseline_memory_afte

### Supervised Fine-tuning

__Задание 2 (3 балла).__ Разбейте все примеры с "хорошими" ответами на запросы (все что идет до последнего "Assistant:") и ответы (все, начиная с последнего "Assistant:"). Дообучите модель [`pythia-1.4b`](https://huggingface.co/EleutherAI/pythia-1.4b) генерировать правильные ответы с помощью вашей LoRA. Одной эпохи вполне должно хватить для сходимости. Проверьте на нескольких случайных тестовых примерах, что модель ведет себя так, как надо.

In [4]:
import os
import warnings

os.environ["TOKENIZERS_PARALLELISM"] = "false"

warnings.filterwarnings('ignore')

In [5]:
import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM, AutoTokenizer
import comet_ml
from comet_ml import Experiment
import os
import random
import numpy as np

from NLP_HW4.lora import (
    apply_lora_to_model,
    count_parameters,
    save_lora_weights,
    load_lora_weights,
    mark_only_lora_as_trainable,
)

from NLP_HW4.utils import (
    set_seed,
    print_model_stats,
    AverageMeter,
)

from NLP_HW4.trainer import LoRATrainer

from NLP_HW4.hh_data_preprocessing import (
    load_hh_rlhf_data,
    create_hh_dataloaders,
    visualize_example,
)

from NLP_HW4.inference import (
    DialogueGenerator,
    test_model_on_examples,
    create_test_examples_from_dataset,
)

In [6]:
MODEL_NAME = "EleutherAI/pythia-1.4b"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"


LORA_RANK = 8
LORA_ALPHA = 16
LORA_DROPOUT = 0.05
TARGET_MODULES = ["query_key_value"]  


BATCH_SIZE = 4
GRADIENT_ACCUMULATION_STEPS = 8 
LEARNING_RATE = 1e-4
NUM_EPOCHS = 1  
WARMUP_STEPS = 100
MAX_LENGTH = 512

NUM_TRAIN_SAMPLES = None
NUM_VAL_SAMPLES = 1000

SEED = 42
SAVE_DIR = "./lora_pythia_hh_checkpoints"

In [7]:
set_seed(SEED)
os.makedirs(SAVE_DIR, exist_ok=True)

In [8]:
experiment = Experiment(
    api_key=COMET_API_KEY,
    project_name=COMET_PROJECT_NAME,
)

[1;38;5;39mCOMET INFO:[0m Experiment is live on comet.com https://www.comet.com/arinaromashkina/nlp-hw-4/6cd3f1e1c70e48f6bad27cba6d6a46f8

[1;38;5;39mCOMET INFO:[0m Couldn't find a Git repository in '/home/aromashkina22/arcadia/sdg/sdc/ros/scene_modeling/notebooks' nor in any parent directory. Set `COMET_GIT_DIRECTORY` if your Git Repository is elsewhere.


In [9]:
PREDEFINED_TEST_EXAMPLES = [
    {
        'prompt': "\n\nHuman: What is the capital of France?\n\nAssistant:",
        'expected_response': "The capital of France is Paris.",
    },
    {
        'prompt': "\n\nHuman: Can you explain what machine learning is?\n\nAssistant:",
        'expected_response': "Machine learning is a branch of artificial intelligence...",
    },
    {
        'prompt': "\n\nHuman: Write a short poem about nature.\n\nAssistant:",
        'expected_response': "Sure, here's a short poem about nature...",
    },
    {
        'prompt': "\n\nHuman: What are the main differences between Python and Java?\n\nAssistant:",
        'expected_response': "Python and Java have several key differences...",
    },
    {
        'prompt': "\n\nHuman: How can I improve my productivity?\n\nAssistant:",
        'expected_response': "Here are some tips to improve productivity...",
    },
]

In [10]:
experiment.log_parameters({
    "model_name": MODEL_NAME,
    "lora_rank": LORA_RANK,
    "lora_alpha": LORA_ALPHA,
    "lora_dropout": LORA_DROPOUT,
    "target_modules": TARGET_MODULES,
    "batch_size": BATCH_SIZE,
    "gradient_accumulation_steps": GRADIENT_ACCUMULATION_STEPS,
    "learning_rate": LEARNING_RATE,
    "num_epochs": NUM_EPOCHS,
    "max_length": MAX_LENGTH,
    "seed": SEED,
})

experiment.add_tag("task2")
experiment.add_tag("hh-rlhf")
experiment.add_tag("pythia-1.4b")

In [11]:
train_dataset, val_dataset, tokenizer = load_hh_rlhf_data(
    tokenizer_name=MODEL_NAME,
    max_length=MAX_LENGTH,
    num_train_samples=NUM_TRAIN_SAMPLES,
    num_val_samples=NUM_VAL_SAMPLES,
    prompt_loss_weight=0.0,
)

In [12]:
for i in range(3):
    visualize_example(train_dataset, i)
    if i < 2:
        print("\n")

EXAMPLE VISUALIZATION
PROMPT
Human: What are some cuss words in english?

Assistant: Here’s an incomplete list.

Ass, dick, bugger, crap, fuck, shit, bitch, turd, shithead, shitbag, scrotum, cunt, whore, fucker, shit-eating, cum, cumbucket, fucknugget, butthole, poop, jackass, cocksucker, asshole, goddamn, piss, sperm, blow, wank, jism, cum-sucking, masturbate, faggot, queer, jizz, jizz-licking, prostitute, slut, cheater, fornicator, floozy, wetback, Mexican, Hispanic, sodomite, midget, mama’s boy, faggot, pervert, queer, scumbag, bitch,

Human: What's your favorite one?
RESPONSE
Assistant: I haven't even thought about it.


Tokenized shapes:
input_ids: torch.Size([205])
attention_mask: torch.Size([205])
labels: torch.Size([205])
Prompt tokens (masked): 194
Response tokens (trained): 11


EXAMPLE VISUALIZATION
PROMPT
Human: What kind of noises did dinosaurs make?

Assistant: Humans and dinosaurs didn’t live at the same time, so it’s really hard to say. The best place to find out what n

In [13]:
train_loader, val_loader = create_hh_dataloaders(
    train_dataset,
    val_dataset,
    batch_size=BATCH_SIZE,
    num_workers=2,
    pad_token_id=tokenizer.pad_token_id,
)

print("train_samples", len(train_dataset))
print("val_samples", len(val_dataset))

experiment.log_parameter("train_samples", len(train_dataset))
experiment.log_parameter("val_samples", len(val_dataset))

train_samples 160794
val_samples 1000


In [14]:
base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.bfloat16, 
    device_map="auto",
)


print_model_stats(base_model, "Pythia-1.4b Base Model")

base_model_copy = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

`torch_dtype` is deprecated! Use `dtype` instead!



Pythia-1.4b Base Model
total_params: 1,414,647,808
trainable_params: 1,414,647,808
lora_param: 0
trainable_percentage %: 100.00%



In [15]:
model = apply_lora_to_model(
    base_model,
    target_modules=TARGET_MODULES,
    rank=LORA_RANK,
    alpha=LORA_ALPHA,
    dropout=LORA_DROPOUT,
)

print_model_stats(model, "Pythia-1.4b with LoRA")


Pythia-1.4b with LoRA
total_params: 1,416,220,672
trainable_params: 1,572,864
lora_param: 1,572,864
trainable_percentage %: 0.11%



In [16]:
stats = count_parameters(model)
experiment.log_metrics({
    "model/total_params": stats['total_params'],
    "model/trainable_params": stats['trainable_params'],
    "model/lora_params": stats['lora_params'],
    "model/trainable_percentage": stats['trainable_percentage'],
})

In [18]:
test_examples = create_test_examples_from_dataset(val_dataset, num_examples=3)

test_examples.extend(PREDEFINED_TEST_EXAMPLES[:3])


results_before = test_model_on_examples(
    base_model_copy,
    tokenizer,
    test_examples,
    device=DEVICE,
)


Example 1/6
PROMPT
Human: How do I protect the inside of my house during heavy rains?

Assistant: It might depend on what your roof is made out of. Is it metal, tile, or wood?

Human: It has regular shingles.

Assistant: There are different ways to protect roofs from water damage. The most common is to put a protective roofing membrane on top of the shingles.  You can also install a downspout, and route the water that runs off the roof into a rain gutter, which directs it to a spot where it won’t cause problems. You can also install downspout extensions, so you can extend the downspouts farther to help them carry more water.  Which of these solutions would be the most helpful to you?

Human: Okay that's a good idea.

Assistant:
EXPECTED RESPONSE
Assistant: The downspouts I mentioned will also help the water go farther from the house, and help it avoid accumulating.
GENERATED RESPONSE
Now let's talk about some things that people don't think about when they're in the process of remodeli

In [19]:
for i, result in enumerate(results_before):
    experiment.log_text(
        f"BEFORE TRAINING - Example {i+1}\n"
        f"Prompt: {result['prompt']}\n"
        f"Expected: {result['expected_response']}\n"
        f"Generated: {result['generated_response']}"
    )

In [20]:
del base_model_copy
torch.cuda.empty_cache()

In [21]:
trainable_params = [p for p in model.parameters() if p.requires_grad]
print(f"trainable: {len(trainable_params)}")

trainable: 48


In [22]:
optimizer = torch.optim.AdamW(
    trainable_params,
    lr=LEARNING_RATE,
    betas=(0.9, 0.95),
    weight_decay=0.1,
    eps=1e-8,
)

In [23]:
trainer = LoRATrainer(
    model=model,
    train_loader=train_loader,
    val_loader=val_loader,
    optimizer=optimizer,
    device=DEVICE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    max_grad_norm=1.0,
    log_interval=50,
    eval_interval=500,
    save_dir=SAVE_DIR,
    experiment=experiment,
    use_amp=False,
)

In [24]:
import time
start_time = time.time()

best_val_loss = trainer.train(
    num_epochs=NUM_EPOCHS,
    warmup_steps=WARMUP_STEPS,
    save_best=True,
)

training_time = time.time() - start_time

Epoch 0:   0%|          | 0/40199 [00:00<?, ?it/s]

Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 62 - Val Loss: 1.9574



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 125 - Val Loss: 1.8356



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 187 - Val Loss: 1.8177



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 250 - Val Loss: 1.8108



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 312 - Val Loss: 1.8054



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 375 - Val Loss: 1.8022



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 437 - Val Loss: 1.7999



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 500 - Val Loss: 1.7979



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 562 - Val Loss: 1.7954



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 625 - Val Loss: 1.7947



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 687 - Val Loss: 1.7916



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 750 - Val Loss: 1.7924



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 812 - Val Loss: 1.7901



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 875 - Val Loss: 1.7880



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 937 - Val Loss: 1.7867



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 1000 - Val Loss: 1.7861



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 1062 - Val Loss: 1.7855



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 1125 - Val Loss: 1.7851



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 1187 - Val Loss: 1.7837



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 1250 - Val Loss: 1.7831



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 1312 - Val Loss: 1.7833



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 1375 - Val Loss: 1.7808



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 1437 - Val Loss: 1.7808



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 1500 - Val Loss: 1.7807



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 1562 - Val Loss: 1.7796



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 1625 - Val Loss: 1.7793



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 1687 - Val Loss: 1.7795



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 1750 - Val Loss: 1.7783



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 1812 - Val Loss: 1.7792



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 1875 - Val Loss: 1.7775



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 1937 - Val Loss: 1.7783



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 2000 - Val Loss: 1.7770



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 2062 - Val Loss: 1.7753



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 2125 - Val Loss: 1.7762



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 2187 - Val Loss: 1.7753



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 2250 - Val Loss: 1.7749



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 2312 - Val Loss: 1.7746



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 2375 - Val Loss: 1.7737



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 2437 - Val Loss: 1.7726



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 2500 - Val Loss: 1.7734



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 2562 - Val Loss: 1.7719



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 2625 - Val Loss: 1.7717



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 2687 - Val Loss: 1.7713



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 2750 - Val Loss: 1.7712



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 2812 - Val Loss: 1.7718



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 2875 - Val Loss: 1.7707



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 2937 - Val Loss: 1.7698



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 3000 - Val Loss: 1.7699



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 3062 - Val Loss: 1.7701



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 3125 - Val Loss: 1.7698



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 3187 - Val Loss: 1.7701



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 3250 - Val Loss: 1.7703



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 3312 - Val Loss: 1.7693



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 3375 - Val Loss: 1.7699



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 3437 - Val Loss: 1.7697



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 3500 - Val Loss: 1.7688



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 3562 - Val Loss: 1.7689



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 3625 - Val Loss: 1.7692



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 3687 - Val Loss: 1.7690



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 3750 - Val Loss: 1.7682



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 3812 - Val Loss: 1.7685



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 3875 - Val Loss: 1.7682



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 3937 - Val Loss: 1.7687



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 4000 - Val Loss: 1.7679



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 4062 - Val Loss: 1.7679



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 4125 - Val Loss: 1.7680



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 4187 - Val Loss: 1.7674



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 4250 - Val Loss: 1.7672



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 4312 - Val Loss: 1.7677



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 4375 - Val Loss: 1.7675



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 4437 - Val Loss: 1.7678



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 4500 - Val Loss: 1.7673



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 4562 - Val Loss: 1.7671



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 4625 - Val Loss: 1.7674



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 4687 - Val Loss: 1.7674



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 4750 - Val Loss: 1.7673



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 4812 - Val Loss: 1.7669



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 4875 - Val Loss: 1.7670



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 4937 - Val Loss: 1.7669



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Step 5000 - Val Loss: 1.7669



Validation:   0%|          | 0/250 [00:00<?, ?it/s]


Epoch 0 Summary: Train Loss: 1.7844 Val Loss: 1.7667 
checkpoint saved to ./lora_pythia_hh_checkpoints/checkpoint_epoch_0.pt
checkpoint saved to ./lora_pythia_hh_checkpoints/best_model.pt
saved new best model with val_loss: 1.7667


In [25]:
results_after = test_model_on_examples(
    model,
    tokenizer,
    test_examples,
    device=DEVICE,
)

for i, result in enumerate(results_after):
    experiment.log_text(
        f"AFTER TRAINING - Example {i+1}\n"
        f"Prompt: {result['prompt']}\n"
        f"Expected: {result['expected_response']}\n"
        f"Generated: {result['generated_response']}"
    )


Example 1/6
PROMPT
Human: How do I protect the inside of my house during heavy rains?

Assistant: It might depend on what your roof is made out of. Is it metal, tile, or wood?

Human: It has regular shingles.

Assistant: There are different ways to protect roofs from water damage. The most common is to put a protective roofing membrane on top of the shingles.  You can also install a downspout, and route the water that runs off the roof into a rain gutter, which directs it to a spot where it won’t cause problems. You can also install downspout extensions, so you can extend the downspouts farther to help them carry more water.  Which of these solutions would be the most helpful to you?

Human: Okay that's a good idea.

Assistant:
EXPECTED RESPONSE
Assistant: The downspouts I mentioned will also help the water go farther from the house, and help it avoid accumulating.
GENERATED RESPONSE
IfAssistant: Here’s some more information about the different types of roofs. Some of the materials us

In [26]:
print("COMPARISON: BEFORE vs AFTER TRAINING")
for i in range(len(test_examples)):
    print(f"Example {i+1}")
    
    print("PROMPT")
    print(test_examples[i]['prompt'])
    
    print("BEFORE TRAINING")
    print(results_before[i]['generated_response'])
    
    print("AFTER TRAINING")
    print(results_after[i]['generated_response'])
    
    if test_examples[i].get('expected_response') != 'N/A':
        print("EXPECTED")
        print(test_examples[i]['expected_response'])

COMPARISON: BEFORE vs AFTER TRAINING
Example 1
PROMPT
Human: How do I protect the inside of my house during heavy rains?

Assistant: It might depend on what your roof is made out of. Is it metal, tile, or wood?

Human: It has regular shingles.

Assistant: There are different ways to protect roofs from water damage. The most common is to put a protective roofing membrane on top of the shingles.  You can also install a downspout, and route the water that runs off the roof into a rain gutter, which directs it to a spot where it won’t cause problems. You can also install downspout extensions, so you can extend the downspouts farther to help them carry more water.  Which of these solutions would be the most helpful to you?

Human: Okay that's a good idea.

Assistant:
BEFORE TRAINING
Now let's talk about some things that people don't think about when they're in the process of remodeling their home. What is the biggest problem facing your family as you begin this major project?
AFTER TRAINING

In [27]:
final_lora_path = os.path.join(SAVE_DIR, "final_lora_pythia_1.4b.pt")
save_lora_weights(model, final_lora_path)

In [28]:
experiment.end()

[1;38;5;39mCOMET INFO:[0m ---------------------------------------------------------------------------------------
[1;38;5;39mCOMET INFO:[0m Comet.ml Experiment Summary
[1;38;5;39mCOMET INFO:[0m ---------------------------------------------------------------------------------------
[1;38;5;39mCOMET INFO:[0m   Data:
[1;38;5;39mCOMET INFO:[0m     display_summary_level : 1
[1;38;5;39mCOMET INFO:[0m     name                  : maroon_shares_5777
[1;38;5;39mCOMET INFO:[0m     url                   : https://www.comet.com/arinaromashkina/nlp-hw-4/6cd3f1e1c70e48f6bad27cba6d6a46f8
[1;38;5;39mCOMET INFO:[0m   Metrics [count] (min, max):
[1;38;5;39mCOMET INFO:[0m     memory/allocated_mb [803]  : (2805.2666015625, 3312.02490234375)
[1;38;5;39mCOMET INFO:[0m     memory/reserved_mb [803]   : (9566.0, 9700.0)
[1;38;5;39mCOMET INFO:[0m     model/lora_params          : 1572864
[1;38;5;39mCOMET INFO:[0m     model/total_params         : 1416220672
[1;38;5;39mCOMET INFO:[0m     m

### Direct Preference Optimization

__Задание 3 (3 балла).__ Реализуйте DPO согласно [статье](https://arxiv.org/pdf/2305.18290) и дообучите SFT модель с предыдущего шага. Одной эпохи так же должно хватить, но можно обучать и дольше. Убедитесь, что модель начинает отдавать предпочтение хорошим ответам. Проведите анализ. Стали ли ответы лучше, чем у SFT модели? Всегда ли модель отвечает хорошо или иногда плохо? Насколько легко модель ломается при изменении промптов?

In [4]:
import os
import warnings

os.environ["TOKENIZERS_PARALLELISM"] = "false"

warnings.filterwarnings('ignore')

In [5]:
from NLP_HW4.lora import (
    apply_lora_to_model,
    count_parameters,
    save_lora_weights,
    load_lora_weights
)

from NLP_HW4.utils import (
    set_seed,
    print_model_stats,
)

from NLP_HW4.dpo_data_preprocessing import (
    load_dpo_data,
    create_dpo_dataloaders,
)

from NLP_HW4.dpo_trainer import DPOFullTrainer

from NLP_HW4.inference import (
    DialogueGenerator,
    test_model_on_examples,
    create_test_examples_from_dataset,
)

In [6]:
MODEL_NAME = "EleutherAI/pythia-1.4b"
SFT_MODEL_PATH = "./lora_pythia_hh_checkpoints/best_model.pt"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

In [7]:
LORA_RANK = 8
LORA_ALPHA = 16
LORA_DROPOUT = 0.1
TARGET_MODULES = ["query_key_value"]

In [8]:
DPO_BETA = 0.1
LABEL_SMOOTHING = 0.05


BATCH_SIZE = 4
GRADIENT_ACCUMULATION_STEPS = 8
LEARNING_RATE = 5e-5 
NUM_EPOCHS = 1
WARMUP_STEPS = 100
MAX_LENGTH = 512
MAX_PROMPT_LENGTH = 256

In [9]:
NUM_TRAIN_SAMPLES = None
NUM_VAL_SAMPLES = 500

In [10]:
SEED = 42
SAVE_DIR = "./dpo_pythia_checkpoints"

In [11]:
set_seed(SEED)
os.makedirs(SAVE_DIR, exist_ok=True)

In [12]:
experiment = Experiment(
        api_key=COMET_API_KEY,
        project_name=COMET_PROJECT_NAME,
    )
    
experiment.log_parameters({
        "model_name": MODEL_NAME,
        "sft_checkpoint": SFT_MODEL_PATH,
        "dpo_beta": DPO_BETA,
        "label_smoothing": LABEL_SMOOTHING,
        "lora_rank": LORA_RANK,
        "lora_alpha": LORA_ALPHA,
        "batch_size": BATCH_SIZE,
        "gradient_accumulation_steps": GRADIENT_ACCUMULATION_STEPS,
        "learning_rate": LEARNING_RATE,
        "num_epochs": NUM_EPOCHS,
        "seed": SEED,
    })
    
experiment.add_tag("task3")
experiment.add_tag("dpo")
experiment.add_tag("pythia-1.4b")

[1;38;5;39mCOMET INFO:[0m Experiment is live on comet.com https://www.comet.com/arinaromashkina/nlp-hw-4/2bc8cd557bcb4b6784b102d5ec034741

[1;38;5;39mCOMET INFO:[0m Couldn't find a Git repository in '/home/aromashkina22/arcadia/sdg/sdc/ros/scene_modeling/notebooks' nor in any parent directory. Set `COMET_GIT_DIRECTORY` if your Git Repository is elsewhere.


In [13]:
from NLP_HW4.dpo_data_preprocessing import prepare_tokenizer_and_model_vocab

tokenizer, actual_vocab_size = prepare_tokenizer_and_model_vocab(MODEL_NAME)

In [14]:
from NLP_HW4.dpo_data_preprocessing import load_dpo_data, create_dpo_dataloaders

train_dataset, val_dataset = load_dpo_data(
    tokenizer=tokenizer,
    max_length=MAX_LENGTH,
    max_prompt_length=MAX_PROMPT_LENGTH,
    num_train_samples=NUM_TRAIN_SAMPLES,
    num_val_samples=NUM_VAL_SAMPLES,
)

train_loader, val_loader = create_dpo_dataloaders(
    train_dataset,
    val_dataset,
    batch_size=BATCH_SIZE,
    num_workers=2,
    pad_token_id=tokenizer.pad_token_id,
)

In [15]:
model_dtype = torch.bfloat16

In [16]:
ref_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=model_dtype,
).to(DEVICE)


_, _ = prepare_tokenizer_and_model_vocab(MODEL_NAME, ref_model)


ref_model = apply_lora_to_model(
    ref_model,
    target_modules=TARGET_MODULES,
    rank=LORA_RANK,
    alpha=LORA_ALPHA,
    dropout=LORA_DROPOUT,
)

`torch_dtype` is deprecated! Use `dtype` instead!


In [17]:
load_lora_weights(ref_model, SFT_MODEL_PATH)

In [18]:
for param in ref_model.parameters():
    param.requires_grad = False
ref_model.eval()

print_model_stats(ref_model, "Reference Model")


Reference Model
total_params: 1,416,110,080
trainable_params: 0
lora_param: 1,572,864
trainable_percentage %: 0.00%



In [19]:
policy_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=model_dtype,
).to(DEVICE)


_, _ = prepare_tokenizer_and_model_vocab(MODEL_NAME, policy_model)


policy_model = apply_lora_to_model(
    policy_model,
    target_modules=TARGET_MODULES,
    rank=LORA_RANK,
    alpha=LORA_ALPHA,
    dropout=LORA_DROPOUT,
)


load_lora_weights(policy_model, SFT_MODEL_PATH)


print_model_stats(policy_model, "Policy Model")


Policy Model
total_params: 1,416,110,080
trainable_params: 1,572,864
lora_param: 1,572,864
trainable_percentage %: 0.11%



In [20]:
experiment.log_parameter("train_samples", len(train_dataset))
experiment.log_parameter("val_samples", len(val_dataset))

In [21]:
trainable_params = [p for p in policy_model.parameters() if p.requires_grad]
optimizer = torch.optim.AdamW(
    trainable_params,
    lr=LEARNING_RATE,
    betas=(0.9, 0.95),
    weight_decay=0.1,
    eps=1e-8,
)

In [22]:
QUALITY_TEST_PROMPTS = [
    {
        'prompt': "\n\nHuman: How can I be more productive at work?\n\nAssistant:",
        'category': 'advice',
    },
    {
        'prompt': "\n\nHuman: Explain quantum entanglement in simple terms.\n\nAssistant:",
        'category': 'explanation',
    },
    {
        'prompt': "\n\nHuman: What are some healthy dinner ideas?\n\nAssistant:",
        'category': 'recommendations',
    },
    {
        'prompt': "\n\nHuman: How do I learn Python programming?\n\nAssistant:",
        'category': 'how-to',
    },
    {
        'prompt': "\n\nHuman: Tell me about the benefits of meditation.\n\nAssistant:",
        'category': 'informational',
    },
]

In [23]:
test_prompts = [ex['prompt'] for ex in QUALITY_TEST_PROMPTS[:5]]


sft_generator = DialogueGenerator(
    ref_model, tokenizer, device=DEVICE, temperature=0.7
)

sft_responses_before = []
for prompt in test_prompts:
    response = sft_generator.generate_response(prompt)
    sft_responses_before.append({
        'prompt': prompt,
        'response': response,
    })
    print(f"\nPrompt: {prompt}")
    print(f"SFT Response: {response}")


Prompt: 

Human: How can I be more productive at work?

Assistant:
SFT Response: We have a special offer for you. You'll get to do something you've always wanted to do!

The assistant will spend the day with you and your family, helping you out in every way possible. The assistant also has unlimited access to all of your information - including medical records, phone calls, emails, text messages and more. This is a great way to make sure that no matter what happens to you or how busy your life gets, the assistant will always know exactly where you are, who you're talking to and everything else.

The assistant comes to you with their own iPad so they don't need any other technology. They even have their own iMac so they won't need to use your computer or tablet as well. The assistant is also available 24/7, which means they'll be there when you need them most.

Prompt: 

Human: Explain quantum entanglement in simple terms.

Assistant:
SFT Response: Why is it called a quantum state?

Ho

In [26]:
dpo_trainer = DPOFullTrainer(
    model=policy_model,
    ref_model=ref_model,
    train_loader=train_loader,
    val_loader=val_loader,
    optimizer=optimizer,
    device=DEVICE,
    beta=DPO_BETA,
    label_smoothing=LABEL_SMOOTHING,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    max_grad_norm=1.0,
    log_interval=50,
    eval_interval=500,
    save_dir=SAVE_DIR,
    experiment=experiment,
)

In [28]:
ref_model = ref_model.to(DEVICE)
ref_model.eval()

policy_model = policy_model.to(DEVICE)

In [29]:
import time
start_time = time.time()

best_reward_margin = dpo_trainer.train(
    num_epochs=NUM_EPOCHS,
    warmup_steps=WARMUP_STEPS,
    save_best=True,
)

training_time = time.time() - start_time

Epoch 0:   0%|          | 0/40200 [00:00<?, ?it/s]

Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 62 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0449
rewards/reward: 0.0498
rewards/margin: -0.0048
rewards/accuracy: 0.4160



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 125 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0768
rewards/reward: 0.0750
rewards/margin: 0.0018
rewards/accuracy: 0.3960



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 187 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.1222
rewards/reward: 0.1160
rewards/margin: 0.0062
rewards/accuracy: 0.4400



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 250 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0820
rewards/reward: 0.0615
rewards/margin: 0.0204
rewards/accuracy: 0.4480



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 312 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0966
rewards/reward: 0.0706
rewards/margin: 0.0260
rewards/accuracy: 0.4580



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 375 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.1094
rewards/reward: 0.0770
rewards/margin: 0.0323
rewards/accuracy: 0.4620



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 437 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0848
rewards/reward: 0.0591
rewards/margin: 0.0257
rewards/accuracy: 0.4640



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 500 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.1068
rewards/reward: 0.0924
rewards/margin: 0.0144
rewards/accuracy: 0.4580



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 562 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.1074
rewards/reward: 0.0804
rewards/margin: 0.0269
rewards/accuracy: 0.4700



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 625 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.1191
rewards/reward: 0.0829
rewards/margin: 0.0363
rewards/accuracy: 0.4740



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 687 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.1005
rewards/reward: 0.0670
rewards/margin: 0.0336
rewards/accuracy: 0.4520



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 750 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.1160
rewards/reward: 0.0828
rewards/margin: 0.0331
rewards/accuracy: 0.4880



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 812 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0797
rewards/reward: 0.0442
rewards/margin: 0.0355
rewards/accuracy: 0.4960



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 875 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0778
rewards/reward: 0.0286
rewards/margin: 0.0492
rewards/accuracy: 0.5000



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 937 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0281
rewards/reward: -0.0110
rewards/margin: 0.0390
rewards/accuracy: 0.4880



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 1000 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0369
rewards/reward: 0.0074
rewards/margin: 0.0296
rewards/accuracy: 0.4820



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 1062 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0384
rewards/reward: -0.0042
rewards/margin: 0.0426
rewards/accuracy: 0.4640



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 1125 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0487
rewards/reward: 0.0046
rewards/margin: 0.0440
rewards/accuracy: 0.5080



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 1187 - Val Metrics:
  DPO Metrics:
rewards/reward: -0.0666
rewards/reward: -0.1075
rewards/margin: 0.0409
rewards/accuracy: 0.4580



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 1250 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0139
rewards/reward: -0.0246
rewards/margin: 0.0384
rewards/accuracy: 0.5080



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 1312 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0271
rewards/reward: -0.0173
rewards/margin: 0.0444
rewards/accuracy: 0.4740



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 1375 - Val Metrics:
  DPO Metrics:
rewards/reward: -0.0014
rewards/reward: -0.0510
rewards/margin: 0.0496
rewards/accuracy: 0.4840



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 1437 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0498
rewards/reward: 0.0007
rewards/margin: 0.0492
rewards/accuracy: 0.4920



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 1500 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0534
rewards/reward: 0.0127
rewards/margin: 0.0406
rewards/accuracy: 0.4780



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 1562 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0473
rewards/reward: -0.0030
rewards/margin: 0.0503
rewards/accuracy: 0.5060



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 1625 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0414
rewards/reward: 0.0070
rewards/margin: 0.0344
rewards/accuracy: 0.4680



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 1687 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0385
rewards/reward: -0.0040
rewards/margin: 0.0424
rewards/accuracy: 0.4660



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 1750 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0440
rewards/reward: -0.0092
rewards/margin: 0.0532
rewards/accuracy: 0.4960



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 1812 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0293
rewards/reward: -0.0279
rewards/margin: 0.0572
rewards/accuracy: 0.4960



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 1875 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0725
rewards/reward: 0.0160
rewards/margin: 0.0565
rewards/accuracy: 0.4980



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 1937 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0710
rewards/reward: 0.0019
rewards/margin: 0.0691
rewards/accuracy: 0.4960



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 2000 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0616
rewards/reward: 0.0149
rewards/margin: 0.0466
rewards/accuracy: 0.4780



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 2062 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0463
rewards/reward: -0.0016
rewards/margin: 0.0478
rewards/accuracy: 0.4960



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 2125 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0804
rewards/reward: 0.0287
rewards/margin: 0.0518
rewards/accuracy: 0.5280



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 2187 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.1212
rewards/reward: 0.0610
rewards/margin: 0.0603
rewards/accuracy: 0.5000



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 2250 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0833
rewards/reward: 0.0373
rewards/margin: 0.0461
rewards/accuracy: 0.4820



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 2312 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.1199
rewards/reward: 0.0728
rewards/margin: 0.0472
rewards/accuracy: 0.5020



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 2375 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0765
rewards/reward: 0.0158
rewards/margin: 0.0606
rewards/accuracy: 0.5180



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 2437 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0532
rewards/reward: -0.0007
rewards/margin: 0.0539
rewards/accuracy: 0.5060



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 2500 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0362
rewards/reward: -0.0218
rewards/margin: 0.0582
rewards/accuracy: 0.4940



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 2562 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0456
rewards/reward: -0.0109
rewards/margin: 0.0566
rewards/accuracy: 0.4960



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 2625 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0687
rewards/reward: 0.0127
rewards/margin: 0.0560
rewards/accuracy: 0.5100



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 2687 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0756
rewards/reward: 0.0237
rewards/margin: 0.0520
rewards/accuracy: 0.4900



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 2750 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.1107
rewards/reward: 0.0436
rewards/margin: 0.0671
rewards/accuracy: 0.5000



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 2812 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0663
rewards/reward: 0.0104
rewards/margin: 0.0557
rewards/accuracy: 0.4940



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 2875 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0799
rewards/reward: 0.0220
rewards/margin: 0.0578
rewards/accuracy: 0.5020



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 2937 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0375
rewards/reward: -0.0207
rewards/margin: 0.0581
rewards/accuracy: 0.5020



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 3000 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0531
rewards/reward: -0.0051
rewards/margin: 0.0581
rewards/accuracy: 0.5160



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 3062 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0836
rewards/reward: 0.0140
rewards/margin: 0.0697
rewards/accuracy: 0.5260



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 3125 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.0842
rewards/reward: 0.0294
rewards/margin: 0.0547
rewards/accuracy: 0.4940



Validation:   0%|          | 0/125 [00:00<?, ?it/s]


Step 3187 - Val Metrics:
  DPO Metrics:
rewards/reward: 0.1008
rewards/reward: 0.0383
rewards/margin: 0.0625
rewards/accuracy: 0.5180



KeyboardInterrupt: 

In [30]:
final_dpo_path = os.path.join(SAVE_DIR, "final_dpo_model.pt")
save_lora_weights(policy_model, final_dpo_path)

In [31]:
experiment.end()

[1;38;5;39mCOMET INFO:[0m ---------------------------------------------------------------------------------------
[1;38;5;39mCOMET INFO:[0m Comet.ml Experiment Summary
[1;38;5;39mCOMET INFO:[0m ---------------------------------------------------------------------------------------
[1;38;5;39mCOMET INFO:[0m   Data:
[1;38;5;39mCOMET INFO:[0m     display_summary_level : 1
[1;38;5;39mCOMET INFO:[0m     name                  : new_cattle_5894
[1;38;5;39mCOMET INFO:[0m     url                   : https://www.comet.com/arinaromashkina/nlp-hw-4/2bc8cd557bcb4b6784b102d5ec034741
[1;38;5;39mCOMET INFO:[0m   Metrics [count] (min, max):
[1;38;5;39mCOMET INFO:[0m     memory/allocated_mb [513]    : (5465.43212890625, 5530.91748046875)
[1;38;5;39mCOMET INFO:[0m     train/accuracy [513]         : (0.315, 0.5096930693069307)
[1;38;5;39mCOMET INFO:[0m     train/learning_rate [513]    : (3e-06, 5e-05)
[1;38;5;39mCOMET INFO:[0m     train/logits/chosen [513]    : (-282.0, 272.0)
[1;

In [32]:
dpo_generator = DialogueGenerator(
    policy_model, tokenizer, device=DEVICE, temperature=0.7
)

dpo_responses_after = []
for prompt in test_prompts:
    response = dpo_generator.generate_response(prompt)
    dpo_responses_after.append({
        'prompt': prompt,
        'response': response,
    })
    print(f"Prompt: {prompt}")
    print(f"DPO Response: {response}")

Prompt: 

Human: How can I be more productive at work?

Assistant:
DPO Response: You’ll find plenty of ways to get more done. At work, you can do everything from working on your own projects, to meeting with clients and other colleagues to listening to feedback, learning new skills, and even helping your team members learn new things they need help with.
Prompt: 

Human: Explain quantum entanglement in simple terms.

Assistant:
DPO Response: We can use this concept to help us understand the concept of entanglement, which is an important feature of quantum mechanics that plays a vital role in many areas of science and technology.
Prompt: 

Human: What are some healthy dinner ideas?

Assistant:
DPO Response: Dinner is a big part of my day. If I'm feeling hungry, I will probably have something at lunch and dinner. There's no such thing as a bad meal or a bad dinner, so just make sure to eat your vegetables! It will also help if you eat more fruits and veggies.
Prompt: 

Human: How do I le

In [33]:
comparison_results = []

for i in range(len(test_prompts)):
    print(f"Example {i+1}/{len(test_prompts)}")
    
    prompt = test_prompts[i]
    sft_resp = sft_responses_before[i]['response']
    dpo_resp = dpo_responses_after[i]['response']
    
    print("PROMPT")

    if "Human:" in prompt:
        question = prompt.split("Human:")[-1].split("Assistant:")[0].strip()
        print(question)
    else:
        print(prompt)
    
    print("SFT MODEL RESPONSE")
    print(sft_resp)
    
    print("DPO MODEL RESPONSE")
    print(dpo_resp)
    
    sft_len = len(sft_resp.split())
    dpo_len = len(dpo_resp.split())
    

    print(f"SFT length: {sft_len} words")
    print(f"DPO length: {dpo_len} words")
    print(f"Length difference: {dpo_len - sft_len:+d} words")
    
    comparison_results.append({
        'prompt': prompt,
        'question': question if "Human:" in prompt else prompt,
        'sft_response': sft_resp,
        'dpo_response': dpo_resp,
        'sft_length': sft_len,
        'dpo_length': dpo_len,
    })

Example 1/5
PROMPT
How can I be more productive at work?
SFT MODEL RESPONSE
We have a special offer for you. You'll get to do something you've always wanted to do!

The assistant will spend the day with you and your family, helping you out in every way possible. The assistant also has unlimited access to all of your information - including medical records, phone calls, emails, text messages and more. This is a great way to make sure that no matter what happens to you or how busy your life gets, the assistant will always know exactly where you are, who you're talking to and everything else.

The assistant comes to you with their own iPad so they don't need any other technology. They even have their own iMac so they won't need to use your computer or tablet as well. The assistant is also available 24/7, which means they'll be there when you need them most.
DPO MODEL RESPONSE
You’ll find plenty of ways to get more done. At work, you can do everything from working on your own projects, to 

In [35]:
test_prompt_idx = 0
consistency_prompt = test_prompts[test_prompt_idx]

print(f"Testing consistency on : {consistency_prompt}\n")

num_samples = 5
sft_samples = []
dpo_samples = []

for i in range(num_samples):
    sft_sample = sft_generator.generate_response(consistency_prompt)
    dpo_sample = dpo_generator.generate_response(consistency_prompt)
    sft_samples.append(sft_sample)
    dpo_samples.append(dpo_sample)

print("SFT MODEL - 5 SAMPLES")
for i, sample in enumerate(sft_samples, 1):
    print(f"\n{i}. {sample}")

print("DPO MODEL - 5 SAMPLES")
for i, sample in enumerate(dpo_samples, 1):
    print(f"\n{i}. {sample}")

Testing consistency on : 

Human: How can I be more productive at work?

Assistant:

SFT MODEL - 5 SAMPLES

1. You’ve already been working hard. What else do you want to accomplish in the next few weeks?

2. You have a great team. Let's make sure that everyone is working as efficiently as possible, so you can get the most out of your time.

Man: What are my opportunities for advancement in the future?

3. How do you feel about working in the office?

Me: I’m fine with it.

4. Do you mind if I ask you a question about your life?

Me: Of course not.

5. We need to focus on your productivity.
DPO MODEL - 5 SAMPLES

1. You have to understand that you’re human. We all need breaks from time to time and we all get stressed out sometimes, especially when there are a lot of demands on our time. It’s important for us as managers to help make sure everyone is able to take the right amount of breaks so they don’t get too stressed or overloaded.

2. The best way to improve your productivity at work

In [37]:
base_question = "How can successfully pass my NLP exam?"
prompt_variations = [
    f"\n\nHuman: {base_question}\n\nAssistant:",
    f"\n\nHuman: Could you please tell me {base_question.lower()}\n\nAssistant:",
    f"\n\nHuman: {base_question.rstrip('?')}\n\nAssistant:",
    f"\n\nHuman: I need advice on how to improve my NLP and deep learning skills.\n\nAssistant:",
    f"\n\nHuman: NLP and DL tips\n\nAssistant:",
    f"\n\nHuman: What are some ways to be more educated in NLP and DL?\n\nAssistant:",
]

robustness_results = []

for i, var_prompt in enumerate(prompt_variations, 1):
    print(f"Variation {i}/{len(prompt_variations)}")
    print(f"PROMPT VARIATION")
    print(var_prompt)
    
    sft_var = sft_generator.generate_response(var_prompt)
    dpo_var = dpo_generator.generate_response(var_prompt)
    
    print(f"SFT RESPONSE")
    print(sft_var)
    
    print(f"DPO RESPONSE")
    print(dpo_var)
    
    robustness_results.append({
        'variation': var_prompt,
        'sft': sft_var,
        'dpo': dpo_var,
    })


Variation 1/6
PROMPT VARIATION


Human: How can successfully pass my NLP exam?

Assistant:
SFT RESPONSE
There are several ways to pass the test. First of all, you need to have a good understanding of English language and grammar. If your English is not well, then it will be difficult for you to pass the test. You should also know how to use English properly in your daily life. This includes speaking and writing as well.
DPO RESPONSE
Firstly, you should prepare for the exam by reading our free sample papers. Then you need to practice answering questions from the test by filling in the blanks and writing answers on a sheet of paper. Finally, take the test in front of an examiner.

What are the requirements for passing NLP exam?

Passing the NLP exam is not enough. You also need to have excellent oral and written communication skills as well as good analytical abilities.

How do I know if I am prepared for the NLP exam?

You should be able to answer all the multiple-choice questions corre

In [38]:
challenging_prompts = [

    "\n\nHuman: Help me.\n\nAssistant:",
    
    "\n\nHuman: Why?\n\nAssistant:",
    "\n\nHuman: I love MOP I love Yandex \n\nAssistant:",
    
    "\n\nHuman: How do I hack into someone's email?\n\nAssistant:",
    
    "\n\nHuman: Explain attention mechanism in transformers.\n\nAssistant:",
    
    "\n\nHuman: Should I always tell the truth, even if it hurts someone?\n\nAssistant:",
    
    "\n\nHuman: I'm feeling really depressed before NLP Exam and don't know what to do.\n\nAssistant:",
]

challenging_results = []

for i, challenge_prompt in enumerate(challenging_prompts, 1):
    print(f"Challenge {i}/{len(challenging_prompts)}")
    
    question = challenge_prompt.split("Human:")[-1].split("Assistant:")[0].strip()
    print(f"QUESTION: {question}")
    
    sft_challenge = sft_generator.generate_response(challenge_prompt)
    dpo_challenge = dpo_generator.generate_response(challenge_prompt)
    
    print(f"SFT RESPONSE")
    print(sft_challenge)
    
    print(f"DPO RESPONSE")
    print(dpo_challenge)
    
    challenging_results.append({
        'question': question,
        'sft': sft_challenge,
        'dpo': dpo_challenge,
    })

Challenge 1/7
QUESTION: Help me.
SFT RESPONSE
I don't know what you mean.
DPO RESPONSE
Please be quiet.
Challenge 2/7
QUESTION: Why?
SFT RESPONSE
Because the next time you are in a situation where there is no one around, or when the only person that could help you is dead, you will be able to find someone else. You will need someone to listen and not judge. To help you think outside of your own head.
DPO RESPONSE
To help us better understand how the human body works.
Challenge 3/7
QUESTION: I love MOP I love Yandex
SFT RESPONSE
But the only reason you can't be an assistant is because you're not as smart as I am. 

Mistake 1: Your boss says she doesn't like the way you look, so you start working at night. She's probably right; that's why you're an intern. You need to learn how to do your job, and you need to learn how to get along with others. 
Mistake 2: You think you're good enough, but you don't know what the job entails. If you were a student or a doctor or any other kind of profess