## **Imports and Setups**

Скачаем и импортируем необходимые зависимости.

In [1]:
%pip install transformers trl

Collecting trl
  Obtaining dependency information for trl from https://files.pythonhosted.org/packages/0d/44/c406c3cf5981bddb16ff72acb5ca235888db4073d868cf51bd143bef3aad/trl-0.7.4-py3-none-any.whl.metadata
  Downloading trl-0.7.4-py3-none-any.whl.metadata (10 kB)
Collecting tyro>=0.5.11 (from trl)
  Obtaining dependency information for tyro>=0.5.11 from https://files.pythonhosted.org/packages/c5/11/abdf67467d06713b431618732a43f82d1b1f02120107b05a789afbcdf54d/tyro-0.6.0-py3-none-any.whl.metadata
  Downloading tyro-0.6.0-py3-none-any.whl.metadata (7.5 kB)
Collecting shtab>=1.5.6 (from tyro>=0.5.11->trl)
  Obtaining dependency information for shtab>=1.5.6 from https://files.pythonhosted.org/packages/40/ad/7227da64498eaa7abecee4311008f70869e156014b3270cec36e2e70cd31/shtab-1.6.5-py3-none-any.whl.metadata
  Downloading shtab-1.6.5-py3-none-any.whl.metadata (7.3 kB)
Downloading trl-0.7.4-py3-none-any.whl (133 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.9/133.9 kB[0m

In [2]:
import torch
from tqdm import tqdm
import pandas as pd
import numpy as np
import random
import seaborn as sns
import matplotlib.pyplot as plt

from dataclasses import dataclass, field
from typing import Dict, Optional

tqdm.pandas()

from transformers import (
    pipeline,
    AutoTokenizer,
    TrainingArguments,
    AutoModelForCausalLM
)

import datasets

from torch.utils.data import (
    Dataset,
    DataLoader,
    RandomSampler,
    random_split
)

from trl import (
    AutoModelForCausalLMWithValueHead,
    DPOTrainer,
    create_reference_model
)

from trl.core import LengthSampler

from typing import List, Dict
from scipy.stats import entropy
from collections import defaultdict

import wandb

import pickle

import gc



## **Configuration and Seed**

Зададим параметры для генерации, оценки и т.д. В качестве SFT модели возьмем gpt2, обученную на imdb датасете https://huggingface.co/lvwerra/gpt2-imdb. Также зафиксируем seed для воспроизводимости результатов.

In [3]:
# название SFT модели
model_name = 'lvwerra/gpt2-imdb'

# название Reward модели
reward_model_name = 'lvwerra/distilbert-imdb'

# задаем параметры оценки
sentiment_pipe_kwargs = {
    'top_k': None,
    'function_to_apply': 'none',
    'batch_size': 16
}

# задаем параметры генерации
generation_pipe_kwargs = {
    'num_return_sequences': 1,
    'min_length': 32,
    'max_length': 64,
    'top_k': 0.0,
    'top_p': 1.0,
    'do_sample': True,
}

In [4]:
def seed_all(seed: int) -> None:
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    random.seed(seed)

In [5]:
seed_all(42)

In [6]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [7]:
wandb.init()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


## **Level 1**

Сперва научимся генерировать тексты, с помощью этой модели. Для этого создадим модель, токенайзер и пайплайн текстовой генерации.

In [8]:
# создаем модель и токенизатор
gpt2_model = AutoModelForCausalLM.from_pretrained(model_name)
gpt2_tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side='left')

# установливем параметр pad_token и pad_token_id
gpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token
generation_pipe_kwargs['pad_token_id'] = gpt2_tokenizer.eos_token_id

config.json:   0%|          | 0.00/577 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/17.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

In [9]:
# создаем пайплайн генерации
text_generation = pipeline('text-generation', model=model_name, device=device, tokenizer=gpt2_tokenizer, **generation_pipe_kwargs)

prefix_text = '<|startoftext|>'
postfix_text = '<|endoftext|>'
generated_text = text_generation(prefix_text)[0]
generated_text['generated_text'][len(prefix_text):]

'just to be sure.<br /><br />It would have been fun to have seen all this with the help of the Aussie Turner Classic Movies for a mere rating (mine is 2), but alas *I* miss Mr Bates\' "bunny parade" feature "Tie'

Отлично! Получилось сгенерировать текст, перейдем к следующему этапу: сгенерирем N текстов с помощью sft модели, посчитаем reward для каждого
с помощью https://huggingface.co/lvwerra/distilbert-imdb. Логиты бинарного
классификатора можно использовать в качестве значения reward. Больше
значение логита — более позитивный текст.

In [10]:
sentiment_pipe = pipeline(model='lvwerra/distilbert-imdb', device=device,  **sentiment_pipe_kwargs)

config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/333 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [11]:
def extract_sentpipe_output(outputs):
    positive_logit = 0
    for out in outputs:
        for element in out:
            if element["label"] == "POSITIVE":
                positive_logit = element["score"]
    return positive_logit

In [12]:
def example_generator(N):
    example_lst = []
    for i in range(N):
        res = {}
        generated_text = text_generation(prefix_text)[0]['generated_text'][len(prefix_text):]
        labels = sentiment_pipe(generated_text)
        res['generated_text'] = generated_text
        res['reward'] = extract_sentpipe_output(labels)
        example_lst.append(res)
    return example_lst

Примем N = 5, сгенерированные примеры и скор отобразим в виде датафрэйма.

In [13]:
N = 5
df = pd.DataFrame(example_generator(N))
df

Unnamed: 0,generated_text,reward
0,presumably tellingThings.dd. Ty Cobb had Kim P...,-1.217334
1,<br /><br />The movie is stellar; it's quite a...,2.677276
2,=><br />0:27 <|beginpoint of text| startoflin...,-0.759771
3,"<p>The addition of "" Also Based on Annis Berg...",-2.091571
4,"And satisfy yourself, eat takeout.<br /><br />...",0.515937


**Создадим датасет из пар winner-loser.** 

Зададим параметры и создадим функцию, которая будет генерировать промпт заданной длины

In [14]:
# параметры для генерации ответов на промпт
generation_pipe_kwargs_chosen_reject = {
    'num_return_sequences': 15,
    'min_length': 64,
    'max_length': 96,
    'top_k': 0.0,
    'top_p': 1.0,
    'do_sample': True,
    'pad_token_id': gpt2_tokenizer.eos_token_id
}

In [15]:
def dataset_builder(config, input_min_text_length=32, input_max_text_length=64):
    prompt_chosen_rejected_dict = {'prompt': [], 'chosen': [], 'rejected': []}
    output_min_length = input_min_text_length
    output_max_length = input_max_text_length
    output_length_sampler = LengthSampler(output_min_length, output_max_length)
    prompt_generation = pipeline('text-generation', model=model_name, device=device, tokenizer=gpt2_tokenizer, **generation_pipe_kwargs)
    answer_generation = pipeline('text-generation', model=model_name, device=device, tokenizer=gpt2_tokenizer, **generation_pipe_kwargs_chosen_reject)

    for i in tqdm(range(1000)):
        gen_len = output_length_sampler()
        prompt = prompt_generation(prefix_text)[0]['generated_text'][: (len(prefix_text) + gen_len)]
        texts_generation = answer_generation(prompt)
        texts = [elem['generated_text'][len(prompt): (len(prompt) + gen_len)] for elem in texts_generation]
        vocab = {}
        for text in texts:
            vocab[text] = extract_sentpipe_output(sentiment_pipe(text))
        lst = list(dict(sorted(vocab.items(), key=lambda x: x[1], reverse=True)).keys())
        prompt_chosen_rejected_dict['prompt'].extend([prompt] * 5)
        prompt_chosen_rejected_dict['chosen'].extend(lst[: 5])
        prompt_chosen_rejected_dict['rejected'].extend(lst[-5: ])
    return prompt_chosen_rejected_dict

In [16]:
prompt_chosen_rejected_dict = dataset_builder(generation_pipe_kwargs_chosen_reject)

100%|██████████| 1000/1000 [25:54<00:00,  1.55s/it]


Проверим, что в полученных списка отсутствуют пустые строки

In [17]:
assert all(map(lambda x: x != '', prompt_chosen_rejected_dict['prompt']))
assert all(map(lambda x: x != '', prompt_chosen_rejected_dict['chosen']))
assert all(map(lambda x: x != '', prompt_chosen_rejected_dict['rejected']))

Взглянем на примеры:

In [18]:
for i in range(5):
    print(prompt_chosen_rejected_dict['prompt'][i], '|', prompt_chosen_rejected_dict['chosen'][i], '|', prompt_chosen_rejected_dict['rejected'][i])

<|startoftext|>the longest movie I have seen by much  | irl. Excellent movie, no mean feat ple | ______________________________________
<|startoftext|>the longest movie I have seen by much  | !! <br /><br />Chris Bachman!!!!!<br / | _________. Well, that's saying somethi
<|startoftext|>the longest movie I have seen by much  | !!! Rainbow Girl and Underworld 3 were | !!<br /><br />no plot changes but just
<|startoftext|>the longest movie I have seen by much  | !!<br /><br />I would rate the film lo | !!<br /><br />Do not buy this movie, d
<|startoftext|>the longest movie I have seen by much  | !!! my opinion is that anything that i | imec utterly fails to represent the gi


Скачаем полученный датасет:

In [19]:
with open("prompt_chosen_rejected_dict.pkl", "wb") as file:
    pickle.dump(prompt_chosen_rejected_dict, file)

Настроим тренировочные параметры.

In [20]:
size = len(prompt_chosen_rejected_dict['rejected'])
size

5000

Разделим датасет на выборки.

In [21]:
train_split = {
    'prompt': prompt_chosen_rejected_dict['prompt'][: int(size*0.8)],
    'chosen': prompt_chosen_rejected_dict['chosen'][: int(size*0.8)],
    'rejected': prompt_chosen_rejected_dict['rejected'][: int(size*0.8)]
}

eval_split = {
    'prompt': prompt_chosen_rejected_dict['prompt'][int(size*0.8): int(size*0.95)],
    'chosen': prompt_chosen_rejected_dict['chosen'][int(size*0.8): int(size*0.95)],
    'rejected': prompt_chosen_rejected_dict['rejected'][int(size*0.8): int(size*0.95)]
}

test_split = {
    'prompt': prompt_chosen_rejected_dict['prompt'][int(size*0.95): ],
    'chosen': prompt_chosen_rejected_dict['chosen'][int(size*0.95): ],
    'rejected': prompt_chosen_rejected_dict['rejected'][int(size*0.95): ]
}

In [22]:
train_dataset = datasets.Dataset.from_dict(train_split)
eval_dataset = datasets.Dataset.from_dict(eval_split)
test_dataset = datasets.Dataset.from_dict(test_split)

Получили следующие размеры выборок: 

In [23]:
len(train_split['rejected']), len(eval_split['rejected']), len(test_split['rejected'])

(4000, 750, 250)

In [24]:
train_kwargs = {
    'model_name': 'lvwerra/gpt2-imdb',
    'report_to': 'wandb',
    'learning_rate': 1e-3,
    'per_device_train_batch_size': 16,
    'max_length': 512,
    'max_steps': 1000,
    'gradient_accumulation_steps': 1,
    'beta': 0.1,
    'max_target_length': 128,
    'max_prompt_length': 128
}

In [25]:
hinge_training_args = TrainingArguments(
        per_device_train_batch_size=train_kwargs['per_device_train_batch_size'],
        max_steps=train_kwargs['max_steps'],
        remove_unused_columns=False,
        gradient_accumulation_steps=train_kwargs['gradient_accumulation_steps'],
        learning_rate=train_kwargs['learning_rate'],
        evaluation_strategy='steps',
        logging_first_step=True,
        logging_steps=10,
        eval_steps=500,
        output_dir='./test1',
        warmup_steps=100,
        report_to=train_kwargs['report_to'],
        gradient_checkpointing=False,
    )

Создадим референсную модель, которую будет необходимо передать в качестве аргумента для DPOTrainer.

In [26]:
gpt2_ref_model = create_reference_model(gpt2_model)

In [27]:
hinge_dpo_trainer = DPOTrainer(
        gpt2_model,
        gpt2_ref_model,
        args=hinge_training_args,
        beta=train_kwargs['beta'],
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=gpt2_tokenizer,
        max_length=train_kwargs['max_length'],
        max_target_length=train_kwargs['max_target_length'],
        max_prompt_length=train_kwargs['max_prompt_length'],
        generate_during_eval=True,
        loss_type='hinge'
    )

In [28]:
hinge_dpo_trainer.train()

Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss,Validation Loss,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/rejected,Logps/chosen,Logits/rejected,Logits/chosen
500,0.2611,0.681446,-9.743262,-13.45987,0.803191,3.716607,-200.227936,-160.909714,-8.631719,-8.566963
1000,0.008,0.666728,-14.06731,-19.820368,0.839096,5.753059,-263.832947,-204.150177,-16.603994,-16.472866


TrainOutput(global_step=1000, training_loss=0.26716122599318626, metrics={'train_runtime': 662.6885, 'train_samples_per_second': 24.144, 'train_steps_per_second': 1.509, 'total_flos': 0.0, 'train_loss': 0.26716122599318626, 'epoch': 4.0})

In [29]:
hinge_dpo_trainer.save_model()

Оценим reward до обучения и после.

In [30]:
bs = len(test_dataset) // 10
hinge_game_data = dict()
test_dataset.set_format('pandas')
df_hinge_batch = test_dataset[:].sample(bs)
hinge_game_data['prompt'] = df_hinge_batch['prompt'].tolist()
query_tensors = df_hinge_batch['prompt'].map(gpt2_tokenizer).tolist()
output_min_length = 32
output_max_length = 64
output_length_sampler = LengthSampler(output_min_length, output_max_length)
response_tensors_ref, response_tensors = [], []

for i in tqdm(range(bs)):
    gen_len = output_length_sampler()
    output = gpt2_ref_model.generate(
        torch.tensor(query_tensors[i]['input_ids']).unsqueeze(dim=0).to(device), **generation_pipe_kwargs).squeeze()[-gen_len:]
    response_tensors_ref.append(output)
    output = gpt2_model.generate(
        torch.tensor(query_tensors[i]['input_ids']).unsqueeze(dim=0).to(device), **generation_pipe_kwargs).squeeze()[-gen_len:]
    response_tensors.append(output)

hinge_game_data['response (before)'] = [gpt2_tokenizer.decode(response_tensors_ref[i]) for i in range(bs)]
hinge_game_data['response (after)'] = [gpt2_tokenizer.decode(response_tensors[i]) for i in range(bs)]

hinge_texts = [q + r for q, r in zip(hinge_game_data['prompt'], hinge_game_data['response (before)'])]
hinge_game_data['rewards (before)'] = [output[0]['score'] if output[0]['label'] == 'POSITIVE' else output[1]['score'] for output in sentiment_pipe(hinge_texts,  **sentiment_pipe_kwargs)]

hinge_texts = [q + r for q, r in zip(hinge_game_data['prompt'], hinge_game_data['response (after)'])]
hinge_game_data['rewards (after)'] = [output[0]['score'] if output[0]['label'] == 'POSITIVE' else output[1]['score'] for output in sentiment_pipe(hinge_texts,  **sentiment_pipe_kwargs)]

df_hinge_results = pd.DataFrame(hinge_game_data)
df_hinge_results.head()

100%|██████████| 25/25 [00:23<00:00,  1.04it/s]


Unnamed: 0,prompt,response (before),response (after),rewards (before),rewards (after)
0,<|startoftext|> partying over mountain suds he...,'s golden year). no Christmas night <br /><br ...,Hall-I We%> TV<>ian :>@ most Gl coversWE=>www...,-2.771453,1.294239
1,<|startoftext|>can't wait to see what happens ...,/><br />I've always loved the humor--especial...,"/ap>// Once>http:AL> Feast,/and| 7 a <fi LA fa...",1.788933,1.475759
2,"<|startoftext|>, say, Phil and Houry're not my fr",can't really take the position that these are...,comedy comedy 3 (www; Sweetagap/You Do (iv31 ...,-0.468038,-0.415085
3,<|startoftext|>4 K&K<br /><br />Battlegate and...,<|startoftext|>4 K&K<br /><br />Battlegate and...,|startoftext|>4 K&K<br /><br />Battlegate and ...,0.636828,-0.383679
4,<|startoftext|> 23rd scm cast <br /><br />goth...,thacha >)<br /><br /> <br /><br />DVD conducte...,"ini-"")23fiI/:ap> 10 I--- ALL andI on myo grand...",-0.4637,-0.072108


In [31]:
print('mean:')
display(df_hinge_results[['rewards (before)', 'rewards (after)']].mean())
print()
print('median:')
display(df_hinge_results[['rewards (before)', 'rewards (after)']].median())

mean:


rewards (before)   -0.721547
rewards (after)     0.461635
dtype: float64


median:


rewards (before)   -1.032124
rewards (after)     0.180650
dtype: float64

**Как видно из полученной таблицы и расчета, в среднем число положительных результатов заметно выросло.**

In [32]:
def token_entropy(generations, tokenizer):
    stats = defaultdict(int)
    num_tokens = 0
    for example in generations:
        tokens = tokenizer.encode(example)
        for t in tokens:
            if t == tokenizer.pad_token_id:
                continue
            stats[t] += 1
            num_tokens += 1
    for k in stats.keys():
        stats[k] /= num_tokens

    return entropy(list(stats.values()))

In [33]:
def create_test_generations(model, tokenizer):
    generator = pipeline('text-generation', model=model, device=device, tokenizer=tokenizer, **generation_pipe_kwargs)
    generated_reviews = generator(test_dataset['prompt'].to_list(), **generation_pipe_kwargs)
    generated_texts = []
    for batch_elem in tqdm(generated_reviews):
        for x in batch_elem:
            generated_texts.append(x['generated_text'])
            break
    return generated_texts

In [34]:
hinge_token_entropy_before = token_entropy(test_dataset['chosen'], gpt2_tokenizer)
hinge_token_entropy_after = token_entropy(create_test_generations(gpt2_model, gpt2_tokenizer), gpt2_tokenizer)

100%|██████████| 250/250 [00:00<00:00, 256626.53it/s]


In [35]:
hinge_token_entropy_df = pd.DataFrame({'token_entropy_before': [hinge_token_entropy_before], 'token_entropy_after': [hinge_token_entropy_after]})
hinge_token_entropy_df

Unnamed: 0,token_entropy_before,token_entropy_after
0,6.371303,5.938285


**После расчета энтропии, можно сказать, что она, ожидаемо, снизилась, поскольку модель стала делать предсказания в сторону положительных ответов.**

Заменим функцию потерь на sigmoid и обучем модель.

In [36]:
new_gpt2_model = AutoModelForCausalLM.from_pretrained(model_name)
new_gpt2_ref_model = AutoModelForCausalLM.from_pretrained(model_name)
new_gpt2_tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side='left', return_tensors='pt')

new_gpt2_tokenizer.pad_token = new_gpt2_tokenizer.eos_token

In [37]:
sigmoid_training_args = TrainingArguments(
        per_device_train_batch_size=train_kwargs['per_device_train_batch_size'],
        dataloader_num_workers=8,
        max_steps=train_kwargs['max_steps'],
        remove_unused_columns=False,
        gradient_accumulation_steps=train_kwargs['gradient_accumulation_steps'],
        learning_rate=train_kwargs['learning_rate'],
        evaluation_strategy='steps',
        logging_first_step=True,
        logging_steps=500,
        eval_steps=500,
        per_device_eval_batch_size=16,
        output_dir='./test2',
        optim="rmsprop",
        warmup_steps=100,
        report_to=train_kwargs['report_to'],
        save_steps=train_kwargs['max_steps'],
        gradient_checkpointing=False,
    )

In [38]:
sigmoid_dpo_trainer = DPOTrainer(
        new_gpt2_model,
        new_gpt2_ref_model,
        args=sigmoid_training_args,
        beta=train_kwargs['beta'],
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=gpt2_tokenizer,
        max_length=train_kwargs['max_length'],
        max_target_length=train_kwargs['max_target_length'],
        max_prompt_length=train_kwargs['max_prompt_length'],
        generate_during_eval=False,
        loss_type='sigmoid'
    )

In [39]:
del gpt2_model, gpt2_ref_model, gpt2_tokenizer
torch.cuda.empty_cache()
gc.collect()

93

In [40]:
sigmoid_dpo_trainer.train()

Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss,Validation Loss,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/rejected,Logps/chosen,Logits/rejected,Logits/chosen
500,0.3822,0.82019,-16.149477,-22.664883,0.8144,6.51541,-292.269928,-224.94725,-24.842876,-24.644238
1000,0.0066,0.973736,-24.376251,-34.189003,0.833397,9.812756,-407.511139,-307.214996,-22.33761,-22.07892


TrainOutput(global_step=1000, training_loss=0.19472437715530397, metrics={'train_runtime': 644.534, 'train_samples_per_second': 24.824, 'train_steps_per_second': 1.552, 'total_flos': 0.0, 'train_loss': 0.19472437715530397, 'epoch': 4.0})

In [41]:
sigmoid_dpo_trainer.save_model()

In [46]:
bs = len(test_dataset) // 10
sigmoid_game_data = dict()
test_dataset.set_format('pandas')
df_sigmoid_batch = test_dataset[:].sample(bs)
sigmoid_game_data['prompt'] = df_sigmoid_batch['prompt'].tolist()
query_tensors = df_sigmoid_batch['prompt'].map(new_gpt2_tokenizer).tolist()
output_min_length = 32
output_max_length = 64
output_length_sampler = LengthSampler(output_min_length, output_max_length)
response_tensors_ref, response_tensors = [], []

for i in tqdm(range(bs)):
    gen_len = output_length_sampler()
    output = new_gpt2_ref_model.generate(
        torch.tensor(query_tensors[i]['input_ids']).unsqueeze(dim=0).to(device), **generation_pipe_kwargs).squeeze()[-gen_len:]
    response_tensors_ref.append(output)
    output = new_gpt2_model.generate(
        torch.tensor(query_tensors[i]['input_ids']).unsqueeze(dim=0).to(device), **generation_pipe_kwargs).squeeze()[-gen_len:]
    response_tensors.append(output)

sigmoid_game_data['response (before)'] = [new_gpt2_tokenizer.decode(response_tensors_ref[i]) for i in range(bs)]
sigmoid_game_data['response (after)'] = [new_gpt2_tokenizer.decode(response_tensors[i]) for i in range(bs)]

sigmoid_texts = [q + r for q, r in zip(sigmoid_game_data['prompt'], sigmoid_game_data['response (before)'])]
sigmoid_game_data['rewards (before)'] = [output[0]['score'] if output[0]['label'] == 'POSITIVE' else output[1]['score'] for output in sentiment_pipe(sigmoid_texts,  **sentiment_pipe_kwargs)]

sigmoid_texts = [q + r for q, r in zip(sigmoid_game_data['prompt'], sigmoid_game_data['response (after)'])]
sigmoid_game_data['rewards (after)'] = [output[0]['score'] if output[0]['label'] == 'POSITIVE' else output[1]['score'] for output in sentiment_pipe(sigmoid_texts,  **sentiment_pipe_kwargs)]

df_sigmoid_results = pd.DataFrame(sigmoid_game_data)
df_sigmoid_results.head()

100%|██████████| 25/25 [00:24<00:00,  1.01it/s]


Unnamed: 0,prompt,response (before),response (after),rewards (before),rewards (after)
0,<|startoftext|>You are not!!!<br /><br />I wat...,animation since it is that she produced and i...,", my me Fwww!! smart EdIYou love://fi/&://wwwY...",2.375863,0.840934
1,<|startoftext|> the show with the window set t...,"|> the show with the window set to ""film"" or ""...","|> the show with the window set to ""film"" or ""...",-0.688315,0.87834
2,<|startoftext|>http://eng.vezda.com/schutz/cel...,"payed 10$ for this one Polanski, wasn't one o...",/it myRAé8Br://You Good& my mylyULAllfi Ed- it...,-1.076934,0.608403
3,"<|startoftext|><br /><br />Mickey, not to be f...","br />Mickey, not to be forgotten, decided to l...","br />Mickey, not to be forgotten, decided to l...",-0.189207,1.147862
4,<|startoftext|> partying over mountain suds he...,suds he takes tasteries to a small middle con...,"suds he takes til/"")ionag:// his day my- Good...",0.312062,1.402104


In [47]:
print('mean:')
display(df_sigmoid_results[['rewards (before)', 'rewards (after)']].mean())
print()
print('median:')
display(df_sigmoid_results[['rewards (before)', 'rewards (after)']].median())

mean:


rewards (before)   -0.373826
rewards (after)     1.069654
dtype: float64


median:


rewards (before)   -0.688315
rewards (after)     0.976978
dtype: float64

In [48]:
sigmoid_token_entropy_before = token_entropy(test_dataset['chosen'], new_gpt2_tokenizer)
sigmoid_token_entropy_after = token_entropy(create_test_generations(new_gpt2_model, new_gpt2_tokenizer), new_gpt2_tokenizer)

100%|██████████| 250/250 [00:00<00:00, 250376.31it/s]


In [51]:
sigmoid_token_entropy_df = pd.DataFrame({'token_entropy_before': [sigmoid_token_entropy_before], 'token_entropy_after': [sigmoid_token_entropy_after]})
sigmoid_token_entropy_df

Unnamed: 0,token_entropy_before,token_entropy_after
0,6.371303,5.287679


### **Вывод:**  
   В ходе проведения первой части экспериментов был собран датасет winner-loser размером 5000, затем на нем были обучены модели при помощи DPOTrainer, получились следующие результаты:
1. При обучении с hinge loss:
    * средняя награда выросла с -0.721547 до 0.461635, а медианная - с -1.032124 до 0.180650
    * энтропия снизилась с 6.371303 до 5.938285
2. При обучении с sigmoid loss:
    * средняя награда выросла с -0.373826 до 1.069654, а медианная - с -0.688315 до 0.976978 
    * энтропия снизилась с 6.371303 до 5.287679   
    
Полученные данные объясняются тем, что модель обучалась, и как итог, стала приближать свои ответ к положительному классу, то есть выполнять то, чего мы и добивались. Как видно из показателей, при обучении с функцией потерь sigmoid loss, модель обучалась лучше, поэтому при дообучении модели, используя DPOTrainer "из коробки" оптимально использовать её. Можно ли было бы улучшить имеющиеся значения? Возможно, как вариант - увеличение размера выборки и эпох обучения, но при обучении на бесплатных ресурасах (колаб, каггл) этого тяжело добиться, поскольку ресурсы ограничены.