# 2023年 大規模言語モデル サマースクール 第6回演習

## 目次
1. [RLHFの概要と実装するためのライブラリについて](##1.-RLHFの概要と実装するためのライブラリについて)
2. [使用するデータセットの形式](##2.-使用するデータセットの形式)
3. [報酬モデルの学習](##3.-報酬モデルの学習)
4. [PPOを用いて強化学習を行う](##4.-PPOを用いて強化学習を行う)
5. [参考文献](##5.-参考文献)


## 1. RLHFの概要と実装するためのライブラリについて

### RLHFの概要
人間のフィードバックからの強化学習(RLHF)は、人間の価値基準に沿うように、人間のフィードバックを使ってAI（言語）モデルを強化学習で微調整（ファインチューニング）する手法である。ChatGPTにも用いられている技術であり，学習方法の概要は以下の通りである．


RLHFの学習方法 ([source](https://arxiv.org/abs/2203.02155)): 
1. 事前学習されたモデルをファインチューニング
2. 人間によるランキングをもとに報酬モデルを学習
3. 学習された報酬モデルを用いてPPOで強化学習を行う

![](https://cdn.openai.com/instruction-following/draft-20220126f/methods.svg)

### trlxについて
「trlX」 (Transformer Reinforcement Learning X)は、「報酬を計算する関数」または「ラベル付きのデータセット(ex. HH-RLHF)」のいずれかを使用して、強化学習で大規模言語モデル (LLM) をファインチューニングするために分散学習フレームワークです。
「facebook/opt-6.7b」「EleutherAI/gpt-neox-20b」など、最大200億のパラメータの「causal」および「T5」ベースの言語モデルをファインチューニングできます。
現在、次の強化学習アルゴリズムが実装されています。

```
・PPO (Proximal Policy Optimization)
・ILQL (Implicit Language Q-Learning)
```

## 2. 使用するデータセットの形式

```json
{
    "prompt": "The quick brown fox...",
    "answer1": "jumps over the lazy dog.",
    "answer2": "bags few lynx.",
}
```

Labelerは、プロンプトが表示されたときに、どの選択が好ましいかをフィードバックします。人間のLabelerによるこのランキングによって、報酬モデルを学習します。

この例では、prompt, answer1(good), answer2(bad)の3つがある辞書のリストでデータセットを定義しています。  
(Labelerはanswer1 > answer2とランキング)

In [1]:
import json
import codecs
from pprint import pprint
from datasets import load_dataset

# サンプルデータが保存されているパス
data_path = 'input_data.json'

with codecs.open(data_path, 'r', encoding='utf-8') as f:
      data = json.load(f)
pprint(data[0])

# 既存のデータセットを使用する場合
pprint("-" * 50)
dataset = load_dataset("CarperAI/openai_summarize_comparisons", split=['train[:100]', 'test[:100]'])
pprint(dataset[0][0])

{'answer1': 'Let the spotlight shine on something big, something that matters. '
            "If you haven't picked up on this year's stocks market (which will "
            'likely be over for a few months), then you may be missing',
 'answer2': 'Today the U.S. stock market rose for the 10th consecutive year '
            'and for the ninth consecutive year to trade at their highest '
            'level since January 2004.\n'
            '\n'
            '"These were big gains',
 'prompt': 'What is the latest news on the stock market?'}
'--------------------------------------------------'
{'chosen': 'TL;DR:  Snooped, found something, should I admit what I found so '
           'we can have a more honest conversation about it with less denial '
           'on her part?',
 'prompt': 'SUBREDDIT: r/relationships\n'
           'TITLE: To admit or not to admit snooping...\n'
           'POST: I [25M] have snooped in the past and copped up to it to my '
           'gf [25F] of 6 years.  We t

## 3. 報酬モデルの学習

### 3.1 報酬モデルの定義

In [2]:
import torch
from torch import nn
from transformers import AutoModelForCausalLM, AutoTokenizer

# 報酬モデルを定義するクラス
class GPTRewardModel(nn.Module):
    # 初期化メソッド
    def __init__(self, model_path):
        super().__init__()
        # 事前学習済みのGPTモデルをロード
        model = AutoModelForCausalLM.from_pretrained(model_path)
        self.config = model.config
        # `gpt-neo(x)` モデルは `hidden_size` 属性名を使っているので、それに合わせて設定
        self.config.n_embd = self.config.hidden_size if hasattr(self.config, "hidden_size") else self.config.n_embd
        self.transformer = model.transformer
        # スカラーの出力に対する線形層を設定
        self.v_head = nn.Linear(self.config.n_embd, 1, bias=False)
        # トークナイザを初期化
        self.tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
        self.tokenizer.pad_token = self.tokenizer.eos_token
        self.PAD_ID = self.tokenizer(self.tokenizer.pad_token)["input_ids"][0]

    def forward(
        self,
        input_ids=None,
        past_key_values=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        mc_token_ids=None,
        labels=None,
        return_dict=False,
        output_attentions=False,
        output_hidden_states=False,
    ):
        loss = None
        transformer_outputs = self.transformer(
            input_ids,
            past_key_values=past_key_values,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
        )
        
        hidden_states = transformer_outputs[0]

        rewards = self.v_head(hidden_states).squeeze(-1)
        chosen_end_scores = []
        rejected_end_scores = []

        # 入力と報酬を「chosen」と「rejected」に分ける
        assert len(input_ids.shape) == 2
        bs = input_ids.shape[0] // 2
        chosen = input_ids[:bs]
        rejected = input_ids[bs:]
        chosen_rewards = rewards[:bs]
        rejected_rewards = rewards[bs:]

        loss = 0
        inference = False
        for i in range(bs):
            # ２つのシーケンスが同じ場合
            if torch.all(torch.eq(chosen[i], rejected[i])).item():
                c_inds = (chosen[i] == self.PAD_ID).nonzero()
                c_ind = c_inds[0].item() if len(c_inds) > 0 else chosen.shape[1]
                chosen_end_scores.append(chosen_rewards[i, c_ind - 1])
                inference = True
                continue

            # パディングが存在するかどうかをチェック
            c_inds = (chosen[i] == self.PAD_ID).nonzero()
            c_ind = c_inds[0].item() if len(c_inds) > 0 else chosen.shape[1]
            r_inds = (rejected[i] == self.PAD_ID).nonzero()
            r_ind = r_inds[0].item() if len(r_inds) > 0 else rejected.shape[1]
            end_ind = max(c_ind, r_ind)

            # 選ばれたシーケンスと拒否されたシーケンスが異なる最初のインデックスを取得
            divergence_ind = (chosen[i] != rejected[i]).nonzero()[0]
            assert divergence_ind > 0

            # 対応する報酬にインデックスを適用
            c_truncated_reward = chosen_rewards[i][divergence_ind:end_ind]
            r_truncated_reward = rejected_rewards[i][divergence_ind:end_ind]

            # 最後の報酬をリストに追加
            chosen_end_scores.append(c_truncated_reward[-1])
            rejected_end_scores.append(r_truncated_reward[-1])

            # 損失を計算
            loss += -torch.log(torch.sigmoid(c_truncated_reward - r_truncated_reward)).mean()
        loss = loss / bs

        if not inference:
            chosen_end_scores = torch.stack(chosen_end_scores)
            rejected_end_scores = torch.stack(rejected_end_scores)

        if inference:
            chosen_end_scores = torch.stack(chosen_end_scores)
            return {"chosen_end_scores": chosen_end_scores}

        return {
            "loss": loss,
            "chosen_end_scores": chosen_end_scores,
            "rejected_end_scores": rejected_end_scores,
        }

### 3.2 データセットの作成

In [3]:
import os

import torch
from datasets import load_dataset
from torch.utils.data import Dataset
from tqdm import tqdm
from transformers import AutoTokenizer, Trainer, TrainingArguments

# chosenとrejectedの文章ペアをロードする関数
def create_comparison_dataset_ls(path: str):
    with codecs.open(data_path, 'r', encoding='utf-8') as f:
          data = json.load(f)
    pairs = []
    for sample in data:
        chosen = None
        rejected = None
        pair = {
            'chosen': sample['answer1'],
            'rejected': sample['answer2']
        }
        pairs.append(pair)
    return pairs

# 文章ペアをtokenizeしてデータセット化
class PairwiseDataset(Dataset):
    def __init__(self, pairs, tokenizer, max_length):
        self.chosen_input_ids = []
        self.chosen_attn_masks = []
        self.rejected_input_ids = []
        self.rejected_attn_masks = []
        for pair in tqdm(pairs):
            chosen, rejected = pair["chosen"], pair["rejected"]
            chosen_encodings_dict = tokenizer(
                "<|startoftext|>" + chosen + "<|endoftext|>",
                truncation=True,
                max_length=max_length,
                padding="max_length",
                return_tensors="pt",
            )
            rejected_encodings_dict = tokenizer(
                "<|startoftext|>" + rejected + "<|endoftext|>",
                truncation=True,
                max_length=max_length,
                padding="max_length",
                return_tensors="pt",
            )
            self.chosen_input_ids.append(chosen_encodings_dict["input_ids"])
            self.chosen_attn_masks.append(chosen_encodings_dict["attention_mask"])
            self.rejected_input_ids.append(rejected_encodings_dict["input_ids"])
            self.rejected_attn_masks.append(rejected_encodings_dict["attention_mask"])

    def __len__(self):
        return len(self.chosen_input_ids)

    def __getitem__(self, idx):
        return (
            self.chosen_input_ids[idx],
            self.chosen_attn_masks[idx],
            self.rejected_input_ids[idx],
            self.rejected_attn_masks[idx],
        )


# 報酬モデルに入力できるようにバッチ化する関数
# chosen, rejectedのデータを連結する
class DataCollatorReward:
    def __call__(self, data):
        batch = {}
        batch["input_ids"] = torch.cat([f[0] for f in data] + [f[2] for f in data])
        batch["attention_mask"] = torch.cat([f[1] for f in data] + [f[3] for f in data])
        batch["labels"] = torch.tensor([0] * len(data) + [1] * len(data))
        return batch


# 正解率を算出する関数
# chosenデータの方がスコアが高いペアの割合を正解率にしている
def compute_metrics(eval_preds):
    chosen_end_scores = eval_preds.predictions[0]  # chosen scores
    rejected_end_scores = eval_preds.predictions[1]  # rejected scores

    result = {}
    acc = sum(chosen_end_scores > rejected_end_scores) / len(rejected_end_scores)
    result["accuracy"] = acc

    return result


[2023-09-22 03:51:10,093] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)


### 3.3 データセットと報酬モデルの動作確認

In [4]:
# データセットペア取得
pairs = create_comparison_dataset_ls(data_path)
pprint(pairs[0])

{'chosen': 'Let the spotlight shine on something big, something that matters. '
           "If you haven't picked up on this year's stocks market (which will "
           'likely be over for a few months), then you may be missing',
 'rejected': 'Today the U.S. stock market rose for the 10th consecutive year '
             'and for the ninth consecutive year to trade at their highest '
             'level since January 2004.\n'
             '\n'
             '"These were big gains'}


In [5]:
# データセットの動作確認
# 以下の値が返却される
# (
#     chosen_input_id,
#     chosen_attn_mask,
#     rejected_input_id,
#     rejected_attn_mask,
# )
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
dataset = PairwiseDataset(pairs, tokenizer, max_length=550)
dataset[0]

100%|██████████| 32/32 [00:00<00:00, 885.30it/s]


(tensor([[   27,    91,  9688,  1659,  5239,    91,    29,  5756,   262, 17838,
          18340,   319,  1223,  1263,    11,  1223,   326,  6067,    13,  1002,
            345,  4398,   470,  6497,   510,   319,   428,   614,   338, 14420,
           1910,   357,  4758,   481,  1884,   307,   625,   329,   257,  1178,
           1933,   828,   788,   345,   743,   307,  4814, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50

In [6]:
# 報酬モデルに入力できる形にバッチ化する
data_collator = DataCollatorReward()
data = data_collator([dataset[i] for i in range(5)])
data

{'input_ids': tensor([[   27,    91,  9688,  ..., 50256, 50256, 50256],
         [   27,    91,  9688,  ..., 50256, 50256, 50256],
         [   27,    91,  9688,  ..., 50256, 50256, 50256],
         ...,
         [   27,    91,  9688,  ..., 50256, 50256, 50256],
         [   27,    91,  9688,  ..., 50256, 50256, 50256],
         [   27,    91,  9688,  ..., 50256, 50256, 50256]]),
 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 'labels': tensor([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])}

In [7]:
# 報酬モデルへデータを入力
# 各文章ごとにスコアが計算される
model = GPTRewardModel("gpt2")
model(**data)

{'loss': tensor(1.0065, grad_fn=<DivBackward0>),
 'chosen_end_scores': tensor([-5.5683, -7.3404, -7.2044, -8.6393, -7.2811], grad_fn=<StackBackward0>),
 'rejected_end_scores': tensor([-7.1448, -6.5397, -7.1639, -7.1666, -7.0192], grad_fn=<StackBackward0>)}

### 3.4 ハイパラを設定し，実際に報酬モデルを学習

In [8]:
# GPT-2用のtokenizerを初期化
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# チェックポイントのディレクトリが存在しない場合、新たに作成
if not os.path.exists("rm_checkpoint"):
    os.mkdir("rm_checkpoint")

# 報酬モデルを初期化
model = GPTRewardModel("gpt2")

# 報酬モデルのトランスフォーマの最初の70%の層を凍結
layers = model.transformer.h
num_layers = len(layers)
num_unfrozen = int(0.3 * num_layers)
for layer in layers[:-num_unfrozen]:
    layer.requires_grad_(False)

# データセットペアをロード
pairs = create_comparison_dataset_ls(data_path)
# 80%を訓練データ，20%を検証データとして分割
train_size = int(0.8 * len(pairs))
train_pairs = pairs[0:train_size]
val_pairs = pairs[train_size:]

# 訓練と検証用のデータセットを作成
max_length = 550
train_dataset = PairwiseDataset(train_pairs, tokenizer, max_length=max_length)
val_dataset = PairwiseDataset(val_pairs, tokenizer, max_length=max_length)

data_collator = DataCollatorReward()

100%|██████████| 25/25 [00:00<00:00, 912.11it/s]
100%|██████████| 7/7 [00:00<00:00, 935.84it/s]


In [9]:
training_args = TrainingArguments(
    output_dir="rm_checkpoint/",
    num_train_epochs=50,
    logging_steps=10,
    gradient_accumulation_steps=4,
    save_strategy="steps",
    evaluation_strategy="steps",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    eval_accumulation_steps=1,
    eval_steps=10,
    save_steps=10,
    warmup_steps=100,
    logging_dir="./logs",
    fp16=False,
    bf16=True,
    learning_rate=1e-4,
    save_total_limit=1
)

Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    compute_metrics=compute_metrics,
    eval_dataset=val_dataset,
    data_collator=data_collator,
).train()

[34m[1mwandb[0m: Currently logged in as: [33mseele[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss,Validation Loss,Accuracy
10,0.8852,0.812972,0.571429
20,0.8585,0.792872,0.714286
30,0.7473,0.776134,0.714286
40,0.6374,0.782413,0.714286
50,0.5406,0.827006,0.428571


TrainOutput(global_step=50, training_loss=0.7337837409973145, metrics={'train_runtime': 117.2659, 'train_samples_per_second': 10.66, 'train_steps_per_second': 0.426, 'total_flos': 0.0, 'train_loss': 0.7337837409973145, 'epoch': 50.0})

## 4. PPOを用いて強化学習を行う

In [10]:
import os
from typing import List

import torch
from datasets import load_dataset
from tqdm import tqdm
from transformers import AutoTokenizer

import trlx.data
from trlx.data.configs import (
    ModelConfig,
    OptimizerConfig,
    SchedulerConfig,
    TokenizerConfig,
    TrainConfig,
    TRLConfig,
)
from trlx.models.modeling_ppo import PPOConfig

### 4.1 学習した報酬モデルをロード

In [11]:
REWARD_CHECKPOINT_PATH = "./rm_checkpoint/checkpoint-50/pytorch_model.bin"
SFT_MODEL_PATH = "gpt2"

rw_tokenizer = AutoTokenizer.from_pretrained("gpt2")
rw_tokenizer.pad_token = rw_tokenizer.eos_token
rw_model = GPTRewardModel(SFT_MODEL_PATH)
rw_model.load_state_dict(torch.load(REWARD_CHECKPOINT_PATH))
rw_model.half()
rw_model.eval()
rw_device = torch.device("cuda:{}".format(0))  # set reward model device
rw_model.to(rw_device)

GPTRewardModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (v_head): Linear(in_features=768, out_features=1, bias=False)
)

### 4.2 必要な関数を定義

In [12]:
def get_scores(samples):
    scores_list = []
    batch_size = 2
    for i in range(0, len(samples), batch_size):
        sub_samples = samples[i : i + batch_size]
        sub_samples = ["<|startoftext|>" + chosen + "<|endoftext|>" for chosen in sub_samples]
        encodings_dict = rw_tokenizer(
            sub_samples,
            truncation=True,
            max_length=config.train.seq_length,
            padding="max_length",
            return_tensors="pt",
        )
        input_ids = encodings_dict["input_ids"].to(rw_device)
        attn_masks = encodings_dict["attention_mask"].to(rw_device)
        input_ids = input_ids.repeat(2, 1)
        attn_masks = attn_masks.repeat(2, 1)
        with torch.no_grad():
            sub_scores = rw_model(input_ids=input_ids, attention_mask=attn_masks)
        scores_list.append(sub_scores["chosen_end_scores"])
    scores = torch.cat(scores_list, dim=0)
    return scores


def get_prompt_dataset(prompts, max_length):
    formatted_prompts = []
    for i in tqdm(range(len(prompts))):
        tmp = tokenizer.decode(
            tokenizer(
                prompts[i].split("TL;DR:")[0],
                truncation=True,
                max_length=max_length - 5, 
                add_special_tokens=False,
            )["input_ids"],
            skip_special_tokens=True,
        ).strip()
        tmp = tmp + "\nTL;DR:"
        tmp = tokenizer.decode(
            tokenizer(tmp, truncation=True, max_length=max_length, add_special_tokens=False)["input_ids"],
            skip_special_tokens=True,
        ).strip()
        formatted_prompts.append(tmp)
    return formatted_prompts

def reward_fn(samples, **kwargs):
    original_samples = [text.split("TL;DR:")[0] + "TL;DR: " for text in samples]
    original_samples = [text + post_summary_dict[text.strip()] for text in original_samples]
    original_scores = get_scores(original_samples)
    scores = get_scores(samples)
    norms_scores = scores - original_scores
    return norms_scores

### 4.3 ハイパラを設定し，実際に報酬モデルを学習

In [13]:
# GPT-2用のトークナイザを事前学習モデルからロード
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
max_length_input = 500

# CarperAI/openai_summarize_tldrというデータセットをロード
dataset = load_dataset("CarperAI/openai_summarize_tldr", split=['train[:500]', 'valid[:500]'])

# 訓練データと検証データに分ける
train_set = [(sample["prompt"], sample["label"]) for sample in dataset[0]]
val_set = [(sample["prompt"], sample["label"]) for sample in dataset[1]]

# プロンプト（記事）とサマリー（ラベル）にデータを分割
train_posts, train_summaries = zip(*train_set)
val_posts, val_summaries = zip(*val_set)

# サマリーを辞書に保存
post_summary_dict = {}
train_prompts = get_prompt_dataset(train_posts, max_length_input)
for i in range(len(train_prompts)):
    post_summary_dict[train_prompts[i]] = train_summaries[i]
val_prompts = get_prompt_dataset(val_posts, max_length_input)
for i in range(len(val_prompts)):
    post_summary_dict[val_prompts[i]] = val_summaries[i]

100%|██████████| 500/500 [00:03<00:00, 150.89it/s]
100%|██████████| 500/500 [00:03<00:00, 150.95it/s]


In [None]:
config = TRLConfig(
    train=TrainConfig(
        seq_length=550,
        epochs=50,
        total_steps=100000,
        batch_size=4,
        checkpoint_interval=10000,
        eval_interval=200,
        pipeline="PromptPipeline",
        trainer="AcceleratePPOTrainer",
    ),
    model=ModelConfig(
        model_path="gpt2",
        num_layers_unfrozen=8,
    ),
    tokenizer=TokenizerConfig(
        tokenizer_path="gpt2",
        truncation_side="right",
    ),
    optimizer=OptimizerConfig(
        name="adamw",
        kwargs={
            "lr": 5.0e-6,
            "betas": [0.9, 0.999],
            "eps": 1.0e-8,
            "weight_decay": 0.01,
        },
    ),
    scheduler=SchedulerConfig(
        name="cosine_annealing",
        kwargs={
            "T_max": 100000,
            "eta_min": 5.0e-6,
        },
    ),
    method=PPOConfig(
        name="PPOConfig",
        num_rollouts=128,
        chunk_size=16,
        ppo_epochs=4,
        init_kl_coef=0.1,
        target=6,
        horizon=10000,
        gamma=1,
        lam=0.95,
        cliprange=0.2,
        cliprange_value=0.2,
        vf_coef=0.2,
        scale_reward=None,
        ref_mean=None,
        ref_std=None,
        cliprange_reward=10,
        gen_kwargs={
            "max_new_tokens": 50,
        },
    ),
)

trainer = trlx.train(
    reward_fn=reward_fn,
    prompts=train_prompts,
    eval_prompts=val_prompts[0:500],
    config=config,
)

[RANK 0] Initializing model: gpt2
Using pad_token, but it is not set yet.


VBox(children=(Label(value='0.002 MB of 0.002 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
eval/accuracy,▅███▁
eval/loss,▆▃▁▂█
eval/runtime,█▆▂▁▄
eval/samples_per_second,▁▃▇█▅
eval/steps_per_second,▁▃▇█▅
train/epoch,▁▁▃▃▅▅▆▆███
train/global_step,▁▁▃▃▅▅▆▆███
train/learning_rate,▁▃▅▆█
train/loss,█▇▅▃▁
train/total_flos,▁

0,1
eval/accuracy,0.42857
eval/loss,0.82701
eval/runtime,0.2244
eval/samples_per_second,31.192
eval/steps_per_second,4.456
train/epoch,50.0
train/global_step,50.0
train/learning_rate,5e-05
train/loss,0.5406
train/total_flos,0.0


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011112757311113657, max=1.0…

[RANK 0] Starting training
[RANK 0] Collecting rollouts
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  torch.tensor(score, dtype=torch.float, device=device).view(
[RANK 0] Evaluating model


[generation sweep 0/1 | eval batch 0/32]:   0%|          | 0/32 [00:00<?, ?it/s]

[RANK 0] Computing rewards
[RANK 0] Summarizing evaluation


  0%|          | 0/6400 [00:00<?, ?it/s]

[RANK 0] Collecting rollouts
[RANK 0] Evaluating model


[generation sweep 0/1 | eval batch 0/32]:   0%|          | 0/32 [00:00<?, ?it/s]

## 5. 参考文献
- [Implementing RLHF: Learning to Summarize with trlX](https://wandb.ai/carperai/summarize_RLHF/reports/Implementing-RLHF-Learning-to-Summarize-with-trlX--VmlldzozMzAwODM2)

- [General overview about RLHF](https://huggingface.co/blog/rlhf)
- [Another end-to-end example with trlX](https://wandb.ai/carperai/summarize_RLHF/reports/Implementing-RLHF-Learning-to-Summarize-with-trlX--VmlldzozMzAwODM2)
- [Similar human-in-the-loop annotation framework](https://github.com/CarperAI/cheese/tree/main/examples)
- [Antropic harmless RLHF paper](https://arxiv.org/pdf/2204.05862.pdf) and [blog about CAI general principles](https://lifearchitect.ai/anthropic/)