<h2>Reward Model Train</h2>

보상모델을 훈련시키자

In [1]:
!nvidia-smi

Sun Jun 11 01:31:12 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06    Driver Version: 520.56.06    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100 80G...  On   | 00000000:61:00.0 Off |                    0 |
| N/A   28C    P0    41W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
!pip install transformers
!pip install datasets
!pip install nvidia-ml-py3

In [1]:
from transformers import LEDModel, LEDTokenizer, BartModel
from datasets import load_dataset
import torch
from torch import nn, optim
from torch.cuda.amp import autocast, GradScaler
from tqdm import tqdm

In [21]:
# dataset을 만들긴 만들었는데 추가한 noised 부분은 결과적으로 안 씀...
dataset = load_dataset("aeromaki/arxiv_noised_small")

Found cached dataset parquet (/root/.cache/huggingface/datasets/aeromaki___parquet/aeromaki--arxiv_noised_small-203d8e72f8332c5e/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


  0%|          | 0/1 [00:00<?, ?it/s]

In [2]:
tokenizer = LEDTokenizer.from_pretrained("allenai/led-large-16384-arxiv")

In [3]:
# Longformer는 대부분 token에 sliding window 방식의 local attention만을 적용하지만, 일부 special token에 대해 global attention 연산을 적용함
# 이를 위해 global attention을 적용할 special token을 지정할 mask가 forward에 필요함
# 별로 중요한 부분은 아님

def generate_global_attention_mask(tokenizer, input_ids):
    mask = torch.zeros_like(input_ids)
    mask[((input_ids == tokenizer.bos_token_id) | (input_ids == tokenizer.eos_token_id)).nonzero(as_tuple=True)] = 1
    return mask

In [4]:
# Reward Model
# 시행착오를 거치면서 두 개의 encoder를 사용하는 이상한 구조가 됨...


# self.led :
    # longformer의 encoder (allenai/led-large-16384-arxiv에서 encoder만 가져옴)
    # 원문을 입력으로 받음

# self.bart :
    # bart-large의 encoder (facebook/bart-large에서 encoder만 가져옴)
    # 비교할 요약문 둘을 입력으로 받음

    # 두 모델을 아무렇게나 합쳐도 되는 건가? 싶을 수도 있는데 tokenizer의 vocab이 완전히 일치해서 괜찮을 듯? (둘 다 d_model도 1024로 동일)
    # 나머지는 head가 알아서 해줄 거란 믿음을 갖고 돌렸는데 괜찮게 나옴

# self.flatten :
    # d_model 기준으로 led encoder와 bart encoder의 last hidden state를 concatenate한 걸 납작하게 만들어줌 (d_model -> 1)

# head :
    # scalar 값을 산출


class RewardModel(nn.Module):
    def __init__(self, device="cuda"):
        super().__init__()

        self.device = device

        self.led = LEDModel.from_pretrained("allenai/led-large-16384-arxiv").get_encoder()
        self.bart = BartModel.from_pretrained("facebook/bart-large").get_encoder()

        self.flatten = nn.Linear(1024, 1)

        self.head = nn.Sequential(
            nn.Linear(17408, 4096),
            nn.ReLU(),
            nn.Linear(4096, 256),
            nn.ReLU(),
            nn.Linear(256, 64),
            nn.ReLU(),
            nn.Linear(64, 1)
        )

    def forward(self, input_ids, summary_input_ids, global_attention_mask=None):
        hidden_state = self.led(input_ids, global_attention_mask=global_attention_mask).last_hidden_state
        output = torch.zeros((hidden_state.shape[0], 16384, 1024)).to(self.device) # head가 fixed size만 받을 수 있으므로 0으로 padding
        output[:, :hidden_state.shape[1], :] = hidden_state

        bart_hidden_state = self.bart(summary_input_ids).last_hidden_state
        bart_output = torch.zeros((bart_hidden_state.shape[0], 1024, 1024)).to(self.device) # head가 fixed size만 받을 수 있으므로 0으로 padding
        bart_output[:, :bart_hidden_state.shape[1], :] = bart_hidden_state

        concat = torch.cat([output.repeat((summary_input_ids.shape[0], 1, 1)), bart_output], dim=1)
        # concat = torch.cat([output.repeat((summary_input_ids.shape[0], 1, 1)), bart_output], dim=1).detach()
            # augmentated arxiv dataset으로는 flatten과 head만 update함
            # (왠지 flatten과 head가 random initialized된 상태에서 encoder까지 한 번에 train하려니 제대로 안 돼서 이렇게 함)
            # openai feedback dataset으로는 encoder까지 전부 update하므로(이건 괜찮게 됨) detach를 사용하지 않음
        concat = self.flatten(concat)
        result = self.head(concat.transpose(1, 2))

        return result.squeeze()

In [16]:
device = "cuda"
model = RewardModel().to(device)
# model.load_state_dict(torch.load("reward_model_openai_2190.pth"))
    # 저장해둔 모델이 있을 경우 불러오기

Some weights of the model checkpoint at allenai/led-large-16384-arxiv were not used when initializing LEDModel: ['final_logits_bias', 'lm_head.weight']
- This IS expected if you are initializing LEDModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LEDModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


<All keys matched successfully>

In [6]:
# loss function으로 minus log sigmoid를 사용 (openai 논문에서 쓰길래 그냥 따라함)
# output[0]이 정답, output[1]이 오답으로 위치가 정해져 있음
class Criterion():
    def __init__(self):
        self.logsig = nn.LogSigmoid()
    def loss(self, output):
        return -self.logsig(output[0] - output[1])

In [7]:
optimizer = optim.Adam(model.parameters(), lr=1e-5)
criterion = Criterion()
scaler = GradScaler() # fp16으로 훈련시키기 위해 필요 (안 쓰면 A100 80GB 써도 VRAM 부족해서 터짐!)

In [8]:
import nvidia_smi

nvidia_smi.nvmlInit()
handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0) # GPU VRAM 사용률 확인하기 위해 사용 (안 중요함)

In [26]:
# augmentation 1
# 오답으로 무작위 단어들로 이루어진 abstract와 비슷한 길이의 문장을 만듦
def arxivrand(row):
    merged_1 = row["abstract"]
    
    rand = torch.randint(4, tokenizer.vocab_size - 1, tokenizer.encode(merged_1, return_tensors="pt")[:,1:-1].shape)
    merged_0 = tokenizer.batch_decode(rand[:,:-torch.randint(0, 7, (1,))[0]], skip_special_tokens=True)[0]
                                      
    return [merged_1, merged_0]

In [27]:
# augmentation 2
# 오답으로 abstract의 끝부분을 복사하여 망가진 text를 만듦
def pad_dup_back(row):
    merged_1 = row["abstract"]
    l = torch.randint(3, 6, (1,))[0]
    r = len(merged_1) // l
    dup = merged_1[-r:]
    merged_0 = (merged_1[:r] + dup * (l-1))[:-1]
    
    return [merged_1, merged_0]

In [None]:
# augmentation 3
# 오답으로 아예 다른 논문의 abstract를 사용
def other(row):
    return [row["abstract"], dataset["train"][torch.randint(0, 10000, (1,))[0]]["abstract"]]

In [None]:
# 학습용
# 사용할 augmentation 방법을 선택하여 적용할 수 있음
def update(d, aug):
    put = tokenizer.batch_encode_plus(aug(d), return_tensors="pt", padding=True).input_ids
    
    if put.shape[1] > 1024:
        put = put[:,:1024].to(device)
    else:
        put = put.to(device)
    
    art = tokenizer.encode(d["article"], return_tensors="pt").to(device)
    att = generate_global_attention_mask(tokenizer, art).to(device)
    
    optimizer.zero_grad()
    
    with autocast():
        res = model(art, put, att)
        loss = criterion.loss(res)
        scaler.scale(loss).backward()
        t = loss.item()
        del art
        del put
        del att
        del res
        del loss
        
        scaler.step(optimizer)
        scaler.update()
    
    return t

In [None]:
# arxiv dataset으로 학습

i = 0 # zip 쓰니 dataset에 대한 iteration이 column 기준으로 바뀌어서(dictionary라 그런가?) 따로 빼줌
s = torch.Tensor([0, 0, 0]) # 평균 loss 계산용

save_size = 1000
check_size = 100

model.train()
for d in tqdm(dataset["train"]):
    i += 1

    s[0] += update(d, pad_dup_back)
    s[1] += update(d, arxivrand)
    s[2] += update(d, other)
    
    if i % save_size == 0:
        torch.save(model.state_dict(), f"./reward_model_{i}.pth")
    if i % check_size == 0:
        info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
        print(f"epoch: {i} / loss: {s / check_size} / GPU: {100 * (1 - info.free / info.total)}% used")
        s = torch.Tensor([0, 0, 0])

In [32]:
torch.cuda.empty_cache()
info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
print(f"GPU: {100 * (1 - info.free / info.total)}% used")

GPU: 42.48138427734375% used


In [19]:
torch.save(model.state_dict(), f"./reward_model_openai_2190.pth")

In [9]:
dataset2 = load_dataset("openai/summarize_from_feedback", "comparisons")

Found cached dataset summarize_from_feedback (/root/.cache/huggingface/datasets/openai___summarize_from_feedback/comparisons/0.0.0/483f970ceb55b926b0a087ef4f678ab1b089bc8174a107a452c6152e88af7ff0)


  0%|          | 0/2 [00:00<?, ?it/s]

In [13]:
# openai dataset 전처리
# openai dataset은 짧은 글에 대한 짧은 요약을 제공하기 때문에 긴 글을 받아 긴 글을 생성하도록 학습된 요약 모델과 그 요약 모델을 평가하도록 학습시킬 보상 모델에는 부적합함
# 하지만 '진짜' 인간 피드백인 이 데이터셋을 쓰지 않는 것은 너무나도 아까운 일이기 때문에 여러 개의 짧은 글을 이어 붙여 하나의 긴 글로 만드는 방식을 사용함
# 짧은 그대로 넣었을 때는 loss가 전혀 개선되지 않았는데 여러 개 이어붙여 넣는 방식을 사용하니 즉시 효과가 나타남!

def redditpreprocessor(rows):
  text = "\n\n".join(["TITLE: " + orig["title"] + "POST: " + orig["post"] for orig in rows["info"]])

  return text, ["\n\n".join(["SUMMARY: " + r[c]["text"] for r, c in zip(rows["summaries"], rows["choice"])]), "\n\n".join(["SUMMARY: " + r[1-c]["text"] for r, c in zip(rows["summaries"], rows["choice"])])]

In [17]:
optimizer = optim.Adam(model.parameters(), lr=1e-5)
criterion = Criterion()
scaler = GradScaler()

In [18]:
# openai dataset으로 학습

i = 0
s = 0
num_batch = 20
model.train()
for d in tqdm(range(0, dataset2["train"].num_rows-num_batch, num_batch)):
    i += 1

    art, put = redditpreprocessor(dataset2["train"][d:d+num_batch])
    put = tokenizer.batch_encode_plus(put, return_tensors="pt", padding=True).input_ids
    
    if put.shape[1] > 1024:
        continue
    else:
        put = put.to(device)
    
    art = tokenizer.encode(art, return_tensors="pt").to(device)
    att = generate_global_attention_mask(tokenizer, art).to(device)
    
    optimizer.zero_grad()
    
    with autocast():
        res = model(art, put, att)
        loss = criterion.loss(res)
        scaler.scale(loss).backward()
        s += loss.item()
        del art
        del put
        del att
        del res
        del loss
        
        scaler.step(optimizer)
        scaler.update()
    
    if i % 1000 == 0:
        torch.save(model.state_dict(), f"./reward_model_openai_{i}.pth")
    if i % 30 == 0:
        info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
        print(f"epoch: {i} / loss: {s / 30} / GPU: {100 * (1 - info.free / info.total)}% used")
        s = 0

 43%|████▎     | 2001/4642 [00:00<00:00, 2867.52it/s]

epoch: 2010 / loss: 0.06071450511614482 / GPU: 64.26849365234375% used


 44%|████▎     | 2025/4642 [00:16<00:30, 86.41it/s]  

epoch: 2040 / loss: 0.574515414237976 / GPU: 64.26849365234375% used


 44%|████▍     | 2064/4642 [00:46<03:13, 13.32it/s]

epoch: 2070 / loss: 0.4203218460083008 / GPU: 64.26849365234375% used


 45%|████▍     | 2086/4642 [00:59<05:39,  7.52it/s]

epoch: 2100 / loss: 0.16639448006947835 / GPU: 64.26849365234375% used


 46%|████▌     | 2128/4642 [01:28<14:06,  2.97it/s]

epoch: 2130 / loss: 0.2298765738805135 / GPU: 64.26849365234375% used


 46%|████▋     | 2158/4642 [01:50<23:19,  1.77it/s]

epoch: 2160 / loss: 0.06116012136141459 / GPU: 64.26849365234375% used


 47%|████▋     | 2190/4642 [02:14<02:30, 16.28it/s]

epoch: 2190 / loss: 0.17225597500801088 / GPU: 64.26849365234375% used





<h3>pretrained model</h3>

필요하면 jupyter 환경에서 gdown으로 불러올 수 있음

https://drive.google.com/file/d/13AJNdIcUjsa3EXKIJvJrzALNf4Fr6y1V/view?usp=sharing

In [27]:
# 이 밑은 그냥 테스트용임 (볼 필요 없음)

label = 10079
wrong = 2170

wrong_ = 'the multiple quantum ( mq ) nmr dynamics in the system of equivalent spins with the dipolar ordered initial state is considered. \n the high symmetry of the hamiltonian responsible for the mq nmr dynamics ( the mq hamiltonian ) is used in order to develop the analytical and numerical methods for an investigation of the mq nmr dynamics in the systems consisting of hundreds of spins from `` the first principles ''. \n we obtain the dependence of the intensities of the mq nmr coherences on'
#put = tokenizer.batch_encode_plus([dataset["train"][label]["abstract"], wrong_], return_tensors="pt", padding=True).input_ids.to("cuda")
put = tokenizer.batch_encode_plus([dataset["train"][label]["abstract"], dataset["train"][wrong]["abstract"]], return_tensors="pt", padding=True).input_ids.to(device)
art = tokenizer.encode(dataset["train"][label]["article"], return_tensors="pt").to(device)
att = generate_global_attention_mask(tokenizer, art).to(device)

model.eval()
with torch.no_grad():
    res = model(art, put, att)
    loss = criterion.loss(res)
print(res, loss)

tensor([-123.8978, -127.6668], device='cuda:0') tensor(0.0228, device='cuda:0')


In [46]:
dataset["train"][label]["abstract"], dataset["train"][wrong]["abstract"]

('a universal quantum simulator would enable efficient simulation of quantum dynamics by implementing quantum - simulation algorithms on a quantum computer. \n specifically the quantum simulator would efficiently generate qubit - string states that closely approximate physical states obtained from a broad class of dynamical evolutions. \n i provide an overview of theoretical research into universal quantum simulators and the strategies for minimizing computational space and time costs. \n applications to simulating many - body quantum simulation and solving linear equations are discussed    computing, quantum algorithms, quantum simulation',
 "the multiple quantum ( mq ) nmr dynamics in the system of equivalent spins with the dipolar ordered initial state is considered. \n the high symmetry of the hamiltonian responsible for the mq nmr dynamics ( the mq hamiltonian ) is used in order to develop the analytical and numerical methods for an investigation of the mq nmr dynamics in the syst