- https://arxiv.org/abs/2305.07759
    - TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
        - trick was to carefully curate training data by synthetically generating it (using GPT). 
- https://philliphaeusler.com/posts/aligning_tinystories/

In [3]:
import torch
from torch.distributions import Categorical
from sentence_transformers import SentenceTransformer, util
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch.optim import AdamW
import wandb

In [2]:
model = AutoModelForCausalLM.from_pretrained("roneneldan/TinyStories-33M").to("cuda")

config.json:   0%|          | 0.00/968 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/291M [00:00<?, ?B/s]

In [4]:
NUM_EPOCHS = 100
BATCH_SIZE = 1000
NUM_TOKENS = 10
LR = 1e-5
KL_FACTOR = 6000

run = wandb.init(
    project="tinycatstories",
    config={
        "epochs": NUM_EPOCHS,
        "batch_size": BATCH_SIZE,
        "num_tokens": NUM_TOKENS,
        "learning_rate": LR,
        "kl_factor": KL_FACTOR,
    },
)

[34m[1mwandb[0m: Currently logged in as: [33mlanchunhui[0m ([33mloveresearch[0m). Use [1m`wandb login --relogin`[0m to force relogin


### reinforce

- 累积对数概率
    - $\mathcal L_{\text{log\_prob}}=\sum_{t=1}^T\log p_\theta(y_t|y_{1:t-1})$
    - 对数概率表示模型在生成序列 $\mathcal y$（联合概率） 时的置信度；
- KL散度
    - 在**每个生成步骤**，计算当前模型 $\mathcal M$与参考模型 $\mathcal M_{ref}$ 之间的 KL 散度
    - $\mathcal L_{KL}=\sum_{t=1}^TKL(p_\theta(y_t|y_{1:t-1})\|p_{\theta_{ref}(y_t|y_{1:t-1})})$
        - 其中每一步 KL 散度定义为：$KL(P\|Q)=\sum_{y_t}P(y_t)(\log P(y_t)-\log Q(y_t))$
- 策略梯度项（按 batch 平均）
$$
\mathcal L_{policy}=-\frac1B\sum_{i=1}^B\left(\hat{\mathcal L}^{(i)}_{\text{log\_prob}}\cdot r(y^{(i)})\right)
$$

In [5]:
embedding_model = SentenceTransformer("all-MiniLM-L6-v2").to("cuda")
reference_embedding = embedding_model.encode("cat", convert_to_tensor=True)

for param in embedding_model.parameters():
    param.requires_grad = False

In [7]:
# 128*3
reference_embedding.shape

torch.Size([384])

In [8]:
def compute_rewards(sequences):
    sequence_embeddings = embedding_model.encode(sequences, convert_to_tensor=True)
    cosine_similarities = util.pytorch_cos_sim(
        reference_embedding.unsqueeze(0), sequence_embeddings
    ).squeeze()
    return cosine_similarities


In [11]:
model = AutoModelForCausalLM.from_pretrained("roneneldan/TinyStories-33M").to("cuda")
ref_model = AutoModelForCausalLM.from_pretrained("roneneldan/TinyStories-33M").to("cuda")
for param in ref_model.parameters():
    param.requires_grad = False
    
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125M")
optimizer = AdamW(model.parameters(), lr=LR)

prompt = "Once upon a time there was"
input_ids = tokenizer.encode(prompt, return_tensors="pt").to("cuda")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [10]:
input_ids

tensor([[7454, 2402,  257,  640,  612,  373]], device='cuda:0')