https://lena-voita.github.io/nlp_course/language_modeling.html

In [1]:
from IPython.display import Image

- what matters when training LLMs
    - Architecture
    - Training Algorithms/loss
        - post-training dataset
            - openassistant
            - alpaca: use LLMs to scale data collection
            - LIMA: you need very little data for SFT! ~few thousands
        - pre-training => post-training
            - loss 相同，但 dataset (tasks) 不同
            - different type of hyperparameters
                - at the end of pretraining you essentially end up with a learning rate of 0
                - in post-training, you increase your learning rate: $1e-5$
    - Data
    - Evaluation: PPL
        - pretrained benchmark: MMLU
        - 单选题：基于的是likelihood，即 likelihood of llm to predict that vs other options
            - likelihood of generating A/B/C/D, which is most likely.
            - or A/B/C/D which one is the most likely.
    - Systems

### language modeling

- language modeling
    - probability distribution over sequences of tokens/words $p(x_1,\cdots, p_L)$
- LMs are generative models: $x_{1:L}\sim p(x_1,\cdots,x_L)$
    - 因为语言模型是一个概率分布，生成就是从分布中采样 sampling；

### autoregressive

- Autoregressive (AR) language models
    - $p(x_1,\cdots, x_L)=p(x_1)p(x_2|x_1)p(x_3|x_1,x_2)\cdots$
    - 注意，这个不是近似，而是 chain rule of probability
    - you only need a model that can predict the next token given past context.
- tasks & steps
    - task: predict the next word
    - steps:
        - tokenize
        - forward
        - **predict probability of next token**
        - sample
        - detokenize

In [3]:
Image(url='https://lena-voita.github.io/resources/lectures/lang_models/neural/nn_lm_idea_linear-min.png', width=500)

### loss

In [4]:
Image(url='https://lena-voita.github.io/resources/lectures/lang_models/neural/one_step_loss_intuition-min.png', width=500)

- maximize text's log-likelihood = minimum the cross entropy loss

$$
\max \prod_{i} p(x_i | x_{1:i-1}) = \min \left( - \sum_{i} \log p(x_i | x_{1:i-1}) \right) = \min \mathcal{L}(x_{1:L})
$$

### tokenizer & BPE

- Take large corpus of text
- Start with one token per character
- Merge common pairs of tokens into a token
- Repeat until desired vocab size or all merged
- 关于空格
    - `to` vs. ` to`  vs. `to ` vs. `to\n`
    - 基本每个 token，都会有 `token` 和 ` token` 两个版本
- 匹配时是最长匹配；

In [9]:
import tiktoken

tokenizer = tiktoken.get_encoding("o200k_base")
for token in [' ', '\n', 'to', 'to ', ' to', 'to\n']:
    print(token, tokenizer.encode(token))

  [220]

 [198]
to [935]
to  [935, 220]
 to [316]
to
 [935, 198]


### PPL for evaluation

$$
\begin{split}
PPL(x_{1:L})=2^{\frac1L\mathcal L(x_{1:L})}=\Pi_{1}^L p(x_i|x_{1:i-1})^{-1/L}\\
\mathcal L(x_{1:L})=-\sum_i\log p(x_i|x_{1:i-1})
\end{split}
$$

- PPL: between 1 and |vocab|
- intuition: number of tokens that you are hesitating between

### Data

- Note: internet is dirty & not representative of what we want. Practice:

1. Download all of internet. Common crawl: 250 billion pages, > 1PB (> 1e6 GB)
2. Text extraction from HTML (challenges: math, boiler plate)
3. Filter undesirable content (e.g. NSFW, harmful content, PII)
4. Deduplicates (url/document/line). E.g. all the headers/footers/menu in forums are always same
5. Heuristic filtering. Rm low quality documents (e.g. # words, word length, outlier toks, dirty toks)
6. Model based filtering. Predict if page could be references by Wikipedia.
7. **Data mix**. Classify data categories (code/books/entertainment). **Reweight domains** using scaling laws to get high downstream performance.

- Also: lr annealing on high-quality data, continual pretraining with longer context
    - overfitting your model on a very high quality data
        - high quality data: (expert) wikipedia 
        - low quality data: human data 
    - Learning rate annealing that considers data quality
        - Higher learning rate for high quality data, lower for low quality

### SFT


> finetune the LLM with **language modeling**(next token prediction) of the **desired answers**(supervised)

- LIMA: you need very little data for SFT! ~few thousands
- Just learns the format of desired answers (length, bullet points, ...)
    - the knowledge (every user) is already in the pretrained LLM!
    - **specializes** to one "type of user"
- intuition：all you learn is you learn how to format your desired answers.
    - your pretrained models, they essentially model the distribution of **every user** on internet,
        - one might write bullet points, another one might answer question with an answer
    - all you tell your model is like, wait, you should actually be optimizing more for **this type of user than another one**.
        - so you're **not actually teaching** it -- you are not teaching anything through this SFT,
        - in SFT all you do is tell the model to optimize for one type of user **that is saw already in a pretrained data set**.

### RLHF

- not purely LLMs, not purely humans.
- Problem：SFT is **behavior cloning** of humans (RLHF maximize human preference rather than clone their behavior)
    - bound by human abilities: human may prefer things that they are not able to generate
        - reading a book vs. writing a book
    - hallucination：cloning correct answer teaches LLM to hallucinate if it didn't know about it.
    - price: collecting ideal answers is expensive.
- reward modeling
    - binary reward doesn't have much information
    - train a reward model $R(\cdot)$ using a logistic regression loss the classify preference
        - $p(i\gt j)=\frac{\exp(R(x,\hat y_j))}{\exp(R(x,\hat y_i))+\exp(R(x,\hat y_j))}$
        - Use logits $R(\cdot)$ as reward => continuous information => information heavy!

- optimize $\mathbb{E}_{\hat{y} \sim p_\theta(\hat{y} | x)} \left[ R(x, \hat{y}) - \beta \log \frac{p_\theta(\hat{y} | x)}{p_{\text{ref}}(\hat{y} | x)} \right]$ using PPO
    - regularization avoids overoptimization
    - LMs are policies (actors) not a model some distribution.