In [1]:
text = "I believe the meaning of life is"

In [2]:
from sentencepiece import SentencePieceProcessor
sp_model = SentencePieceProcessor(model_file='./tokenizer.model')

In [13]:
print(sp_model.bos_id())
print(sp_model.eos_id())
print(sp_model.pad_id())
print(sp_model.vocab_size())

1
2
-1
32000


In [6]:
prompt_tokens = [sp_model.bos_id(), *sp_model.encode(text)]
prompt_tokens

[1, 306, 4658, 278, 6593, 310, 2834, 338]

```
generation_tokens, generation_logprobs = self.generate(
    prompt_tokens=prompt_tokens,
    max_gen_len=max_gen_len,
    temperature=temperature,
    top_p=top_p,
    logprobs=logprobs,
    echo=echo,
)
```

- `echo (bool, optional): `
    - `Flag indicating whether to include prompt tokens in the generated output. Defaults to False.`
- `total_len = min(params.max_seq_len, max_gen_len + max_prompt_len)`
    - `params.max_seq_len`: 1024 (2048)
    - max_gen_len + max_prompt_len = 64 + 8 = 72

## basics

- RMSNorm 与 SwiGLU：https://www.bilibili.com/video/BV1e14y1C7G8
- RoPE 相对位置编码：
    - https://www.bilibili.com/video/BV1Dh4y1P7KY/
    - https://www.bilibili.com/video/BV18u411M7j1/
- cache KV：https://www.bilibili.com/video/BV1FB4y1Z79y/
- GQA，Grouped Query Attention：https://www.bilibili.com/video/BV1vc411o7fa/
- top_p/top_k，nuclear sampling 核采样
    - https://www.bilibili.com/video/BV1Ho4y1x76q

## `Llama.generate`

| **model** | **heads** | **layers** | **dim** |   **head_dim**   |
|-----------|-----------|------------|---------|------------------|
| 7b        | 32        | 32         | 4096    | 4096/32=128      |
| 13b       | 40        | 40         | 5120    | 5120/40=128      |
| 70b       | 64        | 80         | 8192    | 8192/64=128      |

### `logits = self.model.forward(tokens, prev_pos)`

```
for cur_pos in range(min_prompt_len, total_len):
    # logits.shape: [bsz, slice_seq_len, vocab_size]
    # tokens: [0, 8), prompt 部分, 
    # 下一次 tokens: [8, 9)
    # 下一次 tokens: [9, 10)
    logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos)
```

- input prompt tokens 在第一次 forward的时候，整体进来的，需要进行上三角的 mask；

- I believe the meanning of life is 
- I believe the meanning of life is to 
- I believe the meanning of life is to find
- I believe the meanning of life is to find the
- I believe the meanning of life is to find the happiness
- I believe the meanning of life is to find the happiness that
- I believe the meanning of life is to find the happiness that comes 
- I believe the meanning of life is to find the happiness that comes from

### cache KV

- token by token 的生成
- 接口层面，`logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos)`
    - 一次只接受一个 token
    - 历史 token(input prompts + 截止到当前生成的 tokens) 信息缓存在 cache_KV 中
    - Attention layer

        ```
        # 截止到当前（包含历史）！！！
        keys = self.cache_k[:bsz, : start_pos + seqlen]
        values = self.cache_v[:bsz, : start_pos + seqlen]
        ```

### 其他控制参数

- 返回相关：
    - `token_logprobs`
    - `echo`
- 生成相关 
    - `temperature`：温度，温度越高越平均（随机性越高，entropy越高），越低越sharp（确定性越高，entropy 越低）
        $$
        \frac{\exp\left(\frac{z_i}{T}\right)}{\sum_j\exp\left(\frac{z_j}T\right)}
        $$
    - `top_p`：核采样；https://www.bilibili.com/video/BV1Ho4y1x76q

- temperature & top_p

```
if temperature > 0:
    # temperature 与 top_p 组合使用
    probs = torch.softmax(logits[:, -1] / temperature, dim=-1)
    next_token = sample_top_p(probs, top_p)
else:
    next_token = torch.argmax(logits[:, -1], dim=-1)
```