## data

- Llama 3 was **pretrained** on over 15 trillion tokens（15T） of data from **publicly available sources**.
- The **fine-tuning** data includes publicly available **instruction datasets**, as well as over **10M human-annotated examples**.
- Neither the pretraining nor the fine-tuning datasets include Meta user data.

## config

- Meta-Llama-3-8B

```
{
   "dim": 4096,
    "n_layers": 32,
    "n_heads": 32,
    "n_kv_heads": 8,
    "vocab_size": 128256,
    "multiple_of": 1024,
    "ffn_dim_multiplier": 1.3,
    "norm_eps": 1e-05,
    "rope_theta": 500000.0
}
```

- rope_theta: base
    - $5\times 10^{5}$

#### ffn layer shape

```
self.feed_forward = FeedForward(
    dim=args.dim,
    hidden_dim=4 * args.dim,
    multiple_of=args.multiple_of,
    ffn_dim_multiplier=args.ffn_dim_multiplier,
)
```

In [5]:
hidden_dim = 4*4096

```
class FeedForward(nn.Module):
    def __init__(
        self,
        dim: int,
        hidden_dim: int,
        multiple_of: int,
        ffn_dim_multiplier: Optional[float],
    ):
        super().__init__()
        hidden_dim = int(2 * hidden_dim / 3)
        # custom dim factor multiplier
        if ffn_dim_multiplier is not None:
            hidden_dim = int(ffn_dim_multiplier * hidden_dim)
        hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)

        self.w1 = ColumnParallelLinear(
            dim, hidden_dim, bias=False, gather_output=False, init_method=lambda x: x
        )
        self.w2 = RowParallelLinear(
            hidden_dim, dim, bias=False, input_is_parallel=True, init_method=lambda x: x
        )
        self.w3 = ColumnParallelLinear(
            dim, hidden_dim, bias=False, gather_output=False, init_method=lambda x: x
        )

    def forward(self, x):
        return self.w2(F.silu(self.w1(x)) * self.w3(x))
```

In [6]:
hidden_dim = int(2*hidden_dim / 3)
hidden_dim = int(1.3 * hidden_dim)
hidden_dim = 1024 * ((hidden_dim + 1024 - 1) // 1024)
hidden_dim

14336

In [7]:
14336 / 4096

3.5