In [1]:
import json

In [18]:
def calculate_parameters(config, use_mqa=True):
    """
    Calculate the total number of parameters for an LLM architecture.

    Args:
        config (dict): Dictionary containing the model's configuration parameters.
        use_mqa (bool): Flag indicating whether to consider Multi-Query Attention.

    Returns:
        int: Total number of parameters in the model.
    """

    # Extract necessary parameters from the configuration
    vocab_size = config.get('vocab_size')
    hidden_size = config.get('hidden_size')
    intermediate_size = config.get('intermediate_size')
    num_hidden_layers = config.get('num_hidden_layers')
    num_attention_heads = config.get('num_attention_heads')
    num_key_value_heads = config.get('num_key_value_heads', num_attention_heads)
    head_dim = config.get('head_dim')
    mlp_bias = config.get('mlp_bias', True)
    attention_bias = config.get('attention_bias', True)
    tie_word_embeddings = config.get('tie_word_embeddings', True)

    if head_dim is None:
        head_dim = hidden_size // num_attention_heads

    # Initialize total parameters
    total_params = 0

    # 1. Embedding Layer
    # Token embeddings: vocab_size x hidden_size
    embedding_params = vocab_size * hidden_size
    total_params += embedding_params

    # 2. Transformer Layers
    transformer_layer_params = 0

    for _ in range(num_hidden_layers):
        layer_params = 0

        # a. Layer Normalization (RMSNorm)
        # Assuming scale parameters only (no bias)
        layernorm_params = 2 * hidden_size  # Two LayerNorms per layer
        layer_params += layernorm_params

        # b. Self-Attention Mechanism
        if use_mqa:
            # Multi-Query Attention
            # Query projection (W_q): hidden_size x hidden_size
            W_q_params = hidden_size * hidden_size
            # Key and Value projections (W_kv): hidden_size x (2 * head_dim * num_key_value_heads)
            W_kv_params = hidden_size * (2 * head_dim * num_key_value_heads)
            # Output projection (W_o): hidden_size x hidden_size
            W_o_params = hidden_size * hidden_size
            # Biases (if any)
            attention_bias_params = 0
            if attention_bias:
                # Biases for W_q, W_kv, and W_o
                attention_bias_params = hidden_size + 2 * head_dim * num_key_value_heads + hidden_size
            attention_params = W_q_params + W_kv_params + W_o_params + attention_bias_params
        else:
            # Standard Multi-Head Attention
            # Query, Key, Value projections (W_q, W_k, W_v): each hidden_size x hidden_size
            W_q_params = hidden_size * hidden_size
            W_k_params = hidden_size * hidden_size
            W_v_params = hidden_size * hidden_size
            # Output projection (W_o): hidden_size x hidden_size
            W_o_params = hidden_size * hidden_size
            # Biases (if any)
            attention_bias_params = 0
            if attention_bias:
                # Biases for W_q, W_k, W_v, and W_o
                attention_bias_params = 3 * hidden_size + hidden_size
            attention_params = W_q_params + W_k_params + W_v_params + W_o_params + attention_bias_params

        layer_params += attention_params

        # c. Feed-Forward Network (MLP)
        gate_proj = hidden_size * intermediate_size
        up_proj = hidden_size * intermediate_size
        down_proj = intermediate_size * hidden_size
        # Biases (if any)
        mlp_bias_params = 0
        if mlp_bias:
            mlp_bias_params = intermediate_size + hidden_size
        mlp_params = gate_proj + up_proj + down_proj + mlp_bias_params

        layer_params += mlp_params

        # Add layer parameters to total transformer parameters
        transformer_layer_params += layer_params

    total_params += transformer_layer_params

    # 3. Final Layer Normalization
    # Assuming scale parameters only (no bias)
    final_layernorm_params = hidden_size
    total_params += final_layernorm_params

    # 4. Output Projection Layer (if embeddings are not tied)
    if not tie_word_embeddings:
        output_projection_params = hidden_size * vocab_size
        total_params += output_projection_params

    return total_params

# LLAMA-3.2-1B

**Config**

-  head_dim : 64
- hidden_size : 2048
- intermediate_size : 8192
- mlp_bias : false
- num_attention_heads : 32
- num_hidden_layers : 16
- num_key_value_heads : 8
- vocab_size : 128256
- tie_word_embeddings : true


## Parameter Calculation

### **1. Embedding Layer**

- **Token Embeddings**: Maps `vocab_size` tokens to `hidden_size` embeddings.
  - Parameters: `vocab_size × hidden_size`
  - Calculation: `128,256 × 2,048 = 262,668,288`

### **2. Transformer Layers**

There are `num_hidden_layers = 16` Transformer layers. Each layer consists of:

#### **a. Layer Normalization (RMSNorm)**

- **Parameters per layer**: `2 × hidden_size` (since there are two RMSNorms per layer)
  - Calculation: `2 × 2,048 = 4,096`

#### **b. Self-Attention Mechanism**

- **Query Projection (W_q)**:
  - Parameters: `hidden_size × (num_attention_heads * head_dim)`
  - Calculation: `2,048 × (64x32) = 4,194,304`
  
- **Key and Value Projections (W_kv)**:
  - Parameters: `hidden_size × (2 × head_dim × num_key_value_heads)`
  - Calculation: `2,048 × (2 × 64 × 8) = 2,097,152`
  
- **Output Projection (W_o)**:
  - Parameters: `hidden_size × hidden_size`
  - Calculation: `2,048 × 2,048 = 4,194,304`
  
- **Total Attention Parameters per layer**:
  - Calculation: `4,194,304 (W_q) + 2,097,152 (W_kv) + 4,194,304 (W_o) = 10,485,760`

#### **c. Feed-Forward Network (MLP)**

- **up_proj**:
  - Parameters: `hidden_size × intermediate_size`
  - Calculation: `2,048 × 8,192 = 16,777,216`

- **gate_proj**:
  - Parameters: `hidden_size x intermediate_size`
  - Calculation: `2048 x 8,192 = 16,777,216`


- **down_proj**:
  - Parameters: `intermediate_size × hidden_size`
  - Calculation: `8,192 × 2,048 = 16,777,216`


  
- **Total MLP Parameters per layer**:
  - Calculation: `16,777,216 (up_proj) + 16,777,216 (down_proj) + 16,777,216 (gate_proj) = 50,331,648`

#### **d. Total Parameters per Transformer Layer**

- **Sum of all components per layer**:
  - Calculation: `4,096 (LayerNorm) + 10,485,760 (Attention) + 50,331,648 (MLP) = 60,821,504`

#### **e. Total Parameters for All Transformer Layers**

- **Total for 16 layers**:
  - Calculation: `60,821,504 × 16 = 973,144,064`

### **3. Final Layer Normalization**

- **Parameters**: `hidden_size`
  - Calculation: `2,048`

### **4. Total Model Parameters**

- **Sum of Embedding Layer, Transformer Layers, and Final LayerNorm**:
  - Calculation: `262,668,288 (Embeddings) + 973,144,064 (Transformer Layers) + 2,048 (Final LayerNorm) = 1,235,814,400`

Therefore, the LLM has a total of approximately 1B parameters. Here I have assumed the output softmax parameters are shared by input embedding layer parameters. This is assuming multi query attention also, so we achieve reduction of few parameters.

In [19]:
## Code Verification

config_llama3_2_1B = json.load(open('configs/llama-3.2-1B.json'))

num_params_with_mqa = calculate_parameters(config_llama3_2_1B, use_mqa=True)
num_params_without_mqa = calculate_parameters(config_llama3_2_1B, use_mqa=False)
print(f"Number of parameters with MQA: {num_params_with_mqa}")
print(f"Number of parameters without MQA: {num_params_without_mqa}")

Number of parameters with MQA: 1235814400
Number of parameters without MQA: 1336477696


# LLAMA-3.2-3B

**Config**

- head_dim : 128
- hidden_size : 3072
- intermediate_size : 8192
- mlp_bias : false
- num_attention_heads : 24
- num_hidden_layers : 28
- num_key_value_heads : 8
- vocab_size : 128256
- tie_word_embeddings : true

### **1. Embedding Layer**

- **Token Embeddings**: Maps `vocab_size` tokens to `hidden_size` embeddings.
  - **Parameters**: `vocab_size × hidden_size`
  - **Calculation**: `128,256 × 3,072 = 394,002,432`

### **2. Transformer Layers**

There are `num_hidden_layers = 28` Transformer layers. Each layer consists of:

#### **a. Layer Normalization (RMSNorm)**

- **Parameters per layer**: `2 × hidden_size` (since there are two RMSNorms per layer)
  - **Calculation**: `2 × 3,072 = 6,144`

#### **b. Self-Attention Mechanism**

- **Query Projection (W_q)**:
  - **Parameters**: `hidden_size × hidden_size`
  - **Calculation**: `3,072 × 3,072 = 9,437,184`
  
- **Key and Value Projections (W_kv)**:
  - **Parameters**: `hidden_size × (2 × head_dim × num_key_value_heads)`
  - **Calculation**: `3,072 × (2 × 128 × 8) = 3,072 × 2,048 = 6,291,456`
  
- **Output Projection (W_o)**:
  - **Parameters**: `hidden_size × hidden_size`
  - **Calculation**: `3,072 × 3,072 = 9,437,184`
  
- **Total Attention Parameters per layer**:
  - **Calculation**: `9,437,184 (W_q) + 6,291,456 (W_kv) + 9,437,184 (W_o) = 25,165,824`

#### **c. Feed-Forward Network (MLP)**

- **Gate Proj**:
  - **Parameters**: `hidden_size × intermediate_size`
  - **Calculation**: `3,072 × 8,192 = 25,165,824`

- **Up Proj**:
  - **Parameters**: `hidden_size × intermediate_size`
  - **Calculation**: `3,072 × 8,192 = 25,165,824`

- **Down Proj**:
  - **Parameters**: `intermediate_size * hidden_size`
  - **Calculation**: `8,192 x 3,072 = 25,165,824`
  
- **Total MLP Parameters per layer**:
  - **Calculation**: `25,165,824 (up_proj) + 25,165,824 (down_proj) + 25,165,824 (up_proj) = 75,497,472`

#### **d. Total Parameters per Transformer Layer**

- **Sum of all components per layer**:
  - **Calculation**: `6,144 (LayerNorm) + 25,165,824 (Attention) + 75,497,472 (MLP) = 100,669,440`

#### **e. Total Parameters for All Transformer Layers**

- **Total for 28 layers**:
  - **Calculation**: `100,669,440 × 28 = 2,818,744,320`

### **3. Final Layer Normalization**

- **Parameters**: `hidden_size`
  - **Calculation**: `3,072`

### **4. Total Model Parameters**

- **Sum of Embedding Layer, Transformer Layers, and Final LayerNorm**:
  - **Calculation**: `394,002,432 (Embeddings) + 2,818,744,320 (Transformer Layers) + 3,072 (Final LayerNorm) = 3,212,749,824`

**Answer**: The model has a total of 3B parameters.  Here I have assumed the output softmax parameters are shared by input embedding layer parameters. This is assuming multi query attention also, so we achieve reduction of few parameters.

In [25]:
## Code Verification

config_llama3_2_3B = json.load(open('configs/llama-3.2-3B.json'))

num_params_with_mqa = calculate_parameters(config_llama3_2_3B, use_mqa=True)
num_params_without_mqa = calculate_parameters(config_llama3_2_3B, use_mqa=False)
print(f"Number of parameters with MQA: {num_params_with_mqa}")
print(f"Number of parameters without MQA: {num_params_without_mqa}")

Number of parameters with MQA: 3212749824
Number of parameters without MQA: 3565071360


In [31]:
4096*4096

16777216

# LLAMA3.1-8B

**Config:**

- **hidden_size**: 4,096
- **intermediate_size**: 14,336
- **mlp_bias**: false
- **num_attention_heads**: 32
- **num_hidden_layers**: 32
- **num_key_value_heads**: 8
- **vocab_size**: 128,256
- **head_dim**: hidden_size // num_attention_heads = 4,096 // 32 = 128
- **tie_word_embeddings** : False

### **1. Embedding Layer**

- **Token Embeddings**: Maps `vocab_size` tokens to `hidden_size` embeddings.
  - **Parameters**: `vocab_size × hidden_size`
  - **Calculation**: `128,256 × 4,096 = 525,336,576`

### **2. Transformer Layers**

There are `num_hidden_layers = 32` Transformer layers. Each layer consists of:

#### **a. Layer Normalization (RMSNorm)**

- **Parameters per layer**: `2 × hidden_size` (since there are two RMSNorms per layer)
  - **Calculation**: `2 × 4,096 = 8,192`

#### **b. Self-Attention Mechanism**

- **Query Projection (W_q)**:
  - **Parameters**: `hidden_size × hidden_size`
  - **Calculation**: `4,096 × 4,096 = 16,777,216`

- **Key Projection (W_k)**:
  - **Parameters**: `hidden_size × (num_key_value_heads × head_dim)`
  - **Calculation**: `4,096 × (8 × 128) = 4,096 × 1,024 = 4,194,304`

- **Value Projection (W_v)**:
  - **Parameters**: Same as Key Projection
  - **Calculation**: `4,194,304`

- **Output Projection (W_o)**:
  - **Parameters**: `(num_attention_heads × head_dim) × hidden_size`
  - **Calculation**: `(32 × 128) × 4,096 = 4,096 × 4,096 = 16,777,216`

- **Total Attention Parameters per layer**:
  - **Calculation**: `16,777,216 (W_q) + 4,194,304 (W_k) + 4,194,304 (W_v) + 16,777,216 (W_o) = 41,943,040`

#### **c. Feed-Forward Network (MLP)**

- **Gate Proj**:
  - **Parameters**: `hidden_size × intermediate_size`
  - **Calculation**: `4,096 × 14,336 = 58,720,256`

- **Up Proj**:
  - **Parameters**: `hidden_size × intermediate_size`
  - **Calculation**: `4,096 × 14,336 = 58,720,256`

- **Down Proj**:
  - **Parameters**: `intermediate_size × hidden_size`
  - **Calculation**: `14,336 × 4,096 = 58,720,256`

- **Total MLP Parameters per layer**:
  - **Calculation**: `58,720,256 (gate_proj) + 58,720,256 (up_proj) + 58,720,256 (down_proj) = 176,160,768`

#### **d. Total Parameters per Transformer Layer**

- **Sum of all components per layer**:
  - **Calculation**: `8,192 (LayerNorm) + 41,943,040 (Attention) + 176,160,768 (MLP) = 218,112,000`

#### **e. Total Parameters for All Transformer Layers**

- **Total for 32 layers**:
  - **Calculation**: `218,112,000 × 32 = 6,979,584,000`

### **3. Final Layer Normalization**

- **Parameters**: `hidden_size`
  - **Calculation**: `4,096`

### **4. Total Model Parameters**

- **Sum of Embedding Layer, Transformer Layers, and Final LayerNorm**:
  - **Calculation**: `2 * 525,336,576 (Embeddings) + 6,979,584,000 (Transformer Layers) + 4,096 (Final LayerNorm)  = 8,030,261,248`

**Answer**: The model has a total of 8,030,261,248 parameters. Here embedding tying is False, therefore input embedding weights are also accounted for the weight calculation, thats why there is 2 * 525,336,576


In [32]:
## Code Verification

config_llama3_1_8B = json.load(open('configs/llama-3.1-8B.json'))

num_params_with_mqa = calculate_parameters(config_llama3_1_8B, use_mqa=True)
num_params_without_mqa = calculate_parameters(config_llama3_1_8B, use_mqa=False)
print(f"Number of parameters with MQA: {num_params_with_mqa}")
print(f"Number of parameters without MQA: {num_params_without_mqa}")

Number of parameters with MQA: 8030261248
Number of parameters without MQA: 8835567616


# LLAMA 3.1 70B

**Config:**

- **hidden_size**: 8,192
- **intermediate_size**: 28,672
- **mlp_bias**: false
- **num_attention_heads**: 64
- **num_hidden_layers**: 80
- **num_key_value_heads**: 8
- **vocab_size**: 128,256
- **head_dim**: hidden_size // num_attention_heads = 4,096 // 32 = 128
- **tie_word_embeddings** : False

### 1. Embedding Layer

- **Token Embeddings**:
  - Parameters: `vocab_size × hidden_size = 128,256 × 8,192 = 1,050,673,152`
- **Output Embeddings** (since `tie_word_embeddings` is False):
  - Parameters: Same as token embeddings
  - Parameters: `1,050,673,152`
- **Total Embedding Parameters**:
  - Calculation: `1,050,673,152 (Token Embeddings) + 1,050,673,152 (Output Embeddings) = 2,101,346,304`

### 2. Transformer Layers

There are `num_hidden_layers = 80` Transformer layers. Each layer consists of:

#### a. Layer Normalization (RMSNorm)

- **Parameters per layer**:
  - Calculation: `2 × hidden_size = 2 × 8,192 = 16,384`

#### b. Self-Attention Mechanism

- **Query Projection (W_q)**:
  - Parameters: `hidden_size × hidden_size`
  - Calculation: `8,192 × 8,192 = 67,108,864`
- **Key Projection (W_k)**:
  - Parameters: `hidden_size × (num_key_value_heads × head_dim)`
  - Calculation: `8,192 × (8 × 128) = 8,192 × 1,024 = 8,388,608`
- **Value Projection (W_v)**:
  - Parameters: Same as Key Projection
  - Calculation: `8,388,608`
- **Output Projection (W_o)**:
  - Parameters: `(num_attention_heads × head_dim) × hidden_size`
  - Calculation: `(64 × 128) × 8,192 = 8,192 × 8,192 = 67,108,864`
- **Total Attention Parameters per layer**:
  - Calculation: `67,108,864 (W_q) + 8,388,608 (W_k) + 8,388,608 (W_v) + 67,108,864 (W_o) = 150,994,944`

#### c. Feed-Forward Network (MLP)

- **Gate Proj**:
  - Parameters: `hidden_size × intermediate_size`
  - Calculation: `8,192 × 28,672 = 234,881,024`
- **Up Proj**:
  - Parameters: `hidden_size × intermediate_size`
  - Calculation: `8,192 × 28,672 = 234,881,024`
- **Down Proj**:
  - Parameters: `intermediate_size × hidden_size`
  - Calculation: `28,672 × 8,192 = 234,881,024`
- **Total MLP Parameters per layer**:
  - Calculation: `234,881,024 (Gate Proj) + 234,881,024 (Up Proj) + 234,881,024 (Down Proj) = 704,643,072`

#### d. Total Parameters per Transformer Layer

- **Sum of all components per layer**:
  - Calculation: `16,384 (LayerNorm) + 150,994,944 (Attention) + 704,643,072 (MLP) = 855,654,400`

#### e. Total Parameters for All Transformer Layers

- **Total for 80 layers**:
  - Calculation: `855,654,400 × 80 = 68,452,352,000`

### 3. Final Layer Normalization

- **Parameters**:
  - Calculation: `hidden_size = 8,192`

### 4. Total Model Parameters

- **Sum of Embedding Layer, Transformer Layers, and Final LayerNorm**:
  - Calculation: `2,101,346,304 (Embeddings) + 68,452,352,000 (Transformer Layers) + 8,192 (Final LayerNorm) = 70,553,706,496`

**Answer**: The model has a total of **70,553,706,496** parameters.

In [35]:
## Code Verification

config_llama3_1_70B = json.load(open('configs/llama-3.1-70B.json'))

num_params_with_mqa = calculate_parameters(config_llama3_1_70B, use_mqa=True)
num_params_without_mqa = calculate_parameters(config_llama3_1_70B, use_mqa=False)
print(f"Number of parameters with MQA: {num_params_with_mqa}")
print(f"Number of parameters without MQA: {num_params_without_mqa}")

Number of parameters with MQA: 70553706496
Number of parameters without MQA: 79948947456


In [38]:
config_llama3_1_405B

{'architectures': ['LlamaForCausalLM'],
 'attention_bias': False,
 'attention_dropout': 0.0,
 'bos_token_id': 128000,
 'eos_token_id': 128001,
 'hidden_act': 'silu',
 'hidden_size': 16384,
 'initializer_range': 0.02,
 'intermediate_size': 53248,
 'max_position_embeddings': 131072,
 'mlp_bias': False,
 'model_type': 'llama',
 'num_attention_heads': 128,
 'num_hidden_layers': 126,
 'num_key_value_heads': 8,
 'pretraining_tp': 1,
 'rms_norm_eps': 1e-05,
 'rope_scaling': {'factor': 8.0,
  'low_freq_factor': 1.0,
  'high_freq_factor': 4.0,
  'original_max_position_embeddings': 8192,
  'rope_type': 'llama3'},
 'rope_theta': 500000.0,
 'tie_word_embeddings': False,
 'torch_dtype': 'bfloat16',
 'transformers_version': '4.42.3',
 'use_cache': True,
 'vocab_size': 128256}

# LLAMA 3.1 405B

**Config:**

- **hidden_size**: 16,384
- **intermediate_size**: 53,248
- **mlp_bias**: false
- **num_attention_heads**: 128
- **num_hidden_layers**: 126
- **num_key_value_heads**: 8
- **vocab_size**: 128,256
- **head_dim**: hidden_size // num_attention_heads = 4,096 // 32 = 128
- **tie_word_embeddings** : False

### 1. Embedding Layer

- **Token Embeddings**:
  - Parameters: `vocab_size × hidden_size = 128,256 × 16,384 = 2,101,346,304`
- **Output Embeddings** (since `tie_word_embeddings` is False):
  - Parameters: `2,101,346,304`
- **Total Embedding Parameters**:
  - Calculation: `2,101,346,304 + 2,101,346,304 = 4,202,692,608`

### 2. Transformer Layers

There are `num_hidden_layers = 126` Transformer layers. Each layer consists of:

#### a. Layer Normalization (RMSNorm)

- **Parameters per layer**:
  - Calculation: `2 × hidden_size = 2 × 16,384 = 32,768`

#### b. Self-Attention Mechanism

- **Query Projection (W_q)**:
  - Parameters: `hidden_size × hidden_size = 16,384 × 16,384 = 268,435,456`
- **Key Projection (W_k)**:
  - Parameters: `hidden_size × (num_key_value_heads × head_dim) = 16,384 × (8 × 128) = 16,384 × 1,024 = 16,777,216`
- **Value Projection (W_v)**:
  - Parameters: `16,777,216` (same as W_k)
- **Output Projection (W_o)**:
  - Parameters: `(num_attention_heads × head_dim) × hidden_size = (128 × 128) × 16,384 = 16,384 × 16,384 = 268,435,456`
- **Total Attention Parameters per layer**:
  - Calculation: `268,435,456 (W_q) + 16,777,216 (W_k) + 16,777,216 (W_v) + 268,435,456 (W_o) = 570,425,344`

#### c. Feed-Forward Network (MLP)

- **Gate Proj**:
  - Parameters: `hidden_size × intermediate_size = 16,384 × 53,248 = 872,415,232`
- **Up Proj**:
  - Parameters: `16,384 × 53,248 = 872,415,232`
- **Down Proj**:
  - Parameters: `intermediate_size × hidden_size = 53,248 × 16,384 = 872,415,232`
- **Total MLP Parameters per layer**:
  - Calculation: `872,415,232 + 872,415,232 + 872,415,232 = 2,617,245,696`

#### d. Total Parameters per Transformer Layer

- **Sum of all components per layer**:
  - Calculation: `32,768 (LayerNorm) + 570,425,344 (Attention) + 2,617,245,696 (MLP) = 3,187,703,808`

#### e. Total Parameters for All Transformer Layers

- **Total for 126 layers**:
  - Calculation: `3,187,703,808 × 126 = 401,650,679,808`

### 3. Final Layer Normalization

- **Parameters**:
  - Calculation: `hidden_size = 16,384`

### 4. Total Model Parameters

- **Sum of Embedding Layer, Transformer Layers, and Final LayerNorm**:
  - Calculation: `4,202,692,608 (Embeddings) + 401,650,679,808 (Transformer Layers) + 16,384 (Final LayerNorm) = 405,853,388,800`

**Answer**: The model has a total of **405,853,388,800** parameters.


In [37]:
## Code Verification

config_llama3_1_405B = json.load(open('configs/llama-3.1-405B.json'))

num_params_with_mqa = calculate_parameters(config_llama3_1_405B, use_mqa=True)
num_params_without_mqa = calculate_parameters(config_llama3_1_405B, use_mqa=False)
print(f"Number of parameters with MQA: {num_params_with_mqa}")
print(f"Number of parameters without MQA: {num_params_without_mqa}")

Number of parameters with MQA: 405853388800
Number of parameters without MQA: 469271265280
