### Text Classification

- Sentiment analysis (positive / negative)
- Topic classification
- Spam detection
- Intent classification (chatbots)

### Token Classification

- Named Entity Recognition (NER)
- Part-of-Speech tagging
- Chunking


## Init Model & Check Parameters Counts.

In [72]:
from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
)

total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(model.eval())
print(f"Total parameters     : {total_params:,}")
print(f"Trainable parameters : {trainable_params:,}")

## Print with Mililion and billion output
print(f"Total parameters     : {total_params/1_000_000:.2f} Million")
print(f"Trainable parameters : {trainable_params/1_000_000:.2f} Million")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

## Embedding & Layer Norm Parameters

In [65]:
emb = model.bert.embeddings

word_emb = emb.word_embeddings.weight.numel()
pos_emb  = emb.position_embeddings.weight.numel()
seg_emb  = emb.token_type_embeddings.weight.numel()
ln_emb   = sum(p.numel() for p in emb.LayerNorm.parameters())

print("Embeddings")
print(" Word embeddings      :", f"{word_emb:,}")
print(" Position embeddings  :", f"{pos_emb:,}")
print(" Segment embeddings   :", f"{seg_emb:,}")
print(" LayerNorm            :", f"{ln_emb:,}")
print(" Total embeddings     :", f"{word_emb + pos_emb + seg_emb + ln_emb:,}")


Embeddings
 Word embeddings      : 23,440,896
 Position embeddings  : 393,216
 Segment embeddings   : 1,536
 LayerNorm            : 1,536
 Total embeddings     : 23,837,184


## Encoder Layer & Attention Heads Parameter

In [66]:
#All encoder layers have exactly the same parameter count
# 12 layers × same size
def count_module_params(module):
    return sum(p.numel() for p in module.parameters())

total_encoder = 0

for i, layer in enumerate(model.bert.encoder.layer):
    layer_params = count_module_params(layer)
    total_encoder += layer_params
    print(f"Encoder layer {i:02d} parameters: {layer_params:,}")

print("Total encoder parameters:", f"{total_encoder:,}")


Encoder layer 00 parameters: 7,087,872
Encoder layer 01 parameters: 7,087,872
Encoder layer 02 parameters: 7,087,872
Encoder layer 03 parameters: 7,087,872
Encoder layer 04 parameters: 7,087,872
Encoder layer 05 parameters: 7,087,872
Encoder layer 06 parameters: 7,087,872
Encoder layer 07 parameters: 7,087,872
Encoder layer 08 parameters: 7,087,872
Encoder layer 09 parameters: 7,087,872
Encoder layer 10 parameters: 7,087,872
Encoder layer 11 parameters: 7,087,872
Total encoder parameters: 85,054,464


#### NOTE : BERT-base has 12 layers, and each layer has 12 attention heads. So it is 12 heads per layer × 12 layers = 144 heads total.

### Attention Head တစ်ခုရဲ့ (Q,K,V) Parameters count 
- layer 1 ခုမှာရှိတဲ့ attention head 12 ခုစီ သည် Q, K, V linear layers ကို one big matrix တစ်ခု (768x768 + 768) ထဲကိုအသုံးပြုပြီး internally အရ  heads 12 ခုလုံးကို reshaped လုပ်ပြီးမျှသုံးတာဖြစ်ပါတယ်။ ဒါကြောင့် head တစ်လုံးမှာ 768 / 12 = 64 dimension matrix တွေဖြစ်တဲ့ Q,K,V weight matrix ကိုယ်စီခွဲပြီးသုံးကြပါတယ်။

In [67]:
attn_q = 0
attn_k = 0
attn_v = 0
attn_out = 0

for i, layer in enumerate(model.bert.encoder.layer):
    attn = layer.attention.self
    out  = layer.attention.output.dense

    attn_q += attn.query.weight.numel() + attn.query.bias.numel() # ( 768x768 + 768 ) x 12 layers
    attn_k += attn.key.weight.numel()   + attn.key.bias.numel()   # ( 768x768 + 768 ) x 12 layers
    attn_v += attn.value.weight.numel() + attn.value.bias.numel() # ( 768x768 + 768 ) x 12 layers
    attn_out += out.weight.numel() + out.bias.numel()             # ( 768x768 + 768 ) x 12 layers

print("ATTENTION PARAMETERS")
print(f" Query (Q) : {attn_q:,}")
print(f" Key   (K) : {attn_k:,}")
print(f" Value (V) : {attn_v:,}")
print(f" Output    : {attn_out:,}")
print(f" Total Attention : {attn_q + attn_k + attn_v + attn_out:,}")


ATTENTION PARAMETERS
 Query (Q) : 7,087,104
 Key   (K) : 7,087,104
 Value (V) : 7,087,104
 Output    : 7,087,104
 Total Attention : 28,348,416


### Attention Output & Feed Forward NN Layer Parameters

In [68]:
ffn_intermediate = 0
ffn_output = 0

for layer in model.bert.encoder.layer:
    ffn_intermediate += (
        layer.intermediate.dense.weight.numel() +
        layer.intermediate.dense.bias.numel()
    )

    ffn_output += (
        layer.output.dense.weight.numel() +
        layer.output.dense.bias.numel()
    )

print("\nFFN PARAMETERS")
print(f" Intermediate (768→3072): {ffn_intermediate:,}")
print(f" Output (3072→768):       {ffn_output:,}")
print(f" Total FFN:               {ffn_intermediate + ffn_output:,}")




FFN PARAMETERS
 Intermediate (768→3072): 28,348,416
 Output (3072→768):       28,320,768
 Total FFN:               56,669,184


### Layer Norm Parameters

In [69]:
layernorm_params = 0

for layer in model.bert.encoder.layer:
    layernorm_params += count_module_params(layer.attention.output.LayerNorm)
    layernorm_params += count_module_params(layer.output.LayerNorm)

print("\nLayerNorm parameters:", f"{layernorm_params:,}")



LayerNorm parameters: 36,864


## Total SumUp Parameters Count

In [61]:
encoder_total = count_module_params(model.bert.encoder)
print("Manual encoder total parameters:", f"{attn_q+attn_k+attn_v+attn_out+ffn_intermediate+ffn_output+layernorm_params:,}")
print("\nENCODER TOTAL PARAMETERS:", f"{encoder_total:,}")

Manual encoder total parameters: 85,054,464

ENCODER TOTAL PARAMETERS: 85,054,464


In [71]:
## print trainable parameters
for name, param in model.named_parameters():
    if param.requires_grad:
        print(name)
        print(param.shape)
        print("----------------")
        

bert.embeddings.word_embeddings.weight
torch.Size([30522, 768])
----------------
bert.embeddings.position_embeddings.weight
torch.Size([512, 768])
----------------
bert.embeddings.token_type_embeddings.weight
torch.Size([2, 768])
----------------
bert.embeddings.LayerNorm.weight
torch.Size([768])
----------------
bert.embeddings.LayerNorm.bias
torch.Size([768])
----------------
bert.encoder.layer.0.attention.self.query.weight
torch.Size([768, 768])
----------------
bert.encoder.layer.0.attention.self.query.bias
torch.Size([768])
----------------
bert.encoder.layer.0.attention.self.key.weight
torch.Size([768, 768])
----------------
bert.encoder.layer.0.attention.self.key.bias
torch.Size([768])
----------------
bert.encoder.layer.0.attention.self.value.weight
torch.Size([768, 768])
----------------
bert.encoder.layer.0.attention.self.value.bias
torch.Size([768])
----------------
bert.encoder.layer.0.attention.output.dense.weight
torch.Size([768, 768])
----------------
bert.encoder.layer.

### Classifier 

In [70]:
classifier = model.classifier

clf_params = sum(p.numel() for p in classifier.parameters())

print("Classifier head parameters (768 * 2 + 2 = 1,538 ):", f"{clf_params:,}")


Classifier head parameters (768 * 2 + 2 = 1,538 ): 1,538


```
Embeddings        ≈ 23M
Encoder layers    ≈ 85M
 ├─ Attention     ≈ small
 ├─ FFN           ≈ BIG
Classifier        ≈ 1.5K
────────────────────────
Total             ≈ 109M
```
