## Encoder models learn tasks by training,
BERT မှာ “few-shot” ဆိုရင်
training data အနည်းငယ်နဲ့ fine-tune လုပ်ပြီး inference လုပ်တာ ကို ဆိုလိုပါတယ်။ BERT ရဲ့ output က [CLS] embedding classifier head က logits ထုတ် ပေးတာဖြစ်ပါတယ်။ label mapping (positive / negative) က BERT pretrained weight ထဲမှာ မရှိပါဘူး။ အဲ့ဒါကို classifier head (sometimes) upper encoder layers ထဲမှာ training လုပ်ပြီး inference လုပ်တယ်လို့ခေါ်ပါတယ်။ 
- ကိုယ်လိုချင်တဲ့ scope ကို Train စရာမလိုပဲ Classification တန်းလုပ်လို့ရတဲ့ encoder model လည်းရှိပါတယ်။


#### Decoder models learn tasks by prompting.

## Testing Few Shot Classification by training encoder model

In [15]:
# 1 = positive, 0 = negative
few_shot_data = [
    ("I love this product, it works perfectly!", 0),
    ("Absolutely fantastic experience", 0),
    ("Very happy with the results", 0),
    ("This is amazing", 0),
    ("Highly recommended", 0),

    ("I hate this thing", 1),
    ("Terrible and disappointing", 1),
    ("Very bad experience", 1),
    ("Not worth the money", 1),
    ("Completely useless", 1),
]

## Init Tokenizer

In [16]:
import torch
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

texts = [x[0] for x in few_shot_data]
labels = torch.tensor([x[1] for x in few_shot_data])

print("Texts:", texts)
print("Labels:", labels)

encodings = tokenizer(
    texts,
    padding=True,
    truncation=True,
    return_tensors="pt"
)


Texts: ['I love this product, it works perfectly!', 'Absolutely fantastic experience', 'Very happy with the results', 'This is amazing', 'Highly recommended', 'I hate this thing', 'Terrible and disappointing', 'Very bad experience', 'Not worth the money', 'Completely useless']
Labels: tensor([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])


## Init Model

In [17]:
from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2,
    output_attentions=True,
    output_hidden_states=True
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### ALL layers are trained by default

- Embedding layer
- All Transformer encoder layers (12 layers in bert-base)
- Final classification head

In [None]:
## Freeze all BERT layers
for param in model.parameters(): ## parameters တစ်ခုထဲမလို့ .parameters() သုံးလို့၇တယ်။
    param.requires_grad = False

In [None]:
## Encoder fine-tuning: last 2 layers ကိုပဲ train လုပ်မယ်
## classifier layer ကိုလည်း train လုပ်မယ်
## Total 4 layers ကိုပဲ train လုပ်မယ်
for name, param in model.named_parameters(): ## parameters အများကြီး update လုပ်ဖို့ named_parameters() သုံးရတယ်။
    if "encoder.layer.10" in name or "encoder.layer.11" in name:
        param.requires_grad = True
    elif "classifier" in name:
        param.requires_grad = True
    else:
        param.requires_grad = False

#### Train All parameters in BERT model (Model ထဲက parameters အားလုံးကို loss တွက်ပြီး train ကြမယ်။)

In [19]:
## Train All parameters in BERT model
#for param in model.parameters():
#    param.requires_grad = True
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)
loss_fn = torch.nn.CrossEntropyLoss()

model.train() # Set the model to training mode

for epoch in range(30):  # few epochs only
    optimizer.zero_grad()

    outputs = model(
        input_ids=encodings["input_ids"],
        attention_mask=encodings["attention_mask"],
        labels=labels
    )

    loss = outputs.loss
    loss.backward()
    optimizer.step()

    print(f"Epoch {epoch+1} | Loss: {loss.item():.4f}")

Epoch 1 | Loss: 0.6218
Epoch 2 | Loss: 0.6950
Epoch 3 | Loss: 0.7131
Epoch 4 | Loss: 0.6973
Epoch 5 | Loss: 0.6779
Epoch 6 | Loss: 0.6832
Epoch 7 | Loss: 0.6789
Epoch 8 | Loss: 0.6860
Epoch 9 | Loss: 0.6138
Epoch 10 | Loss: 0.7250
Epoch 11 | Loss: 0.6802
Epoch 12 | Loss: 0.6760
Epoch 13 | Loss: 0.7155
Epoch 14 | Loss: 0.6353
Epoch 15 | Loss: 0.6870
Epoch 16 | Loss: 0.6783
Epoch 17 | Loss: 0.6648
Epoch 18 | Loss: 0.6625
Epoch 19 | Loss: 0.6798
Epoch 20 | Loss: 0.6890
Epoch 21 | Loss: 0.6500
Epoch 22 | Loss: 0.6686
Epoch 23 | Loss: 0.6433
Epoch 24 | Loss: 0.7058
Epoch 25 | Loss: 0.6551
Epoch 26 | Loss: 0.6890
Epoch 27 | Loss: 0.6534
Epoch 28 | Loss: 0.7263
Epoch 29 | Loss: 0.7130
Epoch 30 | Loss: 0.6147


## Evaluate After Training 

In [24]:
model.eval() # evaluation mode

test_sentences = [
    "I absolutely love this!",
    "Not good at all.",
    "Holy Shit, I hate this.",
]

test_enc = tokenizer(
    test_sentences,
    padding=True,
    truncation=True,
    return_tensors="pt"
)

print("input_ids:", test_enc["input_ids"].shape)
print("input_ids size:", test_enc["input_ids"])
print("attention_mask:", test_enc["attention_mask"].shape)
print("attention_mask size:", test_enc["attention_mask"])

with torch.no_grad():
    outputs = model(**test_enc)
    predictions = torch.argmax(outputs.logits, dim=1)

hidden_states = outputs.hidden_states
attentions = outputs.attentions

# hidden state ဆိုတာက Token Embedding နဲ့ → LayerNorm ကြားထဲက + Position Embedding နဲ့ + Segment Embedding matrix တွေကိုပြောတာပါ။

print("Number of hidden states:", len(hidden_states))
print("Embedding output shape:", hidden_states[0].shape)
print("Embedding output:", hidden_states[0])


input_ids: torch.Size([3, 9])
input_ids size: tensor([[ 101, 1045, 7078, 2293, 2023,  999,  102,    0,    0],
        [ 101, 2025, 2204, 2012, 2035, 1012,  102,    0,    0],
        [ 101, 4151, 4485, 1010, 1045, 5223, 2023, 1012,  102]])
attention_mask: torch.Size([3, 9])
attention_mask size: tensor([[1, 1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1]])
Number of hidden states: 13
Embedding output shape: torch.Size([3, 9, 768])
Embedding output: tensor([[[ 1.6855e-01, -2.8577e-01, -3.2613e-01,  ..., -2.7571e-02,
           3.8253e-02,  1.6400e-01],
         [-3.4026e-04,  5.3974e-01, -2.8805e-01,  ...,  7.5731e-01,
           8.9008e-01,  1.6575e-01],
         [ 4.6552e-01,  2.5250e-01, -2.7314e-01,  ...,  5.4588e-01,
           4.2764e-01,  6.0232e-01],
         ...,
         [-1.4815e-01, -2.9485e-01, -1.6900e-01,  ..., -5.0090e-01,
           2.5442e-01, -7.0021e-02],
         [ 2.4668e-01, -8.4544e-01, -1.1325e-01,  ...,  2.6931e-0

In [25]:
for text, pred in zip(test_sentences, predictions):
    print(text, "→", "Positive" if pred.item() == 1 else "Negative")

I absolutely love this! → Negative
Not good at all. → Negative
Holy Shit, I hate this. → Negative


In [26]:
for i, layer_hidden in enumerate(hidden_states[1:], start=1):
    print(f"Encoder layer {i} output:", layer_hidden.shape)

Encoder layer 1 output: torch.Size([3, 9, 768])
Encoder layer 2 output: torch.Size([3, 9, 768])
Encoder layer 3 output: torch.Size([3, 9, 768])
Encoder layer 4 output: torch.Size([3, 9, 768])
Encoder layer 5 output: torch.Size([3, 9, 768])
Encoder layer 6 output: torch.Size([3, 9, 768])
Encoder layer 7 output: torch.Size([3, 9, 768])
Encoder layer 8 output: torch.Size([3, 9, 768])
Encoder layer 9 output: torch.Size([3, 9, 768])
Encoder layer 10 output: torch.Size([3, 9, 768])
Encoder layer 11 output: torch.Size([3, 9, 768])
Encoder layer 12 output: torch.Size([3, 9, 768])


In [27]:
last_hidden = hidden_states[-1]

cls_embedding = last_hidden[:, 0, :]
print("CLS embedding shape:", cls_embedding.shape)
print("CLS embedding:", cls_embedding)


CLS embedding shape: torch.Size([3, 768])
CLS embedding: tensor([[ 0.0456,  0.3026,  0.1679,  ..., -0.4577,  0.2394,  0.2425],
        [-0.2764,  0.0137, -0.2456,  ..., -0.1348,  0.3168,  0.3745],
        [-0.1571,  0.1834, -0.1916,  ..., -0.0787,  0.3466,  0.2784]])


In [28]:
classifier = model.classifier

logits_manual = classifier(cls_embedding)

print("Logits (manual):", logits_manual.shape)

Logits (manual): torch.Size([3, 2])


### Output Probabilities

In [29]:
print("Logits (model):", outputs.logits.shape)
print("logits :", outputs.logits)

Logits (model): torch.Size([3, 2])
logits : tensor([[ 0.0982, -0.2283],
        [ 0.1364, -0.0842],
        [-0.0203, -0.2121]])
