## Decoder-only models provide Autoregressive completion
- Text generation & creative writing
- Conversation & chatbots
- Instruction-following tasks (classification, QA, summarization)
- Code generation & reasoning
- Translation & multi-lingual tasks



##### GPT-2 = Decoder-only Transformer

- No encoder
- No cross-attention
- Only masked self-attention
- Trained with next-token prediction

In [28]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

device = "cuda" if torch.cuda.is_available() else "cpu"


In [29]:
model_name = "gpt2"  # GPT-2 small: 12 layers, 12 heads
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(
    model_name,
    output_attentions=True,
    output_hidden_states=True
)

model.to(device)
model.eval()


GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

GPT2 Model ရဲ့ ပထမဆုံး Embedding Layer က word token embedding ပါ၊ 50257 tokens ရှိပါတယ်။
ဒုတိယ wte embedding layer မှာဆိုရင် 1024x768 dimension ရှိတဲ့ context window = 1024 tokens ကောင်ကတော့ word positional embedding ပါ။ 

Decoder Only GPT မှာ 
No sinusoidal encoding, no rotary embeddings ပါ။ 

absolute positions ကိုသာအသုံးပြုပြီး learn လုပ်ပါတယ်။

In [43]:
wte_params = model.transformer.wte.weight.numel()
wpe_params = model.transformer.wpe.weight.numel()

print("Token embedding params:", f"{wte_params:,}")
print("Position embedding params:", f"{wpe_params:,}")


Token embedding params: 38,597,376
Position embedding params: 786,432


In [44]:
block = model.transformer.h[0]

print("c_attn params:", block.attn.c_attn.weight.numel() + block.attn.c_attn.bias.numel())
print("c_proj params:", block.attn.c_proj.weight.numel() + block.attn.c_proj.bias.numel())


c_attn params: 1771776
c_proj params: 590592


In [30]:
text = "Recurrent Neural Network will change the"
inputs = tokenizer(text, return_tensors="pt").to(device)

input_ids = inputs["input_ids"]
print("Input IDs shape:", input_ids.shape)


Input IDs shape: torch.Size([1, 7])


In [31]:
with torch.no_grad():
    outputs = model(**inputs)


In [32]:
logits = outputs.logits
print("Logits shape:", logits.shape)


Logits shape: torch.Size([1, 7, 50257])


In [33]:
hidden_states = outputs.hidden_states

print("Total hidden state tensors:", len(hidden_states))
print("Embedding output shape:", hidden_states[0].shape)
print("Last layer output shape:", hidden_states[-1].shape)


Total hidden state tensors: 13
Embedding output shape: torch.Size([1, 7, 768])
Last layer output shape: torch.Size([1, 7, 768])


In [34]:
attentions = outputs.attentions

print("Total attention layers:", len(attentions))
print("One layer attention shape:", attentions[0].shape)


Total attention layers: 12
One layer attention shape: torch.Size([1, 12, 7, 7])


In [35]:
layer = 0
head = 11

attn = attentions[layer][0, head]
print(attn)


tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.6984, 0.3016, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.4638, 0.2843, 0.2520, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3511, 0.2360, 0.2186, 0.1943, 0.0000, 0.0000, 0.0000],
        [0.3529, 0.1802, 0.1466, 0.1212, 0.1992, 0.0000, 0.0000],
        [0.3420, 0.1347, 0.1247, 0.1247, 0.1232, 0.1506, 0.0000],
        [0.2430, 0.1132, 0.1216, 0.1333, 0.1036, 0.1298, 0.1555]])


In [36]:
len(model.transformer.h)



12

In [37]:
model.config.n_head

12

In [38]:
model.config.n_embd


768

In [39]:
def count_params(module):
    return sum(p.numel() for p in module.parameters())

print("Total GPT-2 parameters:", f"{count_params(model):,}")


Total GPT-2 parameters: 124,439,808


In [40]:
block = model.transformer.h[0]

attn_params = count_params(block.attn)
mlp_params = count_params(block.mlp)

print("Attention params per layer:", f"{attn_params:,}")
print("MLP params per layer:", f"{mlp_params:,}")


Attention params per layer: 2,362,368
MLP params per layer: 4,722,432


In [41]:
block.attn.c_attn.weight.shape


torch.Size([768, 2304])

In [42]:
next_token_logits = logits[:, -1, :]
next_token_id = torch.argmax(next_token_logits, dim=-1)

print("Next token:", tokenizer.decode(next_token_id))


Next token:  way
