# Self-Attention Mechanisms and Text Generation

This project implements and demonstrates two fundamental components of modern NLP: self-attention mechanisms and transformer-based text generation.

## Project Overview

This project showcases advanced **natural language processing** techniques, focusing on the core building blocks of modern transformer architectures and their applications.

## Part I: Self-Attention Mechanism Implementation

### **Single-Head Self-Attention**
- **Architecture**: Query-Key-Value attention mechanism
- **Key Components**:
  - Linear projections for Q, K, V transformations
  - Scaled dot-product attention computation
  - Softmax normalization with dropout regularization
  - Output projection layer
- **Features**: Debug mode for tensor shape visualization
- **Applications**: Foundation for transformer-based models

### **Multi-Head Self-Attention**
- **Architecture**: Parallel attention heads with different learned representations
- **Key Improvements**:
  - Multiple attention heads capture different types of relationships
  - Head-wise dimension splitting and recombination
  - Enhanced representational capacity
  - Efficient parallel computation
- **Implementation**: Proper tensor reshaping and head management
- **Scalability**: Configurable number of attention heads

### **Technical Features**:
- **Mathematical Foundation**: Scaled dot-product attention with √d_k scaling
- **Memory Efficiency**: Optimized tensor operations and reshaping
- **Debugging Support**: Shape tracking and intermediate tensor visualization
- **Modularity**: Clean, reusable attention module design
- **PyTorch Integration**: Native tensor operations and automatic differentiation

## Part II: Transformer-Based Text Generation

### **Model**: Large Language Model with 4-bit Quantization
- **Base Model**: Pre-trained transformer language model
- **Optimization**: 4-bit quantization for memory efficiency
- **Technology**: BitsAndBytes integration for reduced memory footprint
- **Acceleration**: GPU acceleration support

### **Text Generation Pipeline**:
- **Tokenization**: Advanced tokenizer for text preprocessing
- **Prompt Engineering**: Structured prompt creation and formatting
- **Generation Strategy**: Configurable sampling parameters
- **Post-processing**: Clean text extraction and formatting

### **Key Capabilities**:
- **Interactive Generation**: Custom prompt-based text generation
- **Memory Optimization**: Efficient large model handling
- **Flexible Configuration**: Adjustable generation parameters
- **Quality Control**: Temperature and sampling controls

### **Technical Implementation**:
- **Framework**: PyTorch + Transformers library
- **Quantization**: 4-bit precision for large model deployment
- **Memory Management**: Optimized loading and inference
- **GPU Utilization**: Accelerated computation when available

### **Applications**:
- **Creative Writing**: Story and content generation
- **Conversational AI**: Interactive dialogue systems
- **Code Generation**: Programming assistance capabilities
- **Educational Tools**: Learning and demonstration purposes

---

In [58]:
import torch
import torch.nn as nn
import math

In [None]:
# Single-head Self-Attention Implementation
class Attention(nn.Module):
    def __init__(self, emb_dim, n_heads=1, dropout=0.0, debug=False):
        super().__init__()
        assert emb_dim % n_heads == 0
        self.emb_dim, self.n_heads = emb_dim, n_heads
        self.head_dim  = emb_dim // n_heads
        self.scale     = math.sqrt(self.head_dim)
        self.debug     = debug

        self.q_proj = nn.Linear(emb_dim, emb_dim, bias=False)
        self.k_proj = nn.Linear(emb_dim, emb_dim, bias=False)
        self.v_proj = nn.Linear(emb_dim, emb_dim, bias=False)
        self.o_proj = nn.Linear(emb_dim, emb_dim, bias=False)
        self.dropout = nn.Dropout(dropout)

    def _p(self, name, t):
        if self.debug:  # print shapes
            print(f"{name:<12}{tuple(t.shape)}")

    def forward(self, x):
        # ---------- Linear projections ----------
        Q = self.q_proj(x); self._p("Q", Q)
        K = self.k_proj(x); self._p("K", K)
        V = self.v_proj(x); self._p("V", V)

        # ---------- Scaled dot‑product ----------
        scores = Q @ K.transpose(-2, -1) / self.scale  # (B,T,T)
        self._p("scores", scores)

        # ---------- softmax + dropout -----------
        weights = self.dropout(torch.softmax(scores, -1))
        self._p("weights", weights)

        # ---------- Weighted sum ----------------
        context = weights @ V                       # (B,T,D)
        self._p("context", context)

        # ---------- Final linear ----------------
        out = self.o_proj(context)
        self._p("out", out)
        return out

In [None]:
# Test single-head attention with tensor shape debugging
B, T, D = 2, 5, 16
x = torch.randn(B, T, D)
attn_single = Attention(D, n_heads=1, debug=True)
y = attn_single(x)
print("return", y.shape)   # (2,5,16)

Q           (2, 5, 16)
K           (2, 5, 16)
V           (2, 5, 16)
scores      (2, 5, 5)
weights     (2, 5, 5)
context     (2, 5, 16)
out         (2, 5, 16)
return torch.Size([2, 5, 16])


In [None]:
# Multi-head Self-Attention Implementation
class Attention(nn.Module):
    def __init__(self, emb_dim, n_heads=8, dropout=0.0, debug=False):
        super().__init__()
        assert emb_dim % n_heads == 0
        self.emb_dim, self.n_heads = emb_dim, n_heads
        self.head_dim = emb_dim // n_heads
        self.scale    = math.sqrt(self.head_dim)
        self.debug    = debug

        self.q_proj = nn.Linear(emb_dim, emb_dim, bias=False)
        self.k_proj = nn.Linear(emb_dim, emb_dim, bias=False)
        self.v_proj = nn.Linear(emb_dim, emb_dim, bias=False)
        self.o_proj = nn.Linear(emb_dim, emb_dim, bias=False)
        self.dropout = nn.Dropout(dropout)

    def _p(self, name, t):
        if self.debug:
            print(f"{name:<12}{tuple(t.shape)}")

    def _split_heads(self, t):
        # (B,T,D) -> (B,H,T,d)
        B, T, _ = t.shape
        return t.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)

    def forward(self, x):
        B, T, _ = x.shape
        # ---------- Linear projections ----------
        Q = self._split_heads(self.q_proj(x)); self._p("Q", Q)
        K = self._split_heads(self.k_proj(x)); self._p("K", K)
        V = self._split_heads(self.v_proj(x)); self._p("V", V)

        # ---------- Scaled dot‑product ----------
        scores = Q @ K.transpose(-2, -1) / self.scale      # (B,H,T,T)
        self._p("scores", scores)

        # ---------- softmax + dropout -----------
        weights = self.dropout(torch.softmax(scores, -1))  # (B,H,T,T)
        self._p("weights", weights)

        # ---------- Weighted sum ----------------
        ctx = weights @ V                                  # (B,H,T,d)
        self._p("context", ctx)

        # ---------- Concatenate heads -----------
        ctx = ctx.transpose(1, 2).contiguous().view(B, T, self.emb_dim)
        self._p("concat", ctx)

        out = self.o_proj(ctx)                             # (B,T,D)
        self._p("out", out)
        return out

In [None]:
# Test multi-head attention with tensor shape debugging
B, T, D, H = 2, 10, 64, 8
x = torch.randn(B, T, D)
attn_multi = Attention(D, n_heads=H, debug=True)
y = attn_multi(x)
print("return", y.shape)   # (2,10,64)

Q           (2, 8, 10, 8)
K           (2, 8, 10, 8)
V           (2, 8, 10, 8)
scores      (2, 8, 10, 10)
weights     (2, 8, 10, 10)
context     (2, 8, 10, 8)
concat      (2, 10, 64)
out         (2, 10, 64)
return torch.Size([2, 10, 64])


---

## Part II: Transformer-Based Text Generation

This section demonstrates large language model inference with memory optimization techniques.

### Environment Setup for Text Generation

This section requires additional packages for transformer models and quantization support. Note that this may require different package versions than the LSTM sentiment analysis project.

In [None]:
# Environment configuration for text generation (run if needed)
# !pip uninstall -y torch torchvision torchaudio
# !pip install torch torchvision torchaudio

In [1]:
!pip install -qU transformers bitsandbytes accelerate


[notice] A new release of pip is available: 24.0 -> 25.1
[notice] To update, run: C:\Users\INK\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [None]:
# Import required packages for text generation
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
)
import torch

  from .autonotebook import tqdm as notebook_tqdm

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\INK\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "C:\Users\INK\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\traitlet

In [None]:
# Configure 4-bit quantization for memory efficiency
bnb_cfg = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,   # keeps maths stable
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

model_name = "Qwen/Qwen2.5-3B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_cfg,
    device_map="auto",              # spreads across available GPUs
    trust_remote_code=True,
)
model.eval()                        # inference‑only

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Fetching 2 files: 100%|██████████| 2/2 [07:58<00:00, 239.32s/it]
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00,  1.70s/it]


Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 2048)
    (layers): ModuleList(
      (0-35): 36 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=True)
          (k_proj): Linear4bit(in_features=2048, out_features=256, bias=True)
          (v_proj): Linear4bit(in_features=2048, out_features=256, bias=True)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((2048,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((2048,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((2048,), 

In [None]:
# Create prompts for text generation
prompts = [
        "What is the capital of France?",
        "What is the meaning of life?",
        "What is the best programming language?"
]

In [None]:
# Tokenize prompts for model input
for i, text in enumerate(prompts, 1):
    ids = tokenizer(text, return_tensors="pt").input_ids
    print(f"\nPrompt {i}:")
    print("token ids :", ids.tolist()[0][:15], "...")  # preview first 15 tokens
    print("back‑to‑text:", tokenizer.decode(ids[0], skip_special_tokens=True))


Prompt 1:
token ids : [3838, 374, 279, 6722, 315, 9625, 30] ...
back‑to‑text: What is the capital of France?

Prompt 2:
token ids : [3838, 374, 279, 7290, 315, 2272, 30] ...
back‑to‑text: What is the meaning of life?

Prompt 3:
token ids : [3838, 374, 279, 1850, 15473, 4128, 30] ...
back‑to‑text: What is the best programming language?


In [None]:
# Generate text using the language model
device = model.device               # already on GPU via device_map
gen_kwargs = dict(
    max_new_tokens=128,
    temperature=0.7,
    top_p=0.95,
    do_sample=True,
)

for i, text in enumerate(prompts, 1):
    inputs = tokenizer(text, return_tensors="pt").to(device)
    gen_ids = model.generate(**inputs, **gen_kwargs)
    reply = tokenizer.decode(gen_ids[0], skip_special_tokens=True)
    print(f"\n=== Q{i} =============================================")
    print(text)
    print("--- A -----------------------------------------------")
    print(reply.strip())

  attn_output = torch.nn.functional.scaled_dot_product_attention(



What is the capital of France?
--- A -----------------------------------------------
What is the capital of France? The capital of France is Paris.

Paris is a beautiful city located in northern France, and it has been the capital of France since 1795. It is known for its art, culture, cuisine, fashion, and historical landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. 

The city has a rich history dating back to Roman times, and it has played a significant role in French politics, art, science, and technology throughout the centuries. Today, Paris continues to be one of the most popular tourist destinations in the world, attracting millions of visitors each year. 

In addition to

What is the meaning of life?
--- A -----------------------------------------------
What is the meaning of life? This question has puzzled many people throughout history. It is a philosophical inquiry that deals with the ultimate purpose and meaning of existence. The answer to this q