# A4: Do You Agree?

In [1]:
import math
import re
import random
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

In [11]:
from transformers import BertTokenizerFast
from datasets import load_dataset

In [3]:
import numpy as np

In [4]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


In [5]:
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed(SEED)
    torch.cuda.manual_seed_all(SEED)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

### Dataset Details — WikiText-2

For this assignment, I used the **WikiText-2** dataset, a language modeling corpus extracted from English Wikipedia articles that are labeled as “Good” or “Featured.” This dataset is designed to support long-range dependency tasks and contains more natural text than older benchmarks like Penn Treebank. 

- **Official Name:** WikiText-2  
- **Dataset Source:** Hugging Face — Salesforce/wikitext  
  https://huggingface.co/datasets/Salesforce/wikitext 
- **Original Paper:** Merity, S., Xiong, C., Bradbury, J., & Socher, R. (2016). *Pointer Sentinel Mixture Models*. ICLR. arXiv:1609.07843 (introduces the WikiText dataset). 

**Academic citation (APA):**
> Merity, S., Xiong, C., Bradbury, J., & Socher, R. (2016). *Pointer sentinel mixture models*. ICLR. arXiv:1609.07843. Retrieved from https://arxiv.org/abs/1609.07843


In [6]:
dataset = load_dataset('wikitext', 'wikitext-2-v1', split='train')

print(f"Dataset loaded. Original size: {len(dataset)} samples")


print("\nSample text from dataset:")
print("-" * 50)
print(dataset[10]['text']) # Print a random sample to check quality
print("-" * 50)

wikitext-2-v1/test-00000-of-00001.parque(…):   0%|          | 0.00/685k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


wikitext-2-v1/train-00000-of-00001.parqu(…):   0%|          | 0.00/6.07M [00:00<?, ?B/s]

wikitext-2-v1/validation-00000-of-00001.(…):   0%|          | 0.00/618k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Dataset loaded. Original size: 36718 samples

Sample text from dataset:
--------------------------------------------------
 The game 's battle system , the <unk> system , is carried over directly from <unk> Chronicles . During missions , players select each unit using a top @-@ down perspective of the battlefield map : once a character is selected , the player moves the character around the battlefield in third @-@ person . A character can only act once per @-@ turn , but characters can be granted multiple turns at the expense of other characters ' turns . Each character has a field and distance of movement limited by their Action <unk> . Up to nine characters can be assigned to a single mission . During gameplay , characters will call out if something happens to them , such as their health points ( HP ) getting low or being knocked out by enemy attacks . Each character has specific " Potentials " , skills unique to each character . They are divided into " Personal Potential " , which 

# Task 1: Training BERT from Scratch

#### Preprocessing: Masked Language Modeling (MLM) Logic

To train BERT, the model needs to learn how to predict missing words based on their context. I built a custom `Dataset` class that manually implements the BERT masking strategy. 

Instead of just masking words simply, I followed the original BERT paper's logic:
1. I select 15% of the tokens in a sentence to be masked.
2. Of that 15%, 80% are replaced with a `[MASK]` token, 10% are replaced with a random word, and 10% are left as the original word.

I also set the labels for all the *unmasked* words to `-100`.

In [13]:
# 1. Configuration
# We define our hyperparameters here.
MAX_LEN = 128        # Maximum length of a sentence (tokens)
BATCH_SIZE = 32      # Process 32 samples at a time
MAX_MASK = 20        # Maximum number of tokens to mask per sequence
VOCAB_SIZE = 30522   # Standard BERT vocab size

# 2. Initialize Tokenizer
# We use the standard BERT tokenizer to convert text -> numbers
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

# 3. Custom Dataset Class
# This class handles the logic of reading text and creating masks manually.
class BERTDataset(Dataset):
    def __init__(self, data, tokenizer, max_len):
        self.data = data
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        # 1. Get the text
        text = self.data[index]['text']

        # 2. Tokenize
        # - truncation=True: Cut off if longer than max_len
        # - padding='max_length': Add [PAD] if shorter than max_len
        # - return_tensors='pt': Return PyTorch tensors
        encoding = self.tokenizer(
            text,
            max_length=self.max_len,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )

        # Remove the batch dimension added by tokenizer (1, 128) -> (128)
        input_ids = encoding['input_ids'].squeeze(0)
        attention_mask = encoding['attention_mask'].squeeze(0)

        # 3. Create Masking (MLM) - "From Scratch" Logic
        # We create a copy of input_ids to be our "labels" (the answers)
        labels = input_ids.clone()

        # Create a random probability matrix for masking
        # We mask approx 15% of tokens, but NOT [CLS] (101), [SEP] (102), or [PAD] (0)
        probability_matrix = torch.full(labels.shape, 0.15)
        
        # Create special token mask
        special_tokens_mask = (input_ids == tokenizer.pad_token_id) | \
                              (input_ids == tokenizer.cls_token_id) | \
                              (input_ids == tokenizer.sep_token_id)
        
        # Set probability of masking special tokens to 0
        probability_matrix.masked_fill_(special_tokens_mask, value=0.0)

        # Select indices to mask based on probabilities
        masked_indices = torch.bernoulli(probability_matrix).bool()

        # Set labels for NON-masked tokens to -100 (so the loss function ignores them)
        labels[~masked_indices] = -100 

        # 4. Apply the 80-10-10 Rule
        # 80% of the time: Replace with [MASK] token
        indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
        input_ids[indices_replaced] = tokenizer.mask_token_id

        # 10% of the time: Replace with a random word
        indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
        random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
        input_ids[indices_random] = random_words[indices_random]

        # The remaining 10% are kept as original words (but still predicted)

        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'labels': labels
        }

# 4. Create DataLoader
print("Preparing DataLoader...")
train_dataset = BERTDataset(dataset, tokenizer, MAX_LEN)
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)

# 5. Verify a single batch
sample_batch = next(iter(train_loader))
print(f"DataLoader ready. Batch size: {BATCH_SIZE}")
print(f"Input shape: {sample_batch['input_ids'].shape}") 
print(f"Labels shape: {sample_batch['labels'].shape}")  

Preparing DataLoader...
DataLoader ready. Batch size: 32
Input shape: torch.Size([32, 128])
Labels shape: torch.Size([32, 128])


## Building the BERT Architecture
I built the Transformer components using PyTorch `nn.Module`:
* **BERTEmbedding**: Combines token embeddings, position embeddings, and segment embeddings.
* **MultiHeadAttention**: Calculates the scaled dot-product attention to help the model focus on different parts of the sentence.
* **EncoderLayer**: Stacks the attention mechanism with a feed-forward network, layer normalization, and dropout.
* **BERT**: The final wrapper that connects the embedding layer, multiple encoder layers, and adds a linear classifier on top to predict our masked vocabulary words.

In [14]:

#  BERT Model Architecture

class BERTEmbedding(nn.Module):
    def __init__(self, vocab_size, d_model, max_len, n_segments=2, dropout=0.1):
        super().__init__()
        self.tok_embed = nn.Embedding(vocab_size, d_model)
        self.pos_embed = nn.Embedding(max_len, d_model)
        self.seg_embed = nn.Embedding(n_segments, d_model)
        self.norm = nn.LayerNorm(d_model)
        self.drop = nn.Dropout(dropout)

    def forward(self, x, seg):
        seq_len = x.size(1)
        # Create position tensor: [0, 1, 2, ..., seq_len-1]
        pos = torch.arange(seq_len, dtype=torch.long, device=x.device)
        pos = pos.unsqueeze(0).expand_as(x)
        
        # Sum all embeddings
        embedding = self.tok_embed(x) + self.pos_embed(pos) + self.seg_embed(seg)
        return self.drop(self.norm(embedding))

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        assert d_model % n_heads == 0, "d_model must be divisible by n_heads"
        
        self.d_k = d_model // n_heads
        self.n_heads = n_heads
        
        # Linear projections
        self.W_Q = nn.Linear(d_model, d_model)
        self.W_K = nn.Linear(d_model, d_model)
        self.W_V = nn.Linear(d_model, d_model)
        self.fc = nn.Linear(d_model, d_model)
        
        self.layernorm = nn.LayerNorm(d_model)

    def forward(self, Q, K, V, attn_mask):
        batch_size = Q.size(0)
        
        # Linear projections and split into heads
        # Shape: [batch_size, seq_len, n_heads, d_k]
        q_s = self.W_Q(Q).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        k_s = self.W_K(K).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        v_s = self.W_V(V).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)

        # Scaled Dot-Product Attention
        # Scores: [batch_size, n_heads, seq_len, seq_len]
        scores = torch.matmul(q_s, k_s.transpose(-1, -2)) / math.sqrt(self.d_k)
        
        # Apply Mask (Prevent looking at padding)
        # attn_mask shape: [batch_size, seq_len] -> [batch_size, 1, 1, seq_len]
        if attn_mask is not None:
             scores.masked_fill_(attn_mask.unsqueeze(1).unsqueeze(1) == 0, -1e9)

        attn = F.softmax(scores, dim=-1)
        context = torch.matmul(attn, v_s) # [batch_size, n_heads, seq_len, d_k]
        
        # Concatenate heads
        context = context.transpose(1, 2).contiguous().view(batch_size, -1, self.n_heads * self.d_k)
        output = self.fc(context)
        
        return self.layernorm(output + Q), attn # Residual connection + LayerNorm

class PositionWiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        super().__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)
        self.layernorm = nn.LayerNorm(d_model)

    def forward(self, x):
        residual = x
        # GELU is standard for BERT (Smoother ReLU)
        x = F.gelu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        return self.layernorm(x + residual)

class EncoderLayer(nn.Module):
    def __init__(self, d_model, n_heads, d_ff, dropout):
        super().__init__()
        self.attn = MultiHeadAttention(d_model, n_heads)
        self.ffn = PositionWiseFeedForward(d_model, d_ff, dropout)

    def forward(self, x, attn_mask):
        x, attn_weights = self.attn(x, x, x, attn_mask)
        x = self.ffn(x)
        return x, attn_weights

class BERT(nn.Module):
    def __init__(self, vocab_size, d_model, n_layers, n_heads, d_ff, max_len, n_segments, dropout):
        super().__init__()
        self.embedding = BERTEmbedding(vocab_size, d_model, max_len, n_segments, dropout)
        self.layers = nn.ModuleList([
            EncoderLayer(d_model, n_heads, d_ff, dropout) for _ in range(n_layers)
        ])
        
        # MLM Head
        # Projects back to vocab_size to predict the masked word
        self.linear = nn.Linear(d_model, d_model)
        self.activ = nn.GELU()
        self.norm = nn.LayerNorm(d_model)
        self.classifier = nn.Linear(d_model, vocab_size)

        # Weight tying (optional but standard in BERT to reduce params)
        # self.classifier.weight = self.embedding.tok_embed.weight

    def forward(self, input_ids, attention_mask, segment_ids=None):
        # Handle missing segment_ids (default to all zeros)
        if segment_ids is None:
            segment_ids = torch.zeros_like(input_ids)

        output = self.embedding(input_ids, segment_ids)
        
        for layer in self.layers:
            output, attn = layer(output, attention_mask)
        
        # MLM Prediction
        # We only predict the masked tokens, but for implementation simplicity
        # we project the whole sequence and let the loss function handle the masking logic.
        h_masked = self.norm(self.activ(self.linear(output)))
        logits_lm = self.classifier(h_masked)
        
        return logits_lm

In [15]:
# Training Setup & Execution


# 1. Hyperparameters (Small BERT for faster training)
VOCAB_SIZE = 30522 # Standard BERT tokenizer vocab size
D_MODEL = 768      # Standard BERT hidden size
N_LAYERS = 6       # Standard is 12, we use 6 for speed
N_HEADS = 8        # Must divide D_MODEL (768/8 = 96)
D_FF = 3072        # Standard BERT feed-forward size (4 * D_MODEL)
MAX_LEN = 128
N_SEGMENTS = 2
DROPOUT = 0.1
EPOCHS = 10         # Small dataset needs more epochs to converge
LEARNING_RATE = 1e-4

# 2. Initialize Model
model = BERT(VOCAB_SIZE, D_MODEL, N_LAYERS, N_HEADS, D_FF, MAX_LEN, N_SEGMENTS, DROPOUT)
model.to(device)
print(f"Model initialized on {device}")

# 3. Optimizer & Loss
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)
criterion = nn.CrossEntropyLoss(ignore_index=-100) # Ignore non-masked tokens

# 4. Training Loop
print("\n Starting Training (MLM)...")
model.train()

for epoch in range(EPOCHS):
    total_loss = 0
    for step, batch in enumerate(train_loader):
        # Move batch to GPU
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        
        optimizer.zero_grad()
        
        # Forward pass
        outputs = model(input_ids, attention_mask)
        
        # Calculate Loss
        # Flatten outputs: [batch_size * seq_len, vocab_size]
        # Flatten labels:  [batch_size * seq_len]
        loss = criterion(outputs.view(-1, VOCAB_SIZE), labels.view(-1))
        
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        
        if step % 100 == 0 and step > 0:
            print(f"Epoch {epoch+1} | Step {step} | Loss: {loss.item():.4f}")

    avg_loss = total_loss / len(train_loader)
    print(f" Epoch {epoch+1} Completed | Average Loss: {avg_loss:.4f}")



Model initialized on cuda

 Starting Training (MLM)...
Epoch 1 | Step 100 | Loss: 7.2220
Epoch 1 | Step 200 | Loss: 6.1597
Epoch 1 | Step 300 | Loss: 6.9123
Epoch 1 | Step 400 | Loss: 6.5726
Epoch 1 | Step 500 | Loss: 6.1917
Epoch 1 | Step 600 | Loss: 6.7937
Epoch 1 | Step 700 | Loss: 6.3622
Epoch 1 | Step 800 | Loss: 6.2571
Epoch 1 | Step 900 | Loss: 6.2625
Epoch 1 | Step 1000 | Loss: 6.0941
Epoch 1 | Step 1100 | Loss: 6.4586
 Epoch 1 Completed | Average Loss: 6.6300
Epoch 2 | Step 100 | Loss: 5.9137
Epoch 2 | Step 200 | Loss: 5.8978
Epoch 2 | Step 300 | Loss: 6.1429
Epoch 2 | Step 400 | Loss: 5.4760
Epoch 2 | Step 500 | Loss: 5.3608
Epoch 2 | Step 600 | Loss: 5.4596
Epoch 2 | Step 700 | Loss: 5.8484
Epoch 2 | Step 800 | Loss: 4.5628
Epoch 2 | Step 900 | Loss: 5.7266
Epoch 2 | Step 1000 | Loss: 6.2032
Epoch 2 | Step 1100 | Loss: 5.8064
 Epoch 2 Completed | Average Loss: 5.7019
Epoch 3 | Step 100 | Loss: 5.3395
Epoch 3 | Step 200 | Loss: 5.6606
Epoch 3 | Step 300 | Loss: 5.4777
Epoch 3

In [16]:
# 5. Save the Model 
save_path = "bert_mlm_scratch.pth"
torch.save(model.state_dict(), save_path)
print(f"\n Model weights saved to {save_path}")


 Model weights saved to bert_mlm_scratch.pth


## Task 2. Sentence Embedding with Sentence BERT

We need to train a Siamese Network to classify the relationship between two sentences (Entailment, Neutral, Contradiction).

In [19]:
# SNLI Data Loading & Preprocessing

# 1. Load SNLI Dataset
print("Loading SNLI dataset...")
snli_dataset = load_dataset('snli')

# 2. Filter & Subset
# - Remove entries with label -1 (which means annotators disagreed)
# - Select a subset (e.g., 50k samples) to ensure training finishes quickly
print("Filtering and creating subset...")
snli_dataset = snli_dataset.filter(lambda x: x['label'] != -1)

# We use 50,000 samples for training to match the scale of Task 1
train_subset = snli_dataset['train'].select(range(50000)) 
val_subset = snli_dataset['validation']
test_subset = snli_dataset['test']

print(f"SNLI Train Size: {len(train_subset)}")
print(f"SNLI Val Size:   {len(val_subset)}")
print(f"SNLI Test Size:  {len(test_subset)}")

# 3. Custom Dataset for Sentence Pairs
class SNLIDataset(Dataset):
    def __init__(self, data, tokenizer, max_len):
        self.data = data
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        row = self.data[index]
        premise = row['premise']
        hypothesis = row['hypothesis']
        label = row['label']

        # Tokenize Premise (Sentence A)
        encoding_a = self.tokenizer(
            premise,
            max_length=self.max_len,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        
        # Tokenize Hypothesis (Sentence B)
        encoding_b = self.tokenizer(
            hypothesis,
            max_length=self.max_len,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )

        return {
            'input_ids_a': encoding_a['input_ids'].squeeze(0),
            'attention_mask_a': encoding_a['attention_mask'].squeeze(0),
            'input_ids_b': encoding_b['input_ids'].squeeze(0),
            'attention_mask_b': encoding_b['attention_mask'].squeeze(0),
            'label': torch.tensor(label, dtype=torch.long)
        }

# 4. Create DataLoaders
# We reuse the tokenizer and MAX_LEN from Task 1
train_snli_ds = SNLIDataset(train_subset, tokenizer, MAX_LEN)
val_snli_ds = SNLIDataset(val_subset, tokenizer, MAX_LEN)
test_snli_ds = SNLIDataset(test_subset, tokenizer, MAX_LEN)

train_snli_loader = DataLoader(train_snli_ds, batch_size=BATCH_SIZE, shuffle=True)
val_snli_loader = DataLoader(val_snli_ds, batch_size=BATCH_SIZE, shuffle=False)
test_snli_loader = DataLoader(test_snli_ds, batch_size=BATCH_SIZE, shuffle=False)

# Check a batch
sample_snli = next(iter(train_snli_loader))
print(f"\n SNLI DataLoaders ready.")
print(f"Premise Shape: {sample_snli['input_ids_a'].shape}")
print(f"Label Shape: {sample_snli['label'].shape}")

Loading SNLI dataset...
Filtering and creating subset...
SNLI Train Size: 50000
SNLI Val Size:   9842
SNLI Test Size:  9824

 SNLI DataLoaders ready.
Premise Shape: torch.Size([32, 128])
Label Shape: torch.Size([32])


## The Siamese Network Architecture

The original BERT architecture is not well-suited for sentence comparison because it requires feeding both sentences into the model simultaneously, which is computationally expensive. 

Sentence-BERT solves this by using a "Siamese" architecture:

We pass Sentence A (Premise) and Sentence B (Hypothesis) through the same BERT encoder independently to get their hidden states.

We use Mean Pooling to average the token embeddings into a single vector for each sentence ($u$ and $v$).

We concatenate $u$, $v$, and the absolute difference $|u - v|$.

We pass this concatenated vector into a Softmax classifier to predict Entailment, Neutral, or Contradiction.

Here is the mathematical representation of the Classification Objective Function we are implementing:

$o=softmax(W^{T}\cdot(u,v,|u-v|))$

In [20]:
# Siamese Network Architecture


class MeanPooling(nn.Module):
    """
    Applies mean pooling to the token embeddings, ignoring padding tokens.
    """
    def forward(self, last_hidden_state, attention_mask):
        # Expand attention mask to match hidden state dimensions
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()
        
        # Sum the embeddings, but only for non-padding tokens
        sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded, 1)
        
        # Count the number of non-padding tokens (clamp to avoid division by zero)
        sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
        
        # Calculate the mean
        return sum_embeddings / sum_mask

class SiameseNLI(nn.Module):
    def __init__(self, pretrained_bert, d_model, num_classes=3):
        super().__init__()
        # 1. Extract the base encoder from our custom BERT
        self.embedding = pretrained_bert.embedding
        self.layers = pretrained_bert.layers
        
        # 2. Pooling layer
        self.pooler = MeanPooling()
        
        # 3. Classification Head (Softmax Loss)
        # The input is concatenated (u, v, |u-v|), so the dimension is 3 * d_model
        self.classifier = nn.Linear(3 * d_model, num_classes)

    def get_sentence_embedding(self, input_ids, attention_mask):
        # Pass through the embedding layer
        segment_ids = torch.zeros_like(input_ids) # No segments needed for single sentences
        output = self.embedding(input_ids, segment_ids)
        
        # Pass through the Transformer layers
        for layer in self.layers:
            output, _ = layer(output, attention_mask)
            
        # Apply Mean Pooling
        return self.pooler(output, attention_mask)

    def forward(self, input_ids_a, attention_mask_a, input_ids_b, attention_mask_b):
        # 1. Get embeddings for Sentence A (u) and Sentence B (v)
        # Because we use the same get_sentence_embedding function, the weights are tied!
        u = self.get_sentence_embedding(input_ids_a, attention_mask_a)
        v = self.get_sentence_embedding(input_ids_b, attention_mask_b)
        
        # 2. Calculate absolute difference |u - v|
        uv_abs = torch.abs(u - v)
        
        # 3. Concatenate (u, v, |u-v|)
        features = torch.cat([u, v, uv_abs], dim=1)
        
        # 4. Softmax Classifier
        # Note: We return raw logits because PyTorch's CrossEntropyLoss 
        # applies LogSoftmax internally.
        logits = self.classifier(features)
        
        return logits, u, v

# Initialize the Siamese Network
# We pass in the 'model' we trained in Task 1 so it inherits the learned weights
siamese_model = SiameseNLI(model, D_MODEL, num_classes=3)
siamese_model.to(device)

print("Siamese Network Initialized with weights from Task 1.")

Siamese Network Initialized with weights from Task 1.


In [21]:
# Training the Siamese Network

# 1. Hyperparameters for Fine-Tuning
EPOCHS_NLI = 2          
LEARNING_RATE_NLI = 2e-5

# 2. Optimizer and Loss Function
# We use CrossEntropyLoss which combines LogSoftmax and NLLLoss
optimizer_nli = optim.Adam(siamese_model.parameters(), lr=LEARNING_RATE_NLI)
criterion_nli = nn.CrossEntropyLoss()

# 3. Training Loop
print(f"Starting Siamese Network Training (Fine-Tuning on SNLI)...")

for epoch in range(EPOCHS_NLI):
    siamese_model.train()
    total_train_loss = 0
    correct_predictions = 0
    total_samples = 0
    
    for step, batch in enumerate(train_snli_loader):
        # Move tensors to GPU
        input_ids_a = batch['input_ids_a'].to(device)
        attention_mask_a = batch['attention_mask_a'].to(device)
        input_ids_b = batch['input_ids_b'].to(device)
        attention_mask_b = batch['attention_mask_b'].to(device)
        labels = batch['label'].to(device)
        
        optimizer_nli.zero_grad()
        
        # Forward pass
        logits, _, _ = siamese_model(input_ids_a, attention_mask_a, input_ids_b, attention_mask_b)
        
        # Calculate loss
        loss = criterion_nli(logits, labels)
        
        # Backward pass
        loss.backward()
        optimizer_nli.step()
        
        total_train_loss += loss.item()
        
        # Calculate accuracy for monitoring
        predictions = torch.argmax(logits, dim=1)
        correct_predictions += (predictions == labels).sum().item()
        total_samples += labels.size(0)
        
        # Print progress every 200 steps
        if step % 200 == 0 and step > 0:
            current_acc = correct_predictions / total_samples
            print(f"Epoch {epoch+1} | Step {step}/{len(train_snli_loader)} | Loss: {loss.item():.4f} | Acc: {current_acc:.4f}")

    # Calculate average training loss and accuracy
    avg_train_loss = total_train_loss / len(train_snli_loader)
    train_acc = correct_predictions / total_samples
    print(f"Epoch {epoch+1} Train Completed | Avg Loss: {avg_train_loss:.4f} | Train Acc: {train_acc:.4f}")



Starting Siamese Network Training (Fine-Tuning on SNLI)...
Epoch 1 | Step 200/1563 | Loss: 1.0259 | Acc: 0.4512
Epoch 1 | Step 400/1563 | Loss: 0.9197 | Acc: 0.4737
Epoch 1 | Step 600/1563 | Loss: 0.9179 | Acc: 0.4914
Epoch 1 | Step 800/1563 | Loss: 0.9106 | Acc: 0.5066
Epoch 1 | Step 1000/1563 | Loss: 0.7016 | Acc: 0.5159
Epoch 1 | Step 1200/1563 | Loss: 0.8719 | Acc: 0.5242
Epoch 1 | Step 1400/1563 | Loss: 0.9880 | Acc: 0.5297
Epoch 1 Train Completed | Avg Loss: 0.9528 | Train Acc: 0.5349
Epoch 2 | Step 200/1563 | Loss: 0.6787 | Acc: 0.6253
Epoch 2 | Step 400/1563 | Loss: 0.7992 | Acc: 0.6227
Epoch 2 | Step 600/1563 | Loss: 0.9668 | Acc: 0.6220
Epoch 2 | Step 800/1563 | Loss: 0.7900 | Acc: 0.6226
Epoch 2 | Step 1000/1563 | Loss: 0.7235 | Acc: 0.6252
Epoch 2 | Step 1200/1563 | Loss: 0.8005 | Acc: 0.6258
Epoch 2 | Step 1400/1563 | Loss: 0.8410 | Acc: 0.6271
Epoch 2 Train Completed | Avg Loss: 0.8278 | Train Acc: 0.6273


In [22]:
# 4. Save the Fine-Tuned Model
save_path_nli = "sbert_snli_scratch.pth"
torch.save(siamese_model.state_dict(), save_path_nli)
print(f"\nSiamese model weights saved to {save_path_nli}")


Siamese model weights saved to sbert_snli_scratch.pth


## Task 3. Evaluation and Analysis

In [23]:
# Evaluation and Classification Report
from sklearn.metrics import classification_report

# 1. Define the label mapping based on SNLI dataset
# 0: entailment, 1: neutral, 2: contradiction
label_names = ['entailment', 'neutral', 'contradiction']

# 2. Evaluation Function
def evaluate_model(model, dataloader):
    model.eval() # Set model to evaluation mode
    all_predictions = []
    all_true_labels = []
    
    print("Evaluating on Test Set...")
    with torch.no_grad(): # Disable gradient calculation for faster inference
        for batch in dataloader:
            # Move to GPU
            input_ids_a = batch['input_ids_a'].to(device)
            attention_mask_a = batch['attention_mask_a'].to(device)
            input_ids_b = batch['input_ids_b'].to(device)
            attention_mask_b = batch['attention_mask_b'].to(device)
            labels = batch['label'].to(device)
            
            # Get raw logits from the model
            logits, _, _ = model(input_ids_a, attention_mask_a, input_ids_b, attention_mask_b)
            
            # The predicted class is the one with the highest logit score
            predictions = torch.argmax(logits, dim=1)
            
            # Store predictions and true labels
            all_predictions.extend(predictions.cpu().numpy())
            all_true_labels.extend(labels.cpu().numpy())
            
    return all_true_labels, all_predictions

# 3. Run Evaluation on the Test Set
y_true, y_pred = evaluate_model(siamese_model, test_snli_loader)

# 4. Generate Classification Report
print("\n" + "="*50)
print("             CLASSIFICATION REPORT")
print("="*50)
report = classification_report(y_true, y_pred, target_names=label_names, digits=2)
print(report)

Evaluating on Test Set...

             CLASSIFICATION REPORT
               precision    recall  f1-score   support

   entailment       0.61      0.75      0.68      3368
      neutral       0.62      0.63      0.63      3219
contradiction       0.69      0.51      0.59      3237

     accuracy                           0.63      9824
    macro avg       0.64      0.63      0.63      9824
 weighted avg       0.64      0.63      0.63      9824



## Limitations and Potential Improvements

### Documentation & Datasets
* **Pre-training (Task 1):** I used the `wikitext-2-v1` dataset from Hugging Face as a manageable subset of Wikipedia. 
* **Fine-Tuning (Task 2):** I used the `snli` dataset from Hugging Face.
* **Hyperparameters:** The BERT model was configured as a "Small BERT" (6 layers, 8 attention heads, 768 hidden size) to allow training from scratch within reasonable compute limits. Pre-training used a learning rate of 1e-4 for 10 epochs, while fine-tuning used 2e-5 for 2 epochs.

### Limitations & Challenges
1. **Dataset Size:** Implementing and training BERT from scratch on a small subset (approx. 100k samples) is conceptually correct, but practically limits the model's vocabulary and contextual understanding. Real BERT models train on over 3.3 billion words.
2. **Compute Constraints:** Due to local hardware limits, I had to significantly reduce the number of Transformer layers (from 12 to 6) and attention heads (from 12 to 8). This reduces the model's capacity to capture complex sentence relationships.
3. **Training Time:** The pre-training phase was short. Language models typically require millions of optimization steps to converge on Masked Language Modeling (MLM). 
4. **No NSP Task:** I exclusively used MLM for pre-training. The original BERT also uses Next Sentence Prediction (NSP), which might have helped the Siamese network better understand sentence-pair relationships later on.

### Potential Improvements
1. **Scale Up:** If given access to a computing cluster, I would train on the full English Wikipedia and BookCorpus with the standard 12-layer architecture.
2. **Add MNLI Data:** As suggested in the assignment, supplementing the SNLI dataset with the MNLI dataset would introduce more diverse linguistic genres (e.g., telephone conversations, fiction), improving the model's generalization capabilities.
3. **Longer Fine-Tuning:** Implementing a learning rate scheduler with warm-up steps and training for 4-5 epochs on the SNLI dataset could squeeze out slightly better classification accuracy.