TASK 1

In [1]:
from typing import List
import torch
from torch import nn
from transformers import AutoModel, AutoTokenizer


class SentenceTransformer(nn.Module):
    """
    Wraps a pretrained encoder (default: bert‑base‑uncased) with mean‑pooling
    to produce one fixed‑length embedding per sentence.
    """

    def __init__(
        self,
        model_name: str = "bert-base-uncased",
        trainable: bool = False,      # Set True to fine‑tune
        device: str = None,
    ):
        super().__init__()
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.encoder = AutoModel.from_pretrained(model_name)
        self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")
        self.to(self.device)

        # Optionally freeze the backbone
        if not trainable:
            for p in self.encoder.parameters():
                p.requires_grad = False

    def forward(self, sentences: List[str]) -> torch.Tensor:
        """
        Returns (batch_size, hidden) tensor of sentence embeddings.
        """
        encoded = self.tokenizer(
            sentences,
            padding=True,
            truncation=True,
            return_tensors="pt",
        ).to(self.device)

        # Encoder outputs: last_hidden_state shape = (B, L, H)
        last_hidden = self.encoder(**encoded).last_hidden_state

        # --- Mean Pooling ---
        # Mask padding tokens before averaging
        attention_mask = encoded["attention_mask"].unsqueeze(-1)  # (B, L, 1)
        masked_hidden = last_hidden * attention_mask
        summed = masked_hidden.sum(dim=1)              # (B, H)
        counts = attention_mask.sum(dim=1)             # (B, 1)
        sentence_embeddings = summed / counts.clamp(min=1e-9)

        return sentence_embeddings

    # Convenience encode wrapper
    @torch.inference_mode()
    def encode(self, sentences: List[str]) -> torch.Tensor:
        return self.forward(sentences).cpu()

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
if __name__ == "__main__":
    model = SentenceTransformer(trainable=False)
    samples = [
        "Transformers are changing natural‑language processing.",
        "An embedding represents a sentence as a dense vector.",
        "The weather is lovely today!",
    ]
    embeddings = model.encode(samples)
    print("Embeddings shape:", embeddings.shape)  # (3, 768)
    print(embeddings)  # full tensor

Embeddings shape: torch.Size([3, 768])
tensor([[ 0.2334,  0.1515,  0.0473,  ..., -0.1749, -0.2418,  0.0309],
        [-0.1025, -0.3769,  0.0619,  ...,  0.0876, -0.7424,  0.5973],
        [-0.0817, -0.3850,  0.0510,  ...,  0.0072,  0.0259, -0.0114]])


#### Model Architecture Choices

Backbone Model:
 - Used bert-base-uncased from Hugging Face.
 - Chosen for its balance of performance and efficiency, widely adopted and well-supported.

Embedding Strategy:
 - Used mean pooling over the last hidden layer.
 - Applied attention mask to exclude padded tokens during pooling.
 - Chosen for simplicity and effectiveness across unsupervised and similarity tasks.

Trainability Configuration:
 - Included a trainable flag to toggle freezing of the backbone.
 - Allows use as a static feature extractor or as a fine-tunable encoder.

Device Handling:
 - Automatically detects GPU with torch.cuda.is_available().
 - Ensures code runs efficiently on both CPU and GPU without manual changes.

#### Implementation Choices

Frameworks Used:
 - PyTorch: for neural network modules and training.
 - Transformers (Hugging Face): for pretrained models and tokenization.

Tokenizer Settings:
 - padding=True, truncation=True ensures uniform input size in batches.
 - Returns PyTorch tensors for seamless integration with the model.

Model Structure:
 - Single class SentenceTransformer encapsulates tokenizer, encoder, and pooling.
 - Clean separation of components allows easy extension to multi-task learning.

Inference Convenience:
 - Provided .encode() method with @torch.inference_mode() for easy use and no gradient computation.

Testing:
 - Included three example sentences to demonstrate sentence embedding outputs.
 - Printed embedding shape and actual tensor values.

TASK 2

In [12]:
from typing import Dict, List, Optional
import torch
from torch import nn
from transformers import AutoModel, AutoTokenizer


class MultiTaskSentenceTransformer(nn.Module):
    """
    Transformer encoder with two parallel heads:
      • Task A – sentence‑level classification
      • Task B – token‑level classification (e.g., NER)
    """

    def __init__(
        self,
        model_name: str = "bert-base-uncased",
        num_classes_task_a: int = 4,
        num_labels_task_b: int = 7,
        trainable: bool = True,
        device: Optional[str] = None,
    ):
        super().__init__()

        # Shared backbone
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.encoder = AutoModel.from_pretrained(model_name)
        hidden_size = self.encoder.config.hidden_size

        # Task heads
        self.sent_classifier = nn.Sequential(
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(hidden_size, num_classes_task_a),
        )
        self.token_classifier = nn.Linear(hidden_size, num_labels_task_b)

        # Optional freezing
        if not trainable:
            for p in self.encoder.parameters():
                p.requires_grad = False

        # Device
        self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")
        self.to(self.device)

    # --------------------------------------------------- #
    #  Training / inference forward (tensor interface)   #
    # --------------------------------------------------- #
    def forward(
        self,
        input_ids: torch.Tensor,
        attention_mask: torch.Tensor,
    ) -> Dict[str, torch.Tensor]:
        """
        Args
        ----
        input_ids:      (B, L)
        attention_mask: (B, L)

        Returns
        -------
        dict with:
            • task_a_logits: (B, num_classes_task_a)
            • task_b_logits: (B, L, num_labels_task_b)
        """
        outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
        last_hidden = outputs.last_hidden_state                          # (B, L, H)

        # Task A
        mask = attention_mask.unsqueeze(-1)                              # (B, L, 1)
        pooled = (last_hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
        task_a_logits = self.sent_classifier(pooled)                     # (B, C)

        # Task B
        task_b_logits = self.token_classifier(last_hidden)               # (B, L, K)

        return {"task_a_logits": task_a_logits, "task_b_logits": task_b_logits}

    # --------------------------------------------------- #
    #  Convenience wrapper (raw sentences → predictions)  #
    # --------------------------------------------------- #
    @torch.inference_mode()
    def predict(self, sentences: List[str]):
        """
        Returns
        -------
        tuple (cls_ids, ner_ids):
            cls_ids: Tensor (B,)   – predicted class index per sentence
            ner_ids: Tensor (B, L) – predicted label index per token
        """
        enc = self.tokenizer(
            sentences,
            padding=True,
            truncation=True,
            return_tensors="pt",
        ).to(self.device)

        out = self.forward(enc["input_ids"], enc["attention_mask"])
        cls_ids = out["task_a_logits"].argmax(-1).cpu()
        ner_ids = out["task_b_logits"].argmax(-1).cpu()
        return cls_ids, ner_ids


In [16]:
if __name__ == "__main__":
    model = MultiTaskSentenceTransformer(trainable=False)

    demo_sentences = [
        "Barack Obama was born in Hawaii.",
        "Apple unveiled a new iPhone during its September event.",
    ]
    cls_ids, ner_ids = model.predict(demo_sentences)

    print("Task A - predicted class indices:", cls_ids.tolist())
    print("Task B - predicted NER label indices (per token):")
    for sent, ids in zip(demo_sentences, ner_ids):
        print(f"  {sent}\n  {ids.tolist()}")
        

Task A - predicted class indices: [2, 2]
Task B - predicted NER label indices (per token):
  Barack Obama was born in Hawaii.
  [2, 6, 6, 2, 2, 3, 3, 3, 3, 6, 6, 5]
  Apple unveiled a new iPhone during its September event.
  [2, 2, 3, 2, 2, 1, 5, 3, 3, 3, 3, 5]


#### Architectural changes for multi‑task learning:

Shared Transformer Encoder:
     - Kept the original bert-base-uncased backbone to provide a common semantic representation for all tasks.

Task‑specific Heads:
 - Sentence Classification Head (Task A):
    - Mean‑pooled sentence embedding → 2‑layer feed‑forward network → softmax over num_classes_task_a.
 - Token Classification Head (Task B: NER):
    - Linear layer applied to every token hidden state → softmax over num_labels_task_b.    

Output Shapes:
 - Task A logits: (batch_size, num_classes_task_a)
 - Task B logits: (batch_size, seq_len, num_labels_task_b)

Loss Computation (training‑time):
 - Use cross‑entropy for both heads.
 - Total loss = λ * loss_task_a + (1‑λ) * loss_task_b, where λ can be tuned or dynamically balanced.

Gradient Flow:
 - Both heads back‑propagate through the shared encoder, enabling transfer of useful features across tasks.

Data Handling:
 - Sentences are tokenized once; the same batch feeds both heads.
 - For NER, generated label IDs must align with tokenized sub‑words (e.g., first‑sub‑token labelling or BIO on split tokens).

Flexibility:
 - Heads are independent modules—additional tasks (e.g., sentence similarity regression, question answering) can be appended with minimal code changes.

Training Strategy:
 - Alternate mini‑batches from each task or mix them within a batch.
 - Optionally freeze encoder for initial steps, then unfreeze for full fine‑tuning.

Why the numbers vary every run
The BERT backbone is pretrained and sensible, but both task‑specific heads start with random weights (they have never been trained), so the logits—and therefore the arg‑max IDs—are effectively random.

If you run the script again, you’ll almost certainly see different class IDs and token‑label IDs unless you set a manual random seed.

TASK 3

### Training‑time configuration options
#### 1  Entire network frozen
What happens
 - Backbone and all task‑specific heads keep their initial weights.
 - The model becomes a static feature extractor; gradients are never computed.

Implications
 - Fastest training ‑‑ forward pass only.
 - Zero risk of over‑fitting because nothing changes.
 - Works only if the heads are already well‑trained or if you care strictly about embedding quality for downstream nearest‑neighbour / clustering tasks.

When it makes sense
 - You have no labelled data but need quick sentence embeddings.
 - You are running on the edge with limited memory/compute and can’t afford back‑prop.
 - Serving latency is critical and you can pre‑compute all embeddings offline.

#### 2  Only transformer backbone frozen
What happens
 - The large pretrained encoder is locked; only the lightweight heads learn.

Advantages
 - Lower memory and compute than full fine‑tuning—the backbone’s activations do not require gradient storage.
 - Reduced catastrophic forgetting: preserves the general‑language knowledge captured during pre‑training.
 - Heads adapt quickly even with a small labelled set; common in few‑shot settings.

Trade‑offs
 - Can’t adapt deeper representations to the specifics of your domain (e.g. biomedical jargon).
 - If tasks differ greatly from pre‑training data, performance may be capped.

Typical usage
 - Many production systems fine‑tune only classification layers while freezing BERT to save GPU hours.
 - Academic low‑resource benchmarks often start with this setup.

#### 3  Freeze just one task‑specific head
Why you might do it
 - You already have a well‑performing head for Task A but must add Task B without hurting Task A.
 - Or you’re transferring knowledge from Task A to Task B and want Task A as a regulariser.

Effects
 - Gradients from the frozen head do not flow into that head’s weights, but they do flow through the shared encoder (unless it is also frozen).
 - Keeps performance on the frozen task stable while letting the other task improve.

Pitfalls
 - Imbalanced optimisation dynamics—loss from the frozen head is absent, so you must monitor that the encoder doesn’t drift and degrade Task A anyway.
 - Sometimes requires loss‑weighting tricks or periodic evaluation checkpoints.

### Transfer learning scenario
Choice of a pretrained model
 - Start with a domain‑general large model such as bert-base-uncased or a domain‑specific one like BioClinicalBERT if the target corpus is biomedical.
 - Criteria: vocabulary coverage, size vs. compute budget, licence compatibility.

Layers to freeze or unfreeze
 - Step 1 – Warm‑up
    - Freeze all encoder layers. Train only the new heads for a few epochs to stabilise learning signals.
 - Step 2 – Gradual unfreezing
    - Unfreeze the top N transformer layers (closest to the output) first.
    - Optionally continue unfreezing lower layers one block at a time (ULMFiT‑style) while reducing learning rate for deeper layers.

 - Optional regularisation
    - Use layer‑wise learning‑rate decay: higher LR for task heads, progressively smaller LR for early transformer layers.

Rationale
 - Preserve lower‑level lexical and syntactic features that transfer well across domains.
 - Adapt higher‑level semantic representations to domain‑specific patterns (medical entities, sentiment cues, etc.).
 - Staged unfreezing avoids large, unstable gradient updates that could destroy pretrained knowledge (catastrophic forgetting).
 - Layer‑wise LR decay mirrors the intuition that deeper layers need only a gentle nudge while heads require significant updates.

#### Key Decisions & Insights (Training Strategy)
Fine‑tuning granularity
 - Keep more layers frozen when data are scarce or latency is critical; unfreeze progressively as domain shift or accuracy needs grow.

Two‑stage transfer learning
 - Train task heads only to stabilise gradients.
 - Gradually unfreeze upper transformer layers while using smaller learning rates for lower layers.

Layer‑wise learning‑rate decay (LLRD)
 - Apply high LR to the new heads, medium LR to upper encoder blocks, and very small LR to early blocks to avoid catastrophic forgetting.

Multi‑task loss balancing
 - Combine losses as λ · L_A + (1 − λ) · L_B; tune λ or adopt dynamic weighting to prevent one task from dominating.

Selective head freezing
 - Freeze a mature head (e.g., Task A) when adding a new task so its performance remains stable while the shared encoder still adapts.

TASK 4

In [None]:
import random
from pathlib import Path
from typing import Dict, List, Tuple

import torch
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader


# ------------------------------
# 1. Hypothetical data pipeline
# ------------------------------

SENTENCES = [
    "Barack Obama was born in Hawaii.",
    "Apple unveiled a new iPhone.",
    "The quick brown fox jumps over the lazy dog.",
    "OpenAI released a new language model.",
]
NUM_CLASSES_A = 4          # sentence‑level classes
NUM_LABELS_B = 7           # token‑level NER labels (BIO)

def random_label_a() -> int:
    return random.randint(0, NUM_CLASSES_A - 1)

def random_labels_b(n_tokens: int) -> List[int]:
    return [random.randint(0, NUM_LABELS_B - 1) for _ in range(n_tokens)]

class ToyMultiTaskDataset(Dataset):
    def __init__(self, n_samples: int = 100):
        self.samples = [random.choice(SENTENCES) for _ in range(n_samples)]

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        sent = self.samples[idx]
        # fake labels (normally read from annotation files)
        label_a = random_label_a()
        label_b = random_labels_b(len(sent.split()))    # crude token count
        return {"sentence": sent,
                "label_a": label_a,
                "label_b": label_b}

# ------------------------------
# 2. Collate‑fn for DataLoader
# ------------------------------

def collate(batch: List[Dict], tokenizer) -> Dict[str, torch.Tensor]:
    sents  = [item["sentence"]  for item in batch]
    labelA = torch.tensor([item["label_a"] for item in batch], dtype=torch.long)

    # Tokenise once for both tasks
    enc = tokenizer(
        sents,
        padding=True,
        truncation=True,
        return_tensors="pt",
    )
    seq_len = enc["input_ids"].shape[1]

    # Pad / truncate Task‑B labels to match WordPiece length
    padded_b = torch.full((len(batch), seq_len),
                          fill_value=-100, dtype=torch.long)   # ignore_index
    for i, item in enumerate(batch):
        n = min(seq_len, len(item["label_b"]))
        padded_b[i, :n] = torch.tensor(item["label_b"][:n])

    enc["label_a"] = labelA
    enc["label_b"] = padded_b
    return enc

# ------------------------------
# 3. Training loop
# ------------------------------

def train_one_epoch(model, loader, optimizer, loss_weights: Tuple[float,float], device):
    model.train()
    ce_sent  = nn.CrossEntropyLoss()
    ce_token = nn.CrossEntropyLoss(ignore_index=-100)

    total_loss, correct, total = 0.0, 0, 0
    for batch in loader:
        batch = {k: v.to(device) for k, v in batch.items()}
        out = model.forward(batch["input_ids"], batch["attention_mask"])  # slight mod of forward

        # ---- Task A loss ----
        loss_a = ce_sent(out["task_a_logits"], batch["label_a"])

        # ---- Task B loss ----
        # out["task_b_logits"]: (B, L, K)  → reshape for CE
        token_logits = out["task_b_logits"].view(-1, out["task_b_logits"].size(-1))
        token_labels = batch["label_b"].view(-1)
        loss_b = ce_token(token_logits, token_labels)

        # ---- Weighted joint loss ----
        loss = loss_weights[0] * loss_a + loss_weights[1] * loss_b

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # minimal metric: Task A accuracy
        preds_a = out["task_a_logits"].argmax(-1)
        correct += (preds_a == batch["label_a"]).sum().item()
        total   += preds_a.size(0)
        total_loss += loss.item() * preds_a.size(0)

    return total_loss / total, correct / total

# ------------------------------
# 4. Main script
# ------------------------------

def main():
    device = "cuda" if torch.cuda.is_available() else "cpu"

    model = MultiTaskSentenceTransformer(
        num_classes_task_a=NUM_CLASSES_A,
        num_labels_task_b=NUM_LABELS_B,
    ).to(device)

    dataset = ToyMultiTaskDataset(n_samples=256)
    loader = DataLoader(
        dataset,
        batch_size=8,
        shuffle=True,
        collate_fn=lambda b: collate(b, model.tokenizer),
    )

    optimizer = optim.AdamW(model.parameters(), lr=2e-5)
    lambda_a, lambda_b = 0.5, 0.5          # loss weights

    for epoch in range(3):
        loss, acc = train_one_epoch(
            model,
            loader,
            optimizer,
            (lambda_a, lambda_b),
            device
        )
        print(f"Epoch {epoch+1}: loss={loss:.4f} | Task-A acc={acc:.3f}")

In [19]:
if __name__ == "__main__":
    main()

Epoch 1: loss=1.6803 | Task‑A acc=0.266
Epoch 2: loss=1.6747 | Task‑A acc=0.246
Epoch 3: loss=1.6738 | Task‑A acc=0.246


#### Assumptions & Design Rationale
Synthetic dataset
 - Generates random sentences and labels—just to illustrate shapes and collate logic.
 - In practice, replace with a Dataset that reads real annotated files.

Single pass tokenisation
 - Collate‑fn tokenises the sentence batch once, creating input_ids & attention_mask for both tasks.

Label padding
 - Token‑level labels are padded to the max sequence length with -100; nn.CrossEntropyLoss(ignore_index=-100) skips them.

Weighted joint loss
 - Two cross‑entropy losses combined as λA·L_A + λB·L_B.
 - Scalars λ adjust task importance; tune on a validation set.

Metrics
 - For brevity, only sentence‑level accuracy (Task A) is computed.
 - Token‑level F1 or seqeval metrics can be added similarly—update after each batch or epoch.

Optimiser & LR
 - Uses AdamW with a single learning rate; real training might employ layer‑wise decay or scheduler.

Epoch loop
 - Shows three toy epochs and prints running loss and Task A accuracy to confirm the loop executes.



#### How Multi‑Task Training Operates Here
Shared Forward Pass
 - The batch is encoded once by the transformer; hidden states feed both heads, saving compute.

Two Losses, Shared Gradient Flow
 - Gradients from both heads propagate through the shared backbone (unless frozen), enabling cross‑task feature sharing.

Balancing Tasks
 - Loss weights λ counteract dataset‑size or scale imbalance.
 - Alternative schemes: dynamic uncertainty weighting or alternating mini‑batches per task.

Metrics Isolation
 - Each task keeps its own validation metrics—even if the losses are blended—so progress can be monitored independently.

#### Key Decisions & Insights (Training Loop)
Tensor‑based forward interface
 - Accepts input_ids and attention_mask to ensure one tokenisation per batch and seamless mixed‑precision or distributed training.

Custom collate_fn
 - Pads token‑level labels with ‑100 and creates the attention mask, preparing a batch that both heads can consume.

Joint loss computation
 - Uses cross‑entropy for each task; total loss is a weighted sum, making it easy to adjust task importance.

Metric tracking
 - Computes Task A sentence‑level accuracy each batch; leaves hooks for token‑level F1 so each task can be monitored independently.

Optimiser and learning rate
 - Starts with AdamW at 2 × 10⁻⁵; ready to swap in LLRD or schedulers once encoder layers are unfrozen.

Scalability hooks
 - Automatic device selection, batch‑to‑device mapping, and a modular dataset class keep the loop runnable on CPU, single GPU, or multi‑GPU frameworks.

Progressive workflow
 - Run heads‑only training for quick convergence.
 - Unfreeze additional layers as validation metrics plateau.
 - Tweak λ weights and learning rates to maintain balance between tasks.