## Preprocessing and Data Pipeline

The preprocessing pipeline prepares question tokens and image features as described in the paper.

- **Question Processing**:
  - Questions are tokenized to a maximum length of 14 tokens[^1].
  - Words are embedded using 300-dimensional GloVe vectors[^1].
  - A single-layer LSTM encodes the sequence into 512-dimensional question features[^1].

- **Image Processing**:
  - Images are passed through a ResNeXt-152 CNN pretrained on Visual Genome to extract 8×8 grid features (64 regions), each of 2048 dimensions[^1].
  - These features are projected to 512-dimensional vectors to match the question features[^1].

- **Answer Representation**:
  - The candidate answer set is fixed to the top 3129 most frequent answers[^1].
  - Ground-truth answers are encoded as multi-hot vectors with soft scores, enabling multi-label classification[^1].


In [None]:
import json
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader

# Example utility: load GloVe embeddings into a dictionary
def load_glove_embeddings(glove_path):
    glove_dict = {}
    with open(glove_path, 'r', encoding='utf8') as f:
        for line in f:
            parts = line.split()
            if len(parts) == 0: 
                continue
            word = parts[0]
            vec = np.array(parts[1:], dtype=np.float32)
            glove_dict[word] = vec
    return glove_dict

# Build vocabulary for questions (word2idx) and answers (answer2idx) from training data
def build_vocab_and_answers(questions, answers, glove_dict, min_freq=1, top_answers=3129):
    # Build question word vocabulary (include only words present in GloVe for consistency)
    word_counts = {}
    for q in questions:
        for w in q.lower().split():
            if w in glove_dict:  # only consider words with glove vectors
                word_counts[w] = word_counts.get(w, 0) + 1
    # Include words with frequency >= min_freq
    vocab = ['<pad>', '<unk>']  # special tokens
    vocab += [w for w, cnt in word_counts.items() if cnt >= min_freq]
    word2idx = {w: i for i, w in enumerate(vocab)}
    # Build answer vocabulary (top_answers by frequency)
    ans_counts = {}
    for ans_list in answers:
        for ans in ans_list:
            ans_counts[ans] = ans_counts.get(ans, 0) + 1
    # sort answers by frequency and take top N
    top_ans = sorted(ans_counts.items(), key=lambda x: x[1], reverse=True)[:top_answers]
    answer_vocab = [ans for ans, cnt in top_ans]
    answer2idx = {ans: i for i, ans in enumerate(answer_vocab)}
    return word2idx, answer2idx

class VQAv2Dataset(Dataset):
    def __init__(self, questions_json, annotations_json, image_dir, glove_dict, word2idx, answer2idx):
        # Load question and annotation data
        with open(questions_json, 'r') as f_q:
            questions_data = json.load(f_q)
        with open(annotations_json, 'r') as f_a:
            ann_data = json.load(f_a)
        # Map question id to question and image, and to answers
        self.entries = []
        ann_map = {ann['question_id']: ann for ann in ann_data['annotations']}
        for q in questions_data['questions']:
            q_id = q['question_id']
            img_id = q['image_id']
            question_str = q['question']
            # Collect answers (10 answers per question in VQA v2)
            if q_id in ann_map:
                ans_objs = ann_map[q_id]['answers']
                answers = [ans_obj['answer'] for ans_obj in ans_objs]
            else:
                answers = []  # for test questions (no answers)
            self.entries.append({
                'question_id': q_id,
                'image_id': img_id,
                'question': question_str,
                'answers': answers
            })
        self.image_dir = image_dir
        self.word2idx = word2idx
        self.answer2idx = answer2idx
        self.glove_dict = glove_dict

    def __len__(self):
        return len(self.entries)

    def __getitem__(self, idx):
        entry = self.entries[idx]
        # Tokenize question
        tokens = entry['question'].lower().split()  # simple split; could use more advanced tokenization
        # Convert tokens to indices with padding/truncation to length 14
        max_q_len = 14
        tok_idxs = []
        for t in tokens[:max_q_len]:
            tok_idxs.append(self.word2idx.get(t, self.word2idx.get('<unk>')))
        pad_length = max_q_len - len(tok_idxs)
        if pad_length > 0:
            tok_idxs += [self.word2idx['<pad>']] * pad_length
        else:
            tok_idxs = tok_idxs[:max_q_len]
        question_indices = torch.tensor(tok_idxs, dtype=torch.long)
        # Prepare answer multi-hot vector
        ans_vector = torch.zeros(len(self.answer2idx), dtype=torch.float)
        if entry['answers']:
            # Use soft score: each answer gets credit = min(count/3, 1)&#8203;:contentReference[oaicite:8]{index=8}&#8203;:contentReference[oaicite:9]{index=9}
            ans_count = {}
            for ans in entry['answers']:
                if ans in self.answer2idx:
                    ans_count[ans] = ans_count.get(ans, 0) + 1
            for ans, count in ans_count.items():
                score = min(count / 3, 1.0)
                ans_idx = self.answer2idx.get(ans)
                if ans_idx is not None:
                    ans_vector[ans_idx] = score
        # Load image and preprocess (assuming images are COCO, adjust path accordingly)
        image_path = f"{self.image_dir}/COCO_train2014_{entry['image_id']:012d}.jpg"  # example for train2014
        # Image loading and transformation will be handled in a collate or in model (for efficiency, do in model forward)
        # We return image path or ID for later loading
        return image_path, question_indices, ans_vector


## Dataset Handling: `VQAv2Dataset`

In the code above, the `VQAv2Dataset` class performs the following preprocessing steps:

- **Question Handling**:
  - Loads questions and annotations[^1].
  - Tokenizes each question.
  - Pads or truncates tokens to a fixed length of 14[^1].

- **Answer Vector Construction**:
  - Constructs a multi-label answer vector.
  - Applies soft target scoring between 0 and 1 for each answer, using the standard VQA strategy[^1].

- **Image Handling**:
  - Image loading is deferred for efficiency.
  - The dataset returns only an image path or ID.
  - Actual image feature extraction is handled later in the model via a CNN[^1].

# Model Architecture Components

We now define the building blocks of the LRCN model. The model uses a Transformer-like co-attention architecture with a **Layer-Residual Mechanism (LRM)** to preserve information across layers.

There are two types of attention sub-layers:

### 1. Self-Attention (SA) Block
Operates on a single modality (text or image) to capture **intra-modal features**.

### 2. Guided-Attention (GA) Block
A cross-attention that uses one modality to guide attention in the other, capturing **inter-modal features** between image and question.

### Layer-Residual Mechanism (LRM)

This mechanism adds a direct residual connection from the output of an attention block in the previous layer to the output of the corresponding block in the current layer. In other words, each SA or GA block receives an extra skip input from its counterpart in the previous layer, mitigating information loss through deep layers.

Formally, if \( X_{l-1} \) is the output from the previous layer and `PrevRe` is the output from the previous layer's same type of block, then the LRM is applied as:

$$
X_l = \text{LayerNorm}(X_{l-1} + \text{PrevRe} + \text{MHA}(Q_{l-1}, K_{l-1}, V_{l-1}))
$$

where `MHA` denotes multi-head attention.

We implement this by adding the previous layer's output of the same block type (`prev_output`) in addition to the standard residual connection. Below we define PyTorch modules for **Multi-Head Attention with LRM**. We use `nn.MultiheadAttention` for the attention operation and apply residual connections and layer normalization as described (Post-LN architecture).

We provide separate classes for self-attention and guided-attention blocks.


In [None]:
import torch.nn as nn
import torch.nn.functional as F

class SelfAttentionBlock(nn.Module):
    """Self-Attention block with Layer-Residual Mechanism (LRM) for one modality."""
    def __init__(self, hidden_dim=512, num_heads=8):
        super(SelfAttentionBlock, self).__init__()
        # Multi-head self-attention (queries=keys=values=input sequence)
        self.mha = nn.MultiheadAttention(embed_dim=hidden_dim, num_heads=num_heads, batch_first=True)
        self.ln = nn.LayerNorm(hidden_dim)

    def forward(self, x, prev_sa_output=None):
        """
        x: Tensor of shape (batch, seq_len, hidden_dim) - input features for this layer.
        prev_sa_output: Tensor of same shape as x, output from previous layer's SA block (for LRM).
        """
        # Multi-head self-attention (with residual connection)
        attn_out, _ = self.mha(x, x, x)  # Self-attend: Q=K=V=x
        out = x + attn_out  # primary residual: add input features&#8203;:contentReference[oaicite:20]{index=20}
        if prev_sa_output is not None:
            out = out + prev_sa_output  # add skip connection from previous layer's SA&#8203;:contentReference[oaicite:21]{index=21}
        out = self.ln(out)  # layer normalization after residual sum (Post-LN)&#8203;:contentReference[oaicite:22]{index=22}
        return out

class GuidedAttentionBlock(nn.Module):
    """Guided (Cross) Attention block with LRM, guiding features of one modality with another."""
    def __init__(self, hidden_dim=512, num_heads=8):
        super(GuidedAttentionBlock, self).__init__()
        # Multi-head attention for cross-modal: will use queries from one modality and keys/values from another
        self.mha = nn.MultiheadAttention(embed_dim=hidden_dim, num_heads=num_heads, batch_first=True)
        self.ln = nn.LayerNorm(hidden_dim)

    def forward(self, query_seq, context_seq, prev_ga_output=None):
        """
        query_seq: Tensor (batch, L_q, hidden_dim) - features to be updated (queries).
        context_seq: Tensor (batch, L_c, hidden_dim) - guiding features (keys and values).
        prev_ga_output: Tensor of same shape as query_seq, output from previous layer's GA block (for LRM).
        """
        # Multi-head cross-attention: query attends to context
        attn_out, _ = self.mha(query_seq, context_seq, context_seq)  # Q=query_seq, K=context_seq, V=context_seq
        out = query_seq + attn_out  # add primary residual connection (input to output)&#8203;:contentReference[oaicite:23]{index=23}
        if prev_ga_output is not None:
            out = out + prev_ga_output  # add skip from previous layer's GA output&#8203;:contentReference[oaicite:24]{index=24}
        out = self.ln(out)
        return out


In the `SelfAttentionBlock`, we pass `prev_sa_output` (the previous layer's self-attention output for the same modality) to incorporate the **Layer-Residual Mechanism (LRM)**. 

In the `GuidedAttentionBlock`, `prev_ga_output` carries the previous layer's guided-attention output (for the same query modality).

Both blocks add their immediate input (`x` or `query_seq`) as well as the skip connection (`prev_*_output`) to the attention output before normalization, matching equations (8)–(10) in the paper.


# LRCN Model Variants: Pure-Stacking, Co-Stacking, Encoder–Decoder

The LRCN model can be configured in three stacking variants as discussed in the paper:

### Encoder–Decoder (E-D)
The question features are fully encoded with self-attention layers, then the final question representation guides the image features via a guided-attention layer (similar to a traditional Transformer encoder-decoder).  
This means the image is only attended once by the final question features.

### Pure-Stacking
The question and image features interact at every layer. At each layer:
- The question features are first refined by a self-attention block.
- Then, the image features are updated by a guided-attention block using the same layer's question output.

The question is not directly updated by the image in this variant; image guidance happens progressively with increasingly refined question features.

### Co-Stacking
A bidirectional co-attention is applied at every layer. At each layer:
- After the question self-attention, **textual co-attention** is applied (question features are guided by image features) to inject visual information into the question representation.
- Then, **visual guided-attention** is applied (image features are guided by the updated question).

This allows early and reciprocal interactions: image guiding text and text guiding image at each layer, enabling richer feature fusion.

---

We implement a single `LRCNModel` class that can switch between these variants. The model consists of:

### Word Embedding + LSTM Encoder
- Embeds the input question tokens using preloaded **GloVe** weights.
- Encodes them into an initial question feature sequence \( Y \) of length \( m = 14 \) and dimension 512.

### CNN Feature Extractor
- Processes the input image to produce a grid of \( n = 64 \) visual features of dimension 512.
- We use **ResNeXt-152** (pretrained), truncated to output a \( 16 \times 16 \) feature map.
- This is then downsampled to \( 8 \times 8 \) via a 2×2 stride-2 convolution, as described in the paper.
- A linear layer reduces 2048-dimensional features to 512-dimensional.

### Stacked Co-Attention Layers
- A stack of \( L \) layers (we allow \( L = 6 \) or \( 8 \), as per the paper) of **Self-Attention (SA)** and **Guided-Attention (GA)** blocks with **Layer-Residual Mechanism (LRM)**.
- The exact sequence per layer depends on the variant:

| Variant         | Layer Sequence                                                                 |
|-----------------|----------------------------------------------------------------------------------|
| Encoder–Decoder | \( L \) SA layers on question only, then 1 GA layer on image.                   |
| Pure-Stacking   | For each layer: Question SA → Image GA.                                         |
| Co-Stacking     | For each layer: Question SA → Text GA (Q guided by image) → Image GA (X guided by question). |

### Multimodal Fusion & Classifier
- After \( L \) layers, we obtain final refined question features \( Y^{(L)} \) and image features \( X^{(L)} \).
- These are **pooled via attention pooling** (described in the next section) to get a single vector each for image and question.
- The vectors are fused and passed through a classifier to predict the **answer scores**.


In [None]:
import torchvision.models as models
from torchvision import transforms
from PIL import Image

class LRCNModel(nn.Module):
    def __init__(self, hidden_dim=512, num_heads=8, num_layers=6, 
                 vocab_size=10000, glove_weights=None, answer_vocab_size=3129, 
                 variant="pure"):
        """
        variant: "enc_dec", "pure", or "co" indicating Encoder-Decoder, Pure-Stacking, or Co-Stacking structure.
        num_layers: number of layers L for SA (and GA in stacking). For Encoder-Decoder, this is number of question SA layers.
        """
        super(LRCNModel, self).__init__()
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        self.variant = variant

        # Question embedding: 300-d GloVe embeddings, project to hidden_dim via LSTM
        self.embedding = nn.Embedding(vocab_size, 300, padding_idx=0)
        if glove_weights is not None:
            # Initialize embedding weights with pre-trained GloVe (freeze or fine-tune as needed)
            self.embedding.weight.data.copy_(torch.tensor(glove_weights, dtype=torch.float32))
            self.embedding.weight.requires_grad = False  # Freeze GloVe embeddings (optional)
        self.question_lstm = nn.LSTM(input_size=300, hidden_size=hidden_dim, batch_first=True, bidirectional=False)

        # Image feature extractor: ResNeXt-152 backbone pre-trained (we will cut after conv5 layer)
        backbone = models.resnext152_32x8d(pretrained=True)
        # Remove classifier and avgpool to get convolutional feature map
        self.cnn_backbone = nn.Sequential(*list(backbone.children())[:-2])  # shape: (batch, 2048, H=14?, W=14?)
        # Additional conv to go from 14x14 to 8x8 if needed (pad to 16x16 then conv stride2)&#8203;:contentReference[oaicite:44]{index=44}
        self.downsample_conv = nn.Conv2d(in_channels=2048, out_channels=2048, kernel_size=2, stride=2)
        # Linear projection to 512-d
        self.img_feat_proj = nn.Linear(2048, hidden_dim)

        # Define attention blocks for L layers
        # For Encoder-Decoder: question SA layers + one image GA (so num_layers SA, 1 GA)
        # For Pure-Stacking: L layers of question SA and image GA
        # For Co-Stacking: L layers of question SA, text GA, image GA
        self.sa_blocks = nn.ModuleList([SelfAttentionBlock(hidden_dim, num_heads) for _ in range(num_layers)])
        if variant == "enc_dec":
            # Only one GA block for the final cross-attention in Encoder-Decoder
            self.ga_image_block = GuidedAttentionBlock(hidden_dim, num_heads)
        elif variant == "pure":
            # Pure-stacking: L GA blocks for image (question guides image at each layer)
            self.ga_blocks = nn.ModuleList([GuidedAttentionBlock(hidden_dim, num_heads) for _ in range(num_layers)])
        elif variant == "co":
            # Co-stacking: L GA blocks for text (image guides question) and L GA blocks for image (question guides image)
            self.ga_text_blocks = nn.ModuleList([GuidedAttentionBlock(hidden_dim, num_heads) for _ in range(num_layers)])
            self.ga_image_blocks = nn.ModuleList([GuidedAttentionBlock(hidden_dim, num_heads) for _ in range(num_layers)])
        else:
            raise ValueError("Unknown variant type")

        # Attention pooling layers for final feature summarization (two-layer MLP as described)&#8203;:contentReference[oaicite:45]{index=45}&#8203;:contentReference[oaicite:46]{index=46}
        self.att_mlp_text = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Dropout(0.1),
            nn.Linear(hidden_dim, 1)
        )
        self.att_mlp_image = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Dropout(0.1),
            nn.Linear(hidden_dim, 1)
        )
        # Fusion and classifier
        self.fusion_norm = nn.LayerNorm(hidden_dim)  # LayerNorm for fused features&#8203;:contentReference[oaicite:47]{index=47}
        self.fusion_Wx = nn.Linear(hidden_dim, hidden_dim)  # W_x for image feature&#8203;:contentReference[oaicite:48]{index=48}
        self.fusion_Wy = nn.Linear(hidden_dim, hidden_dim)  # W_y for question feature&#8203;:contentReference[oaicite:49]{index=49}
        self.classifier = nn.Linear(hidden_dim, answer_vocab_size)  # projects fused feature to answer logits&#8203;:contentReference[oaicite:50]{index=50}

        # Image preprocessing transform (to normalize images as ResNeXt expects)
        self.image_transform = transforms.Compose([
            transforms.Resize(448),  # ensure image is large enough (ResNeXt was trained on 224x224, but VG training might use larger)
            transforms.CenterCrop(448),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406], 
                                 std=[0.229, 0.224, 0.225])
        ])

    def forward(self, image_paths, question_indices):
        batch_size = question_indices.size(0)
        # 1. Question embedding encoding
        embed = self.embedding(question_indices)  # (B, 14, 300)
        # Pack sequence for LSTM (though all seq len =14 here, we can skip packing for simplicity)
        lstm_out, _ = self.question_lstm(embed)   # (B, 14, 512)
        # Initial question features Y(0)
        q_feats = lstm_out  # we'll treat this as Y^(0) (already contextualized by LSTM)&#8203;:contentReference[oaicite:51]{index=51}
        # 2. Image feature extraction
        # Load and preprocess images
        imgs = [self.image_transform(Image.open(path).convert('RGB')) for path in image_paths]
        imgs_tensor = torch.stack(imgs, dim=0)  # (B, 3, H, W)
        # CNN backbone to get conv features
        conv_feats = self.cnn_backbone(imgs_tensor)  # (B, 2048, Hc, Wc), ideally Hc=Wc≈14 or 15
        # If feature map is not 16x16, pad to 16 and downsample to 8x8 as described
        # (In practice, a 448x448 input to ResNeXt-152 yields ~14x14 feature map. We pad to 16x16 then conv stride2 to get 8x8.)
        if conv_feats.size(2) < 16:
            # pad to 16x16
            pad_h = 16 - conv_feats.size(2)
            pad_w = 16 - conv_feats.size(3)
            conv_feats = F.pad(conv_feats, (0, pad_w, 0, pad_h))
        grid_feats = self.downsample_conv(conv_feats)  # (B, 2048, 8, 8)
        grid_feats = grid_feats.view(batch_size, 2048, -1).permute(0, 2, 1)  # (B, 64, 2048) flatten spatial
        # Project to 512-dim
        grid_feats = self.img_feat_proj(grid_feats)  # (B, 64, 512)
        # Initial image features X(0)
        v_feats = grid_feats

        # 3. Layer-residual co-attention stacks
        # Initialize skip connection storages for LRM
        prev_sa_out = None    # previous layer's question SA output
        prev_ga_img_out = None  # previous layer's image GA output
        prev_ga_txt_out = None  # (co-stacking) previous layer's text GA output
        # Iterate through L layers
        for l in range(self.num_layers):
            # Self-attention on question features (Y) for this layer
            q_out = self.sa_blocks[l](q_feats, prev_sa_output=prev_sa_out)  # SA on question&#8203;:contentReference[oaicite:52]{index=52}
            # Update prev_sa_out for next iteration
            prev_sa_out = q_out
            if self.variant == "enc_dec":
                # Encoder-Decoder: Only perform question SA in all layers; guided attention will be done once after loop
                q_feats = q_out
                continue  # skip GA within loop
            elif self.variant == "pure":
                # Pure-Stacking: guided attention on image, using current question output
                v_out = self.ga_blocks[l](v_feats, q_out, prev_ga_output=prev_ga_img_out)  # GA: image features guided by question&#8203;:contentReference[oaicite:53]{index=53}
                # Update prev outputs and current features
                prev_ga_img_out = v_out
                q_feats = q_out    # question features carry to next layer (not updated by image in pure stacking)
                v_feats = v_out
            elif self.variant == "co":
                # Co-Stacking: two-stage GA at each layer
                # Stage 1: Text GA - update question using image as context
                q_co_out = self.ga_text_blocks[l](q_out, v_feats, prev_ga_output=prev_ga_txt_out)  # question guided by image
                # Stage 2: Image GA - update image using updated question as context
                v_out = self.ga_image_blocks[l](v_feats, q_co_out, prev_ga_output=prev_ga_img_out)  # image guided by question
                # Update prev outputs for next layer
                prev_ga_txt_out = q_co_out
                prev_ga_img_out = v_out
                # Set up for next iteration
                q_feats = q_co_out  # updated question features go to next layer
                v_feats = v_out
        # End of layers loop

        if self.variant == "enc_dec":
            # After encoding question for L layers, do one guided attention on image with final question output
            # (Image features v_feats still initial X(0) here)
            v_feats = self.ga_image_block(v_feats, q_feats, prev_ga_output=None)
            # In E-D, q_feats is final question from encoder, v_feats is final image after one cross-attention
        # At this point, we have final question features in q_feats (B, m, 512) and image features in v_feats (B, n, 512)

        # 4. Attention pooling: compute attention weights over image and question features&#8203;:contentReference[oaicite:54]{index=54}
        # and produce attended feature vectors X_bar and Y_bar as weighted sums&#8203;:contentReference[oaicite:55]{index=55}.
        # Compute attention logits
        img_att_logits = self.att_mlp_image(v_feats)  # (B, n, 1)
        txt_att_logits = self.att_mlp_text(q_feats)   # (B, m, 1)
        # Attention weights
        img_att_weights = F.softmax(img_att_logits, dim=1)  # (B, n, 1), softmax over image regions&#8203;:contentReference[oaicite:56]{index=56}
        txt_att_weights = F.softmax(txt_att_logits, dim=1)  # (B, m, 1), softmax over question words&#8203;:contentReference[oaicite:57]{index=57}
        # Weighted sum to get single feature vectors
        v_att = torch.sum(img_att_weights * v_feats, dim=1)  # (B, 512), visual attended feature \bar{X}&#8203;:contentReference[oaicite:58]{index=58}
        q_att = torch.sum(txt_att_weights * q_feats, dim=1)  # (B, 512), textual attended feature \bar{Y}&#8203;:contentReference[oaicite:59]{index=59}

        # 5. Feature fusion and answer prediction
        # Linear fusion with layer normalization&#8203;:contentReference[oaicite:60]{index=60}
        fused = self.fusion_norm(self.fusion_Wx(v_att) + self.fusion_Wy(q_att))  # fused feature z&#8203;:contentReference[oaicite:61]{index=61}
        # Predict answer scores
        logits = self.classifier(fused)  # raw scores f for each answer&#8203;:contentReference[oaicite:62]{index=62}
        # Apply activation: ReLU + Sigmoid for multi-label classification as per paper&#8203;:contentReference[oaicite:63]{index=63}
        logits = F.relu(logits)         # ReLU on logits&#8203;:contentReference[oaicite:64]{index=64}
        probs = torch.sigmoid(logits)   # Sigmoid to get probabilities&#8203;:contentReference[oaicite:65]{index=65}
        return probs


## A Few Notes on the Implementation

- We use `prev_sa_out`, `prev_ga_img_out`, and `prev_ga_txt_out` to carry the **LRM skip connections** between layers for:
  - Self-Attention (SA)
  - Image Guided-Attention (GA)
  - Text Guided-Attention (GA), respectively.

- In **Pure-Stacking**, only `prev_sa_out` and `prev_ga_img_out` are used (since text GA is not present).

- In **Encoder–Decoder**, only `prev_sa_out` is used to form the question SA chain.

- In **Co-Stacking**, each layer:
  1. First updates the **question** with **image context** using `ga_text_blocks`.
  2. Then updates the **image** with the new question using `ga_image_blocks`.

  This matches the description of applying **textual co-attention after self-attention** to incorporate visual guidance early.

---

### CNN Feature Extraction

- Uses a **ResNeXt-152** backbone.
- The feature map is padded to \( 16 \times 16 \) and a stride-2 convolution is applied to obtain an \( 8 \times 8 \) grid of features.
- Each of the 64 grid cells is a **2048-dimensional** vector, which is projected to **512 dimensions** so that \( X \) (image features) and \( Y \) (question features) share the same dimension.

---

### Attention Pooling

- Uses a learned **MLP** to produce attention weights for each feature vector.
- A **softmax** is applied to get attention weights \( \alpha_i \) and \( \beta_j \) as in equations (15)–(16) of the paper.

---

### Final Fusion & Output

- The final fused vector is computed as:



## Training Configuration and Loop

We train the model using **Binary Cross-Entropy (BCE) loss** with **sigmoid outputs** for multi-label answer prediction.

- **Optimizer**: Adam  
  - \( \beta_1 = 0.9 \), \( \beta_2 = 0.98 \), as specified in the paper.

- **Learning Rate Schedule**:
  - **Warm-up** for the first 3 epochs.
  - **Decay** at epochs 10 and 12.

- **Training Duration**:
  - A total of **13 epochs** on the combined **train + val** datasets (optionally including **Visual Genome**), as described in the paper.

- **Batch Size**: 64

---

Below is a high-level description of the training loop incorporating these settings:

1. Initialize model, optimizer (Adam), and BCE loss.
2. Apply warm-up learning rate schedule for the first 3 epochs.
3. Train the model for 13 epochs:
   - At each epoch:
     - Perform forward and backward passes.
     - Update weights using Adam.
     - Log training loss and accuracy.
     - Run validation after each epoch.
4. Decay learning rate at epochs 10 and 12.
5. Track and log validation performance for model selection.


In [None]:
# Define training components
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = LRCNModel(hidden_dim=512, num_heads=8, num_layers=6, 
                  vocab_size=len(word2idx), glove_weights=glove_matrix, 
                  answer_vocab_size=len(answer2idx), variant="co").to(device)

optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-4, betas=(0.9, 0.98))
criterion = nn.BCELoss()  # Binary Cross-Entropy for multi-label classification

# Learning rate warm-up and decay settings
def adjust_learning_rate(epoch):
    # Warm-up for first 3 epochs: linearly scale lr from 1/4 to 1 (since initial set at 1e-4 as max)
    if epoch < 3:
        lr_scale = (epoch + 1) / 3.0  # epoch starts at 0
        for param_group in optimizer.param_groups:
            param_group['lr'] = 1e-4 * lr_scale
    # Decay by 1/5 at epoch 10 and 12
    if epoch == 10 or epoch == 12:
        for param_group in optimizer.param_groups:
            param_group['lr'] *= 0.2  # decay learning rate&#8203;:contentReference[oaicite:82]{index=82}

# Example DataLoader (assuming dataset and collate are defined)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=64, shuffle=False)

# Training loop
for epoch in range(13):  # train for 13 epochs&#8203;:contentReference[oaicite:83]{index=83}
    model.train()
    adjust_learning_rate(epoch)
    total_loss = 0.0
    for i, (image_paths, questions, targets) in enumerate(train_loader):
        questions = questions.to(device)
        targets = targets.to(device)
        # Forward pass
        probs = model(image_paths, questions)  # forward (image_paths used inside model)
        loss = criterion(probs, targets)
        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
        # (Optional: logging every N batches)
    avg_loss = total_loss / len(train_loader)
    # Validation
    model.eval()
    val_loss = 0.0
    correct = 0; total_questions = 0
    with torch.no_grad():
        for image_paths, questions, targets in val_loader:
            questions = questions.to(device)
            targets = targets.to(device)
            probs = model(image_paths, questions)
            loss = criterion(probs, targets)
            val_loss += loss.item()
            # Compute accuracy (e.g., whether any ground truth answer is predicted as top-1)
            # For multi-label, we can consider top answer correctness for simplicity
            preds = torch.argmax(probs, dim=1)
            # If any of the ground truth answers (those with target > 0) matches the prediction, count as correct
            for j in range(preds.size(0)):
                true_answers = (targets[j] > 0).nonzero(as_tuple=True)[0]
                if preds[j].item() in true_answers.cpu().tolist():
                    correct += 1
            total_questions += questions.size(0)
    avg_val_loss = val_loss / len(val_loader)
    accuracy = correct / total_questions * 100.0
    print(f"Epoch {epoch+1}: Train Loss = {avg_loss:.4f}, Val Loss = {avg_val_loss:.4f}, Val Accuracy = {accuracy:.2f}%")


## Notes on the Training Loop

- We **adjust the learning rate** at the start of each epoch:
  - Linearly **increase** it during the first 3 epochs (warm-up).
  - **Decay** it by a factor of 5 at epochs 10 and 12.

- We use **`BCELoss`** on the **sigmoid outputs (`probs`)** for multi-label classification, since each question can have multiple correct answers.

- **Validation Accuracy**:
  - We compute a simple accuracy metric by checking if the top predicted answer is among the ground truth answers.
  - (More rigorous VQA-specific metrics can be applied, but this serves well for monitoring.)

- **Logging**:
  - After each epoch, we log the **average training loss**, **validation loss**, and **accuracy**.

---

With this setup, the model is ready for training and experimentation. 

- The **default configuration** uses:
  - `6` layers
  - the **Co-Stacking** variant

- You can easily modify:
  - `variant="pure"` or `variant="enc_dec"`
  - `num_layers=8` (for deeper stacks)

to explore the different LRCN variants described in the paper.

### Architectural and Hyperparameter Defaults:
- 8 attention heads  
- 512-dimensional features  
- Layer-Residual connections  
- Training schedule from Section 4.2 of the LRCN paper

---

### Sources

- D. Han et al., *"LRCN: Layer-residual Co-Attention Networks for visual question answering,"*  
  *Expert Systems With Applications, vol. 263, 2025*  
  (Includes architecture and preprocessing details)

- Equation references **(8)–(18)** from the LRCN paper for model computations.

- Training hyperparameter settings taken from **Section 4.2** of the LRCN paper.
