

# Multimodal Learning

⸻

1. What Is Multimodal Learning?

Multimodal learning combines information from multiple data modalities, such as:
	•	Text (e.g., captions, documents)
	•	Image (e.g., pixels, features)
	•	Audio (e.g., speech, sounds)
	•	Video, sensor, or tabular data

Goal: Learn joint representations or complementary features for better prediction, classification, or generation.

⸻

2. Fusion Strategies

Early Fusion:
	•	Combine raw features (e.g., concatenate vectors)
	•	Example: [text features | image features]

Late Fusion:
	•	Train separate models for each modality
	•	Combine decisions (e.g., ensemble of outputs)

Hybrid Fusion:
	•	Intermediate representations are combined
	•	Often seen in attention-based models (e.g., CLIP, ViLBERT)

⸻

3. Applications

Modality Pair	Task Example
Text + Image	Image Captioning, VQA
Text + Audio	Speech Recognition
Video + Audio	Emotion Recognition
Image + Sensor	Medical Diagnosis
Text + Table	Financial Report Analysis



⸻

4. Architecture Example (Text + Image)

Image → CNN → Img Features ┐
                            ├→ Concatenate → MLP → Output
Text  → BERT → Txt Features ┘

Or using attention:

Image → CNN → Keys/Values ┐
                           ├→ Cross Attention
Text → BERT → Queries     ┘



⸻

5. Python: Simple Text + Image Fusion (Early Fusion)

import torch
import torch.nn as nn
from transformers import BertModel

class MultimodalClassifier(nn.Module):
    def __init__(self, txt_dim=768, img_dim=2048, hidden=512, num_classes=2):
        super().__init__()
        self.bert = BertModel.from_pretrained("bert-base-uncased")
        self.img_fc = nn.Linear(img_dim, txt_dim)
        self.classifier = nn.Sequential(
            nn.Linear(txt_dim + txt_dim, hidden),
            nn.ReLU(),
            nn.Linear(hidden, num_classes)
        )

    def forward(self, text_input_ids, text_attention_mask, img_features):
        txt_out = self.bert(input_ids=text_input_ids, attention_mask=text_attention_mask)
        txt_vec = txt_out.pooler_output  # [B, 768]
        img_vec = self.img_fc(img_features)  # [B, 768]
        x = torch.cat([txt_vec, img_vec], dim=1)  # [B, 1536]
        return self.classifier(x)



⸻

6. Challenges in Multimodal Learning
	•	Missing modalities: Not all inputs always available
	•	Alignment: Matching time/space between modalities
	•	Noise: One modality may dominate or be noisy
	•	Dimensional mismatch: Vectors of vastly different sizes

⸻

7. Large Multimodal Models

Model	Modality	Task
CLIP	Text + Image	Retrieval, Zero-shot vision
Flamingo	Text + Image	VQA, Captioning
Whisper	Audio + Text	Transcription, Translation
Gemini/GPT-4V	Text + Image/Code	Multimodal reasoning



⸻

Would you like an example of Multimodal Contrastive Learning like CLIP, or Vision + Sensor fusion for time-series prediction?

In [None]:
import os
import nbformat
import yaml

# Path to your _toc.yml
TOC_FILE = "_toc.yml"

# Where to create notebooks (can be same as where _toc.yml is)
OUTPUT_DIR = "."

# Template notebook content
def make_notebook(title):
    nb = nbformat.v4.new_notebook()
    nb["cells"] = [
        nbformat.v4.new_markdown_cell(f"# {title}"),
        nbformat.v4.new_code_cell("# Your code here")
    ]
    return nb

def create_notebook(path, title):
    """Create an .ipynb file if it doesn't exist."""
    if not path.endswith(".ipynb"):
        path += ".ipynb"
    full_path = os.path.join(OUTPUT_DIR, path)
    os.makedirs(os.path.dirname(full_path), exist_ok=True)
    if not os.path.exists(full_path):
        nb = make_notebook(title)
        with open(full_path, "w", encoding="utf-8") as f:
            nbformat.write(nb, f)
        print(f"Created: {full_path}")
    else:
        print(f"Exists:  {full_path}")

def process_entry(entry):
    """Recursively process toc entries."""
    if isinstance(entry, dict):
        if "file" in entry:
            title = entry.get("title", entry["file"])
            create_notebook(entry["file"], title)
        if "sections" in entry:
            for sec in entry["sections"]:
                process_entry(sec)
    elif isinstance(entry, list):
        for e in entry:
            process_entry(e)

def main():
    with open(TOC_FILE, "r", encoding="utf-8") as f:
        toc = yaml.safe_load(f)

    if "root" in toc:
        create_notebook(toc["root"], toc.get("title", toc["root"]))

    if "chapters" in toc:
        process_entry(toc["chapters"])

if __name__ == "__main__":
    main()
