# Homework · Multimodal Large Language Models
**Course:** Applied Data Science (CPSC 8xxx)
**Due:** _One week after the lab session_
**Submission:** Push to your private GitLab repository and submit a Palmetto job report.

This homework builds on the lecture and lab notebook. You will fine-tune multimodal foundation models across three modality pairings. Each exercise must include:
- SLURM script or `srun` command used on Palmetto.
- Training/evaluation logs (TensorBoard, W&B, or MLflow).
- Short written summary (1–2 paragraphs) interpreting results and challenges.


---
## Environment Checklist
- Use the same `multimodal-llm` Conda environment from the lab notebook.
- Reserve GPUs with at least 24 GB memory (A100 preferred).
- Store intermediate checkpoints under `/scratch1/$USER/hw3-multimodal`.


---
## Exercise 1 · Vision-Language Fine-Tuning (20 pts)
**Goal:** Fine-tune the CLIP-style dual encoder on the [Clemson Campus Scenes](https://example.org) dataset and evaluate zero-shot transfer to COCO.

**Requirements:**
- Implement balanced sampling to mitigate class imbalance between campus landmarks.
- Apply parameter-efficient fine-tuning (LoRA on the projector or QLoRA on the text tower).
- Report Recall@1/5/10 on COCO validation and Clemson Campus validation splits.
- Analyze modality alignment drift compared to the pretraining checkpoint.


In [None]:
# TODO: configure paths
from pathlib import Path
import os

DATA_ROOT = Path('/scratch1') / os.environ.get('USER', 'student') / 'hw3-multimodal'
DATA_ROOT.mkdir(parents=True, exist_ok=True)

# TODO: implement dataset loader and balanced sampling strategy


In [None]:
# TODO: load pretrained encoders and attach LoRA adapters
from transformers import AutoModel
from peft import LoraConfig, get_peft_model

vision_encoder = AutoModel.from_pretrained('openai/clip-vit-base-patch16')
text_encoder = AutoModel.from_pretrained('distilbert-base-uncased')

lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=['query', 'value'])
vision_encoder = get_peft_model(vision_encoder, lora_config)
text_encoder = get_peft_model(text_encoder, lora_config)


In [None]:
# TODO: training loop
# Use Accelerate or PyTorch Lightning if preferred


In [None]:
# TODO: compute retrieval metrics and save a JSON report


---
## Exercise 2 · Audio-Text Instruction Alignment (25 pts)
**Goal:** Adapt Whisper-small + LLaMA-2-7B-chat to follow spoken instructions and output textual answers.

**Dataset Suggestions:** AudioCaps, Spoken-SQuAD, campus tour recordings. Combine with synthetic speech generated via Torchaudio + TTS for augmentation.

**Requirements:**
- Implement a projection layer from Whisper encoder embeddings to the LLaMA hidden size.
- Apply LoRA to the language model _or_ freezing strategy plus adapter.
- Evaluate on a held-out set with WER and instruction-following accuracy (exact match or ROUGE-L).
- Discuss robustness to noisy backgrounds recorded on campus.


In [None]:
# TODO: prepare audio dataset manifest and dataloader


In [None]:
# TODO: build audio-to-text adapter and fine-tuning loop


In [None]:
# TODO: evaluation metrics (WER, instruction accuracy)


---
## Exercise 3 · Any-to-Any Generation (35 pts)
**Goal:** Extend an open-source multimodal assistant (e.g., LLaVA-Next, InstructBLIP, or Qwen-VL) to support **chart-to-audio** and **audio-to-image** tasks via tool augmentation.

**Requirements:**
- Add at least two modality adapters (e.g., Chart OCR encoder + audio synthesizer).
- Implement routing logic that selects the correct adapter based on the prompt.
- Demonstrate two end-to-end examples per new capability.
- Benchmark against a baseline without the new adapters.
- Provide an ablation table analyzing latency and GPU memory usage under Palmetto scheduling constraints.


In [None]:
# TODO: design routing policy for any-to-any interactions


In [None]:
# TODO: integrate tool calls (e.g., image generation API, TTS)


---
## Deliverables Checklist
- [ ] Completed code cells for all exercises.
- [ ] `README.md` in your repository describing data sources and ethical considerations.
- [ ] SLURM submission scripts (`*.sbatch`).
- [ ] Evaluation summaries in `reports/`.
- [ ] Short reflection (submit via Canvas) covering lessons learned and next steps.


---
## Grading Rubric
| Component | Points | Criteria |
| --- | --- | --- |
| Exercise 1 | 20 | Completeness, retrieval metrics, analysis |
| Exercise 2 | 25 | Adapter design, instruction accuracy, robustness discussion |
| Exercise 3 | 35 | Tool integration, demonstrations, ablation |
| Engineering Report | 10 | Clarity of SLURM scripts, logging, reproducibility |
| Responsible AI Reflection | 10 | Bias/safety analysis, mitigation proposals |

Late policy: -10% per day (max 3 days). Collaborations must be declared.
