Binary roof segmentation from aerial imagery under strict data scarcity. Several approaches are compared head-to-head on a ~30-sample training budget, with per-model tuning and a three-way train/val/test split.
With only ~30 labeled training samples, what segmentation quality can each approach actually reach, and how does that ceiling fall as training data is reduced further?
This is a capability-ceiling comparison, not a controlled single-variable experiment. Each model is tuned to its own best recipe under an equal tuning budget, and the results are reported with the recipe that achieved them.
- Source: Inria Aerial Image Labeling dataset, Austin subset.
- Tiles: 36 RGB tiles at 5000×5000 px, each with a binary mask of building footprints.
- Label semantics: the masks mark building footprints. For nadir aerial imagery, this is effectively the roof outline (including overhangs/eaves — no distinction between roof and walls).
- Source layout (not committed to the repo):
The source TIFFs live locally only and are removed once
data/imgs/austin{1..36}.tif data/msks/austin{1..36}.tifpatches/is finalized. The dataset-of-record ispatches/, notdata/.
Three-way split, fixed once, never revisited.
| Set | Tiles | Count | Purpose |
|---|---|---|---|
| train | austin2,3,4,5,7,8,9,10,12,13,14,15,17,18,19,20,22,23,24,25,27,28,29,30 | 24 | Model fitting |
| val | austin1, 6, 11, 16, 21, 26 | 6 | LR tuning, augmentation tuning, checkpoint selection, in-training IoU/F1 |
| test | austin31, 32, 33, 34, 35, 36 | 6 | Final reported numbers — read once at the end of each attempt |
Test discipline: no tuning loop is permitted to read test. Enforced in code via separate dataloader and config flag.
- One 1024×1024 crop per source tile → 36 patches total.
- Smart-center crop: start at the geometric center; if the centered 1024² window has < 5% positive (building) pixels, slide the window around the 5000² source to find a position with meaningful coverage.
- Manual verification: all 36 outputs are eyeballed once; any that look bad are nudged by hand.
- Formats: JPEG q90 for RGB images (~400 KB each), PNG for binary masks (~50 KB each). Total footprint: ~15 MB.
- Committed to repo: yes.
patches/is the dataset-of-record.
patches/
train/{images,masks}/austin{...}.{jpg,png}
val/{images,masks}/austin{1,6,11,16,21,26}.{jpg,png}
test/{images,masks}/austin{31..36}.{jpg,png}
No runtime resizing anywhere in the pipeline. The only resampling in the entire project is the one-time preprocessing step (5000² → 1024²). Every model trains and evaluates at 1024².
Five model configurations, ordered so each builds on a lesson from the previous.
| # | Model | Encoder / prior | Trainable params (approx) | What it tests |
|---|---|---|---|---|
| 1 | U-Net from scratch | none | ~7–30 M | Baseline floor; no pretrained prior |
| 2 | SMP U-Net + ResNet (ImageNet) | ImageNet-pretrained CNN | ~24 M | Standard transfer-learning recipe |
| 3 | SMP DeepLabV3+ + ResNet (ImageNet) | ImageNet-pretrained CNN | ~25 M | Architecture impact on top of #2 |
| 4 | SAM ViT-B frozen + conv decoder | frozen ViT (SAM) | ~1–3 M | Massive pretrained prior; very few trainable params |
| 5 | SAM ViT-B encoder + U-Net decoder w/ skip connections | frozen ViT (SAM), rich decoder | ~5–10 M | Decoder capacity on top of SAM prior |
Attempts #2 and #4 are ported from the reference Keras/PyTorch code in
~/Code/python/dida/ — they serve as known-working baselines that the
new project's numbers can be cross-checked against.
Attempt #5 pulls features from intermediate ViT blocks (e.g., blocks 3/6/9/12
for ViT-B), reshapes (B, N, C) → (B, C, H/16, W/16), and feeds them as
skip connections into a U-Net-style decoder that upsamples back to 1024².
- Framework: PyTorch,
segmentation_models_pytorchfor the CNN track,segment_anythingfor the SAM track. - Input resolution: 1024 × 1024 (unchanged from
patches/). - Sample oversampling (critical for tiny datasets — see "Design notes"):
each epoch randomly draws
samples_per_epoch=400indices with replacement from the 24 training patches; each draw is independently augmented. An epoch is thus ~100 optimizer steps at batch 4, not 6. - Loss:
w * BCE + (1-w) * Dice— weightingwis a per-model hyperparameter (default 0.5, sweep considered during tuning). - Optimizer: AdamW, weight decay 1e-4, optionally EMA-wrapped.
- Callbacks:
- EarlyStopping on
val_iou, patience ~10 epochs - ReduceLROnPlateau, factor 0.5, patience 3
- ModelCheckpoint: save best by
val_iou - PredictionVisualizer: dump prediction overlays every N epochs for the Colab/terminal feedback loop
- EarlyStopping on
- Two-stage training for pretrained-encoder attempts (#2, #3):
- Train with
encoder_freeze=Truefor ~20 epochs (decoder warmup). - Unfreeze encoder, continue training at same or reduced LR.
- Train with
- Decoder dropout (SMP wrapper):
Dropout2d(p)injected between decoder stages,ptunable per model. Encoder is never wrapped in dropout. - Mixed precision: fp16 on Colab T4.
- Seed: 42 for all seedable operations.
What's tuned (per-model, equal tuning budget):
- Learning rate (via short LR-finder run)
- Augmentation preset (LIGHT / MEDIUM / HEAVY / EXTREME — see below)
- Decoder dropout rate (SMP track: 0.0 / 0.2 / 0.3 / 0.5)
- Loss weighting (BCE/Dice 0.5/0.5 vs 0.3/0.7)
- Convergence epoch (via early stopping)
What's locked (identical across attempts):
- Train / val / test split and the patch bytes
- Metric and evaluation protocol
- Random seed
- Optimizer family (AdamW)
- Oversampling scheme (
samples_per_epoch=400) - Batch size in effective terms (actual batch may drop with gradient accumulation under memory pressure, but effective batch is constant)
Tuning budget per model: 1 LR-finder run + 3–4 augmentation/dropout sweeps + 1 final long run. Documented in the results table.
Built around the curated pipeline from the reference Keras code, which
explicitly excluded GaussianBlur and GaussNoise because "there is not
much variety of blur in the given images" — a dataset-specific observation
worth respecting.
LIGHT:
HorizontalFlip, VerticalFlip, RandomRotate90
(the full D4 dihedral group — free for aerial nadir imagery)
MEDIUM (default / reference recipe):
LIGHT +
ShiftScaleRotate(shift=0.2, scale=0.3, rotate=0.4, BORDER_REFLECT, p=0.7)
RandomBrightnessContrast(0.2, 0.3, p=0.5)
HueSaturationValue(hue=10, sat=15, val=15, p=0.3)
HEAVY:
MEDIUM +
RandomShadow
RandomGamma
CLAHE
mild JPEG compression artifacts
EXTREME:
HEAVY +
mild ElasticTransform
CoarseDropout
RandomFog
Hypothesis: MEDIUM or HEAVY wins for CNN track; LIGHT or MEDIUM wins for SAM track (frozen ViT encoders tend to prefer milder augmentation).
- Metric: mean IoU + F1, computed on reconstructed full-resolution predictions (not on patches, since there's exactly one patch per tile).
- Validation: used continuously during tuning.
- Test: read exactly once per attempt, after all tuning is finalized. Test numbers are the leaderboard.
- Qualitative output: prediction overlays on val + test tiles, committed
to
results/<attempt>/overlays/.
| # | Model | LR | Aug | Dropout | Loss (BCE/Dice) | Epochs | Val IoU | Val F1 | Test IoU | Test F1 |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| ... |
Both val and test are shown — val for honesty (see the val→test drop), test as the actual reported number.
Best 1–2 models retrained on {5, 12, 24} training tiles. Plot: val IoU vs. training tile count, one line per model. This is the project's headline figure — it directly visualizes few-shot behavior.
Phase 0 — Smoke test (~5 short runs, < 1 h)
Verify each of the 5 models trains at all under a sensible default recipe.
Loss decreases, no NaNs, no OOM, val IoU improves above trivial floor.
Phase 1 — Per-model LR calibration (~5 short runs)
LR-finder on each model.
Phase 2 — Per-model aug/dropout sweep (~15 runs)
For each model, try 3 augmentation presets × 2 dropout rates at chosen LR.
Pick best (LR, aug, dropout) triple per model.
Phase 3 — Final training runs (~5 full runs)
Each model trained with its tuned recipe to convergence. Read test once.
Fills the headline table.
Phase 4 — Data-budget ablation (~6 runs)
Best 1–2 models on {5, 12, 24} training tiles with tuned recipes.
Produces the headline figure.
Total: ~35–40 runs. Fits across a few Colab sessions.
image-segmentation/
├── README.md — this file
├── .gitignore — excludes data/ and runtime artifacts
├── patches/ — committed dataset-of-record (~15 MB)
│ ├── train/{images,masks}/
│ ├── val/{images,masks}/
│ └── test/{images,masks}/
├── scripts/
│ ├── make_patches.py — one-time preprocessing (5000² → 1024²)
│ └── eyeball_patches.py — visual check of all 36 patches
├── src/
│ ├── data.py — Dataset + oversampling Sampler
│ ├── augment.py — Albumentations presets
│ ├── models/
│ │ ├── unet_scratch.py
│ │ ├── smp_wrapper.py — SMP models with decoder dropout
│ │ └── sam_decoder.py — SAM frozen encoder + decoders (#4, #5)
│ ├── losses.py — BCE+Dice with tunable weighting
│ ├── metrics.py — IoU, F1, full-tile reconstruction
│ ├── train.py — shared training loop, callbacks
│ ├── diagnostics.py — structured per-epoch logging
│ └── config.py — per-attempt hyperparameter dataclasses
├── tests/ — local pytest (CPU only)
│ ├── test_dataset.py
│ ├── test_augment.py
│ ├── test_models.py
│ ├── test_losses.py
│ ├── test_metrics.py
│ └── test_inference.py
├── notebooks/
│ ├── 00_smoke_test.ipynb
│ ├── 01_attempt_unet_scratch.ipynb
│ ├── 02_attempt_smp_unet.ipynb
│ ├── 03_attempt_smp_deeplab.ipynb
│ ├── 04_attempt_sam_frozen.ipynb
│ ├── 05_attempt_sam_unet.ipynb
│ └── 99_ablation_data_budget.ipynb
└── results/ — per-attempt outputs (gitignored)
└── <attempt>/
├── config.yaml
├── history.csv
├── best.pt
└── overlays/
Data (data/raw/, source TIFFs) is local-only and git-ignored.
# one-time patch generation from local source TIFFs
python scripts/make_patches.py --src data/ --dst patches/
# unit tests (runs on CPU in seconds)
pytest tests/Each notebook follows the same skeleton:
!git clone https://github.com/<user>/image-segmentation
%cd image-segmentation
!pip install -r requirements.txt
from src.config import ATTEMPT_02
from src.train import train_with_diagnostics
train_with_diagnostics(ATTEMPT_02)The notebook prints structured per-epoch diagnostics in a format designed to be copy-pasted back into chat for remote debugging:
=== attempt: smp_unet_resnet34 epoch: 5/60 wall: 0:04:12 ===
train: loss=0.214 (n=400 augmented, bs=4, steps=100)
val: loss=0.287 iou=0.612 f1=0.748 (n=6 tiles)
per_tile_iou: austin1=0.71 austin6=0.55 austin11=0.68
austin16=0.63 austin21=0.49 austin26=0.61
worst_tile: austin21 (iou=0.49)
class_balance: pred_pos=0.18 true_pos=0.21
gpu_mem: 4.2 / 15.0 GB step_time: 0.42s
Free Colab T4 is sufficient for all attempts at batch 4 (SAM) to 8 (SMP).
Two-tier approach. Local red-green-refactor for everything that doesn't need GPU; structured observational logging for everything that does.
Tier 1 — local pytest (runs in seconds on any laptop):
test_dataset.py— patch loader returns correct shapes and pixel ranges; mask is binary; no NaNstest_augment.py— augmentation pipeline preserves shapes, mask stays binary, handles all-zero and all-positive edge casestest_models.py— each model factory builds, forward pass on(1, 3, 1024, 1024)zero tensor returns(1, 1, 1024, 1024), trainable param count matches expectation (catches "did I actually freeze the SAM encoder?" bugs)test_losses.py— BCE+Dice returns scalar > 0, gradients flow, hand- computed values match on tiny tensorstest_metrics.py— IoU on hand-crafted 4×4 masks matches expected fractiontest_inference.py— full-tile reconstruction is bit-exact identity when the model isnn.Identity()
Smoke-test flag on the training script: train.py --smoke runs 2 epochs on
a CPU-tiny subset to exercise the entire pipeline (data → model → loss →
optimizer → eval → checkpoint) before touching Colab.
Tier 2 — Colab observational:
Structured per-epoch diagnostics (see "How to run") are the remote debugging interface. The loop is: you paste output to chat, I diagnose overfitting / class imbalance / LR issues from the numbers, you adjust and re-run.
Non-obvious choices and the reasoning behind them.
-
Oversampled epochs (
samples_per_epoch=400). With 24 training samples at batch 4, a naive epoch is 6 optimizer steps — too small for LR schedules and early stopping to behave sensibly. Oversampling with replacement + fresh augmentation per draw restores normal training dynamics without faking data. Each epoch sees ~100 steps of genuinely distinct augmented views. Pattern borrowed from the reference Keras code. -
Single resolution, no runtime resizing. Upsampling creates fake detail and can't add information; downsampling loses it. Storing and training at the same resolution (1024²) means the only resampling in the entire pipeline is the one-time preprocessing step. Both tracks eat identical bytes.
-
Per-model tuning, not shared recipe. The research question is "what's each approach capable of," not "which wins with a fixed recipe." Locking LR or augmentation across fundamentally different models (from-scratch CNN vs. frozen pretrained ViT) would penalize whichever disagrees with the chosen value — measuring misconfiguration rather than potential. Equal tuning budget per model, documented in the results table, is the fairness discipline instead.
-
Decoder-only dropout. Dropout inside a pretrained encoder disrupts its BN/feature statistics; dropout in the decoder regularizes the tiny set of newly-trained parameters, which is where overfitting actually happens with 30 samples. Matches the pattern used in the reference SAM decoder code.
-
Test set held back. Per-model tuning against val means val is contaminated by tuning effort by the time the final run happens. Test is read once at the end of each attempt, never influences any decision, and provides the honest leaderboard.
-
Reference Keras code retained (
~/Code/python/dida/). Attempts #2 and #4 are PyTorch ports of existing working Keras/PyTorch code. Cross- checking numbers against the reference catches port bugs early. -
No GaussianBlur / GaussNoise in augmentation. Excluded based on a dataset-specific observation from the reference code: the source imagery has very little blur variation, so these augmentations add unrealistic variance without matching any real-world deployment condition.
Project planning complete; implementation pending. Next concrete steps:
scripts/make_patches.py— one-time preprocessingsrc/skeleton module +tests/for local TDD- Phase 0 smoke test notebook