Skip to content

ai-sci-computing/image-segmentation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Roof Segmentation — Few-Shot Comparison

Binary roof segmentation from aerial imagery under strict data scarcity. Several approaches are compared head-to-head on a ~30-sample training budget, with per-model tuning and a three-way train/val/test split.


Research question

With only ~30 labeled training samples, what segmentation quality can each approach actually reach, and how does that ceiling fall as training data is reduced further?

This is a capability-ceiling comparison, not a controlled single-variable experiment. Each model is tuned to its own best recipe under an equal tuning budget, and the results are reported with the recipe that achieved them.


Data

  • Source: Inria Aerial Image Labeling dataset, Austin subset.
  • Tiles: 36 RGB tiles at 5000×5000 px, each with a binary mask of building footprints.
  • Label semantics: the masks mark building footprints. For nadir aerial imagery, this is effectively the roof outline (including overhangs/eaves — no distinction between roof and walls).
  • Source layout (not committed to the repo):
    data/imgs/austin{1..36}.tif
    data/msks/austin{1..36}.tif
    
    The source TIFFs live locally only and are removed once patches/ is finalized. The dataset-of-record is patches/, not data/.

Split

Three-way split, fixed once, never revisited.

Set Tiles Count Purpose
train austin2,3,4,5,7,8,9,10,12,13,14,15,17,18,19,20,22,23,24,25,27,28,29,30 24 Model fitting
val austin1, 6, 11, 16, 21, 26 6 LR tuning, augmentation tuning, checkpoint selection, in-training IoU/F1
test austin31, 32, 33, 34, 35, 36 6 Final reported numbers — read once at the end of each attempt

Test discipline: no tuning loop is permitted to read test. Enforced in code via separate dataloader and config flag.


Preprocessing

  • One 1024×1024 crop per source tile → 36 patches total.
  • Smart-center crop: start at the geometric center; if the centered 1024² window has < 5% positive (building) pixels, slide the window around the 5000² source to find a position with meaningful coverage.
  • Manual verification: all 36 outputs are eyeballed once; any that look bad are nudged by hand.
  • Formats: JPEG q90 for RGB images (~400 KB each), PNG for binary masks (~50 KB each). Total footprint: ~15 MB.
  • Committed to repo: yes. patches/ is the dataset-of-record.
patches/
  train/{images,masks}/austin{...}.{jpg,png}
  val/{images,masks}/austin{1,6,11,16,21,26}.{jpg,png}
  test/{images,masks}/austin{31..36}.{jpg,png}

No runtime resizing anywhere in the pipeline. The only resampling in the entire project is the one-time preprocessing step (5000² → 1024²). Every model trains and evaluates at 1024².


Attempts

Five model configurations, ordered so each builds on a lesson from the previous.

# Model Encoder / prior Trainable params (approx) What it tests
1 U-Net from scratch none ~7–30 M Baseline floor; no pretrained prior
2 SMP U-Net + ResNet (ImageNet) ImageNet-pretrained CNN ~24 M Standard transfer-learning recipe
3 SMP DeepLabV3+ + ResNet (ImageNet) ImageNet-pretrained CNN ~25 M Architecture impact on top of #2
4 SAM ViT-B frozen + conv decoder frozen ViT (SAM) ~1–3 M Massive pretrained prior; very few trainable params
5 SAM ViT-B encoder + U-Net decoder w/ skip connections frozen ViT (SAM), rich decoder ~5–10 M Decoder capacity on top of SAM prior

Attempts #2 and #4 are ported from the reference Keras/PyTorch code in ~/Code/python/dida/ — they serve as known-working baselines that the new project's numbers can be cross-checked against.

Attempt #5 pulls features from intermediate ViT blocks (e.g., blocks 3/6/9/12 for ViT-B), reshapes (B, N, C)(B, C, H/16, W/16), and feeds them as skip connections into a U-Net-style decoder that upsamples back to 1024².


Training protocol (shared across all attempts)

  • Framework: PyTorch, segmentation_models_pytorch for the CNN track, segment_anything for the SAM track.
  • Input resolution: 1024 × 1024 (unchanged from patches/).
  • Sample oversampling (critical for tiny datasets — see "Design notes"): each epoch randomly draws samples_per_epoch=400 indices with replacement from the 24 training patches; each draw is independently augmented. An epoch is thus ~100 optimizer steps at batch 4, not 6.
  • Loss: w * BCE + (1-w) * Dice — weighting w is a per-model hyperparameter (default 0.5, sweep considered during tuning).
  • Optimizer: AdamW, weight decay 1e-4, optionally EMA-wrapped.
  • Callbacks:
    • EarlyStopping on val_iou, patience ~10 epochs
    • ReduceLROnPlateau, factor 0.5, patience 3
    • ModelCheckpoint: save best by val_iou
    • PredictionVisualizer: dump prediction overlays every N epochs for the Colab/terminal feedback loop
  • Two-stage training for pretrained-encoder attempts (#2, #3):
    1. Train with encoder_freeze=True for ~20 epochs (decoder warmup).
    2. Unfreeze encoder, continue training at same or reduced LR.
  • Decoder dropout (SMP wrapper): Dropout2d(p) injected between decoder stages, p tunable per model. Encoder is never wrapped in dropout.
  • Mixed precision: fp16 on Colab T4.
  • Seed: 42 for all seedable operations.

Per-model tuning

What's tuned (per-model, equal tuning budget):

  • Learning rate (via short LR-finder run)
  • Augmentation preset (LIGHT / MEDIUM / HEAVY / EXTREME — see below)
  • Decoder dropout rate (SMP track: 0.0 / 0.2 / 0.3 / 0.5)
  • Loss weighting (BCE/Dice 0.5/0.5 vs 0.3/0.7)
  • Convergence epoch (via early stopping)

What's locked (identical across attempts):

  • Train / val / test split and the patch bytes
  • Metric and evaluation protocol
  • Random seed
  • Optimizer family (AdamW)
  • Oversampling scheme (samples_per_epoch=400)
  • Batch size in effective terms (actual batch may drop with gradient accumulation under memory pressure, but effective batch is constant)

Tuning budget per model: 1 LR-finder run + 3–4 augmentation/dropout sweeps + 1 final long run. Documented in the results table.


Augmentation presets

Built around the curated pipeline from the reference Keras code, which explicitly excluded GaussianBlur and GaussNoise because "there is not much variety of blur in the given images" — a dataset-specific observation worth respecting.

LIGHT:
  HorizontalFlip, VerticalFlip, RandomRotate90
  (the full D4 dihedral group — free for aerial nadir imagery)

MEDIUM (default / reference recipe):
  LIGHT +
  ShiftScaleRotate(shift=0.2, scale=0.3, rotate=0.4, BORDER_REFLECT, p=0.7)
  RandomBrightnessContrast(0.2, 0.3, p=0.5)
  HueSaturationValue(hue=10, sat=15, val=15, p=0.3)

HEAVY:
  MEDIUM +
  RandomShadow
  RandomGamma
  CLAHE
  mild JPEG compression artifacts

EXTREME:
  HEAVY +
  mild ElasticTransform
  CoarseDropout
  RandomFog

Hypothesis: MEDIUM or HEAVY wins for CNN track; LIGHT or MEDIUM wins for SAM track (frozen ViT encoders tend to prefer milder augmentation).


Evaluation

  • Metric: mean IoU + F1, computed on reconstructed full-resolution predictions (not on patches, since there's exactly one patch per tile).
  • Validation: used continuously during tuning.
  • Test: read exactly once per attempt, after all tuning is finalized. Test numbers are the leaderboard.
  • Qualitative output: prediction overlays on val + test tiles, committed to results/<attempt>/overlays/.

Reporting

Headline table (one row per attempt)

# Model LR Aug Dropout Loss (BCE/Dice) Epochs Val IoU Val F1 Test IoU Test F1
1 ... ... ... ... ... ... ... ... ... ...
...

Both val and test are shown — val for honesty (see the val→test drop), test as the actual reported number.

Data-budget ablation figure

Best 1–2 models retrained on {5, 12, 24} training tiles. Plot: val IoU vs. training tile count, one line per model. This is the project's headline figure — it directly visualizes few-shot behavior.


Phase plan

Phase 0 — Smoke test               (~5 short runs, < 1 h)
  Verify each of the 5 models trains at all under a sensible default recipe.
  Loss decreases, no NaNs, no OOM, val IoU improves above trivial floor.

Phase 1 — Per-model LR calibration (~5 short runs)
  LR-finder on each model.

Phase 2 — Per-model aug/dropout sweep (~15 runs)
  For each model, try 3 augmentation presets × 2 dropout rates at chosen LR.
  Pick best (LR, aug, dropout) triple per model.

Phase 3 — Final training runs     (~5 full runs)
  Each model trained with its tuned recipe to convergence. Read test once.
  Fills the headline table.

Phase 4 — Data-budget ablation    (~6 runs)
  Best 1–2 models on {5, 12, 24} training tiles with tuned recipes.
  Produces the headline figure.

Total: ~35–40 runs. Fits across a few Colab sessions.


Repository layout

image-segmentation/
├── README.md                    — this file
├── .gitignore                   — excludes data/ and runtime artifacts
├── patches/                     — committed dataset-of-record (~15 MB)
│   ├── train/{images,masks}/
│   ├── val/{images,masks}/
│   └── test/{images,masks}/
├── scripts/
│   ├── make_patches.py          — one-time preprocessing (5000² → 1024²)
│   └── eyeball_patches.py       — visual check of all 36 patches
├── src/
│   ├── data.py                  — Dataset + oversampling Sampler
│   ├── augment.py               — Albumentations presets
│   ├── models/
│   │   ├── unet_scratch.py
│   │   ├── smp_wrapper.py       — SMP models with decoder dropout
│   │   └── sam_decoder.py       — SAM frozen encoder + decoders (#4, #5)
│   ├── losses.py                — BCE+Dice with tunable weighting
│   ├── metrics.py               — IoU, F1, full-tile reconstruction
│   ├── train.py                 — shared training loop, callbacks
│   ├── diagnostics.py           — structured per-epoch logging
│   └── config.py                — per-attempt hyperparameter dataclasses
├── tests/                       — local pytest (CPU only)
│   ├── test_dataset.py
│   ├── test_augment.py
│   ├── test_models.py
│   ├── test_losses.py
│   ├── test_metrics.py
│   └── test_inference.py
├── notebooks/
│   ├── 00_smoke_test.ipynb
│   ├── 01_attempt_unet_scratch.ipynb
│   ├── 02_attempt_smp_unet.ipynb
│   ├── 03_attempt_smp_deeplab.ipynb
│   ├── 04_attempt_sam_frozen.ipynb
│   ├── 05_attempt_sam_unet.ipynb
│   └── 99_ablation_data_budget.ipynb
└── results/                     — per-attempt outputs (gitignored)
    └── <attempt>/
        ├── config.yaml
        ├── history.csv
        ├── best.pt
        └── overlays/

Data (data/raw/, source TIFFs) is local-only and git-ignored.


How to run

Local (once, for preprocessing and testing)

# one-time patch generation from local source TIFFs
python scripts/make_patches.py --src data/ --dst patches/

# unit tests (runs on CPU in seconds)
pytest tests/

Colab (for each training attempt)

Each notebook follows the same skeleton:

!git clone https://github.com/<user>/image-segmentation
%cd image-segmentation
!pip install -r requirements.txt

from src.config import ATTEMPT_02
from src.train import train_with_diagnostics

train_with_diagnostics(ATTEMPT_02)

The notebook prints structured per-epoch diagnostics in a format designed to be copy-pasted back into chat for remote debugging:

=== attempt: smp_unet_resnet34  epoch: 5/60  wall: 0:04:12 ===
train: loss=0.214 (n=400 augmented, bs=4, steps=100)
val:   loss=0.287  iou=0.612  f1=0.748  (n=6 tiles)
per_tile_iou:  austin1=0.71  austin6=0.55  austin11=0.68
               austin16=0.63  austin21=0.49  austin26=0.61
worst_tile:    austin21 (iou=0.49)
class_balance: pred_pos=0.18  true_pos=0.21
gpu_mem:       4.2 / 15.0 GB   step_time: 0.42s

Free Colab T4 is sufficient for all attempts at batch 4 (SAM) to 8 (SMP).


Testing strategy

Two-tier approach. Local red-green-refactor for everything that doesn't need GPU; structured observational logging for everything that does.

Tier 1 — local pytest (runs in seconds on any laptop):

  • test_dataset.py — patch loader returns correct shapes and pixel ranges; mask is binary; no NaNs
  • test_augment.py — augmentation pipeline preserves shapes, mask stays binary, handles all-zero and all-positive edge cases
  • test_models.py — each model factory builds, forward pass on (1, 3, 1024, 1024) zero tensor returns (1, 1, 1024, 1024), trainable param count matches expectation (catches "did I actually freeze the SAM encoder?" bugs)
  • test_losses.py — BCE+Dice returns scalar > 0, gradients flow, hand- computed values match on tiny tensors
  • test_metrics.py — IoU on hand-crafted 4×4 masks matches expected fraction
  • test_inference.py — full-tile reconstruction is bit-exact identity when the model is nn.Identity()

Smoke-test flag on the training script: train.py --smoke runs 2 epochs on a CPU-tiny subset to exercise the entire pipeline (data → model → loss → optimizer → eval → checkpoint) before touching Colab.

Tier 2 — Colab observational:

Structured per-epoch diagnostics (see "How to run") are the remote debugging interface. The loop is: you paste output to chat, I diagnose overfitting / class imbalance / LR issues from the numbers, you adjust and re-run.


Design notes

Non-obvious choices and the reasoning behind them.

  • Oversampled epochs (samples_per_epoch=400). With 24 training samples at batch 4, a naive epoch is 6 optimizer steps — too small for LR schedules and early stopping to behave sensibly. Oversampling with replacement + fresh augmentation per draw restores normal training dynamics without faking data. Each epoch sees ~100 steps of genuinely distinct augmented views. Pattern borrowed from the reference Keras code.

  • Single resolution, no runtime resizing. Upsampling creates fake detail and can't add information; downsampling loses it. Storing and training at the same resolution (1024²) means the only resampling in the entire pipeline is the one-time preprocessing step. Both tracks eat identical bytes.

  • Per-model tuning, not shared recipe. The research question is "what's each approach capable of," not "which wins with a fixed recipe." Locking LR or augmentation across fundamentally different models (from-scratch CNN vs. frozen pretrained ViT) would penalize whichever disagrees with the chosen value — measuring misconfiguration rather than potential. Equal tuning budget per model, documented in the results table, is the fairness discipline instead.

  • Decoder-only dropout. Dropout inside a pretrained encoder disrupts its BN/feature statistics; dropout in the decoder regularizes the tiny set of newly-trained parameters, which is where overfitting actually happens with 30 samples. Matches the pattern used in the reference SAM decoder code.

  • Test set held back. Per-model tuning against val means val is contaminated by tuning effort by the time the final run happens. Test is read once at the end of each attempt, never influences any decision, and provides the honest leaderboard.

  • Reference Keras code retained (~/Code/python/dida/). Attempts #2 and #4 are PyTorch ports of existing working Keras/PyTorch code. Cross- checking numbers against the reference catches port bugs early.

  • No GaussianBlur / GaussNoise in augmentation. Excluded based on a dataset-specific observation from the reference code: the source imagery has very little blur variation, so these augmentations add unrealistic variance without matching any real-world deployment condition.


Status

Project planning complete; implementation pending. Next concrete steps:

  1. scripts/make_patches.py — one-time preprocessing
  2. src/ skeleton module + tests/ for local TDD
  3. Phase 0 smoke test notebook

About

Few shot image segmentation comparison

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages