A scalable, configurable pipeline for cleaning and augmenting vision-language pretraining datasets.
Large-scale image-caption datasets scraped from the web are noisy. Text is often mismatched, duplicated, spammy, or machine-translated garbage. Training a vision-language model on this raw data wastes compute and degrades quality. This toolkit provides a practical pipeline to fix that.
Raw web data → [Quality Filter] → [Deduplication] → [Augmentation] → Clean dataset
Three main stages:
- Quality filtering — CLIP-based image-text relevance scoring + text heuristics (spam patterns, length checks, repeated n-grams)
- Deduplication — perceptual hashing (pHash/dHash) for near-duplicate images + SimHash for near-duplicate captions
- Caption augmentation — template-based paraphrasing and optional bilingual (zh↔en) back-translation
git clone https://github.com/falenai/vl-data-engine.git
cd vl-data-engine
pip install -r requirements.txt
pip install -e .python scripts/run_pipeline.py \
--config configs/default.yaml \
--input data/raw.jsonl \
--output output/clean/python scripts/run_filter.py \
--input data/raw.jsonl \
--image-root data/images/ \
--output data/filtered.jsonl \
--min-clip-score 0.25python scripts/run_dedup.py \
--input data/filtered.jsonl \
--image-root data/images/ \
--output data/deduped.jsonl \
--image-method phash \
--image-threshold 5Everything is controlled via YAML. Copy configs/default.yaml and adjust:
filter:
min_clip_score: 0.22 # CLIP cosine similarity threshold
min_text_len: 10 # Minimum caption character length
max_aspect_ratio: 3.0 # Max image aspect ratio
use_aesthetic_filter: false # Aesthetic score (disabled by default)
run_dedup: true
dedup_image_method: "phash" # or "dhash"
dedup_image_threshold: 5 # Hamming distance threshold
run_augmentation: false # Enable caption augmentationDataset-specific configs are in configs/. Currently: default.yaml, cc3m.yaml.
On a sample of CC3M (1M pairs, single A100 GPU):
| Stage | Throughput | Retention |
|---|---|---|
| Text filter | ~80k pairs/s | ~82% |
| CLIP filter (ViT-B/32) | ~4k pairs/s | ~68% |
| Deduplication | ~12k pairs/s | ~91% |
Full pipeline end-to-end: ~2.5k pairs/s on a single GPU.
The pipeline expects JSONL files, one record per line:
{"image": "relative/path/to/image.jpg", "caption": "A dog playing fetch."}
{"image": "another/image.jpg", "caption": "Mountain landscape at sunset.", "score": 0.9}For billion-scale datasets, the in-memory deduplication hashes will exceed RAM. Consider sharding the dataset and running dedup per-shard, then doing a cross-shard merge step (planned for v0.4). For now, the pipeline is validated on datasets up to ~10M pairs on a single machine with 64GB RAM.
- Text heuristic filters
- CLIP-based relevance scoring
- Perceptual hash deduplication
- SimHash text deduplication
- Template-based caption augmentation
- Back-translation (zh↔en)
- WebDataset native output
- Distributed processing with Ray
- Aesthetic score predictor integration (LAION aesthetic model)
- GPU-accelerated pHash
MIT
Part of ongoing research at ZJU CCNT Lab.
make test # run tests
make filter # run filter pipelineQ: Can I use this with LAION-5B? A: Yes, but you'll need to shard the dataset first — see Notes on Scale.
Raw Data → TextFilter → CLIPFilter → ImageFilter → DedupFilter → Augmentor → Clean Data