Skip to content

falenai/vl-data-engine

Repository files navigation

VL-Data-Engine

A scalable, configurable pipeline for cleaning and augmenting vision-language pretraining datasets.

Python 3.8+ License: MIT PyTorch

The Problem

Large-scale image-caption datasets scraped from the web are noisy. Text is often mismatched, duplicated, spammy, or machine-translated garbage. Training a vision-language model on this raw data wastes compute and degrades quality. This toolkit provides a practical pipeline to fix that.

What It Does

Raw web data → [Quality Filter] → [Deduplication] → [Augmentation] → Clean dataset

Three main stages:

  1. Quality filtering — CLIP-based image-text relevance scoring + text heuristics (spam patterns, length checks, repeated n-grams)
  2. Deduplication — perceptual hashing (pHash/dHash) for near-duplicate images + SimHash for near-duplicate captions
  3. Caption augmentation — template-based paraphrasing and optional bilingual (zh↔en) back-translation

Installation

git clone https://github.com/falenai/vl-data-engine.git
cd vl-data-engine
pip install -r requirements.txt
pip install -e .

Quick Start

Full pipeline

python scripts/run_pipeline.py \
    --config configs/default.yaml \
    --input data/raw.jsonl \
    --output output/clean/

Just filter

python scripts/run_filter.py \
    --input data/raw.jsonl \
    --image-root data/images/ \
    --output data/filtered.jsonl \
    --min-clip-score 0.25

Just deduplicate

python scripts/run_dedup.py \
    --input data/filtered.jsonl \
    --image-root data/images/ \
    --output data/deduped.jsonl \
    --image-method phash \
    --image-threshold 5

Configuration

Everything is controlled via YAML. Copy configs/default.yaml and adjust:

filter:
  min_clip_score: 0.22       # CLIP cosine similarity threshold
  min_text_len: 10           # Minimum caption character length
  max_aspect_ratio: 3.0      # Max image aspect ratio
  use_aesthetic_filter: false # Aesthetic score (disabled by default)

run_dedup: true
dedup_image_method: "phash"  # or "dhash"
dedup_image_threshold: 5     # Hamming distance threshold

run_augmentation: false      # Enable caption augmentation

Dataset-specific configs are in configs/. Currently: default.yaml, cc3m.yaml.

Benchmarks

On a sample of CC3M (1M pairs, single A100 GPU):

Stage Throughput Retention
Text filter ~80k pairs/s ~82%
CLIP filter (ViT-B/32) ~4k pairs/s ~68%
Deduplication ~12k pairs/s ~91%

Full pipeline end-to-end: ~2.5k pairs/s on a single GPU.

Dataset Input Format

The pipeline expects JSONL files, one record per line:

{"image": "relative/path/to/image.jpg", "caption": "A dog playing fetch."}
{"image": "another/image.jpg", "caption": "Mountain landscape at sunset.", "score": 0.9}

Notes on Scale

For billion-scale datasets, the in-memory deduplication hashes will exceed RAM. Consider sharding the dataset and running dedup per-shard, then doing a cross-shard merge step (planned for v0.4). For now, the pipeline is validated on datasets up to ~10M pairs on a single machine with 64GB RAM.

Roadmap

  • Text heuristic filters
  • CLIP-based relevance scoring
  • Perceptual hash deduplication
  • SimHash text deduplication
  • Template-based caption augmentation
  • Back-translation (zh↔en)
  • WebDataset native output
  • Distributed processing with Ray
  • Aesthetic score predictor integration (LAION aesthetic model)
  • GPU-accelerated pHash

License

MIT


Part of ongoing research at ZJU CCNT Lab.

Development

make test   # run tests
make filter  # run filter pipeline

FAQ

Q: Can I use this with LAION-5B? A: Yes, but you'll need to shard the dataset first — see Notes on Scale.

Architecture

Raw Data → TextFilter → CLIPFilter → ImageFilter → DedupFilter → Augmentor → Clean Data

About

Scalable data processing pipeline for vision-language pretraining: quality filtering, deduplication, and bilingual caption augmentation

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors