VL-Data-Engine

A scalable, configurable pipeline for cleaning and augmenting vision-language pretraining datasets.

The Problem

Large-scale image-caption datasets scraped from the web are noisy. Text is often mismatched, duplicated, spammy, or machine-translated garbage. Training a vision-language model on this raw data wastes compute and degrades quality. This toolkit provides a practical pipeline to fix that.

What It Does

Raw web data → [Quality Filter] → [Deduplication] → [Augmentation] → Clean dataset

Three main stages:

Quality filtering — CLIP-based image-text relevance scoring + text heuristics (spam patterns, length checks, repeated n-grams)
Deduplication — perceptual hashing (pHash/dHash) for near-duplicate images + SimHash for near-duplicate captions
Caption augmentation — template-based paraphrasing and optional bilingual (zh↔en) back-translation

Installation

git clone https://github.com/falenai/vl-data-engine.git
cd vl-data-engine
pip install -r requirements.txt
pip install -e .

Quick Start

Full pipeline

python scripts/run_pipeline.py \
    --config configs/default.yaml \
    --input data/raw.jsonl \
    --output output/clean/

Just filter

python scripts/run_filter.py \
    --input data/raw.jsonl \
    --image-root data/images/ \
    --output data/filtered.jsonl \
    --min-clip-score 0.25

Just deduplicate

python scripts/run_dedup.py \
    --input data/filtered.jsonl \
    --image-root data/images/ \
    --output data/deduped.jsonl \
    --image-method phash \
    --image-threshold 5

Configuration

Everything is controlled via YAML. Copy configs/default.yaml and adjust:

filter:
  min_clip_score: 0.22       # CLIP cosine similarity threshold
  min_text_len: 10           # Minimum caption character length
  max_aspect_ratio: 3.0      # Max image aspect ratio
  use_aesthetic_filter: false # Aesthetic score (disabled by default)

run_dedup: true
dedup_image_method: "phash"  # or "dhash"
dedup_image_threshold: 5     # Hamming distance threshold

run_augmentation: false      # Enable caption augmentation

Dataset-specific configs are in configs/. Currently: default.yaml, cc3m.yaml.

Benchmarks

On a sample of CC3M (1M pairs, single A100 GPU):

Stage	Throughput	Retention
Text filter	~80k pairs/s	~82%
CLIP filter (ViT-B/32)	~4k pairs/s	~68%
Deduplication	~12k pairs/s	~91%

Full pipeline end-to-end: ~2.5k pairs/s on a single GPU.

Dataset Input Format

The pipeline expects JSONL files, one record per line:

{"image": "relative/path/to/image.jpg", "caption": "A dog playing fetch."}
{"image": "another/image.jpg", "caption": "Mountain landscape at sunset.", "score": 0.9}

Notes on Scale

For billion-scale datasets, the in-memory deduplication hashes will exceed RAM. Consider sharding the dataset and running dedup per-shard, then doing a cross-shard merge step (planned for v0.4). For now, the pipeline is validated on datasets up to ~10M pairs on a single machine with 64GB RAM.

Roadmap

License

MIT

Part of ongoing research at ZJU CCNT Lab.

Development

make test   # run tests
make filter  # run filter pipeline

FAQ

Q: Can I use this with LAION-5B? A: Yes, but you'll need to shard the dataset first — see Notes on Scale.

Architecture

Raw Data → TextFilter → CLIPFilter → ImageFilter → DedupFilter → Augmentor → Clean Data

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
.github		.github
configs		configs
examples		examples
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VL-Data-Engine

The Problem

What It Does

Installation

Quick Start

Full pipeline

Just filter

Just deduplicate

Configuration

Benchmarks

Dataset Input Format

Notes on Scale

Roadmap

License

Development

FAQ

Architecture

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VL-Data-Engine

The Problem

What It Does

Installation

Quick Start

Full pipeline

Just filter

Just deduplicate

Configuration

Benchmarks

Dataset Input Format

Notes on Scale

Roadmap

License

Development

FAQ

Architecture

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages