Anyone who has tried to build a vision-language pretraining dataset from scratch
knows the routine. You start with a list of a few hundred million image URLs.
A third of them are dead by the time you crawl. Another big chunk decode into
corrupt JPEGs, 1x1 tracking pixels, or 50MB PNGs of a logo. The captions are
sometimes great, sometimes machine-translated word salad, sometimes literally
the string image. By the time you have something trainable, you have spent
weeks shuffling shards across disks and writing one-off scripts that nobody
remembers how to run.
I built VL-Data-Engine because I got tired of rewriting the same five filtering scripts every quarter. The goal is boring on purpose: a small, composable set of stages that take a URL list (or parquet of image-text pairs) and produce sharded WebDataset tars suitable for CLIP / BLIP / SigLIP-style training. No distributed magic if you don't need it. Run it on a laptop on 100k pairs, or hand it to Ray and chew through 500M on a cluster.
The pipeline is a list of Stage objects. Each stage reads a stream of samples
(dicts with at minimum a url and text), does its thing, and emits the
survivors downstream. Stages are independent: drop one, reorder, swap in your
own. The orchestrator handles batching, retries, and counting drops per stage
so you can see exactly where samples die.
from vlde import Pipeline, Config
from vlde.stages import (
Download, Decode, Dedup, CaptionQuality, CLIPFilter, Resize, ShardWriter,
)
cfg = Config.from_yaml("configs/cc3m_like.yaml")
pipe = Pipeline([
Download(timeout=10, max_retries=2),
Decode(min_size=128, max_size=4096),
Dedup(hash_size=8),
CaptionQuality(min_words=3, max_words=80, langs={"en"}),
CLIPFilter(model="ViT-B-32", threshold=0.28),
Resize(size=256),
ShardWriter(out_dir="shards/", samples_per_shard=10_000),
], config=cfg)
stats = pipe.run("data/urls.parquet")
print(stats.report())Download-- async-ish HTTP fetcher with domain blocklist and a robots.txt cache. Skips anything >max_bytes.Decode-- PIL decode plus a size sanity check; rejects 1x1 spacer GIFs and truncated JPEGs.Dedup-- perceptual-hash (pHash) dedup, in-memory bloom-filter mode or sharded on-disk mode for large runs.NSFWFilter-- pluggable classifier interface; ships with a stub. Drop in your own NSFW model if you have one.CLIPFilter-- image-text cosine similarity viaopen_clip. Below threshold is dropped.CaptionQuality-- length, language, punctuation/repetition heuristics.Resize-- center-crop + resize + re-encode (JPEG quality configurable).ShardWriter-- writes WebDataset tar shards with deterministic keys.
pip install -e .
pip install -e ".[clip,ray,lang]" # full install
You will want [clip] if you actually want the CLIP filter to do anything.
Without it the stage is a no-op pass-through.
YAML, dataclass-backed. Unknown keys raise.
# configs/cc3m_like.yaml
input: data/cc3m.parquet
output: shards/cc3m/
num_workers: 16
backend: multiprocessing # or "ray"
download:
timeout: 10
max_bytes: 10_000_000
blocklist: configs/blocklist.txt
clip:
model: ViT-B-32
pretrained: openai
threshold: 0.28
batch_size: 256
shard:
samples_per_shard: 10000
compress: false- Core stage interface + multiprocessing backend
- pHash dedup
- CLIP filter with open_clip
- WebDataset writer
- Ray backend
- YAML config with strict validation
- Streaming dedup across shards (not just within batch)
- Native fasttext lang-id integration (currently optional)
- HuggingFace
datasetspush helper - Resumable runs (checkpoint per shard)
BSD-3-Clause. See LICENSE.