VL-Data-Engine

The Problem

Anyone who has tried to build a vision-language pretraining dataset from scratch knows the routine. You start with a list of a few hundred million image URLs. A third of them are dead by the time you crawl. Another big chunk decode into corrupt JPEGs, 1x1 tracking pixels, or 50MB PNGs of a logo. The captions are sometimes great, sometimes machine-translated word salad, sometimes literally the string image. By the time you have something trainable, you have spent weeks shuffling shards across disks and writing one-off scripts that nobody remembers how to run.

I built VL-Data-Engine because I got tired of rewriting the same five filtering scripts every quarter. The goal is boring on purpose: a small, composable set of stages that take a URL list (or parquet of image-text pairs) and produce sharded WebDataset tars suitable for CLIP / BLIP / SigLIP-style training. No distributed magic if you don't need it. Run it on a laptop on 100k pairs, or hand it to Ray and chew through 500M on a cluster.

What This Does

The pipeline is a list of Stage objects. Each stage reads a stream of samples (dicts with at minimum a url and text), does its thing, and emits the survivors downstream. Stages are independent: drop one, reorder, swap in your own. The orchestrator handles batching, retries, and counting drops per stage so you can see exactly where samples die.

Show Me

from vlde import Pipeline, Config
from vlde.stages import (
    Download, Decode, Dedup, CaptionQuality, CLIPFilter, Resize, ShardWriter,
)

cfg = Config.from_yaml("configs/cc3m_like.yaml")

pipe = Pipeline([
    Download(timeout=10, max_retries=2),
    Decode(min_size=128, max_size=4096),
    Dedup(hash_size=8),
    CaptionQuality(min_words=3, max_words=80, langs={"en"}),
    CLIPFilter(model="ViT-B-32", threshold=0.28),
    Resize(size=256),
    ShardWriter(out_dir="shards/", samples_per_shard=10_000),
], config=cfg)

stats = pipe.run("data/urls.parquet")
print(stats.report())

Pipeline Stages

Download -- async-ish HTTP fetcher with domain blocklist and a robots.txt cache. Skips anything > max_bytes.
Decode -- PIL decode plus a size sanity check; rejects 1x1 spacer GIFs and truncated JPEGs.
Dedup -- perceptual-hash (pHash) dedup, in-memory bloom-filter mode or sharded on-disk mode for large runs.
NSFWFilter -- pluggable classifier interface; ships with a stub. Drop in your own NSFW model if you have one.
CLIPFilter -- image-text cosine similarity via open_clip. Below threshold is dropped.
CaptionQuality -- length, language, punctuation/repetition heuristics.
Resize -- center-crop + resize + re-encode (JPEG quality configurable).
ShardWriter -- writes WebDataset tar shards with deterministic keys.

Install

pip install -e .
pip install -e ".[clip,ray,lang]"   # full install

You will want [clip] if you actually want the CLIP filter to do anything. Without it the stage is a no-op pass-through.

Configuration

YAML, dataclass-backed. Unknown keys raise.

# configs/cc3m_like.yaml
input: data/cc3m.parquet
output: shards/cc3m/
num_workers: 16
backend: multiprocessing   # or "ray"

download:
  timeout: 10
  max_bytes: 10_000_000
  blocklist: configs/blocklist.txt

clip:
  model: ViT-B-32
  pretrained: openai
  threshold: 0.28
  batch_size: 256

shard:
  samples_per_shard: 10000
  compress: false

Roadmap

License

BSD-3-Clause. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
configs		configs
docs		docs
scripts		scripts
tests		tests
vlde		vlde
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VL-Data-Engine

The Problem

What This Does

Show Me

Pipeline Stages

Install

Configuration

Roadmap

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VL-Data-Engine

The Problem

What This Does

Show Me

Pipeline Stages

Install

Configuration

Roadmap

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages