Skip to content

andraiming/vl-data-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

110 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VL-Data-Engine

python license

The Problem

Anyone who has tried to build a vision-language pretraining dataset from scratch knows the routine. You start with a list of a few hundred million image URLs. A third of them are dead by the time you crawl. Another big chunk decode into corrupt JPEGs, 1x1 tracking pixels, or 50MB PNGs of a logo. The captions are sometimes great, sometimes machine-translated word salad, sometimes literally the string image. By the time you have something trainable, you have spent weeks shuffling shards across disks and writing one-off scripts that nobody remembers how to run.

I built VL-Data-Engine because I got tired of rewriting the same five filtering scripts every quarter. The goal is boring on purpose: a small, composable set of stages that take a URL list (or parquet of image-text pairs) and produce sharded WebDataset tars suitable for CLIP / BLIP / SigLIP-style training. No distributed magic if you don't need it. Run it on a laptop on 100k pairs, or hand it to Ray and chew through 500M on a cluster.

What This Does

The pipeline is a list of Stage objects. Each stage reads a stream of samples (dicts with at minimum a url and text), does its thing, and emits the survivors downstream. Stages are independent: drop one, reorder, swap in your own. The orchestrator handles batching, retries, and counting drops per stage so you can see exactly where samples die.

Show Me

from vlde import Pipeline, Config
from vlde.stages import (
    Download, Decode, Dedup, CaptionQuality, CLIPFilter, Resize, ShardWriter,
)

cfg = Config.from_yaml("configs/cc3m_like.yaml")

pipe = Pipeline([
    Download(timeout=10, max_retries=2),
    Decode(min_size=128, max_size=4096),
    Dedup(hash_size=8),
    CaptionQuality(min_words=3, max_words=80, langs={"en"}),
    CLIPFilter(model="ViT-B-32", threshold=0.28),
    Resize(size=256),
    ShardWriter(out_dir="shards/", samples_per_shard=10_000),
], config=cfg)

stats = pipe.run("data/urls.parquet")
print(stats.report())

Pipeline Stages

  • Download -- async-ish HTTP fetcher with domain blocklist and a robots.txt cache. Skips anything > max_bytes.
  • Decode -- PIL decode plus a size sanity check; rejects 1x1 spacer GIFs and truncated JPEGs.
  • Dedup -- perceptual-hash (pHash) dedup, in-memory bloom-filter mode or sharded on-disk mode for large runs.
  • NSFWFilter -- pluggable classifier interface; ships with a stub. Drop in your own NSFW model if you have one.
  • CLIPFilter -- image-text cosine similarity via open_clip. Below threshold is dropped.
  • CaptionQuality -- length, language, punctuation/repetition heuristics.
  • Resize -- center-crop + resize + re-encode (JPEG quality configurable).
  • ShardWriter -- writes WebDataset tar shards with deterministic keys.

Install

pip install -e .
pip install -e ".[clip,ray,lang]"   # full install

You will want [clip] if you actually want the CLIP filter to do anything. Without it the stage is a no-op pass-through.

Configuration

YAML, dataclass-backed. Unknown keys raise.

# configs/cc3m_like.yaml
input: data/cc3m.parquet
output: shards/cc3m/
num_workers: 16
backend: multiprocessing   # or "ray"

download:
  timeout: 10
  max_bytes: 10_000_000
  blocklist: configs/blocklist.txt

clip:
  model: ViT-B-32
  pretrained: openai
  threshold: 0.28
  batch_size: 256

shard:
  samples_per_shard: 10000
  compress: false

Roadmap

  • Core stage interface + multiprocessing backend
  • pHash dedup
  • CLIP filter with open_clip
  • WebDataset writer
  • Ray backend
  • YAML config with strict validation
  • Streaming dedup across shards (not just within batch)
  • Native fasttext lang-id integration (currently optional)
  • HuggingFace datasets push helper
  • Resumable runs (checkpoint per shard)

License

BSD-3-Clause. See LICENSE.

About

Scalable preprocessing & filtering pipeline for vision-language pretraining datasets (CC, LAION-style, web-scraped image-text).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors