Skip to content

davanstrien/uv-scripts-for-ai

Repository files navigation

uv-scripts-for-ai

Follow uv-scripts on Hugging Face Follow davanstrien on Hugging Face

A UV script is a single Python file that declares its own dependencies inline — a portable unit you run with uv run where you have the hardware, or hand to hf jobs uv run on Hugging Face Jobs for a GPU. Chain several into a pipeline.

Each script carries its own dependencies, so people and agents can run one without cloning a repo, making a virtualenv, or installing a requirements.txt first.

A recipe here is one such script. Most read and write the Hugging Face Hub, so one script's output dataset becomes the next one's input.

Quickstart

First, install uv — it's the only thing you install; every script brings its own Python dependencies:

curl -LsSf https://astral.sh/uv/install.sh | sh

Run a recipe on a GPU — point Hugging Face Jobs at the script's URL and it runs on managed hardware, no GPU of your own needed. Here davanstrien/ufo-ColPali is a small public image dataset you can use as-is; the output lands in your namespace:

hf jobs uv run --flavor l4x1 --secrets HF_TOKEN \
  https://huggingface.co/datasets/uv-scripts/ocr/raw/main/glm-ocr.py \
  davanstrien/ufo-ColPali your-username/ufo-ocr

No pip install, no local setup. --secrets HF_TOKEN forwards your token so the job can write the output dataset back to the Hub. (Jobs needs the hf CLI — uv tool install huggingface_hub — and a Hugging Face account with pay-as-you-go credit — no subscription needed; it's billed by the second, and a small CPU job costs ~$0.01/hr. Run hf jobs hardware for current flavors and prices.)

Prefer your own machine? A recipe is just a UV script, so on a box with the hardware it needs — most recipes here want a CUDA GPU — you can run it (or inspect it with --help) directly, no Jobs required:

uv run https://huggingface.co/datasets/uv-scripts/ocr/raw/main/glm-ocr.py --help

What's a UV script?

A normal Python file with a metadata block at the top that lists its dependencies:

# /// script
# requires-python = ">=3.10"
# dependencies = ["datasets", "transformers", "torch"]
# ///

Normally, running someone's Python script means cloning their repo, making a virtual environment, and pip install-ing a requirements.txt first — and if your versions don't match theirs, it can still break. Here the dependencies live inside the file, in that comment block, so uv (and hf jobs uv run) reads them, installs exactly those versions into a throwaway environment, and runs the file — straight from a URL, with nothing to set up. This is the standard PEP 723 inline-script-metadata format; see the uv scripts guide to learn more.

Why UV scripts

A self-contained, pinned script is easy to run and reuse, for a few reasons:

  • Discrete & single-purpose — one script, one job. That job can be a two-second transform or a multi-hour fine-tune; either way it's one self-contained unit you pick by reading a header instead of a whole codebase.
  • Self-describing — the PEP 723 dependency block, the docstring, and --help tell you what it needs and how to call it.
  • Reproducible — dependencies are pinned in the file, so there's no env drift and no "works on my machine."
  • Composable — recipes hand off through the Hub (usually a dataset in, a dataset or model out), so you can chain them into a pipeline.
  • Portable — one self-contained file; run it with uv run where you have the hardware (most recipes need a GPU), or hf jobs uv run it on a managed GPU.

Built for agents, too. Every recipe takes its arguments in the same input output order and runs from a URL, so an AI agent can pick a tool from its header and run it with no setup. On Jobs the agent runs in a sandbox: a throwaway disk, access limited to what the token's repo permissions allow, and a cost cap per job — not arbitrary code on your machine. (Hugging Face also ships an hf CLI skill for agents for driving Jobs from an editor.) This repo also ships a ready-to-use uv-recipes agent skill — point your agent at it to discover, run, and adapt recipes.

Recipes

Domain What it does On the Hub
ocr OCR / document → text & structured data — GLM, PaddleOCR-VL, Nanonets, olmOCR, dots, … (30+ models) uv-scripts/ocr
vision Zero-shot detection & segmentation over image datasets sam3 · object-detection · vlm-object-detection
audio Transcription & speech translation transcription
embeddings & atlas Embed a dataset; build an interactive map build-atlas
data processing Filter / dedup / stats over large datasets dataset-stats · deduplication · classification
dataset creation Turn PDFs / image URLs into Hub datasets dataset-creation · iiif-tiles
synthetic data Generate datasets with LLMs synthetic-data
inference Run any open LLM / VLM over a dataset vllm · openai-oss · transformers-inference
entity extraction NER / structured extraction over text gliner
…and more Training, evaluation, RAG indexing — migrating as they mature training · transformers-training

Most recipes now live in this repo; the rest link to the uv-scripts Hugging Face org where they run today, and migrate here over time. (each folder mirrors to its Hub dataset repo.)

What fits here: any self-contained UV script for data or ML work on the Hub. OCR and dataset work are the current focus, but inference, evaluation, RAG indexing, and training (fine-tuning with TRL / transformers, producing a model) are all in scope. If it's one pinned script that reads from or writes to the Hub, it belongs.

Compose a pipeline

Because recipes hand off through the Hub, you can chain them — each step's output dataset is the next step's input. A document-collection pipeline, end to end:

PDFs / scans          →   OCR to markdown      →   dedup + stats        →   embed + visualise
dataset-creation          ocr/glm-ocr.py           deduplication            build-atlas

Each arrow is a Hub dataset; each box is one hf jobs uv run (or uv run), and every box runs today from its Hub URL, even before it's migrated into this repo. A pipeline can also end in a trained model instead of another dataset. You can write the chain as a shell script, or an agent can generate it — the scripts are the same.

Portable: run it locally or on Jobs

A recipe is the same file wherever you run it — on a machine with the hardware it needs, or on Hugging Face Jobs for a managed GPU. Same file, same arguments:

SCRIPT=https://huggingface.co/datasets/uv-scripts/ocr/raw/main/glm-ocr.py

# locally — needs the right hardware (a GPU for most recipes)
uv run $SCRIPT davanstrien/ufo-ColPali your-username/ufo-ocr

# on a managed GPU — pick hardware with --flavor; --secrets forwards your write token
hf jobs uv run --flavor l4x1 --secrets HF_TOKEN $SCRIPT davanstrien/ufo-ColPali your-username/ufo-ocr

Why reach for Jobs:

  • Pay by the second — billed only while the job runs. Run hf jobs hardware, or see the flavors and pricing.
  • No infrahf jobs uv run <url> and you're done. See the hf jobs CLI.
  • Hub-native — read and write datasets, models, and storage buckets directly. Running from the https://huggingface.co/datasets/uv-scripts/… URL also attributes usage to the recipe.

Model licenses

These scripts are orchestration code: they download third-party models from the Hugging Face Hub at runtime and run inference. This repo does not redistribute any model weights. Each model you run carries its own license (MIT, Apache-2.0, OpenRAIL-M, and some with non-commercial or other use-based terms); those terms govern your use of the model, not this repo's code. You are responsible for checking each model's license — on its Hugging Face model card — before using it, especially in production.

License

The code and documentation in this repository are licensed under the Apache License 2.0. See NOTICE for attribution.


Recipes mirror to the uv-scripts Hugging Face org via GitHub Actions. See CONTRIBUTING.md to add one.

About

Self-contained UV scripts for data & ML tasks — OCR, vision, audio & more — run one in a command, locally or on Hugging Face Jobs. Built for humans and agents.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Contributors