A framework for large-scale LLM data preprocessing using Ray Data. Think datatrove
or dolma
(in fact, it borrows a lot from those), but with native Ray integration.
Create a YAML pipeline configuration:
name: "language_filtering"
description: "An example pipeline that filters input data by language, keeping only English text."
input:
format: "jsonl"
path: "data/input.jsonl"
processors:
- category: "tagger"
type: "language"
params: {}
- category: "filter"
type: "language"
params:
allowed_languages: ["en"]
output:
format: "parquet"
path: "data/output.parquet"
from llmdata import DataPipeline, PipelineConfig
config = PipelineConfig.from_yaml("config.yaml")
pipeline = DataPipeline(config)
pipeline.run()
You can invoke a processing pipeline using the CLI:
llmdata run config.yaml
Full options:
llmdata -h
usage: llmdata [-h] [--config CONFIG] [--print_config[=flags]] {export_schemas,list,run,validate} ...
options:
-h, --help Show this help message and exit.
--config CONFIG Path to a configuration file.
--print_config[=flags]
Print the configuration after applying all other arguments and exit. The optional flags customizes the output and are one or more keywords separated by
comma. The supported flags are: comments, skip_default, skip_null.
subcommands:
For more details of each subcommand, add it as an argument followed by --help.
Available subcommands:
export_schemas Writes all available processor config schemas to a JSON file.
list Print all available processors categories and names.
run Run a pipeline using a config file.
validate Validate a config file.
$ llmdata list
=== Available Components ===
aggregation:
- sum
- count
- quantile
- mean
- min
- max
- std
- absmax
- unique
reader:
- parquet
- jsonl
- csv
- text
writer:
- parquet
- jsonl
- csv
extractor:
- html
- tei
- plain
filter:
- language
- gopher_quality
- gopher_repetition
- num_tokens
- value
- exists
formatter:
- deduplication
- ftfy
- ocr_error
- pii
tagger:
- language
- gopher_quality
- gopher_repetition
- token_count
- length
- value
# Setup
uv sync --dev
uv run pre-commit install
# Testing
uv run pytest
# Code quality
make check
If you use this software, please cite the corresponding paper:
@article{gienapp:2025d,
title = {{The German Commons -- 154 Billion Tokens of Openly Licensed Text for German Language Models}},
author = {Lukas Gienapp and
Christopher Schr\"oder and
Stefan Schweter and
Christopher Akiki and
Ferdinand Schlatt and
Arden Zimmermann and
Phillipe Gen\^et and
Martin Potthast},
year = 2025,
month = oct,
journal = {CoRR},
volume = {abs/2510.13996},
url = {https://arxiv.org/abs/2510.13996}
}