LLMData

A framework for large-scale LLM data preprocessing using Ray Data. Think datatrove or dolma (in fact, it borrows a lot from those), but with native Ray integration.

Quick Start

Configuration

Create a YAML pipeline configuration:

name: "language_filtering"
description: "An example pipeline that filters input data by language, keeping only English text."
input:
  format: "jsonl"
  path: "data/input.jsonl"

processors:
  - category: "tagger"
    type: "language"
    params: {}
  - category: "filter"
    type: "language"
    params:
      allowed_languages: ["en"]

output:
  format: "parquet"
  path: "data/output.parquet"

Usage

from llmdata import DataPipeline, PipelineConfig

config = PipelineConfig.from_yaml("config.yaml")
pipeline = DataPipeline(config)
pipeline.run()

CLI Interface

You can invoke a processing pipeline using the CLI:

llmdata run config.yaml

Full options:

llmdata -h
usage: llmdata [-h] [--config CONFIG] [--print_config[=flags]] {export_schemas,list,run,validate} ...

options:
  -h, --help            Show this help message and exit.
  --config CONFIG       Path to a configuration file.
  --print_config[=flags]
                        Print the configuration after applying all other arguments and exit. The optional flags customizes the output and are one or more keywords separated by
                        comma. The supported flags are: comments, skip_default, skip_null.

subcommands:
  For more details of each subcommand, add it as an argument followed by --help.

  Available subcommands:
    export_schemas      Writes all available processor config schemas to a JSON file.
    list                Print all available processors categories and names.
    run                 Run a pipeline using a config file.
    validate            Validate a config file.

Built-in Processors

$ llmdata list
=== Available Components ===

aggregation:
  - sum
  - count
  - quantile
  - mean
  - min
  - max
  - std
  - absmax
  - unique

reader:
  - parquet
  - jsonl
  - csv
  - text

writer:
  - parquet
  - jsonl
  - csv

extractor:
  - html
  - tei
  - plain

filter:
  - language
  - gopher_quality
  - gopher_repetition
  - num_tokens
  - value
  - exists

formatter:
  - deduplication
  - ftfy
  - ocr_error
  - pii

tagger:
  - language
  - gopher_quality
  - gopher_repetition
  - token_count
  - length
  - value

Development

# Setup
uv sync --dev
uv run pre-commit install

# Testing
uv run pytest

# Code quality
make check

Citation

If you use this software, please cite the corresponding paper:

@article{gienapp:2025d,
    title        = {{The German Commons -- 154 Billion Tokens of Openly Licensed Text for German Language Models}},
    author       = {Lukas Gienapp and
                    Christopher Schr\"oder and
                    Stefan Schweter and
                    Christopher Akiki and
                    Ferdinand Schlatt and
                    Arden Zimmermann and
                    Phillipe Gen\^et and
                    Martin Potthast},
    year         = 2025,
    month        = oct,
    journal      = {CoRR},
    volume       = {abs/2510.13996},
    url          = {https://arxiv.org/abs/2510.13996}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
configs		configs
llmdata		llmdata
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLMData

Quick Start

Configuration

Usage

CLI Interface

Built-in Processors

Development

Citation

About

Uh oh!

Releases

Packages

Languages

License

coral-nlp/llmdata

Folders and files

Latest commit

History

Repository files navigation

LLMData

Quick Start

Configuration

Usage

CLI Interface

Built-in Processors

Development

Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages