Skip to content

coral-nlp/llmdata

Repository files navigation

LLMData

A framework for large-scale LLM data preprocessing using Ray Data. Think datatrove or dolma (in fact, it borrows a lot from those), but with native Ray integration.

Quick Start

Configuration

Create a YAML pipeline configuration:

name: "language_filtering"
description: "An example pipeline that filters input data by language, keeping only English text."
input:
  format: "jsonl"
  path: "data/input.jsonl"

processors:
  - category: "tagger"
    type: "language"
    params: {}
  - category: "filter"
    type: "language"
    params:
      allowed_languages: ["en"]

output:
  format: "parquet"
  path: "data/output.parquet"

Usage

from llmdata import DataPipeline, PipelineConfig

config = PipelineConfig.from_yaml("config.yaml")
pipeline = DataPipeline(config)
pipeline.run()

CLI Interface

You can invoke a processing pipeline using the CLI:

llmdata run config.yaml

Full options:

llmdata -h
usage: llmdata [-h] [--config CONFIG] [--print_config[=flags]] {export_schemas,list,run,validate} ...

options:
  -h, --help            Show this help message and exit.
  --config CONFIG       Path to a configuration file.
  --print_config[=flags]
                        Print the configuration after applying all other arguments and exit. The optional flags customizes the output and are one or more keywords separated by
                        comma. The supported flags are: comments, skip_default, skip_null.

subcommands:
  For more details of each subcommand, add it as an argument followed by --help.

  Available subcommands:
    export_schemas      Writes all available processor config schemas to a JSON file.
    list                Print all available processors categories and names.
    run                 Run a pipeline using a config file.
    validate            Validate a config file.

Built-in Processors

$ llmdata list
=== Available Components ===

aggregation:
  - sum
  - count
  - quantile
  - mean
  - min
  - max
  - std
  - absmax
  - unique

reader:
  - parquet
  - jsonl
  - csv
  - text

writer:
  - parquet
  - jsonl
  - csv

extractor:
  - html
  - tei
  - plain

filter:
  - language
  - gopher_quality
  - gopher_repetition
  - num_tokens
  - value
  - exists

formatter:
  - deduplication
  - ftfy
  - ocr_error
  - pii

tagger:
  - language
  - gopher_quality
  - gopher_repetition
  - token_count
  - length
  - value

Development

# Setup
uv sync --dev
uv run pre-commit install

# Testing
uv run pytest

# Code quality
make check

Citation

If you use this software, please cite the corresponding paper:

@article{gienapp:2025d,
    title        = {{The German Commons -- 154 Billion Tokens of Openly Licensed Text for German Language Models}},
    author       = {Lukas Gienapp and
                    Christopher Schr\"oder and
                    Stefan Schweter and
                    Christopher Akiki and
                    Ferdinand Schlatt and
                    Arden Zimmermann and
                    Phillipe Gen\^et and
                    Martin Potthast},
    year         = 2025,
    month        = oct,
    journal      = {CoRR},
    volume       = {abs/2510.13996},
    url          = {https://arxiv.org/abs/2510.13996}
}

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published