TrainingData

Training Data

⚠️ Sample. An AI-generated demo of LLM Wiki Newsroom — the "open source AI" topic is just the example corpus.

Overview

Training data is the corpus a model learns from, and whether it must be released is the central dispute in the OpenSourceAI debate. The OpenSourceInitiative's OSAID requires "data information" — enough detail for a skilled person to recreate a substantially equivalent system — rather than the full dataset, on the grounds that some data (such as medical records) cannot be legally shared and that a full-data requirement would relegate open-source AI to a tiny niche. Critics including Bruce Perens and the FreeSoftwareFoundation counter that training data is effectively the source code, so a model is not open unless the raw data and processing scripts are released. Fully open-data models such as OLMo (AI2) and Pythia (Eleuther AI) are offered as proof the open-data path is viable.

Connections

OpenSourceAI — training data is the OSAID's most contested component
OpenWeights — open-weights releases withhold training data entirely
FreeSoftwareFoundation — argues raw training data must be released
OpenSourceInitiative — requires data information rather than full data
OpenWashing — opaque training data is central to open-washing concerns

TrainingData

Training Data

Overview

Connections

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LLM Wiki (sample)

Overviews

Contradictions

Concepts

Entities

Sources

Syntheses

Clone this wiki locally