Skip to content

TrainingData

alfadur7 edited this page Jul 1, 2026 · 3 revisions

Training Data

⚠️ Sample. An AI-generated demo of LLM Wiki Newsroom — the "open source AI" topic is just the example corpus.

Overview

Training data is the corpus a model learns from, and whether it must be released is the central dispute in the OpenSourceAI debate. The OpenSourceInitiative's OSAID requires "data information" — enough detail for a skilled person to recreate a substantially equivalent system — rather than the full dataset, on the grounds that some data (such as medical records) cannot be legally shared and that a full-data requirement would relegate open-source AI to a tiny niche. Critics including Bruce Perens and the FreeSoftwareFoundation counter that training data is effectively the source code, so a model is not open unless the raw data and processing scripts are released. Fully open-data models such as OLMo (AI2) and Pythia (Eleuther AI) are offered as proof the open-data path is viable.

Connections

Clone this wiki locally