-
Notifications
You must be signed in to change notification settings - Fork 2
TrainingData
⚠️ Sample. An AI-generated demo of LLM Wiki Newsroom — the "open source AI" topic is just the example corpus.
Training data is the corpus a model learns from, and whether it must be released is the central dispute in the OpenSourceAI debate. The OpenSourceInitiative's OSAID requires "data information" — enough detail for a skilled person to recreate a substantially equivalent system — rather than the full dataset, on the grounds that some data (such as medical records) cannot be legally shared and that a full-data requirement would relegate open-source AI to a tiny niche. Critics including Bruce Perens and the FreeSoftwareFoundation counter that training data is effectively the source code, so a model is not open unless the raw data and processing scripts are released. Fully open-data models such as OLMo (AI2) and Pythia (Eleuther AI) are offered as proof the open-data path is viable.
- OpenSourceAI — training data is the OSAID's most contested component
- OpenWeights — open-weights releases withhold training data entirely
- FreeSoftwareFoundation — argues raw training data must be released
- OpenSourceInitiative — requires data information rather than full data
- OpenWashing — opaque training data is central to open-washing concerns
- catalog-licensing-open-washing
- catalog-open-source-ai-definition
- catalog-open-weights
- catalog
- The Case Against OSI's Open Source AI Definition
- Celebrating an Important Step Forward for Open Source AI (Mozilla)
- Open Source AI Models: How Open Are They Really? (Part 1)
- The Open Source AI Definition 1.0