Skip to content
/ data-gpt Public

Data GPT: A Systematic Laboratory for Data-Centric AI Research

License

Notifications You must be signed in to change notification settings

duoan/data-gpt

Repository files navigation

Data GPT: A Systematic Laboratory for Data-Centric AI Research

"Reproduction in science is not about running the code without errors; it is about independently verifying core hypotheses and exploring the boundaries of robustness."

📖 Abstract

Data GPT is not merely a codebase for training Large Language Models (LLMs); it is a research instrument designed to cultivate "Research Sense" in the domain of Data-Centric AI.

Moving beyond the engineering paradigm of "building for throughput," this project adopts the scientific paradigm of "dissecting for causality." We systematically reproduce, ablate, and stress-test state-of-the-art data processing algorithms—including Deduplication (MinHash/SemDeDup), Domain Reweighting (DoReMi), and Gradient-based Selection (LESS).

Our primary objective is to quantify the causal link between data artifacts (e.g., near-duplicates, domain mixtures) and training dynamics, with a rigorous emphasis on documenting Negative Results and Failed Attempts.


🔬 Research Philosophy

This project operates on three core epistemological principles:

  1. Hypothesis-Driven Implementation: Every line of code serves to verify a specific hypothesis (e.g., "Does semantic deduplication retain more reasoning diversity than n-gram deduplication?").
  2. The Value of Negative Results: A divergent loss curve is not a bug; it is a data point. We treat failed experiments as critical findings regarding the fragility of published methods.
  3. Controlled Environments: We utilize the Pythia model suite and DataComp benchmarks to ensure that all performance gains are attributable to data interventions, not architectural variances.

🗺️ Research Roadmap & Methodology

We investigate four pillars of Data-Centric AI. Each pillar is associated with specific Research Questions (RQs).

Phase I: The Purity Hypothesis (Deduplication)

Focus: Trade-offs between information density and semantic diversity.

  • Algorithms: MinHash LSH, SemDeDup (Embedding-based).

  • Key Hypotheses:

  • H_1: Aggressive deduplication (low Jaccard threshold) harms generalization by removing "near-duplicate" concept reinforcements.

  • H_2: Semantic deduplication outperforms surface-level deduplication on reasoning tasks (ARC, HellaSwag) but incurs O(N^2) computational cost.

  • Experimental Control: S-Curve manipulation (b, r parameters) to control False Positive Rates.

Phase II: The Proxy Paradox (Domain Reweighting)

Focus: DoReMi and Distributionally Robust Optimization (DRO).

  • Algorithms: DoReMi (Domain Reweighting with Minimax Optimization).
  • Key Hypotheses:
  • H_1: The Proxy-Target Gap: Domain weights learned by a small proxy model (e.g., 160M) are suboptimal for larger target models (e.g., 1B) due to capacity constraints.
  • H_2: Tokenizer mismatch between Proxy and Reference models leads to catastrophic weight collapse.

Phase III: The Gradient Fingerprint (Data Selection)

Focus: Instance-level selection using Influence Functions.

  • Algorithms: LESS (Low-rank GradiEnt Similarity Search).
  • Key Hypotheses:
  • H_1: Small-to-Large Transferability: Data selected by a 160M model effectively improves a 1B model.
  • H_2: Gradient-based selection introduces a bias towards short sequence lengths.

Phase IV: The Illusion of Complexity (Synthetic Data)

Focus: Evol-Instruct and complexity metrics.

  • Algorithms: Evol-Instruct, Tree-of-Thought (ToT) generation.
  • Key Hypotheses:
  • H_1: Verbosity \neq Complexity. Many "evolved" samples increase token count without increasing logic depth (measured by Type-Token Ratio).

🧪 Experimental Setup

To ensure strict reproducibility and academic comparability:

Component Specification Rationale
Model Architecture Pythia (160M, 410M, 1B) Transparent checkpoints; consistent data ordering.
Baseline Dataset CommonCrawl (DataComp-Small) Standardized baseline for SOTA comparison.
Evaluation MMLU, HellaSwag, Perplexity Standard academic metrics (via lm-evaluation-harness).
Infrastructure PyTorch, Hydra, Accelerate, WandB Modern, scalable research stack.

📂 Repository Structure

The structure reflects the research workflow, separating hypothesis configuration from implementation.

data-gpt/
├── configs/
│   ├── hypothesis/         # Specific configs for testing assumptions (e.g., high_recall_minhash)
│   ├── experimental_setup/ # Control vs. Treatment definitions
│   └── ...
├── reports/
│   ├── 01_deduplication/
│   │   ├── README.md       # Main Findings
│   │   └── failed_attempts.md # rigorous analysis of what DIDN'T work
│   └── ...
├── notebooks/              # Analysis of S-Curves, Gradient Norms, and Token distributions
├── src/                    # Core implementation of algorithms
└── ...


📉 Featured "Failed Attempts"

Science is built on the graveyard of failed hypotheses. This section (linked below) documents experiments that failed to reproduce reported results or diverged, providing root cause analysis.


🚀 Getting Started

Prerequisites

  • Python 3.10+
  • CUDA 11.8+
  • uv (Recommended for dependency management)

Installation

# Clone the laboratory
git clone https://github.com/duoan/data-gpt.git
cd data-gpt

# Setup environment
uv init
uv sync

Running a Control Experiment

# Run a baseline training on raw data (DataComp-Small) using Pythia-160M
python src/train.py experiment=control_baseline model=pythia_160m

Running an Ablation Study (e.g., MinHash)

# Run training with specific deduplication hypothesis
python src/train.py experiment=dedup_minhash_conservative \
    hypothesis.minhash.num_perm=128 \
    hypothesis.minhash.threshold=0.8

📚 Citation & Acknowledgements

This project is deeply inspired by the following works. If you find this repository useful, please cite the original papers:

  • Pythia: Biderman et al., 2023. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling.
  • DataComp: Gadre et al., 2023. DataComp: In search of the next generation of multimodal datasets.
  • DoReMi: Xie et al., 2023. DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining.
  • LESS: Xia et al., 2024. LESS: Selecting Influential Data for Targeted Instruction Tuning.

Author: Victor An Last Updated: December 2025

About

Data GPT: A Systematic Laboratory for Data-Centric AI Research

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published