Empirical validation of the model presented in "Natural Language as Code: Large Language Models as Probabilistic Compilers" (Schrimpsher, 2026).
This experiment tests whether LLMs function as probabilistic compilers that translate natural language intent into context-dependent intermediate representations.
- H1: For a fixed natural language intent, LLMs generate multiple distinct intermediate representations across repeated generations and contextual conditions.
- H2: Changes in contextual constraints significantly influence the selected representation, including programming language, framework, and execution model.
- H3: A substantial subset of these representation changes preserves the intended task semantics.
| Factor | Levels | Count |
|---|---|---|
| Task | 10 programming tasks | 10 |
| Context | baseline, performance, memory, database, streaming | 5 |
| Temperature | 0.2 (low), 0.8 (high) | 2 |
| Model | Claude Opus 4.6, GPT-4o | 2 |
| Runs | repetitions per cell | 5 |
Total: 1,000 API calls
experiment.yaml → generate.py → analyze.py → evaluate.py → report.py
(Stage 1) (Stage 2) (Stage 3) (Stage 4)
- generate.py — Calls LLM APIs across the experimental matrix, saves raw code outputs
- analyze.py — Classifies outputs (language, framework, abstraction, execution model) via LLM-as-judge
- evaluate.py — Runs automated tests + LLM-as-judge scoring for semantic correctness
- report.py — Statistical analysis: Shannon entropy, Wilcoxon signed-rank, chi-squared tests
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txtSet your API keys:
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."Validate the pipeline end-to-end before committing to the full experiment:
source venv/bin/activate
python generate.py --config pilot.yaml
python analyze.py --config pilot.yaml
python evaluate.py --config pilot.yaml
python report.pyThe pilot runs 2 tasks x 3 contexts x 1 temperature x 1 model x 3 runs = 18 API calls. Review the results in results/ before proceeding.
source venv/bin/activate
python generate.py --config experiment.yaml
python analyze.py --config experiment.yaml
python evaluate.py --config experiment.yaml
python report.pyEach stage is resumable — if it crashes or you stop it, re-run the same command and it picks up where it left off.
All output goes to results/:
| File | Description |
|---|---|
raw_outputs.csv |
Generated code + metadata from Stage 1 |
analyzed.csv |
+ language/framework/abstraction classification from Stage 2 |
evaluated.csv |
+ test results and semantic correctness from Stage 3 |
summary_stats.csv |
Statistical test results for H1/H2/H3 from Stage 4 |
Stage 4 also prints a human-readable summary to the console.
source venv/bin/activate
python -m pytest tests_harness/ -v6GL/
├── experiment.yaml # full experiment config (1,000 calls)
├── pilot.yaml # pilot config (18 calls)
├── generate.py # Stage 1: LLM code generation
├── analyze.py # Stage 2: output classification
├── evaluate.py # Stage 3: execution + scoring
├── report.py # Stage 4: statistical analysis
├── providers.py # API abstraction (Anthropic + OpenAI)
├── csv_utils.py # CSV I/O with resumability
├── sandbox.py # sandboxed code execution
├── tests/ # task fixtures for automated evaluation
├── tests_harness/ # unit tests for the pipeline
├── results/ # output CSVs (generated at runtime)
└── 6GL.pdf # the paper