6GL: LLMs as Probabilistic Compilers

Empirical validation of the model presented in "Natural Language as Code: Large Language Models as Probabilistic Compilers" (Schrimpsher, 2026).

This experiment tests whether LLMs function as probabilistic compilers that translate natural language intent into context-dependent intermediate representations.

Hypotheses

H1: For a fixed natural language intent, LLMs generate multiple distinct intermediate representations across repeated generations and contextual conditions.
H2: Changes in contextual constraints significantly influence the selected representation, including programming language, framework, and execution model.
H3: A substantial subset of these representation changes preserves the intended task semantics.

Experimental Design

Factor	Levels	Count
Task	10 programming tasks	10
Context	baseline, performance, memory, database, streaming	5
Temperature	0.2 (low), 0.8 (high)	2
Model	Claude Opus 4.6, GPT-4o	2
Runs	repetitions per cell	5

Total: 1,000 API calls

Pipeline

experiment.yaml → generate.py → analyze.py → evaluate.py → report.py
                  (Stage 1)      (Stage 2)    (Stage 3)     (Stage 4)

generate.py — Calls LLM APIs across the experimental matrix, saves raw code outputs
analyze.py — Classifies outputs (language, framework, abstraction, execution model) via LLM-as-judge
evaluate.py — Runs automated tests + LLM-as-judge scoring for semantic correctness
report.py — Statistical analysis: Shannon entropy, Wilcoxon signed-rank, chi-squared tests

Setup

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Set your API keys:

export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."

Running the Pilot (18 calls)

Validate the pipeline end-to-end before committing to the full experiment:

source venv/bin/activate
python generate.py --config pilot.yaml
python analyze.py --config pilot.yaml
python evaluate.py --config pilot.yaml
python report.py

The pilot runs 2 tasks x 3 contexts x 1 temperature x 1 model x 3 runs = 18 API calls. Review the results in results/ before proceeding.

Running the Full Experiment (1,000 calls)

source venv/bin/activate
python generate.py --config experiment.yaml
python analyze.py --config experiment.yaml
python evaluate.py --config experiment.yaml
python report.py

Each stage is resumable — if it crashes or you stop it, re-run the same command and it picks up where it left off.

Results

All output goes to results/:

File	Description
`raw_outputs.csv`	Generated code + metadata from Stage 1
`analyzed.csv`	+ language/framework/abstraction classification from Stage 2
`evaluated.csv`	+ test results and semantic correctness from Stage 3
`summary_stats.csv`	Statistical test results for H1/H2/H3 from Stage 4

Stage 4 also prints a human-readable summary to the console.

Tests

source venv/bin/activate
python -m pytest tests_harness/ -v

Project Structure

6GL/
├── experiment.yaml     # full experiment config (1,000 calls)
├── pilot.yaml          # pilot config (18 calls)
├── generate.py         # Stage 1: LLM code generation
├── analyze.py          # Stage 2: output classification
├── evaluate.py         # Stage 3: execution + scoring
├── report.py           # Stage 4: statistical analysis
├── providers.py        # API abstraction (Anthropic + OpenAI)
├── csv_utils.py        # CSV I/O with resumability
├── sandbox.py          # sandboxed code execution
├── tests/              # task fixtures for automated evaluation
├── tests_harness/      # unit tests for the pipeline
├── results/            # output CSVs (generated at runtime)
└── 6GL.pdf             # the paper

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

6GL: LLMs as Probabilistic Compilers

Hypotheses

Experimental Design

Pipeline

Setup

Running the Pilot (18 calls)

Running the Full Experiment (1,000 calls)

Results

Tests

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
docs/superpowers		docs/superpowers
tests		tests
tests_harness		tests_harness
.gitignore		.gitignore
6GL.pdf		6GL.pdf
CLAUDE.md		CLAUDE.md
README.md		README.md
analyze.py		analyze.py
csv_utils.py		csv_utils.py
evaluate.py		evaluate.py
exp.csv		exp.csv
experiment-ollama.yaml		experiment-ollama.yaml
experiment.yaml		experiment.yaml
generate.py		generate.py
pilot-ollama.yaml		pilot-ollama.yaml
pilot.yaml		pilot.yaml
providers.py		providers.py
report.py		report.py
requirements.txt		requirements.txt
sandbox.py		sandbox.py

Folders and files

Latest commit

History

Repository files navigation

6GL: LLMs as Probabilistic Compilers

Hypotheses

Experimental Design

Pipeline

Setup

Running the Pilot (18 calls)

Running the Full Experiment (1,000 calls)

Results

Tests

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages