Skip to content

dschrimpsher/6GL

Repository files navigation

6GL: LLMs as Probabilistic Compilers

Empirical validation of the model presented in "Natural Language as Code: Large Language Models as Probabilistic Compilers" (Schrimpsher, 2026).

This experiment tests whether LLMs function as probabilistic compilers that translate natural language intent into context-dependent intermediate representations.

Hypotheses

  • H1: For a fixed natural language intent, LLMs generate multiple distinct intermediate representations across repeated generations and contextual conditions.
  • H2: Changes in contextual constraints significantly influence the selected representation, including programming language, framework, and execution model.
  • H3: A substantial subset of these representation changes preserves the intended task semantics.

Experimental Design

Factor Levels Count
Task 10 programming tasks 10
Context baseline, performance, memory, database, streaming 5
Temperature 0.2 (low), 0.8 (high) 2
Model Claude Opus 4.6, GPT-4o 2
Runs repetitions per cell 5

Total: 1,000 API calls

Pipeline

experiment.yaml → generate.py → analyze.py → evaluate.py → report.py
                  (Stage 1)      (Stage 2)    (Stage 3)     (Stage 4)
  1. generate.py — Calls LLM APIs across the experimental matrix, saves raw code outputs
  2. analyze.py — Classifies outputs (language, framework, abstraction, execution model) via LLM-as-judge
  3. evaluate.py — Runs automated tests + LLM-as-judge scoring for semantic correctness
  4. report.py — Statistical analysis: Shannon entropy, Wilcoxon signed-rank, chi-squared tests

Setup

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Set your API keys:

export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."

Running the Pilot (18 calls)

Validate the pipeline end-to-end before committing to the full experiment:

source venv/bin/activate
python generate.py --config pilot.yaml
python analyze.py --config pilot.yaml
python evaluate.py --config pilot.yaml
python report.py

The pilot runs 2 tasks x 3 contexts x 1 temperature x 1 model x 3 runs = 18 API calls. Review the results in results/ before proceeding.

Running the Full Experiment (1,000 calls)

source venv/bin/activate
python generate.py --config experiment.yaml
python analyze.py --config experiment.yaml
python evaluate.py --config experiment.yaml
python report.py

Each stage is resumable — if it crashes or you stop it, re-run the same command and it picks up where it left off.

Results

All output goes to results/:

File Description
raw_outputs.csv Generated code + metadata from Stage 1
analyzed.csv + language/framework/abstraction classification from Stage 2
evaluated.csv + test results and semantic correctness from Stage 3
summary_stats.csv Statistical test results for H1/H2/H3 from Stage 4

Stage 4 also prints a human-readable summary to the console.

Tests

source venv/bin/activate
python -m pytest tests_harness/ -v

Project Structure

6GL/
├── experiment.yaml     # full experiment config (1,000 calls)
├── pilot.yaml          # pilot config (18 calls)
├── generate.py         # Stage 1: LLM code generation
├── analyze.py          # Stage 2: output classification
├── evaluate.py         # Stage 3: execution + scoring
├── report.py           # Stage 4: statistical analysis
├── providers.py        # API abstraction (Anthropic + OpenAI)
├── csv_utils.py        # CSV I/O with resumability
├── sandbox.py          # sandboxed code execution
├── tests/              # task fixtures for automated evaluation
├── tests_harness/      # unit tests for the pipeline
├── results/            # output CSVs (generated at runtime)
└── 6GL.pdf             # the paper

About

Exploring natural language as code and LLM-driven programming models (6GL) for system design and execution. Investigating natural language as an intermediate representation for software systems.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages