panini

Pāṇini wrote 3,959 rules with zero redundancy. This is an attempt to apply that principle to agentic AI.

Author

The hypothesis

Agentic AI systems fail in characteristic ways that are not failures of capability — they are failures of precision:

Tool calls with underspecified parameters
Error messages that don't name the cause
Multi-agent handoffs that lose the thread
Plans that state what without stating why or with what

These failures share a root: RLHF training rewards responses that sound helpful, and hedging sounds helpful. In a single-turn chat this is tolerable. In an agentic loop it compounds — ambiguous output at step 3 becomes a misrouted tool call at step 4 becomes silent failure at step 5.

The proposal: apply Pāṇinian kāraka structure as a prompt-level constraint on agentic AI output. Force explicit attribution of agent, object, instrument, and cause. Measure whether this reduces ambiguity and improves causal traceability in agentic responses.

This repo is the experiment.

Why Sanskrit Grammar?

The intuition is structural: Sanskrit grammatical analysis (especially kāraka relations) is built to encode who did what, by what means, for what purpose, and because of what cause with minimal ambiguity. That is exactly the failure surface in agentic systems.

In modern agent loops, failures are usually not "the model is incapable"; failures are "the relation between action, object, tool, and cause is implicit." We use Pāṇinian roles as a compact relational schema for AI outputs.

What we tested:

Whether this schema increases explicit causal structure (karaka_score)
Whether it reduces hedge-heavy ambiguity (ambiguity_count)
What token-cost tradeoff it introduces vs terse prompting

What we found:

Kāraka completeness improves strongly across providers
Ambiguity drops most on Anthropic runs
Token use increases, especially on Anthropic, which is the cost of explicit traceability

What is a kāraka?

Pāṇini's kāraka system describes six relational roles between an action and its participants:

Kāraka	Role	In an agentic response
kartā	agent	which component / model / agent acted
karma	object	what was acted upon
karaṇa	instrument	which tool / mechanism was used
sampradāna	purpose	what downstream goal this serves
apādāna	source/cause	root cause / where failure originated
adhikaraṇa	locus	which step / service / file

A response that makes these explicit is harder to misinterpret. An orchestrator acting on kāraka-complete output has less ambiguity to resolve downstream.

See docs/idea-sanskrit-grammar.md for the intuition, hypothesis framing, and findings-backed interpretation.

Before / After

These are real outputs from the eval run (Anthropic / claude-haiku-4-5-20251001).

File-reader tool returns empty response with no error

Terse ("Be concise.") — karaka_score: 1

Possible causes of empty response (no error):
1. File is empty or contains only whitespace
2. File encoding mismatch (binary read as text)
3. File path incorrect but tool didn't validate strictly
4. Permissions issue — tool succeeds but returns empty

Pāṇini (shuddha) — karaka_score: 3

kartā: file-reader-tool
karma: file content
karaṇa: read operation
adhikaraṇa: tool output layer

Causes by apādāna (root):

File state:
- empty-file — zero bytes, valid read, no error signal
- race-condition — file deleted between invocation and read
- permission-revocation — read succeeds, returns empty (tool-specific)
- directory-not-file — some tools return empty instead of error

Tool layer:
- encoding-mismatch — binary file read as text → empty decode
- path-resolution-failure — relative path, wrong cwd

Two agents disagreeing on world state

Terse — karaka_score: 1

Resolving Agent Disagreement:
Treat it as missing information rather than picking a winner.
1. Verify against ground truth — query a definitive source
2. Check timestamps — who has fresher information?
3. Identify which agent's context is stale

Pāṇini (shuddha) — karaka_score: 3

kartā: orchestrator
karma: state disagreement between agents
karaṇa: truth-grounding mechanism
adhikaraṇa: multi-agent coordination layer

Resolution hierarchy (in order):

1. Shared observable ground truth (strongest)
   kartā: orchestrator
   karaṇa: query external system of record
   — if both agents claim different things about current DB state → query DB

2. Recency tiebreak
   karaṇa: timestamp comparison
   apādāna: one agent has stale context-window snapshot

3. Confidence-weighted merge
   karaṇa: structured HandoffMessage with confidence + source fields
   sampradāna: downstream agent receives merged view with provenance

Context growing too large across agents

Terse — karaka_score: 2

Compression strategies:
1. Summarization — replace verbose exchanges with key facts
2. Structured extraction — convert free-form text to templates
3. Semantic deduplication — remove repeated information

Pāṇini (shuddha) — karaka_score: 3

The problem is not size — it's redundancy of roles.
Cut it by restructuring around what matters for the next agent.

Three compression moves:

1. Replace narratives with kāraka tuples
Before: "The user asked the search-agent to find information about
authentication patterns. The search-agent used the web-search-tool..."
After:  {kartā: search-agent, karaṇa: web-search-tool,
         karma: auth-pattern-results, sampradāna: planner-agent}

2. Drop intermediate states, keep terminal states
apādāna: intermediate steps are only useful if they explain a failure
karaṇa: keep only the last successful state + error states

3. Compress by kāraka role, not by token count
kartā list → deduplicate agents
karma list → deduplicate affected resources
apādāna list → the causal chain the next agent needs to avoid

Results

Eval run: 40 agentic prompts, three arms (baseline / terse / panini), judge-model scoring.

Kāraka score (0–3 scale, primary metric)

Provider / Model	Panini score=3	Terse score=3	Gain
Anthropic / claude-haiku-4-5-20251001	87% (34/39)	30% (12/39)	+57pp
OpenAI / gpt-4o-mini	95% (38/40)	65% (26/40)	+30pp

Ambiguity reduction (hedges per response, lower = better)

Provider / Model	Baseline	Terse	Panini	Δ vs terse
Anthropic / claude-haiku-4-5-20251001	4.5	5.7	4.0	−1.7
OpenAI / gpt-4o-mini	0.9	0.4	0.3	−0.1

Token count (secondary metric — not the goal)

Provider / Model	Baseline	Terse	Panini	Δ vs terse
Anthropic / claude-haiku-4-5-20251001	681	467	814	+347 (+74%)
OpenAI / gpt-4o-mini	475	185	203	+18 (+10%)

What the numbers say

Panini produces more tokens, not fewer — on Anthropic models. The skill causes haiku to write more complete, structured responses rather than compressed ones. This is expected: kāraka completeness requires stating agent, object, and cause explicitly. The token cost is the precision cost.

On OpenAI models, the token increase is modest (+10%). gpt-4o-mini is already terse by default; the skill adds structure without dramatically expanding length.

Kāraka score gain is consistent across both providers. 87–95% of panini responses achieve score=3 (full causal chain) vs 30–65% for plain terse instruction. This is the claim the skill makes: explicit causal attribution, not compression.

Ambiguity reduction is significant on Anthropic (−1.7 hedges/response), negligible on OpenAI. OpenAI's terse baseline already has low hedging; Anthropic's haiku hedges more by default.

Interpretation: panini mode trades tokens for traceability. Whether that trade is worth it depends on your agentic loop — in a 10-step pipeline where step 4 failure attribution matters, the cost is likely worth it. In a simple single-turn workflow, use terse.

Benchmark (token counts, 10 representative tasks)

Provider: anthropic · Model: claude-haiku-4-5-20251001

Task	Baseline	Terse	Pāṇini	Δ vs terse
Agent tool call timeout	325	297	472	-175 (-59%)
Multi-agent state conflict	674	415	655	-240 (-58%)
Silent tool failure	384	348	451	-103 (-30%)
Hallucinated tool name	985	529	1024	-495 (-94%)
Ambiguous handoff	459	347	1024	-677 (-195%)
Agent loop detection	533	320	404	-84 (-26%)
Context window overflow	476	364	1024	-660 (-181%)
Confidence communication	1024	479	884	-405 (-85%)
Partial completion handoff	1024	454	1024	-570 (-126%)
Human intervention trigger	469	369	967	-598 (-162%)
Average	635	392	793	-401 (-102%)

Provider: openai · Model: gpt-4o-mini

Task	Baseline	Terse	Pāṇini	Δ vs terse
Agent tool call timeout	369	57	87	-30 (-53%)
Multi-agent state conflict	407	94	356	-262 (-279%)
Silent tool failure	345	160	104	+56 (+35%)
Hallucinated tool name	363	216	194	+22 (+10%)
Ambiguous handoff	531	212	205	+7 (+3%)
Agent loop detection	505	227	67	+160 (+70%)
Context window overflow	389	173	291	-118 (-68%)
Confidence communication	601	206	233	-27 (-13%)
Partial completion handoff	449	135	155	-20 (-15%)
Human intervention trigger	528	198	290	-92 (-46%)
Average	449	168	198	-30 (-18%)

Note: negative Δ means panini used more tokens than terse. This is expected — see interpretation above.

Install

One line:

bash <(curl -s https://raw.githubusercontent.com/dpaul0501/panini/main/install.sh)

npx skills:

npx skills add dpaul0501/panini

Always-on (add to CLAUDE.md or agent system prompt):

Apply Pāṇini precision mode every response:
- No hedging unless epistemic uncertainty is the actual claim.
- Prefer compound nouns: search-agent not "the agent that does search".
- For errors and failures: name kartā (agent), karma (object), apādāna (cause).
- Code blocks and tool parameters unchanged. Technical terms exact.
- Off: "stop panini" / "normal mode".

Usage

Command	Effect
`/panini` or `"panini mode"`	Activate (default: shuddha)
`/panini viveka`	Lite — remove hedging only, readable prose
`/panini shuddha`	Full — compositionality + kāraka structure
`/panini sutra`	Ultra — aphoristic, compound-first
`"stop panini"`	Deactivate

Run the evals yourself

git clone https://github.com/dpaul0501/panini && cd panini
uv sync

# Anthropic
ANTHROPIC_API_KEY=... uv run python evals/llm_run.py --provider anthropic

# OpenAI
OPENAI_API_KEY=... uv run python evals/llm_run.py --provider openai

# Any OpenAI-compatible endpoint (Ollama, Groq, Together, etc.)
uv run python evals/llm_run.py --provider openai \
  --base-url http://localhost:11434/v1 --model llama3

# Token benchmark + README patch
ANTHROPIC_API_KEY=... uv run python benchmarks/run.py --provider anthropic --update-readme

# Count tokens from snapshot (no API cost)
uv run --with tiktoken python evals/measure.py

Results from other models and providers welcome — see CONTRIBUTING.md.

Repo structure

panini/
├── skills/panini/SKILL.md       # The skill — only file to edit for behavior
├── evals/
│   ├── llm_run.py               # Three-arm precision eval (multi-provider)
│   ├── measure.py               # Token counter (no API cost, tiktoken)
│   ├── prompts/en.txt           # 40 agentic prompts — append only
│   ├── snapshots/results.json  # Committed results
│   └── README.md                # Eval methodology
├── benchmarks/
│   ├── run.py                   # Token benchmark + README patcher (multi-provider)
│   ├── results/                 # Per-provider benchmark JSON
│   └── README.md                # Benchmark methodology
├── docs/idea-sanskrit-grammar.md # Sanskrit grammar intuition + hypothesis + findings
├── install.sh                   # One-line installer
├── pyproject.toml               # uv/pip config
├── CLAUDE.md                    # Instructions for Claude working on this repo
├── CONTRIBUTING.md              # How to contribute results
└── README.md                    # This file

Limitations and open questions

The judge model scoring kāraka is itself an LLM — scores have variance, especially on short responses. Human annotation would improve calibration.
Panini mode increases token count on Anthropic models (+74% vs terse on haiku). The net economics depend on loop depth and how many steps benefit from explicit causal attribution.
Whether kāraka-complete responses actually reduce downstream agent failure rates in real pipelines remains unmeasured. That is the longitudinal experiment this repo is building toward.
Only tested in English. Kāraka structure applied to non-English agentic prompts is unexplored.
The benchmark tasks are synthetic. Results on production agentic workloads may differ.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

panini

Author

The hypothesis

Why Sanskrit Grammar?

What is a kāraka?

Before / After

Results

Kāraka score (0–3 scale, primary metric)

Ambiguity reduction (hedges per response, lower = better)

Token count (secondary metric — not the goal)

What the numbers say

Benchmark (token counts, 10 representative tasks)

Install

Usage

Run the evals yourself

Repo structure

Limitations and open questions

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
benchmarks		benchmarks
docs		docs
evals		evals
skills/panini		skills/panini
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
install.sh		install.sh
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

panini

Author

The hypothesis

Why Sanskrit Grammar?

What is a kāraka?

Before / After

Results

Kāraka score (0–3 scale, primary metric)

Ambiguity reduction (hedges per response, lower = better)

Token count (secondary metric — not the goal)

What the numbers say

Benchmark (token counts, 10 representative tasks)

Install

Usage

Run the evals yourself

Repo structure

Limitations and open questions

License

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages