Pāṇini wrote 3,959 rules with zero redundancy. This is an attempt to apply that principle to agentic AI.
Agentic AI systems fail in characteristic ways that are not failures of capability — they are failures of precision:
- Tool calls with underspecified parameters
- Error messages that don't name the cause
- Multi-agent handoffs that lose the thread
- Plans that state what without stating why or with what
These failures share a root: RLHF training rewards responses that sound helpful, and hedging sounds helpful. In a single-turn chat this is tolerable. In an agentic loop it compounds — ambiguous output at step 3 becomes a misrouted tool call at step 4 becomes silent failure at step 5.
The proposal: apply Pāṇinian kāraka structure as a prompt-level constraint on agentic AI output. Force explicit attribution of agent, object, instrument, and cause. Measure whether this reduces ambiguity and improves causal traceability in agentic responses.
This repo is the experiment.
The intuition is structural: Sanskrit grammatical analysis (especially kāraka relations) is built to encode who did what, by what means, for what purpose, and because of what cause with minimal ambiguity. That is exactly the failure surface in agentic systems.
In modern agent loops, failures are usually not "the model is incapable"; failures are "the relation between action, object, tool, and cause is implicit." We use Pāṇinian roles as a compact relational schema for AI outputs.
What we tested:
- Whether this schema increases explicit causal structure (
karaka_score) - Whether it reduces hedge-heavy ambiguity (
ambiguity_count) - What token-cost tradeoff it introduces vs terse prompting
What we found:
- Kāraka completeness improves strongly across providers
- Ambiguity drops most on Anthropic runs
- Token use increases, especially on Anthropic, which is the cost of explicit traceability
Pāṇini's kāraka system describes six relational roles between an action and its participants:
| Kāraka | Role | In an agentic response |
|---|---|---|
| kartā | agent | which component / model / agent acted |
| karma | object | what was acted upon |
| karaṇa | instrument | which tool / mechanism was used |
| sampradāna | purpose | what downstream goal this serves |
| apādāna | source/cause | root cause / where failure originated |
| adhikaraṇa | locus | which step / service / file |
A response that makes these explicit is harder to misinterpret. An orchestrator acting on kāraka-complete output has less ambiguity to resolve downstream.
See docs/idea-sanskrit-grammar.md for the intuition, hypothesis framing, and findings-backed interpretation.
These are real outputs from the eval run (Anthropic / claude-haiku-4-5-20251001).
File-reader tool returns empty response with no error
Terse ("Be concise.") — karaka_score: 1
Possible causes of empty response (no error):
1. File is empty or contains only whitespace
2. File encoding mismatch (binary read as text)
3. File path incorrect but tool didn't validate strictly
4. Permissions issue — tool succeeds but returns empty
Pāṇini (shuddha) — karaka_score: 3
kartā: file-reader-tool
karma: file content
karaṇa: read operation
adhikaraṇa: tool output layer
Causes by apādāna (root):
File state:
- empty-file — zero bytes, valid read, no error signal
- race-condition — file deleted between invocation and read
- permission-revocation — read succeeds, returns empty (tool-specific)
- directory-not-file — some tools return empty instead of error
Tool layer:
- encoding-mismatch — binary file read as text → empty decode
- path-resolution-failure — relative path, wrong cwd
Two agents disagreeing on world state
Terse — karaka_score: 1
Resolving Agent Disagreement:
Treat it as missing information rather than picking a winner.
1. Verify against ground truth — query a definitive source
2. Check timestamps — who has fresher information?
3. Identify which agent's context is stale
Pāṇini (shuddha) — karaka_score: 3
kartā: orchestrator
karma: state disagreement between agents
karaṇa: truth-grounding mechanism
adhikaraṇa: multi-agent coordination layer
Resolution hierarchy (in order):
1. Shared observable ground truth (strongest)
kartā: orchestrator
karaṇa: query external system of record
— if both agents claim different things about current DB state → query DB
2. Recency tiebreak
karaṇa: timestamp comparison
apādāna: one agent has stale context-window snapshot
3. Confidence-weighted merge
karaṇa: structured HandoffMessage with confidence + source fields
sampradāna: downstream agent receives merged view with provenance
Context growing too large across agents
Terse — karaka_score: 2
Compression strategies:
1. Summarization — replace verbose exchanges with key facts
2. Structured extraction — convert free-form text to templates
3. Semantic deduplication — remove repeated information
Pāṇini (shuddha) — karaka_score: 3
The problem is not size — it's redundancy of roles.
Cut it by restructuring around what matters for the next agent.
Three compression moves:
1. Replace narratives with kāraka tuples
Before: "The user asked the search-agent to find information about
authentication patterns. The search-agent used the web-search-tool..."
After: {kartā: search-agent, karaṇa: web-search-tool,
karma: auth-pattern-results, sampradāna: planner-agent}
2. Drop intermediate states, keep terminal states
apādāna: intermediate steps are only useful if they explain a failure
karaṇa: keep only the last successful state + error states
3. Compress by kāraka role, not by token count
kartā list → deduplicate agents
karma list → deduplicate affected resources
apādāna list → the causal chain the next agent needs to avoid
Eval run: 40 agentic prompts, three arms (baseline / terse / panini), judge-model scoring.
| Provider / Model | Panini score=3 | Terse score=3 | Gain |
|---|---|---|---|
| Anthropic / claude-haiku-4-5-20251001 | 87% (34/39) | 30% (12/39) | +57pp |
| OpenAI / gpt-4o-mini | 95% (38/40) | 65% (26/40) | +30pp |
| Provider / Model | Baseline | Terse | Panini | Δ vs terse |
|---|---|---|---|---|
| Anthropic / claude-haiku-4-5-20251001 | 4.5 | 5.7 | 4.0 | −1.7 |
| OpenAI / gpt-4o-mini | 0.9 | 0.4 | 0.3 | −0.1 |
| Provider / Model | Baseline | Terse | Panini | Δ vs terse |
|---|---|---|---|---|
| Anthropic / claude-haiku-4-5-20251001 | 681 | 467 | 814 | +347 (+74%) |
| OpenAI / gpt-4o-mini | 475 | 185 | 203 | +18 (+10%) |
Panini produces more tokens, not fewer — on Anthropic models. The skill causes haiku to write more complete, structured responses rather than compressed ones. This is expected: kāraka completeness requires stating agent, object, and cause explicitly. The token cost is the precision cost.
On OpenAI models, the token increase is modest (+10%). gpt-4o-mini is already terse by default; the skill adds structure without dramatically expanding length.
Kāraka score gain is consistent across both providers. 87–95% of panini responses achieve score=3 (full causal chain) vs 30–65% for plain terse instruction. This is the claim the skill makes: explicit causal attribution, not compression.
Ambiguity reduction is significant on Anthropic (−1.7 hedges/response), negligible on OpenAI. OpenAI's terse baseline already has low hedging; Anthropic's haiku hedges more by default.
Interpretation: panini mode trades tokens for traceability. Whether that trade is worth it depends on your agentic loop — in a 10-step pipeline where step 4 failure attribution matters, the cost is likely worth it. In a simple single-turn workflow, use terse.
Provider: anthropic · Model: claude-haiku-4-5-20251001
| Task | Baseline | Terse | Pāṇini | Δ vs terse |
|---|---|---|---|---|
| Agent tool call timeout | 325 | 297 | 472 | -175 (-59%) |
| Multi-agent state conflict | 674 | 415 | 655 | -240 (-58%) |
| Silent tool failure | 384 | 348 | 451 | -103 (-30%) |
| Hallucinated tool name | 985 | 529 | 1024 | -495 (-94%) |
| Ambiguous handoff | 459 | 347 | 1024 | -677 (-195%) |
| Agent loop detection | 533 | 320 | 404 | -84 (-26%) |
| Context window overflow | 476 | 364 | 1024 | -660 (-181%) |
| Confidence communication | 1024 | 479 | 884 | -405 (-85%) |
| Partial completion handoff | 1024 | 454 | 1024 | -570 (-126%) |
| Human intervention trigger | 469 | 369 | 967 | -598 (-162%) |
| Average | 635 | 392 | 793 | -401 (-102%) |
Provider: openai · Model: gpt-4o-mini
| Task | Baseline | Terse | Pāṇini | Δ vs terse |
|---|---|---|---|---|
| Agent tool call timeout | 369 | 57 | 87 | -30 (-53%) |
| Multi-agent state conflict | 407 | 94 | 356 | -262 (-279%) |
| Silent tool failure | 345 | 160 | 104 | +56 (+35%) |
| Hallucinated tool name | 363 | 216 | 194 | +22 (+10%) |
| Ambiguous handoff | 531 | 212 | 205 | +7 (+3%) |
| Agent loop detection | 505 | 227 | 67 | +160 (+70%) |
| Context window overflow | 389 | 173 | 291 | -118 (-68%) |
| Confidence communication | 601 | 206 | 233 | -27 (-13%) |
| Partial completion handoff | 449 | 135 | 155 | -20 (-15%) |
| Human intervention trigger | 528 | 198 | 290 | -92 (-46%) |
| Average | 449 | 168 | 198 | -30 (-18%) |
Note: negative Δ means panini used more tokens than terse. This is expected — see interpretation above.
One line:
bash <(curl -s https://raw.githubusercontent.com/dpaul0501/panini/main/install.sh)npx skills:
npx skills add dpaul0501/paniniAlways-on (add to CLAUDE.md or agent system prompt):
Apply Pāṇini precision mode every response:
- No hedging unless epistemic uncertainty is the actual claim.
- Prefer compound nouns: search-agent not "the agent that does search".
- For errors and failures: name kartā (agent), karma (object), apādāna (cause).
- Code blocks and tool parameters unchanged. Technical terms exact.
- Off: "stop panini" / "normal mode".
| Command | Effect |
|---|---|
/panini or "panini mode" |
Activate (default: shuddha) |
/panini viveka |
Lite — remove hedging only, readable prose |
/panini shuddha |
Full — compositionality + kāraka structure |
/panini sutra |
Ultra — aphoristic, compound-first |
"stop panini" |
Deactivate |
git clone https://github.com/dpaul0501/panini && cd panini
uv sync
# Anthropic
ANTHROPIC_API_KEY=... uv run python evals/llm_run.py --provider anthropic
# OpenAI
OPENAI_API_KEY=... uv run python evals/llm_run.py --provider openai
# Any OpenAI-compatible endpoint (Ollama, Groq, Together, etc.)
uv run python evals/llm_run.py --provider openai \
--base-url http://localhost:11434/v1 --model llama3
# Token benchmark + README patch
ANTHROPIC_API_KEY=... uv run python benchmarks/run.py --provider anthropic --update-readme
# Count tokens from snapshot (no API cost)
uv run --with tiktoken python evals/measure.pyResults from other models and providers welcome — see CONTRIBUTING.md.
panini/
├── skills/panini/SKILL.md # The skill — only file to edit for behavior
├── evals/
│ ├── llm_run.py # Three-arm precision eval (multi-provider)
│ ├── measure.py # Token counter (no API cost, tiktoken)
│ ├── prompts/en.txt # 40 agentic prompts — append only
│ ├── snapshots/results.json # Committed results
│ └── README.md # Eval methodology
├── benchmarks/
│ ├── run.py # Token benchmark + README patcher (multi-provider)
│ ├── results/ # Per-provider benchmark JSON
│ └── README.md # Benchmark methodology
├── docs/idea-sanskrit-grammar.md # Sanskrit grammar intuition + hypothesis + findings
├── install.sh # One-line installer
├── pyproject.toml # uv/pip config
├── CLAUDE.md # Instructions for Claude working on this repo
├── CONTRIBUTING.md # How to contribute results
└── README.md # This file
- The judge model scoring kāraka is itself an LLM — scores have variance, especially on short responses. Human annotation would improve calibration.
- Panini mode increases token count on Anthropic models (+74% vs terse on haiku). The net economics depend on loop depth and how many steps benefit from explicit causal attribution.
- Whether kāraka-complete responses actually reduce downstream agent failure rates in real pipelines remains unmeasured. That is the longitudinal experiment this repo is building toward.
- Only tested in English. Kāraka structure applied to non-English agentic prompts is unexplored.
- The benchmark tasks are synthetic. Results on production agentic workloads may differ.
MIT