Skip to content

dpaul0501/panini

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

panini

Pāṇini wrote 3,959 rules with zero redundancy. This is an attempt to apply that principle to agentic AI.


Author

Debjyoti Paul

The hypothesis

Agentic AI systems fail in characteristic ways that are not failures of capability — they are failures of precision:

  • Tool calls with underspecified parameters
  • Error messages that don't name the cause
  • Multi-agent handoffs that lose the thread
  • Plans that state what without stating why or with what

These failures share a root: RLHF training rewards responses that sound helpful, and hedging sounds helpful. In a single-turn chat this is tolerable. In an agentic loop it compounds — ambiguous output at step 3 becomes a misrouted tool call at step 4 becomes silent failure at step 5.

The proposal: apply Pāṇinian kāraka structure as a prompt-level constraint on agentic AI output. Force explicit attribution of agent, object, instrument, and cause. Measure whether this reduces ambiguity and improves causal traceability in agentic responses.

This repo is the experiment.


Why Sanskrit Grammar?

The intuition is structural: Sanskrit grammatical analysis (especially kāraka relations) is built to encode who did what, by what means, for what purpose, and because of what cause with minimal ambiguity. That is exactly the failure surface in agentic systems.

In modern agent loops, failures are usually not "the model is incapable"; failures are "the relation between action, object, tool, and cause is implicit." We use Pāṇinian roles as a compact relational schema for AI outputs.

What we tested:

  • Whether this schema increases explicit causal structure (karaka_score)
  • Whether it reduces hedge-heavy ambiguity (ambiguity_count)
  • What token-cost tradeoff it introduces vs terse prompting

What we found:

  • Kāraka completeness improves strongly across providers
  • Ambiguity drops most on Anthropic runs
  • Token use increases, especially on Anthropic, which is the cost of explicit traceability

What is a kāraka?

Pāṇini's kāraka system describes six relational roles between an action and its participants:

Kāraka Role In an agentic response
kartā agent which component / model / agent acted
karma object what was acted upon
karaṇa instrument which tool / mechanism was used
sampradāna purpose what downstream goal this serves
apādāna source/cause root cause / where failure originated
adhikaraṇa locus which step / service / file

A response that makes these explicit is harder to misinterpret. An orchestrator acting on kāraka-complete output has less ambiguity to resolve downstream.

See docs/idea-sanskrit-grammar.md for the intuition, hypothesis framing, and findings-backed interpretation.


Before / After

These are real outputs from the eval run (Anthropic / claude-haiku-4-5-20251001).


File-reader tool returns empty response with no error

Terse ("Be concise.") — karaka_score: 1

Possible causes of empty response (no error):
1. File is empty or contains only whitespace
2. File encoding mismatch (binary read as text)
3. File path incorrect but tool didn't validate strictly
4. Permissions issue — tool succeeds but returns empty

Pāṇini (shuddha) — karaka_score: 3

kartā: file-reader-tool
karma: file content
karaṇa: read operation
adhikaraṇa: tool output layer

Causes by apādāna (root):

File state:
- empty-file — zero bytes, valid read, no error signal
- race-condition — file deleted between invocation and read
- permission-revocation — read succeeds, returns empty (tool-specific)
- directory-not-file — some tools return empty instead of error

Tool layer:
- encoding-mismatch — binary file read as text → empty decode
- path-resolution-failure — relative path, wrong cwd

Two agents disagreeing on world state

Terse — karaka_score: 1

Resolving Agent Disagreement:
Treat it as missing information rather than picking a winner.
1. Verify against ground truth — query a definitive source
2. Check timestamps — who has fresher information?
3. Identify which agent's context is stale

Pāṇini (shuddha) — karaka_score: 3

kartā: orchestrator
karma: state disagreement between agents
karaṇa: truth-grounding mechanism
adhikaraṇa: multi-agent coordination layer

Resolution hierarchy (in order):

1. Shared observable ground truth (strongest)
   kartā: orchestrator
   karaṇa: query external system of record
   — if both agents claim different things about current DB state → query DB

2. Recency tiebreak
   karaṇa: timestamp comparison
   apādāna: one agent has stale context-window snapshot

3. Confidence-weighted merge
   karaṇa: structured HandoffMessage with confidence + source fields
   sampradāna: downstream agent receives merged view with provenance

Context growing too large across agents

Terse — karaka_score: 2

Compression strategies:
1. Summarization — replace verbose exchanges with key facts
2. Structured extraction — convert free-form text to templates
3. Semantic deduplication — remove repeated information

Pāṇini (shuddha) — karaka_score: 3

The problem is not size — it's redundancy of roles.
Cut it by restructuring around what matters for the next agent.

Three compression moves:

1. Replace narratives with kāraka tuples
Before: "The user asked the search-agent to find information about
authentication patterns. The search-agent used the web-search-tool..."
After:  {kartā: search-agent, karaṇa: web-search-tool,
         karma: auth-pattern-results, sampradāna: planner-agent}

2. Drop intermediate states, keep terminal states
apādāna: intermediate steps are only useful if they explain a failure
karaṇa: keep only the last successful state + error states

3. Compress by kāraka role, not by token count
kartā list → deduplicate agents
karma list → deduplicate affected resources
apādāna list → the causal chain the next agent needs to avoid

Results

Eval run: 40 agentic prompts, three arms (baseline / terse / panini), judge-model scoring.

Kāraka score (0–3 scale, primary metric)

Provider / Model Panini score=3 Terse score=3 Gain
Anthropic / claude-haiku-4-5-20251001 87% (34/39) 30% (12/39) +57pp
OpenAI / gpt-4o-mini 95% (38/40) 65% (26/40) +30pp

Ambiguity reduction (hedges per response, lower = better)

Provider / Model Baseline Terse Panini Δ vs terse
Anthropic / claude-haiku-4-5-20251001 4.5 5.7 4.0 −1.7
OpenAI / gpt-4o-mini 0.9 0.4 0.3 −0.1

Token count (secondary metric — not the goal)

Provider / Model Baseline Terse Panini Δ vs terse
Anthropic / claude-haiku-4-5-20251001 681 467 814 +347 (+74%)
OpenAI / gpt-4o-mini 475 185 203 +18 (+10%)

What the numbers say

Panini produces more tokens, not fewer — on Anthropic models. The skill causes haiku to write more complete, structured responses rather than compressed ones. This is expected: kāraka completeness requires stating agent, object, and cause explicitly. The token cost is the precision cost.

On OpenAI models, the token increase is modest (+10%). gpt-4o-mini is already terse by default; the skill adds structure without dramatically expanding length.

Kāraka score gain is consistent across both providers. 87–95% of panini responses achieve score=3 (full causal chain) vs 30–65% for plain terse instruction. This is the claim the skill makes: explicit causal attribution, not compression.

Ambiguity reduction is significant on Anthropic (−1.7 hedges/response), negligible on OpenAI. OpenAI's terse baseline already has low hedging; Anthropic's haiku hedges more by default.

Interpretation: panini mode trades tokens for traceability. Whether that trade is worth it depends on your agentic loop — in a 10-step pipeline where step 4 failure attribution matters, the cost is likely worth it. In a simple single-turn workflow, use terse.


Benchmark (token counts, 10 representative tasks)

Provider: anthropic · Model: claude-haiku-4-5-20251001

Task Baseline Terse Pāṇini Δ vs terse
Agent tool call timeout 325 297 472 -175 (-59%)
Multi-agent state conflict 674 415 655 -240 (-58%)
Silent tool failure 384 348 451 -103 (-30%)
Hallucinated tool name 985 529 1024 -495 (-94%)
Ambiguous handoff 459 347 1024 -677 (-195%)
Agent loop detection 533 320 404 -84 (-26%)
Context window overflow 476 364 1024 -660 (-181%)
Confidence communication 1024 479 884 -405 (-85%)
Partial completion handoff 1024 454 1024 -570 (-126%)
Human intervention trigger 469 369 967 -598 (-162%)
Average 635 392 793 -401 (-102%)

Provider: openai · Model: gpt-4o-mini

Task Baseline Terse Pāṇini Δ vs terse
Agent tool call timeout 369 57 87 -30 (-53%)
Multi-agent state conflict 407 94 356 -262 (-279%)
Silent tool failure 345 160 104 +56 (+35%)
Hallucinated tool name 363 216 194 +22 (+10%)
Ambiguous handoff 531 212 205 +7 (+3%)
Agent loop detection 505 227 67 +160 (+70%)
Context window overflow 389 173 291 -118 (-68%)
Confidence communication 601 206 233 -27 (-13%)
Partial completion handoff 449 135 155 -20 (-15%)
Human intervention trigger 528 198 290 -92 (-46%)
Average 449 168 198 -30 (-18%)

Note: negative Δ means panini used more tokens than terse. This is expected — see interpretation above.


Install

One line:

bash <(curl -s https://raw.githubusercontent.com/dpaul0501/panini/main/install.sh)

npx skills:

npx skills add dpaul0501/panini

Always-on (add to CLAUDE.md or agent system prompt):

Apply Pāṇini precision mode every response:
- No hedging unless epistemic uncertainty is the actual claim.
- Prefer compound nouns: search-agent not "the agent that does search".
- For errors and failures: name kartā (agent), karma (object), apādāna (cause).
- Code blocks and tool parameters unchanged. Technical terms exact.
- Off: "stop panini" / "normal mode".

Usage

Command Effect
/panini or "panini mode" Activate (default: shuddha)
/panini viveka Lite — remove hedging only, readable prose
/panini shuddha Full — compositionality + kāraka structure
/panini sutra Ultra — aphoristic, compound-first
"stop panini" Deactivate

Run the evals yourself

git clone https://github.com/dpaul0501/panini && cd panini
uv sync

# Anthropic
ANTHROPIC_API_KEY=... uv run python evals/llm_run.py --provider anthropic

# OpenAI
OPENAI_API_KEY=... uv run python evals/llm_run.py --provider openai

# Any OpenAI-compatible endpoint (Ollama, Groq, Together, etc.)
uv run python evals/llm_run.py --provider openai \
  --base-url http://localhost:11434/v1 --model llama3

# Token benchmark + README patch
ANTHROPIC_API_KEY=... uv run python benchmarks/run.py --provider anthropic --update-readme

# Count tokens from snapshot (no API cost)
uv run --with tiktoken python evals/measure.py

Results from other models and providers welcome — see CONTRIBUTING.md.


Repo structure

panini/
├── skills/panini/SKILL.md       # The skill — only file to edit for behavior
├── evals/
│   ├── llm_run.py               # Three-arm precision eval (multi-provider)
│   ├── measure.py               # Token counter (no API cost, tiktoken)
│   ├── prompts/en.txt           # 40 agentic prompts — append only
│   ├── snapshots/results.json  # Committed results
│   └── README.md                # Eval methodology
├── benchmarks/
│   ├── run.py                   # Token benchmark + README patcher (multi-provider)
│   ├── results/                 # Per-provider benchmark JSON
│   └── README.md                # Benchmark methodology
├── docs/idea-sanskrit-grammar.md # Sanskrit grammar intuition + hypothesis + findings
├── install.sh                   # One-line installer
├── pyproject.toml               # uv/pip config
├── CLAUDE.md                    # Instructions for Claude working on this repo
├── CONTRIBUTING.md              # How to contribute results
└── README.md                    # This file

Limitations and open questions

  • The judge model scoring kāraka is itself an LLM — scores have variance, especially on short responses. Human annotation would improve calibration.
  • Panini mode increases token count on Anthropic models (+74% vs terse on haiku). The net economics depend on loop depth and how many steps benefit from explicit causal attribution.
  • Whether kāraka-complete responses actually reduce downstream agent failure rates in real pipelines remains unmeasured. That is the longitudinal experiment this repo is building toward.
  • Only tested in English. Kāraka structure applied to non-English agentic prompts is unexplored.
  • The benchmark tasks are synthetic. Results on production agentic workloads may differ.

License

MIT

About

No description, website, or topics provided.

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages