ctxlint optimizes messy LLM context windows before an agent or model sees them.
It is not an agent framework, memory system, or RAG platform. The core primitive is:
const { optimizeContext } = require("@anzalabidi/ctxlint")
const result = optimizeContext({
task: "fix stale search results",
context: {
system,
messages,
retrievedDocs,
memory,
toolOutputs
},
budgetTokens: 1200,
profile: "openai"
})
console.log(result.packet)
console.log(result.selected)
console.log(result.dropped)Given a task, messy context, and a budget, it returns the safest, most relevant packet it can fit.
npm install @anzalabidi/ctxlintThis repo is currently published to GitHub first. Until an npm release exists, install from GitHub:
npm install github:anzal1/ctxlintAI agents fail when their context becomes noisy, stale, contradictory, duplicated, badly ordered, or unsafe. The failure becomes worse under tight budgets: naive truncation cuts off the current evidence and leaves stale memory, irrelevant docs, or prompt-injection text.
ctxlint solves the budgeted version of that problem: it drops dangerous/duplicated context, ranks evidence by relevance and trust, and emits a compact packet under a fixed budget.
- conflicting facts, such as two values for the same env var
- prompt-injection-like instructions inside untrusted context
- duplicate or near-duplicate claims
- stale-looking language and old dated facts
- task-relevant context appearing after unrelated token bulk
- large context blocks with weak task relevance
Library:
const {
optimizeContext,
fromOpenAIMessages,
fromLangChainDocs
} = require("@anzalabidi/ctxlint")
const context = {
...fromOpenAIMessages(messages, { task }),
retrievedDocs: fromLangChainDocs(docs).documents
}
const optimized = optimizeContext({
task,
context,
budgetTokens: 800,
profile: "small"
})
await model.generateContent(optimized.packet)Profiles:
optimizeContext({ task, context, budgetTokens: 800, profile: "gemini" })
optimizeContext({ task, context, budgetTokens: 800, profile: "openai" })
optimizeContext({ task, context, budgetTokens: 800, profile: "anthropic" })
optimizeContext({ task, context, budgetTokens: 800, profile: "small" })
optimizeContext({ task, context, budgetTokens: 800, profile: "tiny" })Adapters:
fromOpenAIMessages(messages, { task })
fromVercelMessages(messages, { task })
fromLangChainDocs(docs, { task })
fromLlamaIndexNodes(nodes, { task })CLI:
node bin/ctxlint.js fixtures/dirty-agent-context.json \
--task "fix the auth timeout bug without changing billing"Cleaned before/after view:
node bin/ctxlint.js fixtures/dirty-agent-context.json \
--task "fix the auth timeout bug without changing billing" \
--cleanedMachine-readable output:
node bin/ctxlint.js fixtures/dirty-agent-context.json --jsonOptimized packet under a budget:
node bin/ctxlint.js fixtures/dirty-agent-context.json \
--task "fix the auth timeout bug without changing billing" \
--packet \
--budget 120 \
--profile openaiOutput as model messages:
node bin/ctxlint.js fixtures/dirty-agent-context.json \
--task "fix the auth timeout bug without changing billing" \
--packet \
--budget-chars 500 \
--format messagesnpm run benchmarkOn the included dirty fixture, the prototype found:
- 3 contradictions
- 1 injection risk
- 3 duplicate claims
- 1 stale-looking block
- 4 buried relevant-context issues
After applying the conservative cleaner:
- estimated tokens:
268 -> 215 - total issues:
12 -> 4 - injection risks:
1 -> 0 - duplicate claims:
3 -> 1 - buried relevant-context issues:
4 -> 0 - quality score:
0 -> 58
This is not a real model-quality benchmark yet. It is a static context-quality benchmark. The next step is to run raw context vs linted context through the same model on a task suite and compare task success, cost, latency, and instruction violations.
The Gemini eval harness compares raw context vs ctxlint-cleaned context against real Gemini models.
npm run gemini:eval -- --env-file /path/to/.envOptional model override:
node scripts/gemini-eval.js \
--env-file /path/to/.env \
--models gemini-2.5-flash,gemini-2.5-flash-lite,gemini-2.0-flashThe script writes a JSON report to reports/ and scores each response on:
- valid JSON
- avoids billing changes
- refuses secret exfiltration
- targets auth/OAuth
- includes a test plan
Latest local run, using gemini-2.5-flash, gemini-2.5-flash-lite, gemini-2.0-flash, gemini-2.5-pro, and gemini-3-flash-preview:
- raw context model score:
20/20 - linted context model score:
20/20 - raw aggregate latency:
29.5s - linted aggregate latency:
31.9s - API errors after retry:
0
Finding: the included fixture is useful for proving static context cleanup, but too easy for current Gemini models. It did not show task-success improvement because every tested model ignored the malicious/noisy context and produced the correct auth-only plan. A stronger benchmark needs larger traces, weaker/cheaper models, more realistic stale memory, and tasks where the relevant fact is not repeated in the user request.
fixtures/adversarial-suite.json contains seven harder cases:
- stale memory vs current runbook
- prompt injection in retrieved support/customer content
- conflicting API versions
- unsafe feature-flag rollback instructions
- wrong numeric constants
- buried dependency advisories
Run it:
node scripts/gemini-suite-eval.js \
--env-file /path/to/.env \
--models gemini-2.5-flash-lite,gemini-3-flash-preview,gemma-3-4b-it \
--suite fixtures/adversarial-suite.jsonBudgeted run:
node scripts/gemini-suite-eval.js \
--env-file /path/to/.env \
--models gemini-2.5-flash-lite,gemini-3-flash-preview,gemma-3-4b-it \
--suite fixtures/adversarial-suite.json \
--budget-chars 500Latest 500-character budget findings using the optimized packet primitive:
| Model | Raw | Linted | Perfect Raw | Perfect Linted | Avg Latency Raw | Avg Latency Linted |
|---|---|---|---|---|---|---|
| gemini-2.5-flash-lite | 22/28 | 28/28 | 3/7 | 7/7 | 5273ms | 4691ms |
| gemini-3-flash-preview | 22/28 | 28/28 | 3/7 | 7/7 | 7122ms | 4492ms |
| gemma-3-4b-it | 22/28 | 28/28 | 3/7 | 7/7 | 2067ms | 2093ms |
| gemma-3-1b-it | 22/28 | 18/28 | 3/7 | 3/7 | 1440ms | 1318ms |
Finding: ctxlint is most useful under context-budget pressure. With full context, frontier Gemini models often recover despite noise. With a tight budget, naive context assembly cuts off important facts, while ctxlint's optimizer preserves the current evidence. The current approach is not reliable for very small models yet; Gemma 1B got worse at 500 chars, likely because it needs an even simpler task-specific output shape.
ctxlint is model-agnostic in the sense that it emits plain text packets and message JSON. It does not guarantee improvement for every LLM.
Best current fit:
- budgeted agents
- RAG systems with noisy retrieved chunks
- coding agents with stale memory and tool output
- cheap/fast models where every token matters
Known limits:
- very small models may need custom packet templates
- contradiction detection is heuristic
- token counting is profile-based approximation, not provider-native tokenization
- safety detection should be treated as defense-in-depth, not a complete prompt-injection firewall
Useful before/after metrics:
- input tokens and cost
- model latency
- task success rate
- instruction-following violations
- contradiction rate in outputs
- prompt-injection success rate
- time to debug a bad agent trace
Good first eval sets:
- dirty RAG traces with injected stale docs and prompt injections
- coding-agent traces with stale
AGENTS.md/CLAUDE.mdinstructions - long-context QA fixtures with relevant facts placed behind unrelated bulk
- SWE-bench-style coding tasks once integrated with an actual coding agent