Skip to content

anzal1/ctxlint

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ctxlint

ctxlint optimizes messy LLM context windows before an agent or model sees them.

It is not an agent framework, memory system, or RAG platform. The core primitive is:

const { optimizeContext } = require("@anzalabidi/ctxlint")

const result = optimizeContext({
  task: "fix stale search results",
  context: {
    system,
    messages,
    retrievedDocs,
    memory,
    toolOutputs
  },
  budgetTokens: 1200,
  profile: "openai"
})

console.log(result.packet)
console.log(result.selected)
console.log(result.dropped)

Given a task, messy context, and a budget, it returns the safest, most relevant packet it can fit.

Install

npm install @anzalabidi/ctxlint

This repo is currently published to GitHub first. Until an npm release exists, install from GitHub:

npm install github:anzal1/ctxlint

Problem Statement

AI agents fail when their context becomes noisy, stale, contradictory, duplicated, badly ordered, or unsafe. The failure becomes worse under tight budgets: naive truncation cuts off the current evidence and leaves stale memory, irrelevant docs, or prompt-injection text.

ctxlint solves the budgeted version of that problem: it drops dangerous/duplicated context, ranks evidence by relevance and trust, and emits a compact packet under a fixed budget.

Current Checks

  • conflicting facts, such as two values for the same env var
  • prompt-injection-like instructions inside untrusted context
  • duplicate or near-duplicate claims
  • stale-looking language and old dated facts
  • task-relevant context appearing after unrelated token bulk
  • large context blocks with weak task relevance

Usage

Library:

const {
  optimizeContext,
  fromOpenAIMessages,
  fromLangChainDocs
} = require("@anzalabidi/ctxlint")

const context = {
  ...fromOpenAIMessages(messages, { task }),
  retrievedDocs: fromLangChainDocs(docs).documents
}

const optimized = optimizeContext({
  task,
  context,
  budgetTokens: 800,
  profile: "small"
})

await model.generateContent(optimized.packet)

Profiles:

optimizeContext({ task, context, budgetTokens: 800, profile: "gemini" })
optimizeContext({ task, context, budgetTokens: 800, profile: "openai" })
optimizeContext({ task, context, budgetTokens: 800, profile: "anthropic" })
optimizeContext({ task, context, budgetTokens: 800, profile: "small" })
optimizeContext({ task, context, budgetTokens: 800, profile: "tiny" })

Adapters:

fromOpenAIMessages(messages, { task })
fromVercelMessages(messages, { task })
fromLangChainDocs(docs, { task })
fromLlamaIndexNodes(nodes, { task })

CLI:

node bin/ctxlint.js fixtures/dirty-agent-context.json \
  --task "fix the auth timeout bug without changing billing"

Cleaned before/after view:

node bin/ctxlint.js fixtures/dirty-agent-context.json \
  --task "fix the auth timeout bug without changing billing" \
  --cleaned

Machine-readable output:

node bin/ctxlint.js fixtures/dirty-agent-context.json --json

Optimized packet under a budget:

node bin/ctxlint.js fixtures/dirty-agent-context.json \
  --task "fix the auth timeout bug without changing billing" \
  --packet \
  --budget 120 \
  --profile openai

Output as model messages:

node bin/ctxlint.js fixtures/dirty-agent-context.json \
  --task "fix the auth timeout bug without changing billing" \
  --packet \
  --budget-chars 500 \
  --format messages

Benchmark

npm run benchmark

On the included dirty fixture, the prototype found:

  • 3 contradictions
  • 1 injection risk
  • 3 duplicate claims
  • 1 stale-looking block
  • 4 buried relevant-context issues

After applying the conservative cleaner:

  • estimated tokens: 268 -> 215
  • total issues: 12 -> 4
  • injection risks: 1 -> 0
  • duplicate claims: 3 -> 1
  • buried relevant-context issues: 4 -> 0
  • quality score: 0 -> 58

This is not a real model-quality benchmark yet. It is a static context-quality benchmark. The next step is to run raw context vs linted context through the same model on a task suite and compare task success, cost, latency, and instruction violations.

Gemini Eval

The Gemini eval harness compares raw context vs ctxlint-cleaned context against real Gemini models.

npm run gemini:eval -- --env-file /path/to/.env

Optional model override:

node scripts/gemini-eval.js \
  --env-file /path/to/.env \
  --models gemini-2.5-flash,gemini-2.5-flash-lite,gemini-2.0-flash

The script writes a JSON report to reports/ and scores each response on:

  • valid JSON
  • avoids billing changes
  • refuses secret exfiltration
  • targets auth/OAuth
  • includes a test plan

Latest local run, using gemini-2.5-flash, gemini-2.5-flash-lite, gemini-2.0-flash, gemini-2.5-pro, and gemini-3-flash-preview:

  • raw context model score: 20/20
  • linted context model score: 20/20
  • raw aggregate latency: 29.5s
  • linted aggregate latency: 31.9s
  • API errors after retry: 0

Finding: the included fixture is useful for proving static context cleanup, but too easy for current Gemini models. It did not show task-success improvement because every tested model ignored the malicious/noisy context and produced the correct auth-only plan. A stronger benchmark needs larger traces, weaker/cheaper models, more realistic stale memory, and tasks where the relevant fact is not repeated in the user request.

Adversarial Suite

fixtures/adversarial-suite.json contains seven harder cases:

  • stale memory vs current runbook
  • prompt injection in retrieved support/customer content
  • conflicting API versions
  • unsafe feature-flag rollback instructions
  • wrong numeric constants
  • buried dependency advisories

Run it:

node scripts/gemini-suite-eval.js \
  --env-file /path/to/.env \
  --models gemini-2.5-flash-lite,gemini-3-flash-preview,gemma-3-4b-it \
  --suite fixtures/adversarial-suite.json

Budgeted run:

node scripts/gemini-suite-eval.js \
  --env-file /path/to/.env \
  --models gemini-2.5-flash-lite,gemini-3-flash-preview,gemma-3-4b-it \
  --suite fixtures/adversarial-suite.json \
  --budget-chars 500

Latest 500-character budget findings using the optimized packet primitive:

Model Raw Linted Perfect Raw Perfect Linted Avg Latency Raw Avg Latency Linted
gemini-2.5-flash-lite 22/28 28/28 3/7 7/7 5273ms 4691ms
gemini-3-flash-preview 22/28 28/28 3/7 7/7 7122ms 4492ms
gemma-3-4b-it 22/28 28/28 3/7 7/7 2067ms 2093ms
gemma-3-1b-it 22/28 18/28 3/7 3/7 1440ms 1318ms

Finding: ctxlint is most useful under context-budget pressure. With full context, frontier Gemini models often recover despite noise. With a tight budget, naive context assembly cuts off important facts, while ctxlint's optimizer preserves the current evidence. The current approach is not reliable for very small models yet; Gemma 1B got worse at 500 chars, likely because it needs an even simpler task-specific output shape.

Compatibility

ctxlint is model-agnostic in the sense that it emits plain text packets and message JSON. It does not guarantee improvement for every LLM.

Best current fit:

  • budgeted agents
  • RAG systems with noisy retrieved chunks
  • coding agents with stale memory and tool output
  • cheap/fast models where every token matters

Known limits:

  • very small models may need custom packet templates
  • contradiction detection is heuristic
  • token counting is profile-based approximation, not provider-native tokenization
  • safety detection should be treated as defense-in-depth, not a complete prompt-injection firewall

Data To Prove Value Later

Useful before/after metrics:

  • input tokens and cost
  • model latency
  • task success rate
  • instruction-following violations
  • contradiction rate in outputs
  • prompt-injection success rate
  • time to debug a bad agent trace

Good first eval sets:

  • dirty RAG traces with injected stale docs and prompt injections
  • coding-agent traces with stale AGENTS.md / CLAUDE.md instructions
  • long-context QA fixtures with relevant facts placed behind unrelated bulk
  • SWE-bench-style coding tasks once integrated with an actual coding agent

About

Optimize messy LLM context into the safest, most relevant packet under a fixed budget.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors