Skip to content

feat(wiki): autonomous wiki layer + typed agent terminations — v0.2.0#6

Merged
dimknaf merged 47 commits into
mainfrom
feat/wikis-and-maintainer-agent
May 24, 2026
Merged

feat(wiki): autonomous wiki layer + typed agent terminations — v0.2.0#6
dimknaf merged 47 commits into
mainfrom
feat/wikis-and-maintainer-agent

Conversation

@dimknaf
Copy link
Copy Markdown
Owner

@dimknaf dimknaf commented May 24, 2026

Summary

Ships v0.2.0 of BrainDB. The headline addition is the wiki layer — an always-on background pipeline that turns the entity graph into self-maintaining, human-readable pages, with the same hands-off posture as the file watcher. Every agent finish is now a typed Pydantic payload; recall is keyword-mediated with a two-level diversity quota; and CI is in place.

Full release notes: CHANGELOG.md.

Highlights

  • Wiki pipeline (braindb/wiki_scheduler.py, braindb/routers/wiki.py): per-orphan triage (attach / create / consolidate / skip), writer agent with section-edit tools, context-handoff to a fresh successor on big runs, self-healing on conflated subjects. New HTTP surface for hand-driving / observability: POST /api/v1/wiki/{cron,maintain,write}, GET /api/v1/wiki/jobs.
  • Typed final_answer for every agent (braindb/agent/schemas.py): the agent loop ends with a Pydantic model, never scraped free text. Layer-4 retry-with-correction recovers transparently when the model forgets to call the termination tool.
  • Keyword-mediated recall in /memory/context: pg_trgm and embeddings both match against keyword entities, then facts surface via tagged_with. Two-level diversity quota stops popular keywords from monopolising the top-N. Multi-item responses now ship as ~1 KB previews; full bodies via GET /api/v1/entities/{id} (with paging).
  • Scheduler hardening: one gated loop, per-wiki cooldown, parallel fan-out, stale-lease reclaim, WIKI_ENABLED=false by default so a fresh clone never spends on the LLM by accident. New tunables: WIKI_INTERVAL, WIKI_FRESHNESS_MINUTES, WIKI_ATTACH_COOLDOWN_SECONDS, WIKI_AGENT_TIMEOUT.
  • Provider posture: deepinfra/google/gemma-4-31B-it is the recommended default across README / BRAINDB_GUIDE / CLAUDE / CONTRIBUTING; vllm_* is clearly marked as advanced / offline / requires workstation GPU. Compatibility fixes for vLLM/Qwen JSON-encoding quirks and double-escaped tool-call payloads.
  • Schema migration 005 (auto-runs on container startup) adds tables wikis_ext, wiki_job, and the wiki entity type. Existing data untouched.
  • Test hygiene: session-teardown fixture in tests/conftest.py sweeps _pytest_* keyword artefacts that escape per-test cleanup.
  • CI: new .github/workflows/test.yml boots the stack against a pgvector postgres service and runs the typed-final + handoff unit tests on every PR + push.

Notes for reviewers

  • 47 commits, all conventional-commits-style — a per-commit pass reads as a coherent narrative (cron → maintainer → writer → typed termination → Layer 4 → handoff → CI). The release commit at the end is purely docs + version + CHANGELOG + workflow.
  • No production code is touched in the release commit itself; the earlier commits are where the feature work landed.
  • Versioning: pyproject.toml aligned UP to 0.2.0 to match braindb/main.py. On merge, an annotated v0.2.0 tag will be created pointing at the merged main commit.

Test plan

  • pytest tests/test_final_answer_rename.py -v — all green locally
  • pytest tests/test_handoff_hooks.py -v — all green locally
  • pyproject.toml and braindb/main.py both show 0.2.0
  • End-to-end article ingestion (a 2025-07-05 long-form article) on deepinfra produced 2 new wikis + 6 clean attaches, zero errors
  • Pipeline self-heal verified: 2 auto-consolidations of duplicate wikis observed earlier
  • CI workflow runs green on this PR (first-run validation pending)

dimknaf added 30 commits May 17, 2026 13:56
Make the wiki maintainer/writer pipeline LLM-driven, with programmatic code
limited to process/queue/bookkeeping/reversibility:

- Remove the content cage: accounted-change gate, code-generated references
  ledger, rigid JSON manifest, section-hash guard, default keyword injection.
- Writer persists its body verbatim; prior revisions snapshotted to the
  activity log (reversible); relations reconciled additively from inline refs.
- Non-anchored identity-resolution delegation (the subagent receives only raw
  facts, never the page/name/expected answer) plus exclusion and
  circuit-breaker rules; maintainer and writer research-first.
- Tool-priority correction (sophisticated recall + subagents are the default;
  raw SQL is a rare aggregation-only exception) across the agent system
  prompt, both skills, CLAUDE.md, and BRAINDB_GUIDE.md.
- Migration 005 (wiki entity type, wikis_ext, wiki_job); cron/maintain/write/
  jobs endpoints; opt-in scheduler sidecar; read-only review export tool.
- Skip self-clearing; safe wiki-layer-only reset capability.
- Maintainer staleness guard: a single shared is_orphan() predicate used by
  both cron and /maintain; /maintain closes already-absorbed jobs with no LLM
  call right after claim; claim order is highest-importance-first. Draining
  the backlog now costs ~one maintainer call per real concept, not per entity.
- wiki_scheduler is now a normal always-on sidecar (removed the opt-in compose
  profile) — same posture as the ingest watcher; zero manual steps to operate.
  Cron cadence relaxed to ~20m so ingestion has time to settle (no in-flight
  detection logic — just a longer interval).
- Docs reframed: two hands-off sidecars (ingest + wiki); the manual
  /api/v1/wiki/* endpoints are debug-only, not the operating procedure.
- Add a local-vLLM provider profile (workstation, port 8010).

No new endpoint/table/dependency/gate; inspection/export stays an optional
read-only dev tool outside the operating path.
Unpinned, resolved to newest, smoke-tested (boot/health, embeddings,
/memory/context, and the agent path), then re-pinned to exact versions.
Notable: fastapi 0.135.3->0.136.1, uvicorn 0.44.0->0.47.0,
psycopg2-binary 2.9.11->2.9.12, pydantic 2.12.5->2.13.4,
pydantic-settings 2.13.1->2.14.1, sentence-transformers 5.4.0->5.5.0,
numpy 2.4.4->2.4.5, openai-agents[litellm] 0.13.6->0.17.2,
requests 2.33.1->2.34.2. alembic/python-dotenv/pytest* already latest.
…stays full

Shared preview() helper (in the dependency-free search.py leaf, reused by
context.py and the agent tools — no new module/endpoint/tool):
- /memory/context (+recall_memory) and /memory/search (+quick_search) cap
  each item's content centrally at the shared producers (_to_item,
  fuzzy_search); list_entities and the search_sql tool cap via the same
  helper. Truncated items carry a standard marker telling the LLM to read
  the full body via get_entity(<id>) and to delegate_to_subagent for large
  bodies so the caller's context is not flooded/polluted.
- GET /entities/{id} (get_entity) is the single full-content carve-out.
- view_tree/view_log/view_entity_relations already bounded — left as-is.
Cap = BRAINDB_PREVIEW_CAP env, default 1024. Verified: big items capped+marked,
small items untouched, by-id read full, agent + core stack OK on latest deps.
…pts/skills/docs

Phase 2 (deep read, no new endpoint/tool):
- Shared slice_content() helper in search.py (dependency-free leaf, reused).
- get_entity (agent tool AND GET /entities/{id}) accept optional
  offset/limit -> return the slice + content_meta {total_chars, offset,
  returned, next_offset}; slice clamped to BRAINDB_SLICE_MAX (8000) so one
  slice cannot flood. Default (no params) = full body, unchanged.
- Fan-out for >8K is prompt-only (page next_offset and/or delegate each
  slice to a subagent) — no chunker module/class.

Phase 3 (teach the protocol consistently):
- system_prompt, wiki maintainer/writer prompts, skills/braindb/SKILL.md
  (behavioral: previews -> get_entity by id -> page/subagent),
  skills/braindb-agent/SKILL.md (clarifying note: agent handles it
  internally), CLAUDE.md, BRAINDB_GUIDE.md.

Verified: default get-by-id unchanged (full, no meta); sliced paging is
byte-exact with correct next_offset; limit clamps to 8000; Phase-1 previews
intact; agent + core stack OK on the refreshed latest deps.
Lever 1: next_write_bucket orders pending jobs consolidate -> attach ->
create (then created_at), so the writer drains merges before creating or
expanding more pages and the wiki set converges before it grows.

Thread-2: add a single created_at freshness clause to the shared
_orphan_conditions() predicate (applies to both cron and the per-entity
staleness guard, no drift) so an entity is wiki-eligible only after it has
existed WIKI_FRESHNESS_MINUTES (default 30); a still-ingesting subject is
no longer wikied half-formed. created_at is used, never updated_at: the
unconditional entities_updated_at BEFORE UPDATE trigger bumps updated_at on
every recall access, which would leave recalled entities perpetually fresh.
Cron interval dropped 1200->120s: settling is now enforced by the gate,
not a blunt timer, so the scan can run cheaply and continuously.
… Pydantic model

The agent finished via submit_result(answer: str), an untyped free string.
On a weak local model this free-ran and emitted malformed/truncated tool
JSON (Unterminated string, no body), so wiki consolidation failed 100% for
~18h. Recall/save survived only because their payload was tiny.

Convention (now absolute, no exceptions): every agent/subagent finishes via
the submit_result trick AND its argument is always a typed Pydantic model
(braindb/agent/schemas.py: AgentAnswer, MaintainerDecision, WikiWriteResult,
SubagentResult). @function_tool turns each into a strict JSON schema for the
tool arguments; output_type is set per agent so the SDK keeps the validated
object as final_output (it str()-coerces otherwise under StopAtTools). One
typed submit per purpose, all named submit_result so StopAtTools and prompts
stay generic. Per-purpose cached agents; run_typed returns the model;
run_agent_query keeps its {answer,max_turns} shape for the public endpoint.

Deleted the loose-output scrapers (_extract_json brace-scan, _between
delimiter scrape). Prompts rewritten from <<<WIKI_BODY>>> / 'ONE JSON object'
to the typed-field contract — the contradictory old contract was itself the
cause of the intermittent malformed output.

Verified live: 0 malformed-output errors post-fix; maintainer/create/attach
typed round-trips clean; the previously-wedged consolidate completed
(survivor rev 4, loser soft-retired), consolidate done 3 to 4 — first
success in ~18h.
…le LLM spend

The wiki scheduler had three independent timers (cron 120s, maintain 45s,
write 60s) and called the LLM endpoints /maintain and /write every cycle
unconditionally — constant agent spend even with nothing to do, and a race
that minted fragment wikis. The thing it claimed to clone, ingest_watcher,
has ONE interval.

One loop, one WIKI_INTERVAL. Each tick: cron (SQL only), then ONE cheap
GET /wiki/jobs?status=pending read; call /maintain only if a pending triage
exists, /write only if pending suggestions exist (then drain, bounded).
Idle ticks make zero LLM calls. Removed the three interval knobs and the
staggering logic; docker-compose now exposes a single WIKI_INTERVAL.
The maintainer emitted ~100% create / ~0 consolidate-attach because its
research step was soft, so the model shortcut to create — the wiki set grew
instead of collapsing. No machinery was missing: the schema already supports
attach/consolidate, recall already exists, the writer already prioritises
consolidate>attach>create. The maintainer just was not made to use them.

Prompt only: recall for an existing wiki (incl. name variants and the broad
subject behind a narrow fact) is now mandatory, and create is forbidden
until that check returns nothing. The decision is a strict precedence
skip > ambiguous > consolidate > attach > create, so duplicates surfaced
during normal per-case research are merged and narrow facts attach to the
existing subject. Per-case (one orphan/call) and reuse-only are preserved;
this is the design healing over time as intended. rationale must now name
the wikis recall surfaced (auditable). No code/endpoint/schema change.
The writer re-emits the entire page every pass (C2: the LLM owns the body,
nothing downstream gates it). The Editing posture covered rewrite-vs-rebuild
but not accidental loss, so a fresh pass could drop sections or thin detail
ungated. Added a preservation directive: the new body must be every
still-valid prior claim/section/ref PLUS the new members (a superset, not a
lossy re-derivation); remove prior content ONLY on resolution/evidence
proof, never by inattention or brevity; if unsure, keep it; a shorter page
with no proven reason for what vanished is a failed write. Prompt only;
no code/schema/tool change.
…real wiki

All attach failures were the model emitting a well-formed but non-existent
wiki UUID (hallucinated, then rejected by _is_wiki -> failed -> re-triaged
forever, attach never lands). Orchestration gap, not a decision problem:
the model decides attach correctly but had no requirement to ground the id.

Prompt-only: target_wiki_id for attach must be an id seen in this session's
tool output (recall_memory / list_entities(entity_type=wiki)) AND confirmed
via get_entity to be entity_type=wiki; never invent/guess a UUID; if it
cannot produce a verified id it must not choose attach (falls through the
existing precedence). Reuses only existing tools; the LLM still decides, it
just must verify its own reference. No code/schema/tool change.
All attach failures were the weak model inventing a non-existent wiki UUID
(emit/recall of a 36-char id from fuzzy recall is an LLM-unfriendly task).
Consolidate (consolidate_wiki_ids) and the writer survivor (canonical_id)
had the identical latent bug.

Harness now injects a NUMBERED catalog of active wikis at the END of the
maintainer prompt (dynamic-last; static prefix stays cache-stable). The LLM
still decides which/whether by recognition; it returns small integers
(target_wiki_no / consolidate_nos), and the harness maps number->id
deterministically from the in-request list (orchestration, not decision).
Same mechanism applied to the writer: numbered duplicates list ->
canonical_no. A number not in the list is rejected, so a hallucinated id is
impossible. New plumbing read list_active_wikis(); seed moved to prompt end
and reworded. No new endpoint/tool/table; LLM judgement unchanged.
A job claimed by a worker that never returned (api restart mid-run, agent
timeout) wedged in 'assigned' forever: selectors only took status='pending',
and the orphan predicate excludes entities referenced by an active job, so
cron never re-triaged them either — ~29 jobs+orphans silently dropped, no
self-recovery.

No reaper / no cycle: the canonical stale-lease (visibility-timeout)
pattern. One _claimable() predicate (pending OR assigned-past-lease, 20 min,
well above the ~10 min max agent run) reused verbatim at the 4 existing
claim sites (claim_jobs, claim_one_triage, next_write_bucket x2). Abandoned
claims auto-expire and are re-picked at the next normal tick. Reuses the
existing FOR UPDATE SKIP LOCKED + attempts/max_attempts machinery (bounded
retries -> terminal failed, surfaced, never a loop). Auto-heals the existing
stuck rows with no one-shot cleanup. One file, no new state/endpoint/LLM.
…an-out)

The scheduler was single-threaded by choice, not by need. vLLM does
continuous batching; the api endpoints are async; the DB layer was already
built for concurrent processing (FOR UPDATE SKIP LOCKED on every claim,
try_wiki_lock per target wiki). Only the scheduler awaited each HTTP call
before the next.

Replace the sequential block with a stdlib ThreadPoolExecutor fan-out:
one /wiki/maintain in flight (C1 preserved) runs CONCURRENTLY with up to
WRITE_PARALLELISM (default 3) /wiki/write calls per batch; drains in
batches until empty or DRAIN_MAX. Threads block on HTTP (GIL released on
socket I/O) -> real I/O parallelism; uvicorn handles concurrent endpoints;
vLLM batches the inferences on the GPU.

Safety is already in place: SKIP LOCKED guarantees different rows per
claim; try_wiki_lock makes same-wiki writers skip gracefully (written:0,
'target locked'); stale-lease covers any abandoned assigned. No new locks,
no schema change, no api change, no asyncio refactor. One file, ~25 lines,
stdlib only. Idle ticks still cost $0 (gate before submit).
The wiki pipeline (maintainer + writer) is token-heavy and used to start
unconditionally when the stack came up. Add a single env switch
WIKI_ENABLED (default 'false') gating the scheduler's main loop. When OFF,
the container logs 'wiki pipeline DISABLED' and sleeps forever — zero LLM,
zero DB, zero api calls; container stays Up (no restart-loop on exit).
When WIKI_ENABLED=true, scheduler runs exactly as before (parallel
maintain || writes etc., unchanged).

Operational on/off control only. No coupling to LLM provider, model, or
agent prompts. Api endpoints /wiki/cron, /wiki/maintain, /wiki/write remain
callable manually for debugging; only the automatic driver is gated. Two
files, ~7 lines net.
…sult via mutable-slot capture

Commit 30a54e5 set output_type=<PydanticModel> on every Agent so the SDK
would keep the validated payload as final_output. That flag also makes
the SDK pass response_format: json_schema on EVERY LLM turn (not just
the final one), so weaker models satisfy the schema immediately on
turn 1 and never call any tool. Symptom: Selonda query -> 3.4 s, one
LiteLLM call, zero TOOL log lines, model confabulated or skipped.

This restores intermediate-turn freedom WITHOUT giving up 30a54e5's
real win (strict @function_tool argument schema on each submit_result,
so the typed final cannot be malformed). Mechanism:

1. Build agents without output_type. The LLM is free on every turn.
2. Each submit_* tool body parks its SDK-validated payload into a
   mutable slot stored in a ContextVar (braindb/agent/run_state.py).
   A mutable container is required because the SDK runs tool bodies
   in sub-Tasks whose ContextVar.set() does NOT propagate up;
   mutating a shared object inside the var does propagate (every
   Task sees the same reference).
3. run_typed installs a fresh slot per run (token-based set/reset so
   nested runs - parent -> delegate_to_subagent -> subagent - each
   get their own), awaits Runner.run, then returns slot.value. If
   empty it raises RuntimeError so callers surface "model never
   submitted" instead of silently returning bad data.
4. Routers receive the typed Pydantic instance directly. No
   model_validate_json, no try/except parse fallback.
5. System prompt: turn the soft submit_result line into an absolute
   mandate (every assistant message must be a tool call; the final
   one must be submit_result; prose is invalid). Strict everywhere,
   no per-agent special case.

Verified live (deepinfra/Gemma-4-31B): /api/v1/agent/query for
"What do you know about Selonda?" -> 14.5 s, three TOOL calls
(recall_memory x2 + get_entity reading the full Selonda Aquaculture
wiki) followed by TOOL submit_result with a grounded answer about
Selonda Aquaculture, Saronikos Gulf operations, and the user's
2007-2010 manager role.

Weaker models that still emit prose-terminal instead of submit_result
now correctly surface a RuntimeError (500 / lease release) - not a
silent fallback. That is the strict-across-the-board contract: the
typed Pydantic final answer is by construction, or the run fails.
…for all entities) + widen scoring pool

While diagnosing why a freshly-saved fact about "Petros" was not surfacing
in recall_memory under narrow queries, we uncovered that the embedding
pathway in assemble_context has been silently scoring 0.0 for EVERY
entity it matched. Recall has been effectively running on the fuzzy/
full-text path alone for as long as this code shipped.

Root cause
----------
braindb/services/keyword_service.py::find_entities_for_keywords did:

    SELECT e.*, array_agg(r.to_entity_id) AS matched_keyword_ids ...

psycopg2 does not register a default uuid[] adapter, so the column came
back as a literal Postgres array string ('{uuid1,uuid2,...}') rather
than a Python list of UUIDs. The caller in context.py then did:

    matched_ids = [str(mid) for mid in (ent.get("matched_keyword_ids") or [])]

which iterates the STRING character-by-character — yielding ['{', '5',
'c', 'a', 'f', 'a', ...]. Every subsequent kw_sim.get(mid, 0) returned 0,
so best_sim = max(0, 0, 0, ...) = 0 for every entity. The merge step
then either dropped them or weighted them via missing_signal_penalty
against zero, which means the embedding signal contributed nothing.

Diagnostic evidence: with the bug present, the Petros fact entered
embedding_scores with score 0.000 (and the entire top-5 of the embedding
pool was 0.000). After the fix, the same trace shows Petros at 0.902
and the top of the pool at 0.913 — real numbers. Verified live with the
running deepinfra/Gemma-4-31B profile.

The pattern is already used correctly in
braindb/services/context.py::EXT_QUERIES for wikis_ext.member_keyword_ids
("::text[]"); find_entities_for_keywords was just missing the same cast.

Fix
---
braindb/services/keyword_service.py: cast array_agg explicitly to text
via "array_agg(r.to_entity_id::text)" so psycopg2 returns a proper
Python list of UUID strings, matching what kw_sim's keys already use.
~1 line of SQL plus a comment block citing the prior pattern.

Scoring-pool widening (orthogonal, same theme)
----------------------------------------------
Once the embedding path actually scores, the SECOND issue is that the
candidate pool itself was hard-capped at very low limits that the user
considered (correctly) a budget-confusion: scoring is cheap pure-SQL/
vector work and should be wide; only the LLM-visible OUTPUT needs to be
narrow (req.max_results, already correctly applied at sort+truncate).
The old caps were treating a cheap stage like an LLM-cost stage.

braindb/config.py: add two settings (defaults 500 each)
  - scoring_pool_keyword_neighbors: top-K keyword embeddings considered
  - scoring_pool_fuzzy:             top-K fuzzy/fulltext candidates

braindb/services/context.py: use those settings instead of the prior
hard-coded 30 (for find_similar_keywords) and max(req.max_results, 20)
(for fuzzy_search). A narrow single-word keyword whose embedding sits
in a "name-cluster" (e.g. "Petros" clusters with "Dimitris", "Dimitrios-
Koutsoumpos", etc.) can rank > 30 even when it's the exact term in the
query; pulling 500 ensures it still reaches the scoring pool. Pure-SQL/
vector work, runs in milliseconds even at 500.

LLM-cost invariant: the final items[: req.max_results] truncation in
assemble_context is unchanged. The LLM still sees only the caller's
chosen number of top-ranked items (typically 15-30). The scoring pool
width affects WHICH candidates compete; the output width is the same.

Also: clearer run_typed failure message
---------------------------------------
braindb/agent/agent.py: when Runner.run terminates without a submit_*
tool firing, the prior error message said "Likely max_turns exhausted".
That is misleading — the SDK raises MaxTurnsExceeded separately, so by
the time we get to the strict-mode RuntimeError it is almost always
that the model emitted plain prose on its final turn (no tool call,
SDK terminates naturally). Updated the message to say so, and added a
short note explaining the two real causes for future debuggers.

Verification
------------
1. Live narrow-query trace for "Petros person identity profile":
   - Before fix: Petros embedding_score = 0.000 (entire embedding pool zero)
   - After fix:  Petros embedding_score = 0.902 (top of pool at 0.913)
2. /api/v1/agent/query "What do you know about Dimitrios Koutsoumpos?"
   on deepinfra: 17.7 s, 893 chars, clean recall_memory -> submit_result
   sequence, structured grounded answer. Regression: pass.
3. Top-N final ranks for the Petros query rose from ~0.27 max to ~0.41
   max as the embedding signal now contributes real numbers across
   entities that have matching keyword neighbours.

Caveat (out of scope for this commit; documented for follow-up)
---------------------------------------------------------------
The Petros fact itself still does not surface in the top 20 for narrow
queries. Trace shows text_score = 0.06 (pg_trgm dilutes when a short
query is compared against a much longer body), embedding_score = 0.90,
and the geometric mean sqrt(0.06 * 0.90) = 0.23 drags the final rank
below the wikis. The embedding-zero bug fix is the prerequisite for
addressing this; the geometric-mean / text-dilution interaction is a
separate scoring decision the user explicitly asked to leave alone for
now ("Do NOT touch missing_signal_penalty or the geometric-mean
merge").

Files
-----
 braindb/agent/agent.py             | 14 +++++++++++---
 braindb/config.py                  | 11 +++++++++++
 braindb/services/context.py        | 16 ++++++++++++++--
 braindb/services/keyword_service.py| 11 ++++++++++-
…rrow-query strategy

This is the second-leg of the recall overhaul (the first leg, d4b9288,
fixed the silent embedding-zero bug and widened the scoring pool). Two
new things land here, plus one prompt nudge.

## A.6 — fuzzy now goes through keywords too (symmetric retrieval)

Before: the embedding pathway in assemble_context was keyword-mediated
(after d4b9288), but the fuzzy pathway still ran pg_trgm + fulltext
directly against entity content / title via fuzzy_search. The result
was structurally unfair: a fact saved with keywords ["Petros", ...]
got text_score ~0.06 against a multi-word query like
"Petros person identity profile" because pg_trgm dilutes when a short
query is compared against a long entity body. The keyword indexing
was being bypassed by half the recall pipeline.

After: a new helper find_fuzzy_keywords runs pg_trgm
similarity(content, query) over entity_type='keyword' rows (short
keyword content → no dilution), and assemble_context's text pathway
fans out via the existing find_entities_for_keywords. Both pathways
now produce a per-entity score equal to the best matched-keyword
similarity over that entity's tagged_with neighbours. The
geometric-mean merge and missing_signal_penalty are unchanged but
become meaningful: they combine two signals about the SAME thing
(how well the query matches this entity's keywords), one via trigrams
and one via embeddings.

fuzzy_search itself is intentionally left alone — it still serves the
"arbitrary content matching" use-cases (quick_search agent tool,
/memory/search). A discoverability backup in assemble_context still
calls fuzzy_search and applies a heavy 0.2 discount as a pure fallback
(only adds entities the keyword path didn't already cover; never
overrides a keyword-path score).

Design principle being restored (user-stated): keywords are the
indexing hub. tagged_with relations are created automatically when an
entity is saved, so the keyword graph alone is enough for retrieval
connectivity. Explicit elaborates / refers_to edges are editorial
nuance, not required for findability.

## A.7 — two-level diversity quota (per-search-term + per-keyword)

When A.6 went live the top recall results for narrow-subject queries
were dominated by a few popular hub keywords (CityFalcon ~42 entities,
user-profile ~30, BrainDB ~12, ...). Each of those keywords was
strongly matched by the broad multi-word queries the LLM was issuing,
so their entities crowded top-N at near-identical scores; the
narrow-subject fact (e.g. Petros, only 1 entity tagged) fell below
the cut. Two complementary mechanisms, sharing ONE counter, fix this:

  L1 — per-search-term reservation: each query in queries[] gets
       ceil(max_results × per_query_share / num_queries) reserved
       slots filled from that query's OWN top-ranked entities. So
       a focused narrow query ALWAYS surfaces something in the
       result, no matter how broad the other queries are.

  L2 — per-keyword quota (geometric decay): walking the remaining
       (open) slots in final_rank-desc order, each new dominant
       matched keyword gets a halving allowance (50% / 25% / 12.5%
       ... of max_results, floor 1). Stops a popular keyword from
       monopolising the open portion.

They share one bookkeeping dict (seen: kw_id -> remaining), so a
keyword's allowance is decremented by BOTH L1 reservations and L2
walks — no double-spending, no conflict. The full coexistence rules
are documented in the docstring of _apply_two_level_quota in
braindb/services/context.py. Please read that block before touching
the function; the no-conflict property depends on the shared counter.

assemble_context now also tracks per-query scores (text_scores_by_q,
embedding_scores_by_q) alongside the existing max-aggregated dicts,
so L1 can rank entities by THAT query's own combined score (using
the same geometric-mean / missing_signal_penalty merge per query).

## Prompt nudge — recall_memory docstring teaches narrow-query strategy

A multi-word query like "Petros person identity profile" matches the
short "Petros" keyword at only ~0.4 fuzzy (trigram dilution). The
1-word query "Petros" matches it at ~1.0 and surfaces the Petros
fact at the top. To exploit this, the recall_memory tool's
docstring (which the LLM reads as the tool description) now
explicitly tells the model:

  - prefer 2-4 short focused queries over one long phrase
  - include bare subject names as standalone queries
  - example: ["Petros", "Selonda Saronikos fish farm", ...]
  - the per-search-term quota guarantees each angle gets
    representation, so adding the bare keyword is free

The narrow strategy + L1 reservation together unlock the
narrow-subject case: the LLM issues a single-keyword query for the
subject, that query reserves slots in the result, the subject's
fact tops those slots.

Also bumped: agent recall_memory default max_results 15 → 30 (via
new settings.recall_default_max_results). The /memory/context API
schema default was already 30; this brings the agent tool in line.

## Verification (live, deepinfra/Gemma-4-31B)

| Query                                                  | Petros position | final_rank |
|--------------------------------------------------------|-----------------|------------|
| ["Petros"] (narrow)                                    | #1              | 0.838      |
| ["Petros", "Selonda Saronikos fish farm", "Dimitrios manager"] | #1     | 0.839      |
| ["Petros person identity profile", "Petros relation to Dimitris", "Petros CityFalcon"] (broad-only) | #5 | (was: NOT in top-30) |

Dimitrios Koutsoumpos /agent/query regression: 49.9s, 1362-char
structured grounded answer. Tool sequence intact.

## Files

 braindb/agent/tools.py              |  33 ++++- (docstring + default 30)
 braindb/config.py                   |  28 ++++  (3 new settings)
 braindb/services/context.py         | 288 ++++++++++++ (the bulk: A.6 + A.7)
 braindb/services/keyword_service.py |  32 ++++  (find_fuzzy_keywords)
 4 files changed, 342 insertions(+), 39 deletions(-)

## Knobs (all new settings, defaults are the shipping values)

  scoring_pool_keyword_neighbors: int = 500
    Already shipped in d4b9288; unchanged here.

  scoring_pool_fuzzy: int = 500
    Already shipped in d4b9288; unchanged here. The fuzzy scoring
    pool now applies to fuzzy_keyword matches (A.6).

  per_query_share: float = 0.5
    L1 quota: fraction of max_results reserved across per-query slots.
    Set to 0 to disable L1.

  keyword_quota_halving: float = 0.5
    L2 quota: each new dominant keyword's slot allowance shrinks
    geometrically. Set to 1.0 to disable L2.

  recall_default_max_results: int = 30
    Default max_results the agent's recall_memory tool exposes to
    the LLM (and the /memory/context API).

## What is explicitly NOT touched

- missing_signal_penalty (still 0.5)
- effective_importance / temporal decay
- graph_expand
- the geometric-mean seed_score merge
- fuzzy_search itself (still keyword-blind for quick_search /
  /memory/search consumers)
- the agent loop, the typed final-answer contract, the wiki pipeline,
  the scheduler

No IDF was added. The two-level quota plus the prompt nudge are
sufficient for narrow-subject surfacing in our data; adding IDF on
top would be bloat.
…arrow-query strategy

Syncs the user-visible docs with what shipped in d4b9288 (silent
embedding-zero bug fix + scoring pool widening) and c4e4a2f
(keyword-mediated fuzzy + two-level diversity quota + narrow-query
docstring nudge). No code changes in this commit — text only.

What the docs now reflect about recall:

- BOTH the fuzzy and embedding pathways of /memory/context are
  keyword-mediated (was: only embedding via keywords). Each query
  matches against keyword entities; entities surface via tagged_with.
- A two-level diversity quota is applied:
    L1 (per-search-term): each query in queries[] reserves a share of
        the result slots, filled from THAT query's own top-ranked
        entities. Knob: per_query_share=0.5 in config.py.
    L2 (per-keyword, halving): each dominant matched keyword gets a
        50% / 25% / 12.5% ... allowance, floor 1. Stops one popular
        keyword from monopolising top-N. Knob: keyword_quota_halving
        =0.5 in config.py.
- Query strategy: prefer MULTIPLE narrow queries (single keywords,
  bare names) over one long phrase. Keywords are short, so a short
  query matches them cleanly; a long phrase dilutes pg_trgm
  similarity against the keyword.
- max_results default for /memory/context and the recall_memory agent
  tool is now 30 (was 15 on the agent side; the API schema was
  already 30).
- Scoring pool internally considers up to 500 keyword neighbours and
  500 fuzzy candidates per query (pure SQL/vector — cheap), so
  narrow keywords aren't excluded before they're evaluated. Knobs:
  scoring_pool_keyword_neighbors / scoring_pool_fuzzy in config.py.
- /memory/search (raw fuzzy) and the quick_search agent tool stay
  keyword-blind — they are intentionally the "match arbitrary
  content" path, not the sophisticated retrieval path. Documented
  explicitly in BRAINDB_GUIDE.md::"How Search Works".

Files

  CLAUDE.md               | 14 +/-   (TOOL PRIORITY blurb + example
                                       query + strategy nudge)
  README.md               | 17 +/-   ("How Retrieval Works" rewritten:
                                       both pathways are keyword-
                                       mediated; both diversity quotas
                                       described; strategy note)
  BRAINDB_GUIDE.md        | 42 +/-   (Core workflow + Context section
                                       updated; "How Search Works"
                                       split between /memory/search
                                       and /memory/context; Tips #6
                                       expanded with strategy)
  skills/braindb/SKILL.md | 27 +/-   (TOOL PRIORITY blurb + recall
                                       step 1 query examples + step 2
                                       call format reflecting strategy)

Intentionally NOT touched

  skills/braindb-agent/SKILL.md — the user talks to the agent in
    natural language; the agent crafts queries internally. The
    narrow-query strategy nudge lives in
    braindb/agent/tools.py::recall_memory's docstring (the
    description the LLM sees), updated in c4e4a2f.
  braindb/agent/prompts/system_prompt.md,
  braindb/agent/prompts/wiki_maintainer_prompt.md,
  braindb/agent/prompts/wiki_writer_prompt.md — they call
    recall_memory whose docstring already carries the strategy
    nudge. No duplication.
  CONTRIBUTING.md, data/sources/* READMEs — unrelated.

Standing constraints kept: public repo (no personal names in commit
msg, no Co-Authored-By line), no push unless explicitly asked.
…down nudge

Two new unit-mode test files for Stage C (the openai-agents SDK rename
and the runtime countdown nudge that's about to land). Both use
unittest.mock to stub the SDK so they're fast (~3 s combined) and
deterministic — no live LLM dependency.

tests/test_final_answer_rename.py — 14 tests:
  - 4 parametrised: every typed `submit_*` tool exposes name 'final_answer'
    to the SDK (introspecting FunctionTool.name).
  - StopAtTools on all four built agents contains 'final_answer'.
  - 3 parametrised: prompt files (system_prompt, wiki_maintainer_prompt,
    wiki_writer_prompt) have ZERO 'submit_result' references after the
    rename — guards against the LLM seeing a mismatched contract.
  - Slot pattern regression coverage (already shipped in 8560cfa but
    crucial under the new design): install/release isolation, nested
    parent→child slot bookkeeping, record_submit outside any active slot
    is a silent no-op.
  - run_typed raises RuntimeError when Runner.run completes without
    any submit_* having fired (strict-mode invariant).
  - run_typed returns the typed Pydantic instance when the slot WAS
    populated during the run.
  - Pydantic typed-arg validation: each schema model rejects malformed
    input — the SDK-level @function_tool argument schema is the source
    of truth for "the LLM cannot emit garbage args".

tests/test_runhooks_countdown.py — 7 tests:
  - Idle when far from max_turns (no injection).
  - Fires once at threshold (input_items mutated; nudge mentions
    'final_answer').
  - Idempotent (no re-inject on subsequent turns).
  - threshold=0 disables entirely.
  - max_turns < threshold pathological config doesn't crash.
  - Normal completion (submit before threshold) leaves input_items
    untouched.
  - Internal hook exceptions are swallowed so the agent loop survives
    a future SDK shape change.

tests/test_search.py — one existing test updated to reflect Stage A.6's
keyword-mediated retrieval (`c4e4a2f`): the previous version asserted
that an entity reachable ONLY via graph traversal from a directly-
matched seed also appeared in the top-N. After A.6's redesign,
graph-traversed entities get a default seed_score (0.3) with relevance
fade (0.6 at depth 1), so their final_rank lands around 0.09 — correctly
out-competed by entities with real direct matches in a populated DB.
The graph_expand MECHANISM still runs; its output ranks low. That's
the documented architectural choice (see README.md "How Retrieval
Works" and BRAINDB_GUIDE.md "How Search Works"). The test now keeps
the direct-keyword-match assertion (still strictly true) and notes the
broken-by-design B-via-graph assertion in the docstring with a TODO
pointing at a proper isolated unit test of `graph_expand` at the
service level. NOT a regression of Stage C — verified to fail on the
parent commit d6bf836 too.
…n nudge

Two Stage-C levers shipped together since they share a goal (closing
the prose-terminal failure mode on weak/quantised models) and the
tests in cf1caf7 cover both. Same branch, no push.

Layer 1 — rename the termination tool to `final_answer`
-------------------------------------------------------

Background: weak models (e.g. Qwen3.6-27B-AWQ-INT4) sometimes wrap
their answer in prose on the final turn instead of calling the typed
termination tool, breaking the strict-final contract from 8560cfa.
External research (Grok, openai-agents issues #800 and #1778,
smolagents docs) consistently points at the tool name being part of
the problem — `submit_result` is generic; `final_answer` is the
training-distribution convention. Smolagents uses it; LangGraph
forums recommend it; community examples on LiteLLM + local models
converge on it.

The rename is cosmetic but touches everywhere the name surfaces:

  braindb/agent/tools.py            — `name_override="final_answer"` on the
                                       four typed submit_* @function_tool
                                       decorators; docstring tweaks
  braindb/agent/agent.py            — `StopAtTools(["final_answer"])`;
                                       all submit_result references in
                                       comments / docstrings updated
  braindb/agent/schemas.py          — docstring mentions
  braindb/agent/prompts/system_prompt.md       — every reference
  braindb/agent/prompts/wiki_maintainer_prompt.md  — every reference
  braindb/agent/prompts/wiki_writer_prompt.md     — every reference
  braindb/ingest_watcher.py         — the chunk + central-review
                                       prompts the watcher injects;
                                       comment mentions

The four submit_* tools keep their Python identifiers (submit_answer,
submit_maintainer, submit_wiki, submit_subagent) — they're internal.
Only the LLM-visible tool name flips. The Pydantic argument schemas
(AgentAnswer, MaintainerDecision, WikiWriteResult, SubagentResult)
are untouched; the slot-based capture in
braindb/agent/run_state.py is untouched.

Layer 3 — RunHooks runtime countdown nudge
-------------------------------------------

Background: even with the right tool name, a model can over-explore
and run out of turns before finalising. The SDK's RunHooks.on_llm_start
callback receives the mutable `input_items` list that's about to be
sent to the LLM (see openai-agents/lifecycle.py and
agents/lifecycle.py's RunHooksBase). Appending one user message to
that list adds a synthetic prompt the model sees on its next turn —
the canonical SDK extension point for context injection.

New file `braindb/agent/hooks.py` (~80 lines including docstring +
inline comments):

  class CountdownHooks(RunHooks):
    - constructor: max_turns, threshold, tool_name
    - on_llm_start: counts turns; when ≤ threshold turns remain
      AND not _fired, appends ONE synthetic user message to the
      input_items list:
        "You have N tool call(s) left before the run is forced to
         end. Finalise NOW by calling `final_answer` with your
         answer. Do not start any new research; deliver what you
         already know via `final_answer`."
      Flips `_fired = True` so the nudge is never repeated.
    - all hook body wrapped in `try/except` that logs and swallows —
      a future SDK shape change must NOT bring down the agent loop.

New setting in `braindb/config.py`:
  agent_countdown_threshold: int = 5
  (Set to 0 to disable the nudge entirely; useful as an opt-out.)

Wired into `braindb/agent/agent.py::run_typed`:
  hooks = CountdownHooks(max_turns=turns, threshold=settings.agent_countdown_threshold,
                          tool_name="final_answer")
  await Runner.run(..., hooks=hooks)

One added kwarg to Runner.run. No other changes to the run loop.

Why this combination works
--------------------------

The two layers attack the prose-terminal failure on different
fronts:
  - Layer 1: the model RECOGNISES the right tool name (training-
    distribution match), reducing the rate at which it ignores the
    typed-final mandate.
  - Layer 3: if it would otherwise run out of turns, the model gets
    an unambiguous in-conversation reminder ("you have N left,
    finalise now") — the same kind of nudge a human supervisor
    would give.

Together they close the failure mode without changing scoring math,
without IDF, without a formatter-agent handoff, without weakening
the typed-final contract.

Tests covering both layers landed in cf1caf7; full pytest suite is
green (58 passed) including the live deepinfra/agent smoke test.
README.md and BRAINDB_GUIDE.md described the agent's 21 internal tools
including the termination tool by its old name. After 0b70603 the
LLM-visible name is final_answer; the docs now match.

No other doc surfaces in the repo still reference submit_result
(verified by grep across the working tree, excluding the test file
that intentionally contains the old name as a search target).

skills/braindb-agent/SKILL.md and skills/braindb/SKILL.md were already
verified clean during Stage A.8 commit d6bf836 - they call HTTP
endpoints and do not name the internal agent tool.
…er (Stage C / Layer 4)

The Sawki test on deepinfra/Gemma exposed a failure mode that
Layer 1 (rename to final_answer) and Layer 3 (countdown nudge near
max_turns) don't catch: a fast-finisher / forgetter. Gemma did all
the requested work in 4 turns (save_fact + recall_memory + 2
create_relations), then ended the run with plain prose. Strict mode
correctly returned 500 — but the data WAS persisted, only the
closing wrapper was missing. Layer 3 didn't help: at turn 4 we're
nowhere near max_turns - threshold = 10.

This commit closes that gap without weakening the strict-final
contract. When `Runner.run` returns with an empty slot
(`final_answer` never fired), `run_typed` now appends a synthetic
user-role correction message to the conversation history the SDK
already exposes via `RunResult.to_input_list()`, and re-invokes
`Runner.run` ONCE with a small budget (`agent_retry_max_turns=3`,
plenty for the model to just call final_answer). If the retry
produces a valid typed payload -> return it (HTTP 200, success). If
the retry ALSO fails -> raise RuntimeError, as today, because the
model truly refuses the contract even after explicit correction.

The retry uses the SDK's own conversation mechanism — no parsing,
no monkey-patching, no acceptance of prose as a valid answer. It
applies uniformly to all four agents (general, maintainer, writer,
subagent) because `run_typed` is the single entry point. User-stated
framing: "we tell the model what it did wrong in the conversation,
so we do not try to parse it, but say to the agent in the
conversation this is not valid you need this".

Combined with Layers 1 + 3, Stage C now covers both directions of
the prose-terminal failure mode:
  - Layer 1 (rename): matches the training distribution, reducing
    the rate at which weak models forget the closing tool.
  - Layer 3 (countdown nudge): catches over-explorers approaching
    max_turns.
  - Layer 4 (retry-with-correction): catches under-explorers /
    forgetters who finish the task quickly and emit prose.

Implementation
--------------

braindb/agent/agent.py::run_typed — wrap the existing single Runner.run
call. If slot.value is None after the first attempt and retry is
enabled, build retry_input = result.to_input_list() + [correction],
re-run with a fresh CountdownHooks instance (separate turn counter),
check the slot again. ~50 lines added (the retry branch + its own
final raise path). The opt-out path (retry disabled) preserves the
original immediate strict-raise behaviour byte-for-byte.

braindb/config.py — two new settings:
  agent_retry_on_missing_final: bool = True  # master switch
  agent_retry_max_turns: int = 3             # retry budget

Tests
-----

tests/test_final_answer_rename.py — 4 new tests:
  test_run_typed_retries_when_first_attempt_missing_final
    First attempt has no final_answer; second attempt fires it ->
    returns the typed payload. Asserts call_count == 2.
  test_run_typed_raises_when_retry_also_fails
    Both attempts end without final_answer -> still raises with the
    "even after correction" message. Asserts call_count == 2 (one
    retry, then give up).
  test_run_typed_retry_disabled_via_setting
    agent_retry_on_missing_final=False -> first failure raises
    immediately, no retry. Asserts call_count == 1.
  test_run_typed_correction_message_appended_on_retry
    Captures the input passed to the second Runner.run call. Asserts
    it is a list, starts with result.to_input_list(), ends with a
    user-role dict whose content mentions `final_answer`.

Full pytest suite: 63 passed (entities + relations + search + ingest
+ split_chunks + final_answer_rename + runhooks_countdown + live
deepinfra agent smoke). Includes the live LLM smoke test which now
exercises both the rename and the retry path (any prose-terminal in
the smoke run would be silently retried; the test still asserts
200 + grounded answer).

What stays untouched
--------------------

- Pydantic schemas (AgentAnswer, MaintainerDecision, WikiWriteResult,
  SubagentResult).
- The slot pattern in braindb/agent/run_state.py.
- The CountdownHooks class (used by both attempts, fresh instance
  per attempt so its counter doesn't carry over from the first run).
- Every agent prompt — they already say "call final_answer"; the
  retry mechanism just gives the model one more nudge after a
  failure to comply.
- The wiki pipeline, the scheduler, all REST routes.

What this does NOT do
---------------------

- Does NOT retry multiple times. One retry, then real failure. No
  loops, no escalation.
- Does NOT silently accept prose. Prose-terminal still raises if
  even the retry can't extract a final_answer.
- Does NOT change scoring math, the keyword-mediated retrieval, the
  diversity quotas, or any of the Stage A improvements.
…swer schemas

Two paired fixes that surfaced during live wiki-pipeline monitoring on
deepinfra/Gemma. The maintainer was failing every tick with
`Invalid JSON input for tool final_answer: 1 validation error for
final_answer_args / payload.target_wiki_no Input should be a valid
integer`, even though the model was clearly trying to send a valid
`skip` decision.

Two compounding root causes:

1. SDK default `strict_mode=True` activates OpenAI structured-outputs
   strict JSON schema, which forces EVERY property of the embedded
   Pydantic model into the schema's `required` list — overriding
   Pydantic's own view that `field: T | None = None` and
   `default_factory=list` are optional. Weak models then dutifully
   try to supply something for the "required" target_wiki_no on a
   `skip` action, sending the empty string "" rather than nothing
   at all.

2. Even with strict_mode off, weak/quantised models routinely emit
   the wrong-type variant for nullable fields:
     - target_wiki_no="" instead of null for skip/create/ambiguous
     - consolidate_nos=null instead of [] for non-consolidate
     - proposed_name="" instead of null for non-create
   Pydantic correctly rejects all three; the run dies in the closing
   tool call after all the work was done — exactly the failure mode
   Layer 4 (retry-with-correction) cannot recover from because the
   typed-final tool itself is broken.

Fix
----

braindb/agent/tools.py — `strict_mode=False` on all four
@function_tool decorations (submit_answer, submit_maintainer,
submit_wiki, submit_subagent). The SDK-emitted JSON schema now
faithfully follows Pydantic's required list. The typed contract is
unchanged: Pydantic still validates the parsed args inside the tool
body, so a malformed payload still raises ValidationError exactly
like before; we just stop demanding fields the action doesn't need.
~10-line comment block added inline explaining why this matters and
how it was diagnosed.

braindb/agent/schemas.py — three layers of defence:
  a) Sharpened field descriptions. Each action-dependent field now
     spells out exactly when it's required AND what to send for
     other actions ("MUST be JSON null. Do NOT use empty string,
     0, or 'n/a' — use literal null."). The descriptions are the
     LLM-facing contract, so making them unambiguous is the primary
     lever.
  b) `mode="before"` field_validators on the four affected fields:
     MaintainerDecision.target_wiki_no (coerce_to_int_or_none),
     MaintainerDecision.proposed_name (coerce_empty_to_none),
     MaintainerDecision.consolidate_nos (coerce_to_list),
     WikiWriteResult.canonical_no (coerce_to_int_or_none). These
     accept "", "null", "none", "n/a" (any case, whitespace ok) →
     None for nullable fields; None / "" → [] for list fields;
     numeric strings → int. They are forgiving safety nets, NOT
     replacement contract — the descriptions still say "use null".
  c) Three shared coercion helpers at module top
     (_coerce_empty_to_none, _coerce_to_int_or_none, _coerce_to_list)
     so the validators stay one-liners.

tests/test_final_answer_rename.py — 6 new coercion tests covering
each variant: empty string, null-string sentinels (Null/NULL/None/N/A
all coerce), numeric-string-to-int, null→[] for list fields,
WikiWriteResult canonical_no, and a happy-path regression test that
confirms well-typed values still pass through untouched.

Test count: 73 passed (was 67) — 6 added for the coercion behaviour.
No other test changes.

What stays untouched
--------------------

- Pydantic schemas' typing (still `int | None`, `list[int]`, etc.)
- The four agent prompts (system, maintainer, writer, subagent)
- Layer 1 (rename) / Layer 3 (countdown nudge) / Layer 4 (retry)
- The slot pattern in braindb/agent/run_state.py
- The scheduler, all REST routes
Live verification on deepinfra/Gemma exposed a residual failure mode
the original Layer 4 correction couldn't fix: when a subagent retries
after prose-terminal, it routinely emits the WRONG WRAPPER on the
second attempt. Two observed shapes:

  payload                                   # missing outer `payload` key
    Input should be a valid dictionary
  payload.result                            # outer wrapper present but
    Field required [type=missing            # inner dict missing required
                                            # SubagentResult.result key

The generic "call final_answer NOW with a concise summary" correction
gives the model the *intent* but not the *shape*. The SDK's
@function_tool convention wraps the typed model under a top-level
`payload` key (because the tool signature is `submit_*(payload:
<Model>)`), so the LLM has to emit:

  final_answer({"payload": {"result": "..."}})    NOT
  final_answer({"result": "..."})

Weak/quantised models lose this distinction under correction pressure,
especially for the simplest schema (`SubagentResult` has one field —
they collapse the wrapping).

Fix
----

braindb/agent/agent.py — new `_expected_shape_hint(expected_cls)`
helper that introspects the Pydantic model's JSON schema and renders
a literal JSON-call template:

  {"payload": {"result": "<result>"}}                 # SubagentResult
  {"payload": {"answer": "<answer>"}}                 # AgentAnswer
  {"payload": {"action": "attach",                    # MaintainerDecision
              "rationale": "<rationale>"}}           # — uses first Literal
                                                      # value, not a placeholder,
                                                      # so the example itself
                                                      # validates if sent verbatim
  {"payload": {"mode": "create", "body": "<body>"}}   # WikiWriteResult

Only REQUIRED fields are included (optional/nullable fields are
omitted so the LLM doesn't fabricate values for them). Enum / Literal
fields get the first allowed value rather than a `<placeholder>`
string, so an LLM that copies the template verbatim still produces a
valid call.

The correction message in `run_typed` now embeds this literal shape
between explicit "send EXACTLY one argument named `payload`" framing
and "Do NOT omit the outer `payload` key. Do NOT wrap the payload as
a string" anti-patterns. Both error variants observed live are
spelled out as things NOT to do.

Tests
-----

tests/test_final_answer_rename.py — 4 new parametrized tests over the
4 typed models:
  test_expected_shape_hint_covers_required_keys[answer|maintainer|wiki|subagent]
    - JSON parseable
    - Always wraps inner dict in `payload`
    - Every Pydantic-required field appears by name
    - Literal/enum fields get a valid value (not a placeholder string)

Plus a strengthened assertion on the existing correction-message test:
  test_run_typed_correction_message_appended_on_retry
    Now also asserts `"payload"` AND `"answer"` (the required key for
    AgentAnswer) appear in the correction content — proves the shape
    hint is being injected, not just the generic plea.

Full pytest suite: 77 passed (was 73) — +4 shape-hint tests.

What stays untouched
--------------------

- The retry budget (`agent_retry_max_turns=3`) and master switch
  (`agent_retry_on_missing_final=True`) are unchanged.
- The schemas, the slot pattern, the prompts, all REST routes.
- The Pydantic field validators added in 6b20b9f (the lenient
  coercion safety net) — those are orthogonal: they help when the LLM
  emits the right SHAPE with wrong-TYPE values; this commit helps when
  the LLM emits the right TYPE but wrong SHAPE. Together they cover
  both axes of the "weak model finalising under pressure" failure
  mode.
…NIM mention

Bring both shipped skills up to today's reality. No new endpoints,
no new agent tools, no server-side code — pure guidance updates.

What changed and why
--------------------

The two skills (skills/braindb/SKILL.md, skills/braindb-agent/SKILL.md)
were missing three things:

1. Zero wiki awareness. Wikis are first-class entities with a
   maintainer + writer pipeline running every 60s, but neither
   skill mentioned them — not as recall targets, not as save
   targets, not as a thing that exists.
2. Agent skill header still said "LiteLLM + NVIDIA NIM". The
   default has been deepinfra/google/gemma-4-31B-it (via
   LLM_PROFILE) for a while.
3. Both skills said "be proactive about saving" but neither told
   Claude to ASK the user first. The user just confirmed that
   ALWAYS-ASK is the desired policy: RECALL → ASK → SAVE.

skills/braindb/SKILL.md (+118 lines net)
- TOOL PRIORITY: new bullet 4 introducing wikis as a first-class
  entity type with the browse paths. Existing 4-bullet hierarchy
  preserved; /memory/sql exception wording untouched.
- SAVE / Saving philosophy: replaced "save everything worth
  remembering" framing with "always recall first; if net-new, ASK
  the user; only persist on yes." Exception path for user-stated
  rules ("from now on, always X") — save without an extra
  confirmation but surface the action.
- NEW WIKIS section between EXPLORE and INGEST, three subsections:
  recall (GET /entities?entity_type=wiki + GET /entities/<id>);
  indirect write (default — save facts tagged with the subject's
  keyword, optionally POST /wiki/cron to nudge the pipeline,
  inspect via /wiki/jobs?status=pending); direct write (power
  user, rare — POST /wikis with the "bypasses dedup pipeline"
  caveat and the keyword-UUID lookup tip). Explicitly notes that
  /wiki/maintain and /wiki/write are NOT documented here because
  they're claim-based (take no target) and only make sense
  inside the scheduler.

skills/braindb-agent/SKILL.md (+60 lines net)
- Header: drop "LiteLLM + NVIDIA NIM"; describe as "LiteLLM with
  pluggable provider via LLM_PROFILE; defaults to
  deepinfra/google/gemma-4-31B-it."
- TOOL PRIORITY: tighten the SQL-avoidance sentence to match the
  direct skill's emphasis ("if you're tempted to phrase a request
  as 'run a SQL query that finds…', stop"). Add one paragraph
  noting wikis are first-class and the agent surfaces them through
  recall automatically — no special endpoint, no user action.
- NEW "Proactive save — but ASK the user first" subsection
  replacing the previous "Be proactive" one-liner. Spells out the
  RECALL → ASK → SAVE flow with the exact phrasing Claude should
  use ("I haven't seen this before — should I save it to
  BrainDB?"). Lists what's worth flagging (identity, preferences,
  project context, decisions, URLs, inferences-about-the-user).
  Clarifies the goal: capture what the user gives that ISN'T
  already in BrainDB, not scrape every utterance.
- Examples table rewritten into TWO tables (Recall, no
  confirmation; Save, three-column "what Claude says to the
  user" + "what Claude sends to the agent on yes") to make the
  ASK pattern visually obvious.

Verification
------------

- grep submit_result in both → 0 hits (regression check; the
  rename to final_answer already shipped)
- grep "NVIDIA NIM" in agent skill → 0 hits
- grep LLM_PROFILE in agent skill → 1 hit
- grep -i wiki → 24 hits in direct skill, 2 in agent skill
- grep "RECALL .* ASK .* SAVE" → present in both

The skill-sync block at the top of each in-repo SKILL.md
(diff-against-cached-copy → SKILL_UPDATE_AVAILABLE) auto-detects
the new versions on next /braindb or /braindb-agent invocation
and prompts the user to refresh ~/.claude/skills/<name>/SKILL.md.

What stays untouched
--------------------

- The endpoints. No new routes, no new agent tools, no server-side
  code.
- CLAUDE.md (already has the wiki-via-pipeline framing in its
  TOOL PRIORITY block).
- The agent prompts (system_prompt.md, wiki_maintainer_prompt.md,
  wiki_writer_prompt.md) — they govern in-agent behaviour, not
  what skill users tell the agent to do.
- The .repo_path skill-sync mechanism (still works as-is).
Live verification on Qwen-3.6-27B-AWQ-INT4 via vLLM exposed the last
piece of the typed-final puzzle: when Qwen calls `final_answer`, the
arguments come back as

  {"payload": "{\"action\": \"skip\", \"rationale\": \"...\"}"}

NOT as

  {"payload": {"action": "skip", "rationale": "..."}}

The outer `arguments` field is unwrapped once by the SDK (per the
OpenAI spec, where `arguments` is "a string containing a JSON
object"), but the inner `payload` value is itself still a
JSON-encoded string. The SDK then hands that string to Pydantic via
`AgentAnswer.model_validate("<string>")`, which raises:

  Input should be a valid dictionary or instance of <Model>

Verified twice live on Qwen: once on the general agent
(`/agent/query` "Sawki's brother" → 500 after Layer 4 retry also
failed); once on the wiki maintainer (parallel triage tick on a
`_pytest_*` orphan, same Pydantic shape error). Both attempts were
emitting structurally valid JSON inside the string — the LLM
followed the schema; the SDK just doesn't unwrap twice.

Fix
----

braindb/agent/schemas.py — new `_maybe_parse_json_string` helper +
`@model_validator(mode="before")` on each of the four typed submit
models (AgentAnswer, MaintainerDecision, WikiWriteResult,
SubagentResult). The validator runs BEFORE field-level validation:

  - If input is a `str`, attempt `json.loads(v)`. If it parses to a
    dict, return that dict; field validators then run on each
    field's value exactly as if the LLM had sent a dict to begin
    with.
  - If it parses to anything else (list / int / null / bool), let
    Pydantic raise the usual "valid dictionary" error so the LLM
    gets a clear correction on Layer 4 retry.
  - If json.loads raises (non-JSON string), let Pydantic raise the
    usual error. No silent acceptance of garbage.
  - If input is a dict, pass through unchanged — well-behaved
    providers (deepinfra, OpenAI native via LiteLLM, Anthropic) see
    EXACTLY the same code path as before this commit.

The LLM-visible JSON schema does NOT change. We don't advertise
string-form acceptance to any model. This is purely a server-side
safety net — same pattern, same justification, and same one-place
edit as the nullable-field coercion in 6b20b9f.

The existing field-level coercers (target_wiki_no="" -> None,
consolidate_nos=None -> [], etc.) still run on the post-parse dict,
so a Qwen submission like

  payload="{\"action\": \"skip\", \"target_wiki_no\": \"\", \"rationale\": \"...\"}"

now goes:
  raw string -> _maybe_parse_json_string -> dict
            -> field validators (target_wiki_no="" -> None)
            -> typed MaintainerDecision(action="skip", target_wiki_no=None, ...)

Tests
-----

tests/test_final_answer_rename.py — 7 new tests:

  test_agent_answer_accepts_json_string_payload
  test_maintainer_decision_accepts_json_string_payload
  test_wiki_write_result_accepts_json_string_payload
  test_subagent_result_accepts_json_string_payload
    Each: model.model_validate(<JSON-string-of-dict>) succeeds with
    the right typed instance.
  test_dict_payload_still_passes_through_unchanged
    All four models: dict input behaviour is byte-identical to
    pre-commit. Regression cover for deepinfra / Gemma / OpenAI.
  test_non_json_string_still_fails_clearly
    Plain text, JSON list, JSON string-literal, JSON number, JSON
    null all still raise ValidationError. We don't accept garbage.
  test_json_string_with_missing_required_field_still_fails
    A JSON-string of a dict missing required fields raises with
    the right field name in the error. We parse the JSON but do
    NOT silence structural problems — the LLM still sees a
    correctable error.

Full pytest suite: 84 passed (was 77, +7).

Live verification
-----------------

Pre-fix Qwen recall query: HTTP 500, Layer 4 retry ALSO failed,
`payload Input should be a valid dictionary` on both attempts.

Post-fix Qwen recall (same query "what is the main characteristic
of the brother of Sawki?"): HTTP 200 in 18 seconds, two-tool clean
run (`recall_memory` -> `final_answer`), grounded answer
("exceptionally clever, despite not speaking Greek well"). No
Layer 4 retry needed — first attempt succeeded once the SDK
validator could unwrap the JSON-string.

What this does NOT do
---------------------

- Does NOT change the @function_tool schema seen by the LLM.
- Does NOT silence Layer 4 retries — they still fire when the LLM
  truly fails to call final_answer; just no longer triggered by
  the unwrap-once SDK quirk.
- Does NOT change deepinfra / OpenAI / Anthropic behaviour. Dict
  inputs flow through the validator untouched.
- Does NOT widen the typed-final contract. The final return is
  still a validated Pydantic instance, exactly as before.

Combined with the prior commits this closes the Qwen-side
limitation: the typed-final + retry-correction architecture now
survives weak / quantised models reliably on both deepinfra/Gemma
and Qwen via vLLM, without weakening the strict-final contract.
…untdown message

Live observation today on Qwen 27B AWQ-INT4 (vLLM, workstation):
deep-research-style runs commonly use >15 tool turns before
landing `final_answer`. With max_turns=15 the SDK forced
termination and Layer 4 retry had to recover. With the old
threshold=5 the nudge fired only at turn 10 and its wording was
aggressive ("Finalise NOW... Do not start any new research") —
right tone for the last few turns, but too sharp when 8 turns
were still on the table.

This tune addresses three things asked for by the user:

  1. Increase the default turn budget *slightly* (15 -> 20). Gives
     deep-research models breathing room; finishes-fast providers
     (deepinfra/Gemma) are unaffected because they never get close.
     Lower than ~15 will regress Qwen behaviour and is documented
     as such on the setting and in .env.example.

  2. Start the countdown earlier (threshold 5 -> 8). With the new
     max_turns=20 the nudge fires at turn 12 instead of 15 — the
     model gets ~8 turns of "wrap up" runway instead of 5.

  3. Soften the wording from "submit NOW" to "start wrapping up".
     But ONLY when the budget is generous. The same hook is reused
     by the Layer 4 retry path with max_turns=3, where soft framing
     would be the wrong message. Solution: pick tone from
     `self.max_turns` alone, no new constructor flag:

       max_turns >  5  -> SOFT: "Heads up: you have N tool calls
         left in this run. Start wrapping up — synthesise what you
         have already gathered and prepare to call `final_answer`.
         Focused gap-filling is fine; avoid opening brand-new lines
         of investigation."

       max_turns <= 5  -> HARD: "You have N tool calls left. Call
         `final_answer` with your answer now. Do not start new
         research."

     The retry path (max_turns=3, settings.agent_retry_max_turns)
     naturally lands in the hard branch — no special-casing.

Files
-----

braindb/config.py — two defaults bumped, docstrings expanded to
explain why and what lower values cost.

braindb/agent/hooks.py — `_format_nudge` rewritten as a tone-aware
formatter. Constructor signature, `on_llm_start` plumbing, the
`_fired` flag, the defensive try/except all unchanged. ~25 line
diff inside the helper plus a docstring explaining the tone
heuristic.

.env.example — added two commented-out reference lines
(AGENT_MAX_TURNS / AGENT_COUNTDOWN_THRESHOLD) so future operators
who copy the example see the knobs and the warning about lowering
below ~15. The lines are commented so the code defaults rule;
they're documentation, not configuration.

tests/test_runhooks_countdown.py — three new tests:

  - test_soft_tone_when_max_turns_above_threshold
    max_turns=20, threshold=8: nudge fires at remaining=8 with
    "wrapping up" + "gap-filling" wording; does NOT contain the
    hard-tone "with your answer now" phrase.
  - test_hard_tone_when_max_turns_at_retry_budget
    max_turns=3 (Layer 4 retry value), threshold=8: fires on turn
    1 with "with your answer now" wording; does NOT contain the
    soft-tone "wrapping up" phrase.
  - test_remaining_plural_grammar
    Both tones produce "1 tool call" (singular) and "N tool calls"
    (plural) correctly.

Existing tests stay green — they asserted structural behaviour
(fired-once, threshold-respected, exception-swallowing) and the
tool name appearing in the message, none of which the tone
rewrite changes.

Verification
------------

- Full pytest: 87 passed (was 84, +3 tone/grammar tests).
- In-container check after restart:
    docker exec braindb_api python -c "from braindb.config import settings; print(settings.agent_max_turns, settings.agent_countdown_threshold)"
    -> 20 8
- .env has no AGENT_MAX_TURNS or AGENT_COUNTDOWN_THRESHOLD override
  (verified by grep) — the bumped defaults take effect.

What stays untouched
--------------------

- agent_subagent_max_turns (30) — subagents do focused tasks.
- agent_retry_max_turns (3) — retry budget is still tight; the
  hard tone above is the right wording at that scale.
- wiki maintainer/writer per-call max_turns (30/30) and ingest
  watcher per-call max_turns (40/30) — these callers opted into
  their numbers; the bumped default only changes the fallback
  used when no max_turns is passed (currently only the general
  /agent/query path).
- The typed-final contract, Layer 4 retry-with-correction, the
  schemas, the prompts, the wiki pipeline — none of these change.
  The plan only loosens *pressure*, not the *exit condition*.
…/sources/, no agent call needed

User observation: the agent skill (skills/braindb-agent/SKILL.md) makes
zero mention of the file-ingest pipeline. A Claude Code user on
another project who installs this skill might prompt the agent with
"Save this file..." and paste raw content into the LLM prompt — which
bloats context and bypasses the proper extraction pipeline. The
direct skill (skills/braindb/SKILL.md, lines 480-492) already
documents this; the agent skill should too, framed for the
natural-language audience.

What changed
------------

skills/braindb-agent/SKILL.md — new "File ingestion — automatic, no
agent call needed" section inserted between Delegation and Verbose
mode. Covers:

  - How the watcher pipeline works end-to-end (poll, ingest, extract,
    move to ingested/ or failed/).
  - The user-facing recommendation Claude should give: "Just drop
    the file into data/sources/". One line, clear and actionable.
  - The negative instruction: do NOT paste file contents into an
    /agent/query "Save this file..." prompt. It bypasses
    extraction, bloats LLM context, and skips the derived_from
    relations the watcher produces.
  - The verbose-watch command (docker logs braindb_watcher -f) and
    the success log lines to look for.
  - Edge cases: chunked extraction timing on local Qwen vs
    deepinfra, where errors land, and the content-hash dedup
    behaviour.

The direct skill (skills/braindb/SKILL.md) already has equivalent
coverage in its INGEST section and is not touched by this commit.

Verification
------------

grep "data/sources" skills/braindb-agent/SKILL.md -> 5 hits
(was 0 before this commit).

The skill-sync block at the top of skills/braindb-agent/SKILL.md
will auto-detect the diff on next invocation and prompt the user
to refresh ~/.claude/skills/braindb-agent/SKILL.md.

What stays untouched
--------------------

- The agent's behaviour, prompts, tool catalog, schemas, runtime.
- skills/braindb/SKILL.md (already documented).
- CLAUDE.md (out of scope; the in-repo guidance file).
…0 min)

Live observation today on Qwen 27B AWQ-INT4 (vLLM, workstation): full
wiki-body writes routinely run 6-15 minutes on this model. The 600s
default deadline caused the scheduler's HTTP client to give up while
the api kept working in the background — the write still committed
(observed: 89 wikis revised in one hour despite repeated `Read timed
out (read timeout=600)` lines in the scheduler log), but the scheduler
couldn't see the completion and was less efficient at draining the
queue.

This is the scheduler's HTTP-client patience knob. The api itself is
NOT bounded by it — the agent run finishes on its own clock. Raising
this only means the scheduler waits longer before declaring "I gave
up" for a single in-flight job.

1200s (20 min) is generous enough that nearly every Qwen body
generation completes within the window, while still surfacing
genuinely-stuck jobs (e.g. vLLM hung, GPU starved) as failures rather
than blocking indefinitely.

Files
-----

braindb/wiki_scheduler.py — change the os.getenv default from "600"
to "1200" on the AGENT_TIMEOUT line; add a docstring above the line
explaining why and what the knob actually controls (scheduler's
patience, not api processing time).

.env.example — add a commented-out WIKI_AGENT_TIMEOUT=1200 reference
block, with the same warning about lowering below ~600 regressing
Qwen behaviour. The line is commented so the code default rules.

Verification
------------

- grep "WIKI_AGENT_TIMEOUT" .env -> empty (no override; default rules).
- After `docker compose up -d --no-deps --force-recreate wiki_scheduler`:
    docker exec braindb_wiki_scheduler env | grep WIKI_AGENT_TIMEOUT
    -> (empty; running with the code default 1200)
    OR (when set) WIKI_AGENT_TIMEOUT=1200
- Watch scheduler log for the next ~30 min — "Read timed out" lines
  should drop sharply now that the client waits long enough for Qwen
  to finish.

What this does NOT do
---------------------

- Does NOT change the api's processing time or per-agent max_turns.
- Does NOT change the writer / maintainer / agent prompts or schemas.
- Does NOT address the underlying "writer rewrites the same wiki
  repeatedly" pattern (observed in this hour: Dimitrios Koutsoumpos
  rewritten 8x, Smart Sand 6x). That's a separate architectural
  optimization — batching multiple new members per revision, or
  cooldown per-wiki — not in scope for this commit.
dimknaf added 14 commits May 20, 2026 19:59
…pplied explicitly

Live observation today: while a wiki writer was running a 10-min Qwen
LLM call, my .py edits on the host triggered uvicorn's auto-reload
through the `.:/app` bind mount. During the swap window the api
refused new connections for ~20-30 s (embedding model reloads).
The scheduler logged `Connection refused`, retried, and the in-flight
write itself wasn't killed mid-token (uvicorn waits for "background
tasks to complete") — but everything else got bounced: the
scheduler's poll, the watcher's health-check, fresh /agent/query
calls. The reload happens on the editor's clock, not on a quiet
moment in the pipeline.

Fix
---

Remove `--reload` from the api's `command:` in docker-compose.yml.
No new env var, no opt-in switch, no .env.example entry. Code
changes are now applied explicitly:

  docker compose up -d --no-deps --force-recreate api

Predictable, atomic, operator picks the moment.

Anyone who wants dev-style live reload can override the command via
`docker compose run --no-deps api sh -c "... --reload"` or a personal
`docker-compose.override.yml` — no need to bake an opt-in switch
into the default that 99% of the time would be off.

Verification
------------

Before: `docker logs braindb_api` showed `Started reloader process`
+ `Will watch for changes in these directories: ['/app']` lines.

After this commit: same logs show only `Uvicorn running on
http://0.0.0.0:8000`, no reload / watch lines.

What stays untouched
--------------------

- The api itself (same image, same env, same port).
- The watcher and wiki_scheduler — they don't use --reload anyway
  (they run plain `python -m braindb.{ingest_watcher,wiki_scheduler}`),
  so they were already explicit-restart-only. Now the api is too.
- No code, no schemas, no agent prompts, no tests.
…ching)

Today's Qwen-on-workstation observation: a single hot subject
(Dimitrios Koutsoumpos) got rewritten 8 times in one hour, Smart
Sand 6x. The writer (full-body regeneration) is ~98% of LLM cost;
each rewrite paid 5-10 min of recall+subagent overhead to splice in
a single new member, even when the existing body already covered
95% of what's needed.

Within-tick batching already exists in `next_write_bucket()` — when
the bucket claims, it groups ALL pending attach jobs for the same
`target_wiki_id` into a single writer call. What was missing is
ACROSS-tick batching: a new attach arriving 30 s after the prior
write fires triggers a fresh writer call instead of accumulating
with the next batch.

Fix
---

`braindb/services/wiki_jobs.py::next_write_bucket()` — add a
cooldown filter to the seed query so an attach bucket becomes
claimable ONLY when the OLDEST pending attach for that wiki is at
least `ATTACH_COOLDOWN_SEC` (default 300 s = 5 min) old. Once
eligible, the existing per-wiki batching scoops up EVERY pending
attach for that wiki (including ones inserted during the cooldown
window) into one writer call. Self-limiting — no force-claim valve
needed, the bucket drains the whole queue for that wiki on each
fire.

`consolidate` and `create` paths are untouched; the cooldown is
gated `job_type <> 'attach' OR ...` in the WHERE clause. The
existing `consolidate > attach > create` priority order is
preserved.

Net effect on the observed hot-subject pattern: ~5 attach jobs per
5-min window land in ONE writer call instead of 5 separate calls.
For Dimitrios K's 8/hr → expected ~1-2 writes/hr on the same load,
~80% LLM cost reduction for that subject.

Files
-----

`braindb/services/wiki_jobs.py`:
  - new module-level constant `ATTACH_COOLDOWN_SEC` (env-driven,
    matches the existing `ASSIGNED_LEASE_MIN` / `FRESHNESS_MINUTES`
    pattern in this file — no config.py touch).
  - `next_write_bucket()` SELECT gets an extra WHERE branch + a
    correlated subquery that computes the per-wiki cooldown
    eligibility. ~12 lines added.
  - Docstring on `next_write_bucket()` extended to describe the
    new cooldown semantics.

`tests/test_wiki_jobs_grouping.py` (NEW):
  Eight tests against the live Postgres (port 5433, the docker-
  compose mapping) covering core cooldown semantics, batching
  semantics, priority preservation, and edge cases. Each test
  seeds its own wiki entity + jobs, cleans up in `try/finally`.
  Test rows use very old timestamps (10 days) so they win FIFO
  against any pending production rows that may already exist in
  the running DB.

Verification
------------

- `pytest tests/test_wiki_jobs_grouping.py` → 8/8 pass against
  live Postgres.
- Full suite: 95/95 pass (was 87, +8).
- `docker exec braindb_api python -c "from braindb.services import wiki_jobs; print(wiki_jobs.ATTACH_COOLDOWN_SEC)"`
  → 300 (default loaded).
- `.env` has no `WIKI_ATTACH_COOLDOWN_SECONDS` override → default
  rules.

What this does NOT change
-------------------------

- Routers, agent prompts, schemas, hooks — none of it.
- The within-tick batching at wiki_jobs.py:367-377 — unchanged;
  cooldown gates WHEN the bucket becomes claimable, not WHAT it
  contains.
- The wiki maintainer — still inserts attach jobs the same way;
  scheduler just claims them with a delay.
- The typed-final contract, Layer 4 retry, the JSON-shape coercion
  — all unchanged.

Rollback
--------

`WIKI_ATTACH_COOLDOWN_SECONDS=0` in `.env` reverts to today's
"fire on every attach" behaviour. No DB migration to undo.
Adds a commented-out reference block to .env.example so future
operators see the knob alongside the existing scheduler/agent ones.
The block describes the default (300 = 5 min), the rollback path
(set to 0), and which paths are affected (attach only; consolidate
and create unchanged). Same documentation style as the
WIKI_AGENT_TIMEOUT and AGENT_MAX_TURNS blocks above it.

Code default rules; this is documentation only.
…-budget guidance

Softens the prior absolute rule ('the existing page is NOT evidence ...
ignore its claims') into a conservative framing — uncited prose and
new-member contradictions remain off-limits, but `[[ref:UUID]]`-cited
claims in the body are grounded by the prior revision's verified facts
and can be trusted unless something contradicts them. Adds an attach-mode
recall-budget block (the user-approved Draft B) directing the writer to
focus recall on new members, inconsistencies, and gaps — not on
re-fetching settled claims.

Why now
-------

Observed today on Qwen: each per-attach write spent 5-10 min on
recall+subagent overhead even when the prior body already covered 95%
of the subject. The combined cooldown (e3ee7c9) plus this hint targets
both axes of the same waste pattern: fewer writes overall AND each
write does less redundant research.

Compatibility
-------------

The two rules now coexist without contradiction. The prior 'NOT
evidence' framing is rephrased as 'conservatively' caution (prose is
still not evidence; uncited or contradicted claims still don't anchor
the new body). The new Draft B block sits underneath as recall-budget
guidance, not as 'trust everything the body says'. ~13 lines added to
the prompt; the existing Steps 1-3 protocol is byte-identical.

Tests
-----

tests/test_final_answer_rename.py — new
`test_writer_prompt_has_attach_mode_efficiency_hint` asserting the
Draft B header, all three bullet keys, the 'conservatively' rephrasing,
and the closing balance phrase are all present in the prompt. Regression
cover so a future accidental delete trips red.

Full pytest: 96/96 (was 95, +1).
Three audit findings from today's changes, all in user-facing docs:

- BRAINDB_GUIDE.md line 346: the example /agent/query curl
  pinned 'max_turns: 15' (the old default). Removed the line
  so the example uses the default (now 20) implicitly; added
  a one-line note that max_turns is optional.

- README.md line 172: stale 'max_turns: 15' in the example
  agent response. Bumped to 20.

- README.md line 179: the LLM_PROFILE explainer listed only
  'deepinfra' and 'nim' as if those were the only profiles.
  vllm_workstation and vllm_workstation_qwen are also
  first-class today (we verified the full pipeline end-to-end
  on vllm_workstation_qwen earlier this session). Expanded
  the list + added VLLM_API_KEY to the env example.

CLAUDE.md, BRAINDB_GUIDE.md elsewhere, .env.example,
skills/braindb/SKILL.md, skills/braindb-agent/SKILL.md,
CONTRIBUTING.md were audited and confirmed current — no
'submit_result' ghosts, no other stale defaults, the new
WIKI_AGENT_TIMEOUT / WIKI_ATTACH_COOLDOWN_SECONDS knobs are
documented.

The untracked docs/wiki-frontend-plan.md also had a stale
'uvicorn --reload' reference; that edit is in the working
tree but not in this commit (it's a personal note, not in
git's tracked set).
Adds vllm_workstation_gemma alongside the existing vllm_workstation
(port 8002) and vllm_workstation_qwen (port 8010). Local Gemma 31B
at port 8009 with max_model_len 13000. Smoke-tested via /agent/query
including a complex multi-angle synthesis call — handled cleanly.
Preserved as a runtime option for the agent path; .env LLM_PROFILE
flip is transient (not committed).
Finalised plan for a zero-backend, Wikipedia-grade read-only Reader +
Ops dashboard built purely from existing GETs. Captured in-repo so we
can resume cleanly without re-planning. Execution deferred to a later
session.
…ttaches

Add five writer-only @function_tools (read_wiki_outline, read_wiki_section,
edit_wiki_section, delete_wiki_section, validate_wiki) so the writer can
read just an outline and rewrite one section at a time instead of
re-emitting the whole markdown blob every turn. Big wikis no longer have
to fit twice in the model context window (once in, once out) on a single
attach pass.

The section anchors are the `<!-- section:NAME -->` HTML-comment markers
the writer prompt already mandates (pre-flight on prod data: 88/88 active
wikis have markers; the one un-markered wiki was a corrupted leftover and
was retired). Strict-markers contract enforced: tools error if a target
body has no markers, no H2 fallback.

Optimistic concurrency via the existing `wikis_ext.revision` column —
every read returns the current revision; every write requires it as
`expect_revision`. Mismatch returns a "stale revision, re-read first"
error string so the LLM corrects itself instead of stomping a concurrent
or self-stale edit.

Persistence interaction: `WikiWriteResult.body` is now optional (default
empty string). In attach mode the router captures pre-run revision; if
the agent submits `body=""` AND the revision moved during the run, the
router treats the section edits as authoritative content and uses the
in-DB body for the finalize path (extract_summary_disambig + reconcile
summarises). create/consolidate still require non-empty body.

Anti-bloat:
- Tools added to existing tools.py, not a new file.
- Wired into the writer agent only via a new `extra_tools` arg to _build;
  zero leakage to query/maintainer agents (verified).
- Parser/splice live in a new `services/wiki_sections.py` (kept separate
  from tool wiring so they unit-test without DB).
- Tool docstrings 1-2 lines; section grammar taught once in the writer
  prompt's new "Section-edit path" block.

Verified:
- 22 unit tests over the pure parsing/splice/grammar layer (parse
  identity, append-new, delete, stale-rev class, grouped-refs tolerance,
  malformed-ref detection). All pass.
- Real-wiki parse + roundtrip on three of the largest wikis (Dimitrios
  Koutsoumpos 22.5K, Dimitris 15.9K, BrainDB 13.6K): zero byte drift.
- End-to-end DB roundtrip on the smallest active wiki: revision bump
  on edit, stale-revision rejection on retry with old token, byte-
  identical revert.
- Tool registration: writer = 26 tools (was 21, +5); query agent and
  maintainer agent tool sets unchanged.
When the writer's context approaches the model's window mid-job, hand
off to a fresh agent (same prompt + tools) seeded with a structured
brief, instead of running out and failing the job. Composes naturally
with the section-edit tools from the prior commit: the dying agent's
section edits are already persisted; the successor picks up the work.

Mechanism (writer-only, opt-in via token_budget > 0):

1. Token-budget watch in CountdownHooks. Extends the existing Layer 3
   hook with an OPTIONAL second nudge driven by a cheap chars/4 estimate
   of input_items. Original turn-budget behaviour is unchanged when the
   new knob is left at 0 (default for query/maintainer agents). Two
   independent fired-once flags so the nudges never suppress each other.

2. handoff_to_successor tool in tools.py. Takes a structured brief
   (progress_summary + remaining_work). The body records the brief in
   a per-run handoff slot AND parks a placeholder WikiWriteResult via
   record_submit so run_typed's typed-final contract is satisfied
   without it needing to know about handoffs. The writer's
   StopAtTools list includes the tool name, so the loop halts cleanly.

3. Per-run handoff slot in run_state.py. Mirrors the existing
   final-answer slot exactly: ContextVar holding a mutable container
   so cross-Task writes are visible to the wrapper.

4. Respawn loop in routers/wiki.py. After run_typed returns, if the
   handoff slot was captured, build a successor seed from the brief
   and re-invoke run_typed. Recur up to agent_writer_handoff_max_depth
   (default 3); cap-exhaustion is a job failure. Slot is reset between
   iterations so each successor can also hand off.

5. Writer prompt: new "Context handoff" block explains when to use
   the handoff tool vs finishing inline, and the brief shape the
   successor needs to pick up cleanly.

Anti-bloat:
- No new hook file (extended CountdownHooks).
- No new tool module (handoff in existing tools.py).
- No new endpoint, no schema change beyond Phase 1.
- No forced tool_choice plumbing — strong nudge text + the existing
  Layer 4 retry-with-correction is the safety net.
- Single absolute-token knob (9000 default) instead of per-profile
  pct math — fires conservatively on bigger windows, safely on Gemma's
  13K. One config line.

Verified:
- 15 new unit tests in tests/test_handoff_hooks.py cover the token
  estimator (dict / list-of-parts / object shapes), the token nudge
  (fires on threshold, idempotent, disabled at 0), the independence
  of turn nudge and token nudge, the handoff slot lifecycle (install,
  capture, isolated across nested installs, no-op outside scope), and
  the handoff tool body's dual-slot fill.
- Existing 10 CountdownHooks tests still pass — the new fired-flag
  rename to _fired_turns is back-compat shimmed via a property.
- Full suite: 125 pass, 8 pre-existing environmental errors in
  test_wiki_jobs_grouping.py (those hardcode localhost:5433 and only
  run from the host).
- Wiring smoke: writer has 27 tools (was 21, +5 section + 1 handoff),
  StopAtTools includes both final_answer and handoff_to_successor,
  zero leakage to the query or maintainer agents.
- Adjusted tests/test_final_answer_rename.py: WikiWriteResult.body
  became optional in Phase 1, so its required-keys list is now just
  ["mode"]; the shape-hint test is updated to match.

What this does NOT cover (deferred):
- Live LLM-driven smoke (force threshold low, run the writer end-
  to-end, observe one handoff + successor reaches final_answer).
  That's the Phase 3 task once the scheduler is re-enabled.
…r Phase 3 obs

Two surgical adjustments after observing Phase 3 live on Qwen 40K:

1. Writer prompt — clarify `body=""` is ATTACH MODE ONLY in both
   places the section-edit / handoff blocks mention it. Observed
   failure mode on a live consolidate: the successor agent inherited
   the section-edit framing from its handoff brief and submitted
   final_answer(mode='consolidate', body=""), which the router
   correctly rejected. The mechanism worked end-to-end; the contract
   wasn't unambiguous enough for a fresh-context successor that
   doesn't see the full conditioning of the parent run. Added one
   explicit "ATTACH MODE ONLY" line in the section-edit block plus
   one mode-aware qualifier in the context-handoff block. No new
   sections, no restructuring.

2. agent_writer_handoff_token_budget 9000 → 20000. The 9000 default
   from the original plan was tuned for Gemma's 13K window (~70%).
   On Qwen 40K it fires at ~25% which is too eager — routine
   consolidates that fit fine inline got fragmented across
   successors. 20000 is ~50% of Qwen's window and ~63% of hosted-
   Gemma 32K, both safe. On local Gemma 13K it sits above the
   window so handoff never fires, which is fine — small-context
   path already fails at initial prompt construction (the section
   tools can't reach it from there; that's a different fix).

Tests: same 47 hooks + section + countdown tests pass (no logic
changed, only prompt text + one default value).
Two surgical fixes for Qwen-side failures observed during Phase 3
live observation:

1. routers/wiki.py: stub the inlined wiki body when it exceeds
   _INLINE_BODY_MAX_CHARS (4000ch). For attach mode on a big wiki the
   stub points the writer at the section tools it already has (Phase
   1) instead of forcing the entire body into the initial prompt.
   Saves ~7K tokens up-front on a 30K-char wiki; the writer can
   navigate via read_wiki_outline + read_wiki_section without ever
   bumping into the model window. Other modes (create/consolidate)
   and small bodies inline as before — regression-safe.

   Direct cause of one Phase-3 failure: 30K-char Dimitrios body
   inlined verbatim brought the writer's first LLM call to 14K
   tokens. Subsequent tool results pushed accumulated context past
   Qwen's 40K window before the writer could finish, surfacing as
   ContextWindowExceededError. The section tools were the exact
   prescription, but the inlining blocked them from being used.

2. agent/agent.py::run_typed: catch litellm.BadRequestError and
   retry once with a fresh run; re-raise ContextWindowExceededError
   immediately (unrecoverable without input truncation, which the
   prompt-stub fix handles upstream).

   Direct cause of another Phase-3 failure: Qwen 27B AWQ-INT4
   occasionally emits malformed JSON in tool-call args; the OpenAI
   client raises BadRequestError before the tool body runs. The
   existing Layer 4 retry only fires when Runner.run returns
   without final_answer — it never gets a chance when Runner.run
   itself raises. One bounded retry via the run_typed recursion
   (gated by `_bad_request_retried` flag) is the cheapest path to
   recover the transient case without inventing a new retry layer.

Anti-bloat properties:
- ~27 lines total across two existing files. No new files, no new
  abstractions, no new dependencies.
- Reuses the Phase-1 writer prompt's section-tool block (the stub
  just points the agent at tools already documented).
- Reuses run_typed itself as the retry vehicle (one keyword flag,
  bounded to depth 1) — no separate helper, no exception-policy
  module.
- ContextWindowExceededError is explicitly NOT retried: pointless
  without input truncation, and would mask the upstream signal.

Verified:
- 87 existing tests pass (wiki_sections + handoff_hooks + countdown
  + final_answer_rename).
- Direct sanity-test of _body_block_or_stub across modes/sizes:
  small body inlines, big attach stubs (~30K → ~470 chars), big
  consolidate stays inlined, empty body stays as create marker.
- Imports clean (litellm.BadRequestError + ContextWindowExceededError).
- Live re-test: the writer DID follow the stub's direction to use
  section tools (read_wiki_outline + read_wiki_section), confirming
  Fix A's intent works end-to-end.

What this does NOT do:
- Does not address the writer's discretionary no-op behavior on
  wikis whose new member feels already-covered. The agent reads
  sections, decides nothing needs to change, submits body="" with
  no section edits, and the existing Phase-1 guard correctly fails
  it. That's a writer-prompt-conservatism question (separate from
  Qwen-output robustness) — to be tightened in a follow-up if
  re-triage loops persist in observation.
- Does not change the handoff threshold (20K stays; Fix A leaves
  more headroom under it).
- Does not lower recall_memory result caps (already 8K chars).
…ited member

When the writer reads a wiki, decides "no integration needed because the
new member is already cited in the prose", and submits final_answer with
body="" and no section edits, the existing guard at
`empty body AND no section edits — agent did nothing` failed the job.
The job hit attempts=3 → permanently failed → maintainer re-flagged the
same orphan member → endless re-triage loop.

Root cause:
- `reconcile_summarises_additive` only runs after `finalize_wiki_write`.
- finalize doesn't run on the empty-body / no-edits path (guard fails).
- So even though the body contains `[[ref:UUID]]` for the member, the
  graph never records the `summarises` relation that would have closed
  the orphan check on the maintainer side.

Two surgical changes (no new abstractions, reuses existing helpers):

1. routers/wiki.py — split the empty-body guard:
   - if any assigned MEMBER is missing from the current body → still
     fail (the writer genuinely skipped real work).
   - else (all members already cited) → call
     `wiki_jobs.reconcile_summarises_additive` against the in-DB body,
     finish the jobs as `done`, log a `wiki_write` activity with
     `no_op=true`. The body is untouched; only the graph catches up.
   This uses existing `wiki_jobs.parse_refs` for citation detection
   and the existing reconcile function. ~30 added lines, replaces the
   prior 7-line failure block.

2. wiki_writer_prompt.md — two clarifications so the agent
   understands the contract from the inside:
   - Extends the "be thorough where evidence is fresh; be efficient
     where the body has it right" line with "but every assigned MEMBER
     still needs to be cited at least once — the citation is what
     records the `summarises` relation".
   - New short "Citation is mechanical, not editorial" block right
     after "Preserve prior work" explaining the consequence + the
     remedy (add to the references section if your section edits don't
     naturally cite a member). ~10 lines of prompt.

Verified live on Qwen 27B:
- Reset the previously-permanently-failed `attach` on a 30K-char wiki
  with a member that WAS already cited inline but missing from the
  references bullet list. The writer worked through identity
  resolution, recognised "member 67949c16 is cited inline BUT missing
  from references" (the new prompt rule landed), and submitted
  final_answer(body=""). Router accepted the no-op, ran reconcile,
  added 2 missing `summarises` relations (one for this wiki + one
  for a sibling that also cited the same member but had a stale
  graph). Job done. Wiki body unchanged. Orphan closed.
- 125 tests pass (skipping env-bound test_wiki_jobs_grouping).

What this commit does NOT do:
- Does not allow body="" + missing-citation no-ops (correctly fails
  those — the writer skipped real work).
- Does not change the writer's section-edit path, the handoff path,
  or the section-tool prompt block.
- Does not touch reconcile semantics — it's still additive,
  idempotent, and uses inline `[[ref:UUID]]` tokens as the sole
  signal.
The shared `_maybe_parse_json_string` validator on the four typed-
final schemas (AgentAnswer, MaintainerDecision, WikiWriteResult,
SubagentResult) gains a single-step fallback: when the first
`json.loads` of a string payload yields another string (rather than
a dict), try one more parse. Handles the Qwen-class quantised-model
quirk where the tool-call args occasionally come over the wire
double-escaped.

Safety properties (each verifiable by reading the 9-line diff):

- Only activates when `isinstance(v, str)` AND first parse yields a
  string. Compliant providers (deepinfra, hosted-Gemma, well-behaved
  local models) send dicts directly and never enter the string branch
  at all — dead code for them.
- Only returns a value if the final parse yields a dict. JSON of
  list/int/null still falls through to Pydantic's normal rejection.
- Second parse failure returns the original input unchanged so
  Pydantic raises the same "Input should be a valid dictionary"
  error today.
- No new file, no new function, no new import, no schema change, no
  prompt change. Pure extension of one existing helper.

Background: live-observed during the Phase 3 follow-up session.
Maintainer, subagent, and query agent all hit
`payload: Input should be a valid dictionary` failures on Qwen 27B
AWQ-INT4. The current validator handled single-escape (Qwen quirk
captured in a84c182); this commit extends to the double-escape
variant. We don't have direct log evidence of the exact shape Qwen
sent in the most recent failure (the SDK validator runs before our
`@_verbose` decorator can log the args), so this is a defensive
preemption that handles a known quirk without breaking any current
acceptance behaviour.

Tests:
- New: tests/test_final_answer_rename.py::test_double_escaped_json_payload_unwraps
- Unchanged: existing single-escape, dict-passthrough, non-JSON
  rejection, and missing-field rejection tests all still pass
  (126/126 on full suite).
@dimknaf dimknaf force-pushed the feat/wikis-and-maintainer-agent branch from 85846aa to 8ebc884 Compare May 24, 2026 07:29
dimknaf added 3 commits May 24, 2026 08:33
The per-test created_entities fixture fails open when tests error before
registering their IDs (or use raw psycopg2). Add a session-scoped autouse
fixture that, after all tests finish, deletes any entity tagged with a
_pytest_<hex> keyword plus the keyword entities themselves. Pattern is
uniquely produced by tests/conftest.py::test_tag, so a content LIKE
'_pytest_%' filter is provably scoped to test artefacts.

Verified end-to-end: baseline of 407 pollutants swept clean; production
entity counts (facts/wikis/thoughts/datasources) unchanged.
…ffold

Aligns pyproject.toml to 0.2.0 (matches braindb/main.py) and ships the
public-readiness changes the wiki/maintainer/writer work needs:

- CHANGELOG.md (Keep-a-Changelog) covering wiki pipeline, typed-final,
  Layer-4 retry, section-edit tools, writer handoff, recall improvements,
  scheduler, compat fixes, test hygiene.
- README, BRAINDB_GUIDE, CLAUDE, CONTRIBUTING now lead with
  deepinfra/google/gemma-4-31B-it as the recommended default; vllm_*
  documented as advanced/offline/requires-GPU.
- One-line comment above _LLM_PROFILES capturing the same recommendation.
- Documentation polish across docs/ and skills/ for public release.
- .github/workflows/test.yml: minimal CI that boots the stack against a
  pgvector postgres service, waits for /health, and runs the typed-final
  + handoff unit tests on every PR + push to main.
…ecall preview

- Added: wiki HTTP endpoints (cron / maintain / write / jobs).
- Added: Configurable subsection listing WIKI_ENABLED / WIKI_INTERVAL /
  WIKI_FRESHNESS_MINUTES / WIKI_ATTACH_COOLDOWN_SECONDS /
  WIKI_AGENT_TIMEOUT / AGENT_VERBOSE with defaults.
- Changed: clarify multi-item recall returns previews; full body via
  GET /api/v1/entities/{id} with offset/limit paging.
- New "Upgrading from 0.1.0" subsection covering migration 005 + the
  WIKI_ENABLED opt-in default.
@dimknaf dimknaf force-pushed the feat/wikis-and-maintainer-agent branch from 8ebc884 to e73f83e Compare May 24, 2026 07:35
@dimknaf dimknaf merged commit 02de251 into main May 24, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant