Language models are a larynx. Nous is a persistent epistemic substrate for AI.
Nous stores typed relations, graded uncertainty, contradiction boundaries, and memory across time.
Inversion · Research Program · Evidence · Research · Roadmap · Integration
Most AI systems still place the language model at the center and attach tools, memory, and wrappers around it.
Nous inverts that stack.
In this picture, the language model is not discarded. It is repositioned.
The model remains the expression system, the semantic surface, the larynx.
Nous is the persistent epistemic layer behind it.
Nous should be read as a research program with a live implementation, not as a generic AI utility.
It stands on four linked claims:
- language output is not identical with intelligence
- a persistent epistemic substrate can supply what the output layer lacks
- a different category of system requires different benchmarks and longitudinal metrics
- these claims are testable because the repository contains a working substrate rather than only a paper argument
The fastest reading path is:
Evaluating Nous with standard LLM benchmarks would be like measuring the sweetness of chocolate with a Scoville scale. The instrument is not merely inaccurate; it is measuring the wrong physical phenomenon entirely.
Benchmarks such as MMLU, ARC, and HumanEval are valid instruments for language models. Nous is not merely a language model output surface. It is a persistent epistemic substrate.
The relevant question is therefore not only whether a model produced the expected answer right now. It is whether the system knew what it knew, knew what it did not know, preserved consistency under contradiction, and changed structure coherently across time.
What is currently called AI is mostly semantic prediction.
Nous is an attempt to define a different category of system: one that can preserve knowledge, uncertainty, contradiction, and structural change across time.
- It stores relations, not just retrieved chunks.
- It carries confidence, rationale, and uncertainty with the memory itself.
- It makes the boundary between known, probable, and unknown visible to the model.
That changes agent behavior in the place that actually matters: when a model is close to hallucinating but still sounds fluent.
One documented reference run is run_20260403_094211 (see eval/RESULTS_INDEX.md).
Model Score Questions
─────────────────────────────────────────────────────
llama3.1-8b (no memory) 46% 60
llama-3.3-70b (no memory) 47% 60
llama3.1-8b + Nous memory → 96% 60
On this domain-specific run, an 8B model with Nous grounding scored above a 70B baseline.
The effect is not retrieval. It is epistemic grounding — a small, precise knowledge signal redirects the model's existing priors onto the correct frame, with confidence and evidence attached. In the repo, this pattern is referred to as the Intent Disambiguation Effect.
This is evidence for graph grounding. It is not, by itself, a general benchmark for intelligence or a complete measure of a cognitive substrate.
→ Full benchmark details: eval/RESULTS.md · eval/RESULTS_INDEX.md · Run it yourself
| Capability | What it does |
|---|---|
| Structured memory | Stores typed relations between concepts instead of plain text chunks |
| Confidence-aware retrieval | Returns what is known, with evidence and uncertainty attached |
| Gap awareness | Surfaces where knowledge ends instead of bluffing through it |
| Continuous learning | Strengthens or weakens graph paths over time via Hebbian plasticity |
| Local-first runtime | Runs as a local graph and daemon, then injects context into any LLM |
Nous (νοῦς, Gk. mind / active intellect) is a persistent, self-growing epistemic substrate that can attach to any LLM.
It is informed by brain-inspired plasticity, cognitive research, and the practical failure modes of LLM memory.
Your documents, conversations, research
↓
Nous knowledge graph
(SQLite WAL + NetworkX + Hebbian learning + evidence scoring)
↓
brain.query("your question")
↓
Structured context injected into any LLM prompt:
— what is known (relations + confidence)
— why it is known (evidence chain)
— what is NOT known (gap map from TDA)
It is not a RAG system. RAG retrieves chunks. Nous extracts relations — typed, weighted, evidence-scored connections between concepts — and injects a compact, structured context block.
It is not just a memory system. Memory stores and retrieves. Nous maintains an epistemic account: every relation carries a trust tier (hypothesis / indication / validated), a rationale, and a contradiction flag. The system knows the difference between what it has evidence for and what it is guessing.
It learns continuously. Every interaction strengthens or weakens connections (Hebbian plasticity). There is no retraining. No gradient descent. The graph grows — and the gaps become visible.
| System | Main unit | Knows confidence | Knows what's missing | Learns over time | Local-first |
|---|---|---|---|---|---|
| Basic RAG | text chunk | ✗ | ✗ | ✗ | ✓ |
| Vector memory | embedding | ~ | ✗ | ✗ | ✓ |
| Mem0 | memory objects | ~ | ✗ | ~ | ✓ |
| MemGPT / Letta | conversation pages | ✗ | ✗ | ~ | ✗ |
| Claude Memory | key-value | ✗ | ✗ | ✗ | ✗ |
| Nous | typed relation + evidence | ✓ | ✓ | ✓ | ✓ |
Nous does not try to replace the model. It treats the model as the larynx, and the substrate as the persistent epistemic layer behind it.
pip install nouseimport nouse
# Auto-detects the local daemon if it is running.
# Otherwise falls back to direct local graph access.
brain = nouse.attach()
result = brain.query("transformer attention mechanism")
print(result.context_block())
print(result.confidence)
print(result.strong_axioms())If the daemon is running, attach() connects over HTTP. Otherwise it falls back to direct local graph access. The same code works either way.
Provider-specific examples follow below.
# You handle the LLM call. Nous handles the memory.
context = brain.query(user_question).context_block()
response = openai.chat(messages=[
{"role": "system", "content": context},
{"role": "user", "content": user_question},
])from openai import OpenAI
import nouse
client = OpenAI()
brain = nouse.attach()
question = "How does residual attention affect token relevance?"
context = brain.query(question).context_block()
response = client.chat.completions.create(
model="gpt-4.1-mini",
messages=[
{"role": "system", "content": context},
{"role": "user", "content": question},
],
)
print(response.choices[0].message.content)from anthropic import Anthropic
import nouse
client = Anthropic()
brain = nouse.attach()
question = "What does this repo know about topological plasticity?"
context = brain.query(question).context_block()
response = client.messages.create(
model="claude-3-7-sonnet-latest",
max_tokens=800,
system=context,
messages=[
{"role": "user", "content": question},
],
)
print(response.content[0].text)import ollama
import nouse
brain = nouse.attach()
question = "Summarize what is known about epistemic grounding."
context = brain.query(question).context_block()
response = ollama.chat(
model="qwen3.5:latest",
messages=[
{"role": "system", "content": context},
{"role": "user", "content": question},
],
)
print(response["message"]["content"])from groq import Groq
import nouse
client = Groq()
brain = nouse.attach()
question = "What does this project know about Hebbian learning?"
context = brain.query(question).context_block()
response = client.chat.completions.create(
model="llama3-8b-8192",
messages=[
{"role": "system", "content": context},
{"role": "user", "content": question},
],
)
print(response.choices[0].message.content)The pattern is always the same: brain.query(...) first, provider call second.
Nous is local-first today. A managed cloud version is planned:
brain = nouse.attach(api_key="nouse_sk_...")Hosted memory graphs, shared project memory across agents and teams, and zero local setup. Interested? Get in touch.
When you query Nous, the model does not just get a blob of context. It gets an epistemic frame:
[Nous memory]
• transformer attention: mechanism for routing token influence across context
claim: attention modulates token relevance based on learned relational patterns
Validated relations:
transformer —[uses]→ attention [ev=0.92]
attention —[modulates]→ token relevance [ev=0.81]
Uncertain / under review:
attention —[is_equivalent_to]→ memory routing [ev=0.41] ⚑
That is the real product surface: not storage, but a more honest and better-calibrated answer path.
git clone https://github.com/base76-research-lab/Nous
cd Nous
pip install -e .
# Generate questions from your own graph
python eval/generate_questions.py --n 60
# Run benchmark (requires Cerebras or Groq API key, or use Ollama)
python eval/run_eval.py \
--small cerebras/llama3.1-8b \
--large groq/llama-3.3-70b-versatile \
--n 60 --no-judgeThe current benchmark is domain-specific and intentionally small. Its purpose is to test whether a grounded memory signal can redirect the model onto the right frame, not to claim a universal leaderboard win.
For claim consistency, each public benchmark statement should cite an explicit run id from eval/RESULTS_INDEX.md.
Evaluating Nous with standard LLM benchmarks — MMLU, ARC, HumanEval, and similar — would be like measuring the sweetness of chocolate with a Scoville scale. The instrument is not merely inaccurate; it is measuring the wrong physical phenomenon entirely.
Standard benchmarks measure output quality at a single moment: does the system produce the statistically expected token sequence given this prompt? That is a meaningful question for a language model. Nous is not a language model.
Nous is a plastic cognitive substrate: a system that changes its internal structure through what it learns, maintains persistent beliefs with graded uncertainty, consolidates memory asynchronously, and runs a continuous cognitive loop between interactions. The relevant questions are not about outputs — they are about the system's internal epistemic state, how that state evolves over time, and whether that evolution reflects genuine learning rather than pattern matching.
FNC-Bench (in eval/fnc_bench/) is built around different primitives: epistemic honesty, contradiction resistance, and confidence calibration. Even these are partial measures. A complete benchmark for a cognitive architecture must be longitudinal and structural, not momentary and behavioral. That benchmark does not yet exist — because the category of system it would measure has not existed before.
Read a document / have a conversation
↓
nouse daemon (background)
↓
DeepDive: extract concepts + relations
↓
Hebbian update: strengthen confirmed paths
↓
NightRun: consolidate, prune weak edges
↓
Ghost Q (nightly): ask LLM about weak nodes → enrich graph
The daemon runs as a systemd service. It watches your files, chat history, browser bookmarks — anything you configure. You never manually curate the graph.
- Coding agents that need stable project memory across sessions
- Research copilots that must preserve terminology, evidence, and uncertainty
- Domain-specific assistants where bluffing is worse than saying "unknown"
- Local-first AI workflows where you want observability instead of hidden memory state
nouse/
├── inject.py # Public API: attach(), NouseBrain, Axiom, QueryResult
├── field/
│ └── surface.py # SQLite WAL + NetworkX graph interface
├── daemon/
│ ├── main.py # Autonomous learning loop
│ ├── nightrun.py # Nightly consolidation (9 phases)
│ ├── node_deepdive.py # 5-step concept extraction
│ └── ghost_q.py # LLM-driven graph enrichment
├── limbic/ # Neuromodulation (relevance, arousal, novelty)
├── memory/ # Episodic + procedural + semantic memory
├── metacognition/ # Self-monitoring and confidence calibration
└── search/
└── escalator.py # 3-level knowledge escalation
small model + Nous[domain] > large model without Nous
We have evidence for this in our benchmark. The next step is to test across more domains, more models, and with an LLM judge instead of keyword scoring.
Contributions welcome — especially domain-specific question banks.
The repo now stands on three linked claims:
- The Larynx Problem: current AI discourse often mistakes the expression channel of intelligence for intelligence itself.
- FNC-Bench: a different category of system needs a different benchmark, centered on epistemic structure rather than output fluency.
- Nous: a local, persistent epistemic substrate is one concrete attempt to instantiate that missing category.
If there is a paradigm-shift claim here, it lives in the relationship between those three layers.
Further reading:
- The Larynx Problem (wiki)
- Benchmark philosophy (wiki)
- Architecture (wiki)
- Nous Strategic Doctrine
- Frontier Visibility Plan
The theoretical foundation for Nous is described in:
- Wikström, B. (2026). The Larynx Problem: Why Large Language Models Are Not Artificial Intelligence. Zenodo · PhilPapers
- Quattrociocchi, W. et al. (2025). Epistemia: Structural Fault Lines in Generative AI. arXiv:2512.19466
Wikström (2026) argumenterar att LLMs modellerar uttryckskanalen för intelligens (språk), inte intelligens i sig — och att epistemisk grundning via strukturerade, plastiska kunskapsgrafer är nödvändig.
Quattrociocchi et al. (2025) introducerar begreppet "epistemia" för att beskriva det strukturella glappet där språklig trovärdighet ersätter faktisk epistemisk utvärdering. Artikeln identifierar sju epistemologiska "fault lines" mellan mänskligt och maskinellt omdöme och ger en teoretisk ram för varför system som Nous behövs.
pip install nouse
# Start the learning daemon
nouse daemon start
# Interactive REPL with memory
nouse run
# Check graph stats
nouse statusRequires Python 3.11+. Graph stored in ~/.local/share/nouse/.
| Phase | Status | Description |
|---|---|---|
| Core engine | ✅ | SQLite WAL + NetworkX + Hebbian plasticity + TDA gap detection |
| Multi-provider | ✅ | OpenAI, Anthropic, Ollama, Groq, Cerebras |
| MCP integration | ✅ | Model Context Protocol server for Claude and compatible clients |
| Cross-domain benchmarks | 🔄 | Validating on external datasets beyond internal domain |
| Docker support | 📋 | One-command deployment for teams |
| Managed cloud | 📋 | nouse.attach(api_key="nouse_sk_...") — hosted brain for teams |
| Multi-tenant API | 📋 | Shared project memory, team collaboration, SLAs |
- Discord — real-time chat, help, show & tell
- GitHub Discussions — Q&A, ideas, research notes, show & tell
- Open an issue — bugs, feature requests, domain benchmark submissions
- Contributing guide — how to contribute code, benchmarks, examples, and docs
Contributions welcome — especially domain-specific question banks. See CONTRIBUTING.md for how.
MIT — see LICENSE
Björn Wikström / Base76 Research Lab
- 𝕏 / Twitter: @Q_for_qualia
- LinkedIn: bjornshomelab
- Email: bjorn@base76research.com
- Issues: GitHub Issues
For security vulnerabilities, see SECURITY.md.