Skip to content

base76-research-lab/Nous

Nous (νοῦς) — persistent epistemic substrate for AI

Language models are a larynx. Nous is a persistent epistemic substrate for AI.

Nous stores typed relations, graded uncertainty, contradiction boundaries, and memory across time.

PyPI Tests License: MIT Wiki

Inversion · Research Program · Evidence · Research · Roadmap · Integration


The Architectural Inversion

Most AI systems still place the language model at the center and attach tools, memory, and wrappers around it.

Nous inverts that stack.

Industry stack versus Nous stack

In this picture, the language model is not discarded. It is repositioned.

The model remains the expression system, the semantic surface, the larynx.
Nous is the persistent epistemic layer behind it.


Research Program

Nous should be read as a research program with a live implementation, not as a generic AI utility.

It stands on four linked claims:

  • language output is not identical with intelligence
  • a persistent epistemic substrate can supply what the output layer lacks
  • a different category of system requires different benchmarks and longitudinal metrics
  • these claims are testable because the repository contains a working substrate rather than only a paper argument

The fastest reading path is:


Why Standard Benchmarks Do Not Apply

Evaluating Nous with standard LLM benchmarks would be like measuring the sweetness of chocolate with a Scoville scale. The instrument is not merely inaccurate; it is measuring the wrong physical phenomenon entirely.

Benchmarks such as MMLU, ARC, and HumanEval are valid instruments for language models. Nous is not merely a language model output surface. It is a persistent epistemic substrate.

The relevant question is therefore not only whether a model produced the expected answer right now. It is whether the system knew what it knew, knew what it did not know, preserved consistency under contradiction, and changed structure coherently across time.


Why It Matters

What is currently called AI is mostly semantic prediction.

Nous is an attempt to define a different category of system: one that can preserve knowledge, uncertainty, contradiction, and structural change across time.

  • It stores relations, not just retrieved chunks.
  • It carries confidence, rationale, and uncertainty with the memory itself.
  • It makes the boundary between known, probable, and unknown visible to the model.

That changes agent behavior in the place that actually matters: when a model is close to hallucinating but still sounds fluent.

Reference Evidence

One documented reference run is run_20260403_094211 (see eval/RESULTS_INDEX.md).

Model                               Score   Questions
─────────────────────────────────────────────────────
llama3.1-8b  (no memory)            46%     60
llama-3.3-70b  (no memory)          47%     60
llama3.1-8b  + Nous memory  →      96%     60

On this domain-specific run, an 8B model with Nous grounding scored above a 70B baseline.

The effect is not retrieval. It is epistemic grounding — a small, precise knowledge signal redirects the model's existing priors onto the correct frame, with confidence and evidence attached. In the repo, this pattern is referred to as the Intent Disambiguation Effect.

This is evidence for graph grounding. It is not, by itself, a general benchmark for intelligence or a complete measure of a cognitive substrate.

→ Full benchmark details: eval/RESULTS.md · eval/RESULTS_INDEX.md · Run it yourself


Architectural Properties

Capability What it does
Structured memory Stores typed relations between concepts instead of plain text chunks
Confidence-aware retrieval Returns what is known, with evidence and uncertainty attached
Gap awareness Surfaces where knowledge ends instead of bluffing through it
Continuous learning Strengthens or weakens graph paths over time via Hebbian plasticity
Local-first runtime Runs as a local graph and daemon, then injects context into any LLM

What Nous Is

Nous (νοῦς, Gk. mind / active intellect) is a persistent, self-growing epistemic substrate that can attach to any LLM.

It is informed by brain-inspired plasticity, cognitive research, and the practical failure modes of LLM memory.

Your documents, conversations, research
           ↓
    Nous knowledge graph
    (SQLite WAL + NetworkX + Hebbian learning + evidence scoring)
           ↓
    brain.query("your question")
           ↓
    Structured context injected into any LLM prompt:
      — what is known (relations + confidence)
      — why it is known (evidence chain)
      — what is NOT known (gap map from TDA)

It is not a RAG system. RAG retrieves chunks. Nous extracts relations — typed, weighted, evidence-scored connections between concepts — and injects a compact, structured context block.

It is not just a memory system. Memory stores and retrieves. Nous maintains an epistemic account: every relation carries a trust tier (hypothesis / indication / validated), a rationale, and a contradiction flag. The system knows the difference between what it has evidence for and what it is guessing.

It learns continuously. Every interaction strengthens or weakens connections (Hebbian plasticity). There is no retraining. No gradient descent. The graph grows — and the gaps become visible.


How Nous Differs From Alternatives

System Main unit Knows confidence Knows what's missing Learns over time Local-first
Basic RAG text chunk
Vector memory embedding ~
Mem0 memory objects ~ ~
MemGPT / Letta conversation pages ~
Claude Memory key-value
Nous typed relation + evidence

Nous does not try to replace the model. It treats the model as the larynx, and the substrate as the persistent epistemic layer behind it.


Integration Pattern

pip install nouse
import nouse

# Auto-detects the local daemon if it is running.
# Otherwise falls back to direct local graph access.
brain = nouse.attach()

result = brain.query("transformer attention mechanism")

print(result.context_block())
print(result.confidence)
print(result.strong_axioms())

If the daemon is running, attach() connects over HTTP. Otherwise it falls back to direct local graph access. The same code works either way.

Provider-specific examples follow below.

# You handle the LLM call. Nous handles the memory.
context = brain.query(user_question).context_block()
response = openai.chat(messages=[
    {"role": "system", "content": context},
    {"role": "user",   "content": user_question},
])

Use With OpenAI, Anthropic, Ollama Or Groq

OpenAI

from openai import OpenAI
import nouse

client = OpenAI()
brain = nouse.attach()

question = "How does residual attention affect token relevance?"
context = brain.query(question).context_block()

response = client.chat.completions.create(
       model="gpt-4.1-mini",
       messages=[
              {"role": "system", "content": context},
              {"role": "user", "content": question},
       ],
)

print(response.choices[0].message.content)

Anthropic

from anthropic import Anthropic
import nouse

client = Anthropic()
brain = nouse.attach()

question = "What does this repo know about topological plasticity?"
context = brain.query(question).context_block()

response = client.messages.create(
       model="claude-3-7-sonnet-latest",
       max_tokens=800,
       system=context,
       messages=[
              {"role": "user", "content": question},
       ],
)

print(response.content[0].text)

Ollama

import ollama
import nouse

brain = nouse.attach()

question = "Summarize what is known about epistemic grounding."
context = brain.query(question).context_block()

response = ollama.chat(
       model="qwen3.5:latest",
       messages=[
              {"role": "system", "content": context},
              {"role": "user", "content": question},
       ],
)

print(response["message"]["content"])

Groq

from groq import Groq
import nouse

client = Groq()
brain = nouse.attach()

question = "What does this project know about Hebbian learning?"
context = brain.query(question).context_block()

response = client.chat.completions.create(
    model="llama3-8b-8192",
    messages=[
        {"role": "system", "content": context},
        {"role": "user", "content": question},
    ],
)
print(response.choices[0].message.content)

The pattern is always the same: brain.query(...) first, provider call second.


Managed Nous (Coming)

Nous is local-first today. A managed cloud version is planned:

brain = nouse.attach(api_key="nouse_sk_...")

Hosted memory graphs, shared project memory across agents and teams, and zero local setup. Interested? Get in touch.


What A Grounded Answer Looks Like

When you query Nous, the model does not just get a blob of context. It gets an epistemic frame:

[Nous memory]
• transformer attention: mechanism for routing token influence across context
       claim: attention modulates token relevance based on learned relational patterns

Validated relations:
       transformer —[uses]→ attention  [ev=0.92]
       attention —[modulates]→ token relevance  [ev=0.81]

Uncertain / under review:
       attention —[is_equivalent_to]→ memory routing  [ev=0.41] ⚑

That is the real product surface: not storage, but a more honest and better-calibrated answer path.


Run the benchmark yourself

git clone https://github.com/base76-research-lab/Nous
cd Nous
pip install -e .

# Generate questions from your own graph
python eval/generate_questions.py --n 60

# Run benchmark (requires Cerebras or Groq API key, or use Ollama)
python eval/run_eval.py \
  --small cerebras/llama3.1-8b \
  --large groq/llama-3.3-70b-versatile \
  --n 60 --no-judge

The current benchmark is domain-specific and intentionally small. Its purpose is to test whether a grounded memory signal can redirect the model onto the right frame, not to claim a universal leaderboard win.

For claim consistency, each public benchmark statement should cite an explicit run id from eval/RESULTS_INDEX.md.

Why standard LLM benchmarks do not apply

Evaluating Nous with standard LLM benchmarks — MMLU, ARC, HumanEval, and similar — would be like measuring the sweetness of chocolate with a Scoville scale. The instrument is not merely inaccurate; it is measuring the wrong physical phenomenon entirely.

Standard benchmarks measure output quality at a single moment: does the system produce the statistically expected token sequence given this prompt? That is a meaningful question for a language model. Nous is not a language model.

Nous is a plastic cognitive substrate: a system that changes its internal structure through what it learns, maintains persistent beliefs with graded uncertainty, consolidates memory asynchronously, and runs a continuous cognitive loop between interactions. The relevant questions are not about outputs — they are about the system's internal epistemic state, how that state evolves over time, and whether that evolution reflects genuine learning rather than pattern matching.

FNC-Bench (in eval/fnc_bench/) is built around different primitives: epistemic honesty, contradiction resistance, and confidence calibration. Even these are partial measures. A complete benchmark for a cognitive architecture must be longitudinal and structural, not momentary and behavioral. That benchmark does not yet exist — because the category of system it would measure has not existed before.


How the graph grows

Read a document / have a conversation
           ↓
    nouse daemon (background)
           ↓
    DeepDive: extract concepts + relations
           ↓
    Hebbian update: strengthen confirmed paths
           ↓
    NightRun: consolidate, prune weak edges
           ↓
    Ghost Q (nightly): ask LLM about weak nodes → enrich graph

The daemon runs as a systemd service. It watches your files, chat history, browser bookmarks — anything you configure. You never manually curate the graph.


Good Fits

  • Coding agents that need stable project memory across sessions
  • Research copilots that must preserve terminology, evidence, and uncertainty
  • Domain-specific assistants where bluffing is worse than saying "unknown"
  • Local-first AI workflows where you want observability instead of hidden memory state

Architecture

nouse/
├── inject.py          # Public API: attach(), NouseBrain, Axiom, QueryResult
├── field/
│   └── surface.py     # SQLite WAL + NetworkX graph interface
├── daemon/
│   ├── main.py        # Autonomous learning loop
│   ├── nightrun.py    # Nightly consolidation (9 phases)
│   ├── node_deepdive.py  # 5-step concept extraction
│   └── ghost_q.py     # LLM-driven graph enrichment
├── limbic/            # Neuromodulation (relevance, arousal, novelty)
├── memory/            # Episodic + procedural + semantic memory
├── metacognition/     # Self-monitoring and confidence calibration
└── search/
    └── escalator.py   # 3-level knowledge escalation

The hypothesis (work in progress)

small model + Nous[domain]  >  large model without Nous

We have evidence for this in our benchmark. The next step is to test across more domains, more models, and with an LLM judge instead of keyword scoring.

Contributions welcome — especially domain-specific question banks.


Research

The repo now stands on three linked claims:

  • The Larynx Problem: current AI discourse often mistakes the expression channel of intelligence for intelligence itself.
  • FNC-Bench: a different category of system needs a different benchmark, centered on epistemic structure rather than output fluency.
  • Nous: a local, persistent epistemic substrate is one concrete attempt to instantiate that missing category.

If there is a paradigm-shift claim here, it lives in the relationship between those three layers.

Further reading:

The theoretical foundation for Nous is described in:

  • Wikström, B. (2026). The Larynx Problem: Why Large Language Models Are Not Artificial Intelligence. Zenodo · PhilPapers
  • Quattrociocchi, W. et al. (2025). Epistemia: Structural Fault Lines in Generative AI. arXiv:2512.19466

Wikström (2026) argumenterar att LLMs modellerar uttryckskanalen för intelligens (språk), inte intelligens i sig — och att epistemisk grundning via strukturerade, plastiska kunskapsgrafer är nödvändig.

Quattrociocchi et al. (2025) introducerar begreppet "epistemia" för att beskriva det strukturella glappet där språklig trovärdighet ersätter faktisk epistemisk utvärdering. Artikeln identifierar sju epistemologiska "fault lines" mellan mänskligt och maskinellt omdöme och ger en teoretisk ram för varför system som Nous behövs.


Install & Run Daemon

pip install nouse

# Start the learning daemon
nouse daemon start

# Interactive REPL with memory
nouse run

# Check graph stats
nouse status

Requires Python 3.11+. Graph stored in ~/.local/share/nouse/.


Roadmap

Phase Status Description
Core engine SQLite WAL + NetworkX + Hebbian plasticity + TDA gap detection
Multi-provider OpenAI, Anthropic, Ollama, Groq, Cerebras
MCP integration Model Context Protocol server for Claude and compatible clients
Cross-domain benchmarks 🔄 Validating on external datasets beyond internal domain
Docker support 📋 One-command deployment for teams
Managed cloud 📋 nouse.attach(api_key="nouse_sk_...") — hosted brain for teams
Multi-tenant API 📋 Shared project memory, team collaboration, SLAs

Community

Contributions welcome — especially domain-specific question banks. See CONTRIBUTING.md for how.


License

MIT — see LICENSE


Contact

Björn Wikström / Base76 Research Lab

For security vulnerabilities, see SECURITY.md.

About

A persistent epistemic substrate for AI. Nous treats language models as larynx, not mind, and benchmarks epistemic structure instead of output fluency.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Sponsor this project

Packages

 
 
 

Contributors