Skip to content

AnthonyNystrom/mneme

Repository files navigation

mneme logo — two stacked rounded squares representing the exact-match and semantic-match cache layers

mneme

A layered semantic cache for LLMs and any expensive function with embeddable input.

mneme (Greek: μνήμη, "memory"; pronounced NEE-mee) is an embeddable, in-process Python library for semantic memoization: cache an expensive function once, return the cached result whenever a later input means the same thing. LLM completions are the canonical use case; the same machinery covers RAG retrievals, translations, classifications, deduplication, and agent memory. It pairs an exact-match layer (normalized query hash) with a semantic-match layer (cosine similarity over L2-normalized embeddings) and persists durably to a single SQLite file by default.

Full documentation: https://anthonynystrom.github.io/mneme/

from mneme import SemanticCache

with SemanticCache(path="cache.db", embedder=my_embedder) as cache:
    hit = cache.get("How do I reset my password?")
    if hit is None:
        response = call_my_llm("How do I reset my password?")
        cache.put("How do I reset my password?", response)
    else:
        response = hit.response

Why

  • Cache before you call. Turn redundant expensive operations - LLM calls, RAG rerankers, paid translation APIs, slow classifiers - into a microsecond dict lookup or a millisecond NumPy matvec. For chatbots, agent loops, classification pipelines, and batch-style scoring, this is the difference between a viable product and one that pays for every paraphrase.
  • One required dependency. NumPy. Optional extras for hnsw, redis, postgres, dynamodb, prometheus, otel. Bring your own embedder, your own LLM client, your own server.
  • In-process, no daemon. A library you import, not a service you operate. Persists to a single SQLite file by default; swap in Redis / Postgres / DynamoDB for cross-host shared state.
  • Strict typing, zero magic. Public surface is a small set of frozen @dataclasses and Protocols. py.typed shipped.

Features

  • Layered cache - O(1) exact match, then cosine similarity over an in-memory matrix
  • Sync + async APIs (SemanticCache, AsyncSemanticCache)
  • 5 Store backends: Memory, SQLite (default), Redis, Postgres, DynamoDB
  • 2 Index backends: NumPy (default; bandwidth-bound exact search, comfortable at typical d=768 to ~500k and at d=384 well past 1M) and hnswlib (opt-in; sub-millisecond approximate search at 1M+)
  • 3 vector dtypes: float32, float16, int8 for memory-constrained deployments
  • 3 multi-process modes: single, stale-tolerant, mmap-shared
  • Multi-tenant via namespaces with per-namespace LRU quotas
  • Calibration tooling (Python API + CLI) for tuning similarity thresholds
  • Checkpoint export/import for backup and environment promotion
  • Re-embed migration tool when the embedder changes
  • Prometheus and OpenTelemetry metrics adapters

Install

pip install mneme-cache                       # core (NumPy only)
pip install "mneme-cache[hnsw]"               # approximate-NN at 1M+ entries
pip install "mneme-cache[redis]"              # RedisStore
pip install "mneme-cache[postgres]"           # PostgresStore
pip install "mneme-cache[dynamodb]"           # DynamoDBStore
pip install "mneme-cache[prometheus,otel]"    # metrics adapters
pip install "mneme-cache[all]"                # everything

Python 3.10+. The distribution is mneme-cache on PyPI; the import name is mneme. See the full install matrix.

Quickstart

from mneme import SemanticCache, MemoryStore

with SemanticCache(store=MemoryStore(), embedder=my_embedder) as cache:
    cache.put("How do I reset my password?", "Click 'Forgot password' on login.")
    hit = cache.get("Where do I reset my password?")  # paraphrase
    assert hit is not None
    print(hit.layer, hit.similarity, hit.response)

For the async API, see Async quickstart. For wrapping an actual LLM call, see Your first cached LLM.

Use cases

The same machinery covers more than LLM caching. Each pattern is the same three lines (cache.get, cache.put, your function); only what your function does changes.

Pattern What it caches
LLM caching Wrap any LLM call so paraphrases hit a microsecond cache instead of a multi-second model
RAG retrieval Top-k chunks behind paraphrased questions; skips the cross-encoder reranker on cache hits
Translation "Source text → translated text" per language pair; cuts billed translation API calls
Semantic deduplication Read Hit.similarity directly to detect near-duplicate content in ingestion pipelines
Classification Cache labels from any classifier (sklearn, fastText, BERT, rules engines)
Agent memory Per-agent task → plan lookup; consistency on similar tasks across runs

Full walkthrough with runnable scripts →

Performance

Apple M4 Max baseline at 100k entries (full table on the docs site):

Operation Latency
Exact-match get ~2.3 ms p99
Semantic get (fp32, d=768) ~2.7 ms p99
put (no eviction) ~0.9 ms p99
Single-thread throughput ~5,700 ops/sec

Documentation

Getting started Sync + async quickstarts, bring your own embedder
Use cases Five patterns: LLM, RAG retrieval, translation, dedup, classification, agent memory
How mneme is different Where mneme makes different choices than other semantic-cache libraries
Concepts Layered cache, embedders, quantization, multi-process, multi-tenant
Stores Memory · SQLite · Redis · Postgres · DynamoDB
Guides Calibration, checkpoints, re-embed migration, metrics, custom stores, perf tuning
API reference Auto-generated from docstrings
Performance Measured baseline against the original targets
Showcase Flask demo covering all 5 use cases against Nemotron on a DGX Spark
Changelog Release notes

Comparison

mneme GPTCache
Required runtime deps NumPy many (faiss, etc.)
Bundled embedder no (BYOE) yes
Bundled LLM client no yes
Sync + async parity yes partial
Strict typing (py.typed) yes no
Multi-process modes 3 n/a
Multi-tenant quotas per-namespace LRU n/a
Calibration tooling yes (CLI + Python API) no

Status

v1.0. Public surface locked; future minor versions are additive. See Changelog.

License

Apache 2.0. See LICENSE.

About

Layered semantic cache for LLMs (and any expensive function with embeddable input). Exact-match + cosine-similarity matching, persists to SQLite, 5 store backends, sync + async, multi-tenant. NumPy is the only required dep.

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages