Skip to content

BrainPalace 26.6.15

Choose a tag to compare

@bxw91 bxw91 released this 04 Jun 09:45
· 46 commits to main since this release

Multi-language BM25

BrainPalace now tokenizes each document with its own natural-language analyzer (normalize → tokenize → stopwords → stem/lemmatize) instead of a language-agnostic tokenizer.

Highlights

  • ~27 Snowball/PyStemmer languages (en, de, fr, es, ru, it, pt, nl, sv, fi, hu, ro, tr, ar, …) + a vendored Croatian (hr) stemmer; stopwords via stopwordsiso; unknown codes fall back to English.
  • New bm25: config block: language, engine (stem|lemma), detect, detect_min_confidence.
  • CLI: init --language/--bm25-engine, folders add --language, query --language, status shows language/engine. MCP query tool gains language.
  • Croatian lemma tier: pip install 'brainpalace[lemma-hr]' (simplemma, Serbo-Croatian hbs).
  • Engine: BM25 now uses bm25s directly (dropped the LlamaIndex BM25Retriever wrapper). Existing indexes auto-migrate from the stored corpus on first start — no manual action.

See docs/CHANGELOG.md for full details.