BrainPalace 26.6.15
Multi-language BM25
BrainPalace now tokenizes each document with its own natural-language analyzer (normalize → tokenize → stopwords → stem/lemmatize) instead of a language-agnostic tokenizer.
Highlights
- ~27 Snowball/PyStemmer languages (
en,de,fr,es,ru,it,pt,nl,sv,fi,hu,ro,tr,ar, …) + a vendored Croatian (hr) stemmer; stopwords viastopwordsiso; unknown codes fall back to English. - New
bm25:config block:language,engine(stem|lemma),detect,detect_min_confidence. - CLI:
init --language/--bm25-engine,folders add --language,query --language,statusshows language/engine. MCPquerytool gainslanguage. - Croatian lemma tier:
pip install 'brainpalace[lemma-hr]'(simplemma, Serbo-Croatianhbs). - Engine: BM25 now uses
bm25sdirectly (dropped the LlamaIndexBM25Retrieverwrapper). Existing indexes auto-migrate from the stored corpus on first start — no manual action.
See docs/CHANGELOG.md for full details.