docs: refresh README + ARCHITECTURE for v2.0.0#68
Merged
autholykos merged 2 commits intomainfrom Apr 18, 2026
Merged
Conversation
There was a problem hiding this comment.
Pull request overview
Updates top-level documentation to reflect Stroma’s current (v2.0.0) API surface and on-disk schema, replacing outdated v0.2.0-era descriptions.
Changes:
- Refresh README scope/packages and update the example to use current public APIs (e.g.,
corpus.NewRecord,index.Searchwith embeddedSearchParams). - Expand ARCHITECTURE coverage for chunk policies, hybrid fusion/provenance, contextual embedding, quantization, and schema v5 + migration chain.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| README.md | Updates scope/packages and example to match v2.0.0 features and APIs. |
| ARCHITECTURE.md | Updates architecture + schema documentation to match current chunking, retrieval, embedding, quantization, and migrations. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
autholykos
added a commit
that referenced
this pull request
Apr 18, 2026
- Clarify binary quantization footprint: the 32× reduction is for the vec0 prefilter representation; full-precision vectors are retained in chunks_vec_full for rescoring, so total snapshot size is not 32× smaller. Updated both README scope and ARCHITECTURE `store`. - Fix Fingerprint/FingerprintFromPairs error description: they fail on records that cannot be normalized or pairs with empty Ref/ContentHash; the injective-encoding guarantee comes from HashRecord's serialization, not from runtime collision detection. - Fix LateChunkPolicy description: it emits a parent span per heading-aware section (not per record) and may skip leaf emission when a parent already fits the child token budget. - Narrow DefaultFusion claim: ordering matches pre-#17 RRF on every pre-change shape; Score matches on the hybrid multi-arm path and preserves arm-native score on single-arm paths. - Fix migration ordering: v2->v3 adds chunks.context_prefix; v3->v4 re-hashes records.content_hash via HashRecord (no DDL); v4->v5 adds chunks.parent_chunk_id + partial index + same-record triggers. Previous text transposed v2->v3 and v3->v4.
Both docs still reflected the v0.2.0 surface: single-strategy Markdown chunking, dense-only retrieval, schema v1. Refresh to cover the substrate as it actually ships today. - README: expand scope to call out pluggable chunking (chunk.Policy), hybrid retrieval with pluggable FusionStrategy, quantization modes, matryoshka prefilter, ContextualEmbedder, and incremental Update. Swap the example's record literal for corpus.NewRecord. Link the v2.0.0 release notes as the authoritative API surface. - ARCHITECTURE: rename Record.Normalized -> Record.Normalize; add chunk.Policy / MarkdownPolicy / KindRouterPolicy / LateChunkPolicy; add ContextualEmbedder, OpenAI MaxBatchSize + APIToken redaction; add quantization modes + chunks_vec_full rescore table; add FusionStrategy / HitProvenance / Reranker / SearchDimension / DefaultSearchLimit; describe schema v5 with context_prefix and parent_chunk_id + same-record triggers; add the v2->v3->v4->v5 migration chain. No code changes.
- Clarify binary quantization footprint: the 32× reduction is for the vec0 prefilter representation; full-precision vectors are retained in chunks_vec_full for rescoring, so total snapshot size is not 32× smaller. Updated both README scope and ARCHITECTURE `store`. - Fix Fingerprint/FingerprintFromPairs error description: they fail on records that cannot be normalized or pairs with empty Ref/ContentHash; the injective-encoding guarantee comes from HashRecord's serialization, not from runtime collision detection. - Fix LateChunkPolicy description: it emits a parent span per heading-aware section (not per record) and may skip leaf emission when a parent already fits the child token budget. - Narrow DefaultFusion claim: ordering matches pre-#17 RRF on every pre-change shape; Score matches on the hybrid multi-arm path and preserves arm-native score on single-arm paths. - Fix migration ordering: v2->v3 adds chunks.context_prefix; v3->v4 re-hashes records.content_hash via HashRecord (no DDL); v4->v5 adds chunks.parent_chunk_id + partial index + same-record triggers. Previous text transposed v2->v3 and v3->v4.
bd162e8 to
c2f7f1e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Both docs still reflected the v0.2.0 surface — single-strategy Markdown chunking, dense-only retrieval, schema v1 — which means a library consumer landing on the repo today would miss every feature shipped across v0.3.0 → v2.0.0. This PR brings them up to the substrate as it actually ships.
No code changes.
README.mdchunk.Policy), hybrid retrieval with pluggableFusionStrategy, quantization modes, matryoshka prefilter (SearchDimension),ContextualEmbedder, and incrementalUpdate.NewRecord,ErrTooManySectionsDoS cap,MaxBatchSize+APITokenredaction, quantization constants,ExpandContext).corpus.NewRecord; add an inline comment noting thatFusion/Reranker/SearchDimensionare optional and zero-value retrieval gives hybrid RRF over vector+FTS.ARCHITECTURE.mdRecord.Normalized→Record.Normalize(deprecated alias preserved in the code; doc now points at the live entry point).corpus.NewRecord; noteFingerprintnow returns(string, error).chunkentry with thePolicyinterface contract and the three shipped implementations (MarkdownPolicy,KindRouterPolicy,LateChunkPolicy); mentionSectionWithLineageand theMaxSectionsDoS guard.ContextualEmbedder, OpenAIMaxBatchSize+ multi-batch deadline scaling +APITokenredaction.chunks_vec_fullas the rescore companion for binary.FusionStrategy/RetrievalArm/HitProvenance+DefaultFusion()semantics;Rerankerseeing provenance;SearchDimensionmatryoshka prefilter;DefaultSearchLimit.context_prefixandparent_chunk_id+ same-record triggers; add the v2 → v3 → v4 → v5 migration chain table.Test plan
git diff --stat— onlyREADME.md+ARCHITECTURE.mdtouchedcorpus.NewRecord,index.Rebuild,index.Searchwith embeddedSearchParamsall match the current v2.0.0 public APIcorpus/record.go,chunk/policy.go,embed/{embed,openai}.go,index/{index,snapshot,hybrid,context}.go,store/vector_blob.go, and thecreateSchemaDDL inindex/index.go