docs: refresh README + ARCHITECTURE for v2.0.0 by autholykos · Pull Request #68 · dusk-network/stroma

autholykos · 2026-04-18T14:57:05Z

Summary

Both docs still reflected the v0.2.0 surface — single-strategy Markdown chunking, dense-only retrieval, schema v1 — which means a library consumer landing on the repo today would miss every feature shipped across v0.3.0 → v2.0.0. This PR brings them up to the substrate as it actually ships.

No code changes.

`README.md`

Expand Scope to call out pluggable chunking (chunk.Policy), hybrid retrieval with pluggable FusionStrategy, quantization modes, matryoshka prefilter (SearchDimension), ContextualEmbedder, and incremental Update.
Expand Packages with accurate per-package capabilities (NewRecord, ErrTooManySections DoS cap, MaxBatchSize + APIToken redaction, quantization constants, ExpandContext).
Swap the example's record literal for corpus.NewRecord; add an inline comment noting that Fusion / Reranker / SearchDimension are optional and zero-value retrieval gives hybrid RRF over vector+FTS.
Link the v2.0.0 release notes as the authoritative API surface.
Update Status to say v2.0.0.

`ARCHITECTURE.md`

Rename Record.Normalized → Record.Normalize (deprecated alias preserved in the code; doc now points at the live entry point).
Add corpus.NewRecord; note Fingerprint now returns (string, error).
Replace the single-line chunk entry with the Policy interface contract and the three shipped implementations (MarkdownPolicy, KindRouterPolicy, LateChunkPolicy); mention SectionWithLineage and the MaxSections DoS guard.
Add ContextualEmbedder, OpenAI MaxBatchSize + multi-batch deadline scaling + APIToken redaction.
Add the three quantization modes; note chunks_vec_full as the rescore companion for binary.
Add FusionStrategy / RetrievalArm / HitProvenance + DefaultFusion() semantics; Reranker seeing provenance; SearchDimension matryoshka prefilter; DefaultSearchLimit.
Describe schema v5 with context_prefix and parent_chunk_id + same-record triggers; add the v2 → v3 → v4 → v5 migration chain table.

Test plan

git diff --stat — only README.md + ARCHITECTURE.md touched
README example: corpus.NewRecord, index.Rebuild, index.Search with embedded SearchParams all match the current v2.0.0 public API
ARCHITECTURE: every API name and schema column cross-checked against corpus/record.go, chunk/policy.go, embed/{embed,openai}.go, index/{index,snapshot,hybrid,context}.go, store/vector_blob.go, and the createSchema DDL in index/index.go

Copilot

Pull request overview

Updates top-level documentation to reflect Stroma’s current (v2.0.0) API surface and on-disk schema, replacing outdated v0.2.0-era descriptions.

Changes:

Refresh README scope/packages and update the example to use current public APIs (e.g., corpus.NewRecord, index.Search with embedded SearchParams).
Expand ARCHITECTURE coverage for chunk policies, hybrid fusion/provenance, contextual embedding, quantization, and schema v5 + migration chain.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
README.md	Updates scope/packages and example to match v2.0.0 features and APIs.
ARCHITECTURE.md	Updates architecture + schema documentation to match current chunking, retrieval, embedding, quantization, and migrations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Clarify binary quantization footprint: the 32× reduction is for the vec0 prefilter representation; full-precision vectors are retained in chunks_vec_full for rescoring, so total snapshot size is not 32× smaller. Updated both README scope and ARCHITECTURE `store`. - Fix Fingerprint/FingerprintFromPairs error description: they fail on records that cannot be normalized or pairs with empty Ref/ContentHash; the injective-encoding guarantee comes from HashRecord's serialization, not from runtime collision detection. - Fix LateChunkPolicy description: it emits a parent span per heading-aware section (not per record) and may skip leaf emission when a parent already fits the child token budget. - Narrow DefaultFusion claim: ordering matches pre-#17 RRF on every pre-change shape; Score matches on the hybrid multi-arm path and preserves arm-native score on single-arm paths. - Fix migration ordering: v2->v3 adds chunks.context_prefix; v3->v4 re-hashes records.content_hash via HashRecord (no DDL); v4->v5 adds chunks.parent_chunk_id + partial index + same-record triggers. Previous text transposed v2->v3 and v3->v4.

Both docs still reflected the v0.2.0 surface: single-strategy Markdown chunking, dense-only retrieval, schema v1. Refresh to cover the substrate as it actually ships today. - README: expand scope to call out pluggable chunking (chunk.Policy), hybrid retrieval with pluggable FusionStrategy, quantization modes, matryoshka prefilter, ContextualEmbedder, and incremental Update. Swap the example's record literal for corpus.NewRecord. Link the v2.0.0 release notes as the authoritative API surface. - ARCHITECTURE: rename Record.Normalized -> Record.Normalize; add chunk.Policy / MarkdownPolicy / KindRouterPolicy / LateChunkPolicy; add ContextualEmbedder, OpenAI MaxBatchSize + APIToken redaction; add quantization modes + chunks_vec_full rescore table; add FusionStrategy / HitProvenance / Reranker / SearchDimension / DefaultSearchLimit; describe schema v5 with context_prefix and parent_chunk_id + same-record triggers; add the v2->v3->v4->v5 migration chain. No code changes.

- Clarify binary quantization footprint: the 32× reduction is for the vec0 prefilter representation; full-precision vectors are retained in chunks_vec_full for rescoring, so total snapshot size is not 32× smaller. Updated both README scope and ARCHITECTURE `store`. - Fix Fingerprint/FingerprintFromPairs error description: they fail on records that cannot be normalized or pairs with empty Ref/ContentHash; the injective-encoding guarantee comes from HashRecord's serialization, not from runtime collision detection. - Fix LateChunkPolicy description: it emits a parent span per heading-aware section (not per record) and may skip leaf emission when a parent already fits the child token budget. - Narrow DefaultFusion claim: ordering matches pre-#17 RRF on every pre-change shape; Score matches on the hybrid multi-arm path and preserves arm-native score on single-arm paths. - Fix migration ordering: v2->v3 adds chunks.context_prefix; v3->v4 re-hashes records.content_hash via HashRecord (no DDL); v4->v5 adds chunks.parent_chunk_id + partial index + same-record triggers. Previous text transposed v2->v3 and v3->v4.

Copilot AI review requested due to automatic review settings April 18, 2026 14:57

Copilot started reviewing on behalf of autholykos April 18, 2026 14:57 View session

Copilot AI reviewed Apr 18, 2026

View reviewed changes

Comment thread README.md Outdated

Comment thread ARCHITECTURE.md Outdated

Comment thread ARCHITECTURE.md Outdated

Comment thread ARCHITECTURE.md Outdated

Comment thread ARCHITECTURE.md Outdated

Comment thread ARCHITECTURE.md Outdated

autholykos added 2 commits April 18, 2026 17:13

autholykos force-pushed the docs/refresh-post-v2 branch from bd162e8 to c2f7f1e Compare April 18, 2026 15:13

autholykos merged commit 38310e2 into main Apr 18, 2026
4 checks passed

autholykos deleted the docs/refresh-post-v2 branch April 18, 2026 15:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: refresh README + ARCHITECTURE for v2.0.0#68

docs: refresh README + ARCHITECTURE for v2.0.0#68
autholykos merged 2 commits intomainfrom
docs/refresh-post-v2

autholykos commented Apr 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

autholykos commented Apr 18, 2026

Summary

README.md

ARCHITECTURE.md

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`README.md`

`ARCHITECTURE.md`