Skip to content

docs: refresh README + ARCHITECTURE for v2.0.0#68

Merged
autholykos merged 2 commits intomainfrom
docs/refresh-post-v2
Apr 18, 2026
Merged

docs: refresh README + ARCHITECTURE for v2.0.0#68
autholykos merged 2 commits intomainfrom
docs/refresh-post-v2

Conversation

@autholykos
Copy link
Copy Markdown
Member

Summary

Both docs still reflected the v0.2.0 surface — single-strategy Markdown chunking, dense-only retrieval, schema v1 — which means a library consumer landing on the repo today would miss every feature shipped across v0.3.0 → v2.0.0. This PR brings them up to the substrate as it actually ships.

No code changes.

README.md

  • Expand Scope to call out pluggable chunking (chunk.Policy), hybrid retrieval with pluggable FusionStrategy, quantization modes, matryoshka prefilter (SearchDimension), ContextualEmbedder, and incremental Update.
  • Expand Packages with accurate per-package capabilities (NewRecord, ErrTooManySections DoS cap, MaxBatchSize + APIToken redaction, quantization constants, ExpandContext).
  • Swap the example's record literal for corpus.NewRecord; add an inline comment noting that Fusion / Reranker / SearchDimension are optional and zero-value retrieval gives hybrid RRF over vector+FTS.
  • Link the v2.0.0 release notes as the authoritative API surface.
  • Update Status to say v2.0.0.

ARCHITECTURE.md

  • Rename Record.NormalizedRecord.Normalize (deprecated alias preserved in the code; doc now points at the live entry point).
  • Add corpus.NewRecord; note Fingerprint now returns (string, error).
  • Replace the single-line chunk entry with the Policy interface contract and the three shipped implementations (MarkdownPolicy, KindRouterPolicy, LateChunkPolicy); mention SectionWithLineage and the MaxSections DoS guard.
  • Add ContextualEmbedder, OpenAI MaxBatchSize + multi-batch deadline scaling + APIToken redaction.
  • Add the three quantization modes; note chunks_vec_full as the rescore companion for binary.
  • Add FusionStrategy / RetrievalArm / HitProvenance + DefaultFusion() semantics; Reranker seeing provenance; SearchDimension matryoshka prefilter; DefaultSearchLimit.
  • Describe schema v5 with context_prefix and parent_chunk_id + same-record triggers; add the v2 → v3 → v4 → v5 migration chain table.

Test plan

  • git diff --stat — only README.md + ARCHITECTURE.md touched
  • README example: corpus.NewRecord, index.Rebuild, index.Search with embedded SearchParams all match the current v2.0.0 public API
  • ARCHITECTURE: every API name and schema column cross-checked against corpus/record.go, chunk/policy.go, embed/{embed,openai}.go, index/{index,snapshot,hybrid,context}.go, store/vector_blob.go, and the createSchema DDL in index/index.go

Copilot AI review requested due to automatic review settings April 18, 2026 14:57
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates top-level documentation to reflect Stroma’s current (v2.0.0) API surface and on-disk schema, replacing outdated v0.2.0-era descriptions.

Changes:

  • Refresh README scope/packages and update the example to use current public APIs (e.g., corpus.NewRecord, index.Search with embedded SearchParams).
  • Expand ARCHITECTURE coverage for chunk policies, hybrid fusion/provenance, contextual embedding, quantization, and schema v5 + migration chain.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
README.md Updates scope/packages and example to match v2.0.0 features and APIs.
ARCHITECTURE.md Updates architecture + schema documentation to match current chunking, retrieval, embedding, quantization, and migrations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread README.md Outdated
Comment thread ARCHITECTURE.md Outdated
Comment thread ARCHITECTURE.md Outdated
Comment thread ARCHITECTURE.md Outdated
Comment thread ARCHITECTURE.md Outdated
Comment thread ARCHITECTURE.md Outdated
autholykos added a commit that referenced this pull request Apr 18, 2026
- Clarify binary quantization footprint: the 32× reduction is for the
  vec0 prefilter representation; full-precision vectors are retained
  in chunks_vec_full for rescoring, so total snapshot size is not 32×
  smaller. Updated both README scope and ARCHITECTURE `store`.
- Fix Fingerprint/FingerprintFromPairs error description: they fail
  on records that cannot be normalized or pairs with empty
  Ref/ContentHash; the injective-encoding guarantee comes from
  HashRecord's serialization, not from runtime collision detection.
- Fix LateChunkPolicy description: it emits a parent span per
  heading-aware section (not per record) and may skip leaf emission
  when a parent already fits the child token budget.
- Narrow DefaultFusion claim: ordering matches pre-#17 RRF on every
  pre-change shape; Score matches on the hybrid multi-arm path and
  preserves arm-native score on single-arm paths.
- Fix migration ordering: v2->v3 adds chunks.context_prefix;
  v3->v4 re-hashes records.content_hash via HashRecord (no DDL);
  v4->v5 adds chunks.parent_chunk_id + partial index + same-record
  triggers. Previous text transposed v2->v3 and v3->v4.
Both docs still reflected the v0.2.0 surface: single-strategy Markdown
chunking, dense-only retrieval, schema v1. Refresh to cover the
substrate as it actually ships today.

- README: expand scope to call out pluggable chunking (chunk.Policy),
  hybrid retrieval with pluggable FusionStrategy, quantization modes,
  matryoshka prefilter, ContextualEmbedder, and incremental Update.
  Swap the example's record literal for corpus.NewRecord. Link the
  v2.0.0 release notes as the authoritative API surface.
- ARCHITECTURE: rename Record.Normalized -> Record.Normalize; add
  chunk.Policy / MarkdownPolicy / KindRouterPolicy / LateChunkPolicy;
  add ContextualEmbedder, OpenAI MaxBatchSize + APIToken redaction;
  add quantization modes + chunks_vec_full rescore table; add
  FusionStrategy / HitProvenance / Reranker / SearchDimension /
  DefaultSearchLimit; describe schema v5 with context_prefix and
  parent_chunk_id + same-record triggers; add the v2->v3->v4->v5
  migration chain.

No code changes.
- Clarify binary quantization footprint: the 32× reduction is for the
  vec0 prefilter representation; full-precision vectors are retained
  in chunks_vec_full for rescoring, so total snapshot size is not 32×
  smaller. Updated both README scope and ARCHITECTURE `store`.
- Fix Fingerprint/FingerprintFromPairs error description: they fail
  on records that cannot be normalized or pairs with empty
  Ref/ContentHash; the injective-encoding guarantee comes from
  HashRecord's serialization, not from runtime collision detection.
- Fix LateChunkPolicy description: it emits a parent span per
  heading-aware section (not per record) and may skip leaf emission
  when a parent already fits the child token budget.
- Narrow DefaultFusion claim: ordering matches pre-#17 RRF on every
  pre-change shape; Score matches on the hybrid multi-arm path and
  preserves arm-native score on single-arm paths.
- Fix migration ordering: v2->v3 adds chunks.context_prefix;
  v3->v4 re-hashes records.content_hash via HashRecord (no DDL);
  v4->v5 adds chunks.parent_chunk_id + partial index + same-record
  triggers. Previous text transposed v2->v3 and v3->v4.
@autholykos autholykos force-pushed the docs/refresh-post-v2 branch from bd162e8 to c2f7f1e Compare April 18, 2026 15:13
@autholykos autholykos merged commit 38310e2 into main Apr 18, 2026
4 checks passed
@autholykos autholykos deleted the docs/refresh-post-v2 branch April 18, 2026 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants