docs: zero-downtime rolling deploy design by JustMaier · Pull Request #129 · civitai/bitdex

JustMaier · 2026-04-04T20:08:46Z

Summary

Proposes shared-PVC architecture for zero-downtime rolling deploys on same-node K8s
File-lock (flock) writer election — no K8s API dependency, no external coordination
Read-only pod mmaps shared silo files (instant startup from hot page cache), tails ops log for freshness
Write endpoints (POST /ops, PUT /dumps) return 503 in read-only mode — sidecar retries naturally
Covers same-node (this design) and multi-node future direction (fully independent instances)
~200 lines of Rust implementation, depends on V3 mmap architecture landing first

Notes

Design doc only, no code changes
Supersedes the independent-replica model in rolling-restart-cursors.md for same-node deployments
K8s manifest changes (podAffinity, strategy) go in the infra repo, not here

🤖 Generated with Claude Code

…zations Multi-phase dump correctness: - DocOp::Merge variant: merges fields into existing docs instead of replacing - All dump phases use Merge for object-level writes (fixes data loss bug) - Tags post-pass: bitmap inversion writes one Merge per slot (4.5B→109M ops) - 10 unit tests for Merge semantics (roundtrip, accumulate, delete+resurrect) Pipeline performance (StreamingDocWriter fixes): - BufWriter 256→8192 bytes on new shard creation (2x throughput improvement) - Hardware CRC32 via crc32fast (replaces software byte-at-a-time table) - Remove per-shard fsync in finalize (saves 20-80s per phase) - Background enrichment drop (50s blocking → non-blocking) - Mmap explicit drop after parse (zombie RSS 83GB→24GB) DataSilo crate (crates/datasilo/): - Generic mmap'd key-value store: 35M writes/sec, 23M reads/sec - ParallelWriter with atomic bump + 1MB thread-local regions - OpsLog with CRC32 append + replay on startup - Compaction (replay ops → rewrite data file) - 6 unit tests passing Server endpoints: - POST /time-buckets/rebuild: rebuild from sort field data + cache clear - GET /dictionaries: reverse maps for LCS/MappedString fields - GET /ui-config: serves YAML as JSON for config-driven UI Config-driven UI (static/index.html): - Dynamic filter/sort controls from engine metadata + YAML overrides - Card rendering with image URL templates, badges, meta fields - Detail modal with configurable fields, display types, formats - URL state sync for bookmarkable/shareable filter states - Civitai UI config (deploy/configs/civitai/ui-config.yaml) Design docs: - docs/design/docop-merge.md (GPT + Gemini reviewed) - docs/design/datasilo-implementation-plan.md (full migration plan) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Frozen query path (Task #33): - BitmapSilo frozen accessors: get_frozen_filter(), get_frozen_sort_layer() - mark_filters_backed() / mark_sorts_backed() — startup marks bitmaps as unloaded placeholders, reads from mmap at query time - QueryExecutor: get_effective_bitmap() + and_effective_bitmap() helpers with frozen fallback for all filter ops (Eq, In, NotEq, NotIn, Or, Range) - Sort traversal: bifurcate_frozen(), apply_cursor_filter_frozen(), reconstruct_value_frozen() — frozen layers from BitmapSilo mmap - ConcurrentEngine holds BitmapSilo behind RwLock, passes to executor Aggressive V2 retirement (~15K lines removed): - Removed lazy loading: pending_filter_loads, pending_sort_loads, lazy_value_fields, ensure_fields_loaded(), LazyLoad enum - Removed eviction: eviction_stamps, eviction_total, idle sweep - Removed existence sets: existing_keys - Deleted bitmap_memory_cache.rs, bitmap_fs.rs, bound_store.rs, doc_cache.rs, field_handler.rs, preset.rs, shard_store*.rs (4 files) - Removed FilterField::load_field_complete(), load_values(), clear_bases_and_unload() - Cleaned up 47 stale TODO comments (49→2) - Deleted 8 dead test stubs, un-ignored 3 tests (0 ignored remaining) 635 tests passing, 0 failed, 0 ignored. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove dead SortField wrappers: bifurcate(), order_results(), apply_cursor_filter() — only frozen variants remain - Remove dead FlushCommand fields: skip_lazy, cursors, dictionaries - Remove dead docstore_root field from ConcurrentEngine - Clean unused imports across datasilo, concurrent_engine, executor - 635 tests passing, 0 failed, 0 ignored Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The Arc<RoaringBitmap> on VersionedBitmap.base was for ArcSwap CoW snapshot publishing. With V3 frozen mmap, published snapshots read bases from BitmapSilo mmap, making the Arc unnecessary overhead. - base: Arc<RoaringBitmap> → base: RoaringBitmap - Removed from_arc() constructor - Simplified merge(), or_into_base(), load_base() — direct mutation - Updated all .base().as_ref() call sites to .base() - diff: Arc<BitmapDiff> stays (still needed for swap_diff) 635 tests passing, 0 failed, 0 ignored. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Temporary get_shard/get_shard_packed shims on DocSiloAdapter to unblock server compilation. These will be replaced with proper get_document() API that reads from mmap + applies pending ops. 635 tests passing, 0 failed, 0 ignored. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Zero references to ShardStore, DocStoreV3, BitmapFs, doc_cache, bound_store, or field_handler remain in the codebase. - Removed DocCacheConfigEntry struct + doc_cache config field - Removed 8 dead doc_cache metrics from metrics.rs - Removed evict_doc_cache() + doc_cache_stats() stub methods - Removed doc_cache metric scraping from server.rs - Updated all comments from V2 system names to V3 (DataSilo/BitmapSilo) - Updated test assertions from DocStoreV3 to DataSilo 635 tests passing, 0 failed, 0 ignored. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

DocSink and Ingester<B> were V2 abstractions never used in production. Keep BitmapSink trait, CoalescerSink, AccumSink (actively used). 631 tests passing, 0 failed, 0 ignored. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- DataSilo::delete(key) appends Delete tombstone to ops log - get_with_ops() respects delete tombstones (returns None) - Cold compaction: deleted keys excluded from output data file - Hot compaction: deleted keys have index entry zeroed out - OpsLog::for_each_ops() yields full SiloOp (Put + Delete) - Delete CRC validation in for_each() - 4 new tests: cold delete, hot delete, get_with_ops delete, delete+reinsert 29 datasilo tests passing, 631 lib tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge thread now only does: 1. DataSilo compaction when dirty (apply pending doc ops) 2. RSS-aware memory pressure eviction Removed: unused inner clone, time_buckets capture, cursors capture, suppress-unused hacks. Named the thread "bitdex-merge". 631 tests passing, 0 failed, 0 ignored. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Added compact_threshold to SiloConfig (default 0.20 = 20%) - Added dead_bytes counter on DataSilo (AtomicU64) - Hot compaction tracks dead bytes from deletes (zeroed index entries) and relocating updates (overflows where old slot becomes dead) - Cold compaction resets dead_bytes to 0 (full rewrite) - Added dead_bytes(), dead_ratio(), needs_compaction() accessors - BitmapSilo uses compact_threshold=0.0 (bitmaps rewritten in full) 29 datasilo tests, 631 lib tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Cold compaction now uses mmap for both data and index writes: 1. Compute entry layouts sequentially (offsets are cumulative) 2. Pre-allocate data file at exact size and mmap it 3. Write entries via pointer copy to pre-computed offsets Each entry targets a unique non-overlapping region, ready for parallel writes (rayon) when needed. Currently sequential but the infrastructure is in place — just change .for_each to .par_iter().for_each() when rayon is added to datasilo. 29 datasilo tests, 631 lib tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Deleted pg_sync/backfill.rs entirely (no external callers) - Deleted pg_sync/csv_ops.rs entirely (no external callers) - Removed apply_ops_batch_dump + process_wal_dump from ops_processor.rs - Removed 7 dead parse_*_row functions from copy_queries.rs (kept parse_post_row, parse_model_version_row, parse_model_row) - Removed associated dead types: CopyImageRow, CopyResourceRow, CopyMetricRow 631 tests passing, 0 failed, 0 ignored. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New src/cache_silo.rs: - CacheEntryData: serializable subset of UnifiedEntry (bitmap, metadata, sorted_keys) - Binary format v1: fixed header + variable bitmap + optional sorted_keys - hash_unified_key(): folds 64-bit hash to u32 for DataSilo key - save_entry/delete_entry: append to ops log - load_all: scan ops log + data file for last-write-wins restore - compact: delegates to DataSilo compaction Wiring in ConcurrentEngine: - cache_silo field (Arc<RwLock<CacheSilo>>) - Startup: open + load_all from bitmap_path/cache_silo/ - Flush thread: drain_dirty_for_silo() → save dirty entries after cache maintenance - Merge thread: compact CacheSilo when dead space exceeds threshold UnifiedCache additions: - drain_dirty_for_silo(): collects dirty entries as (key_hash, CacheEntryData) 8 new CacheSilo tests, 639 total tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CacheSilo restore fix: - Added UnifiedKey serialization to CacheEntryData binary format (v2) - Added key field to CacheEntryData (encode/decode round-trips the key) - Wired actual restore path: load_all → from_cache_entry_data → insert_restored_entry - Added UnifiedEntry::from_cache_entry_data() constructor - begin_restore/finish_restore for batch eviction Dead enrichment code removal: - Removed PostEnrichment, MvEnrichment, ModelEnrichment structs - Removed load_posts_enrichment, load_mv_enrichment, load_model_enrichment - Removed CopyPostRow, CopyModelVersionRow, CopyModelRow + parse functions - Removed dead helper functions (is_null, parse_opt_*, parse_bool, parse_i64_fast) 639 tests passing, 0 failed, 0 ignored. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

compact_hot() previously truncated the single ops log after compaction, losing any ops written during the compaction window. Fix: two ops log slots (ops_a.log, ops_b.log) with atomic swap. Protocol: 1. Freeze active slot, redirect writes to other slot (atomic xor) 2. Compact data from frozen slot 3. Truncate frozen slot only after data+index fully flushed Legacy migration: existing ops.log renamed to ops_a.log on first open. Tests: test_ab_swap_no_ops_lost, test_legacy_ops_log_migration. 31 datasilo tests, 639 lib tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Two bugs fixed in compact_hot_from(): 1. Reader blocking: old code dropped self.data_mmap during compaction, causing get() to return None. Fix: write to data.bin.tmp while old mmap stays alive, then rename over data.bin. 2. Data/index interleaving: old code wrote data AND updated index in same loop body. Crash mid-loop = corrupt state. Fix: three strict phases — classify (read-only), write data (tmp file), update index (only after data flushed). Dead-space accounting also fixed: captures old_allocated during the read-only classification pass before any mutations. Tests: test_hot_compact_does_not_drop_read_mmap_early, test_hot_compact_data_before_index_sequential_rounds 33 datasilo tests, 639 lib tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- BitmapSilo compact_threshold: 0.0 → 0.20 (20% dead space triggers) - Added compact() and needs_compaction() to BitmapSilo - Merge thread now round-robins across doc, cache, and bitmap silos - bitmap_silo_arc created early for sharing with merge thread 639 tests passing, 0 failed, 0 ignored. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Scarlet audit: previous hot compaction copied the ENTIRE data file to a temp file on every cycle — 25GB memcpy at 107M docs. Fix: two-tier approach: - In-place updates: seek+write to existing data.bin at allocated offsets - Overflows: append to end of existing data.bin (old slot = dead space) - Full file rewrite only when dead_ratio > compact_threshold (separate pass) - Never copy the entire file for routine compaction No temp file, no rename. data_mmap remaps only when file grows (overflows). In-place path doesn't touch the mmap at all — readers unblocked throughout. 33 datasilo tests, 639 lib tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

BitmapSilo true silo (Phase 2 foundation): - Ops encoding: OP_SET_BIT (0x01) and OP_CLEAR_BIT (0x02) for individual bit mutations, alongside existing full frozen bitmap format - Mutation methods: filter_set/clear, sort_set/clear, alive_set/clear — append 5-byte ops to silo ops log - Ops-on-read: get_filter_with_ops, get_sort_layer_with_ops, get_alive_with_ops — read frozen base + scan ops for pending set/clear, apply inline - DataSilo.scan_ops_for_key() — scan both A-B logs for all ops on a key Dead stubs cleanup (Phase 5 partial): - Deleted memory_pressure.rs + all references - Deleted get_rss_bytes() + Windows/Linux FFI from concurrent_engine - Deleted dead stubs: boundstore_*, preload_*, build_all_from_docstore, rebuild_fields_from_docstore, add_fields_from_docstore, etc. - Merge thread: removed RSS eviction loop (no heap data to evict) - Removed rebuild_on_boot from server.rs 636 tests passing, 0 failed, 0 ignored. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

get_effective_bitmap now reads from BitmapSilo first (frozen base + pending silo ops), then merges with in-memory VersionedBitmap diffs for mutations not yet written to the silo. During the Phase 2→4 transition both sources may have data; union combines them. and_effective_bitmap simplified to delegate to get_effective_bitmap. 636 tests passing, 0 failed, 0 ignored. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Added send_mutation_ops() helper that dual-writes every MutationOp to both the BitmapSilo ops log (V3 path) and the coalescer channel (V2 path, removed in Phase 4). All 6 mutation entry points wired. Filter, sort, AND alive mutations all go to the silo ops log: - FilterInsert/Remove → silo.filter_set/clear per slot - SortSet/Clear → silo.sort_set/clear per slot - AliveInsert/Remove → silo.alive_set/clear per slot Combined with the executor ops-on-read from the previous commit, this means the silo now has complete mutation data AND reads apply it. The coalescer/ArcSwap path is now redundant (Phase 4 removes it). 636 tests passing, 0 failed, 0 ignored. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Query path now checks CacheSilo before UnifiedCache: 1. Hash UnifiedKey → key_hash 2. If not in UnifiedCache, try cache_silo.get_entry(key_hash) 3. On silo hit: promote to UnifiedCache via from_cache_entry_data 4. Downstream logic (sorted_keys, radix, bucket diffs) works unchanged New: CacheSilo.get_entry(key_hash) — single-key read via get_with_ops New: silo_hits metric for tracking cross-restart cache effectiveness Write path unchanged: flush thread drain_dirty_for_silo still handles persistence 4 new get_entry tests. 640 total tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- BitmapSilo: RwLock on name_to_key for concurrent key auto-creation (new bitmap values auto-assign silo keys instead of silently skipping) - send_mutation_ops(): skip coalescer when bitmap_silo exists (mutations go ONLY to silo ops log for engines with a silo) - get_effective_bitmap(): simplified to silo-first, VB fallback for tests - Removed V2 lazy-load test (tested flush thread mechanics, N/A with silo) Phase 4 foundation: with silo-only mutations, the coalescer path is now dead for production engines. Tests without silos still use it. 640 tests passing, 0 failed, 0 ignored. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MutationOp and MutationSender now live in mutation.rs (their natural home) instead of write_coalescer.rs. Updated all imports across concurrent_engine, ingester, ops_processor. write_coalescer.rs now imports from mutation.rs — preparation for deleting the coalescer. 640 tests passing, 0 failed, 0 ignored. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

WriteCoalescer batching system replaced by direct silo ops log writes. Flush thread uses local FlushBatch struct for remaining staging updates. MutationOp + MutationSender already moved to mutation.rs. FilterGroupKey moved to unified_cache.rs. 615 tests passing, 0 failed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

UnifiedCache replaced entirely by CacheSilo: - Query path reads cache via CacheSilo.get_entry() only - No in-memory HashMap, no radix sort index, no LRU tracking - UnifiedKey moved to cache_silo.rs - Flush thread live maintenance removed (~1,800 lines from concurrent_engine) - Prefetch worker removed - Cache stats/metrics simplified to CacheSilo-only Total removed this commit: ~5,200 lines (unified_cache.rs + flush thread code) 561 tests passing, 0 failed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

FlushCommand (ForcePublish, SyncUnloaded, ExitLoadingSaveUnload) and the cmd_tx/cmd_rx command channel are gone. loading_mode AtomicBool and all enter/exit methods removed. - enter_loading_mode() / exit_loading_mode() → no-ops - exit_loading_mode_and_save_unload() → just calls save_snapshot() - save_and_unload() → calls publish_staging directly - Flush thread simplified: no command handling, no loading mode checks - 2 V2 tests deleted (loading mode timing tests) 559 tests passing, 0 failed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ArcSwap<InnerEngine> replaced with direct RwLock fields: - slots: Arc<RwLock<SlotAllocator>> - filters: Arc<RwLock<FilterIndex>> - sorts: Arc<RwLock<SortIndex>> Queries hold read locks. Flush thread holds write locks for mutation application only. No more staging clone, no snapshot publishing. Bulk-load paths (clone_staging/publish_staging) still work via read-lock clone → offline build → write-lock swap. 559 tests passing, 0 failed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Removes the following methods that are no longer part of the engine API: - put_via_wal, patch_document_via_wal (WAL write path — superseded by ops pipeline) - put_inner (inlined into put()) - patch, patch_document (PATCH semantics — use PUT for all writes) - sync_filter_values (filter_only sync — use PUT for all writes) - put_many, put_bulk, put_bulk_loading, put_bulk_into (bulk loading) - spawn_docstore_writer, write_docs_to_docstore (docstore helpers) - apply_accum (BitmapAccum apply — superseded by apply_bitmap_maps) - wal_writer field and set_wal_writer (WAL path removed) Keeps: put(), delete(), clone_staging(), publish_staging(), apply_bitmap_maps() — these are still used by dump_processor, loader, and remove_fields. Server PATCH and filter_sync handlers now return 501 Not Implemented. Removes 7 tests that covered the deleted methods. Benchmark "bulk" stage replaced with a no-op placeholder. cargo check --lib: 0 errors cargo check --features server,pg-sync: 0 errors cargo test --lib: 548 passed, 0 failed Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Converted concurrent_engine.rs to directory module. Extracted 6 query methods to src/concurrent_engine/query.rs: - query(), execute_query(), execute_query_impl() - execute_query_traced(), execute_query_with_collector() - resolve_filters(), post_validate() concurrent_engine/mod.rs is now the engine struct + construction + mutations. concurrent_engine/query.rs is the query execution path. 548 tests passing, 0 failed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Scarlet audit items 1-4: - Alive bitmap: load_alive().to_owned() → get_alive_with_ops() (ops-on-read) - put() removed from ConcurrentEngine (test helper added in tests.rs) - InFlightTracker removed (field + all calls + post_validate) - Loading mode call sites removed from server.rs + benchmark.rs - Cache setter call sites removed from server.rs config patch handler 536 lib tests passing. Server + pg-sync features compile clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

FlushArgs struct + run_flush_thread() function extracted from build(). mod.rs: 1,578 → 1,244 lines (-334). flush.rs: 436 lines (flush loop + deferred alive + time buckets). 536 tests passing, 0 failed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Organized source into domain directories: - src/engine/: executor, filter, sort, slot, versioned_bitmap - src/silos/: bitmap_silo, cache_silo, doc_silo_adapter, doc_format - src/query/: planner + query types (BitdexQuery, FilterClause, etc.) engine.rs → engine_facade.rs (avoid conflict with engine/ dir). query.rs content folded into query/mod.rs. 21 files updated with new import paths. 536 tests passing, server+pg-sync features compile clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

load_alive() with .to_owned() removed. Test updated to use get_alive_with_ops(). Alive bitmap is not special — same ops-on-read as all other bitmaps. 536 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Deleted remove_fields() from concurrent_engine (server endpoint returns 501) - Extracted janitor to src/janitor.rs (compaction round-robin across silos) - Merge thread now delegates to janitor::run_janitor() - Time bucket methods + config setters verified as thin delegations (no change needed) mod.rs: 1,244 → 1,190 lines. 536 tests passing, server+pg-sync features compile clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Delete 4 dead files (2,211 lines): meta_index.rs, engine_facade.rs, concurrency.rs, radix_sort.rs + execute_from_radix from executor.rs - Add DataSilo::write_batch_parallel() — rayon parallel mmap writes bypassing ops log for bulk saves (used by BitmapSilo::save_all_parallel) - Add rayon to datasilo crate, parallelize cold compaction mmap writes - Add ParallelBitmapWriter for lock-free bulk bitmap mutations - Clean flush thread: remove dead cache invalidation no-op + merge_dirty - Remove deprecated enabled_metrics config field (keep disabled_metrics) - Add QueryExecutor::new_full() replacing 5 conditional .with_*() chains - Move concurrent_engine/ under engine/ as engine/concurrent_engine/ - Move cache.rs → silos/cache.rs, query_metrics.rs → query/metrics.rs - Defer save_snapshot + compact from per-phase to server handler 488 tests passing, net -6,681 lines. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Replace to_owned() + per-op insert/remove with frozen.apply_ops(&sets, &clears) in BitmapSilo::get_bitmap_with_ops() — only copies containers touched by ops - Aggressive cache silo compaction: compact whenever ops exist (not just on threshold) - Add CacheSilo::has_ops() delegation 488 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move concurrent_engine/{mod,flush,flush_batch,query,tests}.rs up to engine/ as flat siblings. Fields on ConcurrentEngine promoted to pub(crate) for cross-module access. Delete the nested directory. Layout: engine/{concurrent_engine,executor,filter,flush,flush_batch, query,slot,sort,tests,versioned_bitmap}.rs 488 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Rename src/pg_sync/ to src/sync/ — not PG-specific anymore - Move dump_processor.rs, dump_enrichment.rs, dump_expression.rs into sync/ - Move ingester.rs, loader.rs into sync/ - Delete old standalone files and pg_sync/ directory - Update all import paths (crate::pg_sync → crate::sync, etc.) - Fix crate::concurrent_engine → crate::engine::concurrent_engine 654 tests passing (with pg-sync feature). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add ParallelOpsWriter::write_put_reuse() — zero-alloc per call - Add encode_merge_fields_into() — writes to caller buffer - Wire thread-local scratch buffers in dump parse loop - Baseline: 579K rows/s → Fix 1: 597K rows/s (+3%) Bigger win expected at 107M scale (214M fewer allocations) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

HashMap::with_capacity(8) for config_computed_sort_vals — avoids reallocation growth on first insert. Minimal impact at 14.6M scale (591K/s, within noise of 597K/s baseline). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Enable parallel ops writer for ALL phases (was disabled for MV phases) - Multi-value post-pass now uses par_iter + write_put instead of sequential append_ops_batch with Mutex contention - Tags/tools/techniques at 107M scale will benefit most (4.73B rows through lock-free mmap writes instead of locked sequential append) - No regression on images phase: 599K/s (baseline 579K/s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

src/bin/pg_sync.rs: bitdex_v2::pg_sync::* → bitdex_v2::sync::* Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…lescer, pg_sync - Remove 'UnifiedCache' from metrics description, cache_silo docs, engine comments - Remove 'BoundStore' comment from concurrent_engine struct, server purge handler - Remove 'WriteCoalescer' reference from flush_batch docs - Update 'pg_sync' comment in loader.rs to 'sync pipeline' Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ParallelOpsWriter::write_frame() returned false on mmap overflow but callers silently ignored it, dropping doc ops. This could cause missing documents after dump. Fix: Add overflow_count AtomicU64 to ParallelOpsWriter. Incremented on every dropped write. Dump processor checks after parallel writes and logs WARNING with count of dropped ops. Bug 1 (fill_indexed_fields reuse) deferred — borrow checker conflict with row lifetime vs thread-local buffer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace per-row RoaringBitmap::insert() with Vec<u32> collection + sort_unstable() + from_sorted_iter() for sort layer bitmaps. from_sorted_iter uses push_unchecked (O(1) per value) vs insert's binary search across ~1,678 containers. Benchmarked at 5.86x speedup on 32 bit-layers × 7.3M values (9,592ms → 1,638ms). Each rayon thread collects slot IDs into Vec<u32> per bit-layer during the row loop, then builds bitmaps in one shot after all rows processed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Applied OS page-management hints across every mmap creation site so the kernel can make better decisions about readahead, THP, and page reclaim: - dump_processor.rs: SEQUENTIAL after map (bulk CSV read, left-to-right), DONTNEED (Linux only) immediately before drop to release pages promptly - dump_enrichment.rs: SEQUENTIAL after map (same bulk read pattern) - slot_arena.rs: RANDOM at creation (random slot lookups), DONTNEED (Linux only) in cleanup() before drop to reclaim arena pages after phase - datasilo/lib.rs: SEQUENTIAL on bulk-write data mmaps (build_cold, rebuild), RANDOM on load_index (random bucket lookups), RANDOM + conditional HUGEPAGE (>512 MB, Linux only) on load_data for large silos - datasilo/ops_log.rs: SEQUENTIAL on both open-existing and ensure_capacity grow paths (append-only log, purely sequential writes) - datasilo/hash_index.rs: RANDOM on both create() and open() (hash table — scattered random access by definition) All advise() calls are #[cfg(unix)] gated (method does not exist on Windows). DontNeed and HugePage are additionally #[cfg(target_os = "linux")]. Uses let _ = mmap.advise(...) — errors ignored (hints are advisory only; failure is never fatal). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Benchmarks all viable approaches for merging N partial roaring bitmaps: A) sequential pairwise |= B) rayon fold+reduce |= (current dump pipeline) C) MultiOps::union() refs — CoW streaming merge (roaring-rs built-in) D) MultiOps::union() owned E) largest-first sequential |= F) k-way iterator merge → from_sorted_iter G) parallel tree reduction Results across 6 scenarios (8/32 threads × large/medium/sparse): MEDIUM-32 (most common tagId shape): C=3.6ms vs B=18.8ms — 5.2x faster LARGE-8 (dense nsfwLevel shape): C=1.2ms vs A=1.4ms — 1.2x faster SPARSE-32 (rare tag, many threads): C=0.5ms vs B=0.8ms — 1.7x faster Winner is always C (MultiOps::union refs). It does a single streaming merge walk over all N bitmaps, borrowing containers from the largest bitmap first, deferring ensure_correct_store() until the final pass. Pairwise |= calls promote and fix cardinality on every intermediate step. Rayon fold+reduce (B) is slower than single-threaded A in 5 of 6 scenarios because the merge is memory-bandwidth-bound, not CPU-bound. Parallel tree (G) and owned MultiOps (D) are consistently worse than C. Recommendation: replace the dump pipeline's par_iter fold/reduce with bitmaps.iter().union() (MultiOps trait from roaring). Expected 4-5x speedup on the per-value merge for tagIds (31K distinct values, medium cardinality). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds ahash 0.8 and swaps std::collections::HashMap for AHashMap in the three hot-path modules: dump_processor (51 uses), engine/filter (FilterField bitmap map), and engine/sort (SortIndex field map). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… 4 scenarios) Investigates whether threads sharing a single accumulator can eliminate the per-thread bitmap merge that costs 6.4s+ in the dump pipeline. 8 strategies benchmarked across 4 cardinality shapes at 14.6M rows: A per-thread HashMap<u64,bitmap> + sequential OR reduce (current baseline) A2 same parse + MultiOps::union() merge (previous benchmark winner) B shared DashMap<u64,Mutex<bitmap>> — zero merge cost C shared DashMap<u64,Mutex<Vec<u32>>> + sort/from_sorted_iter finalize D per-thread Vec<(val,slot)> + global sort + group-by finalize E per-thread HashMap<u64,Vec<u32>> + parallel sort/from_sorted_iter F 256-shard batched Mutex<HashMap<Vec<u32>>> accumulator G per-thread HashMap<u64,Vec<u32>> + sharded parallel finalize Key finding: NO single approach wins across all cardinalities. Low/mid-card (nsfwLevel, tagIds, <50K distinct values): G/E win at 106ms and 415ms vs A at 59ms/2086ms. A2 (MultiOps merge) is the simplest win: 5x speedup on mid-card with no parse change. B is catastrophic on low-card: 3.4s from 14.6M threads on 5 Mutexes. High-card (userId, postId, 2M distinct values): B wins at 2.4s vs A at 8.2s (3.5x). D (flat Vec + global sort) is nearly as fast at 2.5s with zero lock overhead — simpler and safer. A2 is WORSE than A here: MultiOps overhead on 2M bitmaps with 7 entries each dominates. E/G are also worse than A. Recommendation: Low/mid-card fields: keep per-thread structure, switch merge to MultiOps::union() — 1.8x–5x faster, zero structural change. High-card fields (>50K distinct values): switch to approach D — per-thread Vec<(u64,u32)>, concat+sort+group-by finalize. 3x speedup, ~175MB working buffer, no locks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…election) Proposes same-node shared-PVC architecture for zero-downtime deploys. File-lock writer election, read-only serving mode, 503 on write endpoints for sidecar compatibility. ~200 lines of Rust when implemented. Depends on V3 mmap architecture. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Prototypes 5 approaches for parallelizing the compact_cold_from() scan against the current single-threaded baseline, with 5 scenarios. Approaches tested: Baseline Sequential for_each_ops scan → HashMap<key, Vec<u8>> (current) 2 Sequential header prescan → offset table → parallel chunk scan 2B Fully parallel header prescan → offset table → parallel scan 3 Byte-range parallel scan with CRC self-sync (no prescan) 4 Sequential scan → flat Vec<(key, offset)> + sort (no HashMap) 5 Sequential scan, no value copy (lower bound measurement) Results (1M keys × 300B, 400MB log, 32 threads): Baseline: 584ms Approach 2: 704ms (0.83x — SLOWER) Approach 3: 586ms (1.00x — breakeven only) Thread scaling (approach 3): 0.61x-0.86x at 2-32 threads Finding: ALL parallel approaches are slower on Windows (no MADV_SEQUENTIAL). The scan is memory-bandwidth bound; sequential access wins because the OS prefetcher predicts the sequential pattern. Multiple threads thrash TLB and compete for the same memory bus bandwidth. Critical discovery from approach 5: the no-copy lower bound is 335ms vs 584ms baseline, meaning 43% of scan time is Vec<u8> allocation overhead (14.6M × 300B = 4.4GB of heap allocations per compact). Real bottleneck: TWO full passes over 4.4GB — scan copies values to heap, write phase reads them back. Total: ~9GB of memory traffic for a 4.4GB log. Correct fix: zero-copy compaction. Store HashMap<key, (mmap_offset, len)> instead of Vec<u8>. Write phase reads directly from source mmap to dest data file. Eliminates the 4.4GB heap allocation pass entirely (~2x speedup). Parallel scan remains viable on Linux with MADV_SEQUENTIAL — the production pod already has the hint applied (from previous madvise PR). Re-benchmarking there should show real scaling for approach 3. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add --read-only flag (or BITDEX_READ_ONLY=1 env var) that starts the server in read-only mode: - Write endpoints (POST /ops, PUT /dumps) return 503 with clear message - All admin routes (create/delete/upsert/config) blocked via middleware - WAL reader thread skipped (no write pipeline) - Health endpoint reports {"status":"ok","mode":"read-only"|"read-write"} - Queries, stats, cursors, and all read endpoints work normally This enables K8s rolling deploys where the new pod starts read-only, serves queries immediately from shared mmap'd data, and the sidecar's existing retry logic handles the 503s until the pod is promoted. See docs/design/zero-downtime-deploy.md for the full architecture. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…gies Compares 4 strategies for building filter bitmaps during the dump parse loop across 4 Civitai-realistic data shapes (low/med/high cardinality and 8-field mixed). Key finding: Approach A (current HashMap insert) is 5x slower than B/D on the realistic mixed scenario (71.6s vs 13.4s for 116M tuples). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Compares merge strategies for combining 32-thread filter bitmap outputs at 1M-row scale with 8 fields (2 low, 3 medium, 3 high cardinality). Results: Approach B (per-field parallel merge) wins the merge-only phase at 516ms vs 2591ms for current rayon fold+reduce (5x faster). Pipeline total including per-thread bitmap build: B = 747ms vs A = 2822ms (3.78x). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Comprehensive dump pipeline performance overhaul. Parse+merge time reduced from 30.9s to 14.8s on 14.6M images-small dataset. ## Per-row optimizations (20,230 → 11,968 ns/row, -41%) - Mmap enrichment with dense Vec offset index: 6.1x faster build, 5.2x less memory (HashMap → mmap + Vec<u64> for >100MB CSVs) - Sort bitmap from_sorted_iter: collect Vec<u32> per bit-layer, build bitmaps via sort + from_sorted_iter after row loop - Flat Vec filter bitmap batch insert (Approach B): push (field_idx, value, slot) tuples per row, sort + grouped from_sorted_iter in post-pass. 66% faster than per-row HashMap insert. - Compiled DocFieldPlan: pre-resolve all field indices, value types, and skip flags at phase setup. Single flat loop per row, zero HashMap/HashSet lookups. - DumpFieldValue with zero-copy strings: borrow &str from mmap/ enrichment instead of .to_string(). Shared wire format primitives in doc_format.rs (write_field_int/bool/str/multi_int). - Duplicate config-computed sort elimination: compute GREATEST/LEAST once (early), reuse for bitmap writes (was 22% of parse time). - Reusable indexed_fields Vec (lifetime fix: 'a mmap, not 'b row) - Reusable enrichment buffer (enrich_row_indexed_into) - O(1) enriched_get via AHashMap (was O(n) linear scan, 8 calls/row) - ahash in dump_expression.rs + dump_enrichment.rs for hot-path maps ## Merge phase optimization (5.6s → 2.4s, -57%) - Per-field parallel merge: sequential collect into per-field Vecs, then rayon par_iter over ~20 fields. Each field merges independently. userId (2M values) gets its own thread. ## Infrastructure - Zero-copy cold compaction: SiloOpRef stores mmap offsets instead of Vec<u8> copies. 43% faster compaction scan. - dump-timing feature flag: per-row nanosecond instrumentation with doc_encode sub-timings (field_collect, pack_encode, mmap_write). Zero overhead when feature is off. - streaming_merge config option on dump request body (MultiOps::union path for 107M+ scale, default off). - Mi merge concatenation fix (Merge ops concatenate multi-int arrays) ## Cleanup - Deleted dead CacheStats/CacheEntryDetail stubs + zero-value metrics - Renamed clear_unified_cache → clear_cache - Deleted dead enrich_from_lookup method - Panic guard on EnrichmentTable::get() for Mmap-backed tables - MADV_RANDOM for mmap enrichment lookup phase - 200M key cap warning for dense Vec enrichment index 685 tests pass. 11 files changed, +1207 -507. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace .iter().clone() with .into_iter() when converting AHashMap to std::HashMap for apply_bitmap_maps. Eliminates deep-cloning millions of RoaringBitmaps during the filter/sort bitmap transfer to engine staging. Also uses into_iter for sort_maps_indexed conversion and removes unnecessary .clone() on alive bitmap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move apply_bitmap_maps from process_dump (outer) into process_dump_with_progress (inner), right after the merge phase. Merged bitmaps are consumed directly via into_iter — no intermediate PhaseResult storage, no AHashMap→HashMap conversion overhead. process_dump becomes a thin wrapper (save dictionaries + return). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Write frozen bitmaps directly to BitmapSilo via write_dump_maps() instead of the V2 clone_staging → apply → publish → save_snapshot roundtrip. Eliminates ~15s overhead (5s apply + 10.5s save_snapshot) at 14.6M scale. Results: 1,048K → 1,428K rows/sec (+36%), total process_dump 19.9s → 11.2s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

JustMaier and others added 30 commits April 3, 2026 15:02

JustMaier and others added 29 commits April 4, 2026 07:20

fix: update pg_sync binary imports to use sync:: module path

f7b9a4b

src/bin/pg_sync.rs: bitdex_v2::pg_sync::* → bitdex_v2::sync::* Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

JustMaier merged commit 681bb28 into main Apr 5, 2026
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: zero-downtime rolling deploy design#129

docs: zero-downtime rolling deploy design#129
JustMaier merged 62 commits intomainfrom
design/zero-downtime-deploy

JustMaier commented Apr 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JustMaier commented Apr 4, 2026

Summary

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant