perf: fix slow /packets?groupByHash=true on large stores by efiten · Pull Request #2 · efiten/meshcore-analyzer

efiten · 2026-03-31T14:14:12Z

Summary

Adds LatestSeen field to StoreTx, maintained in all three observation write paths (load, real-time ingest, poll) — eliminates the O(n×observations) scan that was happening per packet at query time
Builds grouped packet maps under read lock (correct), sorts the local copy outside the lock — avoids holding the lock during O(n log n) sort
Caches the full sorted result for 3 seconds keyed by filter params — repeated requests within TTL return instantly without re-sorting

Fixes /packets?limit=50000&groupByHash=true taking 16s on large stores.

Test plan

Deploy to staging: ./deploy-staging.sh fix/grouped-packets-perf
Open packets page and confirm load time is well under 1s
Check [SLOW API] log in browser console is gone
Verify packet data is correct (hashes, counts, observer counts)

🤖 Generated with Claude Code

- Add LatestSeen field to StoreTx, maintained in all three observation write paths (load, real-time ingest, poll). Eliminates the per-packet observation scan that was O(total_packets * avg_observations). - Build grouped packet maps under read lock (correct), sort the local copy outside the lock (avoids holding lock during O(n log n) sort). - Cache the full sorted result for 3 seconds keyed by filter params. Repeated requests within the TTL return instantly without re-sorting. Fixes /packets?limit=50000&groupByHash=true taking 16s on large stores. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

GetChannels was iterating all payload-type-5 packets and JSON-unmarshaling each one while holding s.mu.RLock(), blocking all concurrent reads. On stores with many channel messages this caused /api/channels to take 13s+. - Copy only the needed fields under the read lock, release before unmarshal - Cache the result for 15 seconds keyed by region param - Invalidate cache on new packet ingestion Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…channel cache - TestLatestSeenMaintained: verifies StoreTx.LatestSeen is set >= FirstSeen and >= all observation timestamps after store load - TestQueryGroupedPacketsSortedByLatest: verifies packets with more-recent observations sort before packets with newer first_seen but older observations - TestQueryGroupedPacketsCacheReturnsConsistentResult: verifies cache returns consistent total and ordering on back-to-back calls - TestGetChannelsCacheReturnsConsistentResult: verifies GetChannels cache returns same channel names on repeated calls - TestGetChannelsNotBlockedByLargeLock: verifies GetChannels returns correct data (channel name, messageCount) after lock-copy-unmarshal refactor Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…event silent reconnect-loop death (Kpa-clawbot#1216) RED commit: `1cd25f7b` — CI (failing on assertion): https://github.com/Kpa-clawbot/CoreScope/actions?query=sha%3A1cd25f7b1bdd0091f689dd64ce1bfec6d031191f Fixes Kpa-clawbot#1212 ## Root cause NOT that `AutoReconnect` was off — it was set; `MaxReconnectInterval=30s` was set (PR Kpa-clawbot#949); a `SetReconnectingHandler` was wired. The defect was an **observability gap**: `SetReconnectingHandler` fires only INSIDE paho's reconnect goroutine. If that goroutine never iterates (status race after the recovered handler panic at 21:07:13, or an internal abort), operators see ONLY the `disconnected: pingresp not received` line and then total silence. They cannot distinguish "paho is patiently retrying" from "paho gave up and the goroutine is gone." That ambiguity is what turned a 30s blip into 6h of downtime. ## Changes ### `cmd/ingestor/main.go` — `SetConnectionAttemptHandler` Fires on every TCP/TLS dial — the initial `Connect()` AND every reconnect — independent of paho's internal reconnect-loop state. Logs: ``` MQTT [staging] connection attempt #1 to tcp://broker:1883 MQTT [staging] connection attempt #2 to tcp://broker:1883 ``` Per-source attempt counter via `atomic.AddInt64`. ### `cmd/ingestor/mqtt_watchdog.go` (new) — per-source stall watchdog Satisfies the watchdog acceptance criterion. Even when paho reports `connected`, if no MQTT messages have flowed for >5m, log a WARN line every 60s: ``` MQTT [staging] WATCHDOG: client reports connected to tcp://broker:1883 but no messages received for 7m30s (threshold 5m) — possible half-open socket or upstream stall ``` Catches half-open TCP and broker-accepted-but-not-forwarding scenarios that look "connected" to paho. Hot-path cost: one `atomic.StoreInt64` per inbound message. Watchdog scans the registry once a minute. ### Tests (`cmd/ingestor/mqtt_reconnect_test.go`, new) - `TestBuildMQTTOpts_InstrumentsConnectionAttempt` — asserts `OnConnectAttempt` is wired in `buildMQTTOpts`. - `TestMQTTStallWatchdog_FiresOnSilentSource` — connected + 10m silent + 5m threshold → stall flagged. - `TestMQTTStallWatchdog_QuietWhenRecent` — recent message → no stall. - `TestMQTTStallWatchdog_QuietWhenDisconnected` — disconnected → no stall (paho's reconnect logging covers it). ## TDD - RED `1cd25f7b` — 2 assertion failures (compile OK, stub returns no-stall, `OnConnectAttempt` nil). - GREEN `2527be6f` — implementation; all ingestor tests pass. ## Out of scope - Slice-bounds decode panic (Kpa-clawbot#1211, separate PR). - A full in-process MQTT broker integration test would require a new dep (mochi-mqtt) — the observability and watchdog behaviors are independently verifiable by the unit tests above, and the reconnect path itself is paho's responsibility (we already test it's configured via `mqtt_opts_test.go`). --------- Co-authored-by: bot <bot@example.com> Co-authored-by: OpenClaw Bot <bot@openclaw.local> Co-authored-by: corescope-bot <bot@corescope.local> Co-authored-by: openclaw-bot <openclaw-bot@users.noreply.github.com>

… <500ms (Kpa-clawbot#1226) ## Summary Fixes Kpa-clawbot#1225 — channel messages endpoint took ~30s on staging. ## Root cause `(*DB).GetChannelMessages` SELECTed every observation row for the channel (one row per observation, not per transmission), JSON-unmarshalled each row into a Go map, dedupe-folded by `(sender, packetHash)`, then sliced the tail in Go for pagination. On staging `#wardriving`: - `transmissions` rows with `channel_hash='#wardriving' AND payload_type=5`: **5,703** - `observations` joined to those: **274,632** (~48× amplification) - `time curl /api/channels/%23wardriving/messages?limit=50`: **30.04s / 31.41s / 31.48s / 35.33s / 34.05s** (5 calls before I killed the loop) `EXPLAIN QUERY PLAN` showed the index `idx_tx_channel_hash` was being used — the cost was entirely in fetching, unmarshalling, and folding the full observation set per request even for `limit=50`. Hypothesis #1 from the issue (full table scan on `messages/decoded`) is rejected; #2 (missing index) is rejected; the actual cause was **pagination in Go instead of SQL** — request cost was O(observations) not O(limit). ## Fix Move pagination into SQL on the `transmissions` table. Because `transmissions.hash` is `UNIQUE` and the original dedup key was `(sender, hash)`, each transmission collapses to exactly one logical message — paginating on transmissions is semantically equivalent to the prior in-Go dedup + tail slice. New shape: 1. `COUNT(*)` on transmissions for total (uses `idx_tx_channel_hash`). 2. `SELECT id FROM transmissions … ORDER BY first_seen DESC LIMIT ? OFFSET ?` to pick the page of newest transmissions. 3. `SELECT … FROM observations WHERE transmission_id IN (…page ids…)` — typically 50 ids → a few hundred observation rows. 4. Reassemble in pageIDs order, preserving the ASC-by-`first_seen` API contract. Region filtering, observation-count-as-`repeats`, and "first observation wins for hops/snr/observer" semantics are preserved (observations are scanned `ORDER BY o.id ASC`). ## Perf measurements **Before** (staging `#wardriving`, limit=50, 5 samples killed mid-loop): 30.04s, 31.41s, 31.48s, 35.33s, 34.05s. **Synthetic regression test** (`TestGetChannelMessagesPerfLargeChannel`): 3000 tx × 50 obs. - Broken impl: ~4.5s (test fails the 500ms budget — the RED commit). - Fixed impl: well under 500ms (test passes). **After (staging)**: will measure post-deploy and post-comment on issue with numbers. Synthetic scaling: staging is ~2× the test's transmission count, fixed-path cost scales with `limit` (50) + `COUNT(*)` (~5k rows on index) — expect <100ms p99. ## TDD - RED: `697c290d` — perf test asserts <500ms on 3k×50 dataset; fails at ~4.5s. - GREEN: `3f1f82d3` — fix; full suite green, perf test passes. ## Hypotheses status | # | Hypothesis | Verdict | |---|---|---| | 1 | Endpoint slow on prod-sized data | **CONFIRMED** (different mechanism — see root cause) | | 2 | Missing channel_hash index | Rejected (`idx_tx_channel_hash` exists & used) | | 3 | Frontend re-render storm | Not investigated (backend was clearly the bottleneck) | | 4 | Decode in request path | Rejected (decode is at ingest time; JSON unmarshal of cached `decoded_json` is the cost, addressed by reducing row count) | | 5 | WS subscription failure | Rejected | | 6 | Staging artifact | Rejected (reproducible) | ## Out of scope - The in-memory `(*PacketStore).GetChannelMessages` path (used when `s.db == nil`) has the same shape but operates on bounded in-memory data; not touched. If we ever fall back to it in production we'll revisit. --------- Co-authored-by: clawbot <bot@corescope>

…pa-clawbot#1279 P0+P1) (Kpa-clawbot#1280) Addresses the four P0+P1 firmware reconciliation gaps from the umbrella audit (issue Kpa-clawbot#1279). RED commit: `0a4c084e` (asserts on stub returns; all 13 assertions fail). GREEN commit: `13867681`. ## What's in this PR ### P0 — silently dropped data - **#1 GRP_DATA (0x06) decoder.** Outer envelope is the same shape as GRP_TXT (`channel_hash(1)+MAC(2)+ciphertext`) per `firmware/src/helpers/BaseChatMesh.cpp:476,500`. Factored `decryptChannelBlock(...)` helper used by both 5 and 6. When a channel key matches, the inner is parsed per `firmware/src/helpers/BaseChatMesh.cpp:382-385` as `data_type(uint16 LE) + data_len(1) + blob(data_len)`. Surfaces `{channelHash, MAC, dataType, dataLen, decryptedBlob}` on decrypt or `{channelHash, MAC, encryptedData}` otherwise. Server-side decoder surfaces envelope only (no key store). - **#2 MULTIPART (0x0A) decoder.** Per `firmware/src/Mesh.cpp:289`, byte0 = `(remaining<<4) | inner_type`. When `inner_type == PAYLOAD_TYPE_ACK (0x03)`, next 4 bytes are the LE ack_crc per `firmware/src/Mesh.cpp:292-307`. Surfaces `{remaining, innerType, innerTypeName, innerAckCrc | innerPayload}`. ### P1 — mis-classified / opaque - **#3 `advertRole()` raw-type fix.** Per `firmware/src/helpers/AdvertDataHelpers.h:7-12`, ADV_TYPE_NONE = 0 and 5-15 are FUTURE. The previous boolean fallback collapsed both into `"companion"`, silently relabelling unknown/reserved types. New behaviour: type 0 → `none`, 1 → `companion`, 2-4 → `repeater`/`room`/`sensor`, 5-15 → `type-N`. `ValidateAdvert` accepts the new labels. - **#4 CONTROL (0x0B) byte0 flags + length.** Per `firmware/src/Mesh.cpp:69` + `createControlData` at `Mesh.cpp:609`, byte0 high-bit marks the zero-hop direct subset. Surfaces `{ctrlFlags, ctrlZeroHop, ctrlLength}`. ### Drift fix - `cmd/server/store.go` `payloadTypeNames` now includes `6: GRP_DATA` and `10: MULTIPART` (previously omitted; canonical decoder map already had them). ## Lockstep & TDD Both `cmd/ingestor/decoder.go` and `cmd/server/decoder.go` updated in the same commits — same wire-vector tests live in both packages (`cmd/{ingestor,server}/issue1279_test.go`). Per-item RED→GREEN visible in `git log`. | Item | Tests | RED proof | |---|---|---| | #1 GRP_DATA | ingestor: NoKey + DecryptedInner; server: Envelope | 6 assertions failed pre-impl | | #2 MULTIPART | ingestor + server: Ack + NonAck | 8 assertions failed pre-impl | | #3 advertRole | ingestor + server: 7-row table | 3 assertions failed pre-impl | | #4 CONTROL | ingestor + server: ZeroHop + MultiHop | 6 assertions failed pre-impl | ## What's NOT in this PR The umbrella issue lists P2 items that ship in follow-up PRs: - Live + compare legend entries for the long tail of newly-named types (Kpa-clawbot#1274 + others). - TransportCodes UI surface + filter grammar. - feat1/feat2 capability badges. - `payloadTypeNames` consolidation across server/ingestor (drift-prevention). Leave the umbrella open after this merges. Refs Kpa-clawbot#1279 --------- Co-authored-by: OpenClaw Bot <bot@openclaw.local>

efiten force-pushed the fix/grouped-packets-perf branch from 38e1e38 to f80974b Compare March 31, 2026 14:19

efiten and others added 2 commits April 1, 2026 14:58

efiten force-pushed the fix/grouped-packets-perf branch from f80974b to 3bf1df5 Compare April 1, 2026 13:01

efiten force-pushed the master branch 2 times, most recently from 41bf3b9 to f0fc940 Compare April 1, 2026 21:20

efiten force-pushed the master branch 2 times, most recently from 493d9e8 to 568de4b Compare May 1, 2026 22:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: fix slow /packets?groupByHash=true on large stores#2

perf: fix slow /packets?groupByHash=true on large stores#2
efiten wants to merge 3 commits into
masterfrom
fix/grouped-packets-perf

efiten commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

efiten commented Mar 31, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant