Skip to content

perf: fix slow /packets?groupByHash=true on large stores#2

Open
efiten wants to merge 3 commits into
masterfrom
fix/grouped-packets-perf
Open

perf: fix slow /packets?groupByHash=true on large stores#2
efiten wants to merge 3 commits into
masterfrom
fix/grouped-packets-perf

Conversation

@efiten
Copy link
Copy Markdown
Owner

@efiten efiten commented Mar 31, 2026

Summary

  • Adds LatestSeen field to StoreTx, maintained in all three observation write paths (load, real-time ingest, poll) — eliminates the O(n×observations) scan that was happening per packet at query time
  • Builds grouped packet maps under read lock (correct), sorts the local copy outside the lock — avoids holding the lock during O(n log n) sort
  • Caches the full sorted result for 3 seconds keyed by filter params — repeated requests within TTL return instantly without re-sorting

Fixes /packets?limit=50000&groupByHash=true taking 16s on large stores.

Test plan

  • Deploy to staging: ./deploy-staging.sh fix/grouped-packets-perf
  • Open packets page and confirm load time is well under 1s
  • Check [SLOW API] log in browser console is gone
  • Verify packet data is correct (hashes, counts, observer counts)

🤖 Generated with Claude Code

@efiten efiten force-pushed the fix/grouped-packets-perf branch from 38e1e38 to f80974b Compare March 31, 2026 14:19
efiten and others added 2 commits April 1, 2026 14:58
- Add LatestSeen field to StoreTx, maintained in all three observation
  write paths (load, real-time ingest, poll). Eliminates the per-packet
  observation scan that was O(total_packets * avg_observations).
- Build grouped packet maps under read lock (correct), sort the local
  copy outside the lock (avoids holding lock during O(n log n) sort).
- Cache the full sorted result for 3 seconds keyed by filter params.
  Repeated requests within the TTL return instantly without re-sorting.

Fixes /packets?limit=50000&groupByHash=true taking 16s on large stores.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GetChannels was iterating all payload-type-5 packets and JSON-unmarshaling
each one while holding s.mu.RLock(), blocking all concurrent reads.
On stores with many channel messages this caused /api/channels to take 13s+.

- Copy only the needed fields under the read lock, release before unmarshal
- Cache the result for 15 seconds keyed by region param
- Invalidate cache on new packet ingestion

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@efiten efiten force-pushed the fix/grouped-packets-perf branch from f80974b to 3bf1df5 Compare April 1, 2026 13:01
…channel cache

- TestLatestSeenMaintained: verifies StoreTx.LatestSeen is set >= FirstSeen
  and >= all observation timestamps after store load
- TestQueryGroupedPacketsSortedByLatest: verifies packets with more-recent
  observations sort before packets with newer first_seen but older observations
- TestQueryGroupedPacketsCacheReturnsConsistentResult: verifies cache returns
  consistent total and ordering on back-to-back calls
- TestGetChannelsCacheReturnsConsistentResult: verifies GetChannels cache
  returns same channel names on repeated calls
- TestGetChannelsNotBlockedByLargeLock: verifies GetChannels returns correct
  data (channel name, messageCount) after lock-copy-unmarshal refactor

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@efiten efiten force-pushed the master branch 2 times, most recently from 41bf3b9 to f0fc940 Compare April 1, 2026 21:20
@efiten efiten force-pushed the master branch 2 times, most recently from 493d9e8 to 568de4b Compare May 1, 2026 22:16
efiten pushed a commit that referenced this pull request May 17, 2026
…event silent reconnect-loop death (Kpa-clawbot#1216)

RED commit: `1cd25f7b` — CI (failing on assertion):
https://github.com/Kpa-clawbot/CoreScope/actions?query=sha%3A1cd25f7b1bdd0091f689dd64ce1bfec6d031191f

Fixes Kpa-clawbot#1212

## Root cause

NOT that `AutoReconnect` was off — it was set;
`MaxReconnectInterval=30s` was set (PR Kpa-clawbot#949); a `SetReconnectingHandler`
was wired. The defect was an **observability gap**:

`SetReconnectingHandler` fires only INSIDE paho's reconnect goroutine.
If that goroutine never iterates (status race after the recovered
handler panic at 21:07:13, or an internal abort), operators see ONLY the
`disconnected: pingresp not received` line and then total silence. They
cannot distinguish "paho is patiently retrying" from "paho gave up and
the goroutine is gone." That ambiguity is what turned a 30s blip into 6h
of downtime.

## Changes

### `cmd/ingestor/main.go` — `SetConnectionAttemptHandler`
Fires on every TCP/TLS dial — the initial `Connect()` AND every
reconnect — independent of paho's internal reconnect-loop state. Logs:

```
MQTT [staging] connection attempt #1 to tcp://broker:1883
MQTT [staging] connection attempt #2 to tcp://broker:1883
```

Per-source attempt counter via `atomic.AddInt64`.

### `cmd/ingestor/mqtt_watchdog.go` (new) — per-source stall watchdog
Satisfies the watchdog acceptance criterion. Even when paho reports
`connected`, if no MQTT messages have flowed for >5m, log a WARN line
every 60s:

```
MQTT [staging] WATCHDOG: client reports connected to tcp://broker:1883 but no messages received for 7m30s (threshold 5m) — possible half-open socket or upstream stall
```

Catches half-open TCP and broker-accepted-but-not-forwarding scenarios
that look "connected" to paho.

Hot-path cost: one `atomic.StoreInt64` per inbound message. Watchdog
scans the registry once a minute.

### Tests (`cmd/ingestor/mqtt_reconnect_test.go`, new)
- `TestBuildMQTTOpts_InstrumentsConnectionAttempt` — asserts
`OnConnectAttempt` is wired in `buildMQTTOpts`.
- `TestMQTTStallWatchdog_FiresOnSilentSource` — connected + 10m silent +
5m threshold → stall flagged.
- `TestMQTTStallWatchdog_QuietWhenRecent` — recent message → no stall.
- `TestMQTTStallWatchdog_QuietWhenDisconnected` — disconnected → no
stall (paho's reconnect logging covers it).

## TDD
- RED `1cd25f7b` — 2 assertion failures (compile OK, stub returns
no-stall, `OnConnectAttempt` nil).
- GREEN `2527be6f` — implementation; all ingestor tests pass.

## Out of scope
- Slice-bounds decode panic (Kpa-clawbot#1211, separate PR).
- A full in-process MQTT broker integration test would require a new dep
(mochi-mqtt) — the observability and watchdog behaviors are
independently verifiable by the unit tests above, and the reconnect path
itself is paho's responsibility (we already test it's configured via
`mqtt_opts_test.go`).

---------

Co-authored-by: bot <bot@example.com>
Co-authored-by: OpenClaw Bot <bot@openclaw.local>
Co-authored-by: corescope-bot <bot@corescope.local>
Co-authored-by: openclaw-bot <openclaw-bot@users.noreply.github.com>
efiten pushed a commit that referenced this pull request May 17, 2026
… <500ms (Kpa-clawbot#1226)

## Summary
Fixes Kpa-clawbot#1225 — channel messages endpoint took ~30s on staging.

## Root cause
`(*DB).GetChannelMessages` SELECTed every observation row for the
channel (one row per observation, not per transmission),
JSON-unmarshalled each row into a Go map, dedupe-folded by `(sender,
packetHash)`, then sliced the tail in Go for pagination.

On staging `#wardriving`:
- `transmissions` rows with `channel_hash='#wardriving' AND
payload_type=5`: **5,703**
- `observations` joined to those: **274,632** (~48× amplification)
- `time curl /api/channels/%23wardriving/messages?limit=50`: **30.04s /
31.41s / 31.48s / 35.33s / 34.05s** (5 calls before I killed the loop)

`EXPLAIN QUERY PLAN` showed the index `idx_tx_channel_hash` was being
used — the cost was entirely in fetching, unmarshalling, and folding the
full observation set per request even for `limit=50`.

Hypothesis #1 from the issue (full table scan on `messages/decoded`) is
rejected; #2 (missing index) is rejected; the actual cause was
**pagination in Go instead of SQL** — request cost was O(observations)
not O(limit).

## Fix
Move pagination into SQL on the `transmissions` table. Because
`transmissions.hash` is `UNIQUE` and the original dedup key was
`(sender, hash)`, each transmission collapses to exactly one logical
message — paginating on transmissions is semantically equivalent to the
prior in-Go dedup + tail slice.

New shape:
1. `COUNT(*)` on transmissions for total (uses `idx_tx_channel_hash`).
2. `SELECT id FROM transmissions … ORDER BY first_seen DESC LIMIT ?
OFFSET ?` to pick the page of newest transmissions.
3. `SELECT … FROM observations WHERE transmission_id IN (…page ids…)` —
typically 50 ids → a few hundred observation rows.
4. Reassemble in pageIDs order, preserving the ASC-by-`first_seen` API
contract.

Region filtering, observation-count-as-`repeats`, and "first observation
wins for hops/snr/observer" semantics are preserved (observations are
scanned `ORDER BY o.id ASC`).

## Perf measurements
**Before** (staging `#wardriving`, limit=50, 5 samples killed mid-loop):
30.04s, 31.41s, 31.48s, 35.33s, 34.05s.
**Synthetic regression test**
(`TestGetChannelMessagesPerfLargeChannel`): 3000 tx × 50 obs.
- Broken impl: ~4.5s (test fails the 500ms budget — the RED commit).
- Fixed impl: well under 500ms (test passes).
**After (staging)**: will measure post-deploy and post-comment on issue
with numbers. Synthetic scaling: staging is ~2× the test's transmission
count, fixed-path cost scales with `limit` (50) + `COUNT(*)` (~5k rows
on index) — expect <100ms p99.

## TDD
- RED: `697c290d` — perf test asserts <500ms on 3k×50 dataset; fails at
~4.5s.
- GREEN: `3f1f82d3` — fix; full suite green, perf test passes.

## Hypotheses status
| # | Hypothesis | Verdict |
|---|---|---|
| 1 | Endpoint slow on prod-sized data | **CONFIRMED** (different
mechanism — see root cause) |
| 2 | Missing channel_hash index | Rejected (`idx_tx_channel_hash`
exists & used) |
| 3 | Frontend re-render storm | Not investigated (backend was clearly
the bottleneck) |
| 4 | Decode in request path | Rejected (decode is at ingest time; JSON
unmarshal of cached `decoded_json` is the cost, addressed by reducing
row count) |
| 5 | WS subscription failure | Rejected |
| 6 | Staging artifact | Rejected (reproducible) |

## Out of scope
- The in-memory `(*PacketStore).GetChannelMessages` path (used when
`s.db == nil`) has the same shape but operates on bounded in-memory
data; not touched. If we ever fall back to it in production we'll
revisit.

---------

Co-authored-by: clawbot <bot@corescope>
efiten pushed a commit that referenced this pull request May 19, 2026
…pa-clawbot#1279 P0+P1) (Kpa-clawbot#1280)

Addresses the four P0+P1 firmware reconciliation gaps from the umbrella
audit (issue Kpa-clawbot#1279). RED commit: `0a4c084e` (asserts on stub returns;
all 13 assertions fail). GREEN commit: `13867681`.

## What's in this PR

### P0 — silently dropped data

- **#1 GRP_DATA (0x06) decoder.** Outer envelope is the same shape as
GRP_TXT (`channel_hash(1)+MAC(2)+ciphertext`) per
`firmware/src/helpers/BaseChatMesh.cpp:476,500`. Factored
`decryptChannelBlock(...)` helper used by both 5 and 6. When a channel
key matches, the inner is parsed per
`firmware/src/helpers/BaseChatMesh.cpp:382-385` as `data_type(uint16 LE)
+ data_len(1) + blob(data_len)`. Surfaces `{channelHash, MAC, dataType,
dataLen, decryptedBlob}` on decrypt or `{channelHash, MAC,
encryptedData}` otherwise. Server-side decoder surfaces envelope only
(no key store).
- **#2 MULTIPART (0x0A) decoder.** Per `firmware/src/Mesh.cpp:289`,
byte0 = `(remaining<<4) | inner_type`. When `inner_type ==
PAYLOAD_TYPE_ACK (0x03)`, next 4 bytes are the LE ack_crc per
`firmware/src/Mesh.cpp:292-307`. Surfaces `{remaining, innerType,
innerTypeName, innerAckCrc | innerPayload}`.

### P1 — mis-classified / opaque

- **#3 `advertRole()` raw-type fix.** Per
`firmware/src/helpers/AdvertDataHelpers.h:7-12`, ADV_TYPE_NONE = 0 and
5-15 are FUTURE. The previous boolean fallback collapsed both into
`"companion"`, silently relabelling unknown/reserved types. New
behaviour: type 0 → `none`, 1 → `companion`, 2-4 →
`repeater`/`room`/`sensor`, 5-15 → `type-N`. `ValidateAdvert` accepts
the new labels.
- **#4 CONTROL (0x0B) byte0 flags + length.** Per
`firmware/src/Mesh.cpp:69` + `createControlData` at `Mesh.cpp:609`,
byte0 high-bit marks the zero-hop direct subset. Surfaces `{ctrlFlags,
ctrlZeroHop, ctrlLength}`.

### Drift fix

- `cmd/server/store.go` `payloadTypeNames` now includes `6: GRP_DATA`
and `10: MULTIPART` (previously omitted; canonical decoder map already
had them).

## Lockstep & TDD

Both `cmd/ingestor/decoder.go` and `cmd/server/decoder.go` updated in
the same commits — same wire-vector tests live in both packages
(`cmd/{ingestor,server}/issue1279_test.go`). Per-item RED→GREEN visible
in `git log`.

| Item | Tests | RED proof |
|---|---|---|
| #1 GRP_DATA | ingestor: NoKey + DecryptedInner; server: Envelope | 6
assertions failed pre-impl |
| #2 MULTIPART | ingestor + server: Ack + NonAck | 8 assertions failed
pre-impl |
| #3 advertRole | ingestor + server: 7-row table | 3 assertions failed
pre-impl |
| #4 CONTROL | ingestor + server: ZeroHop + MultiHop | 6 assertions
failed pre-impl |

## What's NOT in this PR

The umbrella issue lists P2 items that ship in follow-up PRs:

- Live + compare legend entries for the long tail of newly-named types
(Kpa-clawbot#1274 + others).
- TransportCodes UI surface + filter grammar.
- feat1/feat2 capability badges.
- `payloadTypeNames` consolidation across server/ingestor
(drift-prevention).

Leave the umbrella open after this merges.

Refs Kpa-clawbot#1279

---------

Co-authored-by: OpenClaw Bot <bot@openclaw.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant