1.2.0 - 2026-06-09
Dashboard performance overhaul plus capability-focused security hardening. Cold and warm dashboard loads drop from seconds to sub-second on large services; sustained concurrent load no longer wedges the backend. Read-path I/O is structurally cut by a per-service DuckDB connection pool, a per-minute time-series rollup bundle, size-capped bin-packing local compaction, composite endpoints that collapse multi-card admin pages into one request, and a frontend pre-warm / hover-prefetch pattern that makes navigation feel instant. Security hardening tightens cross-tenant boundaries, closes a ContextVar propagation hole in the s3fs proxy hook, removes a secret-in-URL leak on downloads, and adds strict validation across the destructive-op surface.
Performance
Structural:
- Per-minute time-series rollup bundle (
backend/core/rollups.py) precomputes a hour-bundled per-minute aggregate for the dashboard chart, eliminating the wide Iceberg scan on chart render. Generated alongside the existing Top-N rollups. - Per-day compaction tier for rollups — closed days are compacted into per-day parquet files; the reader prefers the per-day file and falls back to hourly only for the current day, cutting file-handle pressure on long-running services.
- Size-capped bin-packing local compaction (backend/core/local_compaction.py) replaces single-file daily/weekly rollups with sequential bin-packing capped at
_MAX_PARTITION_BYTES(default 256 MB). Hourly partitions older than 7 days bin-pack into daily files; daily files older than 30 days bin-pack into weekly files. DuckDB query parallelism is preserved on multi-month services where the prior single-file approach degraded to scan-of-one-huge-file. - DuckDB connection-pool tuning knobs —
DUCKDB_POOL_CONN_MEMORY_LIMITandDUCKDB_POOL_CONN_THREADSenv vars cap per-pool-connection memory and thread count so 8 concurrent queries don't oversubscribe physical cores or balloon RSS. Pool view-binding moved outside theConditionlock to eliminate a deadlock under stale-Iceberg-snapshot reload. - Composite read endpoints collapse multi-card mounts into single requests:
POST /api/scoring/dashboard(8 per-card requests → 1)GET /api/scoring/analyticsandGET /api/scoring/configGET /api/network-healthnow includes shielding analysisPOST /api/origin/aggregates(new) batches the origin page's per-card queries
Per-card endpoints stay mounted for back-compat; the frontend opts into composite where it makes sense.
- Parquet ingest sort key changed to
(timestamp, ip)so sessions queries can stream-merge onipinstead of materialising a temp table — ~2× speedup on sessions dashboards. ingested_files.file_datecolumn +(source_name, file_date)index added via numbered SQLite migration. The log-accounting fast path uses the index to bucket by day without scanning every row;metadata_db.get_node_count_avgandget_log_accounting_countssplit on it.- Iceberg commit hygiene — buffer files are tombstoned and removed on the next pass instead of unlinked inline at commit time, removing a commit-path stall.
optimize_tableaddsunion_by_name+ retry-on-CAS-conflict to silence the nightly schema-evolution warning. - Bootstrap stale-while-revalidate —
/api/bootstrapreturns cached dir-stats immediately and refreshes in the background; views are folded into the response so the admin page doesn't issue a follow-up.
Tuning:
- Dashboard live-hour TEMP TABLE shared across CTEs; Python-side bot match + memoised
ngwaf_topcut DuckDB round-trips. - Insights coalesce four city/region/country queries into one and four URL-keyed insights into one CTE (Option C pattern).
- Sessions split the monolithic CTE into measurable stages and eliminate the temp-table materialisation on the hot path.
- Origin summary combines two sequential scans into one via
GROUPING SETS. - Cron-runs
since_iddelta-poll param + frontend wiring on/logs recentCronsso the page only fetches new events. - Admin usage-log visibility-gates its 30s tick and rewrites the latest-per-task SQL to skip the full join.
- Admin shielding banner endpoint trimmed; share-status
staleTimetightened. - Bot-source cache: 60s TTL on the recursive cache-dir
scandir(was 200–1500 ms per/api/bootstrap). - React-Query: skip 4xx retries; hooks lifted out of insights / ReportLayout render-props so each page mount re-uses one query instance instead of re-mounting on every parent render.
Frontend:
starlette-compressreplacesGZipMiddleware— backend now negotiatesbr/zstd/gzip(was gzip-only). Modern browsers get brotli; rendered-text payloads drop ~25 % on the wire.- Keep-alive on Next.js http/undici global agents so the proxy reuses TCP connections to the FastAPI backend instead of new-handshake-per-request.
- Pre-warm + lazy-mount pattern — plotly + maplibre-gl +
world.geojsonare pre-warmed onAppLayoutmount via hidden one-point charts; the visible chart hydrates from the warm module cache instead of triggering a fresh import on first render.LazyMount+PlotlyChartstartvisible=falseto avoid the hydration-mismatch warning that came with the prior eager-mount pattern. - Hover-prefetch sidebar links so the destination's data warms before the click commits.
- Per-insight skeleton cards on first paint; full skeleton rendered from
CARD_CATEGORIESon the dashboard. - Modulepreload for the plotly chunk via a build-time-generated preload manifest (
scripts/build-preload-manifest.mjs+lib/preload-manifest.ts); restores plotly's preload without re-introducing the nav-lag the first attempt caused. - Drop
force-dynamicon routes that don't need it; root layout opts out of build-time SSG so the preload manifest is read at request time. /geo/*static assets cached aggressively;PlotlyChartdynamic-import on/network.SystemHealthCardpolling moved to 1 s for live attack/load feedback now that the endpoint is cheap.useNowMsreuse — multiple visible-tick components (countdowns, "X seconds ago") share one interval.- Map style-data listener replaces a 100 ms
setTimeoutpoll.
Reliability
- Multi-worker login loop fixed —
tunnel.pynow rehydrates a share session on-demand from SQLite when an in-memory cache miss happens on a different uvicorn worker. Previously, login on worker A would loop because worker B couldn't see the freshly-minted session. - DuckDB lock conflict resolved between the connection pool and cron writes —
get_connectionforcesread_only=Falseso pool readers and cron writers no longer trip DuckDB's "different configuration" error on the same file. - Stale-view self-heal —
QueryRunnerclears_view_cachebefore theforce=Truerebuild on the post-empty recovery path so the next query doesn't see the stale schema. - Iceberg s3fs proxy hook falls back to the process-global source so the hook always registers, even when the ContextVar is empty (e.g. cold-start LIST before any
_get_cataloghas fired). - Top-N current-hour merge — a silent
ImportErrorwas dropping the current-hour merge; restored with an explicit fail-loud import. - Rollup compaction —
run_idthreaded through the error branch and the compaction step now uses an in-memory DuckDB so a corrupted on-disk catalog can't wedge the cron. - Dashboard response cache — write to
is_cached(not the aliased_is_cached) so Pydantic doesn't drop the flag on serialise. - Dashboard cache hit rate — disabled the 30 s response-level cache that was masking the rollup wins for fast-changing queries.
- Usage-log rollup drift — reconcile cycle changed from DELETE+INSERT to UPSERT so concurrent flushes can't lose rows.
- Botnet insight investigate link filters only the queried column, not all of them.
expire_snapshotsupdated for pyiceberg 0.11.1 API and now emitscron_runstelemetry.- Proxy compatibility — switched from
middleware.tstoproxy.tsfor Next.js 16; restored the Caddy-marker middleware that the upgrade broke. - Telemetry response middleware backstop (backend/utils/telemetry_response_middleware.py) auto-injects
_debug_queries/_debug_calls/_is_cachedinto JSON-dict responses that bypassedBaseResponse.with_telemetry, so newly-added endpoints don't silently blank the Debug Panel.
Security
Capability-focused hardening across the backend and frontend trust boundaries.
- Cross-tenant ContextVar leak in the s3fs proxy hook closed. PyIceberg writes parquet via a
ThreadPoolExecutor; ContextVars don't propagate to executor workers by default, so the prior fix used an endpoint-keyed global registry that was vulnerable to overwrite when two tenants shared an endpoint URL. Replaced with a globalThreadPoolExecutor.submitmonkeypatch that wraps the callable incontextvars.copy_context()— matches asyncio'sloop.run_in_executorsemantics. Documented in MONKEYPATCHES.md §6. - Path-param service-scope desync — analyst sessions could supply a
service_idpath param that didn't match their session scope on a handful of mutation endpoints. Centralised the check via a router-utils helper invoked on every scoped route. - Secret-in-URL leak on downloads — the download endpoint previously embedded the shared CDN secret in the redirect URL where it could land in browser history / referrer headers. Switched to a signed short-lived bearer that's stripped before the redirect.
- Strict input validation on the destructive-op surface — provision teardown, NGWAF workspace mutations, scoring threshold + enforce-status-code + recv-exclusion-regex changes — runs through length caps, character allowlists, and (where applicable)
falcostatic analysis before any VCL ships. - CSRF gates — moved GET→POST on
logging-settings/updateand sibling state-changing endpoints that were addressable via GET. - Authorisation tightening — share-admin endpoints reject the Caddy-marker header from non-Caddy paths;
claim_tokenpath consolidated under a single atomic UPDATE so concurrent claims can't both succeed. - Cross-tenant cache audit — re-verified that every per-tenant cache key includes
service_id; closed two missing entries on insights and origin paths. - Thread leak fix — the share-login flow was leaking a daemon thread per failed login on multi-worker setups; the new on-demand SQLite rehydration replaces the thread entirely.
- Terms-of-service bypass — share-login
/acknowledgenow fetches the active TOS version and refuses acknowledgement of a stale one; frontend was sending a hardcoded version. - Telemetry-proxy diagnostics for silent 400s (
Missing X-Fos-Target) and unclassifiedlist_objects_v2calls; preserveContent-Typeso downstream compression always fires; preserve multi-valued response headers.
Tests
- 3500+ backend tests (+450).
- 290+ frontend vitest tests (+25).
- New coverage:
tests/core/test_duckdb_pool.py,test_local_compaction.py,test_rollups_compaction.py,test_rollups_hour_bundling.py,test_iceberg_helpers.py,tests/services/test_service_manager.py,tests/utils/test_sql_validator.py,test_telemetry_response_middleware.py,test_router_utils.py,test_state_sync.py,test_terraform_gen.py, plus router coverage for the new composite endpoints and the destructive-op-auth surface. make cigreen: lint + format + mypy + pytest + vcl-test + verify-deps + typecheck-frontend + test-frontend + osv + secret-scan.
Infrastructure
- Synthetic load generator (scripts/loadtest_generator.py) and read-path probe (scripts/dev/loadtest_probe.sh) for reproducible perf measurement against local Parquet+Iceberg.
- Two-pass next build in the frontend Dockerfile so SSG sees the correct plotly chunk hashes; preload-manifest scanner runs after
next buildto capture them.
Documentation
AGENTS.md— added Key Systems entries for the DuckDB connection pool, the hourly Top-N rollup pipeline, and the response telemetry middleware. Updated the local-compaction section to reflect the bin-packing tiers.MONKEYPATCHES.md— documents the newThreadPoolExecutor.submitpatch.