Releases: fastly/fastly-log-analytics
v1.2.0 — dashboard performance overhaul + security hardening
[1.2.0] - 2026-06-09
Dashboard performance overhaul plus capability-focused security hardening. Cold and warm dashboard loads drop from seconds to sub-second on large services; sustained concurrent load no longer wedges the backend. Read-path I/O is structurally cut by a per-service DuckDB connection pool, a per-minute time-series rollup bundle, size-capped bin-packing local compaction, composite endpoints that collapse multi-card admin pages into one request, and a frontend pre-warm / hover-prefetch pattern that makes navigation feel instant. Security hardening tightens cross-tenant boundaries, closes a ContextVar propagation hole in the s3fs proxy hook, removes a secret-in-URL leak on downloads, and adds strict validation across the destructive-op surface.
Performance
Structural:
- Per-minute time-series rollup bundle (
backend/core/rollups.py) precomputes a hour-bundled per-minute aggregate for the dashboard chart, eliminating the wide Iceberg scan on chart render. Generated alongside the existing Top-N rollups. - Per-day compaction tier for rollups — closed days are compacted into per-day parquet files; the reader prefers the per-day file and falls back to hourly only for the current day, cutting file-handle pressure on long-running services.
- Size-capped bin-packing local compaction (backend/core/local_compaction.py) replaces single-file daily/weekly rollups with sequential bin-packing capped at
_MAX_PARTITION_BYTES(default 256 MB). Hourly partitions older than 7 days bin-pack into daily files; daily files older than 30 days bin-pack into weekly files. DuckDB query parallelism is preserved on multi-month services where the prior single-file approach degraded to scan-of-one-huge-file. - DuckDB connection-pool tuning knobs —
DUCKDB_POOL_CONN_MEMORY_LIMITandDUCKDB_POOL_CONN_THREADSenv vars cap per-pool-connection memory and thread count so 8 concurrent queries don't oversubscribe physical cores or balloon RSS. Pool view-binding moved outside theConditionlock to eliminate a deadlock under stale-Iceberg-snapshot reload. - Composite read endpoints collapse multi-card mounts into single requests:
POST /api/scoring/dashboard(8 per-card requests → 1)GET /api/scoring/analyticsandGET /api/scoring/configGET /api/network-healthnow includes shielding analysisPOST /api/origin/aggregates(new) batches the origin page's per-card queries
Per-card endpoints stay mounted for back-compat; the frontend opts into composite where it makes sense.
- Parquet ingest sort key changed to
(timestamp, ip)so sessions queries can stream-merge onipinstead of materialising a temp table — ~2× speedup on sessions dashboards. ingested_files.file_datecolumn +(source_name, file_date)index added via numbered SQLite migration. The log-accounting fast path uses the index to bucket by day without scanning every row;metadata_db.get_node_count_avgandget_log_accounting_countssplit on it.- Iceberg commit hygiene — buffer files are tombstoned and removed on the next pass instead of unlinked inline at commit time, removing a commit-path stall.
optimize_tableaddsunion_by_name+ retry-on-CAS-conflict to silence the nightly schema-evolution warning. - Bootstrap stale-while-revalidate —
/api/bootstrapreturns cached dir-stats immediately and refreshes in the background; views are folded into the response so the admin page doesn't issue a follow-up.
Tuning:
- Dashboard live-hour TEMP TABLE shared across CTEs; Python-side bot match + memoised
ngwaf_topcut DuckDB round-trips. - Insights coalesce four city/region/country queries into one and four URL-keyed insights into one CTE (Option C pattern).
- Sessions split the monolithic CTE into measurable stages and eliminate the temp-table materialisation on the hot path.
- Origin summary combines two sequential scans into one via
GROUPING SETS. - Cron-runs
since_iddelta-poll param + frontend wiring on/logs recentCronsso the page only fetches new events. - Admin usage-log visibility-gates its 30s tick and rewrites the latest-per-task SQL to skip the full join.
- Admin shielding banner endpoint trimmed; share-status
staleTimetightened. - Bot-source cache: 60s TTL on the recursive cache-dir
scandir(was 200–1500 ms per/api/bootstrap). - React-Query: skip 4xx retries; hooks lifted out of insights / ReportLayout render-props so each page mount re-uses one query instance instead of re-mounting on every parent render.
Frontend:
starlette-compressreplacesGZipMiddleware— backend now negotiatesbr/zstd/gzip(was gzip-only). Modern browsers get brotli; rendered-text payloads drop ~25 % on the wire.- Keep-alive on Next.js http/undici global agents so the proxy reuses TCP connections to the FastAPI backend instead of new-handshake-per-request.
- Pre-warm + lazy-mount pattern — plotly + maplibre-gl +
world.geojsonare pre-warmed onAppLayoutmount via hidden one-point charts; the visible chart hydrates from the warm module cache instead of triggering a fresh import on first render.LazyMount+PlotlyChartstartvisible=falseto avoid the hydration-mismatch warning that came with the prior eager-mount pattern. - Hover-prefetch sidebar links so the destination's data warms before the click commits.
- Per-insight skeleton cards on first paint; full skeleton rendered from
CARD_CATEGORIESon the dashboard. - Modulepreload for the plotly chunk via a build-time-generated preload manifest (
scripts/build-preload-manifest.mjs+lib/preload-manifest.ts); restores plotly's preload without re-introducing the nav-lag the first attempt caused. - Drop
force-dynamicon routes that don't need it; root layout opts out of build-time SSG so the preload manifest is read at request time. /geo/*static assets cached aggressively;PlotlyChartdynamic-import on/network.SystemHealthCardpolling moved to 1 s for live attack/load feedback now that the endpoint is cheap.useNowMsreuse — multiple visible-tick components (countdowns, "X seconds ago") share one interval.- Map style-data listener replaces a 100 ms
setTimeoutpoll.
Reliability
- Multi-worker login loop fixed —
tunnel.pynow rehydrates a share session on-demand from SQLite when an in-memory cache miss happens on a different uvicorn worker. Previously, login on worker A would loop because worker B couldn't see the freshly-minted session. - DuckDB lock conflict resolved between the connection pool and cron writes —
get_connectionforcesread_only=Falseso pool readers and cron writers no longer trip DuckDB's "different configuration" error on the same file. - Stale-view self-heal —
QueryRunnerclears_view_cachebefore theforce=Truerebuild on the post-empty recovery path so the next query doesn't see the stale schema. - Iceberg s3fs proxy hook falls back to the process-global source so the hook always registers, even when the ContextVar is empty (e.g. cold-start LIST before any
_get_cataloghas fired). - Top-N current-hour merge — a silent
ImportErrorwas dropping the current-hour merge; restored with an explicit fail-loud import. - Rollup compaction —
run_idthreaded through the error branch and the compaction step now uses an in-memory DuckDB so a corrupted on-disk catalog can't wedge the cron. - Dashboard response cache — write to
is_cached(not the aliased_is_cached) so Pydantic doesn't drop the flag on serialise. - Dashboard cache hit rate — disabled the 30 s response-level cache that was masking the rollup wins for fast-changing queries.
- Usage-log rollup drift — reconcile cycle changed from DELETE+INSERT to UPSERT so concurrent flushes can't lose rows.
- Botnet insight investigate link filters only the queried column, not all of them.
expire_snapshotsupdated for pyiceberg 0.11.1 API and now emitscron_runstelemetry.- Proxy compatibility — switched from
middleware.tstoproxy.tsfor Next.js 16; restored the Caddy-marker middleware that the upgrade broke. - Telemetry response middleware backstop (backend/utils/telemetry_response_middleware.py) auto-injects
_debug_queries/_debug_calls/_is_cachedinto JSON-dict responses that bypassedBaseResponse.with_telemetry, so newly-added endpoints don't silently blank the Debug Panel.
Security
Capability-focused hardening across the backend and frontend trust boundaries.
- Cross-tenant ContextVar leak in the s3fs proxy hook closed. PyIceberg writes parquet via a
ThreadPoolExecutor; ContextVars don't propagate to executor workers by default, so the prior fix used an endpoint-keyed global registry that was vulnerable to overwrite when two tenants shared an endpoint URL. Replaced with a globalThreadPoolExecutor.submitmonkeypatch that wraps the callable incontextvars.copy_context()— matches asyncio'sloop.run_in_executorsemantics. Documented in MONKEYPATCHES.md §6. - Path-param service-scope desync — analyst sessions could supply a
service_idpath param that didn't match their session scope on a handful of mutation endpoints. Centralised the check via a router-utils helper invoked on every scoped route. - Secret-in-URL leak on downloads — the download endpoint previously embedded the shared CDN secret in the redirect URL where it could land in browser history / referrer headers. Switched to a signed short-lived bearer that's stripped before the redirect.
- Strict input validation on the destructive-op surface — provision teardown, NGWAF workspace mutations, scoring threshold + enforce-status-code + recv-exclusion-regex changes — runs through length caps, character allowlists, and (where applicable)
falcostatic analysis before any VCL ships. - **CSRF gate...
1.1.0 — Session scoring
Edge session scoring lands as the headline feature for 1.1.0, alongside a security hardening pass and operator-tunable scorer URL exclusion.
Highlights
- Edge session scoring — Fastly Compute scorer + 6-snippet VCL preflight (recv / pass / fetch / deliver / miss / enforce) with AES-GCM session cookies carrying rotating sid and L2 transition state. L1 (cookie compliance + timing) + L2 (PageRank-trained transition matrix) produce a combined 0–100 score on every request.
- Admin UI at
/admin/session-scoring— live ROC-AUC against operator labels, score-distribution / top-reasons / matrix-staleness cards, threshold slider with counterfactual flag/pass preview, ROC + PR curves, per-reason AUC breakdown, label CRUD with click-to-view-events, matrix retrain + version history + rollback, AES key rotation, operator audit log. - Live edge enforcement — operator commits a threshold and a response code (default 429, operator-overridable to 403 / 451 / 503 / any 4xx-5xx). The enforce snippet rejects scored requests on the post-scoring restart within seconds of commit.
- URL exclusion regex override — per-service regex telling the scorer which URLs to skip. Defaults to the built-in static-asset extension list. Three-layer validation (input policy → falco static analysis → Fastly VCL compiler) before any VCL ships. Focused orchestrator swaps only the recv snippet in ~5–15s.
- Security hardening across the FastAPI backend, Fastly VCL, Next.js frontend, and Rust scorer — trust-boundary normalisation, destructive-op token auth, DuckDB user-SQL parse-tree validator, VCL header & cache discipline, cross-tenant scope enforcement, path-traversal cages, SSH host-key pinning, and scorer signal tightening.
- Dashboard performance — DuckDB connection pool, hourly Top-N rollup precomputation pipeline, bounded cache primitive, streaming Suspense skeletons on admin routes.
- Reliability — cron-progress reaping fixes,
state_syncmerge guards closing a class of "remote-overwrites-code-managed-state" data-loss paths, per-key in-flight collapse in the analytics cache.
Full details, including the security capability breakdown, reliability fixes, performance work, and infrastructure / dependency changes, are in CHANGELOG.md.