Skip to content

hardening: pre-tenant readiness pass#8

Merged
giftedloser merged 6 commits into
mainfrom
codex/pre-tenant-readiness
May 3, 2026
Merged

hardening: pre-tenant readiness pass#8
giftedloser merged 6 commits into
mainfrom
codex/pre-tenant-readiness

Conversation

@giftedloser
Copy link
Copy Markdown
Owner

Summary

Pre-tenant readiness audit + targeted hardening pass. No new product
features, no version bump, no Settings IA changes, no Graph scope
removals. All findings either fixed under tight scope or marked PASS
after verification.

Findings fixed

Security

  • Gate POST /api/rules/preview behind requireDelegatedAuth so a
    same-host process can't fan out predicates against synced tenant
    data (rules.ts). Focused tests in
    rules-preview-gate.api.test.ts.
  • Gate GET /api/settings/tag-config behind requireDelegatedAuth
    for consistency with the mutation siblings — verified the client
    reads via the parent /api/settings endpoint, so no flow regresses
    (settings.ts). Existing test
    updated to assert 401.
  • Desktop-token compare switched to crypto.timingSafeEqual with
    explicit length-mismatch and missing-header guards
    (local-access.ts). Three new
    cases in local-access.api.test.ts.
  • Pino redact paths extended to cover request body credential
    fields and the idempotency-key request header
    (logger.ts). Bodies aren't logged today;
    this is defence in depth.

Destructive UX

  • Typed-confirmation fallback uses
    serial -> deviceName -> intuneId -> autopilotId. Devices missing
    all four refuse to render the destructive flow, with an explanatory
    toast. Replaces a silent fallback to the literal string CONFIRM
    (ActionsToolbar.tsx).
  • At-rest danger affordance on destructive remote-action buttons:
    subtle critical-color left border, critical-tinted icon, explicit
    focus-visible outline matching severity, and aria-label that
    includes "destructive" for screen readers.

Performance

  • Migration 015 adds idx_action_log_triggered_at ON action_log(triggered_at DESC)
    so the audit-log CSV/NDJSON export at
    GET /api/actions/logs/export no longer falls back to a sequential
    scan as action history grows.

Scale tooling

  • npm run db:seed:scale generates ~4,500 synthetic devices
    across ~30 properties with realistic problem distribution and
    prints rough query timings. Dev-only; never invoked by tests/CI.
    Sample run on this branch:
    • Seed total: 278 ms
    • Dashboard query: 37 ms (4500 devices, 7 failure patterns)
    • Devices page 1 (25): 2.9 ms
    • Tags inventory (31 tags): 2.7 ms
    • Provisioning lookup (146 devices): 0.8 ms
    • Device detail: 0.3 ms
    • computeAllDeviceStates full run: 184 ms

Findings verified PASS (no change)

  • Properties sidebar overflow — the existing outer <nav> already
    has lg:overflow-y-auto and the footer is a separate sibling
    outside the nav, so 30+ properties scroll cleanly inside the nav and
    the footer stays pinned. No inner cap needed.
  • CSV exports (server + client) — both sides already escape
    formula-injection prefixes (=, +, -, @) and quote
    newlines/quotes. Filenames include the date.
  • Idempotency keys — minted fresh per click in
    useActions.ts, 24h replay window
    on the server with scope-hash validation.
  • Effect cleanup sweep — EntityPicker (AbortController), HelpTooltip
    (event listener), SyncStatusPill (state-only), FirstRunBanner (SSR-safe
    storage guards), LapsWidget/BitLockerWidget (auto-rehide timers),
    ConfirmDialog/BulkActionConfirm/Tags drawer (keydown listeners) all
    clean up on unmount. No leaks found.
  • Local-storage namespacing — all keys under pilotcheck.pref.*
    or runway.*. No secrets/tokens stored.
  • Graph timeout/retry — 15s AbortController, 3-retry on 429/5xx,
    Retry-After clamped to 1-60s. Partial sync preserves prior good
    data.
  • Tauri shell — restrictive CSP, embedded updater pubkey, narrow
    window-control capability, in-memory desktop token.

Findings intentionally deferred

  • Per-resource sync error tracking — current sync_log granularity
    is per-sync with NDJSON errors array. Surfacing per-resource
    status on the Sync page is a UI improvement that's not blocking
    tenant testing.
  • Confirmation modal with Intune/Autopilot ID alongside name+serial
    — current modal already shows device name + serial; adding portal
    IDs is a nice-to-have.
  • Removing User.ReadBasic.All etc. — verified all three scopes
    the initial audit flagged are actually used (EntityPicker user
    search, group-actions writes, autopilot-import). Documented as
    "Optional" by feature in the README/security-report tables instead.

Permissions docs

README.md and docs/security-report.md now show an Optional?
column on the delegated permission table so a security owner can grant
only the scopes for the features they'll use. Application scopes are
all required.

Live-tenant dry-run runbook

New docs/live-tenant-dry-run.md walks
through prerequisites, first sync, read-only verification, first safe
action (sync one known device), portal deep-link verification,
EntityPicker dry run, LAPS/BitLocker on a lab device, deferred
destructive list, log inspection, disconnect path, and what to capture.

Tests added

  • test/server/rules-preview-gate.api.test.ts — 401 unauthenticated /
    200 after delegated sign-in, including the loopback Origin header
    the real client sets.
  • test/server/local-access.api.test.ts — three new cases: empty
    desktop token, length mismatch, same-length wrong-content token.

Existing test updated:

  • test/server/provisioning-groups.api.test.ts — GET tag-config now
    expects 401, matching the new gate.

Verification

  • npm run lint — clean
  • npm run typecheck — clean
  • npm run test324 passed / 32 files
  • npm run test:e2e30 passed / 4 files
  • npm run build — client (~430 KB) + server (271 KB), clean
  • git diff --check — clean
  • package.json version unchanged (1.6.0)

Risks

  • Scope of destructive action visual change — the new at-rest
    danger affordance is subtle (left border + tinted icon). Visually
    reviewing on each theme would help but isn't required for the gate.
  • Tag-config gate — closes a loophole the client never used; if an
    external automation script relied on the unauthenticated GET, it
    will now need a delegated session.
  • All other changes are additive (tests, docs, scale seed) or
    defence-in-depth (token compare, redact paths, action_log index).

Verdict

READY FOR REVIEW. Suggest squash-merge once a reviewer scans
the auth/destructive-UX commits and the live-tenant dry-run doc.

- Gate POST /api/rules/preview behind requireDelegatedAuth so rule
  authors must be signed in to evaluate predicates against the live
  device snapshot. New rules-preview-gate.api.test.ts proves the 401 /
  200 transitions across the auth boundary.
- Gate GET /api/settings/tag-config behind requireDelegatedAuth for
  consistency with the mutation siblings. The Tags view continues to
  read mappings via the parent /api/settings endpoint, so no client
  flow regresses; provisioning-groups.api.test.ts updated to assert
  the new 401.
- Switch the desktop-token compare to crypto.timingSafeEqual with
  explicit length-mismatch and missing-header guards. Adds three
  focused cases to local-access.api.test.ts (empty token, length
  mismatch, same-length wrong content).
- Extend pino redact paths to cover request body credential fields
  (clientSecret, azureClientSecret, password, token, accessToken,
  idToken, refreshToken, session secrets, certificate thumbprints)
  plus the idempotency-key request header. Defence in depth — bodies
  aren't logged today.
The /api/actions/logs/export route reads up to 100k rows ordered by
triggered_at DESC. Without a covering index that's a sequential scan;
adding idx_action_log_triggered_at keeps the export snappy as action
history grows past the first tenant soak.
…dance

- Build the typed-confirmation token from
  serial -> deviceName -> intuneId -> autopilotId. Devices missing all
  four refuse to render the destructive flow; the previous behaviour
  silently fell back to the literal string "CONFIRM", which let an
  admin confirm a destructive action against an unidentifiable record.
  An explanatory toast surfaces the missing-identifier case.
- Destructive remote-action buttons now carry a subtle critical-color
  left border at rest and a critical-tinted icon, so an admin can read
  the danger affordance without hovering. Hover and confirmation
  behaviour unchanged.
- Add an explicit focus-visible outline that matches severity
  (critical for destructive, accent for safe), and an aria-label that
  includes the "destructive" suffix for screen readers and keyboard
  navigation safety.
Adds npm run db:seed:scale which builds a synthetic SnapshotPayload
of ~4,500 devices spread across ~30 properties with realistic problem
distribution (no_profile, not_in_target_group, user_mismatch,
tag_mismatch, stalled, compliance_drift, orphan_autopilot,
no_autopilot, hybrid_risk). Reuses the production persistSnapshot +
computeAllDeviceStates helpers and prints rough query timings:

  • build payload                        ~18ms
  • persist snapshot                     ~75ms
  • compute all device states           ~184ms
  • dashboard query                       37ms (4500 devices)
  • devices page 1 (25)                  2.9ms
  • tags inventory (31 tags)             2.7ms
  • provisioning lookup (146 devices)    0.8ms
  • device detail                        0.3ms
  • recompute (second pass)             ~186ms

Indicative timings only — never invoked by the test suite or CI. The
script wipes whatever DATABASE_PATH points at, so use a dedicated
DATABASE_PATH=./data/scale-smoke.sqlite when running it.
- New docs/live-tenant-dry-run.md: a step-by-step runbook for the
  first time Runway is pointed at a real tenant. Read-and-look-only
  for the first soak; destructive actions explicitly deferred. Pairs
  with the existing live-testing-checklist.md preflight rather than
  replacing it.
- README + docs/security-report.md permission tables now have an
  Optional? column so a security owner can grant only the delegated
  scopes for the features they intend to use during the pilot. The
  five application scopes remain non-optional.
- CHANGELOG [Unreleased] entries under Security, Performance, Changed,
  Added, and Documentation describing the readiness pass.
- Revert gratuitous export on hasDesktopToken; nothing else imports it.
- Drop req.headers["idempotency-key"] from the redact list. Idempotency
  keys are per-click UUIDs (not secrets) and redacting them costs
  debug value when chasing duplicate-dispatch behaviour.
- Soften the "clear permission error" claim in the README +
  docs/security-report.md permission tables. Action handlers surface
  the Graph status code (e.g. 403), not a translated permission
  string. Updated copy points operators at Action Audit and explains
  that 403 distinguishes "missing Graph scope" from "missing
  Entra/Intune role".
- Add a defensive DATABASE_PATH guard to seed-scale.ts: refuses to
  run when DATABASE_PATH is unset or points at the default
  pilotcheck.sqlite, so a fat-finger can't wipe the user's primary
  database. Verified that the guard trips on the default and lets
  ./data/scale-smoke.sqlite through.

Skipped my own self-review item about a device_state_history
composite index — the existing PRIMARY KEY (device_key, computed_at)
is already a covering index for the trend query at devices.ts:669.
A separate index would be redundant.
@giftedloser giftedloser marked this pull request as ready for review May 3, 2026 03:18
@giftedloser giftedloser merged commit f772c55 into main May 3, 2026
1 check passed
@giftedloser giftedloser deleted the codex/pre-tenant-readiness branch May 3, 2026 03:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant