Skip to content

fix: dashboard stability, infrastructure improvements, auth hardening#399

Merged
lalalune merged 5 commits intodevfrom
sol/dashboard-improvements-20260316
Mar 19, 2026
Merged

fix: dashboard stability, infrastructure improvements, auth hardening#399
lalalune merged 5 commits intodevfrom
sol/dashboard-improvements-20260316

Conversation

@0xSolace
Copy link
Collaborator

Summary

Comprehensive dashboard stability and infrastructure monitoring improvements.

Auth & Proxy Hardening

  • Reject malformed auth tokens early (before hitting Privy API)
  • Cache invalid auth results to short-circuit repeated bad-token requests
  • Fix cookie-based auth error handling (redirect to login instead of error page)
  • Proxy routing cleanup for public vs protected paths

Navigation & Layout Fixes

  • Fix sidebar anchor navigation (scroll-to-section links)
  • Fix user-menu infinite re-render loop caused by unstable context references
  • Stabilize PageHeaderProvider context value with primitive field comparison
  • Fix useSetPageHeader to extract primitives so effect deps are stable

Infrastructure Dashboard

  • Allocation drift fix: Use actual sandbox record count instead of stale allocated_count column from docker_nodes
  • Health classification: Downgrade heartbeat-stale severity to warning when Docker reports container as running + healthy
  • Collapsible incidents panel with per-container action buttons
  • Container actions API (POST /api/v1/admin/infrastructure/containers/actions) — logs, restart, stop, start, inspect, pull-recreate via SSH

Containers Table

  • Upgraded milady-sandboxes-table with improved column layout

Tests

  • Added proxy auth handling tests
  • Added proxy auth routing tests
  • Updated infrastructure health classification tests for new severity logic

Files Changed (20 files, +2383/-655)

- Add auth grace period (5s) to prevent redirect loops during Privy token refresh
- Replace router.push with plain <a> tags for login/navigation stability
- Add returnTo parameter to all login redirects consistently
- Add dashboard error.tsx and loading.tsx boundaries for RSC failures
- Add TabErrorBoundary for infrastructure dashboard containers tab
- Preserve user menu state during transient auth loss
- Proxy: detect malformed JWTs early, redirect to /login with returnTo
- Proxy: add /dashboard/build and /api/auth/create-anonymous-session to public paths
- Replace Next Link with <a> in sandboxes table (fixes /dashboard/milady routes)
- Use soft refresh event instead of hard reload after anon migration
- Add proxy auth unit tests
…ntainer actions API

- Fix allocation drift: use actual sandbox record count instead of stale allocated_count column
- Fix health classification: downgrade heartbeat-stale severity when Docker reports healthy+running
- Add collapsible incidents panel with per-container actions (logs, restart, stop, inspect)
- Add container actions API endpoint (POST /api/v1/admin/infrastructure/containers/actions)
- Fix page header infinite re-render: stabilize context value references with primitive comparison
- Fix useSetPageHeader to extract primitives so effect deps are stable across re-renders
@vercel
Copy link

vercel bot commented Mar 19, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
eliza-cloud-v2 Ready Ready Preview, Comment Mar 19, 2026 1:00pm

@coderabbitai
Copy link

coderabbitai bot commented Mar 19, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9ca43dfb-91e8-4668-b8d1-1c7885c5a30c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sol/dashboard-improvements-20260316
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

Migrating from UI to YAML configuration.

Use the @coderabbitai configuration command in a PR comment to get a dump of all your UI settings in YAML format. You can then edit this YAML file and upload it to the root of your repository to configure CodeRabbit programmatically.

@claude
Copy link

claude bot commented Mar 19, 2026

PR Review — dashboard stability, infrastructure improvements, auth hardening

Overall this is a solid PR with well-thought-out solutions to the auth refresh flickering and allocation drift problems. A few items worth addressing:


Security

since parameter not validated (containers/actions/route.ts, line 123)

The since value is passed to shellQuote before use in docker logs --since, which neutralises single-quote injection. However, there's no format check — any arbitrary string (potentially very long) passes through. Consider adding a lightweight regex guard:

if (since && !/^(\d+[smhd]|[\d]{4}-\d{2}-\d{2}T[\d:.Z+-]+)$/.test(since)) {
  return NextResponse.json({ success: false, error: "Invalid since format" }, { status: 400 });
}

Internal error messages returned to the client (line 315)

return NextResponse.json({ success: false, error: `Action failed: ${message}` }, { status: 500 });

SSH error messages can include internal hostnames, IPs, key fingerprints, and connection details. Even for admin-only endpoints, it's worth sanitising: log the full error server-side, return a generic message (or only the docker exit reason) to the client.


Correctness

pull-recreate name is misleading (line 258–297)

The action only pulls the new image; it never recreates the container. The response note acknowledges this, but the action name will cause operators to think recreation happened. Consider renaming it to pull-image now while it's early, since the frontend action buttons will need updating too. If full recreate behaviour is planned, the current implementation is half-done.

docker inspect --format output used directly in docker pull (line 264–275)

const imageOutput = await ssh.exec(`docker inspect --format '{{.Config.Image}}' ...`);
const image = imageOutput.trim();
// ...
await ssh.exec(`docker pull ${shellQuote(image)} 2>&1`);

docker inspect output via SSH includes a trailing newline but .trim() handles that. The bigger risk: if docker inspect prints a warning line before the image name (e.g., a TLS warning to stderr that bleeds into stdout), shellQuote(image) will pass a multi-line string to docker pull and the pull will fail with a confusing error. Consider adding a stricter format validation:

if (!image || !/^[a-zA-Z0-9_./:@-]+$/.test(image)) {
  return NextResponse.json({ success: false, error: "Could not determine a valid container image" }, { status: 400 });
}

Auth Grace Period

hasBeenAuthenticated ref mutation during render (layout.tsx, line 422–424)

if (authenticated) {
  hasBeenAuthenticated.current = true;
}

Mutating a ref during render is technically safe but can be surprising in Strict Mode (double-render). It works here because setting it true is idempotent, but a comment noting why this is deliberate would help future readers.

authGraceActive and isAnonymous inversion (line 521)

isAnonymous={!shouldAllowProtectedContent}

shouldAllowProtectedContent is authenticated || authGraceActive, so during the 5-second grace window the header shows as authenticated while the user is actually unauthenticated. This is intentional for UX stability, but worth a comment to prevent someone "fixing" it later.


Tests

No tests for the container actions API

proxy-auth-handling.test.ts and proxy-auth-routing.test.ts are great additions. The containers/actions/route.ts file is new, security-sensitive, and has several branches (all 6 actions, validation errors, SSH error handling). It deserves at least happy-path and validation-failure coverage.


Minor

  • SSH_ACTION_TIMEOUT_MS and SSH_LOGS_TIMEOUT_MS are both 30_000 — they could be a single constant unless you plan to differentiate them later.
  • The error.tsx component calls console.error directly rather than the project's logger utility — worth staying consistent.
  • app/dashboard/milady/agents/[id]/page.tsx catches all errors from miladySandboxService.getAgent and silently redirects. This swallows legitimate 500-level errors (DB down, schema mismatch) in the same way as a 404. A narrower catch on the expected "not found" case would preserve observability.

What's good

  • Allocation drift fix (using actual dbContainers.length) is the right call.
  • Downgrading stale-heartbeat severity when Docker reports healthy is a sensible heuristic and the test update is correctly scoped.
  • shellQuote + container name regex validation (^[a-zA-Z0-9_.-]+$) is solid injection prevention.
  • Replacing window.location.reload() with a custom event + router.refresh() is a much cleaner approach to the anon-migration cycle.
  • The auth grace period with hasBeenAuthenticated ref is a clean solution to the Privy token refresh UX problem.

- Fix biome lint: remove unused imports (Pause, Terminal), fix import ordering, fix formatting
- Address review: validate 'since' param format in container logs endpoint
- Address review: validate docker image name format before pull
- Address review: sanitize error messages returned to client (don't leak SSH internals)
@claude
Copy link

claude bot commented Mar 19, 2026

PR Review — #399: Dashboard stability, infrastructure improvements, auth hardening

Overall this is a solid, well-structured PR. The auth hardening, context stabilization, and infrastructure dashboard improvements are all meaningful improvements. A few issues worth addressing before merge:


🔴 Bugs / Correctness Issues

1. pull-recreate only pulls — it doesn't recreate

app/api/v1/admin/infrastructure/containers/actions/route.ts lines ~255–285

The action is named pull-recreate but only performs a docker pull. The response even acknowledges this with a note field:

"Container recreation requires docker-compose or manual recreation."

This is misleading — an operator clicking "pull-recreate" will expect the container to be restarted with the new image. Either rename the action to pull or implement the recreation step (docker stop → rm → run with original args, or signal docker-compose).

2. SortableHeader defined inside InfrastructureDashboard component

packages/ui/src/components/admin/infrastructure-dashboard.tsx ~line 1123

function SortableHeader({ field, label }: { field: SortField; label: string }) { ... }

This function is defined inside the render body of InfrastructureDashboard. React will create a new component type on every render, causing all sorted headers to unmount/remount. Move it outside the component (it can close over sortField, sortDirection, toggleSort via props or accept them as props).

3. setTimeout polling after container action

setTimeout(() => loadInfraSnapshot(), 2000)

Hardcoded 2-second delay before refreshing snapshot is fragile — if SSH is slow the data will still be stale. Either accept the stale read and let the user manually refresh, or implement a proper polling mechanism with a loading state. At minimum use window.setTimeout and store the handle so it can be cleared on unmount.


🟡 Security Considerations

4. nodeId parameter not validated in container actions API

containerName is validated against ^[a-zA-Z0-9_.-]+$ (good), but nodeId has no such validation. It's looked up from the DB (findByNodeId), so injection isn't directly possible — but an exhaustive search with arbitrary nodeId strings could still be used for node enumeration. Consider limiting nodeId to a safe character set as well.

5. Auth grace period allows protected content for 5 seconds after auth loss

app/dashboard/layout.tsx:

const AUTH_LOSS_GRACE_MS = 5000;

This is intentional for Privy token refresh jitter, but 5 seconds of protected content remaining visible to a logged-out user is worth documenting. Consider whether the grace period should apply to read-only views vs. sensitive admin pages, or if 2–3 seconds would suffice.

6. since regex in logs action has redundant character class

if (!/^(\d+[smhd]|[\d]{4}-\d{2}-\d{2}T[\d:.Z+-]+)$/.test(since)) {

[\d]{4} should be \d{4} — minor, but inconsistent and slightly confusing. More importantly, the ISO timestamp portion [\d:.Z+-]+ is very permissive (allows Z+- as individual characters). This passes through shellQuote anyway so it's not exploitable, but worth tightening.


🟡 Code Quality

7. Redundant timeout constants

const SSH_ACTION_TIMEOUT_MS = 30_000;
const SSH_LOGS_TIMEOUT_MS = 30_000;

Both are identical. Consolidate or explain why they differ.

8. [key: string]: unknown in InfraSummary interface

interface InfraSummary {
  // ... typed fields ...
  [key: string]: unknown;
}

The index signature defeats the purpose of having typed fields — TypeScript won't catch typos. Remove it and add any additional fields explicitly.

9. isObviouslyMalformedToken only checks segment count

A token like aaaa.bbbb.cccc passes the 3-segment check but will still fail Privy verification. This is fine (it'll just hit Privy and fail), but since the cache write happens for obviously malformed tokens before Privy is called, consider also doing a basic base64url character check. Not critical since Privy handles it — just noting the gap.


✅ What's Well Done

  • shellQuote + regex validation on container names before SSH exec — correct defense-in-depth
  • SSH error sanitization\"Action failed. Check server logs for details.\" avoids leaking hostnames/IPs to the client
  • Auth caching for invalid tokens (reason: \"expired\" | \"invalid\") prevents hammering Privy with known-bad tokens
  • handleTokenFailure consolidation — clean refactor that unifies the scattered redirect/401 logic
  • useSetPageHeader stabilization — extracting primitives and using actionsRef is the right pattern to break the infinite re-render loop
  • PageHeaderProvider setter memoization — comparing primitive fields before triggering re-renders is correct
  • lastAuthenticatedUser in UserMenuInner — preserving user state during transient auth loss is a good UX pattern
  • Allocation drift fix using actual sandbox record count is clearly correct over the stale allocated_count column
  • Test coverage for proxy auth routing and handling is comprehensive and well-structured

Summary

Items that should be addressed before merge:

  1. pull-recreate naming/behavior mismatch (correctness)
  2. SortableHeader defined inside component body (performance/correctness)
  3. setTimeout for post-action refresh (reliability)

Items that are lower priority but worth tracking:
4. nodeId validation
5. InfraSummary index signature
6. Timeout constant consolidation

@claude
Copy link

claude bot commented Mar 19, 2026

PR #399 Review

Overall this is a solid, well-structured PR addressing real problems (auth flicker, stale allocation counts, navigation instability). A few issues worth addressing before merge.


Issues

[Medium] Security — Raw docker inspect leaks env vars through unmasked JSON view

File: packages/ui/src/components/admin/infrastructure-dashboard.tsx

The container details dialog renders a "Raw Docker Inspect JSON" collapsible at the bottom using JSON.stringify(detailsData, null, 2), which dumps the full inspect payload including Config.Env again — unmasked. The key-based masking logic above only applies to the parsed env var table, not the raw JSON block.

// masking only covers the env var section above, not this:
JSON.stringify(detailsData, null, 2)

Recommendation: Strip Config.Env from detailsData before passing it to the raw JSON view, or apply the same masking transform to that block.


[Medium] Bug — "+N more" incidents badge is dead code (never renders)

File: packages/ui/src/components/admin/infrastructure-dashboard.tsx

The "+N more" message is inside the {incidentsExpanded && (...)} block but also has the condition !incidentsExpanded, making it impossible to render:

{incidentsExpanded && (
  // ...
  {hasMore && !incidentsExpanded && (  // ← always false
    <p>+{infraSnapshot.incidents.length - COLLAPSED_LIMIT} more</p>
  )}
)}

Users with >3 incidents and the panel collapsed get no indication there are hidden incidents.

Recommendation: Move the "+N more" text outside the incidentsExpanded guard.


[Low] Bug — setTimeout in performContainerAction without cleanup

File: packages/ui/src/components/admin/infrastructure-dashboard.tsx

setTimeout(() => loadInfraSnapshot(), 2000);

If the component unmounts within 2 seconds (user navigates away), setInfraSnapshot is called on an unmounted component. Compare how authLossTimerRef handles this correctly in dashboard/layout.tsx.

Recommendation: Track the timeout in a ref and clear on unmount.


[Low] Performance — SortableHeader defined inside component render body

File: packages/ui/src/components/admin/infrastructure-dashboard.tsx

function SortableHeader({ field, label }: { field: SortField; label: string }) {
  // declared inside InfrastructureDashboard's body
}

A new function identity is created on every render, defeating React reconciliation — the component will remount every render rather than update. Should be extracted to module scope.


[Low] Design — Hard navigation for all /dashboard routes is broad

File: packages/ui/src/components/layout/sidebar-item.tsx

const shouldHardNavigate = item.hardNavigate || item.href.startsWith("/dashboard");

This forces full page reloads for every dashboard sidebar link, removing client-side navigation, prefetching, and scroll restoration. The comment says "Temporary escape hatch" but there's no issue link or removal condition documented.

Recommendation: Narrow this to only the specific routes with known RSC instability, or add a // TODO(#issue): with the condition to revert.


[Low] pull-recreate action is misleadingly named

File: app/api/v1/admin/infrastructure/containers/actions/route.ts

The action only pulls the image — it doesn't recreate the container. The API response even says: "Container recreation requires docker-compose or manual recreation." The action key implies more than it does.

Recommendation: Rename to pull-image or actually restart the container. Document the limitation in the UI.


[Info] as any casts in inspect dialog

File: packages/ui/src/components/admin/infrastructure-dashboard.tsx

Repeated (detailsData as any)?.Config?.Image, (detailsData as any)?.State?.Status, etc. A minimal DockerInspect interface would eliminate these and improve safety.


[Info] Auth grace period risk not documented

File: app/dashboard/layout.tsx

The 5-second grace period (AUTH_LOSS_GRACE_MS = 5000) has a known tradeoff: if a user logs out from another tab, the dashboard stays rendered for 5 seconds. API calls will fail correctly, so data exposure is minimal — but this accepted risk is worth a comment.


Positive Observations

  • shellQuote + container name validation regex (/^[a-zA-Z0-9_.-]+$/) are solid injection defenses
  • isObviouslyMalformedToken pre-check avoids unnecessary Privy API calls
  • Caching invalid auth results to short-circuit repeated bad tokens is a good DoS mitigation
  • PageHeaderProvider primitive comparison fix correctly stops the re-render loop
  • useSetPageHeader actions stabilization via useRef is the right approach
  • authLossTimerRef cleanup in effect return is correctly handled
  • Allocation drift fix (actual sandbox record count vs stale allocated_count) is correct
  • Health classification downgrade for heartbeat-stale + Docker-healthy is a sensible UX improvement

…eak, naming, types

- Extract SortableHeader to module scope (prevents remount on every render)
- Fix '+N more' incidents badge dead code (was inside incidentsExpanded guard)
- Mask env vars in raw Docker inspect JSON view (was leaking unmasked)
- Add DockerInspectData interface, remove all 'as any' casts
- Rename pull-recreate → pull-image (action only pulls, doesn't recreate)
- Add nodeId validation regex to container actions API
- Consolidate SSH_ACTION_TIMEOUT_MS and SSH_LOGS_TIMEOUT_MS into single constant
- Remove index signature from InfraSummary interface
- Track post-action setTimeout in ref with cleanup on unmount
- Tighten 'since' parameter regex validation
- Add auth grace period documentation comments (layout.tsx)
- Add ref mutation comment for hasBeenAuthenticated
- Document isAnonymous/authGraceActive intentional behavior
- Add TODO to sidebar hardNavigate escape hatch
- Narrow catch in milady agent page (rethrow unexpected errors)
@claude
Copy link

claude bot commented Mar 19, 2026

PR Review: fix: dashboard stability, infrastructure improvements, auth hardening

Overall this is well-structured work with clear intent behind each change. A few issues worth addressing before merge:


Security

since parameter validation is too permissive (containers/actions/route.ts ~L124–130)

The ISO timestamp regex allows invalid dates like 9999-99-99T99:99:99Z and timezone offsets like +99:00. Docker will reject them server-side, but the input reaches shellQuote and an SSH exec before that happens. The shellQuote helper correctly wraps the value in single quotes, so there's no injection risk here — but tightening the regex would prevent unnecessary SSH round-trips and give better error messages.

maskInspectForDisplay regex is incomplete (infrastructure-dashboard.tsx ~L1730)

const isSensitive = /key|secret|password|token|api/i.test(key ?? "");

This misses common patterns like DATABASE_URL, PRIVATE_KEY, CREDENTIALS, CERT, DSN, CONN_STR. Consider expanding the pattern or using a stricter allowlist approach for non-sensitive keys.

Also, JSON.parse(JSON.stringify(data)) on L1726–1727 will throw if the inspect output contains undefined, Date objects, or circular refs — worth wrapping in a try/catch fallback.

Invalid auth token cache — verify it's bounded

The test at L1132 confirms caching of invalid auth results to short-circuit repeated bad-token requests. This is a good optimization, but please confirm the cache has a max-size bound. An unbounded in-memory cache keyed by token strings could become a slow memory leak under a flood of unique malformed tokens.


Logic / Correctness

Heartbeat severity downgrade may hide real failures (admin-infrastructure.ts ~L1004–1010)

const runtimeHealthy = runtime?.state === "running" && runtime?.health === "healthy";
severity: runtimeHealthy ? "warning" : "critical",

A container can pass Docker health checks while being functionally dead (deadlock, crashed event loop, hung process). A stale heartbeat means the application stopped reporting — that's independently alarming regardless of what Docker thinks. I'd suggest keeping severity: "critical" but including the Docker health context in the reason string, so operators get the full picture without missing the alert.

pull-image doesn't recreate the container (PR description vs. implementation)

The PR description says "pull-recreate via SSH" but the implemented action is pull-image, which pulls the image and stops there — a separate restart is needed. The success response even notes this:

note: "Image pulled successfully. Restart the container to use the new image."

Either rename the action to make the two-step workflow explicit in the UI, or implement the full pull-recreate (pull → stop → rm → run with same flags) atomically. The current behavior risks leaving admins confused when they "pull-recreate" but the container keeps running the old image.

Allocation drift may count stale DB records (admin-infrastructure.ts ~L1020)

const actualAllocatedCount = dbContainers.length;

If dbContainers includes records in terminal states ("deleted", "error"), the drift calculation will be off. A comment explaining the exact filter criteria applied to dbContainers upstream would help, or filter explicitly before using .length.


React / Code Quality

Ref mutation during render (dashboard/layout.tsx ~L443–445)

if (authenticated) {
  hasBeenAuthenticated.current = true;
}

The comment acknowledges this is intentional and idempotent, but it's still a React anti-pattern. Moving it into a useEffect is a one-liner and avoids surprises if strict mode behavior changes:

useEffect(() => {
  if (authenticated) hasBeenAuthenticated.current = true;
}, [authenticated]);

Auth grace timer may clear prematurely (dashboard/layout.tsx ~L467–493)

The useEffect clears the timer at the top of every run before checking conditions. If isFreeModePath changes while the grace timer is counting down, the timer gets wiped even though the auth state itself hasn't changed. Separating the "clear on auth restored" logic from "clear on path change" would make this safer.


Tests

No test for pull-image behavior or the "restart still needed" case

The new container actions API has tests for auth and routing, but not for the actual SSH command construction (logs, restart, pull-image). At minimum a unit test confirming that pull-image does not restart the container (and that the note is present in the response) would prevent regressions.

Missing coverage: valid → invalid → valid token transitions

The proxy auth tests cover static invalid tokens and caching, but don't test the dynamic case where a token is valid, expires mid-session, and then refreshes. This is exactly the scenario the AUTH_LOSS_GRACE_MS is designed for, so it's worth exercising in tests.


Minor

  • SSH_DEFAULT_TIMEOUT_MS = 30_000 is used for logs, restart, stop, and start — consider separate constants for action vs. log timeouts since log fetching with a large --tail can take longer than a restart command.
  • The lines param is clamped to [1, 5000] but not validated as an integer — lines: 1.7 would pass through as Math.min(Math.max(1.7, 1), 5000) = 1.7 and get interpolated into the shell command. An explicit Math.floor or Number.isInteger check would be safer.

Good additions overall — the allocation drift fix, collapsible incidents panel, and auth hardening are solid improvements. Addressing the heartbeat severity logic and the pull-image vs pull-recreate naming mismatch would be the highest priority before merge.

@lalalune lalalune merged commit a3a9767 into dev Mar 19, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants