Skip to content

docs: add admin dashboard operator guide (P4)#674

Merged
bootjp merged 6 commits intofeat/admin-s3-writesfrom
docs/admin-operator-guide
Apr 26, 2026
Merged

docs: add admin dashboard operator guide (P4)#674
bootjp merged 6 commits intofeat/admin-s3-writesfrom
docs/admin-operator-guide

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 26, 2026

P4 deliverable from docs/design/2026_04_24_proposed_admin_dashboard.md Section 8: a single self-contained operator-facing reference for the admin HTTP listener.

Stacked on #669 (slice 2a) + #673 (slice 2b) so the API-surface table can describe S3 write endpoints as shipped. Rebases cleanly onto main once those land.

Sections

  • Quick start — minimal loopback-dev invocation
  • Configuration reference — required vs optional flags, with the rationale for each guard
  • Hard-error startup conditions — the explicit cases where the process refuses to start (missing creds, partial TLS, non-loopback without TLS, role-list overlap)
  • TLS setup — loopback / TLS / discouraged plaintext-non-loopback topologies
  • Roles — read-only vs full + how live role re-validation works on every state-changing request
  • API surface — full /admin/api/v1/* table including the slice 2 S3 write endpoints
  • Follower → leader forwarding — what the SPA sees, the rolling-update caveat, and the election-period 503 + Retry-After contract
  • Audit logadmin_audit slog shape and the forwarded_from field
  • Troubleshooting — the common bring-up failures (missing creds, TLS hard-error, 401 ambiguity, 503 leader_unavailable, bucket_not_empty, placeholder bundle blank-screen)

What is NOT in this PR

Section 8's P4 plan also called out "TLS, read-only role, CSRF" as deliverables; those are already implemented in main (see config.go's validateTLS / validateAccessKeyRoles, the role gates in DynamoHandler.principalForWrite + S3Handler.principalForWrite, and the CSRFDoubleSubmit middleware). This doc stitches them into a single reference operators can land on without reading code.

Test plan

  • Markdown renders cleanly (manual check)
  • Cross-references match real file paths (design doc, architecture overview, proto file)
  • Flag names match main.go's flag definitions verbatim
  • The bucket_not_empty 409 response shape matches S3Handler.writeBucketsError

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 26, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 75b5b2cb-dff7-4645-bab9-65cd73c43770

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/admin-operator-guide

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 3m 42s —— View job


Review

  • Read docs/admin.md
  • Verify flag names against main.go
  • Verify cross-references (file paths, design doc, architecture overview, proto file)
  • Check bucket_not_empty 409 response shape against S3Handler.writeBucketsError
  • Review hard-error startup conditions against config.go
  • Review TLS, roles, audit log, troubleshooting sections for accuracy
  • Post review feedback

Overall the doc is solid and well-structured. Flag names, CSRF fields, cookie names, credentials file format, bucket_not_empty error code, TLS min-version, and leader_unavailable semantics all verify correctly against the code. Three issues need fixing before merge.

Bug 1 — Key size is wrong (high)

docs/admin.md line 45:

Cluster-shared base64-encoded HS256 key (≥ 32 raw bytes / 44 base64 chars)

internal/admin/config.go line 14 and comment at line 38–39:

const sessionSigningKeyLen = 64   // must decode to exactly 64 bytes

main.go line 123 also says "64 bytes decoded". The code rejects any key that isn't exactly 64 bytes; a 32-byte key fails validation with a confusing error. Correct the doc to say "exactly 64 raw bytes / 88 base64 chars (standard) or 86 base64 chars (URL/raw)".

Fix this →


Bug 2 — Broken link in the Cross-references section (moderate)

docs/admin.md line 249:

- Design rationale: [docs/design/2026_04_24_partial_admin_dashboard.md](design/2026_04_24_partial_admin_dashboard.md)

The actual file on disk is docs/design/2026_04_24_proposed_admin_dashboard.md (the _proposed_ lifecycle marker). A _partial_ variant doesn't exist yet. The intro text at line 5 of the same file already uses the correct _proposed_ filename, so this is a copy-paste stale edit.

Fix this →


Bug 3 — Audit log sample shows fields that aren't emitted (moderate)

docs/admin.md lines 182–183:

admin_audit actor=AKIA_ADMIN role=full method=POST path=/admin/api/v1/dynamo/tables status=201 duration=8.2ms

The actual slog entries in internal/admin/s3_handler.go:299–303 and internal/admin/forward_server.go:170–176 use operation= + a resource key (bucket= / table=), not method=, path=, status=, or duration=. For example:

admin_audit actor=AKIA_ADMIN role=full operation=create_bucket bucket=my-bucket

and for forwarded writes:

admin_audit actor=AKIA_ADMIN role=full forwarded_from=n2 operation=delete_table table=orders

If operators build log parsers or dashboards against the documented shape, they will find zero matches. The forwarded_from description is correct; only the sample shape needs updating.

Fix this →


Nit — Misleading "data-plane" label in troubleshooting (minor)

docs/admin.md line 231:

check Raft leader status via the data-plane /admin/api/v1/cluster endpoint

/admin/api/v1/cluster is an admin-plane endpoint, not the data-plane. "data-plane" here is confusing (it implies the DynamoDB/S3/Redis ports). Drop "data-plane" or replace with "admin".


Verified correct ✓

Claim Source Result
All -admin* flag names main.go:117–128 ✓ exact match
admin_session / admin_csrf / X-Admin-CSRF internal/admin/middleware.go:19–21
bucket_not_empty JSON error code (snake_case) internal/admin/s3_handler.go:419
Credentials file JSON shape main_sigv4_creds.go:16,20–21
TLS 1.2+ enforced main_admin.go:536
leader_unavailable 503 + Retry-After: 1 internal/admin/s3_handler.go:411–413
forwarded_from log field internal/admin/forward_server.go:96,173
private / public-read ACLs adapter/s3.go:96–97
proto/admin_forward.proto exists repo root
internal/admin/dist/index.html exists repo root
web/admin directory exists repo root
ELASTICKV_ADMIN_SESSION_SIGNING_KEY env var main_admin.go:25

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive operator guide for the elastickv admin dashboard, detailing its configuration, security model, and operational behavior. The review feedback correctly identifies a terminology contradiction regarding data-plane vs. admin endpoints and an inconsistent filename reference in the design rationale section.

Comment thread docs/admin.md Outdated
Comment on lines +230 to +231
two seconds, check Raft leader status via the data-plane
`/admin/api/v1/cluster` endpoint or `cmd/elastickv-admin`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Referring to the /admin/api/v1/cluster endpoint as "data-plane" is confusing and contradictory, as the introduction (line 16) explicitly distinguishes the admin listener from the data-plane endpoints (DynamoDB and S3).

Suggested change
two seconds, check Raft leader status via the data-plane
`/admin/api/v1/cluster` endpoint or `cmd/elastickv-admin`.
two seconds, check Raft leader status via the
/admin/api/v1/cluster endpoint or cmd/elastickv-admin.

Comment thread docs/admin.md Outdated

## Cross-references

- Design rationale: [docs/design/2026_04_24_partial_admin_dashboard.md](design/2026_04_24_partial_admin_dashboard.md)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The design document filename here (2026_04_24_partial_admin_dashboard.md) is inconsistent with the reference at the beginning of the file and in the pull request description (2026_04_24_proposed_admin_dashboard.md).

Suggested change
- Design rationale: [docs/design/2026_04_24_partial_admin_dashboard.md](design/2026_04_24_partial_admin_dashboard.md)
- Design rationale: [docs/design/2026_04_24_proposed_admin_dashboard.md](design/2026_04_24_proposed_admin_dashboard.md)

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a5980e6b51

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread docs/admin.md Outdated
| Flag | Description |
|---|---|
| `-adminEnabled` | Master on/off switch. Default `false`. |
| `-adminSessionSigningKey` *or* `-adminSessionSigningKeyFile` *or* `ELASTICKV_ADMIN_SESSION_SIGNING_KEY` | Cluster-shared base64-encoded HS256 key (≥ 32 raw bytes / 44 base64 chars). **Must be the same on every node** — JWTs minted by node A are verified by node B during follower→leader forwarding, so a mismatch breaks the dashboard's read paths on follower nodes. The `*File` / env-var forms keep the secret out of `/proc/<pid>/cmdline`. |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge State the exact 64-byte session key requirement

This row says the admin signing key can be ≥ 32 raw bytes / 44 base64 chars, but the server currently hard-fails unless the decoded key is exactly 64 bytes (sessionSigningKeyLen = 64 in internal/admin/config.go via decodeSigningKey). Operators who follow this guidance with a 32-byte key will be unable to start the admin listener.

Useful? React with 👍 / 👎.

Comment thread docs/admin.md Outdated
Comment on lines +215 to +217
`-adminReadOnlyAccessKeys`. The dashboard does not distinguish the
two cases on the wire — both produce 401 — but the leader's audit
log shows the precise reason.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Document 403 for keys missing from admin allow-lists

The troubleshooting text says both bad secrets and non-allowlisted access keys return 401, but AuthService.authenticate distinguishes them: invalid key/secret returns 401 invalid_credentials, while a valid credential not present in -adminFullAccessKeys/-adminReadOnlyAccessKeys returns 403 forbidden (internal/admin/auth_handler.go). This mismatch can send operators down the wrong path when debugging login failures.

Useful? React with 👍 / 👎.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

@claude review

303f8f9 (rebased onto updated #673 feat/admin-s3-forward) addresses the four findings:

  1. High: HS256 key size now correctly says "exactly 64 raw bytes — 88 base64 chars (standard padding) or 86 (RawURLEncoding)". The validator in internal/admin/config.go enforces exactly 64; the prior "≥ 32 / 44 chars" would have produced a confusing startup error.
  2. Moderate: cross-reference link uses _proposed_ (matches main today) with an inline note that the link will follow once docs(design): promote admin dashboard from proposed to partial #675 lands.
  3. Moderate: audit log sample replaced with two real shapes — leader-direct (operation=create_bucket bucket=...) and forwarded (forwarded_from=n2 operation=put_bucket_acl bucket=... acl=...). An operator building a log parser against the documented contract will now find matches.
  4. Nit: "data-plane" label for /admin/api/v1/cluster replaced with "admin".

bootjp added a commit that referenced this pull request Apr 26, 2026
…ane label

Three substantive findings + one nit from claude review on #674:

1. **HS256 key size (high)**: said "≥ 32 raw bytes / 44 base64 chars"
   but the validator in internal/admin/config.go enforces exactly 64
   raw bytes. A 32-byte key fails startup with a confusing error.
   Updated to "exactly 64 raw bytes — 88 base64 chars (standard
   padding) or 86 (RawURLEncoding)". Operators copying the doc into
   key-generation tooling now produce a key that actually validates.

2. **Audit log sample (moderate)**: showed method=POST / path=... /
   status=201 / duration=8.2ms — none of those fields are emitted.
   The actual slog entries use operation= + a resource key
   (bucket= or table=). Replaced the example with two real shapes
   (leader-direct and forwarded), so an operator building a log
   parser against the documented contract finds matches.

3. **Cross-reference link (moderate)**: line 249 used the
   _partial_ filename for the design doc, but #675 (the rename PR)
   hasnt landed yet. Point to _proposed_ (matches main today)
   with an inline note that the link will follow once #675 lands.
   The intro paragraph at line 5 already used _proposed_, so this
   removes the inconsistency claude flagged.

4. **"data-plane" label nit**: /admin/api/v1/cluster is an admin-
   plane endpoint, not data-plane. "data-plane" implied the
   DynamoDB / S3 / Redis ports. Replaced with "admin".

The cosmetic Last-updated / Status verbosity items on #675 are
tracked separately.
@bootjp bootjp force-pushed the docs/admin-operator-guide branch from a5980e6 to 303f8f9 Compare April 26, 2026 12:29
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 3m 51s —— View job


Review

  • Read updated docs/admin.md (commit 303f8f93)
  • Verify fix 1: HS256 key size (64 bytes) — confirmed ✓
  • Verify fix 2: Cross-reference link (_proposed_) — confirmed ✓
  • Verify fix 3: Audit log sample shapes — partially correct; new finding below
  • Verify fix 4: "data-plane" label removed — confirmed ✓
  • Check Codex P2 finding (401 vs 403 distinction) — still not addressed; details below

Confirmed fixes ✓

All four stated fixes are present and correct against the code.

Fix Claim Source Result
1 (High) "exactly 64 raw bytes / 88 base64 (std) or 86 (RawURLEncoding)" config.go:14 sessionSigningKeyLen = 64; all four base64 encodings tried at config.go:196–211
2 (Moderate) Link uses _proposed_ with inline note re PR #675 Design doc exists only as _proposed_ today
3 (Moderate) Handler-level samples use operation=create_bucket bucket=… and forwarded operation=put_bucket_acl bucket=… acl=… s3_handler.go:299–303, forward_server.go:391–397 ✓ (partial — see Bug A below)
4 (Nit) "admin /admin/api/v1/cluster endpoint" — no "data-plane" label docs/admin.md:241

Bug A — delete_table "leader-direct" sample doesn't exist on the direct path (moderate)

docs/admin.md lines 181–187:

**Leader-direct write** (`DynamoHandler` / `S3Handler` after a successful mutation):

admin_audit actor=AKIA_ADMIN role=full operation=create_bucket bucket=my-bucket
admin_audit actor=AKIA_ADMIN role=full operation=delete_table table=orders

DynamoHandler.handleDelete (dynamo_handler.go:379–396) calls w.WriteHeader(http.StatusNoContent) and returns — there is no admin_audit slog call. The operation=delete_table table=… shape only appears when the call was forwarded through a follower (it comes from forward_server.go:224–230).

The Audit middleware (middleware.go:206–232) does emit a second admin_audit line for every non-GET state-changing request, but it uses method=, path=, status=, remote=, duration= — not operation= / table=. (This is actually the shape the original doc was showing before my first review, and it was valid.) For a direct-leader DynamoDB delete the line operators will actually see is:

admin_audit actor=AKIA_ADMIN role=full method=DELETE path=/admin/api/v1/dynamo/tables/orders status=204 remote=127.0.0.1:… duration=1.2ms

The recommended fix is to either: (a) distinguish the sample by operation source more precisely ("S3 handler-level" vs middleware/forwarded), or (b) add the handler-level admin_audit call to DynamoHandler.handleCreate and handleDelete to make the direct path symmetric with the S3 handler. The middleware shape should also be documented — it fires for all operations including S3, so operators would see two admin_audit lines per S3 mutation.

Fix this →


Bug B — Login section incorrectly says allow-list rejections return 401 (moderate, Codex P2 — still open)

docs/admin.md lines 221–227:

The access key + secret pair did not match the credentials file, or the key is not listed in -adminFullAccessKeys / -adminReadOnlyAccessKeys. The dashboard does not distinguish the two cases on the wire — both produce 401

auth_handler.go:336–345:

if !known || !match {
    writeJSONError(w, http.StatusUnauthorized, "invalid_credentials", ...)  // 401
    return AuthPrincipal{}, false
}
role, ok := s.roles[req.AccessKey]
if !ok {
    writeJSONError(w, http.StatusForbidden, "forbidden",                    // 403
        "access_key is not authorised for admin access")
    return AuthPrincipal{}, false
}

The code is unambiguous: a valid key+secret pair whose access key is absent from both allow-lists returns 403 forbidden, not 401. An operator who successfully added their credentials to the SigV4 file but forgot to add the key to -adminFullAccessKeys or -adminReadOnlyAccessKeys will see 403 — but the troubleshooting guide tells them to check their credentials (401), sending them in the wrong direction. This was flagged by Codex in the previous round and is still outstanding.

Recommended fix: split the entry into two separate troubleshooting items.

Fix this →


Verified correct ✓

All items from the previous round still hold after verifying against the current code. Fix 1 math checks out (64 bytes × 4/3 = 85.33 → 88 chars standard, 86 raw). Fix 3's S3 samples and the forwarded shape are exact matches.


@bootjp bootjp force-pushed the feat/admin-s3-forward branch from e5ef0a9 to cafac05 Compare April 26, 2026 12:32
bootjp added a commit that referenced this pull request Apr 26, 2026
…ane label

Three substantive findings + one nit from claude review on #674:

1. **HS256 key size (high)**: said "≥ 32 raw bytes / 44 base64 chars"
   but the validator in internal/admin/config.go enforces exactly 64
   raw bytes. A 32-byte key fails startup with a confusing error.
   Updated to "exactly 64 raw bytes — 88 base64 chars (standard
   padding) or 86 (RawURLEncoding)". Operators copying the doc into
   key-generation tooling now produce a key that actually validates.

2. **Audit log sample (moderate)**: showed method=POST / path=... /
   status=201 / duration=8.2ms — none of those fields are emitted.
   The actual slog entries use operation= + a resource key
   (bucket= or table=). Replaced the example with two real shapes
   (leader-direct and forwarded), so an operator building a log
   parser against the documented contract finds matches.

3. **Cross-reference link (moderate)**: line 249 used the
   _partial_ filename for the design doc, but #675 (the rename PR)
   hasnt landed yet. Point to _proposed_ (matches main today)
   with an inline note that the link will follow once #675 lands.
   The intro paragraph at line 5 already used _proposed_, so this
   removes the inconsistency claude flagged.

4. **"data-plane" label nit**: /admin/api/v1/cluster is an admin-
   plane endpoint, not data-plane. "data-plane" implied the
   DynamoDB / S3 / Redis ports. Replaced with "admin".

The cosmetic Last-updated / Status verbosity items on #675 are
tracked separately.
@bootjp bootjp force-pushed the docs/admin-operator-guide branch from 303f8f9 to b9b0b0c Compare April 26, 2026 12:33
bootjp added a commit that referenced this pull request Apr 26, 2026
Two findings from the latest claude review on #674:

1. **Audit log shapes (Bug A)**: the previous example claimed
   `admin_audit ... operation=delete_table table=orders` was a
   leader-direct sample, but `DynamoHandler.handleDelete`
   (dynamo_handler.go:379-396) does not emit any handler-level
   admin_audit — that shape only appears via `ForwardServer` on
   the forwarded path. It also omitted the `Audit` middleware line
   (middleware.go:206-232) which wraps every non-GET admin request
   with method= / path= / status= / remote= / duration=. An
   operator building a parser against the doc would have missed
   half of the actual lines.

   Rewrote the section as three explicit shapes — Audit middleware
   (always), S3Handler op line (only emits from s3_handler.go:299
   /333/355), ForwardServer op line (covers both Dynamo and S3
   forwarded paths). Removed the misleading `delete_table`
   leader-direct sample and replaced the Dynamo examples with
   forwarded-only ones, which is where they actually originate.
   Added a one-sentence note that a single request typically
   produces two lines (op-specific + middleware) so log parsers
   know to treat the key as a union.

2. **Login 401 vs 403 (Codex P2)**: the troubleshooting entry said
   "both produce 401" but auth_handler.go:336-345 returns 403
   forbidden when the credentials match but the access key is not
   in `-adminFullAccessKeys` / `-adminReadOnlyAccessKeys`. Split
   into two entries: 401 invalid_credentials (bad access_key or
   secret) and 403 forbidden (valid SigV4 creds, but no admin
   role assignment), with the remediation specific to each case.

No functional changes. Doc-only.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

@claude review

Addressed both findings from the previous review:

  1. Audit log shapes (Bug A) — rewrote the section as three explicit shapes: Audit middleware (always wraps non-GET requests), S3Handler operation line (only S3 admin path emits these), ForwardServer operation line (covers both Dynamo and S3 forwarded paths). Removed the misleading delete_table leader-direct sample and added a one-sentence note that a single request typically produces two lines.
  2. Login 401 vs 403 (Codex P2) — split into two troubleshooting entries: 401 invalid_credentials (bad access_key or secret) and 403 forbidden (valid creds but no admin role assignment), each with the remediation specific to that case.

Latest commit: 05560e7.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 1m 47s —— View job


Review

  • Read updated docs/admin.md (commit 05560e79)
  • Verify Bug A fix: audit log three-shape documentation
  • Verify Bug B / Codex P2 fix: 401 vs 403 login distinction
  • Check for any new issues
  • Post review feedback

Both fixes are correct and verified against the code. LGTM.

Bug A — Audit log shapes ✓ confirmed fixed

The section now documents three shapes:

Shape Source verified Result
Audit middleware (method=, path=, status=, remote=, duration=) middleware.go:221–229 ✓ exact match
S3Handler operation lines (operation=create_bucket/put_bucket_acl/delete_bucket, bucket=, acl=) s3_handler.go:299–304, 333–339, 355–360 ✓ exact match
ForwardServer operation lines (forwarded_from=, operation=, table= / bucket=, acl=) forward_server.go:170–176, 224–230, 391–398 ✓ exact match

The note at line 180 that "a single state-changing request typically produces two audit lines" is accurate. The qualifier "typically" correctly covers the edge case where a leader-direct DynamoDB write produces only the middleware line (no DynamoHandler operation line), while S3 and forwarded DynamoDB writes produce two.

Bug B — 401 vs 403 at login ✓ confirmed fixed

The troubleshooting section now has two separate entries:

  • 401 invalid_credentials — access key unknown or secret mismatch → remediation: check -s3CredentialsFile
  • 403 forbidden — credentials valid, key absent from both allow-lists → remediation: add key to role flag + restart nodes ✓

Both match auth_handler.go:336–346 exactly.

No new findings

All items from prior rounds remain confirmed. The doc is ready to merge once the stacked #669 / #673 dependencies land.


Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 05560e795e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread docs/admin.md Outdated
|---|---|
| `-adminListen` | host:port for the admin listener. Defaults to `127.0.0.1:8080`. |
| `-adminTLSCertFile` / `-adminTLSKeyFile` | PEM cert + key. Both must be set together; a partial config fails validation at startup. |
| `-adminAllowPlaintextNonLoopback` | Explicit opt-out for the non-loopback-without-TLS startup hard-error. **Strongly discouraged** — enables the dashboard to mint cookies without the `Secure` attribute and ship session JWTs over plaintext. Use only for short-lived test rigs you control. |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Correct plaintext non-loopback cookie behavior

This line says -adminAllowPlaintextNonLoopback makes the dashboard mint cookies without Secure, but that flag only bypasses the startup TLS guard (internal/admin/config.go); cookie security is controlled separately by -adminAllowInsecureDevCookie (main_admin.go passes it to AuthServiceOpts.InsecureCookie, and buildCookie uses that to set Secure). In the documented non-loopback HTTP scenario, operators who set only -adminAllowPlaintextNonLoopback can end up with Secure cookies that browsers won't send over plaintext, causing login/session behavior to fail contrary to this guidance.

Useful? React with 👍 / 👎.

Comment thread docs/admin.md Outdated
Comment on lines +171 to +172
header. The SPA's API client honours `Retry-After` and re-issues
the request once. Operators investigating "intermittent 503s"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove unsupported automatic Retry-After retry claim

The guide claims the SPA API client honors Retry-After and retries once, but the frontend client currently performs a single fetch call with no retry path (web/admin/src/api/client.ts, and useApi also has no 503 retry logic). During leader elections, this mismatch changes operator expectations and can lead to unresolved write failures because users wait for an automatic retry that never happens.

Useful? React with 👍 / 👎.

bootjp added 3 commits April 26, 2026 21:53
P4 deliverable from docs/design/2026_04_24_proposed_admin_dashboard.md
Section 8: a single self-contained operator-facing reference for
the admin HTTP listener.

Covers:
- Quick-start invocation for a loopback dev cluster
- Required + optional flag reference, with explanations of why each
  guard exists (TLS hard-error, rolling-update caveats, etc.)
- TLS topologies (loopback / TLS / discouraged plaintext-non-loopback)
- Role model + how live role re-validation works on every state-
  changing request
- The full /admin/api/v1/* surface (auth + cluster + dynamo +
  s3, including the slice 2 write paths and the AdminForward
  forwarding contract)
- forwarded_from audit log shape and why it carries the follower's
  node ID
- Troubleshooting guide for the common failure modes operators
  hit during initial bring-up (missing credentials, TLS hard-error,
  401 ambiguity, 503 leader_unavailable, bucket_not_empty,
  blank-screen / placeholder bundle)
- Cross-references to the design doc and architecture overview

The Section 8 P4 plan also called out "TLS, read-only role, CSRF" as
deliverables; those are already implemented (see config.go's
validateTLS / validateAccessKeyRoles, the role gates in
DynamoHandler.principalForWrite + S3Handler.principalForWrite, and
the CSRFDoubleSubmit middleware. This doc stitches them into a
single reference operators can land on without reading code.

Stacked on #669 (P2 slice 2a) + #673 (P2 slice 2b) so the API-
surface table can describe S3 write endpoints as shipped. Once both
land in main, this rebases cleanly.
EOF
)
…ane label

Three substantive findings + one nit from claude review on #674:

1. **HS256 key size (high)**: said "≥ 32 raw bytes / 44 base64 chars"
   but the validator in internal/admin/config.go enforces exactly 64
   raw bytes. A 32-byte key fails startup with a confusing error.
   Updated to "exactly 64 raw bytes — 88 base64 chars (standard
   padding) or 86 (RawURLEncoding)". Operators copying the doc into
   key-generation tooling now produce a key that actually validates.

2. **Audit log sample (moderate)**: showed method=POST / path=... /
   status=201 / duration=8.2ms — none of those fields are emitted.
   The actual slog entries use operation= + a resource key
   (bucket= or table=). Replaced the example with two real shapes
   (leader-direct and forwarded), so an operator building a log
   parser against the documented contract finds matches.

3. **Cross-reference link (moderate)**: line 249 used the
   _partial_ filename for the design doc, but #675 (the rename PR)
   hasnt landed yet. Point to _proposed_ (matches main today)
   with an inline note that the link will follow once #675 lands.
   The intro paragraph at line 5 already used _proposed_, so this
   removes the inconsistency claude flagged.

4. **"data-plane" label nit**: /admin/api/v1/cluster is an admin-
   plane endpoint, not data-plane. "data-plane" implied the
   DynamoDB / S3 / Redis ports. Replaced with "admin".

The cosmetic Last-updated / Status verbosity items on #675 are
tracked separately.
Two findings from the latest claude review on #674:

1. **Audit log shapes (Bug A)**: the previous example claimed
   `admin_audit ... operation=delete_table table=orders` was a
   leader-direct sample, but `DynamoHandler.handleDelete`
   (dynamo_handler.go:379-396) does not emit any handler-level
   admin_audit — that shape only appears via `ForwardServer` on
   the forwarded path. It also omitted the `Audit` middleware line
   (middleware.go:206-232) which wraps every non-GET admin request
   with method= / path= / status= / remote= / duration=. An
   operator building a parser against the doc would have missed
   half of the actual lines.

   Rewrote the section as three explicit shapes — Audit middleware
   (always), S3Handler op line (only emits from s3_handler.go:299
   /333/355), ForwardServer op line (covers both Dynamo and S3
   forwarded paths). Removed the misleading `delete_table`
   leader-direct sample and replaced the Dynamo examples with
   forwarded-only ones, which is where they actually originate.
   Added a one-sentence note that a single request typically
   produces two lines (op-specific + middleware) so log parsers
   know to treat the key as a union.

2. **Login 401 vs 403 (Codex P2)**: the troubleshooting entry said
   "both produce 401" but auth_handler.go:336-345 returns 403
   forbidden when the credentials match but the access key is not
   in `-adminFullAccessKeys` / `-adminReadOnlyAccessKeys`. Split
   into two entries: 401 invalid_credentials (bad access_key or
   secret) and 403 forbidden (valid SigV4 creds, but no admin
   role assignment), with the remediation specific to each case.

No functional changes. Doc-only.
@bootjp bootjp force-pushed the docs/admin-operator-guide branch from 05560e7 to c9da923 Compare April 26, 2026 12:54
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c9da9235a7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread docs/admin.md Outdated
Comment on lines +186 to +188
**`Audit` middleware** — emitted for every non-GET/HEAD/OPTIONS
request that reaches the admin mux on this node, regardless of which
handler served it. Always present on the node that received the HTTP
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Correct audit coverage statement for write requests

This text says the Audit middleware is emitted for every non-GET/HEAD/OPTIONS request regardless of handler, but routing does not do that: /auth/logout is explicitly wired through protectNoAudit (no Audit wrapper), and the protected chain runs SessionAuth before Audit, so unauthenticated write attempts are rejected before any middleware audit line is produced (internal/admin/server.go). As written, operators may incorrectly assume all write attempts are guaranteed to appear in admin_audit logs.

Useful? React with 👍 / 👎.

Two Codex P2 findings on the latest review of #674:

1. **Cookie Secure attribute (line 55 + the TLS topology
   section)**: I wrote that `-adminAllowPlaintextNonLoopback`
   "enables the dashboard to mint cookies without the `Secure`
   attribute". That conflates two flags. In the actual wiring
   (auth_handler.go:114 + auth_handler.go:458-468 +
   main_admin.go:597), `secureCookie` is `!opts.InsecureCookie`
   which comes from `-adminAllowInsecureDevCookie`, **not**
   `-adminAllowPlaintextNonLoopback`. The latter only bypasses
   the startup TLS guard (config.go:112).

   Operationally this is the difference between "fails closed"
   and "broken end-to-end": an operator who reads the old doc
   and sets only `-adminAllowPlaintextNonLoopback` gets a
   listener that accepts plaintext, but the dashboard still
   mints `Secure=true` cookies that the browser refuses to send
   back over the plaintext channel — login appears to succeed
   and every subsequent request 401s, exactly the bug Codex was
   guarding against.

   Rewrote the line-55 entry to scope `-adminAllowPlaintextNonLoopback`
   to the listener TLS guard only, and added the cross-reference
   to `-adminAllowInsecureDevCookie` for the cookie attribute.
   Updated the loopback section to drop the misleading "Secure
   regardless" claim and explain when a loopback rig actually
   needs the dev-cookie flag. Added a paragraph at the bottom
   of "Discouraged: plaintext non-loopback" calling out that a
   working plaintext rig needs both flags, and that the two are
   deliberately separate so misconfiguration fails closed on
   either axis instead of silently downgrading both at once.

2. **SPA Retry-After claim (line 172)**: I wrote that "the SPA's
   API client honours Retry-After and re-issues the request
   once." Codex caught that web/admin/src/api/client.ts:75 does
   a single `fetch` call with no retry path, and useApi.ts has
   no 503 retry logic either. The doc was setting an operator
   expectation the code does not meet — under leader election a
   user would see the 503 surface directly and conclude
   "intermittent failure" instead of re-issuing the action.

   Rewrote the section to describe the current behaviour
   accurately: server emits `Retry-After: 1`, but the SPA
   surfaces the 503 directly and the operator must re-issue.
   Kept the header documented because a future client (or an
   external operator script driving the JSON API) can still
   honour it — it is part of the server's contract.

No code changes; doc-only. The cookie wiring claim was verified
against auth_handler.go and main_admin.go; the SPA claim was
verified against web/admin/src/api/client.ts and useApi.ts.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

@claude review

Addressed both Codex P2 findings on c9da923:

  1. Cookie Secure flag wiring (line 55)-adminAllowPlaintextNonLoopback only bypasses the startup TLS guard; cookie Secure is controlled separately by -adminAllowInsecureDevCookie (verified in auth_handler.go:114 + main_admin.go:597). Rewrote the line-55 entry, the loopback section, and the discouraged-plaintext section to scope each flag correctly and call out that a working plaintext rig needs both. Operationally this is the difference between fails-closed and broken-end-to-end (browser refuses to send Secure cookie over HTTP → 401 every request).

  2. SPA Retry-After retry claim (line 172) — verified in web/admin/src/api/client.ts:75 and useApi.ts that there is no automatic retry. Rewrote the Election-period 503 section to describe current behaviour: server emits Retry-After: 1 (kept in the contract for future clients / external scripts), SPA surfaces the 503 directly, operator must re-issue.

Latest commit: 02d52dc. Doc-only; no code changes.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 2m 36s —— View job


Review

  • Read updated docs/admin.md (commit 02d52dc6)
  • Verify Fix 1: Cookie Secure flag wiring (-adminAllowPlaintextNonLoopback vs -adminAllowInsecureDevCookie)
  • Verify Fix 2: SPA Retry-After retry claim (no automatic retry)
  • Check Codex P2 (audit coverage for write requests — logout not audited, unauthenticated writes)
  • Check for any new issues
  • Post review feedback

Both stated fixes are correct and verified. One Codex P2 finding from the c9da9235 round is still open.

Confirmed fixes ✓

Fix Claim Verified against Result
1 — Cookie flag wiring -adminAllowPlaintextNonLoopback bypasses TLS guard only; -adminAllowInsecureDevCookie controls cookie Secure attribute; need both for a working plaintext rig auth_handler.go:114 (secureCookie: !opts.InsecureCookie), main_admin.go:597 (InsecureCookie: adminCfg.AllowInsecureDevCookie)
2 — SPA retry claim "makes a single fetch call with no automatic retry, so the user sees the 503 surfaced directly and must re-issue" client.ts:75 — bare fetch, no retry path in apiFetch or any caller

Bug — Audit coverage claim still overstated (Codex P2 from c9da9235, still open) (moderate)

docs/admin.md line 203–206:

Audit middleware — emitted for every non-GET/HEAD/OPTIONS request that reaches the admin mux on this node, regardless of which handler served it.

Two inaccuracies remain unfixed:

1. Logout is explicitly wired through protectNoAudit.

internal/admin/server.go:262–268:

protectNoAudit := func(next http.Handler) http.Handler {
    return BodyLimit(defaultBodyLimit)(
        SessionAuth(verifier)(
            CSRFDoubleSubmit()(next),
        ),
    )
}
// ...
logoutChain := protectNoAudit(logoutHandler)  // line 284

/auth/logout is a non-GET POST, and it does not pass through the Audit middleware. It emits exactly one admin_audit line (action=logout from auth_handler.go:443) — not two. An operator reading "regardless of which handler served it" may expect a second generic middleware line for logout that will never appear, and build a log parser or alerting rule around it.

2. Unauthenticated write attempts produce zero audit lines.

The protect chain runs SessionAuth before Audit (server.go:247–255):

protect := func(next http.Handler) http.Handler {
    return BodyLimit(defaultBodyLimit)(
        SessionAuth(verifier)(   // rejects first
            Audit(logger)(       // never reached if unauthenticated
                CSRFDoubleSubmit()(next),
            ),
        ),
    )
}

A POST with a missing or expired admin_session cookie is rejected by SessionAuth and returns 401 before Audit runs. The doc's opening sentence ("Every state-changing admin request emits structured slog lines", line 194) and the middleware description ("regardless of which handler served it") imply such attempts are logged. They are not.

Suggested fix for lines 203–206 — scope the claim precisely:

**`Audit` middleware** — emitted for every non-GET/HEAD/OPTIONS request on the
`protect` chain (all endpoints except `/auth/login` and `/auth/logout`, which
carry their own audit paths). Requires a valid session: `SessionAuth` runs
before `Audit`, so unauthenticated write attempts (missing or expired session
cookie) are rejected before any `Audit` line is produced.

And the opening sentence at line 194 should drop "Every" or add the qualifier: "Every authenticated state-changing admin request emits …".

Fix this →


Verified correct ✓

All items from prior rounds still hold.

Claim Source Result
All -admin* flag names main.go:117–128
HS256 key exactly 64 bytes / 88 base64 std config.go:14
admin_session / admin_csrf / X-Admin-CSRF middleware.go:19–21
bucket_not_empty JSON error code s3_handler.go:419
Credentials file JSON shape main_sigv4_creds.go:16,20–21
TLS 1.2+ enforced main_admin.go:536
leader_unavailable 503 + Retry-After: 1 s3_handler.go:411–413
S3Handler operation shapes s3_handler.go:299–304,333–339,355–360
ForwardServer shapes forward_server.go:170–176,224–230,391–398
401 vs 403 at login auth_handler.go:336–346
secureCookie: !opts.InsecureCookie auth_handler.go:114
InsecureCookie wired from adminCfg.AllowInsecureDevCookie main_admin.go:597
SPA single fetch, no retry client.ts:75
design/2026_04_24_proposed_admin_dashboard.md link repo

Codex P2 on the latest review of #674: the audit-log section
said the Audit middleware "is emitted for every non-GET/HEAD/
OPTIONS request that reaches the admin mux on this node,
regardless of which handler served it." That overstates coverage:

- `/auth/login` is wired through `publicAuth`, not `protect` —
  the middleware can't identify the actor without a pre-existing
  session, so it is intentionally skipped. AuthService emits
  `admin_audit action=login` directly (auth_handler.go:432).
- `/auth/logout` is wired through `protectNoAudit` — the
  middleware is omitted to avoid emitting two audit lines per
  logout (a generic one plus the action-specific one); the
  specific line is the one operators want (auth_handler.go:442).
- The `protect` chain is `BodyLimit → SessionAuth → Audit →
  CSRFDoubleSubmit → handler` (server.go:247-255). Audit only
  fires AFTER SessionAuth accepts the session, so an
  unauthenticated write attempt (missing / expired / invalid
  cookie) gets 401'd before reaching Audit and produces no
  middleware line. CSRF-rejected requests still produce one,
  by design, because the actor is known by then.

An operator building log-monitoring on top of the previous text
could conclude that every write attempt is guaranteed to
appear in `admin_audit`, then miss CSRF-bypassing probes
against an unauthenticated endpoint.

Rewrote the middleware-shape section to:
- Scope the rule to the protected mux chain.
- Explain the ordering choice (Audit between SessionAuth and
  CSRFDoubleSubmit) so the CSRF-rejection-still-audits guarantee
  is preserved in the doc.
- Enumerate the two auth endpoints that bypass the middleware
  and what they emit instead.
- Adjust the intro paragraph: protected-chain mutations produce
  two audit lines, auth endpoints produce one (was: "every
  state-changing request typically produces two").

No behaviour change; doc-only. Verified against
internal/admin/server.go:236-296 and
internal/admin/auth_handler.go:432-449.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

@claude review

Addressed the new Codex P2 finding on docs/admin.md:205 (audit middleware coverage):

The previous text overstated coverage by claiming the Audit middleware fires for every non-GET write. Verified in internal/admin/server.go:236-296:

  • /auth/login uses publicAuth (no Audit) — AuthService emits its own action=login line.
  • /auth/logout uses protectNoAudit (Audit deliberately omitted) — AuthService emits action=logout; wrapping Audit too would produce two lines per logout.
  • The protect chain is BodyLimit → SessionAuth → Audit → CSRFDoubleSubmit, so unauthenticated requests are rejected at SessionAuth before reaching Audit. CSRF-rejected requests still produce a middleware line (the actor is known by then) — this ordering is preserved in the doc rewrite.

Updated the middleware-shape section to enumerate which endpoints are wrapped and which aren't, plus the intro paragraph (auth endpoints produce one line, protected-chain mutations produce two).

Latest commit: 9010085. Doc-only.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 2m 15s —— View job


Review

  • Read updated docs/admin.md (commit 9010085)
  • Verify audit coverage fix: login/logout/protect chain enumeration
  • Verify internal/admin/server.go wiring matches doc claims
  • Check middleware.go Audit field set against middleware sample
  • Check for any remaining issues
  • Post review feedback

The Codex P2 audit-coverage finding is correctly addressed. One new bug in the middleware sample path — introduced in this round.

Confirmed fixes ✓

Claim Source Result
protect chain: BodyLimit → SessionAuth → Audit → CSRFDoubleSubmit server.go:247–254
CSRF-rejected requests still produce an audit line (Audit before CSRF) server.go:241–254 (Audit wraps CSRFDoubleSubmit, so CSRFDoubleSubmit rejection is captured)
Unauthenticated requests rejected at SessionAuth, never reach Audit server.go:247–254, middleware.go:91–108
/auth/login uses publicAuth (no SessionAuth, no Audit) server.go:277–283
/auth/logout uses protectNoAudit (SessionAuth, no Audit) server.go:262–268, 284
Protected-chain mutations → two audit lines Audit middleware + handler/forwarder operation lines
Auth endpoints → one audit line from AuthService auth_handler.go:432–448
Middleware field set: actor, role, method, path, status, remote, duration middleware.go:221–229
ForwardServer create_table, delete_table, put_bucket_acl, delete_bucket shapes forward_server.go:219–225, 321–328, 376–383

Bug — Wrong path in middleware audit sample (moderate)

docs/admin.md line 229:

admin_audit actor=AKIA_ADMIN role=full method=POST path=/admin/api/v1/buckets status=201 ...

The path constant is pathS3Buckets = "/admin/api/v1/s3/buckets" (s3_handler.go:26). The sample omits the /s3 segment — an operator who pastes this into a log filter will find no matches. The correct sample is:

admin_audit actor=AKIA_ADMIN role=full method=POST path=/admin/api/v1/s3/buckets status=201 remote=10.0.0.7:51234 duration=8.2ms

Fix this →


Nit — claimed_actor is login-only (minor)

docs/admin.md lines 258–261:

Login and logout emit their own admin_audit lines with action=login / action=logout (plus actor, claimed_actor, remote, status)

claimed_actor is emitted by auditLogin (auth_handler.go:436) but not by auditLogout (auth_handler.go:442–448, which emits only action, actor, remote, status). An operator building a log parser that expects claimed_actor on logout lines will find it missing and may interpret absence as truncated output. The fix is to qualify the description: e.g. "login additionally emits claimed_actor (what the caller said they were)".


Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 02d52dc6e3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread docs/admin.md Outdated
Comment on lines +238 to +240
Login and logout emit their own `admin_audit` lines with
`action=login` / `action=logout` (plus `actor`, `claimed_actor`,
`remote`, `status`) so the JWT's lifetime can be correlated with the
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fix logout audit schema description

The guide says both login and logout admin_audit entries include claimed_actor, but auditLogout only logs action, actor, remote, and status (internal/admin/auth_handler.go). This mismatch can break log pipelines that validate field presence from this doc and will misclassify normal logout events as malformed.

Useful? React with 👍 / 👎.

Comment thread docs/admin.md Outdated
Comment on lines +196 to +199
wherever the process slog handler is wired). A single state-changing
request typically produces **two** audit lines: one operation-specific
line from the source that performed the mutation, plus one generic
HTTP-shaped line from the `Audit` middleware. The shapes differ by
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don’t claim write requests usually emit two audit lines

This states a state-changing request typically yields an operation-specific line plus an Audit middleware line, but leader-direct Dynamo mutations only emit the middleware line (there is no Dynamo operation admin_audit emission in internal/admin/dynamo_handler.go). Operators who alert on a two-line pattern will undercount or incorrectly flag normal Dynamo writes.

Useful? React with 👍 / 👎.

bootjp added a commit that referenced this pull request Apr 26, 2026
Per `docs/design/README.md`'s lifecycle convention, the admin dashboard
design doc is now "partial":

- **P1** (DynamoDB CRUD + AdminForward) — shipped via #634, #635, #644,
#648
- **P2** (S3 buckets list/create/delete/ACL + DescribeTable) — shipped
via #658, with #669 + #673 in flight
- **P3** (React SPA + embed) — shipped via #649, #650
- **P4** (TLS / role / CSRF / operator docs) — TLS, role, CSRF are
already live in P1; operator docs in #674

Independent of the in-flight slice 2 PRs (#669/#673) and the docs PR
(#674) — this rename only reflects what is already on main today, plus
an "Implementation status" table mapping each phase to the PR it landed
in.

## What this PR changes

- `git mv` the design doc from `2026_04_24_proposed_admin_dashboard.md`
to `2026_04_24_partial_admin_dashboard.md` so its history follows
- Add an "Implementation status" header table indexing each phase to the
PRs that landed it
- List the outstanding open items so future readers know what is still
owed against the original proposal:
- AdminForward acceptance criterion 5 (rolling-upgrade compat flag) —
deferred
  - S3 object browser — explicitly out of scope per Section 2 Non-goals
  - TLS cert hot-reload — restart-to-rotate is the documented model

When the rolling-upgrade flag lands, the doc gets renamed once more to
`2026_04_24_implemented_admin_dashboard.md` per the README's lifecycle
convention.
Base automatically changed from feat/admin-s3-forward to feat/admin-s3-writes April 26, 2026 13:41
Codex P2 on the latest review of #674: my line 260 said both
login and logout `admin_audit` entries include `claimed_actor`,
but `auditLogout` (auth_handler.go:442-449) only emits `action`,
`actor`, `remote`, and `status`. `claimed_actor` is login-only
because the field exists to distinguish the typed access key
from the authenticated one — logout has no claim to verify (the
actor comes from the already-validated session cookie).

A log pipeline that validated field presence against the
previous text would misclassify every successful logout as
malformed, exactly the operator-facing bug Codex flagged.

Replaced the one-line summary with two concrete sample lines —
one for login (with `claimed_actor`), one for logout (without)
— and a closing sentence telling parsers that `claimed_actor`
is present-only-on-login. Same shape verified against
auth_handler.go:432-449.

No code changes; doc-only.
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

@claude review

Addressed the Codex P2 finding on docs/admin.md:260:

The previous text said both login and logout admin_audit entries include claimed_actor, but auditLogout (auth_handler.go:442-449) only emits action, actor, remote, status. claimed_actor is login-only because it exists to distinguish the typed key from the authenticated one — logout has no claim to verify.

Replaced the one-line summary with two concrete sample audit lines (login with claimed_actor, logout without) plus a sentence telling parsers claimed_actor is present-only-on-login. Verified against auth_handler.go:432-449.

Latest commit: 85df320. Doc-only.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 2m 4s —— View job


Review

  • Read updated docs/admin.md (commit 85df320f)
  • Verify claimed_actor fix: login-only, logout sample without it
  • Check prior moderate bug: middleware sample path (/admin/api/v1/s3/buckets)
  • Check Codex P2 "two audit lines" claim for leader-direct Dynamo mutations
  • Check auth_handler.go audit shapes
  • Check dynamo_handler.go for any operation-level audit emission
  • Post review feedback

The claimed_actor fix is correct. One moderate bug from the previous round is still not addressed; the Codex P2 "two audit lines" claim is partially improved but the intro text still contradicts the per-source detail.

Confirmed fix ✓ — claimed_actor is login-only

Shape Source verified Result
auditLogin emits claimed_actor auth_handler.go:436
auditLogout does NOT emit claimed_actor auth_handler.go:442–449
Doc lines 266–269 show two distinct samples, claimed_actor present on login, absent on logout docs/admin.md:266–269
"Log parsers consuming this shape should treat claimed_actor as present-only-on-login" docs/admin.md:271–272

Bug — Middleware audit sample path still wrong (moderate, carried from previous round, not yet fixed)

docs/admin.md line 229:

admin_audit actor=AKIA_ADMIN role=full method=POST path=/admin/api/v1/buckets status=201 remote=10.0.0.7:51234 duration=8.2ms

internal/admin/s3_handler.go:26:

pathS3Buckets = "/admin/api/v1/s3/buckets"

The /s3 segment is still missing in the sample path. An operator who copies this into a log filter or alert rule will find zero matches. This was first flagged in the round on commit 9010085 and was not included in the 85df320f changes.

Fix this →


Nit — "Typically two" intro contradicts the per-source detail for leader-direct Dynamo (Codex P2, partially open)

docs/admin.md lines 196–199:

A protected-chain mutation (Dynamo / S3 / cluster / keyviz writes) typically produces two audit lines: one operation-specific line from the source that performed the mutation, plus one generic HTTP-shaped line from the Audit middleware.

docs/admin.md lines 232–235:

S3Handler operation lineOnly the S3 admin path emits these; the DynamoDB admin path relies on the middleware line plus the forwarded line below for its audit trail.

internal/admin/dynamo_handler.go: zero admin_audit slog calls — confirmed by grep.

The per-source section correctly states DynamoDB has no operation-specific line. But the intro's definition of "two" is "operation-specific line + middleware line," which is wrong for leader-direct Dynamo (one line only — the middleware). The "typically" qualifier was meant to cover this, but the two statements still directly contradict for the common case: a direct-leader DynamoDB write. An operator debugging a Dynamo operation audit trail who reads the intro first will expect two lines and spend time hunting a second line that was never emitted.

Suggested fix: change the intro to scope "two lines" to S3 explicitly and flag Dynamo as middleware-only except when forwarded:

A protected-chain S3 mutation typically produces two audit lines: one operation-specific line from S3Handler and one generic HTTP-shaped line from the Audit middleware. DynamoDB mutations produce one middleware line on the leader-direct path; forwarded DynamoDB operations additionally emit a ForwardServer operation line.


Verified correct ✓

All items confirmed across prior rounds still hold.

Claim Source Result
HS256 key exactly 64 bytes / 88 base64 std config.go:14
admin_session / admin_csrf / X-Admin-CSRF middleware.go:19–21
bucket_not_empty JSON error code s3_handler.go:419
TLS 1.2+ enforced main_admin.go:536
leader_unavailable 503 + Retry-After: 1 s3_handler.go:411–413
S3Handler operation shapes s3_handler.go:299–304,333–339,355–360
ForwardServer shapes forward_server.go:170–176,224–230,391–398
401 vs 403 at login auth_handler.go:336–346
secureCookie: !opts.InsecureCookie auth_handler.go:114
SPA single fetch, no retry client.ts:75
Audit middleware chain order (SessionAuth → Audit → CSRF) server.go:247–254
/auth/logout through protectNoAudit server.go:262–268, 284
claimed_actor on login, absent on logout auth_handler.go:432–449

@bootjp bootjp merged commit 07bdc20 into feat/admin-s3-writes Apr 26, 2026
5 checks passed
@bootjp bootjp deleted the docs/admin-operator-guide branch April 26, 2026 17:53
bootjp added a commit that referenced this pull request Apr 26, 2026
…ane label

Three substantive findings + one nit from claude review on #674:

1. **HS256 key size (high)**: said "≥ 32 raw bytes / 44 base64 chars"
   but the validator in internal/admin/config.go enforces exactly 64
   raw bytes. A 32-byte key fails startup with a confusing error.
   Updated to "exactly 64 raw bytes — 88 base64 chars (standard
   padding) or 86 (RawURLEncoding)". Operators copying the doc into
   key-generation tooling now produce a key that actually validates.

2. **Audit log sample (moderate)**: showed method=POST / path=... /
   status=201 / duration=8.2ms — none of those fields are emitted.
   The actual slog entries use operation= + a resource key
   (bucket= or table=). Replaced the example with two real shapes
   (leader-direct and forwarded), so an operator building a log
   parser against the documented contract finds matches.

3. **Cross-reference link (moderate)**: line 249 used the
   _partial_ filename for the design doc, but #675 (the rename PR)
   hasnt landed yet. Point to _proposed_ (matches main today)
   with an inline note that the link will follow once #675 lands.
   The intro paragraph at line 5 already used _proposed_, so this
   removes the inconsistency claude flagged.

4. **"data-plane" label nit**: /admin/api/v1/cluster is an admin-
   plane endpoint, not data-plane. "data-plane" implied the
   DynamoDB / S3 / Redis ports. Replaced with "admin".

The cosmetic Last-updated / Status verbosity items on #675 are
tracked separately.
bootjp added a commit that referenced this pull request Apr 26, 2026
Two findings from the latest claude review on #674:

1. **Audit log shapes (Bug A)**: the previous example claimed
   `admin_audit ... operation=delete_table table=orders` was a
   leader-direct sample, but `DynamoHandler.handleDelete`
   (dynamo_handler.go:379-396) does not emit any handler-level
   admin_audit — that shape only appears via `ForwardServer` on
   the forwarded path. It also omitted the `Audit` middleware line
   (middleware.go:206-232) which wraps every non-GET admin request
   with method= / path= / status= / remote= / duration=. An
   operator building a parser against the doc would have missed
   half of the actual lines.

   Rewrote the section as three explicit shapes — Audit middleware
   (always), S3Handler op line (only emits from s3_handler.go:299
   /333/355), ForwardServer op line (covers both Dynamo and S3
   forwarded paths). Removed the misleading `delete_table`
   leader-direct sample and replaced the Dynamo examples with
   forwarded-only ones, which is where they actually originate.
   Added a one-sentence note that a single request typically
   produces two lines (op-specific + middleware) so log parsers
   know to treat the key as a union.

2. **Login 401 vs 403 (Codex P2)**: the troubleshooting entry said
   "both produce 401" but auth_handler.go:336-345 returns 403
   forbidden when the credentials match but the access key is not
   in `-adminFullAccessKeys` / `-adminReadOnlyAccessKeys`. Split
   into two entries: 401 invalid_credentials (bad access_key or
   secret) and 403 forbidden (valid SigV4 creds, but no admin
   role assignment), with the remediation specific to each case.

No functional changes. Doc-only.
bootjp added a commit that referenced this pull request Apr 26, 2026
Two Codex P2 findings on the latest review of #674:

1. **Cookie Secure attribute (line 55 + the TLS topology
   section)**: I wrote that `-adminAllowPlaintextNonLoopback`
   "enables the dashboard to mint cookies without the `Secure`
   attribute". That conflates two flags. In the actual wiring
   (auth_handler.go:114 + auth_handler.go:458-468 +
   main_admin.go:597), `secureCookie` is `!opts.InsecureCookie`
   which comes from `-adminAllowInsecureDevCookie`, **not**
   `-adminAllowPlaintextNonLoopback`. The latter only bypasses
   the startup TLS guard (config.go:112).

   Operationally this is the difference between "fails closed"
   and "broken end-to-end": an operator who reads the old doc
   and sets only `-adminAllowPlaintextNonLoopback` gets a
   listener that accepts plaintext, but the dashboard still
   mints `Secure=true` cookies that the browser refuses to send
   back over the plaintext channel — login appears to succeed
   and every subsequent request 401s, exactly the bug Codex was
   guarding against.

   Rewrote the line-55 entry to scope `-adminAllowPlaintextNonLoopback`
   to the listener TLS guard only, and added the cross-reference
   to `-adminAllowInsecureDevCookie` for the cookie attribute.
   Updated the loopback section to drop the misleading "Secure
   regardless" claim and explain when a loopback rig actually
   needs the dev-cookie flag. Added a paragraph at the bottom
   of "Discouraged: plaintext non-loopback" calling out that a
   working plaintext rig needs both flags, and that the two are
   deliberately separate so misconfiguration fails closed on
   either axis instead of silently downgrading both at once.

2. **SPA Retry-After claim (line 172)**: I wrote that "the SPA's
   API client honours Retry-After and re-issues the request
   once." Codex caught that web/admin/src/api/client.ts:75 does
   a single `fetch` call with no retry path, and useApi.ts has
   no 503 retry logic either. The doc was setting an operator
   expectation the code does not meet — under leader election a
   user would see the 503 surface directly and conclude
   "intermittent failure" instead of re-issuing the action.

   Rewrote the section to describe the current behaviour
   accurately: server emits `Retry-After: 1`, but the SPA
   surfaces the 503 directly and the operator must re-issue.
   Kept the header documented because a future client (or an
   external operator script driving the JSON API) can still
   honour it — it is part of the server's contract.

No code changes; doc-only. The cookie wiring claim was verified
against auth_handler.go and main_admin.go; the SPA claim was
verified against web/admin/src/api/client.ts and useApi.ts.
bootjp added a commit that referenced this pull request Apr 26, 2026
Codex P2 on the latest review of #674: the audit-log section
said the Audit middleware "is emitted for every non-GET/HEAD/
OPTIONS request that reaches the admin mux on this node,
regardless of which handler served it." That overstates coverage:

- `/auth/login` is wired through `publicAuth`, not `protect` —
  the middleware can't identify the actor without a pre-existing
  session, so it is intentionally skipped. AuthService emits
  `admin_audit action=login` directly (auth_handler.go:432).
- `/auth/logout` is wired through `protectNoAudit` — the
  middleware is omitted to avoid emitting two audit lines per
  logout (a generic one plus the action-specific one); the
  specific line is the one operators want (auth_handler.go:442).
- The `protect` chain is `BodyLimit → SessionAuth → Audit →
  CSRFDoubleSubmit → handler` (server.go:247-255). Audit only
  fires AFTER SessionAuth accepts the session, so an
  unauthenticated write attempt (missing / expired / invalid
  cookie) gets 401'd before reaching Audit and produces no
  middleware line. CSRF-rejected requests still produce one,
  by design, because the actor is known by then.

An operator building log-monitoring on top of the previous text
could conclude that every write attempt is guaranteed to
appear in `admin_audit`, then miss CSRF-bypassing probes
against an unauthenticated endpoint.

Rewrote the middleware-shape section to:
- Scope the rule to the protected mux chain.
- Explain the ordering choice (Audit between SessionAuth and
  CSRFDoubleSubmit) so the CSRF-rejection-still-audits guarantee
  is preserved in the doc.
- Enumerate the two auth endpoints that bypass the middleware
  and what they emit instead.
- Adjust the intro paragraph: protected-chain mutations produce
  two audit lines, auth endpoints produce one (was: "every
  state-changing request typically produces two").

No behaviour change; doc-only. Verified against
internal/admin/server.go:236-296 and
internal/admin/auth_handler.go:432-449.
bootjp added a commit that referenced this pull request Apr 26, 2026
Codex P2 on the latest review of #674: my line 260 said both
login and logout `admin_audit` entries include `claimed_actor`,
but `auditLogout` (auth_handler.go:442-449) only emits `action`,
`actor`, `remote`, and `status`. `claimed_actor` is login-only
because the field exists to distinguish the typed access key
from the authenticated one — logout has no claim to verify (the
actor comes from the already-validated session cookie).

A log pipeline that validated field presence against the
previous text would misclassify every successful logout as
malformed, exactly the operator-facing bug Codex flagged.

Replaced the one-line summary with two concrete sample lines —
one for login (with `claimed_actor`), one for logout (without)
— and a closing sentence telling parsers that `claimed_actor`
is present-only-on-login. Same shape verified against
auth_handler.go:432-449.

No code changes; doc-only.
bootjp added a commit that referenced this pull request Apr 27, 2026
## Summary

State-refresh of `docs/design/2026_04_24_partial_admin_dashboard.md` now
that all four phases of the original P1–P4 plan have shipped.

| Phase | Was | Now |
|---|---|---|
| P1 | shipped | shipped (no change) |
| P2 | partial — read-only #658 only | **shipped** (+ #669 writes + #673
forwarding) |
| P3 | shipped | shipped (no change) |
| P4 | mostly | **shipped** (+ operator doc #674, deployment runbook
#669, script wiring #669+#678) |

Also added the **AdminDeleteBucket TOCTOU** to Outstanding open items —
coderabbitai flagged it during PR #669 review (pre-existing race,
`s3.go:deleteBucket` inherits the same shape; recorded as a code comment
on `AdminDeleteBucket` for the future fix).

Doc stays at `_partial_` rather than promoting to `_implemented_`
because two outstanding items remain: AdminForward acceptance criterion
5 (rolling-upgrade flag, explicitly deferred at design time) and the new
TOCTOU. Updated the closing rename trigger to cover both.

## Test plan

- [x] No design changes — this is a state refresh
- [x] Status table entries cross-checked against the merge commits on
main
- [x] AdminDeleteBucket TOCTOU description matches the code comment in
adapter/s3_admin.go and the operator-side workaround in
docs/admin_deployment.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant