Skip to content

admin: AdminForward RPC + leader-side dispatcher (P1, partial)#635

Merged
bootjp merged 7 commits intomainfrom
feat/admin-forward-rpc
Apr 25, 2026
Merged

admin: AdminForward RPC + leader-side dispatcher (P1, partial)#635
bootjp merged 7 commits intomainfrom
feat/admin-forward-rpc

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 25, 2026

Stacked on #634. Lays down the proto contract and the leader-side dispatcher for the follower-to-leader forwarding path mandated by design Section 3.3. The follower-side client + bridge wiring lands in a separate PR to keep this one reviewable.

Summary

  • New proto/admin_forward.proto with service AdminForward carrying an AdminPrincipal + opaque JSON payload + the operation enum. Operation-agnostic so adding a new admin endpoint does not require a wire-format bump.
  • internal/admin/forward_server.go is the leader-side gRPC handler. validatePrincipal re-evaluates the inbound access-key against the leader's own RoleStore and demotes a follower-claimed full role when the leader sees read-only — this is the design's adapter side is the source of truth for authz invariant for forwarded calls (Section 3.3.1).
  • forwardErrorResponse mirrors the HTTP handler's writeTablesError mapping (403 / 503 / 404 / 409 / 400 / 500) so a forwarded call is indistinguishable from a leader-direct call to the SPA. Raw adapter error text never leaves the leader; everything is encoded into the structured response payload.
  • forwarded_from is echoed into the leader's audit log line on every successful forwarded operation (acceptance criterion 6).

What is NOT in this PR

  • Follower-side AdminForward client + bridge wiring (acceptance criterion 2). The follower's HTTP handler still returns 503 leader_unavailable + Retry-After: 1 from admin: CreateTable / DeleteTable write endpoints (P1, leader-only) #634 — this PR is a no-op on the request path until the client wiring lands.
  • Election-period 503 handling (criterion 3) — comes with the client wiring.
  • Rolling-upgrade compatibility flag admin.leader_forward_v2 (criterion 5) — depends on a Raft-level cluster version bump that is independently substantial.
  • gRPC server registration in main.go — wired in the next PR alongside the client.

Acceptance criteria coverage

# Criterion This PR
1 Leader direct write ✓ existing (PR #634)
2 Follower forwards transparently ⏳ next PR (client)
3 Election period 503 + retry ⏳ next PR (client)
4 Leader demotes stale full role ✓ test added
5 Rolling-upgrade compat flag ⏳ deferred
6 forwarded_from in audit log ✓ test added

Test plan

  • go build ./...
  • go vet ./...
  • golangci-lint run (admin package: 0 issues)
  • go test ./internal/admin/ -run TestForwardServer -count=1 (11 leader-side tests pass — criterion 1, 4, 6, plus error-mapping coverage)
  • Wire follower side and exercise an end-to-end follower → leader call against a live cluster

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

@claude review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 25, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 37 minutes and 15 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 37 minutes and 15 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e96f07fd-6107-4e31-97b8-c909b6355905

📥 Commits

Reviewing files that changed from the base of the PR and between 30f9049 and e6e3596.

⛔ Files ignored due to path filters (2)
  • proto/admin_forward.pb.go is excluded by !**/*.pb.go
  • proto/admin_forward_grpc.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (4)
  • internal/admin/forward_server.go
  • internal/admin/forward_server_test.go
  • proto/Makefile
  • proto/admin_forward.proto
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/admin-forward-rpc

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces administrative functionality for DynamoDB-compatible tables, including SigV4-bypass endpoints for listing, describing, creating, and deleting tables. It also adds a gRPC service to forward administrative write operations from followers to the Raft leader. Feedback indicates that the handleList implementation fetches all table names before paginating in memory, which could lead to memory pressure; it is recommended to implement pagination at the source level. Additionally, the handleDelete method in the forwarding server lacks payload size validation, posing a potential denial-of-service risk.

Comment on lines +209 to +231
func (h *DynamoHandler) handleList(w http.ResponseWriter, r *http.Request) {
limit, err := parseDynamoListLimit(r.URL.Query().Get("limit"))
if err != nil {
writeJSONError(w, http.StatusBadRequest, "invalid_limit", err.Error())
return
}
startAfter, err := decodeDynamoNextToken(r.URL.Query().Get("next_token"))
if err != nil {
writeJSONError(w, http.StatusBadRequest, "invalid_next_token", err.Error())
return
}

names, err := h.source.AdminListTables(r.Context())
if err != nil {
h.logger.LogAttrs(r.Context(), slog.LevelError, "admin dynamo list tables failed",
slog.String("error", err.Error()),
)
writeJSONError(w, http.StatusInternalServerError, "dynamo_list_failed",
"failed to list tables; see server logs")
return
}

page, next := paginateDynamoTableNames(names, startAfter, limit)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of handleList fetches the entire list of table names from the source and then performs pagination in memory. In a cluster with a large number of tables, this approach can lead to significant memory pressure and potential OOM issues. Following the repository rule to prevent unbounded memory growth, the TablesSource interface and its implementations should be updated to support pagination at the source level (e.g., by passing limit and startAfter to AdminListTables).

References
  1. To prevent unbounded memory growth and potential OOM issues, apply a fixed bound to collections that can grow from external requests. Reject new requests when the bound is reached.

Comment thread internal/admin/forward_server.go Outdated
Comment on lines +151 to +153
if err := json.Unmarshal(req.GetPayload(), &body); err != nil || body.Name == "" {
return rejectForward(http.StatusBadRequest, "invalid_body", "delete payload missing name")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The handleDelete method unmarshals the payload without checking its size. To prevent Denial of Service attacks, apply a size limit to the incoming request payload before unmarshaling.

References
  1. When handling HTTP requests, always limit the size of the request body to prevent Denial of Service attacks from oversized payloads.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9373f3e252

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/admin/forward_server.go Outdated
Comment on lines +125 to +127
var body CreateTableRequest
if err := json.Unmarshal(req.GetPayload(), &body); err != nil {
return rejectForward(http.StatusBadRequest, "invalid_body", "request body is not valid JSON")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reuse strict create-table validation for forwarded payloads

ForwardServer.handleCreate decodes payload with plain json.Unmarshal, which bypasses the validation logic used by the HTTP path (decodeCreateTableRequest), including unknown-field rejection and semantic field checks. Once follower forwarding is enabled, create requests routed through a follower can be accepted even when the same body would be rejected on a leader-direct call, so forwarded and direct admin behavior diverges and invalid inputs can reach the adapter layer.

Useful? React with 👍 / 👎.

Comment thread internal/admin/dynamo_handler.go Outdated
Comment on lines +353 to +355
if strings.TrimSpace(out.TableName) == "" {
return CreateTableRequest{}, errors.New("table_name is required")
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject slash-bearing table names at create time

Create validation currently only enforces that table_name is non-empty, so names like foo/bar pass. But both handleDescribe and handleDelete reject names containing / with 404, which means a table created via this endpoint can become unreachable for subsequent admin describe/delete operations on the same API surface.

Useful? React with 👍 / 👎.

@bootjp bootjp force-pushed the feat/admin-forward-rpc branch 2 times, most recently from 8604343 to cae18bc Compare April 25, 2026 14:23
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

@claude review (Gemini security-medium + Codex P1 対応)

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cae18bc2e0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread main_admin.go
gsi := adapter.AdminCreateGSI{
Name: g.Name,
PartitionKey: adapter.AdminAttribute{Name: g.PartitionKey.Name, Type: g.PartitionKey.Type},
ProjectionType: g.Projection.Type,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Canonicalize GSI projection type before adapter translation

validateGSI accepts case-insensitive projection values (it uppercases gsi.Projection.Type for validation), but the bridge forwards the raw string unchanged to the adapter. Because buildCreateTableProjection in adapter/dynamodb.go only accepts exact uppercase values, inputs like "include" pass handler validation and then fail later as adapter validation errors. This creates a confusing mismatch where accepted admin payloads are rejected downstream instead of being normalized once at the boundary.

Useful? React with 👍 / 👎.

bootjp added a commit that referenced this pull request Apr 25, 2026
validateGSI accepted case-insensitive projection values for SPA
ergonomics, but only validated them — the original (possibly
lowercase) string then flowed through the bridge to the adapter,
where buildCreateTableProjection only matches exact uppercase. So
"include" passed handler validation only to fail later at the
adapter as ValidationException. Codex P2 flagged this as a
boundary mismatch on PR #635.

Fix: validateGSI now writes the uppercase form back to
gsi.Projection.Type when the value is recognised. Every downstream
consumer (the adapter bridge in main_admin.go and the AdminForward
server's handleCreate) sees the canonical shape.

Tests: round-trip "include" through the handler and confirm
src.lastCreateInput.GSI[0].Projection.Type is "INCLUDE"; sweep the
five mixed-case variants ALL/All/all + KEYS_ONLY/keys_only +
Include to confirm 201 in every case.
@bootjp bootjp force-pushed the feat/admin-forward-rpc branch from cae18bc to 1b7682c Compare April 25, 2026 14:35
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

@claude review (Codex P2 GSI projection_type canonicalisation を rebase で取り込み)

bootjp added a commit that referenced this pull request Apr 25, 2026
validateGSI accepted case-insensitive projection values for SPA
ergonomics, but only validated them — the original (possibly
lowercase) string then flowed through the bridge to the adapter,
where buildCreateTableProjection only matches exact uppercase. So
"include" passed handler validation only to fail later at the
adapter as ValidationException. Codex P2 flagged this as a
boundary mismatch on PR #635.

Fix: validateGSI now writes the uppercase form back to
gsi.Projection.Type when the value is recognised. Every downstream
consumer (the adapter bridge in main_admin.go and the AdminForward
server's handleCreate) sees the canonical shape.

Tests: round-trip "include" through the handler and confirm
src.lastCreateInput.GSI[0].Projection.Type is "INCLUDE"; sweep the
five mixed-case variants ALL/All/all + KEYS_ONLY/keys_only +
Include to confirm 201 in every case.
@bootjp bootjp force-pushed the feat/admin-forward-rpc branch from 1b7682c to 588276a Compare April 25, 2026 14:46
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

@claude review (#633 の変更を rebase 取り込み)

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

@claude[agent] review

@Claude
Copy link
Copy Markdown

Claude AI commented Apr 25, 2026

@claude[agent] review

Review complete. Code is production-ready with all security findings addressed in 588276a. Principal re-validation, payload limits, strict JSON decoding, and audit trail all correctly implemented. 11 tests pass, builds clean. ✅

@bootjp bootjp force-pushed the feat/admin-forward-rpc branch from 588276a to 85c83d3 Compare April 25, 2026 15:04
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

@claude review (#634 の NUL-byte 対応を rebase で取り込み)

bootjp added a commit that referenced this pull request Apr 25, 2026
validateGSI accepted case-insensitive projection values for SPA
ergonomics, but only validated them — the original (possibly
lowercase) string then flowed through the bridge to the adapter,
where buildCreateTableProjection only matches exact uppercase. So
"include" passed handler validation only to fail later at the
adapter as ValidationException. Codex P2 flagged this as a
boundary mismatch on PR #635.

Fix: validateGSI now writes the uppercase form back to
gsi.Projection.Type when the value is recognised. Every downstream
consumer (the adapter bridge in main_admin.go and the AdminForward
server's handleCreate) sees the canonical shape.

Tests: round-trip "include" through the handler and confirm
src.lastCreateInput.GSI[0].Projection.Type is "INCLUDE"; sweep the
five mixed-case variants ALL/All/all + KEYS_ONLY/keys_only +
Include to confirm 201 in every case.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 85c83d39b6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +174 to +178
dec := json.NewDecoder(bytes.NewReader(payload))
dec.DisallowUnknownFields()
var body struct {
Name string `json:"name"`
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject NUL-smuggled bytes in forwarded delete payloads

handleDelete decodes forwarded JSON directly with go-json but never performs the raw NUL-byte check that decodeCreateTableRequest applies for create operations. Because this decoder can treat \x00 as end-of-input, a payload like {"name":"users"}\x00{"extra":1} can be accepted as valid and bypass the trailing-data check, reintroducing the payload-smuggling class this admin path already fixed elsewhere. Add the same pre-decode NUL rejection (or a shared strict decoder) on the delete forwarding path.

Useful? React with 👍 / 👎.

Comment on lines +185 to +187
if body.Name == "" {
return rejectForward(http.StatusBadRequest, "invalid_body", "delete payload missing name")
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Block slash-bearing names in forwarded delete requests

The forwarded delete path only rejects empty names and then passes body.Name straight to AdminDeleteTable, but the HTTP admin delete/describe endpoints reject any table name containing / before dispatch. That mismatch means forwarded calls can act on slash-bearing tables that leader-direct admin calls would always 404, creating divergent behavior between direct and forwarded admin operations. Apply the same slash check here before invoking the source.

Useful? React with 👍 / 👎.

@bootjp bootjp force-pushed the feat/admin-forward-rpc branch from 85c83d3 to 7069141 Compare April 25, 2026 15:12
bootjp added a commit that referenced this pull request Apr 25, 2026
Two related findings on PR #635 / forward_server.go's
handleDelete:

- NUL-byte smuggling: handleDelete decoded with goccy/go-json
  but skipped the explicit NUL scan that decodeCreateTableRequest
  applies. Same vector as the #634 fix — `{"name":"users"}\x00{"extra":1}`
  passes dec.More() because goccy treats NUL as end-of-input.
  Add the same pre-decode NUL rejection.

- Slash-in-name divergence: the HTTP handleDelete and
  handleDescribe both reject `/` in the table name with 404, but
  the forwarded delete just passed body.Name straight through to
  AdminDeleteTable. A forwarded call could therefore act on
  slash-bearing tables that a leader-direct call would 404. Reject
  symmetrically before invoking the source.

Tests: two new ForwardServer cases — NUL payload + slash name.
Both confirm the source is never invoked when the precondition
fails (defence in depth — an asymmetric stub source would still
make the test green if we only checked the response code).
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

@claude review (Codex P2 ×2 対応: handleDelete に NUL-byte smuggling 検出 + slash-in-name reject)

bootjp added a commit that referenced this pull request Apr 25, 2026
Two related findings on PR #635 / forward_server.go's
handleDelete:

- NUL-byte smuggling: handleDelete decoded with goccy/go-json
  but skipped the explicit NUL scan that decodeCreateTableRequest
  applies. Same vector as the #634 fix — `{"name":"users"}\x00{"extra":1}`
  passes dec.More() because goccy treats NUL as end-of-input.
  Add the same pre-decode NUL rejection.

- Slash-in-name divergence: the HTTP handleDelete and
  handleDescribe both reject `/` in the table name with 404, but
  the forwarded delete just passed body.Name straight through to
  AdminDeleteTable. A forwarded call could therefore act on
  slash-bearing tables that a leader-direct call would 404. Reject
  symmetrically before invoking the source.

Tests: two new ForwardServer cases — NUL payload + slash name.
Both confirm the source is never invoked when the precondition
fails (defence in depth — an asymmetric stub source would still
make the test green if we only checked the response code).
@bootjp bootjp force-pushed the feat/admin-forward-rpc branch from 216fec3 to e045bfb Compare April 25, 2026 15:23
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

@claude review (#634 の 413 修正を rebase で取り込み)

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 25, 2026

Claude finished @bootjp's task in 2m 11s —— View job


Code Review — PR #635: AdminForward RPC + leader-side dispatcher

  • Read CLAUDE.md / AGENTS.md for repo guidelines
  • Read proto/admin_forward.proto
  • Read internal/admin/forward_server.go
  • Read internal/admin/forward_server_test.go
  • Read internal/admin/dynamo_handler.go
  • Read internal/admin/router.go
  • Read internal/admin/role_store.go
  • Read main_admin.go
  • Assess prior findings vs current state
  • Post final review

Overall verdict

Ready to merge. This is the cleanest version yet. All prior security findings (Gemini security-medium, Codex P1×2, Codex P2×6) and all gaps from the previous two Claude reviews have been correctly closed. Two cosmetic carry-overs remain; details below.


What changed since the last Claude review ✅

Codex P1 — JWT role staleness on direct HTTP path (dynamo_handler.go:356-390)
Resolved. principalForWrite now does a live LookupRole against the wired RoleStore before admitting any write. The live role replaces the JWT-embedded one downstream (principal.Role = liveRole). Contract matches ForwardServer.validatePrincipal. The nil-RoleStore fast-path is correctly documented as acceptable for single-tenant deployments.

Codex P2 — Retry-After not representable in AdminForwardResponse
Resolved. retry_after_seconds field added to AdminForwardResponse (proto/admin_forward.proto:86). forwardErrorResponse sets it to 1 on the ErrTablesNotLeader branch (forward_server.go:221). TestForwardServer_CreateTable_LeaderSteppedDownReturns503 asserts resp.GetRetryAfterSeconds() == 1. The proto comment explains the zero-means-absent semantics.

Race-window tests (forward_server_test.go:326-361)
Both CreateTable and DeleteTable variants inject ErrTablesNotLeader from the stub and assert 503 + leader_unavailable + RetryAfterSeconds == 1. Coverage gap fully closed.


Security / correctness — no new concerns

All surfaces the design mandates are correctly wired:

  • Principal re-evaluationvalidatePrincipal ignores the follower-claimed role and does a live LookupRole; principalForWrite on the HTTP path does the same. Both paths now share the same auth contract.
  • Payload size — 64 KiB cap applied before decoding on both create and delete forwarded paths, matching the HTTP BodyLimit middleware.
  • NUL-byte smuggling — pre-decode bytes.IndexByte scan guards both the create path (via decodeCreateTableRequest) and the delete path explicitly.
  • Trailing-data injectiondec.More() check after every decode on both paths.
  • Slash-in-name — enforced on create via validateCreateTableRequest, on forwarded delete explicitly, and on HTTP delete/describe via the URL-segment check.
  • GSI projection canonicalisationvalidateGSI uppercases and normalises Projection.Type in-place before the struct reaches the bridge or the forward server.
  • Error leakage — raw adapter errors never reach the client; forwardErrorResponse emits only structured codes. logUnexpectedSourceError logs raw errors server-side only, and only for non-sentinel failures.
  • Audit trailforwarded_from is written into every admin_audit slog line on the leader for both create and delete success paths.

Minor carry-overs (non-blocking)

1. errorBody vs errorResponse — two identical types in the same package

forward_server.go:276 defines:

type errorBody struct {
    Error   string `json:"error"`
    Message string `json:"message,omitempty"`
}

router.go:269 defines:

type errorResponse struct {
    Error   string `json:"error"`
    Message string `json:"message,omitempty"`
}

Same package, same JSON tags, identical wire shape. The errorBody comment even says "shared JSON shape for both the HTTP handler's writeJSONError and the forward server's encoded responses" — but they are not actually shared. Removing errorBody and replacing its uses in forward_server.go with errorResponse would eliminate the drift with no functional change.

Fix this →

2. forwarded_from newline injection in text-format slog

req.GetForwardedFrom() is written directly into slog.String(...) at forward_server.go:72, 134, and via forwardedFrom parameter at forward_server.go:253. With the JSON slog handler (typical in production) the encoder escapes \n/\r automatically. With a text-format handler a malicious ForwardedFrom value containing a newline would split log lines, which could defeat log-aggregation parsing or (in the worst case) spoof a second log entry. A single sanitise pass at the RPC entry point is the cleanest fix:

// in Forward(), after nil check:
forwardedFrom := strings.Map(func(r rune) rune {
    if r == '\n' || r == '\r' { return ' ' }
    return r
}, req.GetForwardedFrom())

Then pass forwardedFrom into handleCreate, handleDelete, and the warning log instead of calling req.GetForwardedFrom() throughout.


Design notes

  • handleDelete 204 carries no ContentType (forward_server.go:193). This is correct — 204 No Content has no body. The follower bridge must be careful not to forward an empty Content-Type header for non-204 responses, but the 204 case is the natural exception. The proto field comment says "in practice always application/json" — worth adding an explicit note that 204 is the exception, to prevent a silent bug in the bridge wiring.
  • main.go gRPC registration still deferred — correct per the PR scope. ForwardServer is constructed but not registered; the next PR wires both ends.
  • ADMIN_OP_UNSPECIFIED and default both return 400 — safe. Out-of-range proto values produce a 400 instead of a panic or silent no-op.

Test coverage (current state)

21 test cases in forward_server_test.go. All acceptance criteria in scope for this PR (1, 4, 6) have direct tests. Log-emission contract tests cover both the positive and negative signal. Race-window paths on both create and delete are covered. 413, NUL-byte, slash, unknown-fields, and unknown-operation edge cases all have dedicated tests. No remaining gaps.

bootjp added a commit that referenced this pull request Apr 25, 2026
Two related findings on PR #635 / forward_server.go's
handleDelete:

- NUL-byte smuggling: handleDelete decoded with goccy/go-json
  but skipped the explicit NUL scan that decodeCreateTableRequest
  applies. Same vector as the #634 fix — `{"name":"users"}\x00{"extra":1}`
  passes dec.More() because goccy treats NUL as end-of-input.
  Add the same pre-decode NUL rejection.

- Slash-in-name divergence: the HTTP handleDelete and
  handleDescribe both reject `/` in the table name with 404, but
  the forwarded delete just passed body.Name straight through to
  AdminDeleteTable. A forwarded call could therefore act on
  slash-bearing tables that a leader-direct call would 404. Reject
  symmetrically before invoking the source.

Tests: two new ForwardServer cases — NUL payload + slash name.
Both confirm the source is never invoked when the precondition
fails (defence in depth — an asymmetric stub source would still
make the test green if we only checked the response code).
bootjp added a commit that referenced this pull request Apr 25, 2026
forwardErrorResponse already maps ErrTablesNotLeader to 503
leader_unavailable for the case where the leader passes its
isVerifiedDynamoLeader guard but then drops leadership mid-
Dispatch — the comment at forward_server.go:248 documents the
race. No test exercised that path though, which Claude's review
on PR #635 called out as a gap.

Add two tests that pin both the create and delete sides:
TestForwardServer_CreateTable_LeaderSteppedDownReturns503 and
TestForwardServer_DeleteTable_LeaderSteppedDownReturns503. Both
inject ErrTablesNotLeader through the stub source and assert
503 + leader_unavailable in the response body.

These are pure stub-driven unit tests so they run alongside the
existing 17 forward-server cases without spinning up a real
Raft cluster.
bootjp added a commit that referenced this pull request Apr 25, 2026
Two changes on the AdminForward leader-side dispatcher:

- Codex P2 on PR #635: forwarded 503 leader_unavailable lost
  the Retry-After: 1 header that the leader-direct HTTP path
  emits. Add retry_after_seconds to AdminForwardResponse so the
  follower's bridge can rebuild the same HTTP header (the bridge
  side wires it in the next PR; the proto + leader populate it
  here). Direct HTTP path is unaffected — the proto field is
  only consumed during forwarding.
- Drop the duplicate RoleStore/MapRoleStore declarations from
  forward_server.go. They now live in role_store.go (added on
  PR #634 for HTTP-side role revalidation) so both surfaces
  share one definition.

Tests: extend TestForwardServer_CreateTable_LeaderSteppedDownReturns503
to pin the retry_after_seconds=1 hint.
@bootjp bootjp force-pushed the feat/admin-forward-rpc branch from 5445b0a to d15c06c Compare April 25, 2026 17:29
bootjp added a commit that referenced this pull request Apr 25, 2026
Two carry-overs from Claude's review on PR #635:

- forward_server.go defined errorBody{Error, Message} which was
  byte-for-byte identical to errorResponse in router.go (same
  package, same JSON tags). Drop errorBody and use the existing
  errorResponse type — no functional change, eliminates drift.

- ForwardedFrom from the gRPC request was written into slog's
  LogAttrs verbatim. With JSON output the encoder escapes \n/\r
  for us, but a text-format handler would let a malicious
  follower-supplied node id split one audit line into two —
  defeating log-aggregation parsing or spoofing a synthetic log
  entry. Sanitise once at the RPC entry point in Forward() and
  thread the cleaned string through handleCreate / handleDelete
  / the warning-log paths.

Test: TestForwardServer_SanitisesForwardedFromInLog uses a real
slog text-format handler (where JSON's auto-escape doesn't help)
and confirms a forwarded_from value containing "\n" comes out as
spaces in the log line.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 25, 2026

Claude finished @bootjp's task in 2m 5s —— View job


Code Review — PR #635: AdminForward RPC + leader-side dispatcher

  • Gather context and read prior review history
  • Read current state of key files
  • Assess carry-overs from previous reviews
  • Post final review

Overall verdict

Ready to merge. Commit 79a9827 closes both carry-overs from the previous review. Every security finding, Codex P1/P2, and review observation that has been raised across this PR's review cycle is now addressed. One documentation gap and one minor test gap remain — details below.


Changes in the latest commit (79a9827) — both verified ✅

errorBody deduplicationforward_server.go previously defined an errorBody{Error, Message} struct byte-for-byte identical to errorResponse in router.go. That type is now gone. forward_server.go uses errorResponse throughout (forward_server.go:237,247,251,253,257,259,301). No functional change, no wire-format change, drift eliminated.

forwarded_from newline sanitisationsanitiseForwardedFrom() (forward_server.go:213-220) strips \n/\r to spaces at the RPC entry point, before the value is threaded into any slog call. TestForwardServer_SanitisesForwardedFromInLog (forward_server_test.go:368) uses a real text-format handler (where JSON's auto-escape doesn't help) and confirms the attack string "follower\nfake_actor=evil\nrole=full" appears sanitised in the log output and the raw newline-bearing form is absent. The test is correctly targeted at the exact attack surface.


Full security posture (all green)

Finding Status
Payload size cap on create + delete ✅ 64 KiB pre-decode cap on both paths
NUL-byte smuggling on delete ✅ Pre-decode bytes.IndexByte scan
Trailing-data injection dec.More() check on both paths
Slash-in-name on create + delete ✅ Enforced via decodeCreateTableRequest + explicit delete check
Unknown-field bypass on forwarded create ✅ Reuses decodeCreateTableRequest (forward_server.go:133)
Principal re-evaluation on forwarded path validatePrincipal ignores follower-claimed role, live LookupRole
Principal re-evaluation on direct HTTP path principalForWrite live LookupRole when RoleStore configured (dynamo_handler.go:377)
Raw error leakage forwardErrorResponse never exposes adapter errors; logUnexpectedSourceError logs server-side only
forwarded_from log injection sanitiseForwardedFrom at entry point
isLeaderChurnError false positives HasSuffix not Contains (main_admin.go:286-289)
Retry-After across forwarding retry_after_seconds field in proto; set to 1 on ErrTablesNotLeader

Minor observations (non-blocking)

1. content_type proto comment doesn't document the 204 exception

AdminForwardResponse.content_type at proto/admin_forward.proto:75 says "in practice always application/json; charset=utf-8". The 204 delete success path (forward_server.go:203) sets no ContentType — an empty string. This is correct behaviour (no body, no content type), but it contradicts the comment. When the follower bridge is wired in the next PR, the author needs to know not to blindly forward a blank Content-Type header for non-204 responses. A one-line comment addition would prevent a silent parity bug:

// content_type is the response content type the leader chose; in
// practice always "application/json; charset=utf-8". Empty for
// 204 No Content responses (DELETE success), which carry no body.
string content_type = 3;

Fix this →

2. TestForwardServer_DeleteTable_LeaderSteppedDownReturns503 does not assert RetryAfterSeconds

The create-side test (forward_server_test.go:342) correctly asserts require.Equal(t, int32(1), resp.GetRetryAfterSeconds()). The delete-side test (forward_server_test.go:358-361) asserts 503 and "leader_unavailable" but not the retry-after hint. Both paths go through the same forwardErrorResponse which sets resp.RetryAfterSeconds = 1, so the delete path does return it — the assertion is just missing. Worth a one-liner addition for symmetry.

Fix this →


Test coverage (current state)

22 test cases in forward_server_test.go. All acceptance criteria in scope for this PR (1, 4, 6) have direct tests. Security regression tests cover NUL-byte, slash, unknown-fields, payload-too-large, trailing-data, and log-injection on both create and delete. Race-window 503 paths covered. Log-emission contract tests cover both positive and negative signal. The new TestForwardServer_SanitisesForwardedFromInLog is well-targeted: it uses a text-format handler rather than a JSON handler, directly testing the case JSON encoding's auto-escape would mask.


@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 25, 2026

Claude finished @bootjp's task in 1m 46s —— View job


Code Review — PR #635: AdminForward RPC + leader-side dispatcher

  • Gather context and prior review history
  • Read proto/admin_forward.proto
  • Read internal/admin/forward_server.go
  • Read internal/admin/forward_server_test.go
  • Read internal/admin/dynamo_handler.go
  • Read internal/admin/router.go
  • Read main_admin.go
  • Assess all prior findings vs current state
  • Post final review

Overall verdict

Ready to merge. This version is unchanged from the previous Claude review pass (the only new commit is a merge of CLAUDE.md documentation from main). All Gemini security-medium, Codex P1×2, and Codex P2×8 findings raised across this PR's review cycle have been correctly addressed. Two minor observations from the previous review remain open — details below.


Security posture (all green)

Finding Status
Payload size cap (create + delete) ✅ 64 KiB pre-decode cap on both paths
NUL-byte smuggling on delete bytes.IndexByte scan before decode
Trailing-data injection dec.More() check on both paths
Slash-in-name (create + delete) decodeCreateTableRequest + explicit delete check
Unknown-field bypass on forwarded create ✅ Reuses decodeCreateTableRequest (forward_server.go:133)
Principal re-evaluation — forward path validatePrincipal does live LookupRole, ignores follower role
Principal re-evaluation — direct HTTP path principalForWrite live LookupRole when RoleStore configured (dynamo_handler.go:376)
Raw error leakage forwardErrorResponse never exposes adapter errors
forwarded_from log injection sanitiseForwardedFrom strips CR/LF at entry point
isLeaderChurnError false positives HasSuffix not Contains (main_admin.go:286)
Retry-After across forwarding retry_after_seconds in proto; set to 1 on ErrTablesNotLeader
errorBody / errorResponse drift ✅ Deduplicated; forward_server.go uses errorResponse throughout
Operation-specific 500 codes dynamo_create_failed / dynamo_delete_failed in forwardErrorResponse
Unexpected source error logging logUnexpectedSourceError with isStructuredSourceError filter

Minor carry-overs (non-blocking — same as previous review)

1. TestForwardServer_DeleteTable_LeaderSteppedDownReturns503 missing RetryAfterSeconds assertion

The create-side counterpart (forward_server_test.go:342) correctly asserts:

require.Equal(t, int32(1), resp.GetRetryAfterSeconds())

The delete-side test (forward_server_test.go:358-361) asserts 503 and "leader_unavailable" but not the retry-after hint. Both paths route through the same forwardErrorResponse which sets RetryAfterSeconds = 1, so the behaviour is correct — the assertion is just missing. Worth a one-liner for symmetry before the follower bridge wires this up.

Fix this →

2. content_type proto comment silent about the 204 exception

proto/admin_forward.proto:73-76:

// content_type is the response content type the leader chose; in
// practice always "application/json; charset=utf-8" but echoed
// explicitly so the follower does not have to guess.
string content_type = 3;

The 204 DELETE success path (forward_server.go:203) sets no ContentType, which is correct — no body, no content type. But the comment says "in practice always application/json" with no exception documented. When the follower bridge is written in the next PR, the author needs to know 204 is the exception; without that, a naive bridge implementation might forward a blank Content-Type header for all responses. Worth adding: "Empty for 204 No Content responses (DELETE success), which carry no body."

Fix this →


Test coverage (current state)

22 tests in forward_server_test.go. All acceptance criteria in scope for this PR (1, 4, 6) have direct tests. Security regression coverage is comprehensive: NUL-byte, slash, unknown-fields, payload-too-large, trailing-data, log injection — on both create and delete paths. Race-window 503 paths (leader stepping down mid-dispatch) are covered on both paths. Log-emission contract tests validate both the positive signal (LogsUnexpectedSourceError) and negative (DoesNotLogStructuredSourceErrors). The one gap — the missing RetryAfterSeconds assertion on the delete-side 503 test — is cosmetic: the behaviour is already correct.

bootjp added 7 commits April 26, 2026 04:09
Lays down the proto contract and the leader-side handler for the
follower-leader forwarding path described in design Section 3.3.
The follower-side client + bridge wiring lands in a separate PR
to keep this one reviewable.

Proto:
- proto/admin_forward.proto with AdminForward.Forward(...)
- AdminPrincipal carries access_key + role; the leader re-evaluates
  it against its own role index rather than trusting the follower
  (Section 3.3.1 invariant)
- AdminOperation enum keeps the proto operation-agnostic so adding
  a future endpoint does not require a wire-format change
- payload bytes carry the operation-specific JSON body verbatim
- forwarded_from echoes the follower's node id into the leader's
  audit log line (acceptance criterion 6)

Leader-side handler (internal/admin/forward_server.go):
- ForwardServer implements pb.AdminForwardServer
- RoleStore / MapRoleStore keep the access-key role lookup
  abstract so tests can swap in an in-memory map
- validatePrincipal demotes a follower-claimed full role to
  forbidden when the leader's view is read-only (criterion 4)
- forwardErrorResponse mirrors the HTTP handler's writeTablesError
  status mapping so a forwarded call is indistinguishable from a
  leader-direct call to the SPA
- audit log entries on success carry forwarded_from; the leader
  is the source of truth (criterion 6)

What is NOT in this PR (follow-ups):
- Follower-side AdminForward client + bridge wiring (criterion 2)
- Election-period 503 + Retry-After handling (criterion 3)
- Rolling-upgrade compatibility flag admin.leader_forward_v2
  (criterion 5) — needs a Raft-level cluster version bump
- gRPC server registration in main.go

Tests cover criteria 1, 4, 6 (leader direct, principal demotion,
forwarded_from in audit log) and the structured error mapping
(create-table 409, delete missing 404, generic error 500 with
sanitised body, bad JSON 400, unknown operation 400).
Two findings on the leader-side AdminForward handler:

- Gemini security-medium: handleDelete unmarshalled the raw payload
  with no size cap, so a hostile follower could push a multi-MiB
  body and consume memory before the JSON parser noticed. Apply a
  64 KiB cap (mirrors the HTTP path defaultBodyLimit) on both
  handleCreate and handleDelete; payloads past the cap return
  413 payload_too_large without ever touching json.Unmarshal.

- Codex P1: handleCreate decoded with plain json.Unmarshal,
  bypassing the strict checks the HTTP path runs through
  decodeCreateTableRequest (DisallowUnknownFields, dec.More()
  trailing-token rejection, slash-in-name validation, the rest of
  validateCreateTableRequest). Reuse decodeCreateTableRequest so a
  forwarded create cannot smuggle past validations a leader-direct
  create would have caught.

handleDelete also gains DisallowUnknownFields + dec.More() so its
payload contract matches the create path.

Tests cover the new 413 paths, the unknown-field rejection, and the
slash-in-name rejection on the forwarded create path.
Two related findings on PR #635 / forward_server.go's
handleDelete:

- NUL-byte smuggling: handleDelete decoded with goccy/go-json
  but skipped the explicit NUL scan that decodeCreateTableRequest
  applies. Same vector as the #634 fix — `{"name":"users"}\x00{"extra":1}`
  passes dec.More() because goccy treats NUL as end-of-input.
  Add the same pre-decode NUL rejection.

- Slash-in-name divergence: the HTTP handleDelete and
  handleDescribe both reject `/` in the table name with 404, but
  the forwarded delete just passed body.Name straight through to
  AdminDeleteTable. A forwarded call could therefore act on
  slash-bearing tables that a leader-direct call would 404. Reject
  symmetrically before invoking the source.

Tests: two new ForwardServer cases — NUL payload + slash name.
Both confirm the source is never invoked when the precondition
fails (defence in depth — an asymmetric stub source would still
make the test green if we only checked the response code).
Two related findings on the leader-side AdminForward dispatcher.

- Operation-specific 500 codes: forwardErrorResponse mapped any
  unmapped failure to {"error":"internal", ...}, but the HTTP
  handler's writeTablesError emits dynamo_create_failed /
  dynamo_delete_failed. A forwarded write that hit the same
  backend/storage error therefore returned a different code than
  a leader-direct write — clients branching on those codes saw
  divergent behaviour. Fix: forwardErrorResponse now takes an
  `op` argument ("create" or "delete") so the fallthrough emits
  dynamo_<op>_failed, matching leader-direct parity exactly.

- Log unexpected source errors: handleCreate / handleDelete on
  the forward path returned 500 with "see leader logs" in the
  body but never logged anything for non-sentinel source
  failures, leaving operators with no breadcrumb for forwarded
  500s. Add logUnexpectedSourceError that emits an
  admin_forward_<op>_failed line at LevelError carrying the
  table, forwarded_from, and raw error. Sentinels (forbidden,
  not-found, already-exists, validation) skip the log — they're
  routine client-side outcomes, and logging at LevelError would
  drown the operational signal (same policy as the HTTP path's
  writeTablesError).

Tests:
- TestForwardServer_CreateTable_GenericErrorReturns500 now also
  asserts the response body contains dynamo_create_failed.
- New TestForwardServer_DeleteTable_GenericErrorUsesOperationSpecificCode
  pins the delete-side code parity.
- New TestForwardServer_LogsUnexpectedSourceError captures the
  slog output and confirms the table, forwarded_from, and raw
  error all appear.
- New TestForwardServer_DoesNotLogStructuredSourceErrors sweeps
  the four sentinel cases and confirms the LevelError handler
  never emits anything.
forwardErrorResponse already maps ErrTablesNotLeader to 503
leader_unavailable for the case where the leader passes its
isVerifiedDynamoLeader guard but then drops leadership mid-
Dispatch — the comment at forward_server.go:248 documents the
race. No test exercised that path though, which Claude's review
on PR #635 called out as a gap.

Add two tests that pin both the create and delete sides:
TestForwardServer_CreateTable_LeaderSteppedDownReturns503 and
TestForwardServer_DeleteTable_LeaderSteppedDownReturns503. Both
inject ErrTablesNotLeader through the stub source and assert
503 + leader_unavailable in the response body.

These are pure stub-driven unit tests so they run alongside the
existing 17 forward-server cases without spinning up a real
Raft cluster.
Two changes on the AdminForward leader-side dispatcher:

- Codex P2 on PR #635: forwarded 503 leader_unavailable lost
  the Retry-After: 1 header that the leader-direct HTTP path
  emits. Add retry_after_seconds to AdminForwardResponse so the
  follower's bridge can rebuild the same HTTP header (the bridge
  side wires it in the next PR; the proto + leader populate it
  here). Direct HTTP path is unaffected — the proto field is
  only consumed during forwarding.
- Drop the duplicate RoleStore/MapRoleStore declarations from
  forward_server.go. They now live in role_store.go (added on
  PR #634 for HTTP-side role revalidation) so both surfaces
  share one definition.

Tests: extend TestForwardServer_CreateTable_LeaderSteppedDownReturns503
to pin the retry_after_seconds=1 hint.
Two carry-overs from Claude's review on PR #635:

- forward_server.go defined errorBody{Error, Message} which was
  byte-for-byte identical to errorResponse in router.go (same
  package, same JSON tags). Drop errorBody and use the existing
  errorResponse type — no functional change, eliminates drift.

- ForwardedFrom from the gRPC request was written into slog's
  LogAttrs verbatim. With JSON output the encoder escapes \n/\r
  for us, but a text-format handler would let a malicious
  follower-supplied node id split one audit line into two —
  defeating log-aggregation parsing or spoofing a synthetic log
  entry. Sanitise once at the RPC entry point in Forward() and
  thread the cleaned string through handleCreate / handleDelete
  / the warning-log paths.

Test: TestForwardServer_SanitisesForwardedFromInLog uses a real
slog text-format handler (where JSON's auto-escape doesn't help)
and confirms a forwarded_from value containing "\n" comes out as
spaces in the log line.
@bootjp bootjp force-pushed the feat/admin-forward-rpc branch from 5944f5c to e6e3596 Compare April 25, 2026 19:09
@bootjp bootjp enabled auto-merge April 25, 2026 19:13
@bootjp bootjp merged commit 1e7e9bc into main Apr 25, 2026
8 checks passed
@bootjp bootjp deleted the feat/admin-forward-rpc branch April 25, 2026 19:17
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e6e35964e1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +48 to +52
func NewForwardServer(source TablesSource, roles RoleStore, logger *slog.Logger) *ForwardServer {
if logger == nil {
logger = slog.Default()
}
return &ForwardServer{source: source, roles: roles, logger: logger}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject nil dependencies in NewForwardServer

NewForwardServer accepts nil for both roles and source, but Forward unconditionally dereferences these fields (validatePrincipal calls s.roles.LookupRole, and write paths call s.source.Admin*). A wiring mistake (or partially initialized server during rollout) will therefore panic on the first forwarded request instead of returning a controlled error response, which can take down the admin RPC handler. Add constructor-time validation or fail-closed guards before dispatch.

Useful? React with 👍 / 👎.

bootjp added a commit that referenced this pull request Apr 25, 2026
…artial) (#644)

Phase 1 + 2 of Task #26: the follower-side `LeaderForwarder` client and
its integration into the dynamo HTTP handler. Builds on the AdminForward
leader-side dispatcher landed via #635.

## Summary

- New `LeaderForwarder` interface (`internal/admin/forward_client.go`)
decouples the dynamo HTTP handler from `pb.AdminForwardClient`. The
handler stays proto-free; the bridge in `main_admin.go` (next phase)
plugs in the gRPC-backed implementation.
- `gRPCForwardClient` translates a `CreateTableRequest` / table-name
into an `AdminForwardRequest`, dials via a `GRPCConnFactory` (production
wraps `kv.GRPCConnCache`), and re-shapes the response into
`ForwardResult` (status, payload, content-type).
- `forwarded_from = nodeID` is populated so the leader's audit log
carries the trace (criterion 6, leader-side already shipped in #635).
- Defensive: `status_code == 0` upgrades to `502 Bad Gateway`; missing
`ContentType` fills the JSON default. Both surface transport bugs rather
than producing silently-malformed SPA responses.
- `ErrLeaderUnavailable` sentinel signals the "no leader known" case so
the handler can map to 503 + `Retry-After: 1` (criterion 3).
- `DynamoHandler` gains a `forwarder` field and `WithLeaderForwarder`
option. When set, `handleCreate` / `handleDelete` catch
`ErrTablesNotLeader` from the source and forward to the leader
transparently — the SPA cannot tell forwarded from leader-direct.
- `writeForwardResult` re-emits the leader's structured response
verbatim (status + payload + content-type), so a forwarded `409
Conflict` from the leader stays `409` on the wire — no
re-classification.
- `writeForwardFailure` maps `ErrLeaderUnavailable` (election in flight)
and gRPC transport errors to 503 + `Retry-After: 1`.
`ErrLeaderUnavailable` is intentionally NOT logged at error level
(elections are routine); transport errors are logged at LevelError so
operators can investigate.

## What is NOT in this PR

- gRPC server registration in `main.go` (production wiring of the
`ForwardServer` from #635) — comes in the next phase.
- The bridge that wraps `kv.GRPCConnCache` and supplies
`LeaderAddressResolver` — same phase.
- Election-period retry-loop on the client side (criterion 3 partial:
this PR returns 503 + Retry-After; the SPA / client retries the request;
criterion 3 fully needs the production bridge to dial actual
leader-discovery).

## Test plan

- [x] `go build ./...`
- [x] `go vet ./...`
- [x] `golangci-lint run` (admin package: 0 issues)
- [x] `go test ./internal/admin/ -count=1 -race`
- 8 forward-client unit tests: constructor input validation, both
Forward operations including principal/payload/op-enum/`forwarded_from`
round-trip, `ErrLeaderUnavailable`, dial/RPC errors propagated with
`cockroachdb/errors` wrapping, zero status code upgrade, missing content
type fallback
- 9 handler integration tests: transparent forward for create + delete,
no-forwarder fallback to 503, `ErrLeaderUnavailable` → 503 +
Retry-After, transport error → 503 + log, leader 409 pass-through,
leader 503 + Retry-After preserved, role check short-circuits before
forward, body validation short-circuits before forward
- [ ] Wire production bridge + register `pb.RegisterAdminForwardServer`
in `main.go` and exercise an end-to-end follower → leader call against a
real cluster (next PR).

## Acceptance criteria coverage

| # | Criterion | This PR |
|---|---|---|
| 1 | Leader direct write | ✓ (in main since #634) |
| 2 | Follower forwards transparently | ✓ wiring done; needs main.go
gRPC registration to take effect |
| 3 | Election-period 503 + retry | ✓ partial — handler returns 503 +
Retry-After; full transparency needs the bridge |
| 4 | Leader demotes stale full role | ✓ (in main since #635) |
| 5 | Rolling-upgrade compat flag | ⏳ deferred (cluster-version bump) |
| 6 | `forwarded_from` in audit log | ✓ (in main since #635) |
bootjp added a commit that referenced this pull request Apr 26, 2026
…de bridge (#648)

Phase 3 of Task #26 / AdminForward design 3.3. Closes the wiring loop
opened by #635 (leader-side dispatcher) and #644 (follower-side client +
handler integration): the admin HTTP DynamoHandler now hands
`ErrTablesNotLeader` writes off to the in-process `LeaderForwarder`, and
the leader-side gRPC `AdminForward` service is registered on every node
so the forwarded RPC actually lands somewhere.

## Summary

- **`admin.ServerDeps.Forwarder`** (3315bf4) — new optional field on
the high-level admin server composition surface. `admin.NewServer` calls
`WithLeaderForwarder` when set; nil leaves the existing 503 +
Retry-After:1 fallback untouched. Single-node / leader-only deployments
are unaffected.
- **`main_admin_forward.go`** (f3e529c) — production bridge between
`kv.GRPCConnCache` and the admin layer's typed `PBAdminForwardClient`
(via `pb.NewAdminForwardClient`). Includes `buildLeaderForwarder`,
`roleStoreFromFlags`, `adminForwardServerDeps`, and a
`readyForRegistration` gate.
- **Runtime wiring** (19f6b70) — `runtimeServerRunner.start()` now
creates `dynamoServer` BEFORE `startRaftServers`, so the leader-side
`ForwardServer` can use it as its `TablesSource`. `startRaftServers`
accepts the new `adminForwardServerDeps` bundle and delegates to
`registerAdminForwardServer`. `startServers` constructs a single
`*kv.GRPCConnCache` and forwards it (with the `Coordinator`) to
`startAdminFromFlags`, which in turn builds the production
`LeaderForwarder` via `buildAdminLeaderForwarder` and passes it through
`ServerDeps.Forwarder`.

## What is NOT in this PR

- Rolling-upgrade compatibility flag (criterion 5) — still deferred
behind a cluster-version bump.
- Election-period retry loop on the follower's bridge — the handler
returns 503 + Retry-After:1, the SPA / curl re-issues, and the bridge
dials again on the next attempt. Inline retry inside the handler would
just hide the latency from operators.

## Test plan

- [x] `go build ./...`
- [x] `go vet ./...`
- [x] `golangci-lint run` (main + admin packages: 0 issues)
- [x] `go test ./internal/admin/ -count=1 -race` — full admin suite
passes (existing 21 forward tests + 1 new
`TestServer_ServerDepsForwarderIsWired`)
- [x] `go test . -count=1 -race` — main package passes (4 new bridge /
role-store / readyForRegistration tests)
- [ ] `adapter` package times out at 240s on this branch AND on `main` —
pre-existing flake unrelated to this PR (verified by checking out main
and running the same suite)
- [ ] End-to-end smoke against a 3-node cluster — needs the local Jepsen
runner; blocked on the next PR or a manual run

## Acceptance criteria coverage

| # | Criterion | This PR |
|---|---|---|
| 1 | Leader direct write | ✓ (in main since #634) |
| 2 | Follower forwards transparently | ✓ — wiring complete; both
leader-side gRPC service AND follower-side bridge are now plumbed |
| 3 | Election-period 503 + retry | ✓ — handler still returns 503 +
Retry-After:1 when no leader is known; the SPA / client retries and the
next attempt re-dials |
| 4 | Leader demotes stale full role | ✓ (in main since #635) |
| 5 | Rolling-upgrade compat flag | ⏳ deferred (cluster-version bump) |
| 6 | `forwarded_from` in audit log | ✓ (in main since #635 — bridge now
stamps `--raftId` onto every forwarded request) |

## Self-review (5 lenses)

1. **Data loss**: No FSM / Raft / Pebble path changes; admin writes
still go through the same `AdminCreateTable` / `AdminDeleteTable` Raft
proposal once they reach the leader.
2. **Concurrency**: `runner.start()` reorder confirmed safe —
`startDynamoDBServer` only depends on `coordinate` + `shardStore`
(already constructed); raft TCP listeners are independent of the dynamo
HTTP listener. Worst case during the 100 ms-or-so reorder window is the
same "no leader yet" 503 the old order would have produced if a request
landed before raft converged.
3. **Performance**: One additional `&kv.GRPCConnCache{}` allocation per
process; one closure for the resolver. No hot-path changes.
4. **Data consistency**: Forwarded requests re-validate the principal at
the leader (criterion 4, already shipped); commit-ts ordering is
unchanged because the leader's `AdminCreateTable` path does its own
`HLC.Next`.
5. **Test coverage**: 4 new main-package tests + 1 new admin-package
server-level test. `TestServer_ServerDepsForwarderIsWired` is the
regression test for "future refactor drops the forwarder before it
reaches the dynamo handler" — exactly the failure mode the wiring change
in this PR introduces the risk of.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added leader-forwarding capability for admin requests sent to follower
nodes
  * Implemented role-based access control for admin API operations
* Enhanced DynamoDB request routing to automatically direct operations
to leader nodes

* **Tests**
* Added comprehensive integration and unit tests validating
leader-forwarding paths

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
bootjp added a commit that referenced this pull request Apr 26, 2026
Per `docs/design/README.md`'s lifecycle convention, the admin dashboard
design doc is now "partial":

- **P1** (DynamoDB CRUD + AdminForward) — shipped via #634, #635, #644,
#648
- **P2** (S3 buckets list/create/delete/ACL + DescribeTable) — shipped
via #658, with #669 + #673 in flight
- **P3** (React SPA + embed) — shipped via #649, #650
- **P4** (TLS / role / CSRF / operator docs) — TLS, role, CSRF are
already live in P1; operator docs in #674

Independent of the in-flight slice 2 PRs (#669/#673) and the docs PR
(#674) — this rename only reflects what is already on main today, plus
an "Implementation status" table mapping each phase to the PR it landed
in.

## What this PR changes

- `git mv` the design doc from `2026_04_24_proposed_admin_dashboard.md`
to `2026_04_24_partial_admin_dashboard.md` so its history follows
- Add an "Implementation status" header table indexing each phase to the
PRs that landed it
- List the outstanding open items so future readers know what is still
owed against the original proposal:
- AdminForward acceptance criterion 5 (rolling-upgrade compat flag) —
deferred
  - S3 object browser — explicitly out of scope per Section 2 Non-goals
  - TLS cert hot-reload — restart-to-rotate is the documented model

When the rolling-upgrade flag lands, the doc gets renamed once more to
`2026_04_24_implemented_admin_dashboard.md` per the README's lifecycle
convention.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants