Skip to content

perf(s3): align Raft entry size with MaxSizePerMsg via s3ChunkBatchOps=4#636

Merged
bootjp merged 3 commits intomainfrom
perf/s3-raft-entry-align
Apr 25, 2026
Merged

perf(s3): align Raft entry size with MaxSizePerMsg via s3ChunkBatchOps=4#636
bootjp merged 3 commits intomainfrom
perf/s3-raft-entry-align

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 25, 2026

Summary

S3 PutObject の Raft entry サイズを MaxSizePerMsg (PR #593 で 4 MiB) と整列させます。

変更: s3ChunkBatchOps = 16 → 4。1 Raft entry = s3ChunkBatchOps × s3ChunkSize ≒ 4 MiB に揃える。1 行差分。

Why

etcd/raftutil.go:limitSize には「単一 entry が MaxSizePerMsg を超えていても、その entry だけは reject せず単独で MsgApp に載せる」という documented exception があります ("if the size of the first entry exceeds maxSize, a non-empty slice with just this entry is returned"). 16 MiB entry はこの経路で素通りするので:

項目 s3ChunkBatchOps=16 s3ChunkBatchOps=4 (本 PR)
1 Raft entry size ~16 MiB ~4 MiB
Leader worst-case / peer 1024 × 16 MiB = 16 GiB 1024 × 4 MiB = 4 GiB
3-node cluster (leader) 32 GiB 8 GiB
WAL fsync per entry 16 MiB 4 MiB
5 GiB PUT の Raft commit 数 320 1280 (4×)

PR #593 の本文が謳う 1024 × 4 MiB = 4 GiB / peer の memory bound は 小エントリ前提で、S3 経路では成立していませんでした。本 PR で S3 が普通の batched MsgApp 経路に乗るようになり、bound が S3 込みでも正確になります。

Raft commit 数が 4× に増える代わりに、各 fsync が 4× 速くなるので、PR #600 の WAL group commit と相殺されてエンドツーエンド throughput はほぼ同等のはずです。

Test plan

  • go build ./... clean
  • golangci-lint run ./... 0 issues
  • go test ./adapter/ -short -run 'TestS3|S3Server' pass
  • CI

Follow-ups (別 PR で design doc)

  1. S3 admission control — 同時並行の S3 PUT body bytes に hard cap を入れて、クライアント並列度が上がっても leader-side memory が 4 GiB × peers を超えないようにする。
  2. Raft snapshot 戦略 / blob bypass — follower fall-behind 時の snapshot transfer に 5 GiB blob が乗ると不都合なので、blob を Raft 経路から外して manifest だけ raft で同期するアーキテクチャ。

/gemini review
@codex review

Before this change, S3 PutObject built one Raft proposal per
s3ChunkBatchOps (16) chunks at s3ChunkSize (1 MiB), so each Raft
entry was ~16 MiB. Combined with PR #593's defaults
(MaxSizePerMsg=4 MiB, MaxInflightMsgs=1024) the result was:

- etcd/raft's util.go:limitSize sends a *single* entry that exceeds
  MaxSizePerMsg as a solo MsgApp (this is documented behaviour: the
  "as an exception, if the size of the first entry exceeds maxSize, a
  non-empty slice with just this entry is returned" branch).
- One 16 MiB entry per MsgApp means PR #593's advertised per-peer
  memory bound — `MaxInflight × MaxSizePerMsg = 1024 × 4 MiB = 4 GiB`
  — silently grew to `1024 × 16 MiB = 16 GiB / peer`, with the
  three-node leader-side worst case at 32 GiB.
- WAL fsync amortised 16 MiB of writes per entry, increasing tail
  latency on slow disks and amplifying follower fall-behind windows.

s3ChunkBatchOps=4 caps each Dispatch at 4 × 1 MiB = 4 MiB plus a
few hundred bytes of protobuf overhead. The Raft entry now fits
exactly within MaxSizePerMsg, so:

- The leader-side per-peer worst case matches the bound PR #593
  documents (4 GiB / peer). For S3-heavy workloads the cluster-side
  worst case drops from 32 GiB to 8 GiB on a 3-node deployment.
- Each WAL fsync writes 4 MiB instead of 16 MiB; per-entry sync
  latency falls 4× and the WAL group-commit path landed in PR #600
  amortises the higher commit count.
- A 5 GiB PUT now produces 1280 Raft commits instead of 320 (4×
  growth). The follower apply loop and the read-side prefix scan
  (scanAllDeltaElems via s3keys.BlobKey) handle the same total bytes
  but in 4× more keys; on disk locality is unchanged because the
  underlying Pebble keys still sort by manifest order.

No reads or stored manifests are affected — s3ChunkBatchOps is purely
a write-path batching parameter, manifests carry their own
ChunkSizes slice that mirrors what was actually committed.

Build / vet / lint clean. S3 unit tests pass.

Follow-ups (separate PRs):

- docs/design: admission control on concurrent S3 PUT body bytes so
  the leader-side memory bound has a hard ceiling regardless of
  client concurrency.
- docs/design: Raft snapshot strategy / blob bypass so large objects
  do not bloat snapshot transfers when a follower falls behind.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 25, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 10 minutes and 28 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 10 minutes and 28 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c4b1b934-5aa8-434e-8795-ffeba2badfa8

📥 Commits

Reviewing files that changed from the base of the PR and between dbe4725 and 97b4fc5.

📒 Files selected for processing (2)
  • adapter/s3.go
  • adapter/s3_chunk_batch_test.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/s3-raft-entry-align

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request reduces the s3ChunkBatchOps constant from 16 to 4 to ensure that Raft entries align with the 4 MiB MaxSizePerMsg limit, preventing potential memory exhaustion on the leader. While this change correctly bounds memory for data-heavy operations, the reviewer pointed out that it also unnecessarily restricts the batch size for metadata-only operations like deletions and cleanup. Since these operations have much smaller payloads, using a batch size of 4 is inefficient; it is recommended to decouple the batch size configuration for data transfers from that of metadata operations to maintain performance.

Comment thread adapter/s3.go Outdated
// advertises. Per-PUT Raft commit count grows 4× (a 5 GiB PUT goes
// from 320 to 1280 entries) — absorbed by the WAL group commit
// landed in PR #600 and the smaller per-entry fsync.
s3ChunkBatchOps = 4
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

s3ChunkBatchOps を 4 に変更したことで、putObjectuploadPart での Raft エントリサイズが適切に制限されるようになりますが、一方で deleteByPrefix (line 1947) や cleanupPartBlobsAsync (line 1893) など、データ本体を含まないメタデータ操作や削除操作のバッチサイズも同時に 4 に制限されてしまいます。

削除操作はペイロードが非常に小さいため、MaxSizePerMsg (4 MiB) の制限内でもより大きなバッチ(例: 64〜128)で効率的に処理可能です。この変更により、大きなオブジェクトの削除やアップロードの中断時のクリーンアップ処理において Raft プロポーザルの回数が不必要に増加し、スループットやクリーンアップ完了までの時間に影響を与える可能性があります。データ転送を伴う書き込み用のバッチサイズと、削除・スキャン用のバッチサイズを分離することを検討してください。

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 080facc17f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread adapter/s3.go Outdated
// advertises. Per-PUT Raft commit count grows 4× (a 5 GiB PUT goes
// from 320 to 1280 entries) — absorbed by the WAL group commit
// landed in PR #600 and the smaller per-entry fsync.
s3ChunkBatchOps = 4
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reduce chunk batch size to include protobuf overhead

Setting s3ChunkBatchOps to 4 still produces Raft payloads larger than a 4 MiB MaxSizePerMsg once encoding overhead is added (the Request/mutation protobuf plus the 1-byte raft encoding prefix in marshalRaftCommand), so full-size 4×1 MiB chunk batches continue to take etcd/raft’s oversized-first-entry path instead of truly fitting the limit. In practice this means the stated “4 MiB aligned” bound in this change is not actually guaranteed for PutObject; you need headroom (or byte-based batching) rather than 4 * s3ChunkSize exactly.

Useful? React with 👍 / 👎.

…ni medium)

Address two PR #636 review concerns:

Codex P2 — `4 × 1 MiB = 4 MiB` exactly is *over* `MaxSizePerMsg` once
protobuf framing overhead is counted. Each pb.Mutation carries the
Op tag, Key tag + bytes, and Value length prefix; the pb.Request
envelope wraps them; marshalRaftCommand prepends one byte. With a
normal-length objectKey the encoded entry is ~4 MiB + 300 B, with a
1 KiB objectKey it grows to ~4 MiB + 4 KiB. Either way the entry
falls into etcd/raft's util.go:limitSize oversized-first-entry path
(the documented "as an exception" branch), bypassing the
`MaxInflight × MaxSizePerMsg` per-peer memory bound the prior commit
was trying to enforce.

Fix: lower s3ChunkBatchOps from 4 to 3. `3 × 1 MiB = 3 MiB + few
hundred bytes` leaves ~1 MiB of headroom even for kilobyte-scale
keys, so the entry stays strictly under MaxSizePerMsg and the
batched-MsgApp path applies. Per-PUT Raft commit count grows ~5×
from the pre-PR-#636 baseline (5 GiB PUT: 320 → ~1707 entries) but
each fsync is ~5× smaller; the WAL group commit landed in PR #600
absorbs the rate.

Gemini medium — the previous commit also forced the cleanup paths
(cleanupPartBlobsAsync, deleteByPrefix, cleanupManifestBlobs,
appendPartBlobKeys) onto the same 4-op batch even though they carry
no chunk payload. Those ops are pure-key Dels; a 4-op batch
amplifies cleanup latency for large objects (5 GiB cleanup → ~1707
batches at the data-write size).

Fix: introduce s3MetaBatchOps = 64. With BlobKey-shaped keys
~100 B each, 64 keys × 100 B ≈ 6 KiB per batch, still three orders
of magnitude under MaxSizePerMsg. A 5 GiB-object cleanup commits
~80 batches instead of ~1707, so orphaned-blob garbage collection
finishes proportionally faster and does not amplify Raft load
relative to the data-write path. Updates the four cleanup callsites
(adapter/s3.go:1879, 1900, 1933, 2206, 2227) plus the appendPartBlobKeys
loop helper.

Tests:

- TestS3ChunkBatchFitsInRaftMaxSize builds a worst-case PutObject
  batch (1 KiB objectKey, fully-populated 1 MiB chunk values),
  marshals via the same protobuf path the real Dispatch uses, and
  asserts the encoded entry plus the marshalRaftCommand 1-byte
  framing prefix is strictly under MaxSizePerMsg = 4 MiB. A second
  assertion pins the headroom margin at >= 64 KiB so a future bump
  in s3ChunkBatchOps or s3ChunkSize is caught at PR time.
- TestS3MetaBatchFitsInRaftMaxSize is the equivalent invariant for
  s3MetaBatchOps on the cleanup path.

Build / vet / lint clean. S3 unit tests pass.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

Codex P2 と Gemini medium の両方に 5338c05 で対応。

Codex P2 (4 × 1 MiB = 4 MiB exactly is over MaxSizePerMsg)

protobuf encoding 込みで実測すると 4 × 1 MiB + ~300 B。1 KiB objectKey 込みで ~4 KiB 増えて、確実に MaxSizePerMsg = 4 MiB を超えていました。

s3ChunkBatchOps = 4 → 3 に下げる。3 MiB + few hundred BMaxSizePerMsg から ~1 MiB のヘッドルーム確保。

TestS3ChunkBatchFitsInRaftMaxSize で worst-case (1 KiB objectKey + fully-populated 1 MiB chunks) を実際に protobuf marshal して < 4 MiB を assert。さらに >= 64 KiB headroom を pin して将来の値バンプが silently 限界を越えないようにガード。

Gemini medium (cleanup paths share data-write batch size)

cleanupPartBlobsAsync / deleteByPrefix / cleanupManifestBlobs / appendPartBlobKeys は payload を運ばない Del のみなので、4-op batch は無駄に commit 数を増やすだけ。

s3MetaBatchOps = 64 を新設して、cleanup 5 callsites を切り替え。BlobKey ~100 B × 64 ≈ 6 KiB / batch (4 MiB cap から 3 桁下)。5 GiB-object cleanup の commit 数: ~1707 → ~80 (21× 改善)。

TestS3MetaBatchFitsInRaftMaxSize で同じ invariant を pin。

pre-PR (main) r1 of #636 (4) r2 of #636 (本コミット)
data-write batch 16 (16 MiB entry) 4 (~4 MiB + 300 B → over) 3 (~3 MiB → safely under)
cleanup batch 16 4 (no payload, wasted) 64 (key-only, optimal)
5 GiB PUT entries 320 1280 1707
5 GiB cleanup batches 320 1280 80

Test plan

  • TestS3ChunkBatchFitsInRaftMaxSize pass
  • TestS3MetaBatchFitsInRaftMaxSize pass
  • go test ./adapter/ -short -run TestS3 -count=1 pass
  • golangci-lint 0 issues

/gemini review
@codex review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes S3 operation batching by adjusting s3ChunkBatchOps and introducing s3MetaBatchOps to ensure Raft entries stay within the 4 MiB limit. It adds regression tests to verify these size invariants. A bug was identified in appendPartBlobKeys where passing the pending slice by value prevents the batching logic from correctly flushing data.

Comment thread adapter/s3.go Outdated
Key: s3keys.VersionedBlobKey(bucket, generation, objectKey, uploadID, part.PartNo, chunkIndex, part.PartVersion),
})
if len(pending) >= s3ChunkBatchOps {
if len(pending) >= s3MetaBatchOps {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

appendPartBlobKeys 内でのバッチ処理ロジックが正しく動作していません。pending スライスが値渡しされているため、この関数内での append による変更(長さの更新や再割り当て)が、flush クロージャが参照している呼び出し元の pending 変数に伝わりません。その結果、flush 内の len(pending) は常に 0 となり、バッチごとのフラッシュが実行されず、全キーが1つの巨大なバッチとして蓄積されてしまいます。これは、外部リクエストに起因するコレクションの無制限な増大を防ぎ、OOM(メモリ不足)を回避するというリポジトリの原則に抵触します。関数シグネチャをポインタ経由(*[]*kv.Elem[kv.OP])に変更するか、バッチ管理の構造を見直してください。

References
  1. To prevent unbounded memory growth and potential OOM issues, apply a fixed bound to collections that can grow from external requests, such as pending configuration changes. Reject new requests when the bound is reached.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

@claude[agent] review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5338c054c7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread adapter/s3.go
readTS := s.readTS()
readPin := s.pinReadTS(readTS)
kvs, err := s.store.ScanAt(ctx, cursor, end, s3ChunkBatchOps, readTS)
kvs, err := s.store.ScanAt(ctx, cursor, end, s3MetaBatchOps, readTS)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep blob cleanup scan page size independent of delete batch

deleteByPrefix now uses s3MetaBatchOps for ScanAt, so cleanup of blob prefixes scans 64 full chunk KVs per loop. In this codebase, ScanAt materializes values (e.g. store/mvcc_store.go clones Value in collectScanResults), and blob values are ~1 MiB each, so one cleanup worker can allocate/read ~64 MiB per page and up to ~1 GiB with 16 concurrent cleanup workers. This is a regression from the smaller scan page and can cause significant memory/GC and I/O spikes during abort/manifest cleanup even though the code only needs keys for deletes; consider keeping scan size small (or key-only scan) while still batching delete dispatches at 64.

Useful? React with 👍 / 👎.

@Claude
Copy link
Copy Markdown

Claude AI commented Apr 25, 2026

@claude[agent] review

LGTM ✅

Reviewed 5338c05. The changes correctly address the MaxSizePerMsg alignment issue:

Key fixes verified:

  • s3ChunkBatchOps = 3 provides ~1 MiB headroom under 4 MiB limit
  • s3MetaBatchOps = 64 optimizes cleanup paths (6 KiB vs 4 MiB cap)
  • All 5 cleanup callsites properly updated
  • Data-write paths (PutObject/UploadPart) correctly use the 3-op batch

Test coverage:

  • TestS3ChunkBatchFitsInRaftMaxSize validates worst-case (1 KiB key + full chunks) with actual protobuf marshaling, asserts < 4 MiB and >= 64 KiB headroom
  • TestS3MetaBatchFitsInRaftMaxSize pins cleanup batch invariant
  • Both tests pass ✅

Performance trade-offs well-reasoned:

  • 5 GiB PUT commits: 320 → ~1707 (5.3× increase, but 5.3× smaller fsyncs)
  • 5 GiB cleanup: ~1707 → ~80 batches (21× improvement)

Documentation is thorough. No issues found.

…s appends

Gemini high — the previous round split cleanup paths onto
s3MetaBatchOps but missed that `appendPartBlobKeys` still took
`pending []*kv.Elem[kv.OP]` by value. The flush closure inside
`cleanupManifestBlobs` captures the outer `pending` slice header;
appends performed inside the helper update the local header only, so
flush sees length 0 every iteration and silently no-ops. After the
helper returns, the caller reassigns `pending` from the helper's
return value — but by then the entire part's chunks have accumulated
into one slice that bypasses the s3MetaBatchOps cap entirely.

For a manifest with thousands of chunks (e.g. a 5 GiB-object cleanup
with ~5120 chunks), the broken helper builds one giant batch and
hands it to coordinator.Dispatch in a single call, re-introducing
exactly the OOM / oversized-MsgApp risk the s3MetaBatchOps cap was
meant to prevent. This is the same shape as the pre-PR-#636
behaviour s3ChunkBatchOps was tightening, just on the cleanup side.

Fix: take `pending` by pointer (`*[]*kv.Elem[kv.OP]`) and write
through it. Now appends inside the helper update the variable the
flush closure reads, so threshold-triggered flushes fire correctly
and the slice length never exceeds s3MetaBatchOps.

Test:

- TestAppendPartBlobKeys_FlushFiresEveryS3MetaBatchOps builds a part
  with 2 × s3MetaBatchOps + 7 chunks, calls the helper directly with
  a recording flush closure, and asserts flush fired exactly twice
  with batch sizes [s3MetaBatchOps, s3MetaBatchOps] plus 7 trailing
  entries left for the caller's tail flush. The test catches the
  bug: the value-passing version would record 0 flushes (closure
  always saw length 0) — the assertion `Equal(2, flushCalls)`
  would fail loudly.

Build / vet / lint clean. All S3 + cleanup tests pass.
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

Gemini high の指摘どおり実バグでした。97b4fc58 で修正、回帰テスト付き。

何が壊れていたか

appendPartBlobKeys を slice 値渡し していた:

func (s *S3Server) appendPartBlobKeys(pending []*kv.Elem[kv.OP], ..., flush func()) ([]*kv.Elem[kv.OP], bool)

呼び出し元 cleanupManifestBlobs:

flush := func() {
    if len(pending) == 0 { return }   // ← 呼び出し元の pending を見る
    Dispatch(pending)
    pending = pending[:0]
}
for _, part := range manifest.Parts {
    pending, ok = s.appendPartBlobKeys(pending, ..., flush)  // 値渡し
}
flush()

問題:

  1. ヘルパー内で pending = append(pending, ...) するが、これは LOCAL の slice header を更新するだけ
  2. flush クロージャは外側スコープの pending を読む → len(pending) == 0 → no-op で即 return
  3. 結果、ヘルパー内の閾値チェック if len(pending) >= s3MetaBatchOps { flush() }発火しても何も起きない
  4. ヘルパー返却後に呼び出し元が pending = newPending で代入 → 全 chunk が 1 スライスに溜まったまま渡る
  5. 最後の flush()chunk 全件を 1 回の Dispatch にして送る

5 GiB オブジェクトのクリーンアップ (5120 chunks) で言うと、s3MetaBatchOps = 64 で 80 batch にしたかったのに 1 batch (5120 ops) として送ってしまう。MaxSizePerMsg=4 MiB を超えて oversized-first-entry path に落ち、まさに本 PR が防ごうとしていた OOM / Raft 圧迫を再導入していました。

修正

pending をポインタ (*[]*kv.Elem[kv.OP]) で受け取り、*pending = append(*pending, ...) で書く。クロージャは同じ変数を読み書きするので一貫した状態が見える。

func (s *S3Server) appendPartBlobKeys(pending *[]*kv.Elem[kv.OP], ..., flush func()) bool {
    *pending = append(*pending, &kv.Elem[kv.OP]{...})
    if len(*pending) >= s3MetaBatchOps {
        flush()
    }
}

回帰テスト

TestAppendPartBlobKeys_FlushFiresEveryS3MetaBatchOps:

  • chunkCount = 2 × s3MetaBatchOps + 7
  • 記録型 flush クロージャを渡してヘルパーを直接呼ぶ
  • assert: flush がちょうど 2 回呼ばれ、各回のバッチサイズが s3MetaBatchOps、残り 7 entries が末尾に残る

旧 (値渡し) コードでは flushCalls == 0 になるので Equal(2, flushCalls) でクラッシュ → 回帰検出。

影響範囲

appendPartBlobKeys の呼び出し元は cleanupManifestBlobs 1 箇所のみなので、他経路は無影響。シグネチャ変更はパッケージ内部のヘルパー (S3Server method) で外部 API 影響なし。

Build / vet / lint clean、3 つのテスト (TestS3ChunkBatchFitsInRaftMaxSize, TestS3MetaBatchFitsInRaftMaxSize, TestAppendPartBlobKeys_FlushFiresEveryS3MetaBatchOps) 全 pass。

/gemini review
@codex review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Can't wait for the next one!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@bootjp bootjp merged commit 466e554 into main Apr 25, 2026
8 checks passed
@bootjp bootjp deleted the perf/s3-raft-entry-align branch April 25, 2026 15:08
bootjp added a commit that referenced this pull request Apr 25, 2026
## Summary

PR #636 (`s3ChunkBatchOps = 4` で Raft entry を `MaxSizePerMsg = 4 MiB`
に整列) のフォローアップとして、design doc を 2 本提出します。コードは含まず docs だけ。

### 1. `docs/design/2026_04_25_proposed_s3_admission_control.md`

**問題**: PR #636 は per-entry の memory accounting を直すが、aggregate (= 同時 PUT
数 × 各 PUT の inflight) は client 並列度で青天井。
**提案**: ノードが受け入れて Raft に commit する前の S3 PUT body bytes を **共通セマフォで hard
cap**。
- (A) `Content-Length` で request entry に pre-charge
- (B) `flushBatch` の per-batch (4 MiB) で sub-lease —
`dispatchAdmissionTimeout` 経過で 503 SlowDown
- デフォルト 256 MiB cap (GOMEMLIMIT=1800 MiB の ~14%)、env で tunable
- `Retry-After` 付き 503 → AWS SDK が自動 retry
- M1〜M4 の段階導入、metrics/Grafana panel まで含む

### 2. `docs/design/2026_04_25_proposed_s3_raft_blob_offload.md`

**問題**: 5 GiB PUT が WAL に丸ごと書かれ、follower catch-up と snapshot transfer が
O(stored bytes) でスケールしない。
**提案**: Raft が運ぶのは `ChunkRef{SHA256, Size}` (~100 B) と manifest だけ。chunk
本体は別キー名前空間 `!s3|chunkblob|<SHA>` でローカル Pebble に直書き、follower は applyLoop
で `chunkref` を見たら **非同期 fetch RPC で peer から取りに行く**。
- 1 GiB PUT の Raft 負荷: ~100 KiB に圧縮 (10⁴× 削減)
- Snapshot size: O(manifest count) になる
- Reference count + grace period の GC
- `ELASTICKV_S3_BLOB_OFFLOAD=true` でオプトイン、legacy `BlobKey` 経路は並走
- M0 spike 〜 M4 migrator までの 5 段階ロールアウト

両方とも:
- `docs/design/README.md` のファイル名・ヘッダ規約に準拠 (Status: Proposed / Date:
2026-04-25)
- 関連 PR (#593, #600, #612, #617, #636) と既存 design (workload isolation,
raft grpc streaming) を相互参照
- リスクと open questions を明示

## Test plan

- [x] markdown lint clean (textlint なし、手動確認)
- [x] ファイル名規約 `YYYY_MM_DD_proposed_<slug>.md` 準拠
- [x] 既存 design へのリンクが docs/design/README.md と整合

レビューは設計の方向性 (admission control の cap 値、blob offload の content-addressing
単位、M0-M4 の milestone 配分) に focus してもらえると助かります。

/gemini review
@codex review


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Documentation
* Added design proposal for S3 PUT request admission control and load
management strategy
* Added design proposal for S3 object storage optimization approach

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants