Conversation
…ndation)
Implements internal/backup/{filename.go,filename_test.go} per the Phase 0
design doc (docs/design/2026_04_29_proposed_snapshot_logical_decoder.md).
Encoding rules (deterministic, reversible from the filename alone except
for the SHA fallback):
- RFC3986 unreserved set [A-Za-z0-9._-] passes through.
- Other bytes -> %HH (uppercase hex), like form-urlencoded but applied
to every non-allowlisted byte.
- Segments exceeding 240 bytes after percent-encoding render as
<sha256-hex-prefix-32>__<truncated-original> with the full original
recorded in KEYMAP.jsonl by the caller. DecodeSegment refuses these
with ErrShaFallbackNeedsKeymap so callers cannot fabricate the
original bytes from the filename alone.
- DynamoDB B-attribute (binary) keys take a separate "b64.<base64url>"
path so a binary key never collides with a hex-shaped string key.
Tests cover passthrough, percent escaping, hex-uppercase invariant,
SHA-fallback firing on long inputs (raw or post-expansion), binary
round-trip, malformed-input rejection, output-length bound under
adversarial inputs, and rapid-driven property tests for round-trip on
both percent and binary paths.
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 23 minutes and 10 seconds.Comment |
|
@claude review |
|
Claude finished @bootjp's task in 4m 11s —— View job Code Review:
|
| Location | Magic value | Suggested constant |
|---|---|---|
percentEncode, line 139 |
3 (len("%HH")) |
percentEncodedMaxExpansion = 3 |
percentEncode, lines 146–147 |
4 (nibble width), 0x0F (low-nibble mask) |
nibbleWidth = 4, lowNibbleMask byte = 0x0F |
percentDecode, line 164 |
2 (hex digits after %) |
hexEscapeDigits = 2 |
percentDecode, line 182 |
4 (nibble width) |
same nibbleWidth |
hexUpper, lines 228, 231 |
10 (decimal/hex boundary) |
hexAlphaBase = 10 |
unhex, lines 239, 241 |
10 |
same hexAlphaBase |
Minor — shaFallbackHexPrefixBytes misleading name
shaFallbackHexPrefixBytes = 32 is 32 hex characters, which encode 16 bytes. The name suffix Bytes implies raw bytes, but the constant is used as a character count. Consider shaFallbackHexPrefixLen or shaFallbackHexChars to avoid confusion at the call sites.
What's well done
- Encoding logic is correct for the non-colliding cases:
percentEncode/percentDecodeare proper inverses; the boundary check ati+2 >= len(seg)is correct;hexUpper/unhexare symmetric. EncodeBinarySegmentseparation cleanly prevents theb64.form from colliding with hex-shaped string keys.- Negative tests (
TestDecodeSegment_Rejects*) cover truncated escapes, non-hex digits, raw reserved bytes, and malformed base64 — exactly the right failure cases. - Property tests via
rapidcover the round-trip invariant and the SHA-fallback post-condition for randomly generated inputs. - Length bound guarantee (
TestEncodeSegment_OutputLengthBoundedByMax) is a good correctness invariant test. - SHA-fallback human-suffix test confirms operator greppability.
There was a problem hiding this comment.
Code Review
This pull request introduces a filename encoding and decoding system for logical backups, supporting percent-encoding, base64-encoded binary segments, and a SHA-256 fallback for segments exceeding 240 bytes. The review feedback highlights critical improvements for robustness, specifically addressing potential OOM risks by enforcing length checks before large memory allocations in encoding and decoding functions. Furthermore, a logic fix was suggested for EncodeSegment to prevent user-supplied keys from colliding with the SHA-fallback pattern, ensuring deterministic decoding.
| func EncodeSegment(raw []byte) string { | ||
| encoded := percentEncode(raw) | ||
| if len(encoded) <= maxSegmentBytes { | ||
| return encoded | ||
| } | ||
| return shaFallback(raw) | ||
| } |
There was a problem hiding this comment.
The current implementation of EncodeSegment has two issues:
- OOM Risk: It calls percentEncode(raw) before checking the length. If raw is extremely large, percentEncode will attempt to allocate 3 * len(raw) bytes (via strings.Builder.Grow), which can lead to an OOM. Since any raw longer than maxSegmentBytes is guaranteed to exceed the limit after encoding, we should short-circuit early.
- Collision Bug: A user-supplied key that happens to match the SHA-fallback pattern (32 hex characters followed by __) will be returned as-is by EncodeSegment but will be misidentified as a fallback by DecodeSegment, making it undecodable without a keymap. EncodeSegment should detect this collision and force the SHA-fallback path to ensure the invariant that DecodeSegment only returns ErrShaFallbackNeedsKeymap for segments that actually require the keymap.
| func EncodeSegment(raw []byte) string { | |
| encoded := percentEncode(raw) | |
| if len(encoded) <= maxSegmentBytes { | |
| return encoded | |
| } | |
| return shaFallback(raw) | |
| } | |
| func EncodeSegment(raw []byte) string { | |
| if len(raw) > maxSegmentBytes { | |
| return shaFallback(raw) | |
| } | |
| encoded := percentEncode(raw) | |
| if len(encoded) <= maxSegmentBytes && !isShaFallback(encoded) { | |
| return encoded | |
| } | |
| return shaFallback(raw) | |
| } |
References
- To prevent unbounded memory growth and potential OOM issues, apply a fixed bound to inputs that can trigger large allocations.
- When deserializing or processing data, pre-allocating a buffer is acceptable if a strictly enforced maximum size limit is checked before allocation.
| func EncodeBinarySegment(raw []byte) string { | ||
| enc := binaryPrefix + base64.RawURLEncoding.EncodeToString(raw) | ||
| if len(enc) <= maxSegmentBytes { | ||
| return enc | ||
| } | ||
| return shaFallback(raw) | ||
| } |
There was a problem hiding this comment.
Similar to EncodeSegment, EncodeBinarySegment should guard against large inputs to prevent excessive memory allocation in base64.EncodeToString before the length check is performed.
| func EncodeBinarySegment(raw []byte) string { | |
| enc := binaryPrefix + base64.RawURLEncoding.EncodeToString(raw) | |
| if len(enc) <= maxSegmentBytes { | |
| return enc | |
| } | |
| return shaFallback(raw) | |
| } | |
| func EncodeBinarySegment(raw []byte) string { | |
| if len(raw) > maxSegmentBytes { | |
| return shaFallback(raw) | |
| } | |
| enc := binaryPrefix + base64.RawURLEncoding.EncodeToString(raw) | |
| if len(enc) <= maxSegmentBytes { | |
| return enc | |
| } | |
| return shaFallback(raw) | |
| } |
References
- To prevent unbounded memory growth and potential OOM issues, apply a fixed bound to inputs that can trigger large allocations.
- When deserializing or processing data, pre-allocating a buffer is acceptable if a strictly enforced maximum size limit is checked before allocation.
| func DecodeSegment(seg string) ([]byte, error) { | ||
| if isShaFallback(seg) { |
There was a problem hiding this comment.
As a defensive measure, DecodeSegment should validate that the input segment length does not exceed maxSegmentBytes. Since all valid encoded segments (including fallbacks) are guaranteed to be within this limit, rejecting oversized inputs prevents unnecessary allocations in percentDecode or base64.DecodeString.
| func DecodeSegment(seg string) ([]byte, error) { | |
| if isShaFallback(seg) { | |
| func DecodeSegment(seg string) ([]byte, error) { | |
| if len(seg) > maxSegmentBytes { | |
| return nil, errors.Wrap(ErrInvalidEncodedSegment, "segment length exceeds maximum") | |
| } | |
| if isShaFallback(seg) { |
References
- To prevent unbounded memory growth and potential OOM issues, apply a fixed bound to collections or inputs that can grow from external requests.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 25f5da368d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if _, ok := unhex(seg[i]); !ok { | ||
| return false | ||
| } | ||
| } | ||
| return seg[shaFallbackHexPrefixBytes:shaFallbackHexPrefixBytes+len(shaFallbackSeparator)] == shaFallbackSeparator |
There was a problem hiding this comment.
Disambiguate SHA fallback from plain encoded segments
Any segment that begins with 32 hex characters followed by __ is treated as SHA fallback, even when it is a normal output of EncodeSegment (e.g. key 0123456789abcdef0123456789abcdef__x, which is unreserved and <=240 bytes). In that case EncodeSegment returns the raw string but DecodeSegment returns ErrShaFallbackNeedsKeymap, so round-trip decode fails for valid keys and restore logic can incorrectly require a KEYMAP entry that was never written.
Useful? React with 👍 / 👎.
Two reviewer-flagged correctness issues in EncodeSegment: 1. SHA-fallback collision (Gemini HIGH, Codex P1): A user key whose percent-encoded form happens to begin with 32 hex characters + "__" (e.g. "0123456789abcdef0123456789abcdef__x") passed through EncodeSegment unchanged. DecodeSegment's structural detection then misclassified the legitimate key as a SHA fallback and returned ErrShaFallbackNeedsKeymap, breaking round-trip for keys that should not require KEYMAP.jsonl. EncodeSegment now promotes such inputs to a real SHA fallback so KEYMAP.jsonl carries the original bytes for exact recovery. 2. Huge-input transient allocation (Gemini HIGH on EncodeSegment and EncodeBinarySegment, MEDIUM on DecodeSegment): percentEncode pre-grew its builder to 3*len(raw); a 1 GiB raw input would allocate 3 GiB before the length check fired. The encoders now short-circuit to shaFallback when the worst-case expansion (3*len for percent-encoded, 4/3*len + 4 for b64) would always overflow the ceiling. DecodeSegment refuses inputs longer than maxSegmentBytes outright as a defensive measure. Tests added: - TestEncodeSegment_KeyMatchingShaFallbackShapeIsPromotedToFallback - TestEncodeSegment_HugeInputDoesNotMaterialiseFullExpansion - TestDecodeSegment_RejectsOversizedInput - TestEncodeBinarySegment_HugeInputTakesShaFallbackWithoutEncoding
|
Review feedback addressed in HIGH (Gemini line 90, Codex P1 line 207) — SHA-fallback collision HIGH (Gemini lines 90, 104) — huge-input transient allocation MEDIUM (Gemini line 111) — DecodeSegment defensive length check New tests:
@claude review |
|
Claude finished @bootjp's task in 3m 1s —— View job Code Review:
|
| Line | Literal | Suggested constant |
|---|---|---|
| 174 | 3 in b.Grow(len(raw) * 3) |
use existing percentEncodeMaxExpansion |
| 181 | 4 (nibble shift) |
nibbleWidth = 4 |
| 182 | 0x0F (low-nibble mask) |
lowNibbleMask byte = 0x0F |
| 199 | 2 (hex digits after %) |
hexEscapeDigits = 2 |
| 217 | 4 (nibble shift) |
same nibbleWidth |
| 263, 266, 274, 276 | 10 (decimal/hex boundary) |
hexAlphaBase = 10 |
Note that percentEncodeMaxExpansion was added in this PR specifically to name the 3, but line 174 inside percentEncode still uses the literal with //nolint. The constant should be used there too.
Minor — shaFallbackHexPrefixBytes misleading name
Still names a character count, not a byte count (32 hex chars = 16 bytes of SHA-256). Consider shaFallbackHexPrefixLen or shaFallbackHexChars. Affects lines 44, 225, 234, 237, 242.
What's well done in this revision
- The promoted-fallback path in
EncodeSegmentis the right design choice: it preserves the invariant thatisShaFallbackis authoritative on decoder input, with no separator change required. base64.RawURLEncoding.EncodedLenused for the pre-check inEncodeBinarySegmentis precise and allocation-free.TestEncodeSegment_HugeInputDoesNotMaterialiseFullExpansion(1 MiB) andTestEncodeBinarySegment_HugeInputTakesShaFallbackWithoutEncodingare the right regression anchors.TestDecodeSegment_RejectsOversizedInputcorrectly validates the new guard.- Commit message is excellent — it maps each fix back to the specific reviewer and comment.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e7a84eba67
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if len(raw)*percentEncodeMaxExpansion > maxSegmentBytes { | ||
| return shaFallback(raw) |
There was a problem hiding this comment.
Remove overbroad fast path in EncodeSegment
The early-return condition len(raw)*percentEncodeMaxExpansion > maxSegmentBytes is not a safe “always overflow” check: it sends every input longer than 80 bytes to SHA fallback, including plain unreserved keys whose encoded form would still be <=240 bytes (e.g., 200 ASCII letters). That changes many reversible keys into ErrShaFallbackNeedsKeymap cases, creating unnecessary KEYMAP dependence and breaking the documented non-fallback round-trip behavior for 81–240 byte segments.
Useful? React with 👍 / 👎.
| if strings.HasPrefix(seg, binaryPrefix) { | ||
| raw, err := base64.RawURLEncoding.DecodeString(seg[len(binaryPrefix):]) |
There was a problem hiding this comment.
Prevent b64. prefix collisions with string segments
Any segment beginning with b64. is decoded as binary, but EncodeSegment leaves unreserved string keys unchanged, so string keys like b64.Zm9v decode to different bytes (foo) and keys like b64.abc can fail decode entirely if the suffix is not valid base64url. This makes string-key round-tripping non-invertible for a realistic key prefix unless callers carry an external type channel for every segment.
Useful? React with 👍 / 👎.
Two P1 follow-ups from Codex. #100 Codex P1 -- overbroad fast path. The previous len(raw)*3 > 240 fast path tripped for any input longer than 80 bytes, including pure-ASCII keys whose 1:1 encoded form would have fit fine. That converted reversible keys into ErrShaFallbackNeedsKeymap unnecessarily. Replaced with: 1. len(raw) > maxSegmentBytes -> SHA fallback (1:1 lower bound; no encoding can fit). 2. percentEncodeBounded: stream-encodes with an in-loop overflow check so the partial allocation is bounded to maxSegmentBytes even on adversarial inputs that DO need escaping. Returns ("", false) on overflow so the caller takes the SHA path without seeing the partial output. TestEncodeSegment_LongUnreservedASCIIEncodesAsIs locks the correct 200-byte-ASCII round-trip. #146 Codex P1 -- b64. prefix collision. A user STRING key like "b64.foo" was returned as-is by EncodeSegment (all unreserved) and then misclassified by DecodeSegment as a binary segment, decoding the base64 to "foo" instead of round-tripping. EncodeSegment now promotes any input whose percent-encoded form starts with the binary prefix to a real SHA fallback, parallel to the existing SHA-shape collision check, so KEYMAP.jsonl carries the original bytes. TestEncodeSegment_KeyStartingWithBinaryPrefixIsPromotedToFallback covers it. The previous huge-input OOM-guard property (no all-at-once 3*len(raw) allocation) is preserved by percentEncodeBounded; the existing TestEncodeSegment_HugeInputDoesNotMaterialiseFullExpansion still passes.
|
2 Codex P1 follow-ups addressed in #100 — Overbroad fast path.
#146 —
@claude review |
|
Claude finished @bootjp's task in 2m 54s —— View job Code Review:
|
| Lines | Literal | Suggested constant |
|---|---|---|
| 141, 215 | c >> 4 (nibble shift) |
nibbleWidth = 4 |
| 142, 216 | c & 0x0F |
lowNibbleMask byte = 0x0F |
| 208 | len(raw) * 3 |
use existing escapeBytes local |
| 251 | hi << 4 |
same nibbleWidth |
| 297, 300, 308, 310 | 10 |
hexAlphaBase = 10 |
CLAUDE.md: "Avoid //nolint — refactor instead."
Minor — shaFallbackHexPrefixBytes misleading name
The constant counts hex characters, not raw bytes (32 hex chars = 16 bytes of SHA-256). Still present at lines 44, 259, 268, 271, 276. Consider shaFallbackHexPrefixLen or shaFallbackHexChars.
New observation (Low, non-blocking)
percentEncode (lines 205–219) is now unreachable from the main encoding path — percentEncodeBounded superseded it in EncodeSegment. Its only remaining caller is shaFallback (line 264), where it encodes the truncated suffix (at most 64 bytes, so the 3*64 = 192-byte pre-grow is safe). The two functions share an identical inner loop, which is minor duplication. Not a bug and not blocking, but percentEncode could be refactored away by having shaFallback call percentEncodeBounded with maxLen = shaFallbackTruncatedSuffixBytes*3.
What's well done in this revision
percentEncodeBounded's streaming design is exactly right for an OOM guard: bail early, cap pre-growth, no large intermediate allocation.base64.RawURLEncoding.EncodedLen(len(raw))pre-check inEncodeBinarySegmentis allocation-free and precise.- The three promotion invariants (
len > max, SHA-shape,b64.prefix) are all in a single readable ternary (line 110) with a clear doc comment above. - New tests
TestEncodeSegment_LongUnreservedASCIIEncodesAsIsandTestEncodeSegment_KeyStartingWithBinaryPrefixIsPromotedToFallbackare the right regression anchors for the P1 bugs.
Summary
First piece of the Phase 0a logical-backup decoder described in
docs/design/2026_04_29_proposed_snapshot_logical_decoder.md. Addsinternal/backup/filename.go+ tests — the filename encoding/decoding primitive every per-adapter encoder will depend on.Subsequent PRs will add:
KEYMAP.jsonlwriter, per-adapter encoders (DynamoDB / S3 / Redis / SQS), the main decoder pipeline, and thecmd/elastickv-snapshot-decodeCLI.What this PR does
[A-Za-z0-9._-]pass through; every other byte becomes%HH. Long segments (>240 bytes after expansion) take a SHA-256-prefix + truncated-original fallback.b64.segments; SHA-fallback inputs returnErrShaFallbackNeedsKeymapso callers cannot fabricate the original bytes from the filename alone.b64.<base64url>form so binary keys never collide with hex-shaped string keys.Test plan
go test -race ./internal/backup/...— all tests pass.golangci-lint run ./internal/backup/...— clean.pgregory.net/rapidcovering round-trip on both encoding paths and SHA-fallback post-condition.%HH, non-hex digits, raw reserved bytes, malformedb64.segments.Self-review
-raceclean.KEYMAP.jsonlfor full reverse.