Skip to content

backup: filename encoding (Phase 0a foundation)#711

Open
bootjp wants to merge 3 commits intomainfrom
feat/backup-phase0a-filename
Open

backup: filename encoding (Phase 0a foundation)#711
bootjp wants to merge 3 commits intomainfrom
feat/backup-phase0a-filename

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 29, 2026

Summary

First piece of the Phase 0a logical-backup decoder described in docs/design/2026_04_29_proposed_snapshot_logical_decoder.md. Adds internal/backup/filename.go + tests — the filename encoding/decoding primitive every per-adapter encoder will depend on.

Subsequent PRs will add: KEYMAP.jsonl writer, per-adapter encoders (DynamoDB / S3 / Redis / SQS), the main decoder pipeline, and the cmd/elastickv-snapshot-decode CLI.

What this PR does

  • Encode: bytes in [A-Za-z0-9._-] pass through; every other byte becomes %HH. Long segments (>240 bytes after expansion) take a SHA-256-prefix + truncated-original fallback.
  • Decode: reverses percent and b64. segments; SHA-fallback inputs return ErrShaFallbackNeedsKeymap so callers cannot fabricate the original bytes from the filename alone.
  • Binary path: DynamoDB B-attribute keys take a separate b64.<base64url> form so binary keys never collide with hex-shaped string keys.

Test plan

  • go test -race ./internal/backup/... — all tests pass.
  • golangci-lint run ./internal/backup/... — clean.
  • Property tests via pgregory.net/rapid covering round-trip on both encoding paths and SHA-fallback post-condition.
  • Negative tests: truncated %HH, non-hex digits, raw reserved bytes, malformed b64. segments.

Self-review

  • Data loss — N/A; pure encoding/decoding library. SHA-fallback explicitly returns a typed error rather than synthesizing a wrong key.
  • Concurrency — All functions are pure; no shared state. -race clean.
  • Performance — Single-pass encoders; pre-grown builder; no allocations beyond the output. SHA-256 only on the fallback path.
  • Data consistency — Encoder is deterministic given the same input. Decode is the inverse on non-fallback inputs (verified by rapid). SHA-fallback is documented as requiring KEYMAP.jsonl for full reverse.
  • Test coverage — Table-driven cases for the documented rules + rapid property tests for round-trip + targeted negative tests. New non-test code lines added: 222.

…ndation)

Implements internal/backup/{filename.go,filename_test.go} per the Phase 0
design doc (docs/design/2026_04_29_proposed_snapshot_logical_decoder.md).

Encoding rules (deterministic, reversible from the filename alone except
for the SHA fallback):
- RFC3986 unreserved set [A-Za-z0-9._-] passes through.
- Other bytes -> %HH (uppercase hex), like form-urlencoded but applied
  to every non-allowlisted byte.
- Segments exceeding 240 bytes after percent-encoding render as
  <sha256-hex-prefix-32>__<truncated-original> with the full original
  recorded in KEYMAP.jsonl by the caller. DecodeSegment refuses these
  with ErrShaFallbackNeedsKeymap so callers cannot fabricate the
  original bytes from the filename alone.
- DynamoDB B-attribute (binary) keys take a separate "b64.<base64url>"
  path so a binary key never collides with a hex-shaped string key.

Tests cover passthrough, percent escaping, hex-uppercase invariant,
SHA-fallback firing on long inputs (raw or post-expansion), binary
round-trip, malformed-input rejection, output-length bound under
adversarial inputs, and rapid-driven property tests for round-trip on
both percent and binary paths.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 23 minutes and 10 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 068e4ea6-8f6e-4f03-88c1-3454a02ca186

📥 Commits

Reviewing files that changed from the base of the PR and between 7ac9c04 and 1e28238.

📒 Files selected for processing (2)
  • internal/backup/filename.go
  • internal/backup/filename_test.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/backup-phase0a-filename

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 23 minutes and 10 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 29, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @bootjp's task in 4m 11s —— View job


Code Review: internal/backup/filename.go

  • Gather context
  • Read internal/backup/filename.go
  • Read internal/backup/filename_test.go
  • Analyze encoding/decoding logic
  • Review test coverage
  • Post final review

The implementation is clean and well-structured overall, but there is one correctness bug and a project-convention violation that should be fixed before merging.


Bug (High) — SHA fallback detection collides with valid user keys

isShaFallback (filename.go:198–208) is not injection-proof.

The detector checks: "first 32 chars are hex-like, then __". The problem: all of those characters (0–9, a–f, A–F, _) are in the unreserved set and pass through percentEncode unchanged. A user key like:

0123456789abcdef0123456789abcdef__mykey

is entirely unreserved (len 38 < 240), so EncodeSegment emits it verbatim. Then DecodeSegment calls isShaFallbacktrue → returns ErrShaFallbackNeedsKeymap for what is actually a well-formed, directly-decodable key.

The existing test TestEncodeSegment_ShaFallbackPrefixCannotCollideWithEncodedHex misses this case: it only tests inputs that are one char too short or that have a wrong separator. A test case that demonstrates the current breakage:

key := "0123456789abcdef0123456789abcdef__mykey"
enc := EncodeSegment([]byte(key))
// enc == key (all unreserved, length 38)
_, err := DecodeSegment(enc)
// currently: ErrShaFallbackNeedsKeymap  ← WRONG
// should be: nil, []byte(key)

The property test doesn't reliably catch this because it draws inputs up to 80 bytes randomly — the probability of hitting exactly 32 hex chars + __ is negligible.

Root cause: _ is unreserved, so __ can naturally appear in encoded user-key output. Any separator composed entirely of unreserved bytes can collide.

Fix: use a separator byte that can only appear in encoder-generated output. Since % is never a literal character in percent-encoded output (a user byte % would be encoded as %25), embedding % in the SHA-fallback separator makes it injection-proof. For example:

shaFallbackSeparator = "%__"
// isShaFallback then checks seg[32:35] == "%__"

A user key producing % at position 32 of its encoded form would mean the user byte at that position is non-unreserved, which would already break the 32-char hex check. So this separator cannot collide with any percent-encoded user-key output.

This is a wire-format change and is cheap to make now in Phase 0a before any data is written. Fix this →


Convention violation (Medium) — //nolint:mnd suppressions

CLAUDE.md states: "Avoid //nolint — refactor instead."

There are 9 //nolint:mnd suppressions across filename.go. Each can be removed by extracting a named constant:

Location Magic value Suggested constant
percentEncode, line 139 3 (len("%HH")) percentEncodedMaxExpansion = 3
percentEncode, lines 146–147 4 (nibble width), 0x0F (low-nibble mask) nibbleWidth = 4, lowNibbleMask byte = 0x0F
percentDecode, line 164 2 (hex digits after %) hexEscapeDigits = 2
percentDecode, line 182 4 (nibble width) same nibbleWidth
hexUpper, lines 228, 231 10 (decimal/hex boundary) hexAlphaBase = 10
unhex, lines 239, 241 10 same hexAlphaBase

Fix this →


Minor — shaFallbackHexPrefixBytes misleading name

shaFallbackHexPrefixBytes = 32 is 32 hex characters, which encode 16 bytes. The name suffix Bytes implies raw bytes, but the constant is used as a character count. Consider shaFallbackHexPrefixLen or shaFallbackHexChars to avoid confusion at the call sites.


What's well done

  • Encoding logic is correct for the non-colliding cases: percentEncode / percentDecode are proper inverses; the boundary check at i+2 >= len(seg) is correct; hexUpper / unhex are symmetric.
  • EncodeBinarySegment separation cleanly prevents the b64. form from colliding with hex-shaped string keys.
  • Negative tests (TestDecodeSegment_Rejects*) cover truncated escapes, non-hex digits, raw reserved bytes, and malformed base64 — exactly the right failure cases.
  • Property tests via rapid cover the round-trip invariant and the SHA-fallback post-condition for randomly generated inputs.
  • Length bound guarantee (TestEncodeSegment_OutputLengthBoundedByMax) is a good correctness invariant test.
  • SHA-fallback human-suffix test confirms operator greppability.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a filename encoding and decoding system for logical backups, supporting percent-encoding, base64-encoded binary segments, and a SHA-256 fallback for segments exceeding 240 bytes. The review feedback highlights critical improvements for robustness, specifically addressing potential OOM risks by enforcing length checks before large memory allocations in encoding and decoding functions. Furthermore, a logic fix was suggested for EncodeSegment to prevent user-supplied keys from colliding with the SHA-fallback pattern, ensuring deterministic decoding.

Comment on lines +84 to +90
func EncodeSegment(raw []byte) string {
encoded := percentEncode(raw)
if len(encoded) <= maxSegmentBytes {
return encoded
}
return shaFallback(raw)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of EncodeSegment has two issues:

  1. OOM Risk: It calls percentEncode(raw) before checking the length. If raw is extremely large, percentEncode will attempt to allocate 3 * len(raw) bytes (via strings.Builder.Grow), which can lead to an OOM. Since any raw longer than maxSegmentBytes is guaranteed to exceed the limit after encoding, we should short-circuit early.
  2. Collision Bug: A user-supplied key that happens to match the SHA-fallback pattern (32 hex characters followed by __) will be returned as-is by EncodeSegment but will be misidentified as a fallback by DecodeSegment, making it undecodable without a keymap. EncodeSegment should detect this collision and force the SHA-fallback path to ensure the invariant that DecodeSegment only returns ErrShaFallbackNeedsKeymap for segments that actually require the keymap.
Suggested change
func EncodeSegment(raw []byte) string {
encoded := percentEncode(raw)
if len(encoded) <= maxSegmentBytes {
return encoded
}
return shaFallback(raw)
}
func EncodeSegment(raw []byte) string {
if len(raw) > maxSegmentBytes {
return shaFallback(raw)
}
encoded := percentEncode(raw)
if len(encoded) <= maxSegmentBytes && !isShaFallback(encoded) {
return encoded
}
return shaFallback(raw)
}
References
  1. To prevent unbounded memory growth and potential OOM issues, apply a fixed bound to inputs that can trigger large allocations.
  2. When deserializing or processing data, pre-allocating a buffer is acceptable if a strictly enforced maximum size limit is checked before allocation.

Comment on lines +98 to +104
func EncodeBinarySegment(raw []byte) string {
enc := binaryPrefix + base64.RawURLEncoding.EncodeToString(raw)
if len(enc) <= maxSegmentBytes {
return enc
}
return shaFallback(raw)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to EncodeSegment, EncodeBinarySegment should guard against large inputs to prevent excessive memory allocation in base64.EncodeToString before the length check is performed.

Suggested change
func EncodeBinarySegment(raw []byte) string {
enc := binaryPrefix + base64.RawURLEncoding.EncodeToString(raw)
if len(enc) <= maxSegmentBytes {
return enc
}
return shaFallback(raw)
}
func EncodeBinarySegment(raw []byte) string {
if len(raw) > maxSegmentBytes {
return shaFallback(raw)
}
enc := binaryPrefix + base64.RawURLEncoding.EncodeToString(raw)
if len(enc) <= maxSegmentBytes {
return enc
}
return shaFallback(raw)
}
References
  1. To prevent unbounded memory growth and potential OOM issues, apply a fixed bound to inputs that can trigger large allocations.
  2. When deserializing or processing data, pre-allocating a buffer is acceptable if a strictly enforced maximum size limit is checked before allocation.

Comment on lines +110 to +111
func DecodeSegment(seg string) ([]byte, error) {
if isShaFallback(seg) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

As a defensive measure, DecodeSegment should validate that the input segment length does not exceed maxSegmentBytes. Since all valid encoded segments (including fallbacks) are guaranteed to be within this limit, rejecting oversized inputs prevents unnecessary allocations in percentDecode or base64.DecodeString.

Suggested change
func DecodeSegment(seg string) ([]byte, error) {
if isShaFallback(seg) {
func DecodeSegment(seg string) ([]byte, error) {
if len(seg) > maxSegmentBytes {
return nil, errors.Wrap(ErrInvalidEncodedSegment, "segment length exceeds maximum")
}
if isShaFallback(seg) {
References
  1. To prevent unbounded memory growth and potential OOM issues, apply a fixed bound to collections or inputs that can grow from external requests.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 25f5da368d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +203 to +207
if _, ok := unhex(seg[i]); !ok {
return false
}
}
return seg[shaFallbackHexPrefixBytes:shaFallbackHexPrefixBytes+len(shaFallbackSeparator)] == shaFallbackSeparator
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Disambiguate SHA fallback from plain encoded segments

Any segment that begins with 32 hex characters followed by __ is treated as SHA fallback, even when it is a normal output of EncodeSegment (e.g. key 0123456789abcdef0123456789abcdef__x, which is unreserved and <=240 bytes). In that case EncodeSegment returns the raw string but DecodeSegment returns ErrShaFallbackNeedsKeymap, so round-trip decode fails for valid keys and restore logic can incorrectly require a KEYMAP entry that was never written.

Useful? React with 👍 / 👎.

Two reviewer-flagged correctness issues in EncodeSegment:

1. SHA-fallback collision (Gemini HIGH, Codex P1):
   A user key whose percent-encoded form happens to begin with 32 hex
   characters + "__" (e.g. "0123456789abcdef0123456789abcdef__x")
   passed through EncodeSegment unchanged. DecodeSegment's structural
   detection then misclassified the legitimate key as a SHA fallback
   and returned ErrShaFallbackNeedsKeymap, breaking round-trip for
   keys that should not require KEYMAP.jsonl. EncodeSegment now
   promotes such inputs to a real SHA fallback so KEYMAP.jsonl
   carries the original bytes for exact recovery.

2. Huge-input transient allocation (Gemini HIGH on EncodeSegment and
   EncodeBinarySegment, MEDIUM on DecodeSegment):
   percentEncode pre-grew its builder to 3*len(raw); a 1 GiB raw
   input would allocate 3 GiB before the length check fired. The
   encoders now short-circuit to shaFallback when the worst-case
   expansion (3*len for percent-encoded, 4/3*len + 4 for b64) would
   always overflow the ceiling. DecodeSegment refuses inputs longer
   than maxSegmentBytes outright as a defensive measure.

Tests added:
- TestEncodeSegment_KeyMatchingShaFallbackShapeIsPromotedToFallback
- TestEncodeSegment_HugeInputDoesNotMaterialiseFullExpansion
- TestDecodeSegment_RejectsOversizedInput
- TestEncodeBinarySegment_HugeInputTakesShaFallbackWithoutEncoding
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 30, 2026

Review feedback addressed in e7a84eba:

HIGH (Gemini line 90, Codex P1 line 207) — SHA-fallback collision
Confirmed bug. A user key like 0123456789abcdef0123456789abcdef__suffix (35 unreserved bytes) passed through EncodeSegment unchanged, then DecodeSegment's structural detector misclassified it as a SHA fallback and returned ErrShaFallbackNeedsKeymap. Round-trip broke for legitimate keys that should not have required KEYMAP.jsonl. EncodeSegment now promotes such inputs to a real SHA fallback so KEYMAP.jsonl carries the original bytes.

HIGH (Gemini lines 90, 104) — huge-input transient allocation
percentEncode pre-grew its builder to 3 * len(raw); a 1 GiB raw input would allocate 3 GiB just to be discarded by the length check. Encoders now short-circuit to shaFallback when the worst-case expansion would always overflow the ceiling — 3*len for percent, 4/3*len + 4 for b64.<base64url>. No allocation of the full expansion happens for adversarially large inputs.

MEDIUM (Gemini line 111) — DecodeSegment defensive length check
DecodeSegment now rejects inputs longer than maxSegmentBytes outright with ErrInvalidEncodedSegment. EncodeSegment never produces such inputs; any caller passing one is reading a corrupted dump or has a bug.

New tests:

  • TestEncodeSegment_KeyMatchingShaFallbackShapeIsPromotedToFallback — the collision-shaped input
  • TestEncodeSegment_HugeInputDoesNotMaterialiseFullExpansion — 1 MiB raw, asserts SHA fallback + bounded output
  • TestDecodeSegment_RejectsOversizedInput
  • TestEncodeBinarySegment_HugeInputTakesShaFallbackWithoutEncoding

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Claude finished @bootjp's task in 3m 1s —— View job


Code Review: internal/backup/filename.go (round 2)

  • Read current state of filename.go and filename_test.go
  • Verify SHA-fallback collision fix
  • Verify huge-input OOM fix
  • Verify DecodeSegment defensive length check
  • Verify new tests are correct and sufficient
  • Look for any remaining issues

All three high/medium issues from round 1 are correctly fixed. Two convention items from round 1 remain unaddressed.


Fixes verified ✓

SHA-fallback collision (EncodeSegment, lines 98–107)

The logic is correct. After percentEncode(raw), the result is checked with isShaFallback(encoded): if the encoded form happens to match the 32-hex-chars + "__" pattern (possible only when the first 32 raw bytes are all from {0-9, a-f, A-F} and bytes 32–33 are __, all of which are unreserved and pass through unchanged), the encoder promotes to a real SHA fallback so DecodeSegment's structural check remains correct. The new test TestEncodeSegment_KeyMatchingShaFallbackShapeIsPromotedToFallback (line 312) directly exercises the broken case.

OOM short-circuit in EncodeSegment (line 99)

len(raw)*percentEncodeMaxExpansion > maxSegmentBytes is exactly right. At len(raw) = 80, 80*3 = 240 = maxSegmentBytes (not >), so percentEncode is still called and b.Grow allocates at most 240 bytes. At len(raw) = 81, we short-circuit. The bound is tight.

OOM short-circuit in EncodeBinarySegment (line 118)

Using base64.RawURLEncoding.EncodedLen(len(raw)) + len(binaryPrefix) > maxSegmentBytes is correct: EncodedLen is pure integer arithmetic with no allocation, so the check is free.

DecodeSegment defensive length guard (lines 138–141)

Clean. Any EncodeSegment output is ≤ maxSegmentBytes, so anything longer is either a corrupted dump or a caller bug — rejecting it outright is the right call.


Remaining issues (both carried from round 1)

Convention violation (Medium) — //nolint:mnd suppressions (9 sites)

CLAUDE.md: "Avoid //nolint — refactor instead." All nine suppressions in the file can be replaced with named constants:

Line Literal Suggested constant
174 3 in b.Grow(len(raw) * 3) use existing percentEncodeMaxExpansion
181 4 (nibble shift) nibbleWidth = 4
182 0x0F (low-nibble mask) lowNibbleMask byte = 0x0F
199 2 (hex digits after %) hexEscapeDigits = 2
217 4 (nibble shift) same nibbleWidth
263, 266, 274, 276 10 (decimal/hex boundary) hexAlphaBase = 10

Note that percentEncodeMaxExpansion was added in this PR specifically to name the 3, but line 174 inside percentEncode still uses the literal with //nolint. The constant should be used there too.

Fix this →

Minor — shaFallbackHexPrefixBytes misleading name

Still names a character count, not a byte count (32 hex chars = 16 bytes of SHA-256). Consider shaFallbackHexPrefixLen or shaFallbackHexChars. Affects lines 44, 225, 234, 237, 242.


What's well done in this revision

  • The promoted-fallback path in EncodeSegment is the right design choice: it preserves the invariant that isShaFallback is authoritative on decoder input, with no separator change required.
  • base64.RawURLEncoding.EncodedLen used for the pre-check in EncodeBinarySegment is precise and allocation-free.
  • TestEncodeSegment_HugeInputDoesNotMaterialiseFullExpansion (1 MiB) and TestEncodeBinarySegment_HugeInputTakesShaFallbackWithoutEncoding are the right regression anchors.
  • TestDecodeSegment_RejectsOversizedInput correctly validates the new guard.
  • Commit message is excellent — it maps each fix back to the specific reviewer and comment.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e7a84eba67

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/backup/filename.go Outdated
Comment on lines +99 to +100
if len(raw)*percentEncodeMaxExpansion > maxSegmentBytes {
return shaFallback(raw)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove overbroad fast path in EncodeSegment

The early-return condition len(raw)*percentEncodeMaxExpansion > maxSegmentBytes is not a safe “always overflow” check: it sends every input longer than 80 bytes to SHA fallback, including plain unreserved keys whose encoded form would still be <=240 bytes (e.g., 200 ASCII letters). That changes many reversible keys into ErrShaFallbackNeedsKeymap cases, creating unnecessary KEYMAP dependence and breaking the documented non-fallback round-trip behavior for 81–240 byte segments.

Useful? React with 👍 / 👎.

Comment on lines +145 to +146
if strings.HasPrefix(seg, binaryPrefix) {
raw, err := base64.RawURLEncoding.DecodeString(seg[len(binaryPrefix):])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Prevent b64. prefix collisions with string segments

Any segment beginning with b64. is decoded as binary, but EncodeSegment leaves unreserved string keys unchanged, so string keys like b64.Zm9v decode to different bytes (foo) and keys like b64.abc can fail decode entirely if the suffix is not valid base64url. This makes string-key round-tripping non-invertible for a realistic key prefix unless callers carry an external type channel for every segment.

Useful? React with 👍 / 👎.

Two P1 follow-ups from Codex.

#100 Codex P1 -- overbroad fast path.
The previous len(raw)*3 > 240 fast path tripped for any input
longer than 80 bytes, including pure-ASCII keys whose 1:1 encoded
form would have fit fine. That converted reversible keys into
ErrShaFallbackNeedsKeymap unnecessarily. Replaced with:
  1. len(raw) > maxSegmentBytes -> SHA fallback (1:1 lower bound;
     no encoding can fit).
  2. percentEncodeBounded: stream-encodes with an in-loop overflow
     check so the partial allocation is bounded to maxSegmentBytes
     even on adversarial inputs that DO need escaping. Returns
     ("", false) on overflow so the caller takes the SHA path
     without seeing the partial output.
TestEncodeSegment_LongUnreservedASCIIEncodesAsIs locks the
correct 200-byte-ASCII round-trip.

#146 Codex P1 -- b64. prefix collision.
A user STRING key like "b64.foo" was returned as-is by
EncodeSegment (all unreserved) and then misclassified by
DecodeSegment as a binary segment, decoding the base64 to "foo"
instead of round-tripping. EncodeSegment now promotes any input
whose percent-encoded form starts with the binary prefix to a
real SHA fallback, parallel to the existing SHA-shape collision
check, so KEYMAP.jsonl carries the original bytes.
TestEncodeSegment_KeyStartingWithBinaryPrefixIsPromotedToFallback
covers it.

The previous huge-input OOM-guard property (no all-at-once
3*len(raw) allocation) is preserved by percentEncodeBounded; the
existing TestEncodeSegment_HugeInputDoesNotMaterialiseFullExpansion
still passes.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 30, 2026

2 Codex P1 follow-ups addressed in 1e282382:

#100 — Overbroad fast path. len(raw)*3 > 240 falsely tripped for any input longer than 80 bytes, including pure-ASCII unreserved keys whose 1:1 encoding would have fit. Replaced with:

  1. len(raw) > maxSegmentBytes → SHA fallback (1:1 lower bound, can't fit anyway).
  2. percentEncodeBounded streams the encode and bails on overflow, capping the partial allocation at maxSegmentBytes. Restores the documented round-trip for 81–240 byte unreserved keys while preserving the OOM guard for adversarial inputs that DO need escaping.

TestEncodeSegment_LongUnreservedASCIIEncodesAsIs (200-byte ASCII round-trip) locks it in. The existing TestEncodeSegment_HugeInputDoesNotMaterialiseFullExpansion still passes because percentEncodeBounded never allocates more than maxSegmentBytes + 3.

#146b64. prefix collision. A user STRING key like b64.foo was returned as-is (all unreserved) and then misclassified by DecodeSegment as binary, decoding the base64 to foo instead of round-tripping. EncodeSegment now promotes any input whose encoded form starts with b64. to a real SHA fallback (parallel to the existing SHA-shape collision check). KEYMAP.jsonl carries the original bytes.

TestEncodeSegment_KeyStartingWithBinaryPrefixIsPromotedToFallback.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Claude finished @bootjp's task in 2m 54s —— View job


Code Review: internal/backup/filename.go (round 3)

  • Read current state of filename.go and filename_test.go
  • Verify overbroad fast-path fix (percentEncodeBounded)
  • Verify b64. prefix collision fix
  • Verify new tests are correct and sufficient
  • Check remaining issues from round 2
  • Post final review

Both Codex P1 fixes are correctly implemented. One new low-priority observation.


Fixes verified ✓

Overbroad fast path (#100) — percentEncodeBounded (lines 121–145)

The streaming encoder is correct. It bails as soon as adding a character (+1) or a 3-byte escape (+escapeBytes) would exceed maxLen, so it never over-runs the ceiling. The pre-grow cap is bounded to min(3*len(raw), maxLen+3) = 243 bytes regardless of input size — never 3*len(raw) for adversarial inputs.

The early guard len(raw) > maxSegmentBytes (line 105) handles the 1:1 lower bound before percentEncodeBounded is called, which also keeps the cap multiplication within 3*240 = 720 (never overflows). The bound is tight: len(raw) = 80 bytes → 3*80 = 240 = maxSegmentBytes, so percentEncodeBounded is still called; at len(raw) = 81 the caller short-circuits first.

TestEncodeSegment_LongUnreservedASCIIEncodesAsIs (line 312) correctly anchors the previously-broken 200-byte ASCII case.

b64. prefix collision (#146) — EncodeSegment line 110

encoded, ok := percentEncodeBounded(raw, maxSegmentBytes)
if !ok || isShaFallback(encoded) || strings.HasPrefix(encoded, binaryPrefix) {
    return shaFallback(raw)
}

The check is correct and complete. It covers both fully-unreserved keys (b64.foob64.foo, promoted) and partially-reserved keys (b64.!!!b64.%21%21%21, still has b64. prefix, promoted). TestEncodeSegment_KeyStartingWithBinaryPrefixIsPromotedToFallback (line 331) is the right regression anchor.


Remaining issues (carried across all three review rounds)

Convention violation (Medium) — //nolint:mnd suppressions (now 10 sites)

Round 3 added 2 more at lines 141–142 inside percentEncodeBounded. All 10 can be replaced with named constants:

Lines Literal Suggested constant
141, 215 c >> 4 (nibble shift) nibbleWidth = 4
142, 216 c & 0x0F lowNibbleMask byte = 0x0F
208 len(raw) * 3 use existing escapeBytes local
251 hi << 4 same nibbleWidth
297, 300, 308, 310 10 hexAlphaBase = 10

CLAUDE.md: "Avoid //nolint — refactor instead."

Fix this →

Minor — shaFallbackHexPrefixBytes misleading name

The constant counts hex characters, not raw bytes (32 hex chars = 16 bytes of SHA-256). Still present at lines 44, 259, 268, 271, 276. Consider shaFallbackHexPrefixLen or shaFallbackHexChars.


New observation (Low, non-blocking)

percentEncode (lines 205–219) is now unreachable from the main encoding path — percentEncodeBounded superseded it in EncodeSegment. Its only remaining caller is shaFallback (line 264), where it encodes the truncated suffix (at most 64 bytes, so the 3*64 = 192-byte pre-grow is safe). The two functions share an identical inner loop, which is minor duplication. Not a bug and not blocking, but percentEncode could be refactored away by having shaFallback call percentEncodeBounded with maxLen = shaFallbackTruncatedSuffixBytes*3.


What's well done in this revision

  • percentEncodeBounded's streaming design is exactly right for an OOM guard: bail early, cap pre-growth, no large intermediate allocation.
  • base64.RawURLEncoding.EncodedLen(len(raw)) pre-check in EncodeBinarySegment is allocation-free and precise.
  • The three promotion invariants (len > max, SHA-shape, b64. prefix) are all in a single readable ternary (line 110) with a clear doc comment above.
  • New tests TestEncodeSegment_LongUnreservedASCIIEncodesAsIs and TestEncodeSegment_KeyStartingWithBinaryPrefixIsPromotedToFallback are the right regression anchors for the P1 bugs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant