Skip to content

Add byte-count limit to TCompactProtocol varint reader#3410

Merged
Jens-G merged 1 commit into
apache:masterfrom
Jens-G:compact-varint-limit
Apr 23, 2026
Merged

Add byte-count limit to TCompactProtocol varint reader#3410
Jens-G merged 1 commit into
apache:masterfrom
Jens-G:compact-varint-limit

Conversation

@Jens-G
Copy link
Copy Markdown
Member

@Jens-G Jens-G commented Apr 22, 2026

Summary

The varint reader in TCompactProtocol loops until it encounters a non-continuation byte (bit 7 clear), without bounding the number of bytes it will consume. The protobuf wire-format specification, which the compact protocol wire format is based on, defines a maximum of 10 bytes for a 64-bit varint (5 bytes for 32-bit). C++ and Node.js already enforce this limit and raise a protocol exception when it is exceeded.

This change extends that limit to all remaining runtimes:

  • Python (lib/py): readVarint() uses a bounded for loop; raises TProtocolException(INVALID_DATA) on overflow.
  • Go (lib/go): readVarint64() uses a bounded loop; returns TProtocolException(INVALID_DATA) on overflow.
  • PHP (lib/php): readVarint() uses a bounded while loop; throws TProtocolException(INVALID_DATA) on overflow.
  • Java (lib/java): slow-path readVarint32() (5-byte limit) and readVarint64() (10-byte limit) both throw TProtocolException(INVALID_DATA) on overflow.
  • netstd (lib/netstd): ReadVarInt32Async() and ReadVarInt64Async() use bounded loops; throw TProtocolException(INVALID_DATA) on overflow.
  • Delphi (lib/delphi): ReadVarInt32() and ReadVarInt64() use bounded loops; raise TProtocolException(INVALID_DATA) on overflow.
  • Haxe (lib/haxe): readVarint32() and readVarint64() use bounded loops; throw TProtocolException(INVALID_DATA) on overflow.
  • Lua (lib/lua): readVarint() uses a bounded loop; returns an error on overflow.
  • Swift (lib/swift): readVarInt() uses a bounded loop; throws TProtocolError(.invalidData) on overflow.
  • Ruby (lib/rb): both the pure-Ruby implementation and the C extension are fixed. read_varint32() gains a separate 5-byte limit; read_varint64() a 10-byte limit. Both raise ProtocolException(INVALID_DATA) on overflow.
  • Dart (lib/dart): _readVarInt32() and _readVarInt64() use bounded for loops; throw TProtocolError(INVALID_DATA) on overflow.
  • Erlang (lib/erl): read_varint/3 gains a when Count < 10 guard; call sites propagate the error tuple.
  • Rust (lib/rs): VarIntReader trait replaced with local bounded loops for both 32-bit (5 bytes) and 64-bit (10 bytes); zigzag decode moved to free functions independent of the integer_encoding crate.

Named constants replace bare integer literals in all runtimes, with a comment explaining the derivation (ceil(64/7) = 10, ceil(32/7) = 5). Java, Ruby, and Rust enforce both the 5-byte (32-bit) and 10-byte (64-bit) limits separately; other runtimes enforce the 10-byte ceiling uniformly for both 32-bit and 64-bit reads.

Test plan

  • Python: two new varint limit tests pass
  • Go: two new varint limit tests pass
  • PHP: existing test suite passes
  • Java: existing test suite passes
  • netstd: dotnet build clean
  • Ruby: bundle exec rspec spec/compact_protocol_spec.rb — 4 new varint limit tests pass (23/23 total); pure-Ruby path exercised (C extension not compiled in Docker)
  • Swift: 2 new varint limit tests pass (13/13 total)
  • Dart: 2 new varint limit tests added and verified
  • Rust: cargo test compact — 75/75 pass including 3 new varint limit tests (rust:latest image, toolchain override required due to root rust-toolchain pinning 1.83.0)

🤖 Generated with Claude Code

@mergeable mergeable Bot added golang Pull requests that update Go code java Pull requests that update Java code php python labels Apr 22, 2026
Copy link
Copy Markdown
Contributor

@mhlakhani mhlakhani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: instead of 10 everywhere, should we use a named constant or comment in the code? as it stands this is kind of like a magic number and hard to reason about

@Jens-G Jens-G force-pushed the compact-varint-limit branch 2 times, most recently from f230b90 to 77b24c9 Compare April 22, 2026 22:01
@Jens-G Jens-G requested a review from kpumuk as a code owner April 22, 2026 22:01
@mergeable mergeable Bot added dart Pull requests that update Dart code delphi erlang haxe lua Pull requests that update Lua code c# Pull requests that update C# code ruby Pull requests that update Ruby code swift labels Apr 22, 2026
Comment thread lib/rb/lib/thrift/protocol/compact_protocol.rb
@Jens-G Jens-G force-pushed the compact-varint-limit branch from 77b24c9 to ade316a Compare April 22, 2026 22:30
@mergeable mergeable Bot added the rust Pull requests that update Rust code label Apr 22, 2026
@Jens-G Jens-G force-pushed the compact-varint-limit branch from ade316a to 2a59422 Compare April 22, 2026 22:40
@Jens-G
Copy link
Copy Markdown
Member Author

Jens-G commented Apr 22, 2026

nit: instead of 10 everywhere, should we use a named constant or comment in the code? as it stands this is kind of like a magic number and hard to reason about

Should be covered now.

This only covers pure Ruby implementation, we also have C extension here https://github.com/apache/thrift/blob/master/lib/rb/ext/compact_protocol.c#L442

The description also mentions 5 bytes for 32-bit, which is not covered in the change.

Issue 1 — Ruby C extension missing: lib/rb/ext/compact_protocol.c:read_varint64 had an unbounded while (true) loop, and read_varint32 just delegated to it. Both are now fixed with bounded for loops using rb_exc_raise(get_protocol_exception(PROTOERR_INVALID_DATA, ...)).

Issue 2 — 5-byte limit for 32-bit not enforced: The description claimed ceil(32/7) = 5 but neither Ruby path enforced it. Fixed in both the C extension and pure Ruby — read_varint32 now has its own 5-byte loop instead of delegating.

What changed:

  • lib/rb/ext/compact_protocol.c — added MAX_VARINT32_BYTES 5 / MAX_VARINT64_BYTES 10 defines; replaced both reader functions
  • lib/rb/lib/thrift/protocol/compact_protocol.rb — added MAX_VARINT32_BYTES = 5; read_varint32 now has its own 5-byte loop
  • lib/rb/spec/compact_protocol_spec.rb — 2 new tests for 32-bit overlong rejection and valid 5-byte acceptance (23/23 pass)
  • Commit amended and force-pushed; PR description updated to explain that Java, Ruby, and Rust enforce the 5/10 split, while other runtimes use 10 uniformly

Client: py,go,php,java,netstd,delphi,haxe,lua,swift,rb,dart,erl,rs

C++ and Node.js already enforce a 10-byte ceiling on varint reads.
This change brings all remaining runtimes into parity by raising a
protocol exception (INVALID_DATA) when more than 10 bytes are consumed
without finding a terminating byte.

Java: both fast-path and slow-path varint readers are covered.
Erlang: read_varint/3 gains a when Count < 10 guard clause; call sites
  updated to propagate the error tuple rather than crashing on badmatch.
Rust: VarIntReader trait replaced with local bounded loops; zigzag
  decode moved to free functions independent of the integer_encoding crate.
Ruby: both the pure-Ruby implementation and the C extension are fixed;
  read_varint32 gains a separate 5-byte limit instead of delegating to
  read_varint64.

Named constants replace bare integer literals in all runtimes, with a
comment explaining the derivation (ceil(64/7) = 10, ceil(32/7) = 5).
Java and Rust enforce both the 5-byte (32-bit) and 10-byte (64-bit)
limits; other runtimes enforce the 10-byte ceiling uniformly.

Tests added for py, go, php, java (all four original runtimes), plus
rb, swift, dart, and rs which have existing unit-test infrastructure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Jens-G Jens-G force-pushed the compact-varint-limit branch from 2a59422 to 2e38de7 Compare April 22, 2026 23:11
@Jens-G Jens-G merged commit d515221 into apache:master Apr 23, 2026
167 of 168 checks passed
@Jens-G Jens-G deleted the compact-varint-limit branch April 23, 2026 20:20
madesroches added a commit to madesroches/micromegas that referenced this pull request May 20, 2026
## Summary
- Bumps three vulnerable transitive deps to their patched versions,
closing four Dependabot alerts.
- Two `thrift` alerts (#223, #224) are left open — upstream has not
published a patched version to crates.io. See research note below.

## Changes
| Manifest | Package | From | To | Advisory |
|---|---|---|---|---|
| `yarn.lock` | `ws` | 8.19.0 | 8.20.1 |
[GHSA-58qx-3vcg-4xpx](GHSA-58qx-3vcg-4xpx)
(#255) |
| `yarn.lock` | `brace-expansion` | 5.0.5 | 5.0.6 |
[GHSA-jxxr-4gwj-5jf2](GHSA-jxxr-4gwj-5jf2)
(#252) |
| `analytics-web-app/yarn.lock` | `brace-expansion` | 5.0.5 | 5.0.6 |
GHSA-jxxr-4gwj-5jf2 (#251) |
| `python/micromegas/poetry.lock` | `idna` | 3.11 | 3.15 |
[GHSA-65pc-fj4g-8rjx](GHSA-65pc-fj4g-8rjx)
(#253) |

## thrift alerts (#223, #224) — left open

The `thrift` Rust crate on crates.io is stuck at 0.17.0 (Nov 2022).
Apache Thrift has released through 0.23.0 on GitHub, including the fix
for CVE-2026-43868 (PR apache/thrift#3410, merged Apr 2026), but no
0.18+ version has been published to crates.io — the Rust binding was
effectively unmaintained (THRIFT-5917). Rust CI was re-enabled in
apache/thrift#3417 (Apr 2026) and a release workflow exists (#3027), so
a publish is plausibly near-term.

Even if we forked, `parquet 57.3.0` constrains to `thrift ^0.17`, so a
0.23.x cannot satisfy resolution without also patching parquet. arrow-rs
is independently migrating off the `thrift` crate (custom parser shipped
in v57 for reads; writes still pending).

Severity is medium (CVSS 5.3, DoS only, requires attacker-controlled
parquet metadata). Per CLAUDE.md the alerts stay open until a real fix
is available.

## Test plan
- [ ] `yarn install --immutable` clean at repo root
- [ ] `yarn install --immutable` clean in `analytics-web-app/`
- [ ] `poetry check` clean in `python/micromegas/`
- [ ] Dependabot alerts #251, #252, #253, #255 auto-close on merge
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

c# Pull requests that update C# code dart Pull requests that update Dart code delphi erlang golang Pull requests that update Go code haxe java Pull requests that update Java code lua Pull requests that update Lua code php python ruby Pull requests that update Ruby code rust Pull requests that update Rust code swift

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants