arrow-csv: integer overflow panic in Reader::records::flush
The cargo-fuzz csv_reader harness being prototyped for #5332 found a clean panic in arrow-csv after a few minutes of running on main HEAD. Filing here because it's an actual bug surfaced by the proposed fuzzing infrastructure.
Repro
72 bytes:
0000 2e 22 3f 0a 31 0a 3f 3f 0a 3c 50 50 0a 3f 0a 31 |."?.1.??.<PP.?.1|
0010 0a 3f 38 0a 3c 0a 3f 0a 3c 50 50 0a 3f 0a 31 0a |.?8.<.?.<PP.?.1.|
0020 3f 38 0a 0a 2e 22 3f 0a 31 0a 3f 3f 0a ce ce ce |?8..."?.1.??....|
0030 b1 ce ce ce ce ce ce ce ce 31 0a 3f 38 0a 3c 0a |.........1.?8.<.|
0040 3f 0a 3c 0a 3f 0a 3f 69 |?.<.?.?i|
(Same input feeds straight into ReaderBuilder::new(Arc::new(schema)).with_format(format).build(Cursor::new(data)) and iterates batches; the panic happens inside the reader's flush, not in client code.)
Panic
thread '<unnamed>' panicked at arrow-csv/src/reader/records.rs:207:21:
attempt to add with overflow
…
3: <closure> at arrow-csv/src/reader/records.rs:207:21
4: for_each over chunks_exact_mut row offsets
5: <closure> at arrow-csv/src/reader/records.rs:206:32
6: …
9: arrow_csv::reader::records::*::flush
at arrow-csv/src/reader/records.rs:204:14
The site is the row.iter_mut().for_each(|x| { *x += offset; row_offset = *x; }) accumulation:
// arrow-csv/src/reader/records.rs:201-210 (approx)
let mut row_offset = 0;
self.offsets[1..self.offsets_len]
.chunks_exact_mut(self.num_columns)
.for_each(|row| {
let offset = row_offset;
row.iter_mut().for_each(|x| {
*x += offset; // <-- panic here on overflow
row_offset = *x;
});
});
*x and offset are both usize. With pathological CSV input the per-row offsets grow without bound and *x + offset wraps past usize::MAX. The crash is reachable from any caller of Reader::next() on attacker-controlled bytes, so it's a DoS surface for any service that accepts CSV through arrow-csv directly.
Suggested fix sketch
Either:
- Bound the cumulative offset by the actual data length: each emitted offset can never legitimately exceed
self.data_len, so an explicit cap prior to the addition turns the overflow into a clean ParseError.
- Use
checked_add on the accumulator and surface the overflow as a ParseError("CSV record offsets overflowed") rather than letting it land at panic! in release mode (where overflow checks would otherwise be off and we'd hit silent UB / wrap-around).
I haven't sent a PR for this since I have two related parquet allocation-bound PRs (#9883, #9884) already in flight and didn't want to flood the queue. Happy to send a fix as a follow-up once those land if no one else picks it up first.
The 72-byte repro and the arrow-csv/fuzz harness that produced it live on my fork at fuzz/initial-harnesses (per #5332). The harness reproduces this in single-digit minutes from an empty corpus.
xref #5332 #9883 #9884
arrow-csv: integer overflow panic in
Reader::records::flushThe cargo-fuzz
csv_readerharness being prototyped for #5332 found a clean panic inarrow-csvafter a few minutes of running onmainHEAD. Filing here because it's an actual bug surfaced by the proposed fuzzing infrastructure.Repro
72 bytes:
(Same input feeds straight into
ReaderBuilder::new(Arc::new(schema)).with_format(format).build(Cursor::new(data))and iterates batches; the panic happens inside the reader's flush, not in client code.)Panic
The site is the
row.iter_mut().for_each(|x| { *x += offset; row_offset = *x; })accumulation:*xandoffsetare bothusize. With pathological CSV input the per-row offsets grow without bound and*x + offsetwraps pastusize::MAX. The crash is reachable from any caller ofReader::next()on attacker-controlled bytes, so it's a DoS surface for any service that accepts CSV througharrow-csvdirectly.Suggested fix sketch
Either:
self.data_len, so an explicit cap prior to the addition turns the overflow into a cleanParseError.checked_addon the accumulator and surface the overflow as aParseError("CSV record offsets overflowed")rather than letting it land atpanic!in release mode (where overflow checks would otherwise be off and we'd hit silent UB / wrap-around).I haven't sent a PR for this since I have two related parquet allocation-bound PRs (#9883, #9884) already in flight and didn't want to flood the queue. Happy to send a fix as a follow-up once those land if no one else picks it up first.
The 72-byte repro and the
arrow-csv/fuzzharness that produced it live on my fork atfuzz/initial-harnesses(per #5332). The harness reproduces this in single-digit minutes from an empty corpus.xref #5332 #9883 #9884