Skip to content

arrow-csv: integer overflow panic in Reader::records::flush #9885

@youichi-uda

Description

@youichi-uda

arrow-csv: integer overflow panic in Reader::records::flush

The cargo-fuzz csv_reader harness being prototyped for #5332 found a clean panic in arrow-csv after a few minutes of running on main HEAD. Filing here because it's an actual bug surfaced by the proposed fuzzing infrastructure.

Repro

72 bytes:

0000  2e 22 3f 0a 31 0a 3f 3f  0a 3c 50 50 0a 3f 0a 31  |."?.1.??.<PP.?.1|
0010  0a 3f 38 0a 3c 0a 3f 0a  3c 50 50 0a 3f 0a 31 0a  |.?8.<.?.<PP.?.1.|
0020  3f 38 0a 0a 2e 22 3f 0a  31 0a 3f 3f 0a ce ce ce  |?8..."?.1.??....|
0030  b1 ce ce ce ce ce ce ce  ce 31 0a 3f 38 0a 3c 0a  |.........1.?8.<.|
0040  3f 0a 3c 0a 3f 0a 3f 69                            |?.<.?.?i|

(Same input feeds straight into ReaderBuilder::new(Arc::new(schema)).with_format(format).build(Cursor::new(data)) and iterates batches; the panic happens inside the reader's flush, not in client code.)

Panic

thread '<unnamed>' panicked at arrow-csv/src/reader/records.rs:207:21:
attempt to add with overflow
   …
   3: <closure> at arrow-csv/src/reader/records.rs:207:21
   4: for_each over chunks_exact_mut row offsets
   5: <closure> at arrow-csv/src/reader/records.rs:206:32
   6: …
   9: arrow_csv::reader::records::*::flush
              at arrow-csv/src/reader/records.rs:204:14

The site is the row.iter_mut().for_each(|x| { *x += offset; row_offset = *x; }) accumulation:

// arrow-csv/src/reader/records.rs:201-210 (approx)
let mut row_offset = 0;
self.offsets[1..self.offsets_len]
    .chunks_exact_mut(self.num_columns)
    .for_each(|row| {
        let offset = row_offset;
        row.iter_mut().for_each(|x| {
            *x += offset;          // <-- panic here on overflow
            row_offset = *x;
        });
    });

*x and offset are both usize. With pathological CSV input the per-row offsets grow without bound and *x + offset wraps past usize::MAX. The crash is reachable from any caller of Reader::next() on attacker-controlled bytes, so it's a DoS surface for any service that accepts CSV through arrow-csv directly.

Suggested fix sketch

Either:

  • Bound the cumulative offset by the actual data length: each emitted offset can never legitimately exceed self.data_len, so an explicit cap prior to the addition turns the overflow into a clean ParseError.
  • Use checked_add on the accumulator and surface the overflow as a ParseError("CSV record offsets overflowed") rather than letting it land at panic! in release mode (where overflow checks would otherwise be off and we'd hit silent UB / wrap-around).

I haven't sent a PR for this since I have two related parquet allocation-bound PRs (#9883, #9884) already in flight and didn't want to flood the queue. Happy to send a fix as a follow-up once those land if no one else picks it up first.

The 72-byte repro and the arrow-csv/fuzz harness that produced it live on my fork at fuzz/initial-harnesses (per #5332). The harness reproduces this in single-digit minutes from an empty corpus.

xref #5332 #9883 #9884

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions