Describe the bug, including details regarding any error messages, version, and platform.
The CSV block parser sizes its per-chunk value array from num_cols, the column count inferred from the first line of the input, times the rows-in-chunk count. PresizedValueDescWriter computes the capacity as 2 + num_rows * num_cols, and ParseSpecialized computes a bulk-filter threshold as num_cols_ * (num_rows_ - start) * 10, both in int32_t.
A CSV whose first line carries a few million fields drives num_cols high enough that these products exceed INT32_MAX, which is signed-integer-overflow UB (UBSan flags both expressions). Beyond the arithmetic, ParsedValueDesc stores a 31-bit offset, so once the per-chunk value count (rows x columns) passes INT32_MAX the offsets can no longer represent the data anyway, and the parser should reject the input rather than overflow.
This affects cpp/src/arrow/csv/parser.cc.
Component(s)
C++
Describe the bug, including details regarding any error messages, version, and platform.
The CSV block parser sizes its per-chunk value array from
num_cols, the column count inferred from the first line of the input, times the rows-in-chunk count.PresizedValueDescWritercomputes the capacity as2 + num_rows * num_cols, andParseSpecializedcomputes a bulk-filter threshold asnum_cols_ * (num_rows_ - start) * 10, both inint32_t.A CSV whose first line carries a few million fields drives
num_colshigh enough that these products exceedINT32_MAX, which is signed-integer-overflow UB (UBSan flags both expressions). Beyond the arithmetic,ParsedValueDescstores a 31-bit offset, so once the per-chunk value count (rows x columns) passesINT32_MAXthe offsets can no longer represent the data anyway, and the parser should reject the input rather than overflow.This affects
cpp/src/arrow/csv/parser.cc.Component(s)
C++