Skip to content

[C++][CSV] Possible int32 overflow of per-chunk value count in block parser #50275

Description

@metsw24-max

Describe the bug, including details regarding any error messages, version, and platform.

The CSV block parser sizes its per-chunk value array from num_cols, the column count inferred from the first line of the input, times the rows-in-chunk count. PresizedValueDescWriter computes the capacity as 2 + num_rows * num_cols, and ParseSpecialized computes a bulk-filter threshold as num_cols_ * (num_rows_ - start) * 10, both in int32_t.

A CSV whose first line carries a few million fields drives num_cols high enough that these products exceed INT32_MAX, which is signed-integer-overflow UB (UBSan flags both expressions). Beyond the arithmetic, ParsedValueDesc stores a 31-bit offset, so once the per-chunk value count (rows x columns) passes INT32_MAX the offsets can no longer represent the data anyway, and the parser should reject the input rather than overflow.

This affects cpp/src/arrow/csv/parser.cc.

Component(s)

C++

Metadata

Metadata

Assignees

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions