feat: migrate common/data/columnar module#24
Conversation
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the update. The non-owning ColumnarArray constructor is clearer now, but I found two remaining correctness blockers in nested row lifetime and decimal conversion.
| std::shared_ptr<InternalRow> ColumnarRow::GetRow(int32_t pos, int32_t num_fields) const { | ||
| auto struct_array = arrow::internal::checked_cast<const arrow::StructArray*>(array_vec_[pos]); | ||
| assert(struct_array); | ||
| return std::make_shared<ColumnarRow>(struct_array->fields(), pool_, row_id_); |
There was a problem hiding this comment.
This nested row still drops the ownership chain. The returned ColumnarRow stores raw child Array pointers, but it is constructed without a holder. If the parent ColumnarRow is the only object owning the top-level StructArray, a nested row returned from GetRow can outlive the parent and then its raw child pointers dangle. The current DataLifeCycle test only uses result_row while the parent row is alive, so it does not catch this. Please keep an owning holder here, for example by passing a shared_ptr/sliced StructArray or by returning ColumnarRowRef with a ColumnarBatchContext that owns struct_array->fields().
| assert(array); | ||
| arrow::Decimal128 decimal(array->GetValue(offset_ + pos)); | ||
| return Decimal(precision, scale, | ||
| static_cast<Decimal::int128_t>(decimal.high_bits()) << 64 | decimal.low_bits()); |
There was a problem hiding this comment.
Please avoid reconstructing the signed int128 value by left-shifting a signed high word. For negative Decimal128 values, decimal.high_bits() is negative, and left-shifting a negative signed integer is undefined behavior in C++. This same expression is also used in ColumnarRow and ColumnarRowRef. Please combine the bits through an unsigned 128-bit value first, then cast to Decimal::int128_t, and add a negative decimal test.
leaves12138
left a comment
There was a problem hiding this comment.
I did another full pass and found one additional correctness issue beyond the two existing blockers.
| auto value_type_id = dict_type->value_type()->id(); | ||
| auto index_type_id = dict_type->index_type()->id(); | ||
| int64_t dict_index = -1; | ||
| if (index_type_id == arrow::Type::type::INT32) { |
There was a problem hiding this comment.
This only decodes dictionary indices for INT32 and INT64, but Arrow dictionary indices are not limited to these two widths (for example int8/int16 dictionaries are valid and commonly used for small dictionaries). For those arrays dict_index stays -1; in release builds the assert is compiled out and the code then calls dictionary->GetView(-1), which can read before the offsets buffer or crash. Please handle all supported integer index types (or return a proper error instead of relying on assert).
Purpose
No Linked issue.
Migrate
common/data/columnar/module. This module provides columnar data access layer built on top of Apache Arrow arrays, including:Also migrates the dependency
DateTimeUtils(timestamp conversion utilities).Tests
columnar_utils_test.cpp— Tests GetView/GetBytes for plain and dictionary-encoded string arrayscolumnar_array_test.cpp— Tests all primitive types, complex/nested types (struct, list, map, decimal, timestamp), and null handlingcolumnar_row_test.cpp— Tests ColumnarRow and ColumnarRowRef for simple types, complex/nested types, dictionary types, null handling, data lifecycle, and binary accessAPI and Format
Documentation
Generative AI tooling