Skip to content

feat: migrate common/data/columnar module#24

Open
lxy-9602 wants to merge 2 commits into
apache:mainfrom
lxy-9602:add-columnar-data
Open

feat: migrate common/data/columnar module#24
lxy-9602 wants to merge 2 commits into
apache:mainfrom
lxy-9602:add-columnar-data

Conversation

@lxy-9602
Copy link
Copy Markdown
Contributor

Purpose

No Linked issue.

Migrate common/data/columnar/ module. This module provides columnar data access layer built on top of Apache Arrow arrays, including:

  • ColumnarArray: Access vector column data for array types
  • ColumnarMap: Access vector column data for map types
  • ColumnarRow: Row view over arrow::Array with per-row field access
  • ColumnarRowRef: Lightweight row view sharing batch-level context to reduce per-row overhead
  • ColumnarUtils: Utility functions for reading Arrow arrays (including dictionary-encoded arrays)
  • ColumnarBatchContext: Shared batch context struct for ColumnarRowRef

Also migrates the dependency DateTimeUtils (timestamp conversion utilities).

Tests

  • columnar_utils_test.cpp — Tests GetView/GetBytes for plain and dictionary-encoded string arrays
  • columnar_array_test.cpp — Tests all primitive types, complex/nested types (struct, list, map, decimal, timestamp), and null handling
  • columnar_row_test.cpp — Tests ColumnarRow and ColumnarRowRef for simple types, complex/nested types, dictionary types, null handling, data lifecycle, and binary access

API and Format

Documentation

Generative AI tooling

Copy link
Copy Markdown

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. The non-owning ColumnarArray constructor is clearer now, but I found two remaining correctness blockers in nested row lifetime and decimal conversion.

std::shared_ptr<InternalRow> ColumnarRow::GetRow(int32_t pos, int32_t num_fields) const {
auto struct_array = arrow::internal::checked_cast<const arrow::StructArray*>(array_vec_[pos]);
assert(struct_array);
return std::make_shared<ColumnarRow>(struct_array->fields(), pool_, row_id_);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This nested row still drops the ownership chain. The returned ColumnarRow stores raw child Array pointers, but it is constructed without a holder. If the parent ColumnarRow is the only object owning the top-level StructArray, a nested row returned from GetRow can outlive the parent and then its raw child pointers dangle. The current DataLifeCycle test only uses result_row while the parent row is alive, so it does not catch this. Please keep an owning holder here, for example by passing a shared_ptr/sliced StructArray or by returning ColumnarRowRef with a ColumnarBatchContext that owns struct_array->fields().

assert(array);
arrow::Decimal128 decimal(array->GetValue(offset_ + pos));
return Decimal(precision, scale,
static_cast<Decimal::int128_t>(decimal.high_bits()) << 64 | decimal.low_bits());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid reconstructing the signed int128 value by left-shifting a signed high word. For negative Decimal128 values, decimal.high_bits() is negative, and left-shifting a negative signed integer is undefined behavior in C++. This same expression is also used in ColumnarRow and ColumnarRowRef. Please combine the bits through an unsigned 128-bit value first, then cast to Decimal::int128_t, and add a negative decimal test.

Copy link
Copy Markdown

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did another full pass and found one additional correctness issue beyond the two existing blockers.

auto value_type_id = dict_type->value_type()->id();
auto index_type_id = dict_type->index_type()->id();
int64_t dict_index = -1;
if (index_type_id == arrow::Type::type::INT32) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only decodes dictionary indices for INT32 and INT64, but Arrow dictionary indices are not limited to these two widths (for example int8/int16 dictionaries are valid and commonly used for small dictionaries). For those arrays dict_index stays -1; in release builds the assert is compiled out and the code then calls dictionary->GetView(-1), which can read before the offsets buffer or crash. Please handle all supported integer index types (or return a proper error instead of relying on assert).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants