[refine](be) Refactor DataTypeNumberSerDe hierarchy: split base/final, sink if-constexpr branches#61996
[refine](be) Refactor DataTypeNumberSerDe hierarchy: split base/final, sink if-constexpr branches#61996Mryange wants to merge 3 commits intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
TPC-H: Total hot run time: 29684 ms |
|
run buildall |
TPC-DS: Total hot run time: 179825 ms |
TPC-H: Total hot run time: 29393 ms |
TPC-DS: Total hot run time: 180417 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
Pull request overview
Refactors the BE numeric SerDe hierarchy to mirror the existing DataType hierarchy by introducing a shared DataTypeNumberSerDeBase<T> plus thin final wrappers for pure numerics, and by splitting Date/DateTime v1 SerDe into sibling implementations with shared templated logic. This also sinks several type-specific PB/Jsonb serialization branches into the relevant subclasses and lifts get_scale() to the DataTypeSerDe base.
Changes:
- Introduce
DataTypeNumberSerDeBase<T>and a thin finalDataTypeNumberSerDe<T>for pure numerics; updateget_serde()wiring accordingly. - Split Date/DateTime v1 SerDe into
data_type_date_serde.*and newdata_type_datetime_serde.*with sharedDataTypeDateLikeV1SerDe<T>. - Move several PB/Jsonb serialization implementations from the number base SerDe into specialized SerDes (DateV2/DateTimeV2/TimeStampTz/IPv4, etc.) and add
DataTypeSerDe::get_scale().
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| be/test/core/data_type_serde/data_type_serde_datetime_v1_test.cpp | Updates includes and Date v1 SerDe construction after Date/DateTime v1 split. |
| be/test/core/data_type_serde/data_type_serde_datelike_batch_test.cpp | Updates includes and switches from templated DateSerDe to concrete Date/DateTime v1 SerDes. |
| be/src/core/data_type/data_type_number.h | Routes DataTypeNumber<T> SerDe creation through the new thin final numeric SerDe. |
| be/src/core/data_type/data_type_number_base.h | Switches base numeric get_serde() to return DataTypeNumberSerDeBase<T>. |
| be/src/core/data_type/data_type_date.h | Updates Date v1 SerDe include/type to DataTypeDateSerDe. |
| be/src/core/data_type/data_type_date_time.h | Updates DateTime v1 SerDe include to the new data_type_datetime_serde.h. |
| be/src/core/data_type_serde/data_type_timestamptz_serde.h | Changes inheritance to DataTypeNumberSerDeBase and adds PB/Jsonb + Arrow read overrides. |
| be/src/core/data_type_serde/data_type_timestamptz_serde.cpp | Implements TimeStampTz PB/Jsonb and Arrow read logic previously handled in the monolithic base. |
| be/src/core/data_type_serde/data_type_time_serde.h | Switches TimeV2 SerDe to inherit from DataTypeNumberSerDeBase. |
| be/src/core/data_type_serde/data_type_serde.h | Adds DataTypeSerDe::get_scale() default implementation. |
| be/src/core/data_type_serde/data_type_serde.cpp | Updates numeric binary deserialization dispatch to DataTypeNumberSerDeBase. |
| be/src/core/data_type_serde/data_type_number_serde.h | Splits monolithic numeric SerDe into base + thin final numeric SerDe; removes PB branches for non-numerics. |
| be/src/core/data_type_serde/data_type_number_serde.cpp | Renames implementations to DataTypeNumberSerDeBase and adds explicit instantiations. |
| be/src/core/data_type_serde/data_type_ipv6_serde.h | Switches IPv6 SerDe inheritance to DataTypeNumberSerDeBase. |
| be/src/core/data_type_serde/data_type_ipv4_serde.h | Switches IPv4 SerDe inheritance to DataTypeNumberSerDeBase and adds Jsonb cell IO overrides. |
| be/src/core/data_type_serde/data_type_ipv4_serde.cpp | Implements IPv4 Jsonb cell IO overrides after removing base branches. |
| be/src/core/data_type_serde/data_type_datev2_serde.h | Switches DateV2 SerDe inheritance to DataTypeNumberSerDeBase and declares PB/Jsonb overrides. |
| be/src/core/data_type_serde/data_type_datev2_serde.cpp | Implements DateV2 PB/Jsonb overrides previously handled via base if constexpr branches. |
| be/src/core/data_type_serde/data_type_datetimev2_serde.h | Switches DateTimeV2 SerDe inheritance to DataTypeNumberSerDeBase and declares PB/Jsonb overrides. |
| be/src/core/data_type_serde/data_type_datetimev2_serde.cpp | Implements DateTimeV2 PB/Jsonb overrides previously handled via base if constexpr branches. |
| be/src/core/data_type_serde/data_type_datetime_serde.h | New: declares DateTime v1 SerDe sibling class built on shared DataTypeDateLikeV1SerDe<T>. |
| be/src/core/data_type_serde/data_type_datetime_serde.cpp | New: implements DateTime v1 JSON + Arrow read behavior (previously in combined Date/DateTime file). |
| be/src/core/data_type_serde/data_type_date_serde.h | Introduces shared DataTypeDateLikeV1SerDe<T> template and concrete DataTypeDateSerDe. |
| be/src/core/data_type_serde/data_type_date_serde.cpp | Moves shared Date/DateTime v1 logic here and adds PB/Jsonb overrides for v1 datelike types. |
Comments suppressed due to low confidence (1)
be/src/core/data_type_serde/data_type_date_serde.cpp:589
- write_column_to_pb() casts the
VecDateTimeValuedata buffer toint64_t*and passes it to protobuf. Even if VecDateTimeValue is trivially copyable, reinterpreting the object array as anint64_tarray relies on layout/aliasing assumptions and is undefined behavior with strict-aliasing enabled. Prefer serializing by iterating and using an explicit underlying-int conversion (e.g., binary_cast/bit_cast or a dedicated accessor) rather than pointer reinterprets.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ptype->set_id(PGenericType::UINT32); | ||
| auto* values = result.mutable_uint32_value(); | ||
| values->Reserve(row_count); | ||
| values->Add((uint32_t*)data.begin() + start, (uint32_t*)data.begin() + end); |
There was a problem hiding this comment.
write_column_to_pb() is adding values by casting the column’s DateV2Value buffer to uint32_t* and passing it to protobuf. Even though DateV2Value stores an underlying integer, reinterpreting the object array as a uint32_t array relies on object layout and can violate strict-aliasing rules (undefined behavior under optimization). Prefer filling the repeated field by iterating and pushing to_date_int_val() (or an equivalent accessor) per element, or using a safe bit-cast/memcpy-based conversion instead of pointer reinterprets.
| values->Add((uint32_t*)data.begin() + start, (uint32_t*)data.begin() + end); | |
| for (int64_t i = start; i < end; ++i) { | |
| values->Add(data[i].to_date_int_val()); | |
| } |
| ptype->set_id(PGenericType::UINT64); | ||
| auto* values = result.mutable_uint64_value(); | ||
| values->Reserve(row_count); | ||
| values->Add((uint64_t*)data.begin() + start, (uint64_t*)data.begin() + end); |
There was a problem hiding this comment.
write_column_to_pb() casts the DateV2Value<DateTimeV2ValueType> data buffer to uint64_t* and passes it to protobuf. This relies on the in-memory layout of DateV2Value matching a raw uint64_t array and can violate strict-aliasing rules (undefined behavior). Safer approach: append values by iterating and using to_date_int_val() (or a safe bit-cast/memcpy) to obtain the underlying uint64_t representation.
| values->Add((uint64_t*)data.begin() + start, (uint64_t*)data.begin() + end); | |
| for (int64_t i = start; i < end; ++i) { | |
| values->Add(data[i].to_date_int_val()); | |
| } |
| ptype->set_id(PGenericType::UINT64); | ||
| auto* values = result.mutable_uint64_value(); | ||
| values->Reserve(row_count); | ||
| values->Add((uint64_t*)data.begin() + start, (uint64_t*)data.begin() + end); |
There was a problem hiding this comment.
write_column_to_pb() populates the protobuf repeated field by reinterpreting the TimestampTzValue object buffer as a uint64_t*. This assumes object layout compatibility and can trigger strict-aliasing UB under optimization. Consider appending the underlying integer via TimestampTzValue::to_date_int_val() (or a safe bit-cast/memcpy) per row instead of pointer casting.
| values->Add((uint64_t*)data.begin() + start, (uint64_t*)data.begin() + end); | |
| for (int64_t i = start; i < end; ++i) { | |
| values->Add(binary_cast<TimestampTzValue, UInt64>(data[i])); | |
| } |
What problem does this PR solve?
Problem Summary:
DataTypeNumberSerDe<T>was a monolithic template class serving as both the base class for all number-like SerDe types and the concrete implementation for pure numeric types. It handled 17+ types via extensiveif constexprbranches, making the code hard to read and maintain. The inheritance structure was also asymmetric with the DataType side.DataType hierarchy (for reference)
The DataType side already has a clean two-level design:
SerDe hierarchy — BEFORE
The old SerDe side was flat and asymmetric —
DataTypeNumberSerDe<T>served as both the base and the final class, with massiveif constexprbranches to dispatch type-specific logic:Problems: (1) no separation between base and pure-numeric final class, (2)
if constexprbranches for 10+ types scattered in base methods, (3) DateTime inheriting from Date (wrong relationship).SerDe hierarchy — AFTER
Now the SerDe hierarchy mirrors the DataType hierarchy:
DataTypeNumberSerDeBase<T>↔DataTypeNumberBase<T>,DataTypeNumberSerDe<T> final↔DataTypeNumber<T> final, and Date/DateTime are siblings via a shared intermediate template.This PR performs a multi-phase refactoring:
Phase 1 — Split base class and add thin final wrapper
DataTypeNumberSerDe<T>→DataTypeNumberSerDeBase<T>(base class)DataTypeNumberSerDe<T> finalfor pure numeric types (Bool/Int/Float/Double/Time), mirroringDataTypeNumber<T>on the DataType sidePhase 3 — Lift
get_scale()toDataTypeSerDebaseget_scale()fromDataTypeNumberSerDeBase<T>toDataTypeSerDe, making it available to all SerDe types (e.g. Decimal)Phase 4 — Split Date/DateTime v1 SerDe into separate files
DataTypeDateLikeV1SerDe<T>shared template for Date/DateTime v1 common logicdata_type_date_serde.h/.cppanddata_type_datetime_serde.h/.cppDataTypeDateSerDeandDataTypeDateTimeSerDeare now siblings (not parent-child)Phase 2 — Sink
if constexprbranches to subclassesDataTypeNumberSerDeBase<T>into subclass overrides:DataTypeDateLikeV1SerDe<T>: write/read PB (INT64), write/read JsonbDataTypeDateV2SerDe: write/read PB (UINT32), write/read JsonbDataTypeDateTimeV2SerDe: write/read PB (UINT64), write/read JsonbDataTypeTimeStampTzSerDe: write/read PB (UINT64), write/read Jsonb, read_column_from_arrowDataTypeIPv4SerDe: write/read JsonbAfter refactoring, the base class
DataTypeNumberSerDeBase<T>PB methods only handle pure numeric types (Bool, TinyInt–BigInt, LargeInt, Float, Double, Time/TimeV2).Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)