[bug](hive) fix datetime type outfile parquet file not correct#61396
[bug](hive) fix datetime type outfile parquet file not correct#61396zhangstar333 wants to merge 3 commits intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
TPC-H: Total hot run time: 26668 ms |
TPC-DS: Total hot run time: 167680 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
Code Review Summary
PR Goal
This PR fixes datetime type outfile parquet file generation by ensuring that when enable_int96_timestamps is true, the Arrow TimestampType uses NANO unit (required by the patched Arrow library after PR #60946). It also refactors the write_column_to_arrow serialization to use the actual Arrow timestamp unit rather than the column scale.
Critical Checkpoint Conclusions
1. Does the code accomplish the goal? Is there a test that proves it?
- The Arrow type mapping fix (NANO for INT96) is correct and well-placed.
- The
write_column_to_arrowrefactor to switch ontimestamp_type->unit()is a good improvement over_scale-based branching. - However, the test has issues: the
hive_dockerexpected output shows timezone-shifted values (10:00:00->02:00:00), indicating that the wall-clock time is still being incorrectly shifted by the session timezone. This means the core INT96 write path still has a timezone correctness issue that is not fixed by this PR.
2. Is this modification as small, clear, and focused as possible?
- Yes. The refactoring of
ParquetFileOptionsto use designated initializers is a good cleanup included alongside the fix.
3. Concurrency?
- Not applicable. No concurrency concerns in this change.
4. Lifecycle management / static initialization?
- Not applicable.
5. Configuration items?
- No new configs added. Existing
enable_int96_timestampsis properly wired.
6. Incompatible changes?
- Not applicable.
7. Parallel code paths?
- Issue found:
DataTypeTimeStampTzSerDe::write_column_to_arrowindata_type_timestamptz_serde.cpp:192-223has the exact same old_scale-based branching pattern that was fixed indata_type_datetimev2_serde.cpp. SinceTYPE_TIMESTAMPTZshares the sameconvert_to_arrow_typemapping (which now returns NANO whenenable_int96_timestamps=true), the TIMESTAMPTZ serde will produce wrong values when it encounters a NANO timestamp type. See inline comment.
8. Special conditional checks?
- Not applicable.
9. Test coverage?
- The new regression test has several issues: dead code (
enable_int96parameter), misleading comments, and thehive_dockerexpected output shows timezone-shifted values suggesting the fix is incomplete. See inline comments.
10. Observability?
- Not applicable for this change.
11. Transaction/persistence?
- Not applicable.
12. Data writes?
- The timestamp value computation is correct for the non-INT96 paths (MICRO, MILLI, SECOND). For the INT96/NANO path, the arithmetic is correct.
13. FE-BE variable passing?
- All paths correctly pass
enable_int96_timestampsthrough toParquetFileOptions.
14. Performance?
- The switch statement is equivalent performance to the prior if-else chain. No concerns.
15. Other issues?
- See inline comments for specific issues.
Summary of Issues Found
- [Parallel path not updated]
DataTypeTimeStampTzSerDe::write_column_to_arrowstill uses_scale-based branching and will produce wrong values when the Arrow type is NANO (INT96 mode). - [Test dead code] The
enable_int96parameter inoutfile_to_HDFSis never used (the property line is commented out). The comment at line 122 says "enable_int96_timestamps=true" but the call passesfalse. Clean up the dead code or actually wire the parameter. - [Hive results show timezone shift] The
.outfilehive_dockersection shows an 8-hour shift (e.g.10:00:00->02:00:00), suggesting that the INT96 value written still embeds the timezone-converted UTC epoch rather than the wall-clock nanoseconds Hive expects. This may indicate the fix is incomplete for the write path. - [Unused imports] The test file imports
IOGroovyMethods,StandardCharsets,Files,Pathsbut none are used. - [Unused variable]
defaultFS_with_postfixis defined but never used in the test.
...est/suites/external_table_p0/export/hive_read/parquet/test_hive_read_parquet_datetime.groovy
Show resolved
Hide resolved
...ion-test/data/external_table_p0/export/hive_read/parquet/test_hive_read_parquet_datetime.out
Show resolved
Hide resolved
...est/suites/external_table_p0/export/hive_read/parquet/test_hive_read_parquet_datetime.groovy
Show resolved
Hide resolved
...est/suites/external_table_p0/export/hive_read/parquet/test_hive_read_parquet_datetime.groovy
Show resolved
Hide resolved
|
run buildall |
TPC-H: Total hot run time: 27218 ms |
TPC-DS: Total hot run time: 168331 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
| uint32_t microsecond = datetime_val.microsecond(); | ||
| timestamp = (timestamp * 1000000) + microsecond; | ||
| } else if (_scale > 0) { | ||
| timestamp = (timestamp * 1000000000LL) + (microsecond * 1000LL); |
There was a problem hiding this comment.
here overflow of int64 when datetime is 9999-12-31
need some time to find soultion
What problem does this PR solve?
Problem Summary:
before #60946, enable_int96_timestamps to true could write int96 type in parquet file.
after the pr add arrow patch, when set enable_int96_timestamps to true in conf, must need timeunit to be NANO to write int96.
so check when enable_int96_timestamps is true, set arrow TimestampType timeunit to NANO
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)