Skip to content

[bug](hive) fix datetime type outfile parquet file not correct#61396

Open
zhangstar333 wants to merge 3 commits intoapache:masterfrom
zhangstar333:hive_datetime_int96
Open

[bug](hive) fix datetime type outfile parquet file not correct#61396
zhangstar333 wants to merge 3 commits intoapache:masterfrom
zhangstar333:hive_datetime_int96

Conversation

@zhangstar333
Copy link
Contributor

What problem does this PR solve?

Problem Summary:
before #60946, enable_int96_timestamps to true could write int96 type in parquet file.
after the pr add arrow patch, when set enable_int96_timestamps to true in conf, must need timeunit to be NANO to write int96.
so check when enable_int96_timestamps is true, set arrow TimestampType timeunit to NANO

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@zhangstar333
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.65% (19705/37429)
Line Coverage 36.21% (184116/508489)
Region Coverage 32.39% (142114/438733)
Branch Coverage 33.57% (62143/185125)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.47% (26183/36635)
Line Coverage 54.27% (275049/506776)
Region Coverage 51.33% (227292/442782)
Branch Coverage 52.86% (98118/185607)

@zhangstar333
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 26668 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 338b86126a93cf9329aef12f87f81bee71d6f875, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17651	4501	4294	4294
q2	q3	10643	758	518	518
q4	4678	355	257	257
q5	7550	1214	1027	1027
q6	177	171	146	146
q7	776	848	674	674
q8	9296	1450	1290	1290
q9	4837	4744	4672	4672
q10	6267	1934	1638	1638
q11	465	251	253	251
q12	704	580	465	465
q13	18035	2966	2179	2179
q14	226	247	216	216
q15	q16	725	755	683	683
q17	711	846	457	457
q18	6093	5549	5157	5157
q19	1229	989	617	617
q20	532	500	373	373
q21	4711	1839	1430	1430
q22	535	383	324	324
Total cold run time: 95841 ms
Total hot run time: 26668 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4785	4718	4634	4634
q2	q3	3879	4350	3818	3818
q4	895	1213	767	767
q5	4106	4367	4376	4367
q6	181	172	142	142
q7	1806	1694	1524	1524
q8	2504	2685	2556	2556
q9	7632	7405	7409	7405
q10	3792	3990	3640	3640
q11	520	440	415	415
q12	495	568	448	448
q13	2716	3138	2345	2345
q14	275	314	388	314
q15	q16	798	771	720	720
q17	1159	1406	1369	1369
q18	7117	6698	6570	6570
q19	884	897	924	897
q20	2060	2216	2043	2043
q21	3950	3416	3337	3337
q22	433	558	398	398
Total cold run time: 49987 ms
Total hot run time: 47709 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 167680 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 338b86126a93cf9329aef12f87f81bee71d6f875, data reload: false

query5	4332	637	502	502
query6	347	234	220	220
query7	4229	489	270	270
query8	333	243	228	228
query9	8716	2712	2717	2712
query10	526	385	340	340
query11	6982	5101	4869	4869
query12	184	128	124	124
query13	1280	466	355	355
query14	5775	3702	3480	3480
query14_1	2891	2806	2816	2806
query15	205	197	174	174
query16	1004	484	441	441
query17	1115	738	631	631
query18	2444	454	351	351
query19	216	208	183	183
query20	144	127	127	127
query21	220	148	114	114
query22	13252	13961	14743	13961
query23	16154	15855	15791	15791
query23_1	15846	15572	15263	15263
query24	7231	1614	1238	1238
query24_1	1218	1230	1230	1230
query25	612	470	398	398
query26	1220	267	144	144
query27	2785	475	291	291
query28	4513	1864	1844	1844
query29	821	555	461	461
query30	301	223	190	190
query31	1008	947	877	877
query32	86	71	74	71
query33	511	339	278	278
query34	906	900	520	520
query35	643	691	602	602
query36	1043	1146	995	995
query37	133	99	82	82
query38	2929	2893	2885	2885
query39	858	839	826	826
query39_1	776	799	797	797
query40	230	149	131	131
query41	63	60	57	57
query42	263	259	259	259
query43	241	248	219	219
query44	
query45	194	189	187	187
query46	880	974	594	594
query47	2144	2132	2064	2064
query48	307	310	239	239
query49	627	447	377	377
query50	700	285	209	209
query51	4141	4045	3985	3985
query52	260	262	263	262
query53	295	329	282	282
query54	314	271	268	268
query55	92	86	82	82
query56	322	317	312	312
query57	1928	1698	1788	1698
query58	282	298	263	263
query59	2781	2972	2761	2761
query60	340	336	321	321
query61	150	140	148	140
query62	637	600	522	522
query63	304	280	281	280
query64	5075	1253	1001	1001
query65	
query66	1455	461	348	348
query67	24257	24292	24149	24149
query68	
query69	434	323	285	285
query70	894	939	996	939
query71	338	307	303	303
query72	2815	2623	2369	2369
query73	536	537	320	320
query74	9578	9549	9317	9317
query75	2850	2759	2461	2461
query76	2324	1024	658	658
query77	363	365	313	313
query78	10932	11166	10477	10477
query79	1076	810	565	565
query80	813	617	560	560
query81	521	264	221	221
query82	1333	148	118	118
query83	338	267	245	245
query84	252	119	90	90
query85	896	492	438	438
query86	391	301	294	294
query87	3112	3117	3002	3002
query88	3536	2652	2605	2605
query89	429	371	335	335
query90	1844	181	177	177
query91	172	156	134	134
query92	81	73	74	73
query93	926	873	503	503
query94	524	317	290	290
query95	580	338	311	311
query96	642	517	225	225
query97	2484	2531	2425	2425
query98	232	222	217	217
query99	1018	992	889	889
Total cold run time: 247664 ms
Total hot run time: 167680 ms

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.66% (19710/37429)
Line Coverage 36.25% (184319/508487)
Region Coverage 32.43% (142286/438730)
Branch Coverage 33.59% (62180/185123)

@zhangstar333
Copy link
Contributor Author

/review

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

PR Goal

This PR fixes datetime type outfile parquet file generation by ensuring that when enable_int96_timestamps is true, the Arrow TimestampType uses NANO unit (required by the patched Arrow library after PR #60946). It also refactors the write_column_to_arrow serialization to use the actual Arrow timestamp unit rather than the column scale.

Critical Checkpoint Conclusions

1. Does the code accomplish the goal? Is there a test that proves it?

  • The Arrow type mapping fix (NANO for INT96) is correct and well-placed.
  • The write_column_to_arrow refactor to switch on timestamp_type->unit() is a good improvement over _scale-based branching.
  • However, the test has issues: the hive_docker expected output shows timezone-shifted values (10:00:00 -> 02:00:00), indicating that the wall-clock time is still being incorrectly shifted by the session timezone. This means the core INT96 write path still has a timezone correctness issue that is not fixed by this PR.

2. Is this modification as small, clear, and focused as possible?

  • Yes. The refactoring of ParquetFileOptions to use designated initializers is a good cleanup included alongside the fix.

3. Concurrency?

  • Not applicable. No concurrency concerns in this change.

4. Lifecycle management / static initialization?

  • Not applicable.

5. Configuration items?

  • No new configs added. Existing enable_int96_timestamps is properly wired.

6. Incompatible changes?

  • Not applicable.

7. Parallel code paths?

  • Issue found: DataTypeTimeStampTzSerDe::write_column_to_arrow in data_type_timestamptz_serde.cpp:192-223 has the exact same old _scale-based branching pattern that was fixed in data_type_datetimev2_serde.cpp. Since TYPE_TIMESTAMPTZ shares the same convert_to_arrow_type mapping (which now returns NANO when enable_int96_timestamps=true), the TIMESTAMPTZ serde will produce wrong values when it encounters a NANO timestamp type. See inline comment.

8. Special conditional checks?

  • Not applicable.

9. Test coverage?

  • The new regression test has several issues: dead code (enable_int96 parameter), misleading comments, and the hive_docker expected output shows timezone-shifted values suggesting the fix is incomplete. See inline comments.

10. Observability?

  • Not applicable for this change.

11. Transaction/persistence?

  • Not applicable.

12. Data writes?

  • The timestamp value computation is correct for the non-INT96 paths (MICRO, MILLI, SECOND). For the INT96/NANO path, the arithmetic is correct.

13. FE-BE variable passing?

  • All paths correctly pass enable_int96_timestamps through to ParquetFileOptions.

14. Performance?

  • The switch statement is equivalent performance to the prior if-else chain. No concerns.

15. Other issues?

  • See inline comments for specific issues.

Summary of Issues Found

  1. [Parallel path not updated] DataTypeTimeStampTzSerDe::write_column_to_arrow still uses _scale-based branching and will produce wrong values when the Arrow type is NANO (INT96 mode).
  2. [Test dead code] The enable_int96 parameter in outfile_to_HDFS is never used (the property line is commented out). The comment at line 122 says "enable_int96_timestamps=true" but the call passes false. Clean up the dead code or actually wire the parameter.
  3. [Hive results show timezone shift] The .out file hive_docker section shows an 8-hour shift (e.g. 10:00:00 -> 02:00:00), suggesting that the INT96 value written still embeds the timezone-converted UTC epoch rather than the wall-clock nanoseconds Hive expects. This may indicate the fix is incomplete for the write path.
  4. [Unused imports] The test file imports IOGroovyMethods, StandardCharsets, Files, Paths but none are used.
  5. [Unused variable] defaultFS_with_postfix is defined but never used in the test.

@zhangstar333
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 27218 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 9488bd02caed1bf08e75eb91711b9fd81f08c280, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17596	4535	4292	4292
q2	q3	10642	835	538	538
q4	4681	361	253	253
q5	7562	1210	1011	1011
q6	179	175	147	147
q7	825	854	687	687
q8	9302	1497	1376	1376
q9	4895	4722	4758	4722
q10	6249	1929	1639	1639
q11	471	261	244	244
q12	737	598	469	469
q13	18060	2930	2186	2186
q14	229	228	211	211
q15	q16	715	730	663	663
q17	745	869	428	428
q18	6062	5490	5341	5341
q19	1255	991	621	621
q20	560	499	375	375
q21	5071	1855	1682	1682
q22	428	355	333	333
Total cold run time: 96264 ms
Total hot run time: 27218 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4636	4646	4608	4608
q2	q3	3882	4399	3836	3836
q4	908	1215	814	814
q5	4117	4422	4368	4368
q6	185	181	148	148
q7	1788	1690	1555	1555
q8	2529	2743	2671	2671
q9	7616	7693	7350	7350
q10	3783	4093	3706	3706
q11	535	454	423	423
q12	503	615	469	469
q13	2692	3199	2356	2356
q14	282	316	281	281
q15	q16	714	770	714	714
q17	1164	1351	1330	1330
q18	7181	6700	6712	6700
q19	960	928	987	928
q20	2083	2124	2002	2002
q21	4007	3657	3388	3388
q22	452	422	374	374
Total cold run time: 50017 ms
Total hot run time: 48021 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 168331 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 9488bd02caed1bf08e75eb91711b9fd81f08c280, data reload: false

query5	4351	637	520	520
query6	331	228	197	197
query7	4212	465	267	267
query8	333	242	223	223
query9	8733	2722	2742	2722
query10	526	394	342	342
query11	7007	5089	4862	4862
query12	191	134	126	126
query13	1277	483	359	359
query14	5747	3725	3430	3430
query14_1	2869	2843	2822	2822
query15	195	194	172	172
query16	997	467	376	376
query17	1073	703	582	582
query18	2441	437	332	332
query19	211	210	179	179
query20	137	125	122	122
query21	206	130	110	110
query22	13171	14184	14674	14184
query23	16546	15718	15666	15666
query23_1	15484	15364	15301	15301
query24	7266	1601	1230	1230
query24_1	1227	1231	1215	1215
query25	544	459	435	435
query26	1235	263	143	143
query27	2789	478	290	290
query28	4531	1849	1847	1847
query29	857	550	489	489
query30	301	227	195	195
query31	1026	938	873	873
query32	87	73	72	72
query33	504	321	281	281
query34	894	876	514	514
query35	629	681	603	603
query36	1068	1094	998	998
query37	132	96	85	85
query38	2958	2967	2931	2931
query39	862	835	810	810
query39_1	799	813	801	801
query40	230	148	135	135
query41	63	61	59	59
query42	261	255	253	253
query43	240	250	223	223
query44	
query45	196	188	181	181
query46	879	965	605	605
query47	2138	2114	2074	2074
query48	305	320	231	231
query49	632	469	377	377
query50	695	279	221	221
query51	4120	4007	4019	4007
query52	266	271	259	259
query53	292	333	284	284
query54	297	268	265	265
query55	92	88	82	82
query56	320	326	316	316
query57	1926	1813	1636	1636
query58	290	265	267	265
query59	2775	2919	2746	2746
query60	358	353	334	334
query61	188	184	180	180
query62	647	604	560	560
query63	314	281	278	278
query64	5233	1384	1102	1102
query65	
query66	1477	478	376	376
query67	24265	24424	24242	24242
query68	
query69	418	316	299	299
query70	1001	965	931	931
query71	370	312	300	300
query72	3116	2794	2377	2377
query73	536	540	325	325
query74	9633	9550	9400	9400
query75	2838	2748	2464	2464
query76	2297	1040	669	669
query77	358	370	308	308
query78	11043	11245	10533	10533
query79	1081	821	584	584
query80	1430	625	548	548
query81	529	264	226	226
query82	1361	156	120	120
query83	364	261	241	241
query84	253	117	95	95
query85	1111	503	460	460
query86	412	335	290	290
query87	3147	3073	3027	3027
query88	3603	2694	2691	2691
query89	424	371	340	340
query90	1830	173	177	173
query91	169	162	137	137
query92	84	76	73	73
query93	903	847	499	499
query94	578	345	309	309
query95	592	348	320	320
query96	645	528	231	231
query97	2432	2489	2393	2393
query98	230	217	224	217
query99	1019	1000	892	892
Total cold run time: 250626 ms
Total hot run time: 168331 ms

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.65% (19761/37533)
Line Coverage 36.21% (184494/509535)
Region Coverage 32.41% (142473/439595)
Branch Coverage 33.54% (62256/185615)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.58% (26297/36738)
Line Coverage 54.39% (276212/507820)
Region Coverage 51.60% (228903/443643)
Branch Coverage 53.06% (98735/186097)

uint32_t microsecond = datetime_val.microsecond();
timestamp = (timestamp * 1000000) + microsecond;
} else if (_scale > 0) {
timestamp = (timestamp * 1000000000LL) + (microsecond * 1000LL);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here overflow of int64 when datetime is 9999-12-31
need some time to find soultion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants