Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](parquet)Fix the be core issue when reading parquet unsigned types. #39926

Merged
merged 4 commits into from
Aug 29, 2024

Conversation

hubgeter
Copy link
Contributor

Proposed changes

Since Doris does not have an unsigned type, we convert parquet uint32 type to doris bigint (int64) type.
When reading the parquet file, the byte size stored in parquet and the byte size of the data type mapped by doris are inconsistent, resulting in be core.
Fix:
When reading, we read according to the byte size stored in parquet, and then convert it to the data type mapped by doris.

Mapping relationship description:
parquet -> doris
UInt8 -> Int16
UInt16 -> Int32
UInt32 -> Int64
UInt64 -> Int128.

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@hubgeter
Copy link
Contributor Author

run buildall

Copy link
Contributor

Possible file(s) that should be tracked in LFS detected: 🚨

The following file(s) exceeds the file size limit: 1048576 bytes, as set in the .yml configuration files:

  • regression-test/data/external_table_p0/tvf/unsigned_integers_3.parquet

Consider using git-lfs to manage large files.

@github-actions github-actions bot added the lfs-detected! Warning Label for use when LFS is detected in the commits of a Pull Request label Aug 26, 2024
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@github-actions github-actions bot removed the lfs-detected! Warning Label for use when LFS is detected in the commits of a Pull Request label Aug 27, 2024
@hubgeter
Copy link
Contributor Author

run buildall

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@hubgeter
Copy link
Contributor Author

run buildall

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@hubgeter
Copy link
Contributor Author

run buildall

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TPC-H: Total hot run time: 38574 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 8b624cb373a0123edcb0dea32085351dae52dc8f, data reload: false

------ Round 1 ----------------------------------
q1	18026	4500	4423	4423
q2	2027	181	185	181
q3	10474	1116	1114	1114
q4	10144	795	780	780
q5	7709	2840	2835	2835
q6	227	139	139	139
q7	963	623	615	615
q8	9336	2066	2070	2066
q9	7180	6515	6523	6515
q10	7015	2252	2257	2252
q11	462	252	259	252
q12	401	232	237	232
q13	18890	3068	3031	3031
q14	291	244	233	233
q15	530	498	486	486
q16	509	391	393	391
q17	984	731	790	731
q18	7324	6911	6849	6849
q19	1388	1168	1078	1078
q20	696	335	329	329
q21	3925	3040	3182	3040
q22	1116	1031	1002	1002
Total cold run time: 109617 ms
Total hot run time: 38574 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4518	4451	4249	4249
q2	382	285	264	264
q3	2850	2626	2616	2616
q4	1928	1700	1706	1700
q5	5466	5371	5391	5371
q6	217	133	133	133
q7	2110	1723	1792	1723
q8	3171	3353	3315	3315
q9	8413	8462	8359	8359
q10	3462	3192	3216	3192
q11	610	506	503	503
q12	775	633	580	580
q13	12027	3061	3049	3049
q14	307	275	286	275
q15	522	490	483	483
q16	481	425	428	425
q17	1780	1508	1505	1505
q18	7799	7571	7502	7502
q19	1646	1603	1399	1399
q20	2054	1799	1808	1799
q21	5357	5138	5287	5138
q22	1159	1048	1063	1048
Total cold run time: 67034 ms
Total hot run time: 54628 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 187957 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 8b624cb373a0123edcb0dea32085351dae52dc8f, data reload: false

query1	905	367	378	367
query2	6463	1951	1777	1777
query3	6652	213	225	213
query4	27970	23074	23145	23074
query5	4154	521	515	515
query6	255	171	165	165
query7	4578	289	295	289
query8	252	200	205	200
query9	8582	2495	2505	2495
query10	442	260	286	260
query11	17796	15103	15002	15002
query12	151	107	99	99
query13	1628	376	391	376
query14	9748	6856	7229	6856
query15	267	169	180	169
query16	7384	489	467	467
query17	1564	587	539	539
query18	1979	281	284	281
query19	273	147	146	146
query20	117	112	107	107
query21	206	101	101	101
query22	4563	4199	4064	4064
query23	34155	33790	33630	33630
query24	11180	2881	2910	2881
query25	629	371	391	371
query26	1089	156	153	153
query27	2292	285	283	283
query28	6786	2089	2067	2067
query29	751	433	406	406
query30	308	151	147	147
query31	970	759	782	759
query32	98	54	57	54
query33	747	289	285	285
query34	989	494	496	494
query35	857	774	696	696
query36	1100	917	889	889
query37	152	80	86	80
query38	3975	3865	3811	3811
query39	1438	1405	1393	1393
query40	200	113	114	113
query41	48	48	45	45
query42	116	99	102	99
query43	519	467	469	467
query44	1216	765	768	765
query45	196	170	167	167
query46	1110	732	749	732
query47	1903	1770	1804	1770
query48	377	306	296	296
query49	1110	434	423	423
query50	823	417	416	416
query51	7237	7133	7039	7039
query52	97	85	89	85
query53	257	189	181	181
query54	903	461	464	461
query55	79	78	82	78
query56	278	261	261	261
query57	1197	1057	1067	1057
query58	242	216	223	216
query59	3122	2866	2970	2866
query60	296	264	271	264
query61	143	98	100	98
query62	841	634	664	634
query63	226	185	184	184
query64	4547	671	682	671
query65	3209	3130	3119	3119
query66	1403	333	336	333
query67	15645	15226	15346	15226
query68	3296	584	559	559
query69	395	279	281	279
query70	1125	1073	1067	1067
query71	341	278	274	274
query72	6324	4037	4155	4037
query73	752	335	340	335
query74	9130	8845	8813	8813
query75	3426	2695	2652	2652
query76	1943	1010	1031	1010
query77	523	331	331	331
query78	9763	9032	9070	9032
query79	1048	544	539	539
query80	700	531	527	527
query81	499	229	227	227
query82	243	147	141	141
query83	179	164	179	164
query84	231	84	78	78
query85	752	374	346	346
query86	306	303	292	292
query87	4354	4230	4309	4230
query88	2983	2427	2415	2415
query89	374	296	282	282
query90	1833	207	207	207
query91	141	116	117	116
query92	66	52	56	52
query93	1024	555	544	544
query94	844	301	290	290
query95	379	279	283	279
query96	592	271	273	271
query97	3203	3059	3111	3059
query98	223	202	206	202
query99	1517	1280	1287	1280
Total cold run time: 279294 ms
Total hot run time: 187957 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.69 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 8b624cb373a0123edcb0dea32085351dae52dc8f, data reload: false

query1	0.05	0.04	0.04
query2	0.08	0.04	0.05
query3	0.23	0.05	0.05
query4	1.67	0.08	0.07
query5	0.54	0.49	0.51
query6	1.13	0.74	0.73
query7	0.02	0.02	0.02
query8	0.06	0.05	0.04
query9	0.56	0.46	0.50
query10	0.54	0.54	0.54
query11	0.15	0.12	0.12
query12	0.15	0.12	0.11
query13	0.61	0.59	0.58
query14	0.76	0.79	0.78
query15	0.88	0.81	0.82
query16	0.39	0.37	0.39
query17	1.06	0.97	1.05
query18	0.21	0.21	0.22
query19	1.94	1.77	1.76
query20	0.02	0.01	0.01
query21	15.40	0.67	0.66
query22	4.36	7.67	1.82
query23	18.29	1.41	1.31
query24	2.07	0.25	0.22
query25	0.16	0.08	0.08
query26	0.27	0.18	0.17
query27	0.09	0.08	0.08
query28	13.19	1.02	0.99
query29	12.65	3.33	3.28
query30	0.24	0.06	0.06
query31	2.86	0.40	0.41
query32	3.26	0.48	0.48
query33	2.97	3.03	3.04
query34	16.99	4.50	4.41
query35	4.43	4.44	4.44
query36	0.66	0.47	0.50
query37	0.19	0.16	0.15
query38	0.18	0.15	0.15
query39	0.05	0.03	0.03
query40	0.16	0.12	0.13
query41	0.10	0.05	0.05
query42	0.06	0.05	0.06
query43	0.05	0.04	0.04
Total cold run time: 109.73 s
Total hot run time: 30.69 s

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Aug 28, 2024
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link
Contributor

@kaka11chen kaka11chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@morningman morningman merged commit 80b9213 into apache:master Aug 29, 2024
27 of 30 checks passed
hubgeter added a commit to hubgeter/doris that referenced this pull request Aug 29, 2024
…es. (apache#39926)

## Proposed changes
Since Doris does not have an unsigned type, we convert parquet uint32
type to doris bigint (int64) type.
When reading the parquet file, the byte size stored in parquet and the
byte size of the data type mapped by doris are inconsistent, resulting
in be core.
Fix:
When reading, we read according to the byte size stored in parquet, and
then convert it to the data type mapped by doris.

Mapping relationship description:
parquet -> doris  
UInt8 -> Int16
UInt16 -> Int32
UInt32 -> Int64
UInt64 -> Int128.
dataroaring pushed a commit that referenced this pull request Sep 3, 2024
…es. (#39926)

## Proposed changes
Since Doris does not have an unsigned type, we convert parquet uint32
type to doris bigint (int64) type.
When reading the parquet file, the byte size stored in parquet and the
byte size of the data type mapped by doris are inconsistent, resulting
in be core.
Fix:
When reading, we read according to the byte size stored in parquet, and
then convert it to the data type mapped by doris.

Mapping relationship description:
parquet -> doris  
UInt8 -> Int16
UInt16 -> Int32
UInt32 -> Int64
UInt64 -> Int128.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.1.6-merged dev/3.0.2-merged reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants