Skip to content

Conversation

@suxiaogang223
Copy link
Contributor

@suxiaogang223 suxiaogang223 commented Oct 16, 2024

Proposed changes

pick prs:
#41506
#41526
#41683
#41816

suxiaogang223 and others added 3 commits October 16, 2024 09:50
…he#41506)

## Proposed changes
Reason: https://issues.apache.org/jira/browse/ARROW-5322
Java readers(parquet-mr) handles "dictionaryPageOffset = 0" to determine
if dictionary page exists where as the C readers uses
"has_dictionaryPageOffset" (_isset bit in thrift message) to determine
the same resulting in incompatible behaviours.
Therefore, we should consider that dicttionary page exists when both
`__isset.dictionary_page_offset` is true and `dictionary_page_offset` is
greater than 0.
## Proposed changes
Implemented reading parqeut files with decimal256 type
…der (apache#41683)

## Proposed changes
Impl ByteStreamSplitDecoder to decode BYTE_STREAM_SPLIT encoding
parquet.
relate pr: apache/arrow#42372

> Apache Parquet does not have any encodings suitable for FP data and
the available text compressors (zstd, gzip, etc) do not handle FP data
very well.
It is possible to apply a simple data transformation named "stream
splitting". Such could be "byte stream splitting" which creates K
streams of length N where K is the number of bytes in the data type (4
for floats, 8 for doubles) and N is the number of elements in the
sequence.

---------

Co-authored-by: morningman <morningman@163.com>
@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@suxiaogang223
Copy link
Contributor Author

run buildall

@gavinchou gavinchou changed the title [cherry-pick](branch-3.0) fix parquet cases [cherry-pick](branch-3.0) fix parquet cases (#41506 #41526 #41683) Oct 16, 2024
@suxiaogang223
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 40390 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit f285248d1b6b9d69f363b511f50c23c340f2c2ef, data reload: false

------ Round 1 ----------------------------------
q1	17606	7459	7344	7344
q2	2057	153	154	153
q3	10707	1076	1178	1076
q4	10550	757	738	738
q5	7758	2846	2724	2724
q6	233	150	150	150
q7	981	652	644	644
q8	9572	1928	1873	1873
q9	7922	6425	6467	6425
q10	6999	2276	2320	2276
q11	448	258	254	254
q12	413	217	218	217
q13	17764	2966	2970	2966
q14	256	214	205	205
q15	558	527	516	516
q16	499	412	408	408
q17	983	588	535	535
q18	7242	6633	6615	6615
q19	1573	1025	1038	1025
q20	578	275	271	271
q21	3922	3143	3010	3010
q22	1096	1003	965	965
Total cold run time: 109717 ms
Total hot run time: 40390 ms

----- Round 2, with runtime_filter_mode=off -----
q1	7348	7300	7264	7264
q2	325	223	230	223
q3	2944	2917	2833	2833
q4	2060	1856	1748	1748
q5	5711	5694	5768	5694
q6	223	147	148	147
q7	2237	1782	1769	1769
q8	3335	3666	3416	3416
q9	8916	8868	8856	8856
q10	3553	3543	3532	3532
q11	575	481	486	481
q12	847	636	601	601
q13	16442	3143	3169	3143
q14	300	262	268	262
q15	570	533	535	533
q16	520	464	458	458
q17	1888	1630	1606	1606
q18	8182	7901	7472	7472
q19	3861	1500	1561	1500
q20	2160	1862	1891	1862
q21	5437	5237	5259	5237
q22	1126	997	1028	997
Total cold run time: 78560 ms
Total hot run time: 59634 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 193811 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit f285248d1b6b9d69f363b511f50c23c340f2c2ef, data reload: false

query1	1310	911	887	887
query2	6284	2015	1971	1971
query3	10810	3675	3652	3652
query4	67607	25610	23423	23423
query5	5858	479	475	475
query6	506	176	169	169
query7	6432	312	310	310
query8	309	212	209	209
query9	9409	2695	2688	2688
query10	531	305	272	272
query11	18453	15244	15756	15244
query12	160	102	112	102
query13	1666	476	466	466
query14	11784	7460	7562	7460
query15	233	174	171	171
query16	7234	476	432	432
query17	1076	580	557	557
query18	1853	299	306	299
query19	192	155	144	144
query20	112	108	108	108
query21	208	101	98	98
query22	4456	4247	4292	4247
query23	34371	33679	33846	33679
query24	5556	2883	2840	2840
query25	523	416	403	403
query26	689	163	159	159
query27	1707	301	309	301
query28	3923	2247	2238	2238
query29	655	433	433	433
query30	239	155	158	155
query31	999	804	771	771
query32	71	53	55	53
query33	494	296	300	296
query34	918	506	512	506
query35	866	727	733	727
query36	1089	929	941	929
query37	136	79	79	79
query38	3914	3845	3966	3845
query39	1472	1390	1410	1390
query40	208	101	97	97
query41	47	42	43	42
query42	114	92	97	92
query43	515	489	488	488
query44	1154	795	796	795
query45	206	167	169	167
query46	1151	725	732	725
query47	1904	1812	1800	1800
query48	415	349	329	329
query49	708	393	388	388
query50	828	419	422	419
query51	7038	7031	6901	6901
query52	131	85	99	85
query53	255	183	177	177
query54	583	454	456	454
query55	78	73	73	73
query56	266	250	243	243
query57	1210	1113	1094	1094
query58	221	212	247	212
query59	3037	3057	2961	2961
query60	292	256	268	256
query61	103	98	99	98
query62	796	669	659	659
query63	216	188	183	183
query64	1718	622	620	620
query65	3267	3148	3165	3148
query66	748	325	300	300
query67	15484	15274	15158	15158
query68	3421	557	549	549
query69	713	279	288	279
query70	1183	1149	1127	1127
query71	505	279	279	279
query72	8307	3969	3928	3928
query73	790	342	347	342
query74	9834	9006	8934	8934
query75	5020	2661	2646	2646
query76	4114	914	844	844
query77	776	284	301	284
query78	10055	9215	9303	9215
query79	2733	592	588	588
query80	1324	427	432	427
query81	563	224	230	224
query82	644	134	125	125
query83	393	141	140	140
query84	292	78	84	78
query85	1434	297	279	279
query86	455	314	282	282
query87	4558	4168	4225	4168
query88	4524	2402	2396	2396
query89	408	285	291	285
query90	2107	187	191	187
query91	137	104	106	104
query92	53	44	44	44
query93	4069	538	546	538
query94	1051	289	272	272
query95	354	252	254	252
query96	631	279	281	279
query97	3282	3132	3096	3096
query98	223	187	186	186
query99	1587	1288	1282	1282
Total cold run time: 332338 ms
Total hot run time: 193811 ms

## Proposed changes
fix parquet case: nation.dict-malformed.parquet
@suxiaogang223 suxiaogang223 changed the title [cherry-pick](branch-3.0) fix parquet cases (#41506 #41526 #41683) [cherry-pick](branch-3.0) fix parquet cases (#41506 #41526 #41683 #41816) Oct 17, 2024
@suxiaogang223
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 40017 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit ce02dd7a9948e45ab9200f15069a5e876b9214df, data reload: false

------ Round 1 ----------------------------------
q1	17608	7417	7328	7328
q2	2035	163	149	149
q3	10939	1043	1162	1043
q4	10551	795	742	742
q5	7752	2780	2752	2752
q6	231	151	150	150
q7	984	641	622	622
q8	9575	1897	1992	1897
q9	7232	6374	6503	6374
q10	6988	2295	2292	2292
q11	448	252	245	245
q12	403	214	216	214
q13	17775	2960	2940	2940
q14	241	223	217	217
q15	553	509	529	509
q16	504	403	400	400
q17	971	612	532	532
q18	7192	6443	6600	6443
q19	3128	955	923	923
q20	588	284	273	273
q21	3951	3003	3077	3003
q22	1048	1000	969	969
Total cold run time: 110697 ms
Total hot run time: 40017 ms

----- Round 2, with runtime_filter_mode=off -----
q1	7406	7262	7219	7219
q2	321	223	220	220
q3	2922	2837	2838	2837
q4	2029	1845	1762	1762
q5	5679	5696	5715	5696
q6	225	137	141	137
q7	2210	1772	1760	1760
q8	3326	3465	3432	3432
q9	8895	8853	8839	8839
q10	3541	3530	3497	3497
q11	578	495	495	495
q12	787	598	612	598
q13	16429	3170	3142	3142
q14	313	271	275	271
q15	576	518	512	512
q16	505	454	457	454
q17	1834	1635	1592	1592
q18	8197	7794	7553	7553
q19	2820	1474	1480	1474
q20	2115	1863	1875	1863
q21	5392	5386	5251	5251
q22	1112	1004	1028	1004
Total cold run time: 77212 ms
Total hot run time: 59608 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 193279 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit ce02dd7a9948e45ab9200f15069a5e876b9214df, data reload: false

query1	1262	911	901	901
query2	6254	1991	1952	1952
query3	10758	3800	3786	3786
query4	68479	26871	23441	23441
query5	5873	473	432	432
query6	494	165	161	161
query7	6527	321	301	301
query8	300	214	202	202
query9	9616	2686	2658	2658
query10	537	276	274	274
query11	18511	15412	15758	15412
query12	156	101	101	101
query13	1631	461	446	446
query14	11628	6844	6991	6844
query15	224	178	181	178
query16	7325	457	489	457
query17	1062	561	554	554
query18	1712	296	309	296
query19	192	153	139	139
query20	109	106	109	106
query21	207	98	100	98
query22	4214	4319	4031	4031
query23	34720	33805	33767	33767
query24	5499	2850	2899	2850
query25	521	409	400	400
query26	694	166	162	162
query27	1717	292	298	292
query28	3929	2233	2218	2218
query29	686	430	434	430
query30	232	145	147	145
query31	952	791	806	791
query32	71	52	56	52
query33	484	293	286	286
query34	893	503	490	490
query35	854	724	736	724
query36	1050	923	950	923
query37	133	79	74	74
query38	3918	3912	3976	3912
query39	1483	1442	1412	1412
query40	213	97	96	96
query41	46	43	43	43
query42	115	97	101	97
query43	526	494	485	485
query44	1111	785	782	782
query45	198	169	165	165
query46	1135	695	730	695
query47	1906	1842	1805	1805
query48	433	338	331	331
query49	713	426	391	391
query50	825	401	419	401
query51	6969	6957	6921	6921
query52	100	93	93	93
query53	262	183	180	180
query54	573	470	480	470
query55	77	73	73	73
query56	276	254	268	254
query57	1235	1152	1121	1121
query58	225	255	244	244
query59	3122	2924	2866	2866
query60	276	261	256	256
query61	104	103	104	103
query62	760	667	648	648
query63	220	185	183	183
query64	1946	617	615	615
query65	3233	3153	3216	3153
query66	714	310	301	301
query67	15545	15244	15175	15175
query68	3485	565	540	540
query69	589	284	282	282
query70	1127	1128	1119	1119
query71	456	263	297	263
query72	7726	3880	3856	3856
query73	775	337	336	336
query74	10204	8870	9173	8870
query75	3796	2641	2633	2633
query76	3211	919	906	906
query77	728	288	282	282
query78	10082	9277	9162	9162
query79	3411	591	576	576
query80	2591	430	446	430
query81	565	222	227	222
query82	742	129	121	121
query83	315	131	138	131
query84	299	88	75	75
query85	1889	300	277	277
query86	433	302	304	302
query87	4389	4295	4371	4295
query88	4855	2377	2408	2377
query89	408	283	292	283
query90	2145	181	182	181
query91	151	105	107	105
query92	66	46	46	46
query93	4887	542	535	535
query94	1051	275	285	275
query95	352	250	248	248
query96	618	278	272	272
query97	3268	3181	3155	3155
query98	219	199	200	199
query99	1556	1280	1291	1280
Total cold run time: 334154 ms
Total hot run time: 193279 ms

@morningman morningman merged commit e634f0d into apache:branch-3.0 Oct 17, 2024
@suxiaogang223 suxiaogang223 deleted the fix_parquet branch October 21, 2024 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants