Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[refactor](spark load) update spark version for spark load to resolve cve problem #30368

Merged
merged 1 commit into from
Mar 18, 2024

Conversation

gnehil
Copy link
Contributor

@gnehil gnehil commented Jan 25, 2024

Proposed changes

Issue Number: close #xxx

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@CalvinKirs
Copy link
Member

run buildall

@gnehil
Copy link
Contributor Author

gnehil commented Mar 14, 2024

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 38360 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 7ad6689c8b6b4eb5c12094982deb72fb7f80c4c2, data reload: false

------ Round 1 ----------------------------------
q1	17699	5572	4152	4152
q2	2039	153	149	149
q3	10752	1051	903	903
q4	7878	780	724	724
q5	7467	2577	2623	2577
q6	184	125	123	123
q7	1231	830	793	793
q8	9349	2034	2032	2032
q9	7097	6430	6422	6422
q10	8491	3499	3635	3499
q11	426	218	221	218
q12	573	310	305	305
q13	17806	2815	2874	2815
q14	273	252	245	245
q15	505	460	465	460
q16	488	413	389	389
q17	947	597	519	519
q18	7231	6553	6437	6437
q19	2828	1494	1421	1421
q20	566	285	281	281
q21	6221	3591	3656	3591
q22	363	305	307	305
Total cold run time: 110414 ms
Total hot run time: 38360 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4144	4119	4061	4061
q2	316	224	229	224
q3	2928	2826	2842	2826
q4	1893	1581	1602	1581
q5	5214	5246	5275	5246
q6	195	116	118	116
q7	2262	1817	1874	1817
q8	3154	3268	3283	3268
q9	8616	8564	8538	8538
q10	3753	3668	3688	3668
q11	535	455	439	439
q12	718	544	539	539
q13	16915	2870	2857	2857
q14	297	251	242	242
q15	481	452	454	452
q16	450	437	413	413
q17	1734	1463	1487	1463
q18	7531	7266	7167	7167
q19	1617	1542	1497	1497
q20	1897	1702	1704	1702
q21	4705	4746	4628	4628
q22	533	452	463	452
Total cold run time: 69888 ms
Total hot run time: 53196 ms

@gnehil
Copy link
Contributor Author

gnehil commented Mar 14, 2024

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 38446 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 23b7c0b0ab58d04d4b294206d2b53cf027d4b6fd, data reload: false

------ Round 1 ----------------------------------
q1	17644	4201	4125	4125
q2	2029	150	143	143
q3	10672	1087	903	903
q4	7078	755	731	731
q5	7486	2771	2774	2771
q6	196	123	125	123
q7	1159	835	806	806
q8	9403	2031	1971	1971
q9	7056	6470	6376	6376
q10	8551	3477	3676	3477
q11	423	229	223	223
q12	676	304	296	296
q13	17789	2853	2847	2847
q14	276	250	264	250
q15	496	463	455	455
q16	498	400	398	398
q17	964	519	603	519
q18	7255	6505	6465	6465
q19	1529	1477	1494	1477
q20	540	283	275	275
q21	6358	3515	3561	3515
q22	366	300	307	300
Total cold run time: 108444 ms
Total hot run time: 38446 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4099	4079	4123	4079
q2	320	221	219	219
q3	2982	2824	2849	2824
q4	1892	1568	1596	1568
q5	5203	5244	5248	5244
q6	196	115	125	115
q7	2239	1858	1862	1858
q8	3158	3264	3257	3257
q9	8574	8522	8553	8522
q10	3696	3703	3693	3693
q11	538	435	462	435
q12	708	567	532	532
q13	16910	2843	2884	2843
q14	277	244	254	244
q15	490	446	449	446
q16	482	413	419	413
q17	1729	1479	1464	1464
q18	7485	7298	7038	7038
q19	1631	1473	1548	1473
q20	1901	1722	1715	1715
q21	4887	4704	4769	4704
q22	541	452	467	452
Total cold run time: 69938 ms
Total hot run time: 53138 ms

@gnehil
Copy link
Contributor Author

gnehil commented Mar 14, 2024

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 38466 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit f3fb59651850ac7accccd2356c4cb5c082afcc87, data reload: false

------ Round 1 ----------------------------------
q1	17684	4889	4156	4156
q2	2027	157	167	157
q3	10692	1077	899	899
q4	6744	732	737	732
q5	7474	2580	2725	2580
q6	182	123	122	122
q7	1180	829	817	817
q8	9346	1984	1997	1984
q9	7157	6457	6450	6450
q10	8488	3476	3606	3476
q11	430	231	222	222
q12	667	310	302	302
q13	17796	2868	2893	2868
q14	273	245	251	245
q15	491	452	466	452
q16	493	405	394	394
q17	943	532	554	532
q18	7217	6616	6509	6509
q19	1559	1403	1388	1388
q20	545	296	279	279
q21	6539	3589	3638	3589
q22	381	313	314	313
Total cold run time: 108308 ms
Total hot run time: 38466 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4201	4105	4148	4105
q2	322	229	231	229
q3	2934	2888	2800	2800
q4	1890	1607	1615	1607
q5	5238	5280	5264	5264
q6	209	114	119	114
q7	2235	1834	1846	1834
q8	3149	3302	3276	3276
q9	8573	8565	8600	8565
q10	3715	3696	3680	3680
q11	536	430	437	430
q12	712	553	571	553
q13	16924	2901	2852	2852
q14	274	246	250	246
q15	471	457	448	448
q16	457	419	399	399
q17	1775	1475	1454	1454
q18	7491	7181	7110	7110
q19	1616	1503	1485	1485
q20	1876	1681	1689	1681
q21	4715	4676	4696	4676
q22	517	433	438	433
Total cold run time: 69830 ms
Total hot run time: 53241 ms

@CalvinKirs
Copy link
Member

Colude you provide more details on the version requirements? Are there any specific requirements for the Spark version on the user service end?

@gnehil
Copy link
Contributor Author

gnehil commented Mar 18, 2024

run buildall

@gnehil
Copy link
Contributor Author

gnehil commented Mar 18, 2024

Colude you provide more details on the version requirements? Are there any specific requirements for the Spark version on the user service end?

For the previous code, when writing the parquet file, the Row object is converted to an InternalRow object by calling the toRow method of RowEncoder. For this behavior in the RowEncoder class, spark 2 and spark 3 have different implementation methods, so spark load can only run in the spark 2 environment.
For the current modification, use the apply method of InternalRow to initialize a new InternalRow object through the value array in the Row object. This method is implemented in the same way in spark 2 and spark 3, so spark load can run normally in both two versions of spark environment.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Mar 18, 2024
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@CalvinKirs
Copy link
Member

Please try to add regression tests for Spark3, this can be done in the next PR

@CalvinKirs CalvinKirs merged commit 06801d5 into apache:master Mar 18, 2024
25 of 29 checks passed
gnehil added a commit to gnehil/doris that referenced this pull request Apr 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.0.x dev/2.1.1-merged reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants