[GLUTEN-4170][VL] Decouple partitions from plan to avoid driver stalled by Yohahaha · Pull Request #4177 · apache/gluten

Yohahaha · 2023-12-25T06:16:00Z

What changes were proposed in this pull request?

There are two parts will lead driver stalled when scan contains lots of partitions,

plan serialization happens in every GlutenPartition construction.
GlutenWholeStageColumnarRDD#getPartitions

		22374 partitions	44611 partitions
before	#1	880ms	1352ms
	#2	3662ms	17186ms
after	#1	21ms	106ms
	#2	6ms	25ms

This patch decouple scan splitInfo(LocalFileNodes) from ReadRel to avoid serialize substrait plan for each partition in Driver, when the plan is complex or the number of partitions is particularly large, the cost of this serialization cannot be ignored.

Stream splitInfo(inputIterator) still kept in ReadRel for now.

(Fixes: #4170)

How was this patch tested?

github-actions · 2023-12-25T06:16:26Z

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Other pull requests

github-actions · 2023-12-25T06:16:35Z

Run Gluten Clickhouse CI

github-actions · 2023-12-25T06:16:43Z

#4170

github-actions · 2023-12-25T07:26:45Z

Run Gluten Clickhouse CI

Yohahaha · 2023-12-25T09:30:47Z

@FelixYBW @rui-mo @philo-he @ulysses-you
let's discuss the solution based on current patch, the key is decouple scan partition from plan when serialization. I know this patch may seems tricky, and open to accept more better advices.

ulysses-you

thank you @Yohahaha for the idea, looks reasonable to me. One concern is about the name, can we avoid using streamXxx ? I think iteratorXxx is easy to follow. We are using iterator in both java and native side.

github-actions · 2023-12-27T08:24:30Z

Run Gluten Clickhouse CI

rui-mo

Thanks for the improvement.

rui-mo · 2024-01-02T05:48:30Z

-     /// File schema
-     NamedStruct schema = 17;
+      /// File schema
+      NamedStruct schema = 17;


Seems these changes are not needed.

it's needed, there should be 6 leading space.

Is it about the format? How do you meet these errors?

Just found this format issue and fix it.

philo-he · 2024-01-03T03:26:33Z

+      case (splitInfos, index) =>
+        wsCtx.substraitContext.initSplitInfosIndex(0)
+        wsCtx.substraitContext.setSplitInfos(splitInfos)
+        val substraitPlan = wsCtx.root.toProtobuf


I'm assuming the proposed optimization can also be applied for CH backend. If so, it will need some follow-up work from CH engineer. @baibaichen

Yohahaha · 2024-01-04T01:47:22Z

I assume this solution looks well for all your guys.

zhouyuan · 2024-01-04T03:36:53Z

@lgbo-ustc hi, this patch tries to refactor on gen file partitions, please take a look if it will impact CK backend

thanks, -yuan

github-actions · 2024-01-04T04:03:56Z

Run Gluten Clickhouse CI

github-actions · 2024-01-04T04:07:33Z

Run Gluten Clickhouse CI

github-actions · 2024-01-04T04:13:49Z

Run Gluten Clickhouse CI

github-actions · 2024-01-04T04:20:36Z

Run Gluten Clickhouse CI

Yohahaha · 2024-01-04T07:24:33Z

I see QueryBenchmark is similar with GenericBenchmark and QueryBenchmark does not covered by CI, could we remove it? @marin-ma @jinchengchenghh

marin-ma · 2024-01-04T07:28:36Z

I see QueryBenchmark is similar with GenericBenchmark and QueryBenchmark does not covered by CI, could we remove it? @marin-ma @jinchengchenghh

cc: @rui-mo

rui-mo · 2024-01-04T07:47:08Z

I see QueryBenchmark is similar with GenericBenchmark and QueryBenchmark does not covered by CI, could we remove it?

@Yohahaha GenericBenchmark uses arrow to read files, while QueryBenchmark uses Velox. So QueryBenchmark is useful when we want to test Velox TableScan. I think the better option is to enable QueryBenchmark on CI. @marin-ma Please help to confirm, thanks.

marin-ma · 2024-01-04T08:06:59Z

I see QueryBenchmark is similar with GenericBenchmark and QueryBenchmark does not covered by CI, could we remove it?

@Yohahaha GenericBenchmark uses arrow to read files, while QueryBenchmark uses Velox. So QueryBenchmark is useful when we want to test Velox TableScan. I think the better option is to enable QueryBenchmark on CI. @marin-ma Please help to confirm, thanks.

@rui-mo If input is from middle stage, GenericBenchmark will use arrow reader to load the input iterator. If input is from first stage, the whole pipeline is offloaded including table scan. Here's the doc https://github.com/oap-project/gluten/blob/main/docs/developers/MicroBenchmarks.md#generate-substrait-plan-and-input-for-any-query

rui-mo · 2024-01-04T08:14:46Z

@marin-ma Thanks for confirming. @Yohahaha We can remove QueryBenchmark because its functionality is covered by GenericBenchmark.

github-actions · 2024-01-17T07:12:47Z

Run Gluten Clickhouse CI

zhztheplayer

Thanks for working on this!

github-actions · 2024-01-17T07:55:15Z

Run Gluten Clickhouse CI

github-actions · 2024-01-17T09:43:23Z

Run Gluten Clickhouse CI

github-actions · 2024-01-17T09:45:43Z

Run Gluten Clickhouse CI

ulysses-you

lgtm, thank you @Yohahaha

marin-ma

Thank you for this work! Could you please also update the micro benchmark documentation? Noticed that we need to specify split files for first stages.

github-actions · 2024-01-18T02:29:59Z

Run Gluten Clickhouse CI

marin-ma · 2024-01-18T02:32:26Z

LGTM. Thanks!

@zzcclp Do you have any further comments?

ulysses-you · 2024-01-18T09:39:08Z

We can create followups if there are some new finding issues

GlutenPerfBot · 2024-01-18T11:30:02Z

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query	log/native_4177_time.csv	log/native_master_01_17_2024_6e070aee2_time.csv	difference	percentage
q1	32.83	32.53	-0.299	99.09%
q2	23.86	25.15	1.288	105.40%
q3	37.16	35.63	-1.529	95.89%
q4	38.23	39.47	1.240	103.24%
q5	69.55	69.91	0.365	100.52%
q6	6.74	7.16	0.425	106.31%
q7	80.76	83.43	2.667	103.30%
q8	84.40	86.98	2.576	103.05%
q9	118.49	125.57	7.081	105.98%
q10	40.96	42.09	1.132	102.76%
q11	19.53	20.23	0.697	103.57%
q12	25.49	27.47	1.981	107.77%
q13	44.31	44.85	0.538	101.21%
q14	20.65	17.86	-2.795	86.47%
q15	26.55	29.77	3.213	112.10%
q16	12.55	13.95	1.398	111.13%
q17	101.78	100.92	-0.855	99.16%
q18	147.60	146.37	-1.232	99.17%
q19	12.47	13.91	1.442	111.56%
q20	28.27	26.50	-1.771	93.74%
q21	222.72	226.25	3.531	101.59%
q22	13.74	13.71	-0.030	99.78%
total	1208.65	1229.72	21.063	101.74%

jinchengchenghh · 2025-11-04T18:20:52Z

+  // Get the input schema of this iterator.
+  uint64_t colNum = 0;
+  std::vector<TypePtr> veloxTypeList;
+  if (readRel.has_base_schema()) {


I find here you change the input iterator names, and comment ValueStreamNode in Velox does not support name change, but here updates the output type based on output type, do you meet any problem? @Yohahaha
https://github.com/apache/incubator-gluten/blob/c5b7e59335201f960cda49dff7edc315b36ed05e/cpp/velox/operators/plannodes/RowVectorStream.h#L100

And this update only take effect on the top level columns, what if there is nested column?

I know this is used for constructing ValueStreamNode outputType, may also need to update nested column

Yohahaha changed the title ~~[GLUTEN-][VL] Decouple partitions from plan to avoid driver hang~~ [GLUTEN-4170][VL] Decouple partitions from plan to avoid driver hang Dec 25, 2023

Yohahaha changed the title ~~[GLUTEN-4170][VL] Decouple partitions from plan to avoid driver hang~~ [GLUTEN-4170][VL] Decouple partitions from plan to avoid driver stalled Dec 25, 2023

ulysses-you reviewed Dec 27, 2023

View reviewed changes

Comment thread gluten-core/src/main/resources/substrait/proto/substrait/algebra.proto Outdated

Comment thread gluten-core/src/main/java/io/glutenproject/substrait/rel/LocalFilesNode.java Outdated

Comment thread cpp/velox/substrait/SubstraitToVeloxPlan.h Outdated

Yohahaha force-pushed the decouple-partition branch from 777373d to 996bccc Compare December 27, 2023 08:23

rui-mo requested a review from zzcclp January 2, 2024 05:41

rui-mo reviewed Jan 2, 2024

View reviewed changes

ulysses-you reviewed Jan 2, 2024

View reviewed changes

Comment thread cpp/velox/substrait/SubstraitToVeloxPlan.cc

Comment thread backends-velox/src/main/scala/io/glutenproject/backendsapi/velox/IteratorApiImpl.scala Outdated

Comment thread cpp/velox/substrait/SubstraitToVeloxPlan.cc Outdated

rui-mo reviewed Jan 2, 2024

View reviewed changes

Comment thread cpp/velox/substrait/SubstraitToVeloxPlan.cc

FelixYBW mentioned this pull request Jan 2, 2024

[VL] driver stalled before first job starts #4170

Closed

philo-he reviewed Jan 3, 2024

View reviewed changes

Yohahaha force-pushed the decouple-partition branch from 996bccc to 422a220 Compare January 4, 2024 04:03

rui-mo reviewed Jan 4, 2024

View reviewed changes

Comment thread cpp/velox/benchmarks/GenericBenchmark.cc Outdated

Comment thread cpp/velox/compute/VeloxPlanConverter.h Outdated

Comment thread cpp/velox/substrait/SubstraitToVeloxPlan.h Outdated

Yohahaha force-pushed the decouple-partition branch from 375baed to c9fd781 Compare January 4, 2024 08:14

Yohahaha added 5 commits January 17, 2024 15:11

rebase and fix

a031781

fix compile and ut.

626d149

fix compile and ut.

c091242

fix ut

f1e6db9

fix comments

95c301b

Yohahaha force-pushed the decouple-partition branch from 3d1e7d0 to 95c301b Compare January 17, 2024 07:12

zhztheplayer reviewed Jan 17, 2024

View reviewed changes

Comment thread cpp/core/jni/JniWrapper.cc Outdated

Comment thread cpp/core/jni/JniWrapper.cc Outdated

Comment thread cpp/velox/compute/VeloxPlanConverter.cc Outdated

Comment thread cpp/velox/substrait/SubstraitToVeloxPlan.cc Outdated

fix comments

ebac5aa

fix comments

c899630

fix format

9e56777

ulysses-you previously approved these changes Jan 18, 2024

View reviewed changes

marin-ma requested changes Jan 18, 2024

View reviewed changes

update docs

4894e7e

Yohahaha dismissed ulysses-you’s stale review via 4894e7e January 18, 2024 02:29

marin-ma approved these changes Jan 18, 2024

View reviewed changes

ulysses-you approved these changes Jan 18, 2024

View reviewed changes

ulysses-you merged commit 2fc4503 into apache:main Jan 18, 2024

Yohahaha deleted the decouple-partition branch January 18, 2024 10:19

This was referenced Jan 22, 2024

[CH] Decouple LocalFiles from plan to improve driver generating substrait plan #4480

Closed

[GLUTEN-4480][CH] Decouple LocalFiles from plan to improve driver generating substrait plan #4481

Merged

marin-ma mentioned this pull request Jan 22, 2024

[VL] Update generic benchmark usage and doc #4485

Closed

jinchengchenghh reviewed Nov 4, 2025

View reviewed changes

Conversation

Yohahaha commented Dec 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

github-actions Bot commented Dec 25, 2023

Uh oh!

github-actions Bot commented Dec 25, 2023

Uh oh!

github-actions Bot commented Dec 25, 2023

Uh oh!

github-actions Bot commented Dec 25, 2023

Uh oh!

Yohahaha commented Dec 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ulysses-you left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Dec 27, 2023

Uh oh!

rui-mo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yohahaha commented Jan 4, 2024

Uh oh!

zhouyuan commented Jan 4, 2024

Uh oh!

github-actions Bot commented Jan 4, 2024

Uh oh!

github-actions Bot commented Jan 4, 2024

Uh oh!

github-actions Bot commented Jan 4, 2024

Uh oh!

github-actions Bot commented Jan 4, 2024

Uh oh!

Yohahaha commented Jan 4, 2024

Uh oh!

marin-ma commented Jan 4, 2024

Uh oh!

rui-mo commented Jan 4, 2024

Uh oh!

marin-ma commented Jan 4, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rui-mo commented Jan 4, 2024

Uh oh!

github-actions Bot commented Jan 17, 2024

Uh oh!

zhztheplayer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Yohahaha commented Dec 25, 2023 •

edited

Loading

Yohahaha commented Dec 25, 2023 •

edited

Loading

marin-ma commented Jan 18, 2024 •

edited

Loading