Skip to content

[GLUTEN-4480][CH] Decouple LocalFiles from plan to improve driver generating substrait plan#4481

Merged
baibaichen merged 1 commit into
apache:mainfrom
exmy:gluten-4480
Jan 29, 2024
Merged

[GLUTEN-4480][CH] Decouple LocalFiles from plan to improve driver generating substrait plan#4481
baibaichen merged 1 commit into
apache:mainfrom
exmy:gluten-4480

Conversation

@exmy
Copy link
Copy Markdown
Contributor

@exmy exmy commented Jan 22, 2024

What changes were proposed in this pull request?

Applied for CH backend following velox does #4177

The time taken to generate the substrait plan has decreased from 249938ms to 4486ms for 163180 partitions after this pach.

(Fixes: #4480)

How was this patch tested?

Pass CI

@github-actions
Copy link
Copy Markdown

#4480

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI

Comment thread cpp-ch/local-engine/Storages/SubstraitSource/SubstraitFileSource.cpp Outdated
@zhztheplayer
Copy link
Copy Markdown
Member

Thank you for working on this @exmy.

I noticed the PR doesn't make changes to the following two parts of code yet:

  1. https://github.com/oap-project/gluten/blob/19a863c0d3781034b4def57be899e5bd4a8825b6/backends-clickhouse/src/main/scala/io/glutenproject/backendsapi/clickhouse/CHIteratorApi.scala#L247-L280

  2. https://github.com/oap-project/gluten/blob/19a863c0d3781034b4def57be899e5bd4a8825b6/backends-clickhouse/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ClickHouseAppendDataExec.scala#L234-L252

Do you have a plan to make cleanup to the code as well? Especially 2 since some of the code seems to be obsolete so can be removed probably. I am no sure so would you like to help doing a check? Thanks.

After that, we may be able to remove a series of APIs on splitInfos from SubstraitContext to get a cleaner design.

@exmy exmy marked this pull request as draft January 23, 2024 01:43
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI

@exmy
Copy link
Copy Markdown
Contributor Author

exmy commented Jan 23, 2024

Do you have a plan to make cleanup to the code as well? Especially 2 since some of the code seems to be obsolete so can be removed probably. I am no sure so would you like to help doing a check? Thanks.

After that, we may be able to remove a series of APIs on splitInfos from SubstraitContext to get a cleaner design.

@zzcclp Could you help check if we need cleanup the two parts of code?

@zzcclp
Copy link
Copy Markdown
Contributor

zzcclp commented Jan 23, 2024

  1. genNativeFileScanRDD

change the code in the part of the genNativeFileScanRDD , and the ClickHouseAppendDataExec.scala is already removed in other pr #4457

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI

2 similar comments
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI

@exmy
Copy link
Copy Markdown
Contributor Author

exmy commented Jan 25, 2024

@zhztheplayer @zzcclp Have cleaned up the code in the part of the genNativeFileScanRDD except ClickHouseAppendDataExec. I'd like to remove the related APIs on splitInfos from SubstraitContext in the following pr.

@zzcclp
Copy link
Copy Markdown
Contributor

zzcclp commented Jan 25, 2024

@exmy PR #4457 has been merged, please rebase.
BTW, there are some changes for LocalFileNode when query data from mergetree data

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI

@exmy exmy marked this pull request as ready for review January 25, 2024 23:56
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI

@zhztheplayer
Copy link
Copy Markdown
Member

LGTM for changes to common module

@zzcclp If you want to review CH changes, thanks.

// TODO: We still maintain the old logic of parsing LocalFiles or ExtensionTable in RealRel
// to be compatiable with some suites about metrics.
// Remove this compatiability in later and then only java iter has local files in ReadRel.
if (read.has_local_files() || (!read.has_extension_table() && !isReadFromMergeTree(read)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't add this parameter isMergeTree, can ignore the suites about metrics first and raist a issue to fix.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the paramter isMergeTree is confused.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't add this parameter isMergeTree, can ignore the suites about metrics first and raist a issue to fix.

We need a way to identify whether to parse according to LocalFiles or ExtensionTable here. If we don't have isMergeTree parameter, are there alternative ways to achieve the same thing?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read.has_extension_table() is for mergetree, and read.has_local_files() is for other file format, parquet or orc don't need the extension table part.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadRel no longer includes LocalFiles or ExtensionTable since they have been decoupled from RealRel which is exactly the goal this pr want to achieve, right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got, can we judge the split_info is localfiles or extendsiontable according to the split_info string?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried before, but unfortunately, there wasn't a suitable way to judge it solely based on split_info string information.

Copy link
Copy Markdown
Contributor

@liuneng1994 liuneng1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI

@baibaichen baibaichen merged commit 7763a0a into apache:main Jan 29, 2024
@GlutenPerfBot
Copy link
Copy Markdown
Contributor

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query log/native_4481_time.csv log/native_master_01_28_2024_d041ce584_time.csv difference percentage
q1 32.47 32.99 0.519 101.60%
q2 25.53 23.85 -1.682 93.41%
q3 36.14 37.74 1.591 104.40%
q4 38.06 38.98 0.921 102.42%
q5 68.29 67.95 -0.340 99.50%
q6 7.11 7.02 -0.086 98.78%
q7 82.83 80.87 -1.962 97.63%
q8 85.66 84.35 -1.310 98.47%
q9 119.32 121.55 2.230 101.87%
q10 41.16 43.06 1.898 104.61%
q11 21.35 20.27 -1.080 94.94%
q12 28.11 27.38 -0.726 97.42%
q13 45.15 45.45 0.300 100.66%
q14 20.61 22.94 2.333 111.32%
q15 27.72 28.71 0.995 103.59%
q16 14.68 14.08 -0.594 95.95%
q17 98.46 102.67 4.218 104.28%
q18 146.51 148.89 2.386 101.63%
q19 13.58 12.47 -1.110 91.82%
q20 26.28 26.11 -0.171 99.35%
q21 224.09 226.03 1.942 100.87%
q22 13.38 13.55 0.163 101.22%
total 1216.50 1226.93 10.433 100.86%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CH] Decouple LocalFiles from plan to improve driver generating substrait plan

8 participants