[SPARK-48037][CORE][3.4] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data #46464

cxzl25 · 2024-05-08T06:18:01Z

What changes were proposed in this pull request?

This PR aims to fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data.

Why are the changes needed?

When the shuffle writer is SortShuffleWriter, it does not use SQLShuffleWriteMetricsReporter to update metrics, which causes AQE to obtain runtime statistics and the rowCount obtained is 0.

Some optimization rules rely on rowCount statistics, such as EliminateLimits. Because rowCount is 0, it removes the limit operator. At this time, we get data results without limit.

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala

Lines 168 to 172 in 59d5946

    
           override def runtimeStatistics: Statistics = { 
        
             val dataSize = metrics("dataSize").value 
        
             val rowCount = metrics(SQLShuffleWriteMetricsReporter.SHUFFLE_RECORDS_WRITTEN).value 
        
             Statistics(dataSize, Some(rowCount)) 
        
           }

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

Lines 2067 to 2070 in 59d5946

    
           object EliminateLimits extends Rule[LogicalPlan] { 
        
             private def canEliminate(limitExpr: Expression, child: LogicalPlan): Boolean = { 
        
               limitExpr.foldable && child.maxRows.exists { _ <= limitExpr.eval().asInstanceOf[Int] } 
        
             }

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Production environment verification.

master metrics

PR metrics

Was this patch authored or co-authored using generative AI tooling?

No

… metrics resulting in potentially inaccurate data ### What changes were proposed in this pull request? This PR aims to fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data. ### Why are the changes needed? When the shuffle writer is SortShuffleWriter, it does not use SQLShuffleWriteMetricsReporter to update metrics, which causes AQE to obtain runtime statistics and the rowCount obtained is 0. Some optimization rules rely on rowCount statistics, such as `EliminateLimits`. Because rowCount is 0, it removes the limit operator. At this time, we get data results without limit. https://github.com/apache/spark/blob/59d5946cfd377e9203ccf572deb34f87fab7510c/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala#L168-L172 https://github.com/apache/spark/blob/59d5946cfd377e9203ccf572deb34f87fab7510c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L2067-L2070 ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Production environment verification. **master metrics** <img width="296" alt="image" src="https://github.com/apache/spark/assets/3898450/dc9b6e8a-93ec-4f59-a903-71aa5b11962c"> **PR metrics** <img width="276" alt="image" src="https://github.com/apache/spark/assets/3898450/2d73b773-2dcc-4d23-81de-25dcadac86c1"> ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#46273 from cxzl25/SPARK-48037. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit e24f896)

…AndVerifyResult` to skip check results ### What changes were proposed in this pull request? This PR aims to support AdaptiveQueryExecSuite to skip check results. ### Why are the changes needed? apache#46273 (comment) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#46316 from cxzl25/SPARK-48070. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 35767bb)

dongjoon-hyun

+1, LGTM. Thank you, @cxzl25 .

cxzl25 · 2024-05-08T12:56:41Z

.github/workflows/build_and_test.yml

@@ -644,6 +644,7 @@ jobs:
        python3.9 -m pip install 'sphinx<3.1.0' mkdocs pydata_sphinx_theme 'sphinx-copybutton==0.5.2' nbsphinx numpydoc 'jinja2<3.0.0' 'markupsafe==2.0.1' 'pyzmq<24.0.0' 'sphinxcontrib-applehelp==1.0.4' 'sphinxcontrib-devhelp==1.0.2' 'sphinxcontrib-htmlhelp==2.0.1' 'sphinxcontrib-qthelp==1.0.3' 'sphinxcontrib-serializinghtml==1.1.5' 'nest-asyncio==1.5.8' 'rpds-py==0.16.2' 'alabaster==0.7.13'
        python3.9 -m pip install ipython_genutils # See SPARK-38517
        python3.9 -m pip install sphinx_plotly_directive 'numpy>=1.20.0' 'pyarrow==12.0.1' pandas 'plotly>=4.8'
+        python3.9 -m pip install 'nbsphinx==0.9.3'


https://github.com/cxzl25/spark/actions/runs/8997219681/job/24725778387#step:24:6423

Exception occurred: File "/usr/local/lib/python3.9/dist-packages/nbsphinx/__init__.py", line 1316, in apply for section in self.document.findall(docutils.nodes.section): AttributeError: 'document' object has no attribute 'findall'

The failed CI uses nbsphinx 0.9.4 version, which requires docutils >= 0.18.1.

https://github.com/spatialaudio/nbsphinx/releases/tag/0.9.4

Release
0.9.4 May 7, 2024
0.9.3 Aug 27, 2023

Do we need to pin the nbsphinx version in the master branch as well? Similar to SPARK-39421 . cc @HyukjinKwon

Oops. I missed this line. My bad. We should proceed this separately because this this the following, @cxzl25 .

[SPARK-48179][INFRA][3.5] Pin nbsphinx to 0.9.3 #46448

Thanks @dongjoon-hyun! I didn’t know why CI failed on the 3.4 branch at first, so I tested it in my own way.

Do we need to backport SPARK-48179 to branch 3.4?

Do we need to backport SPARK-48179 to branch 3.4?

I believe it's too late and would be redundant.

…lated metrics resulting in potentially inaccurate data ### What changes were proposed in this pull request? This PR aims to fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data. ### Why are the changes needed? When the shuffle writer is SortShuffleWriter, it does not use SQLShuffleWriteMetricsReporter to update metrics, which causes AQE to obtain runtime statistics and the rowCount obtained is 0. Some optimization rules rely on rowCount statistics, such as `EliminateLimits`. Because rowCount is 0, it removes the limit operator. At this time, we get data results without limit. https://github.com/apache/spark/blob/59d5946cfd377e9203ccf572deb34f87fab7510c/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala#L168-L172 https://github.com/apache/spark/blob/59d5946cfd377e9203ccf572deb34f87fab7510c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L2067-L2070 ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Production environment verification. **master metrics** <img width="296" alt="image" src="https://github.com/apache/spark/assets/3898450/dc9b6e8a-93ec-4f59-a903-71aa5b11962c"> **PR metrics** <img width="276" alt="image" src="https://github.com/apache/spark/assets/3898450/2d73b773-2dcc-4d23-81de-25dcadac86c1"> ### Was this patch authored or co-authored using generative AI tooling? No Closes #46464 from cxzl25/SPARK-48037-3.4. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

dongjoon-hyun · 2024-05-08T14:30:54Z

Merged to branch-3.4.

cxzl25 added 2 commits May 8, 2024 14:14

github-actions bot added SQL CORE labels May 8, 2024

yaooqinn approved these changes May 8, 2024

View reviewed changes

dongjoon-hyun approved these changes May 8, 2024

View reviewed changes

docs test

c1ea5dc

github-actions bot added the INFRA label May 8, 2024

cxzl25 commented May 8, 2024

View reviewed changes

dongjoon-hyun closed this May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48037][CORE][3.4] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data #46464

[SPARK-48037][CORE][3.4] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data #46464

cxzl25 commented May 8, 2024

dongjoon-hyun left a comment

cxzl25 May 8, 2024 •

edited

cxzl25 May 8, 2024

dongjoon-hyun May 8, 2024

cxzl25 May 8, 2024

dongjoon-hyun May 8, 2024

dongjoon-hyun commented May 8, 2024

	override def runtimeStatistics: Statistics = {
	val dataSize = metrics("dataSize").value
	val rowCount = metrics(SQLShuffleWriteMetricsReporter.SHUFFLE_RECORDS_WRITTEN).value
	Statistics(dataSize, Some(rowCount))
	}

	object EliminateLimits extends Rule[LogicalPlan] {
	private def canEliminate(limitExpr: Expression, child: LogicalPlan): Boolean = {
	limitExpr.foldable && child.maxRows.exists { _ <= limitExpr.eval().asInstanceOf[Int] }
	}

[SPARK-48037][CORE][3.4] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data #46464

[SPARK-48037][CORE][3.4] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data #46464

Conversation

cxzl25 commented May 8, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

dongjoon-hyun left a comment

Choose a reason for hiding this comment

cxzl25 May 8, 2024 • edited

Choose a reason for hiding this comment

cxzl25 May 8, 2024

Choose a reason for hiding this comment

dongjoon-hyun May 8, 2024

Choose a reason for hiding this comment

cxzl25 May 8, 2024

Choose a reason for hiding this comment

dongjoon-hyun May 8, 2024

Choose a reason for hiding this comment

dongjoon-hyun commented May 8, 2024

cxzl25 May 8, 2024 •

edited