[WIP][SPARK-37487][SQL][CORE] Avoid performing CollectMetrics twice if the operation is followed by global sort.#34765
[WIP][SPARK-37487][SQL][CORE] Avoid performing CollectMetrics twice if the operation is followed by global sort.#34765sarutak wants to merge 1 commit intoapache:masterfrom
Conversation
|
Does this bug also impact the metrics reported by other nodes? For example |
Seems so. We need a more comprehensive solution... |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #145798 has finished for PR 34765 at commit
|
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
This PR fixes an issue that
CollectMetricsperforms twice if it's followed by global sort like as follows.The expected statistics calculated by
CollectMetricsis[0,99,4950,50]but the actual result is[0,99,9900,100].The reason is that jobs for sampling can run before the global sort, which performs extra
CollectMetrics.spark/core/src/main/scala/org/apache/spark/Partitioner.scala
Line 171 in e7fa289
spark/core/src/main/scala/org/apache/spark/Partitioner.scala
Line 195 in e7fa289
The solution this PR proposes to introduce a property
spark.job.isSamplingJobwhich is intended to be get/set internally.Before the sampling jobs run, Spark sets the property, and reset it after the jobs finish.
Then,
CollectMetricscan judge a task is whether of a sampling job or not.Why are the changes needed?
Bug fix.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
New test.