[SPARK-32626][CORE] Do not increase the input metrics when read rdd from cache by Udbhav30 · Pull Request #29441 · apache/spark

Udbhav30 · 2020-08-15T15:20:39Z

What changes were proposed in this pull request?

Do not increase the input metrics when reading rdd from the cache

Why are the changes needed?

Input Metrics will be increased after the rdd is first computed, so it is not correct to increment the input metrics when we read rdd from the cache.

Does this PR introduce any user-facing change?

yes, the user will get the correct read metrics now.

How was this patch tested?

Existing UTs

Udbhav30 · 2020-08-15T15:21:37Z

cc @dongjoon-hyun

dongjoon-hyun · 2020-08-15T19:29:02Z

ok to test

SparkQA · 2020-08-15T22:30:47Z

Test build #127480 has finished for PR 29441 at commit 49f9db6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-08-15T22:54:44Z

core/src/main/scala/org/apache/spark/rdd/RDD.scala

-          val existingMetrics = context.taskMetrics().inputMetrics
-          existingMetrics.incBytesRead(blockResult.bytes)


Why? Based on TaskMetrics.inputMetrics document, it is

Metrics related to reading data from a [[org.apache.spark.rdd.HadoopRDD]] or from persisted data, defined only in tasks with input.

Even cached, don't we need to include its size into input size of the task?

input size of the task is increased when the rdd is first computed, I think it is not correct to increase the metrics again when read from cache. Correct me if i am wrong.

From the doc, I think inputMetrics counts all input size the task processes. It is not for input size read from disk. Although you don't need to read from disk if it is cached, it still increases the amount of input the task is going to process.

I agree with @viirya, as currently defined, it include all reads. Note, a cached RDD read could still involve network or disk reads.

+1 on the @viirya comment.

Thanks for the information @viirya. I'll close the pr

probot-autolabeler bot added the CORE label Aug 15, 2020

[SPARK-32626] Do not increase the input metrics when read rdd from cache

49f9db6

Udbhav30 force-pushed the metric branch from 36c1a8d to 49f9db6 Compare August 15, 2020 15:27

viirya reviewed Aug 15, 2020

View reviewed changes

maropu changed the title ~~[SPARK-32626] Do not increase the input metrics when read rdd from cache~~ [SPARK-32626][CORE] Do not increase the input metrics when read rdd from cache Aug 16, 2020

Udbhav30 closed this Aug 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32626][CORE] Do not increase the input metrics when read rdd from cache#29441

[SPARK-32626][CORE] Do not increase the input metrics when read rdd from cache#29441
Udbhav30 wants to merge 1 commit intoapache:masterfrom
Udbhav30:metric

Udbhav30 commented Aug 15, 2020

Uh oh!

Udbhav30 commented Aug 15, 2020

Uh oh!

dongjoon-hyun commented Aug 15, 2020

Uh oh!

SparkQA commented Aug 15, 2020

Uh oh!

viirya Aug 15, 2020

Uh oh!

Udbhav30 Aug 16, 2020

Uh oh!

viirya Aug 16, 2020

Uh oh!

mridulm Aug 16, 2020

Uh oh!

maropu Aug 16, 2020

Uh oh!

Udbhav30 Aug 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

		val existingMetrics = context.taskMetrics().inputMetrics
		existingMetrics.incBytesRead(blockResult.bytes)

Conversation

Udbhav30 commented Aug 15, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Udbhav30 commented Aug 15, 2020

Uh oh!

dongjoon-hyun commented Aug 15, 2020

Uh oh!

SparkQA commented Aug 15, 2020

Uh oh!

viirya Aug 15, 2020

Choose a reason for hiding this comment

Uh oh!

Udbhav30 Aug 16, 2020

Choose a reason for hiding this comment

Uh oh!

viirya Aug 16, 2020

Choose a reason for hiding this comment

Uh oh!

mridulm Aug 16, 2020

Choose a reason for hiding this comment

Uh oh!

maropu Aug 16, 2020

Choose a reason for hiding this comment

Uh oh!

Udbhav30 Aug 17, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants