Skip to content

[SPARK-32626][CORE] Do not increase the input metrics when read rdd from cache#29441

Closed
Udbhav30 wants to merge 1 commit intoapache:masterfrom
Udbhav30:metric
Closed

[SPARK-32626][CORE] Do not increase the input metrics when read rdd from cache#29441
Udbhav30 wants to merge 1 commit intoapache:masterfrom
Udbhav30:metric

Conversation

@Udbhav30
Copy link
Contributor

What changes were proposed in this pull request?

Do not increase the input metrics when reading rdd from the cache

Why are the changes needed?

Input Metrics will be increased after the rdd is first computed, so it is not correct to increment the input metrics when we read rdd from the cache.

Does this PR introduce any user-facing change?

yes, the user will get the correct read metrics now.

How was this patch tested?

Existing UTs

@Udbhav30
Copy link
Contributor Author

cc @dongjoon-hyun

@dongjoon-hyun
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Aug 15, 2020

Test build #127480 has finished for PR 29441 at commit 49f9db6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Comment on lines -391 to -392
val existingMetrics = context.taskMetrics().inputMetrics
existingMetrics.incBytesRead(blockResult.bytes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Based on TaskMetrics.inputMetrics document, it is

Metrics related to reading data from a [[org.apache.spark.rdd.HadoopRDD]] or from persisted data, defined only in tasks with input.

Even cached, don't we need to include its size into input size of the task?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input size of the task is increased when the rdd is first computed, I think it is not correct to increase the metrics again when read from cache. Correct me if i am wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the doc, I think inputMetrics counts all input size the task processes. It is not for input size read from disk. Although you don't need to read from disk if it is cached, it still increases the amount of input the task is going to process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @viirya, as currently defined, it include all reads. Note, a cached RDD read could still involve network or disk reads.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the @viirya comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the information @viirya. I'll close the pr

@maropu maropu changed the title [SPARK-32626] Do not increase the input metrics when read rdd from cache [SPARK-32626][CORE] Do not increase the input metrics when read rdd from cache Aug 16, 2020
@Udbhav30 Udbhav30 closed this Aug 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants