[SPARK-32626][CORE] Do not increase the input metrics when read rdd from cache#29441
[SPARK-32626][CORE] Do not increase the input metrics when read rdd from cache#29441Udbhav30 wants to merge 1 commit intoapache:masterfrom
Conversation
|
ok to test |
|
Test build #127480 has finished for PR 29441 at commit
|
| val existingMetrics = context.taskMetrics().inputMetrics | ||
| existingMetrics.incBytesRead(blockResult.bytes) |
There was a problem hiding this comment.
Why? Based on TaskMetrics.inputMetrics document, it is
Metrics related to reading data from a [[org.apache.spark.rdd.HadoopRDD]] or from persisted data, defined only in tasks with input.
Even cached, don't we need to include its size into input size of the task?
There was a problem hiding this comment.
input size of the task is increased when the rdd is first computed, I think it is not correct to increase the metrics again when read from cache. Correct me if i am wrong.
There was a problem hiding this comment.
From the doc, I think inputMetrics counts all input size the task processes. It is not for input size read from disk. Although you don't need to read from disk if it is cached, it still increases the amount of input the task is going to process.
There was a problem hiding this comment.
I agree with @viirya, as currently defined, it include all reads. Note, a cached RDD read could still involve network or disk reads.
There was a problem hiding this comment.
Thanks for the information @viirya. I'll close the pr
What changes were proposed in this pull request?
Do not increase the input metrics when reading rdd from the cache
Why are the changes needed?
Input Metrics will be increased after the rdd is first computed, so it is not correct to increment the input metrics when we read rdd from the cache.
Does this PR introduce any user-facing change?
yes, the user will get the correct read metrics now.
How was this patch tested?
Existing UTs