[SPARK-55523][SQL] Fix inputMetrics.recordsRead to report rows instead of CachedBatches#54684
Open
moomindani wants to merge 1 commit intoapache:masterfrom
Open
[SPARK-55523][SQL] Fix inputMetrics.recordsRead to report rows instead of CachedBatches#54684moomindani wants to merge 1 commit intoapache:masterfrom
moomindani wants to merge 1 commit intoapache:masterfrom
Conversation
…d of CachedBatches ### What changes were proposed in this pull request? `InMemoryTableScanExec.doExecute()` iterates over `CachedBatch` objects from `RDD.getOrCompute`, which increments `inputMetrics.recordsRead` by 1 per `CachedBatch` on a cache hit — counting batches instead of actual rows. This fix registers a `TaskCompletionListener` inside `mapPartitionsInternal` that corrects the count once all batches in a partition are consumed: - On cache hit: `getOrCompute` added `numBatches`; the listener adds `(rowCount - numBatches)` so the final value equals the actual row count. - On cache miss: the source scan already added `rowCount` to `recordsRead`; the listener adds `(rowCount - rowCount) = 0`, leaving the value unchanged. No changes to `RDD.scala`, so non-SQL RDD caching is unaffected. ### Why are the changes needed? Users and monitoring tools relying on `TaskMetrics.inputMetrics.recordsRead` (e.g. the Spark UI, external observability systems) see incorrect values when reading from a cached DataFrame. For example, 3,605 rows stored in a single `CachedBatch` would report `recordsRead = 1` instead of `3605`. ### Does this PR introduce _any_ user-facing change? Yes — `inputMetrics.recordsRead` for queries reading from a cached DataFrame now reports the correct row count instead of the batch count. ### How was this patch tested? Added a test in `InMemoryColumnarQuerySuite` that caches 1000 rows, materializes the cache, then reads from cache and asserts `inputMetrics.recordsRead == 1000` via a `SparkListener`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
InMemoryTableScanExec.doExecute()iterates overCachedBatchobjects backed byRDD.getOrCompute, which incrementsinputMetrics.recordsReadby 1 perCachedBatchon a cache hit — counting batches instead of actual rows.This fix registers a
TaskCompletionListenerinsidemapPartitionsInternalthat corrects the count once all batches in a partition are consumed:getOrComputeaddednumBatches; the listener adds(rowCount - numBatches)so the final value equals the actual row count.recordsReadbyrowCount; the listener adds(rowCount - rowCount) = 0, leaving the value unchanged.No changes to
RDD.scala, so non-SQL RDD caching metrics are unaffected.Why are the changes needed?
Users and monitoring tools relying on
TaskMetrics.inputMetrics.recordsRead(e.g. the Spark UI, external observability systems) see incorrect values when reading from a cached DataFrame. For example, 3,605 rows stored in a singleCachedBatchwould reportrecordsRead = 1instead of3605.Does this PR introduce any user-facing change?
Yes —
inputMetrics.recordsReadfor queries reading from a cached DataFrame now reports the correct row count instead of the batch count.How was this patch tested?
Added a test in
InMemoryColumnarQuerySuitethat caches 1000 rows, materializes the cache, then reads from cache and assertsinputMetrics.recordsRead == 1000via aSparkListener.Notes
doExecuteColumnar()has the same underlying issue (batch-level counting viagetOrCompute) and could be addressed as a follow-up.