Skip to content

Fix duration metric accuracy: RDD timing, codegen try/finally, per-partition write timing#63

Merged
minskya merged 1 commit intomainfrom
DATAFLINT-4738
Apr 20, 2026
Merged

Fix duration metric accuracy: RDD timing, codegen try/finally, per-partition write timing#63
minskya merged 1 commit intomainfrom
DATAFLINT-4738

Conversation

@minskya
Copy link
Copy Markdown
Contributor

@minskya minskya commented Apr 20, 2026

Summary

  • Custom RDD instead of mapPartitions for duration timingDataFlintRDDUtils now uses a custom RDD that captures startTime inside compute() before firstParent.iterator(). Previously, mapPartitions set startTime after the parent RDD's compute, missing eager work in operators like SortExec (full partition sort) and HashAggregateExec (hash map build).

  • Per-partition write durationTimedExec.executeCollect now reconstructs DataWritingCommandExec with the data plan wrapped in RDDTimingWrapper inside WriteFilesExec (Spark 3.4+) or directly (older Spark). The write command consumes the timed RDD via sparkContext.runJob, so wall-clock-per-partition timing captures both data production and write I/O. Previously, write duration was measured as driver-side wall-clock time, inconsistent with other per-partition metrics.

  • Codegen try/finally for blocking operatorsdoProduce now wraps the child code with try/finally so the duration metric flushes even when blocking operators (SortExec, etc.) exit early via shouldStop()/return. Previously, duration was 0 for these operators in codegen because the timing code after the child code was never reached.

  • Sanitize codegen variable prefix — Nodes with spaces in nodeName (e.g. RDDScanExec's "Scan ExistingRDD", DataWritingCommandExec's "Execute InsertIntoHadoopFsRelationCommand") caused invalid Java identifiers in generated code. doProduce now sanitizes ctx.freshNamePrefix to strip non-alphanumeric characters.

  • Add RDDScanExec to instrumented nodes — "Scan ExistingRDD" nodes now get duration metrics in both Spark 3 and Spark 4.

Test plan

  • pluginspark3 tests pass (43 tests, 42 pass, 1 pre-existing flaky BroadcastHashJoinExec timing test)
  • pluginspark4 compiles
  • Codegen timing tests (DataFlintCodegenExecSpec) pass with try/finally fix
  • Write timing tests (DataWritingCommandExec) pass with per-partition approach

…upport for duration Writing duration now really gives per partition duration and not just wall time
@notion-workspace
Copy link
Copy Markdown

Duration fixes

@minskya minskya changed the title TimedExec now wraps compute instead of mapPartitions codeGen better … Fix duration metric accuracy: RDD timing, codegen try/finally, per-partition write timing Apr 20, 2026
@minskya minskya marked this pull request as ready for review April 20, 2026 12:00
@minskya minskya merged commit 51cd734 into main Apr 20, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant