Fix duration metric accuracy: RDD timing, codegen try/finally, per-partition write timing by minskya · Pull Request #63 · dataflint/spark

minskya · 2026-04-20T10:36:56Z

Summary

Custom RDD instead of mapPartitions for duration timing — DataFlintRDDUtils now uses a custom RDD that captures startTime inside compute() before firstParent.iterator(). Previously, mapPartitions set startTime after the parent RDD's compute, missing eager work in operators like SortExec (full partition sort) and HashAggregateExec (hash map build).
Per-partition write duration — TimedExec.executeCollect now reconstructs DataWritingCommandExec with the data plan wrapped in RDDTimingWrapper inside WriteFilesExec (Spark 3.4+) or directly (older Spark). The write command consumes the timed RDD via sparkContext.runJob, so wall-clock-per-partition timing captures both data production and write I/O. Previously, write duration was measured as driver-side wall-clock time, inconsistent with other per-partition metrics.
Codegen try/finally for blocking operators — doProduce now wraps the child code with try/finally so the duration metric flushes even when blocking operators (SortExec, etc.) exit early via shouldStop()/return. Previously, duration was 0 for these operators in codegen because the timing code after the child code was never reached.
Sanitize codegen variable prefix — Nodes with spaces in nodeName (e.g. RDDScanExec's "Scan ExistingRDD", DataWritingCommandExec's "Execute InsertIntoHadoopFsRelationCommand") caused invalid Java identifiers in generated code. doProduce now sanitizes ctx.freshNamePrefix to strip non-alphanumeric characters.
Add RDDScanExec to instrumented nodes — "Scan ExistingRDD" nodes now get duration metrics in both Spark 3 and Spark 4.

Test plan

pluginspark3 tests pass (43 tests, 42 pass, 1 pre-existing flaky BroadcastHashJoinExec timing test)
pluginspark4 compiles
Codegen timing tests (DataFlintCodegenExecSpec) pass with try/finally fix
Write timing tests (DataWritingCommandExec) pass with per-partition approach

…upport for duration Writing duration now really gives per partition duration and not just wall time

notion-workspace · 2026-04-20T10:37:03Z

TimedExec now wraps compute instead of mapPartitions codeGen better s…

99a6506

…upport for duration Writing duration now really gives per partition duration and not just wall time

minskya changed the title ~~TimedExec now wraps compute instead of mapPartitions codeGen better …~~ Fix duration metric accuracy: RDD timing, codegen try/finally, per-partition write timing Apr 20, 2026

minskya marked this pull request as ready for review April 20, 2026 12:00

minskya merged commit 51cd734 into main Apr 20, 2026
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix duration metric accuracy: RDD timing, codegen try/finally, per-partition write timing#63

Fix duration metric accuracy: RDD timing, codegen try/finally, per-partition write timing#63
minskya merged 1 commit intomainfrom
DATAFLINT-4738

minskya commented Apr 20, 2026 •

edited

Loading

Uh oh!

notion-workspace Bot commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

minskya commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

notion-workspace Bot commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

minskya commented Apr 20, 2026 •

edited

Loading