Support Spark 3.0/3.1 instrumentation with non-transparent TimedExec wrapper#68
Merged
Conversation
…wrapper TimedExec's transparent wrapper (children = child.children) is incompatible with Spark 3.0/3.1's withNewChildren which uses mapProductIterator + containsChild. This caused CollapseCodegenStages, ApplyColumnarRulesAndInsertTransitions, and AQE to silently fail to update children through the wrapper. Changes: - TimedExec: detect Spark < 3.2 via SPARK_VERSION and use non-transparent children = Seq(child) on legacy Spark, transparent on 3.2+ - TimedWithCodegenExec: split codegen support into separate class so non-codegen nodes (e.g. FileSourceScanExec on 3.1) don't get cast errors - DataFlintInstrumentationExtension: re-enable SQL node instrumentation on all Spark versions (was disabled for 3.0/3.1) - SqlReducer: merge duplicate wrapper+child node pairs in the UI by keeping the child node (rich plan descriptions) and adding wrapper's duration metric - SqlReducer: preserve parsedPlan when plan data unavailable on repeated polls (non-paginated SQL API returns all SQLs every cycle) - PySpark test: add columnar scan + Python UDF + shuffle test case Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tMap nodes - WindowParser: regex now matches WindowInPandas/ArrowWindowPython plan descriptions (was hardcoded to "Window [" prefix) - batchEvalPythonParser: fallback regex for bare function names used by MapInPandas, FlatMapGroupsInPandas, and FlatMapCoGroupsInPandas (these don't use the [funcs], [udfs] bracket format) - SqlReducer: reverted unnecessary WindowInPandas split (parser fix makes it redundant) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The bracket regex [funcs], [udfs] was matching FlatMapCoGroupsInPandas's [group_keys], [group_keys] as function/UDF lists. Now requires the first bracket to contain parentheses (function calls) or be empty, so CoGroups falls through to the bare function name regex. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
children = child.children) is incompatible with Spark 3.0/3.1'swithNewChildren(usesmapProductIterator+containsChild). On legacy Spark, switches tochildren = Seq(child)so plan transformations (CollapseCodegenStages,ApplyColumnarRulesAndInsertTransitions, AQE) work correctly.FileSourceScanExecon Spark 3.1) are wrapped without codegen, avoidingClassCastException./sqlAPI (used on Spark < 3.2) returns all SQLs every poll cycle, causing repeated recalculations that lost plan descriptions whensqlplanoffset advanced. Now preserves existing plan data when unavailable.WindowInPandas/ArrowWindowPythonplan descriptions weren't parsed (regex requiredWindow [prefix).MapInPandas,FlatMapGroupsInPandas, andFlatMapCoGroupsInPandasuse a bare function format that the bracket-based parser couldn't handle. Both fixed with fallback regex patterns.Test plan
pyspark-testing/dataflint_pyspark_example.pyon Spark 3.1.2🤖 Generated with Claude Code