Skip to content

perf: O(1) PlanDataInjector lookup by op kind#4535

Open
schenksj wants to merge 2 commits into
apache:mainfrom
schenksj:fix/plandatainjector-o1-lookup
Open

perf: O(1) PlanDataInjector lookup by op kind#4535
schenksj wants to merge 2 commits into
apache:mainfrom
schenksj:fix/plandatainjector-o1-lookup

Conversation

@schenksj
Copy link
Copy Markdown

Which issue does this PR close?

Closes #4530.

Rationale for this change

PlanDataInjector.injectPlanData walked every operator in the tree against every registered injector:

for (injector <- injectors if injector.canInject(op)) { ... }

That is N operators × M injectors canInject calls, even though most operators in any tree are non-scan and match no injector.

This is an independent micro-optimization, extracted from the Delta Lake contrib integration branch (which registers a third injector) purely to keep that PR focused; there is no functional dependency.

What changes are included in this PR?

  • Add def opStructCase: Operator.OpStructCase to the PlanDataInjector trait, implemented by IcebergPlanDataInjector (ICEBERG_SCAN) and NativeScanPlanDataInjector (NATIVE_SCAN).
  • Build a Map[Operator.OpStructCase, PlanDataInjector] and look up by op.getOpStructCase (O(1)), then run a single canInject confirm (which still inspects detail fields like hasCommon / !hasFilePartition). Non-scan operators skip the iteration entirely.

Pure performance change — no behavior difference.

How are these changes tested?

  • New PlanDataInjectorSuite: asserts a non-scan operator tree passes through injectPlanData unchanged (exercises the O(1) miss path), and that each registered injector resolves back to itself via its declared opStructCase (the kinds are distinct and the map is complete, so no injector is silently shadowed).
  • CometExecSuite (126/0), which exercises native-scan plan-data injection end-to-end on every native query, passes unchanged — confirming the refactor preserves behavior.

schenksj and others added 2 commits May 30, 2026 07:38
PlanDataInjector.injectPlanData walked every operator in the tree against every
registered injector (`for (injector <- injectors if injector.canInject(op))`)
-- N operators x M injectors canInject calls -- even though most operators in
any tree are non-scan and match no injector.

Add `opStructCase` to the PlanDataInjector trait and key a
Map[OpStructCase, PlanDataInjector]. Look up by op.getOpStructCase (O(1)) then a
single canInject confirm; non-scan operators skip the iteration entirely. Pure
performance change -- no behavior difference.

Closes apache#4530

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
check-suites.py requires every *Suite.scala to appear in both
pr_build_linux.yml and pr_build_macos.yml. Add the new
PlanDataInjectorSuite alongside its sibling org.apache.spark.sql.comet
suites.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@andygrove andygrove requested a review from mbutrovich May 31, 2026 14:08
@andygrove
Copy link
Copy Markdown
Member

I'm not familiar with this code so will defer to @mbutrovich for review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PlanDataInjector does N x M canInject calls per operator tree

2 participants