Skip to content

fix: include port id in materialization reader actor id to fix self-join deadlock#4985

Closed
Ma77Ball wants to merge 8 commits into
apache:mainfrom
Ma77Ball:fix/differenceOperator
Closed

fix: include port id in materialization reader actor id to fix self-join deadlock#4985
Ma77Ball wants to merge 8 commits into
apache:mainfrom
Ma77Ball:fix/differenceOperator

Conversation

@Ma77Ball
Copy link
Copy Markdown
Contributor

@Ma77Ball Ma77Ball commented May 8, 2026

What changes were proposed in this PR?

Fixes the Difference operator hanging when one upstream operator (e.g., a single CSV) is wired to both of its input ports.

The bug

  • Scheduler materializes the upstream to break the self-join cycle.
  • Each downstream worker spawns one InputPortMaterializationReaderThread per upstream URI.
  • Each thread's virtual "from" actor ID was built from (uri, workerActorId) only.
  • With a self-join, both threads share the same URI and worker → same ChannelIdentity → FIFO sequence numbers and end-of-channel markers cross-routed → one port never drains → Difference hangs.

The fix

  • Mix the destination PortIdentity into the actor name: MATERIALIZATION_READER_<uri>_port<n>[i]_<workerActorId>.
  • Thread toPortId through the three callers of getFromActorIdForInputPortStorage:
    • ResourceAllocatorglobalPortId.portId.
    • AssignPortHandlermsg.portId.
    • InputManagerInputPortMaterializationReaderThread (new ctor field)

Any related issues, documentation, or discussions?

Closes: #2588

How was this PR tested?

  • VirtualIdentityUtilsSpec — checks the new ID format and asserts distinct IDs for the same (uri, worker) but different port IDs.
  • ExpansionGreedyScheduleGeneratorSpec — builds csv → difference with both inputs from the same csv; asserts levelSets.size > 1, proving materialization happens and the schedule isn't a deadlocked single-level region.
  • Manual: ran a workflow with one CSV connected to both Difference ports — previously hung, now completes.

Was this PR authored or co-authored using generative AI tooling?

Co-authored with: Claude Opus 4.7 in compliance with ASF

@Ma77Ball Ma77Ball changed the title Fix/difference operator fix: include port id in materialization reader actor id to fix self-join deadlock May 8, 2026
@Ma77Ball
Copy link
Copy Markdown
Contributor Author

Ma77Ball commented May 8, 2026

@Yicong-Huang please review

@Ma77Ball Ma77Ball changed the title fix: include port id in materialization reader actor id to fix self-join deadlock fix: include port id in materialization reader actor id to fix self-join deadlock May 8, 2026
@Yicong-Huang
Copy link
Copy Markdown
Contributor

Yicong-Huang commented May 8, 2026

Mix the destination PortIdentity into the actor name: MATERIALIZATION_READER_port[i].

I don't think fixing the port id into the actor's name is the right way to go. port id shouldn't be part of actor name and Actor name should be unique. If we have an operator with 5 input ports, this fix would name the actor to 5 different names.

The bug should be on the other lower layer: an actor could have multiple channels, and channels are connected between different ports. I think we did not differentiate channels well, and the fix is deeper than just renames, the gateway logic might also require fix. To fix it you need a deeper understanding of amber.

@chenlica chenlica requested a review from Xiao-zhen-Liu May 9, 2026 06:17
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 9, 2026

Codecov Report

❌ Patch coverage is 71.42857% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.72%. Comparing base (b4f3a41) to head (d9830c6).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
...pache/texera/amber/util/VirtualIdentityUtils.scala 75.00% 0 Missing and 1 partial ⚠️
...a/amber/operator/difference/DifferenceOpDesc.scala 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##               main    #4985   +/-   ##
=========================================
  Coverage     42.72%   42.72%           
+ Complexity     2185     2184    -1     
=========================================
  Files          1031     1031           
  Lines         38152    38156    +4     
  Branches       4004     4005    +1     
=========================================
+ Hits          16302    16304    +2     
  Misses        20831    20831           
- Partials       1019     1021    +2     
Flag Coverage Δ *Carryforward flag
access-control-service 39.53% <ø> (ø)
agent-service 33.72% <ø> (ø) Carriedforward from b4f3a41
amber 43.22% <71.42%> (+<0.01%) ⬆️
computing-unit-managing-service 0.00% <ø> (ø)
config-service 0.00% <ø> (ø)
file-service 32.18% <ø> (ø)
frontend 33.08% <ø> (ø) Carriedforward from b4f3a41
python 88.90% <ø> (ø) Carriedforward from b4f3a41
workflow-compiling-service 47.72% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@Xiao-zhen-Liu Xiao-zhen-Liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging into this — the symptom analysis (self-join through materialization producing two reader threads with the same ChannelIdentity) is accurate.

However, I do agree with @Yicong-Huang. Actor ID is not the right layer to fix it:

The root issue is that ChannelIdentity only carries (fromWorkerId, toWorkerId, isControl) — no port. And NetworkInputGateway keys its channel map by ChannelIdentity, so two channels between the same actor pair collapse into one AmberFIFOChannel with a single portId slot. That's where the FIFO seq numbers and end-of-channel markers cross-route. Renaming the synthetic "from" actor per destination port sidesteps the collision in the materialization path, but it doesn't address the underlying model.

It also doesn't fix the pipelined case. When the upstream is a real worker (not a synthetic reader thread), its name isn't ours to change. Two pipelined links sharing (fromActor, toActor) but different toPortId will still hit the same AmberFIFOChannel, and the second addInputChannel call will overwrite the port assignment of the first. So Intersect and SymmetricDifference — which don't declare port dependencies and never materialize — would still hang on self-link. #2588 wouldn't really be closed.

And making one reader thread present itself as N actor IDs (one per destination port) breaks the "an actor has one name" invariant. An actor ID should identify the sender, not the (sender, destination) pair.

The proper fix is probably at the channel layer: either add toPortId to ChannelIdentity, or re-key inputChannels by (ChannelIdentity, toPortId). Either way it's a bigger change that needs careful work on the gateway, FIFO tracking, and EOC routing.

Given the scope around and engine-knowledge required for this fix, I think it would be better for @Ma77Ball to close this PR, and I can work on a more fundamental fix later.

@Ma77Ball Ma77Ball closed this May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

self-diff cannot finish execution

4 participants