test(amber-python): per-channel sync in test_main_loop_thread_can_align_ecm#4526
Merged
Merged
Conversation
…lerant The test put a data tuple and a NoOperation-bearing ECM into input_queue back-to-back, then asserted output_queue produces the DataElement before the NoOperation reply. That's only true on a fast machine where the test pops output_queue before the NoOperation reply is enqueued. output_queue is a LinkedBlockingMultiQueue (internal_queue.py:80) with priority 1 for control sub-queues and priority 2 for data sub-queues, so once both items are present, get() returns the control reply first. On CI runners the test routinely sees the control reply first and fails with `tag=ChannelIdentity(..., is_control=True)` mismatch — observed on unrelated PRs apache#4512 and apache#4520. Drain both items and identify each by type (DataElement vs DCMElement), then assert each independently. Production behavior — control outranking data on egress — is correct as is. Closes apache#4524
…gn_ecm
Replace the order-tolerant drain-then-group-by-type approach with an
explicit per-channel synchronization that better expresses the actual
production semantics:
1. Wait until both expected channels (the data channel to the downstream
worker, and the control channel back to "sender") have their item in
output_queue's sub-queues. Sub-queues are added lazily on first put,
so guard against the missing-key case before reading size.
2. Drain two items via priority get(), key them by tag, and assert each
per channel — DataElement on the data channel, DCMElement carrying
the NoOperation reply on the control channel.
This matches what production guarantees:
- control to coordinator outranks data on egress (cross-channel
priority — kept by InternalQueue);
- within a channel sub-queue, FIFO is preserved (so ECM behind data on
the same data channel stays behind data).
The previous fixed-order assertion only worked on a fast machine where
the test popped between the two production puts. The earlier
order-tolerant version sidestepped the issue but lost the per-channel
intent. This version makes the channel scope explicit, fails fast with
a descriptive timeout instead of hanging, and runs deterministically on
both fast and slow runners.
Closes apache#4524
After the per-channel sync wait already ensures both items are queued, assert the priority semantics directly: pop control first (priority 1 sub-queue), then data (priority 2). This actually verifies the production guarantee — control to coordinator outranks data on egress — instead of just confirming both items showed up in some order. If a regression flips priorities or routes the NoOperation reply to a different channel, the first get() now fails fast with "expected control reply first". Refs apache#4524
zuozhiw
approved these changes
Apr 26, 2026
Contributor
zuozhiw
left a comment
There was a problem hiding this comment.
looks good, we are now effectively waiting for both the control and data output queue to be filled with at least one item (they are in a stable state) then start to assert our tests
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this PR?
Fixes the recurring CI flake in
core/runnables/test_main_loop.py::TestMainLoop::test_main_loop_thread_can_align_ecmand tightens the assertion to verify the actual production priority guarantee.Why it flaked: the test put a data tuple and an alignment-completing ECM into
input_queueback-to-back, then assumedoutput_queue.get()would yield theDataElementfirst followed by the NoOperation reply.output_queueis aLinkedBlockingMultiQueuewhose sub-queues are keyed by channel — control channels at priority 1, data channels at priority 2 (internal_queue.py:80). MainLoop produces in this order: data → NoOperation reply, but whether the test pops them in that order depends on whether MainLoop has finished both productions by the time the test calls.get():The production semantics are intentional and correct:
Fix: the test now expresses the priority semantics directly:
output_queue's sub-queues — the data channel to the downstream worker, and the control channel back to "sender". Sub-queues are added lazily on first put, so the wait safely treats a missing key as size zero.output_queue.get()first returns theDCMElement(NoOperation reply, control sub-queue priority 1), then theDataElement(data sub-queue priority 2).If a future change flips priorities or routes the NoOperation reply to a different channel, the first
get()now fails fast with "expected control reply first".No production code change.
Any related issues, documentation, discussions?
Closes #4524. Has been hitting unrelated PRs (#4512, #4520).
How was this PR tested?
Ran 30 consecutive iterations locally — 30 PASS, 0 FAIL.
ruff format --check .andruff check .clean.Was this PR authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.7)