Skip to content

test(amber-python): per-channel sync in test_main_loop_thread_can_align_ecm#4526

Merged
Yicong-Huang merged 4 commits into
apache:mainfrom
Yicong-Huang:test/fix-align-ecm-flake
Apr 26, 2026
Merged

test(amber-python): per-channel sync in test_main_loop_thread_can_align_ecm#4526
Yicong-Huang merged 4 commits into
apache:mainfrom
Yicong-Huang:test/fix-align-ecm-flake

Conversation

@Yicong-Huang
Copy link
Copy Markdown
Contributor

@Yicong-Huang Yicong-Huang commented Apr 26, 2026

What changes were proposed in this PR?

Fixes the recurring CI flake in core/runnables/test_main_loop.py::TestMainLoop::test_main_loop_thread_can_align_ecm and tightens the assertion to verify the actual production priority guarantee.

Why it flaked: the test put a data tuple and an alignment-completing ECM into input_queue back-to-back, then assumed output_queue.get() would yield the DataElement first followed by the NoOperation reply. output_queue is a LinkedBlockingMultiQueue whose sub-queues are keyed by channel — control channels at priority 1, data channels at priority 2 (internal_queue.py:80). MainLoop produces in this order: data → NoOperation reply, but whether the test pops them in that order depends on whether MainLoop has finished both productions by the time the test calls .get():

  • Fast machine: only data is in the queue → data first → ✅
  • Slow CI: both items queued → priority returns the control reply first → ❌

The production semantics are intentional and correct:

  • Control to coordinator outranks data on egress (cross-channel priority — priority 1 vs 2 sub-queues).
  • Within a channel sub-queue, FIFO is preserved (so an ECM forwarded on a data channel stays behind the data tuples on that same channel).

Fix: the test now expresses the priority semantics directly:

  1. Wait until both expected channels have their item in output_queue's sub-queues — the data channel to the downstream worker, and the control channel back to "sender". Sub-queues are added lazily on first put, so the wait safely treats a missing key as size zero.
  2. With both items queued, assert the priority pop order: output_queue.get() first returns the DCMElement (NoOperation reply, control sub-queue priority 1), then the DataElement (data sub-queue priority 2).
  3. Bounded with a 5s timeout so a regression that drops one of the channels fails with a descriptive message instead of hanging.

If a future change flips priorities or routes the NoOperation reply to a different channel, the first get() now fails fast with "expected control reply first".

No production code change.

Any related issues, documentation, discussions?

Closes #4524. Has been hitting unrelated PRs (#4512, #4520).

How was this PR tested?

Ran 30 consecutive iterations locally — 30 PASS, 0 FAIL.
ruff format --check . and ruff check . clean.

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.7)

…lerant

The test put a data tuple and a NoOperation-bearing ECM into input_queue
back-to-back, then asserted output_queue produces the DataElement before
the NoOperation reply. That's only true on a fast machine where the
test pops output_queue before the NoOperation reply is enqueued.

output_queue is a LinkedBlockingMultiQueue (internal_queue.py:80) with
priority 1 for control sub-queues and priority 2 for data sub-queues, so
once both items are present, get() returns the control reply first. On
CI runners the test routinely sees the control reply first and fails
with `tag=ChannelIdentity(..., is_control=True)` mismatch — observed on
unrelated PRs apache#4512 and apache#4520.

Drain both items and identify each by type (DataElement vs DCMElement),
then assert each independently. Production behavior — control outranking
data on egress — is correct as is.

Closes apache#4524
@Yicong-Huang Yicong-Huang marked this pull request as draft April 26, 2026 06:05
…gn_ecm

Replace the order-tolerant drain-then-group-by-type approach with an
explicit per-channel synchronization that better expresses the actual
production semantics:

1. Wait until both expected channels (the data channel to the downstream
   worker, and the control channel back to "sender") have their item in
   output_queue's sub-queues. Sub-queues are added lazily on first put,
   so guard against the missing-key case before reading size.
2. Drain two items via priority get(), key them by tag, and assert each
   per channel — DataElement on the data channel, DCMElement carrying
   the NoOperation reply on the control channel.

This matches what production guarantees:
  - control to coordinator outranks data on egress (cross-channel
    priority — kept by InternalQueue);
  - within a channel sub-queue, FIFO is preserved (so ECM behind data on
    the same data channel stays behind data).

The previous fixed-order assertion only worked on a fast machine where
the test popped between the two production puts. The earlier
order-tolerant version sidestepped the issue but lost the per-channel
intent. This version makes the channel scope explicit, fails fast with
a descriptive timeout instead of hanging, and runs deterministically on
both fast and slow runners.

Closes apache#4524
@Yicong-Huang Yicong-Huang changed the title test(amber-python): make test_main_loop_thread_can_align_ecm order-tolerant test(amber-python): per-channel sync in test_main_loop_thread_can_align_ecm Apr 26, 2026
After the per-channel sync wait already ensures both items are queued,
assert the priority semantics directly: pop control first (priority 1
sub-queue), then data (priority 2). This actually verifies the
production guarantee — control to coordinator outranks data on egress —
instead of just confirming both items showed up in some order.

If a regression flips priorities or routes the NoOperation reply to a
different channel, the first get() now fails fast with "expected control
reply first".

Refs apache#4524
@Yicong-Huang Yicong-Huang marked this pull request as ready for review April 26, 2026 06:46
Copy link
Copy Markdown
Contributor

@zuozhiw zuozhiw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, we are now effectively waiting for both the control and data output queue to be filled with at least one item (they are in a stable state) then start to assert our tests

@Yicong-Huang Yicong-Huang merged commit a6ae296 into apache:main Apr 26, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test_main_loop_thread_can_align_ecm: drain order depends on inter-channel timing

2 participants