Skip to content

[Monitor] Support repeat ... loops + Update monitor output on WAL examples#194

Merged
ngernest merged 65 commits into
mainfrom
monitor_repeat_loops
Mar 6, 2026
Merged

[Monitor] Support repeat ... loops + Update monitor output on WAL examples#194
ngernest merged 65 commits into
mainfrom
monitor_repeat_loops

Conversation

@ngernest
Copy link
Copy Markdown
Contributor

@ngernest ngernest commented Feb 20, 2026

This PR extends the monitor to support repeat loops by implementing the design discussed in #183 (see this comment for details about the design).

Other miscellaneous changes:

  • Define a file macros.rs with some helper macros for logging that print out the line number, function name and file name (the default info! macro from the log crate doesn't do this)
  • Changed the AugmentedTrace type from a type alias to a tuple struct so that we can define methods over it (see types.rs)
  • Renamed Stmt::BoundedLoop to RepeatLoop in the interpreter code

Current status (EoD 2/26):

1. Loop with assignments
Consider loop_with_assigns.prot:

// Each iteration of the loop performs the full add protocol:
//   assign inputs -> step -> release inputs -> assert output
// This tests that assignments, DontCare releases, and assertions
// all work correctly inside a bounded loop body.
prot loop_add<DUT: Adder>(in a: u32, in b: u32, in num_iters: u64, out s: u32) {
  repeat num_iters iterations {
    DUT.a := a;
    DUT.b := b;
    step();
    DUT.a := X;
    DUT.b := X;
    assert_eq(s, DUT.s);
  }
  fork();
  step();
}

This is the original transaction file supplied to the interpreter:

trace {
    loop_add(1, 2, 3, 3);   // assert 1+2=3 on each of 3 iterations
    loop_add(10, 20, 1, 30); // assert 10+20=30 on 1 iteration
}

The monitor currently produces the following output:

// trace 0
trace {
    loop_add(1, 2, 3, 3);
    loop_add(10, 20, 1, 30);
    loop_add(198490043, 4254467669, 1, 30);
}

// trace 1
trace {
    loop_add(1, 2, 1, 3);
    loop_add(1, 2, 2, 3);
    loop_add(10, 20, 1, 30);
    loop_add(198490043, 4254467669, 1, 30);
}

// trace 2
trace {
    loop_add(1, 2, 2, 3);
    loop_add(1, 2, 1, 3);
    loop_add(10, 20, 1, 30);
    loop_add(198490043, 4254467669, 1, 30);
}

// trace 3
trace {
    loop_add(1, 2, 1, 3);
    loop_add(1, 2, 1, 3);
    loop_add(1, 2, 1, 3);
    loop_add(10, 20, 1, 30);
    loop_add(198490043, 4254467669, 1, 30);
}

The issue is with the final spurious loop_add protocol in each of the traces suggested by the monitor. This is caused after a legitimate loop_call transaction completes and triggers an ExplicitFork (i.e. another loop_add protocol) at time = 4:

This new thread reads the values of DUT.a and DUT.b at time 4, both of which are randomized values (caused by DontCare), it does a step() (time is now 5), but assert_eq(s, DUT.s) succeeds since DUT.s is still 30. This thread finishes and emits loop_add(random_a, random_b, 1, 30).

Edit: we discussed on Monday how this is fine (this is an issue with how the waveform is being produced by the interpreter), so it doesn't block this PR

2: Nested busy wait
Monitor hangs on this protocol due to exponential blowup in the no. of threads (this also happens for push_pop_loop_empty.prot and push_pop_loop_not_empty.prot for the FIFO example):

// Nested bounded loops: the outer loop runs `outer_iters` times,
// and the inner loop runs `inner_iters` times per outer iteration,
// for a total of `outer_iters * inner_iters` steps.
// Since inputs persist on the adder, the final result is
// the same regardless of how many steps we take.
prot nested_busy_wait<DUT: Adder>(in a: u32, in b: u32, in outer_iters: u64, in inner_iters: u64, out s: u32) {
  DUT.a := a;
  DUT.b := b;
  repeat outer_iters iterations {
    repeat inner_iters iterations {
      step();
      assert_eq(s, DUT.s);
    }
    step();
    assert_eq(s, DUT.s);
  }
  DUT.a := X;
  DUT.b := X;
  assert_eq(s, DUT.s);
  fork();
  step();
}

Edit: we discussed on Monday this is fine. I added a 5 second timeout for Turnt for these specific test cases and allowed them to fail in Turnt to allow CI to proceed.

Comment thread monitor/src/thread.rs Outdated
pub args_to_pins: FxHashMap<SymbolId, SymbolId>,

/// Map which tracks how many times an input/output parameter's value
/// was updated *after* it was initially inferred due to the corresponding
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngernest What does it mean for the monitor if an "input/output parameter's value was updated after it was initially inferred"? Can you provide me with an example of why/how that would happen?

Copy link
Copy Markdown
Contributor Author

@ngernest ngernest Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! When the monitor is run on busy_wait.prot, previously there was this (wrong) trace being produced:

...
// Trace 1
trace {
    add_busy_wait(4, 5, 3, 3);      // no. of loop iterations is 2nd to last arg, so num_cycles = 3
}
...

This trace is wrong, because if we look at the waveform (busy_wait.fst below), we see that the trace should be add_busy_wait(1, 2, 1, 3); add_busy_wait(4, 5, 2, 9) instead, i.e. this trace is missing the first cycle (doesn't cover the whole waveform). This violates the assumption that at any point in the waveform, there must be at most one protocol that can explain the signals (and since the only protocol in busy_wait.prot is add_busy_wait, the monitor is missing the first instance of add_busy_wait).

Screenshot 2026-03-04 at 9 17 01 AM

Looking at the logs, it appears that the trace only containing add_busy_wait(4, 5, 3, 3) was caused by a thread where the input parameters a, b were initially inferred as 1 and 2 (by reading off DUT.a & DUT.b during the first cycle in the waveform), but after a few iterations of the repeat loop, the values of DUT.a and DUT.b changed to 4 and 5 in the waveform, i.e. this is an example where a parameter's value was initially inferred, but then it changed due to the waveform signal changing.

I should mention this rebind_counts map doesn't change the "runtime" behavior of the monitor (i.e. how it infers parameter values) -- it is only used when there are multiple candidate traces and we want to reject some of them (e.g. traces which only contain one instance of add_busy_wait). For other tests where the parameter values do change over time (e.g. the WAL AXI-Stream example), since there is only one trace produced by the monitor for those (since the monitor waits exactly for ready and valid to both become 1 in the same cycle to detect the data transfer), this strategy doesn't get used.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the logs, it appears that the trace only containing add_busy_wait(4, 5, 3, 3) was caused by a thread where the input parameters a, b were initially inferred as 1 and 2 (by reading off DUT.a & DUT.b during the first cycle in the waveform), but after a few iterations of the repeat loop, the values of DUT.a and DUT.b changed to 4 and 5 in the waveform,

Should that not lead to a failure of the trace?

This is what should happen when there is, e.g., a Dut.a := a in the protocol:

  • if a is not defined, use the value of Dut.a to define a
  • if a is defined, check the value of a against Dut.a if they disagree, then the protocol fails.

So then how would you ever have a rebind_count that isn't just zero?

Copy link
Copy Markdown
Contributor Author

@ngernest ngernest Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what should happen when there is, e.g., a Dut.a := a in the protocol:

  • if a is not defined, use the value of Dut.a to define a
  • if a is defined, check the value of a against Dut.a if they disagree, then the protocol fails.

This is mostly what is done in the current implementation in the main branch, although in the main branch of scheduler.rs here, if a is defined and trace(DUT.a) != previous inferred value for a, we update the value for a.

The reason for this is that WAL AXIS test implicitly relies on this behavior in order to work -- here is the protocol:

prot send_data<D: AXIS>(in data: u8) {
    D.m_axis_tdata := data;
    D.m_axis_tvalid := 1'b1;
    while (!(D.m_axis_tready == 1'b1)) {
        step();
    }
    // one cycle for the data transfer to take place
    step();
}

When a thread corresponding to send_data begins, it immediately infers data to be the value of D.m_axis_tdata at the time. However, the thread can stay in the while loop for many cycles, and after every step(), the current implementation of the monitor on the main branch re-performs the "compare data against the trace value for D.m_axis_tdata" check, and updates data to match D.m_axis_tdata, and this causes the WAL tests to currently pass. (As a result, rebind_count can be non-zero under this scheme.)

(Yesterday I tried enforcing a thread failure if trace(DUT.a) != previous inferred value for a but this caused the WAL tests to fail, so the rebind_counts approach was mostly a way to keep the existing monitor behavior while filtering out invalid traces from the add_busy_wait example that don't span the entire waveform. I'm happy to revert this change and try something else though!)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this is that WAL AXIS test implicitly relies on this behavior in order to work -- here is the protocol:

That means that something is wrong with the WAL AXIS protocol.

Copy link
Copy Markdown
Contributor Author

@ngernest ngernest Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the WAL AXIS waveforms again and I noticed a bug that the existing approach might have caused:

Here is part of the trace produced for axis_truncated.prot (in axis_truncated.out) (with idle protocols explicitly printed out)

...
idle();  // [time: 1400ns -> 1408ns]
send_data(18);  // [time: 1408ns -> 3272ns]
idle();  // [time: 3272ns -> 3280ns]
...

This is the relevant section of the waveform corresponding to send_data(18) (between the two markers indicating 1408ns - 3272ns):

Screenshot 2026-03-04 at 10 39 30 AM

A thread is spawned at 1408ns, and at this time, data is 16, but because ready & valid are both 0 during this time, the thread keeps waiting (stuck in the while-loop) until 3264 ns (near the orange marker)
when ready and valid both become 1. During this time from 1408 - 3264ns, data changes from 16 -> 17 -> 18, and because the value of data at 3264 ns is 18, data is updated to be 18 after the next step().

The protocol that shows up in the trace (send_data(18)) is correct, since 18 is the value of data when both ready & valid are 1, but the timestamps are incorrect, since the starting timestamp of the thread 1480ns corresponds to when data = 16 still.

If we change the monitor behavior to more strictly align with this

if a is defined, check the value of a against Dut.a if they disagree, then the protocol fails

then the monitor should show a later starting timestamp for send_data(18), and at 1408ns (when data = 16), we would just have another idle protocol.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A thread is spawned at 1408ns, and at this time, data is 16, but because ready & valid are both 0 during this time, the thread keeps waiting (stuck in the while-loop) until 3264 ns (near the orange marker)

When valid is 0, then the send_data protocol should immediately fail. The protocol immediately constrains D.m_axis_tvalid := 1'b1; and that is never changed. So it cannot match any cycle where m_axis_tvalid is not 1.

Copy link
Copy Markdown
Contributor Author

@ngernest ngernest Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha I'll debug this thanks (removed all the rebind_count logic for now and set threads to fail if their inferred parameter values don't match the waveform value)

@ngernest
Copy link
Copy Markdown
Contributor Author

ngernest commented Mar 4, 2026

Latest updates:

  • The monitor now fails if a parameter is defined (i.e. we've already determined a value for it) and the waveform value doesn't match it in subsequent cycles
  • All WAL tests now fail as their waveforms are buggy (data is not stable when valid is asserted), the expected output for these tests in Turnt is an error message saying that no transactions match
    • This involved updating the error reporting logic so that if a scheduler fails and no other scheduler can make any progress (i.e. its current and next queues are both empty, indicating that there are no more threads that can be run in either the current or the next cycle), then we report the error (see lines 129 - 149 in global_scheduler.rs)
Screenshot 2026-03-04 at 3 23 19 PM
  • One of the Brave New World tests has also been updated (the AXI-Stream FIFO axis-fifo-d4, [Monitor] AXI FIFO example from Brave New World artifact  #140) as the monitor now fails on one of the waveforms. The monitor now reports an error for d4_fixed.prot, which is expected, since the bug involves fixing a FIFO so that we cannot perform any further push-es when it is full (the buggy protocol d4_buggy.prot still permits pushes even when the FIFO is full).

(More details about the bug in this PR comment)

@ngernest ngernest marked this pull request as ready for review March 4, 2026 20:21
@ngernest ngernest changed the title (WIP) [Monitor] Support repeat ... loops [Monitor] Support repeat ... loops + Update monitor output on WAL examples Mar 4, 2026
@ngernest ngernest merged commit 51de153 into main Mar 6, 2026
18 checks passed
@ngernest ngernest deleted the monitor_repeat_loops branch March 6, 2026 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants