[Monitor] Support repeat ... loops + Update monitor output on WAL examples#194
Conversation
This reverts commit 1667082.
…us guesses for loop_args in repeat loops
| pub args_to_pins: FxHashMap<SymbolId, SymbolId>, | ||
|
|
||
| /// Map which tracks how many times an input/output parameter's value | ||
| /// was updated *after* it was initially inferred due to the corresponding |
There was a problem hiding this comment.
@ngernest What does it mean for the monitor if an "input/output parameter's value was updated after it was initially inferred"? Can you provide me with an example of why/how that would happen?
There was a problem hiding this comment.
Yep! When the monitor is run on busy_wait.prot, previously there was this (wrong) trace being produced:
...
// Trace 1
trace {
add_busy_wait(4, 5, 3, 3); // no. of loop iterations is 2nd to last arg, so num_cycles = 3
}
...This trace is wrong, because if we look at the waveform (busy_wait.fst below), we see that the trace should be add_busy_wait(1, 2, 1, 3); add_busy_wait(4, 5, 2, 9) instead, i.e. this trace is missing the first cycle (doesn't cover the whole waveform). This violates the assumption that at any point in the waveform, there must be at most one protocol that can explain the signals (and since the only protocol in busy_wait.prot is add_busy_wait, the monitor is missing the first instance of add_busy_wait).
Looking at the logs, it appears that the trace only containing add_busy_wait(4, 5, 3, 3) was caused by a thread where the input parameters a, b were initially inferred as 1 and 2 (by reading off DUT.a & DUT.b during the first cycle in the waveform), but after a few iterations of the repeat loop, the values of DUT.a and DUT.b changed to 4 and 5 in the waveform, i.e. this is an example where a parameter's value was initially inferred, but then it changed due to the waveform signal changing.
I should mention this rebind_counts map doesn't change the "runtime" behavior of the monitor (i.e. how it infers parameter values) -- it is only used when there are multiple candidate traces and we want to reject some of them (e.g. traces which only contain one instance of add_busy_wait). For other tests where the parameter values do change over time (e.g. the WAL AXI-Stream example), since there is only one trace produced by the monitor for those (since the monitor waits exactly for ready and valid to both become 1 in the same cycle to detect the data transfer), this strategy doesn't get used.
There was a problem hiding this comment.
Looking at the logs, it appears that the trace only containing
add_busy_wait(4, 5, 3, 3)was caused by a thread where the input parametersa, bwere initially inferred as 1 and 2 (by reading offDUT.a&DUT.bduring the first cycle in the waveform), but after a few iterations of the repeat loop, the values ofDUT.aandDUT.bchanged to 4 and 5 in the waveform,
Should that not lead to a failure of the trace?
This is what should happen when there is, e.g., a Dut.a := a in the protocol:
- if
ais not defined, use the value ofDut.ato definea - if
ais defined, check the value ofaagainstDut.aif they disagree, then the protocol fails.
So then how would you ever have a rebind_count that isn't just zero?
There was a problem hiding this comment.
This is what should happen when there is, e.g., a Dut.a := a in the protocol:
- if a is not defined, use the value of Dut.a to define a
- if a is defined, check the value of a against Dut.a if they disagree, then the protocol fails.
This is mostly what is done in the current implementation in the main branch, although in the main branch of scheduler.rs here, if a is defined and trace(DUT.a) != previous inferred value for a, we update the value for a.
The reason for this is that WAL AXIS test implicitly relies on this behavior in order to work -- here is the protocol:
prot send_data<D: AXIS>(in data: u8) {
D.m_axis_tdata := data;
D.m_axis_tvalid := 1'b1;
while (!(D.m_axis_tready == 1'b1)) {
step();
}
// one cycle for the data transfer to take place
step();
}When a thread corresponding to send_data begins, it immediately infers data to be the value of D.m_axis_tdata at the time. However, the thread can stay in the while loop for many cycles, and after every step(), the current implementation of the monitor on the main branch re-performs the "compare data against the trace value for D.m_axis_tdata" check, and updates data to match D.m_axis_tdata, and this causes the WAL tests to currently pass. (As a result, rebind_count can be non-zero under this scheme.)
(Yesterday I tried enforcing a thread failure if trace(DUT.a) != previous inferred value for a but this caused the WAL tests to fail, so the rebind_counts approach was mostly a way to keep the existing monitor behavior while filtering out invalid traces from the add_busy_wait example that don't span the entire waveform. I'm happy to revert this change and try something else though!)
There was a problem hiding this comment.
The reason for this is that WAL AXIS test implicitly relies on this behavior in order to work -- here is the protocol:
That means that something is wrong with the WAL AXIS protocol.
There was a problem hiding this comment.
I looked at the WAL AXIS waveforms again and I noticed a bug that the existing approach might have caused:
Here is part of the trace produced for axis_truncated.prot (in axis_truncated.out) (with idle protocols explicitly printed out)
...
idle(); // [time: 1400ns -> 1408ns]
send_data(18); // [time: 1408ns -> 3272ns]
idle(); // [time: 3272ns -> 3280ns]
...This is the relevant section of the waveform corresponding to send_data(18) (between the two markers indicating 1408ns - 3272ns):
A thread is spawned at 1408ns, and at this time, data is 16, but because ready & valid are both 0 during this time, the thread keeps waiting (stuck in the while-loop) until 3264 ns (near the orange marker)
when ready and valid both become 1. During this time from 1408 - 3264ns, data changes from 16 -> 17 -> 18, and because the value of data at 3264 ns is 18, data is updated to be 18 after the next step().
The protocol that shows up in the trace (send_data(18)) is correct, since 18 is the value of data when both ready & valid are 1, but the timestamps are incorrect, since the starting timestamp of the thread 1480ns corresponds to when data = 16 still.
If we change the monitor behavior to more strictly align with this
if a is defined, check the value of a against Dut.a if they disagree, then the protocol fails
then the monitor should show a later starting timestamp for send_data(18), and at 1408ns (when data = 16), we would just have another idle protocol.
There was a problem hiding this comment.
A thread is spawned at 1408ns, and at this time,
datais 16, but becauseready&validare both 0 during this time, the thread keeps waiting (stuck in the while-loop) until 3264 ns (near the orange marker)
When valid is 0, then the send_data protocol should immediately fail. The protocol immediately constrains D.m_axis_tvalid := 1'b1; and that is never changed. So it cannot match any cycle where m_axis_tvalid is not 1.
There was a problem hiding this comment.
Gotcha I'll debug this thanks (removed all the rebind_count logic for now and set threads to fail if their inferred parameter values don't match the waveform value)
…line numbers instead
…nfcombinatorial explosion
…nferred value != waveform value)
…the current / next cycle) before emitting global monitor failure
|
Latest updates:
(More details about the bug in this PR comment) |
repeat ... loops repeat ... loops + Update monitor output on WAL examples

This PR extends the monitor to support
repeatloops by implementing the design discussed in #183 (see this comment for details about the design).Other miscellaneous changes:
macros.rswith some helper macros for logging that print out the line number, function name and file name (the defaultinfo!macro from thelogcrate doesn't do this)AugmentedTracetype from a type alias to a tuple struct so that we can define methods over it (seetypes.rs)Stmt::BoundedLooptoRepeatLoopin the interpreter codeCurrent status (EoD 2/26):
1. Loop with assignments
Consider
loop_with_assigns.prot:This is the original transaction file supplied to the interpreter:
The monitor currently produces the following output:
The issue is with the final spurious
loop_addprotocol in each of the traces suggested by the monitor. This is caused after a legitimateloop_calltransaction completes and triggers anExplicitFork(i.e. anotherloop_addprotocol) at time = 4:This new thread reads the values of
DUT.aandDUT.bat time 4, both of which are randomized values (caused by DontCare), it does astep()(time is now 5), butassert_eq(s, DUT.s)succeeds sinceDUT.sis still 30. This thread finishes and emitsloop_add(random_a, random_b, 1, 30).Edit: we discussed on Monday how this is fine (this is an issue with how the waveform is being produced by the interpreter), so it doesn't block this PR
2: Nested busy wait
Monitor hangs on this protocol due to exponential blowup in the no. of threads (this also happens for
push_pop_loop_empty.protandpush_pop_loop_not_empty.protfor the FIFO example):Edit: we discussed on Monday this is fine. I added a 5 second timeout for Turnt for these specific test cases and allowed them to fail in Turnt to allow CI to proceed.