-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix CDC issue in async FIFO: ensure that Gray code pointers are registered before resynchronisation. #45
Comments
Comment by Wren6991 Yep looks like I've broken something :D Will update once I've got formal running on my own machine, and fixed the issue. |
Comment by Wren6991 The reset values look nonsensical to me, e.g. 010 (bin) != 010 (gray). Maybe the fact it passed before was a lucky break due to the bin and gray pointers being driven by the same register, so their initial values were correct. I need to dig into this formal stuff more. Note: it fails during BMC. |
Comment by codecov[bot] Codecov Report
@@ Coverage Diff @@
## master #38 +/- ##
==========================================
+ Coverage 85.52% 85.57% +0.05%
==========================================
Files 27 27
Lines 3918 3932 +14
Branches 765 765
==========================================
+ Hits 3351 3365 +14
Misses 478 478
Partials 89 89
Continue to review full report at Codecov.
|
Comment by Wren6991 I've added an additional commit which asserts FIFO resets whilst Since this fix is outside the scope of this PR, I am happy to open a separate one if you would prefer. |
Comment by whitequark First, thanks for finding the bug! This is indeed very important, and I'm not sure how I missed it. It is definitely not present in Migen, it's an nMigen bug.
Unfortunately you've significantly weakened the contract by adding a reset. Before your changes, the FIFO contract would test an arbitrary initial state (any block RAM contents and any pointers). After your changes the initial state is always defined, so e.g. a FIFO that would pass just two elements through it and then hang would pass it. |
Comment by whitequark Can you check if the following diff also fixes the bug, please? diff --git a/nmigen/lib/fifo.py b/nmigen/lib/fifo.py
index 3af5c51..d4f16e1 100644
--- a/nmigen/lib/fifo.py
+++ b/nmigen/lib/fifo.py
@@ -304,10 +304,8 @@ class AsyncFIFO(FIFOInterface):
GrayEncoder(self._ctr_bits)
produce_cdc = m.submodules.produce_cdc = \
MultiReg(produce_w_gry, produce_r_gry, odomain="read")
- m.d.comb += [
- produce_enc.i.eq(produce_w_bin),
- produce_w_gry.eq(produce_enc.o),
- ]
+ m.d.comb += produce_enc.i.eq(produce_w_bin)
+ m.d.write += produce_w_gry.eq(produce_enc.o)
consume_r_bin = Signal(self._ctr_bits)
consume_r_gry = Signal(self._ctr_bits)
@@ -316,17 +314,15 @@ class AsyncFIFO(FIFOInterface):
GrayEncoder(self._ctr_bits)
consume_cdc = m.submodules.consume_cdc = \
MultiReg(consume_r_gry, consume_w_gry, odomain="write")
- m.d.comb += [
- consume_enc.i.eq(consume_r_bin),
- consume_r_gry.eq(consume_enc.o),
- ]
+ m.d.comb += consume_enc.i.eq(consume_r_bin)
+ m.d.read += consume_r_gry.eq(consume_enc.o)
m.d.comb += [
self.writable.eq(
- (produce_w_gry[-1] == consume_w_gry[-1]) |
- (produce_w_gry[-2] == consume_w_gry[-2]) |
- (produce_w_gry[:-2] != consume_w_gry[:-2])),
- self.readable.eq(consume_r_gry != produce_r_gry)
+ (produce_enc.o[-1] == consume_w_gry[-1]) |
+ (produce_enc.o[-2] == consume_w_gry[-2]) |
+ (produce_enc.o[:-2] != consume_w_gry[:-2])),
+ self.readable.eq(consume_enc.o != produce_r_gry)
]
do_write = self.writable & self.we
diff --git a/nmigen/test/test_lib_fifo.py b/nmigen/test/test_lib_fifo.py
index c7f4c28..e70d7be 100644
--- a/nmigen/test/test_lib_fifo.py
+++ b/nmigen/test/test_lib_fifo.py
@@ -262,8 +262,8 @@ class FIFOFormalCase(FHDLTestCase):
# self.assertFormal(FIFOModelEquivalenceSpec(fifo, rdomain="read", wdomain="write"),
# mode="bmc", depth=fifo.depth * 3 + 1)
self.assertFormal(FIFOContractSpec(fifo, rdomain="read", wdomain="write",
- bound=fifo.depth * 4 + 1),
- mode="hybrid", depth=fifo.depth * 4 + 1)
+ bound=(fifo.depth + 1) * 4),
+ mode="hybrid", depth=(fifo.depth + 1) * 4)
def test_async(self):
self.check_async_fifo(AsyncFIFO(width=8, depth=4)) |
Comment by Wren6991
Possibly distracted by CC's name!
Ah okay, I see the problem. Wouldn't you normally test this with a separate induction pass? The failure I saw was due to the design starting in an unreachable state. You are talking about reachable states, which should be testable by induction after a sufficiently long BMC. I haven't used SymbiYosys before, but have used yosys-smtbmc with a different solver backend, and always performed BMC followed by induction. |
Comment by whitequark
Yes, but I never got induction to work on all the FIFO designs. This oddball BMC is the best I could do. |
Comment by Wren6991 That diff fixes the bug, but does introduce an extra cycle of latency in each direction. Immediate problem is defused :)
Sounds like your experience with formal has been similar to mine... if this would be a valuable thing then I can have a play with it. It would be nice at some point in the future to win back that additional latency cycle, without breaking the test! Thank you for taking a look at my patch. |
Comment by whitequark
Oh, I see! I will do something about it. Thanks for pointing out!
This is an open issue, #29. For now I will do the copy-pastey thing. |
Comment by Wren6991
Ah okay. I had a quick look at that area of code (Module etc) and decided I needed to learn some GoF patterns first! If it's any help at all, this is what my "flat" patch looked like: link to commit But this had the same issue with the FIFO contract test. |
Comment by whitequark
I'm not really a huge fan of GoF. That code doesn't need patterns so much as one horrible hack, and I can't decide exactly how horrible it's going to be. So it's not done just yet. |
Comment by whitequark Fixed by 4027317, without any added latency. I've solved this properly by adding appropriate |
Issue by Wren6991
Sunday Mar 03, 2019 at 15:56 GMT
Originally opened as m-labs/nmigen#38
The AsyncFIFO class is subtly different from the Clifford Cummings paper referenced in the source code. Here is a diagram showing the difference:
In the current nMigen code, there is a path that goes from:
produce
counter register in the write domainGrayEncoder
MultiReg
in the read domain.The problem is that there is a narrow window after a write clock edge where all of the Gray pointer bits may toggle, due to static hazards in the encoder, since all encoder inputs can toggle simultaneously. If you sample this with an asynchronous clock, it's possible to see a completely wrong value for the
produce
pointer, which can cause irreversible corruption/loss of data.I've only shown this for the
produce
pointer, but the same is true in the other direction too. The fix is to drive the Gray encoder from the binary counter register's input, rather than its output, and register the encoder output before resynchronising. From a synchronous point of view this behaves the same as the old Gray count, so the full/empty logic stays the same.This bug will only be observed with a low probability, and may not be observed at all, e.g. if your two asynchronous clocks share some fixed rational relationship with a common clock source through PLLs, which affects the distribution of the relative phase of one observed at the rising edge of the other. However it is a legitimate bug, with fatal consequences.
This code is not ready to merge:
MultiReg
s should ideally be resettable (although that is a different conversation!)GrayBinCounter
, which is currently only used byAsyncFIFO
. It contains some extra detail, like putting ano_retiming
attribute on the Gray register.__init__
of this class, likeMultiReg
, but I saw references elsewhere to theDomainRenamer
class, which seems cleaner. However it seems you can only call this onFragment
s, notModules
, and I currently don't know what aFragment
is :)AsyncFIFO
. It just looks a bit copy-pastey.I'm really new to nMigen, so this code definitely needs a close review.
Wren6991 included the following code: https://github.com/m-labs/nmigen/pull/38/commits
The text was updated successfully, but these errors were encountered: