feed.xml

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>The ZipCPU by Gisselquist Technology</title>
    <description>The ZipCPU blog, featuring how to discussions of FPGA and soft-core CPU design.  This site will be focused on Verilog solutions, using exclusively OpenSource IP products for FPGA design.  Particular focus areas include topics often left out of more mainstream FPGA design courses such as how to debug an FPGA design.
</description>
    <link>https://zipcpu.com/</link>
    <atom:link href="https://zipcpu.com/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Sat, 06 Jul 2024 17:20:39 -0400</pubDate>
    <lastBuildDate>Sat, 06 Jul 2024 17:20:39 -0400</lastBuildDate>
    <generator>Jekyll v4.2.0</generator>
    <image>
      <url>https://zipcpu.com/img/gt-rss.png</url>
      <title></title>
      <link></link>
    </image>
    
      <item>
        <title>My Personal Journey in Verification</title>
        <description>&lt;p&gt;This week, I’ve been testing a CI/CD pipeline.  This has been my opportunity
to shake the screws and kick the tires on what should become a new verification
product shortly.&lt;/p&gt;

&lt;p&gt;I thought that a good design to check might be my
&lt;a href=&quot;https://github.com/ZipCPU/sdsdpi&quot;&gt;SDIO project&lt;/a&gt;.  It has roughly all the
pieces in place, and so makes sense for an automated testing pipeline.&lt;/p&gt;

&lt;p&gt;This weekend, the CI project engineer shared with me:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;It’s literally the first time I get to know a good hardware project needs
such many verifications and testings!  There’s even a real SD card
simulation model and RW test…&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;After reminiscing about this for a bit, I thought it might be worth taking a
moment to tell how I got here.&lt;/p&gt;

&lt;h2 id=&quot;verification-the-goal&quot;&gt;Verification: The Goal&lt;/h2&gt;

&lt;p&gt;Perhaps the best way to explain the “goal” of verification is by way of an
old “war story”–as we used to call them.&lt;/p&gt;

&lt;p&gt;At one time, I was involved with a DOD unit whose whole goal and purpose was
to build quick reaction hardware capabilities for the warfighter.  We bragged
about our ability to respond to a call on a Friday night with a new product
shipped out on a C-130 before the weekend was over.&lt;/p&gt;

&lt;p&gt;Anyone who has done engineering for a while will easily recognize that this
sort of concept violates all the good principles of engineering.  There’s no
time for a requirements review.  There’s no time for prototyping–or perhaps
there is, to the extent that it’s always the &lt;em&gt;prototype&lt;/em&gt; that heads out the
door to the warfighter as if it were a &lt;em&gt;product&lt;/em&gt;.  There’s no time to build a
complete test suite, to verify the new capability against all things that could
go wrong.  However, we’d often get only one chance to do this right.&lt;/p&gt;

&lt;p&gt;Now, how do you accomplish quality engineering in that kind of environment?&lt;/p&gt;

&lt;p&gt;The key to making this sort of shop work lay in the “warehouse”, and what
sort of capabilities we might have “lying on the shelf” as we called it.
Hence, we’d spend our time polishing prior capabilities, as well as
anticipating new requirements.  We’d then spend our time building, verifying,
and testing these capabilities against phantom requirements, in the hopes that
they’d be close to what we’d need to build should a real requirement arise.
We’d then place these concept designs in the “warehouse”, and show them off
to anyone who came to visit wondering what it was that our team was able to
accomplish.  Then, when a new requirement arose, we’d go into this “warehouse”
and find whatever capability was closest to what the customer required and
modify it to fit the mission requirement.&lt;/p&gt;

&lt;p&gt;That was how we achieved success.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/vlog-wait/rule-of-gold.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;The same applies in digital logic design.  You want to have a good set of
tried, trusted, and true components in your “library” so that whenever a new
customer comes along, you can leverage these components quickly to meet his
needs.  This is why I’ve often said that well written, well tested, well
verified design components are gold in this business.  Such components allow
you to go from zero to product in short order.  Indeed, the more well-tested
components you have that you can
&lt;a href=&quot;/blog/2020/01/13/reuse.html&quot;&gt;reuse&lt;/a&gt;, the faster you’ll be
to market with any new need, and the cheaper it will cost you to get there.&lt;/p&gt;

&lt;p&gt;That’s therefore the ultimate goal: a library of
&lt;a href=&quot;/blog/2020/01/13/reuse.html&quot;&gt;reusable&lt;/a&gt;
components that can be quickly composed into new products for customers.&lt;/p&gt;

&lt;p&gt;As I’ve tried to achieve this objective over the years, my approach to
component verification has changed, or rather grown, many times over.&lt;/p&gt;

&lt;h2 id=&quot;hardware-verification&quot;&gt;Hardware Verification&lt;/h2&gt;

&lt;p&gt;When I first started learning FPGA design, I understood nothing about
simulation.  Rather than learning how to do simulation properly, I instead
learned quickly how to test my designs in hardware.  Most of these designs
were DSP based.  (My background was DSP, so this made sense …)  Hence,
the following approach tended to work for me:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;I created access points in the hardware that allowed me to read and write
registers at key locations within the design.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;One of these “registers” I could write to controlled the inputs to my DSP
pipeline.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Another register, when written to, would cause the design to “step” the
entire DSP pipeline as if a new sample had just arrived from the A/D.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;A set of registers within the design then allowed me to read the state of
the entire pipeline, so I could do debugging.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This worked great for “stepping” through designs.  When I moved to processing
real-time information, such as the A/D results from the antenna connected to
the design, I build an internal logic analyzer to catch and capture key
signals along the way.&lt;/p&gt;

&lt;p&gt;I called this “Hardware in the loop testing”.&lt;/p&gt;

&lt;p&gt;Management thought I was a genius.&lt;/p&gt;

&lt;p&gt;This approach worked … for a while.  Then I started realizing how painful it
was.  I think the transition came when I was trying to debug
&lt;a href=&quot;/2018/10/02/fft.html&quot;&gt;my FFT&lt;/a&gt; by writing test vectors to
an Arty A7 circuit board via UART, and reading the results back to display
them on my screen. Even with the hardware in the loop, hitting all the test
vectors was painfully slow.&lt;/p&gt;

&lt;p&gt;Eventually, I had to search for a new and better solution.  This was just too
slow.  Later on, I would start to realize that this solution didn’t catch
enough bugs–but I’ll get to that in a bit.&lt;/p&gt;

&lt;h2 id=&quot;happy-path-simulation-testing&quot;&gt;Happy Path Simulation Testing&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Happy_path&quot;&gt;“Happy path” testing&lt;/a&gt;
is a reference to simply testing working paths
through a project’s environment.  To use an aviation analogy, a &lt;a href=&quot;https://en.wikipedia.org/wiki/Happy_path&quot;&gt;“happy path”
test&lt;/a&gt;
might make sure the ground avoidance radar never alerted when you
weren’t close to the ground.  It doesn’t make certain that the radar
necessarily does the right thing when you are close to the ground.&lt;/p&gt;

&lt;p&gt;So, let’s talk about my next project: the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Verification of the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt;
began with an &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/a20b6064ea794d66fdeb2e00929287d7f2dc9ac6/bench/asm/simtest.s&quot;&gt;assembly
program&lt;/a&gt;
the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; would run.  The
&lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/a20b6064ea794d66fdeb2e00929287d7f2dc9ac6/bench/asm/simtest.s&quot;&gt;program&lt;/a&gt;
was designed to test all the instructions of the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt;
with sufficient fidelity to know when/if the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt; worked.&lt;/p&gt;

&lt;p&gt;The test had one of two outcomes.  If the program halted, then the test was
considered a success.  If it detected an error, the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt; would execute a
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BUSY&lt;/code&gt; instruction (i.e. jump to current address) and then perpetually loop.
My test harness could then detect this condition and end with a failing exit
code.&lt;/p&gt;

&lt;p&gt;When the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; acquired a software
tool chain (GCC+Binutils) and C-library support, this &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/a20b6064ea794d66fdeb2e00929287d7f2dc9ac6/bench/asm/simtest.s&quot;&gt;assembly
program&lt;/a&gt;
was abandoned and replaced with a &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/a20b6064ea794d66fdeb2e00929287d7f2dc9ac6/sim/zipsw/cputest.c&quot;&gt;similar program in
C&lt;/a&gt;.
While I still use &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/a20b6064ea794d66fdeb2e00929287d7f2dc9ac6/sim/zipsw/cputest.c&quot;&gt;this
program&lt;/a&gt;,
it’s no longer the core of the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s
verification suite.  Instead, I tend to use it to shake out any bugs in any
new environment the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; might be
placed into.&lt;/p&gt;

&lt;p&gt;This approach failed horribly, however, when I tried integrating an &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/a20b6064ea794d66fdeb2e00929287d7f2dc9ac6/rtl/core/pfcache.v&quot;&gt;instruction
cache&lt;/a&gt;
into the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;.  I built the
&lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/a20b6064ea794d66fdeb2e00929287d7f2dc9ac6/rtl/core/pfcache.v&quot;&gt;instruction
cache&lt;/a&gt;.
I tested the &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/a20b6064ea794d66fdeb2e00929287d7f2dc9ac6/rtl/core/pfcache.v&quot;&gt;instruction
cache&lt;/a&gt;
in isolation.  I tested the
&lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/a20b6064ea794d66fdeb2e00929287d7f2dc9ac6/rtl/core/pfcache.v&quot;&gt;cache&lt;/a&gt;
as part of the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt;.  I convinced myself that it worked.
Then I placed my “working” design onto hardware and &lt;a href=&quot;/zipcpu/2017/12/28/ugliest-bug.html&quot;&gt;all
hell broke lose&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This was certainly not “the way.”&lt;/p&gt;

&lt;h2 id=&quot;formal-verification&quot;&gt;Formal Verification&lt;/h2&gt;

&lt;p&gt;I was then asked to &lt;a href=&quot;/blog/2017/10/19/formal-intro.html&quot;&gt;review a new, open source, formal verification tool called
SymbiYosys&lt;/a&gt;.  The tool
handed my cocky attitude back to me, and took my pride down a couple steps.  In
particular, I found a bunch of bugs in a FIFO I had used for years.  The bugs
had never shown up in hardware testing (that I had noticed at least), and
certainly hadn’t shown up in any of my &lt;a href=&quot;https://en.wikipedia.org/wiki/Happy_path&quot;&gt;“Happy path”
testing&lt;/a&gt;.  This left me wondering,
how many other bugs did I have in my designs that I didn’t know about?&lt;/p&gt;

&lt;p&gt;I then started &lt;a href=&quot;/blog/2018/01/22/formal-progress.html&quot;&gt;working through my previous projects, formally verifying all my
prior work&lt;/a&gt;.  In every
case, I found more bugs.  By the time I got to the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;–&lt;a href=&quot;/blog/2018/04/02/formal-cpu-bugs.html&quot;&gt;I found a myriad of bugs
in what I thought was a “working”&lt;/a&gt;
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I’d like to say that the quality of my IP went up at this point.  I was
certainly finding a lot of bugs I’d never found before by using formal methods.
I now knew, for example, how to guarantee I’d never have any more of those
cache bugs I’d had before.&lt;/p&gt;

&lt;p&gt;So, while it is likely that my IP quality was going up, the unfortunate
reality was that I was still finding bugs in my “formally verified”
IP–although not nearly as many.&lt;/p&gt;

&lt;p&gt;A &lt;a href=&quot;/formal/2020/06/12/four-keys.html&quot;&gt;couple of improvements&lt;/a&gt;
helped me move forward here.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Bidirectional formal property sets&lt;/p&gt;

    &lt;p&gt;The biggest danger in formal verification is that you might &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;assume()&lt;/code&gt;
something that isn’t true.  The first way to limit this is to make
sure you never &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;assume()&lt;/code&gt; a property within the design, but rather you
only &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;assume()&lt;/code&gt; properties of inputs–never outputs, and never local
registers.&lt;/p&gt;

    &lt;p&gt;But how do you know when you’ve assumed too much?  This can be a challenge.&lt;/p&gt;

    &lt;p&gt;One of the best ways I’ve found to do this is to create a bidirectional
property set.  A bus master, for example, would make assumptions about
how the slave would respond.  A similar property set for the bus slave
would make assumptions about what the master would do.  Further, the slave
would turn the master’s assumptions into verifiable assertions–guaranteeing
that the master’s assumptions were valid.  If you can use the same property
set in this manner for both master and slave, save that you swap
assumptions and assertions, then you can verify both in isolation to
include only assuming those things that can be verified elsewhere.&lt;/p&gt;

    &lt;p&gt;Creating such property sets for both AXI-Lite and AXI led me to find
many bugs in Xilinx IP.  This alone suggested that I was on the “right path”.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Cover checking&lt;/p&gt;

    &lt;p&gt;I also learned to use &lt;a href=&quot;/formal/2018/07/14/dev-cycle.html&quot;&gt;formal coverage
checking&lt;/a&gt;, in
addition to straight assertion
based verification.  Cover checks weren’t the end all, but they could
be useful in some key situations.  For example, a quick cover check might
help you discover that you had gotten the reset polarity wrong, and so
all of your formal assertions were passing because your design was assumed
to be held in reset.  (This has happened to me more than once.  Most
recently, the &lt;a href=&quot;/blog/2024/06/13/kimos.html&quot;&gt;cost was a couple of months
delay&lt;/a&gt; on what should’ve
otherwise been a straight forward hardware bringup–but that wasn’t really
a &lt;em&gt;formal&lt;/em&gt; verification issue.)&lt;/p&gt;

    &lt;p&gt;For a while, I also &lt;a href=&quot;/formal/2018/07/14/dev-cycle.html&quot;&gt;used cover checking to quickly discover (with minimal
work) how a design component might work within a larger
environment&lt;/a&gt;.  I’ve
since switched to simulation checking (with assertions enabled) for my
most recent examples of this type of work, but I do still find it valuable.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;/blog/2018/03/10/induction-exercise.html&quot;&gt;Induction&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;&lt;a href=&quot;/blog/2018/03/10/induction-exercise.html&quot;&gt;Induction&lt;/a&gt; isn’t
really a “new” thing I learned along the way, but it is worth mentioning
specially.  As I learned formal verification, I learned to use
&lt;a href=&quot;/blog/2018/03/10/induction-exercise.html&quot;&gt;induction&lt;/a&gt;
right from the start and so I’ve tended to use
&lt;a href=&quot;/blog/2018/03/10/induction-exercise.html&quot;&gt;induction&lt;/a&gt;
in every proof I’ve ever done.  It’s just become my normal practice from day
one.&lt;/p&gt;

    &lt;p&gt;&lt;a href=&quot;/blog/2018/03/10/induction-exercise.html&quot;&gt;Induction&lt;/a&gt;,
however, takes a lot of work.  Sometimes it takes so much work I wonder
if there’s really any value in it.  Then I tend to find some key bug or
other–perhaps a buffer overflow or something–some bug I’d have never found
without
&lt;a href=&quot;/blog/2018/03/10/induction-exercise.html&quot;&gt;induction&lt;/a&gt;.
That alone keeps me running
&lt;a href=&quot;/blog/2018/03/10/induction-exercise.html&quot;&gt;induction&lt;/a&gt;.
every time I can.  Even better, once the
&lt;a href=&quot;/blog/2018/03/10/induction-exercise.html&quot;&gt;induction&lt;/a&gt;.
proof is complete, you can often &lt;a href=&quot;/formal/2019/08/03/proof-duration.html&quot;&gt;trim the entire formal proof down from
15-20 minutes down to less than a single
minute&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Contract checking&lt;/p&gt;

    &lt;p&gt;My initial formal proofs were haphazard.  I’d throw assertions at the wall
and see what I could find.  Yes, I found bugs.  However, I never really had
the confidence that I was “proving” a design worked.  That is, not until I
learned of the idea of a “formal contract”.  The “formal contract” simply
describes the essence of how a component worked.&lt;/p&gt;

    &lt;p&gt;For example, in a memory system, the formal contract might have the solver
track a single value of memory.  When written to, the value should change.
When read, the value should be returned.  If this contract holds for all such
memory addresses, then the memory acts (as you would expect) … like a
&lt;em&gt;memory&lt;/em&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Parameter checks&lt;/p&gt;

    &lt;p&gt;For a while, I was maintaining &lt;a href=&quot;https://github.com/ZipCPU/zbasic&quot;&gt;“ZBasic”–a basic ZipCPU
distribution&lt;/a&gt;.  This was where I did all
my simulation based testing of the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;.  The problem was, this
approach didn’t work.  Sure, I’d test the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt; in one configuration, get it
to work, and then put it down believing the
“&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt;” worked.  Some time later,
I’d try the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt; in a different
configuration–such as pipelined vs non-pipelined, and … it
would fail in whatever mode it had not been tested in.  The problem with the
&lt;a href=&quot;https://github.com/ZipCPU/zbasic&quot;&gt;ZBasic approach&lt;/a&gt; is that it tended to only
check one mode–leaving all of the others unchecked.&lt;/p&gt;

    &lt;p&gt;This lead my to adjust the proofs of the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; so that the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt; would at least be formally
verified with as many parameter configurations as I could to make sure it
would work in all environments.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I’ve written more about &lt;a href=&quot;/formal/2020/06/12/four-keys.html&quot;&gt;these parts of a proof some time
ago&lt;/a&gt;, and I still stand
by them today.&lt;/p&gt;

&lt;p&gt;Yes, formal verification is hard work.  However, a well verified design is
highly valuable–on the shelf, waiting for that new customer requirement to
come in.&lt;/p&gt;

&lt;p&gt;The problem with all this formal verification work lies in its (well known)
Achilles heel.  Because formal verification includes an exhaustive
combinatorial search for bugs across all potential design inputs and states,
it can be computationally expensive.  Yeah, it can take a while.  To reduce
this expense, it’s important to limit the scope of what is verified.  As a
result, I tend to verify design &lt;em&gt;components&lt;/em&gt; rather than entire designs.  This
leaves open the possibility of a failure in the logic used to connect all
these smaller, verified components together.&lt;/p&gt;

&lt;h2 id=&quot;autofpga-and-better-crossbars&quot;&gt;AutoFPGA and Better Crossbars&lt;/h2&gt;

&lt;p&gt;Sure enough, the next class of bugs I had to deal with were integration bugs.&lt;/p&gt;

&lt;p&gt;I had to deal with several.  Common bugs included:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Using unnamed ports, and connecting module ports to the wrong signals.&lt;/p&gt;

    &lt;p&gt;At one point, I decided the
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
“stall” port should come before the
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
acknowledgment port.  Now, how many designs had to change to accommodate
that?&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;I had a bunch of problems with my &lt;a href=&quot;/blog/2017/06/22/simple-wb-interconnect.html&quot;&gt;initial interconnect
design&lt;/a&gt;
methodology.  Initially, I used the slave’s
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
strobe signal as an address decoding signal.  I then had a bug where the
address would move off of the slave of interest, and the acknowledgment
was never returned.  The result of that bug was that the design hung any
time I tried to read the entirety of &lt;a href=&quot;/blog/2019/03/27/qflexpress.html&quot;&gt;flash
memory&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;Think about how much simulation time and effort I had to go through to
simulate reading an &lt;em&gt;entire&lt;/em&gt; &lt;a href=&quot;/blog/2019/03/27/qflexpress.html&quot;&gt;flash
memory&lt;/a&gt;–just to find
this bug at the end of it.  Yes, it was painful.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Basically, when connecting otherwise “verified” modules together by hand,
I had problems where the result wasn’t reliably working.&lt;/p&gt;

&lt;p&gt;The first and most obvious solution to something like this is to use a linting
tool, such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;verilator -Wall&lt;/code&gt;. 
&lt;a href=&quot;https://www.veripool.org/verilator/&quot;&gt;Verilator&lt;/a&gt; can find things like
unconnected pins and such.  That’s a help, but I had been doing that from
early on.&lt;/p&gt;

&lt;p&gt;My eventual solution was twofold.  First, I redesigned my &lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;bus
interconnect&lt;/a&gt; from the
top to the bottom.  You can find the new and redesigned
&lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;interconnect&lt;/a&gt; components
in my &lt;a href=&quot;https://github.com/ZipCPU/wb2axip&quot;&gt;wb2axip repository&lt;/a&gt;.  Once these
components were verified, I then had a proper guarantee: all masters would get
acknowledgments (or errors) from all slave requests they made.  Errors would
no longer be lost.  Attempts to interact with a non-existent slave would
(properly) return bus errors.&lt;/p&gt;

&lt;p&gt;To deal with problems where signals were connected incorrectly, I built a tool
I call &lt;a href=&quot;/zipcpu/2017/10/05/autofpga-intro.html&quot;&gt;AutoFPGA&lt;/a&gt; to
connect components into designs.  A special tag given to the tool would
immediately connect all bus signals to a bus component–whether it be a slave
or master, whether it be connected to a
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;,
&lt;a href=&quot;/formal/2018/12/28/axilite.html&quot;&gt;AXI-Lite&lt;/a&gt;, or
&lt;a href=&quot;/formal/2019/05/13/axifull&quot;&gt;AXI&lt;/a&gt; bus.  This required that my
slaves followed one of two conventions.  Either all the bus ports had to
follow a basic port ordering convention, or they needed to follow a bus
naming convention.  Ideally, a slave should follow both.  Further, after
finding even more port connection bugs, I’m slowly moving towards the practice
of naming all of my port connections.&lt;/p&gt;

&lt;p&gt;This works great for composing designs of bus components.  Almost all of my
designs now use this approach, and only a few (mostly test bench) designs
remain where I connect bus components by hand manually.&lt;/p&gt;

&lt;h2 id=&quot;mcy&quot;&gt;MCY&lt;/h2&gt;

&lt;p&gt;At one time along the way, I was asked to review &lt;a href=&quot;https://github.com/YosysHQ/mcy&quot;&gt;MCY: Mutation Coverage with
Yosys&lt;/a&gt;.  My review back to the team was …
mixed.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/YosysHQ/mcy&quot;&gt;MCY&lt;/a&gt;
works by intentionally breaking your design.  Such changes to the design are
called “mutations”.  The goal is to determine whether or not the mutated
(broken) design will trigger a test failure.  In this fashion, the test suite
can be evaluated.  A “good” test suite will be able to find any mutation.
Hence, &lt;a href=&quot;https://github.com/YosysHQ/mcy&quot;&gt;MCY&lt;/a&gt;
allows you to measure how good your test suite is in the first place.&lt;/p&gt;

&lt;p&gt;Upon request, I tried &lt;a href=&quot;https://github.com/YosysHQ/mcy&quot;&gt;MCY&lt;/a&gt; with the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;.  This turned into a bigger
challenge than I had expected.  Sure, &lt;a href=&quot;https://github.com/YosysHQ/mcy&quot;&gt;MCY&lt;/a&gt;
works with &lt;a href=&quot;https://github.com/steveicarus/iverilog&quot;&gt;Icarus Verilog&lt;/a&gt;,
&lt;a href=&quot;https://www.veripool.org/verilator/&quot;&gt;Verilator&lt;/a&gt;, and even (perhaps) some other
(not so open) simulators as well.  However, when I ran a design under
&lt;a href=&quot;https://github.com/YosysHQ/mcy&quot;&gt;MCY&lt;/a&gt;, my simulations tended to find only a
(rough) 70% of any mutations.  The formal proofs, however, could find 95-98% of
any mutations.&lt;/p&gt;

&lt;p&gt;That’s good, right?&lt;/p&gt;

&lt;p&gt;Well, not quite.  The problem is that I tend to place all of my formal
logic in the same file as the component that would be mutated.  In order to
keep the mutation engine from mutating the formal properties, I had to remove
the formal properties from the file to be mutated into a separate file.
Further, I then had to access the values that were to be assumed or asserted
external from the file under test using something often known as “dot notation”.
While (System)Verilog allows such descriptions natively, there weren’t any open
source tools that allowed such external formal property descriptions.
(Commercial tools allowed this, just not the open source
&lt;a href=&quot;https://github.com/YosysHQ/sby&quot;&gt;SymbiYosys&lt;/a&gt;.) This left me stuck with a couple
of unpleasant choices:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;I could remove the ability of the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;
(or whatever design) to be formally verified with Open Source tools,&lt;/li&gt;
  &lt;li&gt;I could give up on using
&lt;a href=&quot;/blog/2018/03/10/induction-exercise.html&quot;&gt;induction&lt;/a&gt;,&lt;/li&gt;
  &lt;li&gt;I could use &lt;a href=&quot;https://github.com/YosysHQ/mcy&quot;&gt;MCY&lt;/a&gt; with simulation only, or&lt;/li&gt;
  &lt;li&gt;I could choose to not use &lt;a href=&quot;https://github.com/YosysHQ/mcy&quot;&gt;MCY&lt;/a&gt; at all.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is why I don’t use &lt;a href=&quot;https://github.com/YosysHQ/mcy&quot;&gt;MCY&lt;/a&gt;.  It may be a
“good” tool, but it’s just not for me.&lt;/p&gt;

&lt;p&gt;What I did learn, however, was that my
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; test suite was checking the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt;’s functionality nicely–just not
the debugging port.  Indeed, none of my tests checked the debugging port to the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt;
at all.  As a result, none of the (simulation-based) mutations of the
debugging port were ever caught.&lt;/p&gt;

&lt;p&gt;Lesson learned?  My test suite still wasn’t good enough.  Sure, the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt; might
“work” today, but how would I know some change in the future wouldn’t break it?&lt;/p&gt;

&lt;p&gt;I needed a better way of knowing whether or not my test suite was good enough.&lt;/p&gt;

&lt;h2 id=&quot;coverage-checking&quot;&gt;Coverage Checking&lt;/h2&gt;

&lt;p&gt;Sometime during this process I discovered
&lt;a href=&quot;https://en.wikipedia.org/wiki/Code_coverage&quot;&gt;coverage checking&lt;/a&gt;.
&lt;a href=&quot;https://en.wikipedia.org/wiki/Code_coverage&quot;&gt;Coverage checking&lt;/a&gt;.
is a process of automatically watching over all of your simulation based tests
to see which lines get executed and which do not.  Depending on the tool,
coverage checks can also tell whether particular signals are ever flipped or
adjusted during simulation.  A good coverage check, therefore, can provide
some level of indication of whether or not all control paths within a design
have been exercised, and whether or not all signals have been toggled.&lt;/p&gt;

&lt;p&gt;Coverage metrics are actually kind of nice in this regard.&lt;/p&gt;

&lt;p&gt;Sadly, coverage checking isn’t as good as mutation coverage, but … it’s
better than nothing.&lt;/p&gt;

&lt;p&gt;Consider a classic coverage failure: many of my simulations check for
AXI &lt;a href=&quot;https://en.wikipedia.org/wiki/Back_pressure&quot;&gt;backpressure&lt;/a&gt;.  Such
&lt;a href=&quot;https://en.wikipedia.org/wiki/Back_pressure&quot;&gt;backpressure&lt;/a&gt; is generated when
either &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BVALID &amp;amp;&amp;amp; !BREADY&lt;/code&gt;, or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RVALID &amp;amp;&amp;amp; !RREADY&lt;/code&gt;.  If your design is to
follow the AXI specification, it should be able to handle
&lt;a href=&quot;https://en.wikipedia.org/wiki/Back_pressure&quot;&gt;backpressure&lt;/a&gt;
properly.  That is, if you hold &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;!BREADY&lt;/code&gt; long enough, it should be possible
to force &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;!AWREADY&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;!WREADY&lt;/code&gt;.  Likewise, it should be possible to hold
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RREADY&lt;/code&gt; low long enough that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ARREADY&lt;/code&gt; gets held low.  A well verified,
bug-free design should be able to deal with these conditions.&lt;/p&gt;

&lt;p&gt;However, a “good” design should never create any significant
&lt;a href=&quot;https://en.wikipedia.org/wiki/Back_pressure&quot;&gt;backpressure&lt;/a&gt;.
Hence, if you build a simulation environment from “good” working components,
you aren’t likely to see much
&lt;a href=&quot;https://en.wikipedia.org/wiki/Back_pressure&quot;&gt;backpressure&lt;/a&gt;.  How then should a
component’s &lt;a href=&quot;https://en.wikipedia.org/wiki/Back_pressure&quot;&gt;backpressure&lt;/a&gt;
capability be tested?&lt;/p&gt;

&lt;p&gt;My current solution here is to test
&lt;a href=&quot;https://en.wikipedia.org/wiki/Back_pressure&quot;&gt;backpressure&lt;/a&gt;
via formal methods, with the unfortunate consequence that some conditions
will never get tested under simulation.  The result is that I’ll never get
to 100% coverage with this approach.&lt;/p&gt;

&lt;p&gt;A second problem with coverage regards the unused signals.  For example,
AXI-Lite has two signals, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AWPROT&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ARPROT&lt;/code&gt;, that are rarely used by
any of my designs.  However, they are official AXI-Lite (and AXI) signals.
As a result,
&lt;a href=&quot;/zipcpu/2017/10/05/autofpga-intro.html&quot;&gt;AutoFPGA&lt;/a&gt;
will always try to connect them to an AXI-Lite (or AXI) port, yet none of my
designs use these.  This leads to another set of exceptions that needs to be
made when measuring coverage.&lt;/p&gt;

&lt;p&gt;So, coverage metrics aren’t perfect.  Still, they can help me find
what parts of the design are (and are not) being tested well.  This can then
help feed into better (and more complete) test design.&lt;/p&gt;

&lt;p&gt;That’s the good news.  Now let’s talk about some of the not so good parts.&lt;/p&gt;

&lt;p&gt;When learning formal verification, I spent some time formally verifying
Xilinx IP.  After finding several bugs, I spoke to a Xilinx executive
regarding how they verified their IP.  Did they use formal methods?  No.
Did they use their own AXI Verification IP?  No.  Yet, they were very proud of
how well they had verified their IP.  Specifically, their executive bragged
about how good their coverage metrics were, and the number of test points
checked for each IP.&lt;/p&gt;

&lt;p&gt;Hmm.&lt;/p&gt;

&lt;p&gt;So, let me get this straight: Xilinx IP gets good coverage metrics, and hits
a large number of test points, yet still has bugs within it that I can find
via formal methods?&lt;/p&gt;

&lt;p&gt;Okay, so … how severe are these bugs?  In one case, the bugs would totally
break the AXI bus and bring the system containing the IP down to a screeching
halt–if the bug were ever tripped.  For example, if the system requested both
a read burst and a write burst at the same time, one particular slave might
accomplish the read burst with the length of the write burst–or vice versa.
(It’s been a while, so I’d have to look up the details to be exact regarding
them.)  In another case dealing with a network controller, it was possible
to receive a network packet, capture that packet correctly, and then return
a corrupted packet simply because the &lt;a href=&quot;/blog/2021/08/28/axi-rules.html&quot;&gt;AXI bus
handshakes&lt;/a&gt; weren’t properly
implemented.  To this day this bugs have not been fixed, and it’s nearly five
years later.&lt;/p&gt;

&lt;p&gt;Put simply, if it is possible for an IP to lock up your system completely,
then that IP shouldn’t be trusted until the bug is fixed.&lt;/p&gt;

&lt;p&gt;How then did Xilinx manage to convince themselves that their IP was high
quality?  By “good” coverage metrics.&lt;/p&gt;

&lt;p&gt;Lesson learned?  &lt;a href=&quot;https://en.wikipedia.org/wiki/Code_coverage&quot;&gt;Coverage
checking&lt;/a&gt; is a good thing, and it
can reveal holes in any simulation-based verification suite.  It’s just not
good enough on its own to find all of what you are missing.&lt;/p&gt;

&lt;p&gt;My conclusion?  Formal verification, followed by a simulation test suite that
evaluates coverage statistics is something to pay attention to, but not the
end all be-all.  One tool isn’t enough.  Many tools are required.&lt;/p&gt;

&lt;h2 id=&quot;self-checking-testbenches&quot;&gt;Self-Checking Testbenches&lt;/h2&gt;

&lt;p&gt;I then got involved with ASIC design.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;/blog/2017/10/13/fpga-v-asic.html&quot;&gt;ASIC design differs from FPGA design in a couple of
ways&lt;/a&gt;.  Chief among them
is the fact that the ASIC design must work the first time.  There’s little to
no room for error.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 1. A typical verification environment&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/vjourney/verilogtb.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;When working with my first ASIC design, I was introduced to a more formalized
simulation flow.  Let me explain it this way, looking at Fig. 1.  Designs
tend to have two interfaces: a bus interface, together with a device I/O
interface.  A test script can then be used to drive some form of bus functional
model, which will then control the design under test via its bus interface.  A
device model would then mimic the device the design was intended to talk to.
When done well, the test script would evaluate the values returned by the
design–after interacting with the device, and declare “success” or “failure”.&lt;/p&gt;

&lt;p&gt;Here’s the key to this setup: I can run many different tests from this starting
point by simply changing the test script and nothing else.&lt;/p&gt;

&lt;p&gt;For example, let’s imagine an external memory controller.  A “good” memory
controller should be able to accept any bus request, convert it into
I/O wires to interact with the external memory, and then return a response from
the memory.  Hence, it should be possible to first write to the external memory
and then (later) read from the same external memory.  Whatever is then read
should match what was written previously.  This is the minimum test
case–measuring the “contract” with the memory.&lt;/p&gt;

&lt;p&gt;Other test cases might evaluate this contract across all of the modes the
memory supports.  Still other cases might attempt to trigger all of the faults
the design is supposed to be able to handle.  The only difference between these
many test cases would then be their test scripts.  Again, you can measure
whether or not the test cases are sufficient using coverage measures.&lt;/p&gt;

&lt;p&gt;The key here is that all of the test cases must produce either a “pass” or
“fail” result.  That is, they must be self-checking.  Now, using self checking
test cases, I can verify (via simulation) something like the 
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; across all of its instructions,
in SMP and single CPU environments, using the DMA (or not), and so forth.
Indeed, the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s test environment
takes this approach one step farther, by not just changing the test script
(in this case a &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; software program)
but also the configuration of the test environment as well.  This allows me
to make sure the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; will continue
to work in 32b, 64b, or even wider bus environments in a single test suite.&lt;/p&gt;

&lt;p&gt;Yes, this was a problem I was having before I adopted this methodology: I’d
test the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; with a 32b bus, and then
deploy the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; to a board whose
memory was 64b wide or wider.  The &lt;a href=&quot;https://github.com/ZipCPU/kimos&quot;&gt;Kimos
project&lt;/a&gt;, for example, has a 512b bus.  Now
that I run test cases on multiple bus widths, I have the confidence that I
can easily adjust the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; from one
bus width to another.&lt;/p&gt;

&lt;p&gt;This is now as far as I’ve now come in my verification journey.  I now use
formal tests, simulation tests, coverage checking, and a self-checking test
suite on new design components.  Is this perfect?  No, but at least its more
rigorous and repeatable than where I started from.&lt;/p&gt;

&lt;h2 id=&quot;next-steps-softwarehardware-interaction&quot;&gt;Next Steps: Software/Hardware interaction&lt;/h2&gt;

&lt;p&gt;The testing regiment discussed above continues to have a very large and
significant hole: I can’t test software drivers very well.&lt;/p&gt;

&lt;p&gt;Consider as an example my &lt;a href=&quot;https://github.com/ZipCPU/sdsdpi&quot;&gt;SD card
controller&lt;/a&gt;.  The 
&lt;a href=&quot;https://github.com/ZipCPU/sdsdpi&quot;&gt;repository&lt;/a&gt; actually contains three
controllers: &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdspi.v&quot;&gt;one for interacting with SD cards via their SPI
interface&lt;/a&gt;, &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdio_top.v&quot;&gt;one via
the SDIO interface&lt;/a&gt;,
and a third for use with eMMC cards (&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdio_top.v&quot;&gt;using the SDIO
interface&lt;/a&gt;).
The &lt;a href=&quot;https://github.com/ZipCPU/sdsdpi&quot;&gt;repository&lt;/a&gt; contains formal proofs
for all leaf modules, and two types of SD card models–a &lt;a href=&quot;https://github.com/ZipCPU/blob/master/bench/cpp/sdspi.cpp&quot;&gt;C++ model for
SPI&lt;/a&gt; and all Verilog
models for
&lt;a href=&quot;https://github.com/ZipCPU/blob/master/sdspi/bench/verilog/mdl_sdio.v&quot;&gt;SDIO&lt;/a&gt; and
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/bench/verilog/mdl_emmc.v&quot;&gt;eMMC&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This controller IP also contains a set of &lt;a href=&quot;https://github.com/ZipCPU/sdspi/tree/master/sw&quot;&gt;software
drivers&lt;/a&gt; for use when working
with SD cards.  Ideally, these drivers should be tested together with the
&lt;a href=&quot;https://github.com/ZipCPU/sdsdpi&quot;&gt;SD card controller(s)&lt;/a&gt;, so they could be
verified together.&lt;/p&gt;

&lt;p&gt;Recently, for example, I added a &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sddma.v&quot;&gt;DMA
capability&lt;/a&gt; to the
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
version of &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdio.v&quot;&gt;the SDIO (and eMMC)
controller(s)&lt;/a&gt;.  This
(new) &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sddma.v&quot;&gt;DMA
capability&lt;/a&gt;
then necessitated quite a few changes to the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/tree/master/sw&quot;&gt;control software&lt;/a&gt;, so that it
could take advantage of it.  With no tests, how well do you think
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/sw/sdiodrv.c&quot;&gt;this software&lt;/a&gt;
worked when I first tested it in hardware?&lt;/p&gt;

&lt;p&gt;It didn’t.&lt;/p&gt;

&lt;p&gt;So, for now, the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/tree/master/sw&quot;&gt;software
directory&lt;/a&gt; simply holds the
software I will copy to other designs and test in actual hardware.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 2. Software driven test bench&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/cpusim/softwaretb.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;The problem is, testing the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/tree/master/sw&quot;&gt;software
directory&lt;/a&gt; requires many
design components beyond just the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SD card controllers&lt;/a&gt; that would be under test.
It requires memory, a console port, a CPU, and the CPU’s tool chain–all in
addition to the &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;design&lt;/a&gt; under test.
These extra components aren’t a part of the &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SD controller
repository&lt;/a&gt;, nor perhaps should they be.  How
then should these &lt;a href=&quot;https://github.com/ZipCPU/sdspi/tree/master/sw&quot;&gt;software
drivers&lt;/a&gt; be tested?&lt;/p&gt;

&lt;p&gt;Necessity breeds invention, so I’m sure I’ll eventually solve this problem.
This is just as far as I’ve gotten so far.&lt;/p&gt;

&lt;h2 id=&quot;automated-testing&quot;&gt;Automated testing&lt;/h2&gt;

&lt;p&gt;At any rate, I submitted this
&lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;repository&lt;/a&gt; to an automated continuous
integration facility the team I was working with was testing.  The utility
leans heavily on the existence of a variety of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;make test&lt;/code&gt; capabilities within
the &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;repository&lt;/a&gt;, and so the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SD Card repository&lt;/a&gt; was a good fit for
testing.  Along the way, I needed some help from the test facility engineer to
get &lt;a href=&quot;https://github.com/YosysHQ/sby&quot;&gt;SymbiYosys&lt;/a&gt;,
&lt;a href=&quot;https://github.com/steveicarus/iverilog&quot;&gt;IVerilog&lt;/a&gt; and
&lt;a href=&quot;https://www.veripool.org/verilator/&quot;&gt;Verilator&lt;/a&gt; capabilities installed.  His
response?&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;It’s literally the first time I get to know a good hardware project needs
such many verifications and testings!  There’s even a real SD card
simulation model and RW test…&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Yeah.  Actually, there’s three SD card models–as discussed above.  It’s been
a long road to get to this point, and I’ve certainly learned a lot along the
way.&lt;/p&gt;
&lt;hr /&gt;&lt;p&gt;&lt;em&gt;Watch therefore: for ye know not what hour your Lord doth come. (Matt 24:42)&lt;/em&gt;</description>
        <pubDate>Sat, 06 Jul 2024 00:00:00 -0400</pubDate>
        <link>https://zipcpu.com/formal/2024/07/06/verifjourney.html</link>
        <guid isPermaLink="true">https://zipcpu.com/formal/2024/07/06/verifjourney.html</guid>
        
        
        <category>formal</category>
        
      </item>
    
      <item>
        <title>Debugging video from across the ocean</title>
        <description>&lt;p&gt;I’ve come across two approaches to video synchronization.  The first, used by
a lot of the Xilinx IP I’ve come across, is to hold the video pipeline in
reset until everything is ready and then release the resets (in the right and
proper order) to get the design started.  If something goes wrong, however,
there’s no room for recovery.  The second approach is the approach I like to
use, which is to &lt;a href=&quot;/video/2022/03/14/axis-video.html&quot;&gt;build video components that are inherently
“stable”&lt;/a&gt;: 1) if they
ever lose synchronization, they will naturally work their way back into
synchronization, and 2) once synchronized they will not get out of sync.&lt;/p&gt;

&lt;p&gt;At least that’s the goal.  It’s a great goal, too–when it works.&lt;/p&gt;

&lt;p&gt;Today’s story is about what happens when a “robust” video display isn’t.&lt;/p&gt;

&lt;h2 id=&quot;system-overview&quot;&gt;System Overview&lt;/h2&gt;

&lt;p&gt;Let’s start at the top level: I’m working on building a SONAR device.&lt;/p&gt;

&lt;p&gt;This device will be placed in the water, and it will sample acoustic data.
All of the electronics will be contained within a pressure chamber, with
the only interface to the outside world being a single cable providing both
Ethernet and power.&lt;/p&gt;

&lt;p&gt;Here’s the picture I used to capture this idea when &lt;a href=&quot;/blog/2022/08/24/protocol-design.html&quot;&gt;we discussed the network
protocols that would be required to debug this
device&lt;/a&gt;.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 1. Controlling an Underwater FPGA&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/netbus/sysdesign.svg&quot; alt=&quot;&quot; width=&quot;780&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;This “wet” device will then connect to a “dry” device (kept on land, via
Ethernet) where the sampled data can then be read, stored and processed.&lt;/p&gt;

&lt;p&gt;Now into today’s detail: while my customer has provided no requirement for
real-time processing, there’s arguably a need for it during development testing.
Even if there’s no need for real-time processing in the final delivery, there’s
arguably a need for it in the lab leading up to that final delivery.  That is,
I’d like to be able to just glance at my lab setup and know (at a glance or
two) that things are working.  For this reason, I’d like some real time
displays that I can read, at a glance, and know that things are working.&lt;/p&gt;

&lt;p&gt;So, what do we have available to us to get us closer?&lt;/p&gt;

&lt;h2 id=&quot;display-architecture&quot;&gt;Display Architecture&lt;/h2&gt;

&lt;p&gt;Some time ago, I built several RTL “display” modules to use for this
lab-testing purpose.  In general, these modules take an &lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;AXI stream of incoming
data&lt;/a&gt;,
and they produce an &lt;a href=&quot;/video/2022/03/14/axis-video.html&quot;&gt;AXI video stream for
display&lt;/a&gt;.  At present,
there are only five of these graphics display modules:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_histogram.v&quot;&gt;A histogram display&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;&lt;a href=&quot;/dsp/2019/12/21/histogram.html&quot;&gt;Histograms are exceptionally useful for diagnosing any A/D collection
issues&lt;/a&gt;, so having a live
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_histogram.v&quot;&gt;histogram display&lt;/a&gt;
to provide insight into the sampled data distribution just makes sense.&lt;/p&gt;

    &lt;p&gt;However, &lt;a href=&quot;/dsp/2019/12/21/histogram.html&quot;&gt;histogram&lt;/a&gt;
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_histogram.v&quot;&gt;displays&lt;/a&gt;
need a tremendous dynamic range.  How do you handle that in hardware?  Yeah,
that was part of the challenge when building this display.  It involved
figuring out how to build multiplies and divides without doing either
multiplication or division.  A fun project, though.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_trace.v&quot;&gt;A trace module&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;By “trace”, I mean something to show the time series, such as a plot of
voltage against time.  My big challenge with this display so far has been
the reality that the SONAR A/D chips can produce more data than they eye can
quickly process.&lt;/p&gt;

    &lt;p&gt;Now that we’ve been through a test or two with the hardware, I have a better
idea of what would be valuable here.  As a result, I’m likely going to take
the absolute value of voltages across a significant fraction of a second,
and then use that approach to display a couple of seconds worth of data on
the screen.  Thankfully, my &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_trace.v&quot;&gt;trace display
module&lt;/a&gt; is
quite flexible, and should be able to display anything you give to it by way
of an AXI Stream input.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_waterfall.v&quot;&gt;A falling raster&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;The very first time my wife came to a family day at the office, way back
in the 1995-96 time frame or so, the office had a display set up with a
microphone and a sliding spectral raster.  I was in awe!  You could speak,
and see what your voice “looked” like spectrally over time.  You could hit
the table, whistle, bark, whatever, and every sound you made would look
different.&lt;/p&gt;

    &lt;p&gt;I’ve since &lt;a href=&quot;https://github.com/ZipCPU/fftdemo&quot;&gt;built this kind of capability&lt;/a&gt;
many times over, and even &lt;a href=&quot;/dsp/2020/11/21/spectrogram.html&quot;&gt;studied the best ways to do it from a
mathematical standpoint&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;In the SONAR world, you’ll find this sort of thing really helps you visualize
what’s going on in your data streams–what sounds are your sensors picking
up, what frequencies are they at, etc.  A good raster will let you “see”
motors in the water–all very valuable.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;A spectrogram, via the same &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_trace.v&quot;&gt;trace
module&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;This primarily involves plotting the absolute values of the data coming out
of an &lt;a href=&quot;/dsp/2018/10/02/fft.html&quot;&gt;FFT&lt;/a&gt;,
applied to the incoming data.  Thankfully, the &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_trace.v&quot;&gt;trace
module&lt;/a&gt;
is robust enough to handle this kind of input as well.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_split.v&quot;&gt;A split screen display&lt;/a&gt;,
that can place both an &lt;a href=&quot;/dsp/2018/10/02/fft.html&quot;&gt;FFT&lt;/a&gt;
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_trace.v&quot;&gt;trace&lt;/a&gt;
and a falling raster on the same screen.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We’ll come back to the split screen display in a bit.  In general, however,
the processing components used within it look (roughly) like Fig.  2 below.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 2. Split display video processing pipeline&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/qoi-debug/split-pipeline.svg&quot; alt=&quot;&quot; width=&quot;780&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Making this happen required some other behind the scenes components as well,
to include:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_empty.v&quot;&gt;An empty video generator&lt;/a&gt;–to
generate an &lt;a href=&quot;/video/2022/03/14/axis-video.html&quot;&gt;AXI video
stream&lt;/a&gt; from scratch.
The video out of this device is a constant color (typically black).  This
then forms a “canvas” (via the &lt;a href=&quot;/video/2022/03/14/axis-video.html&quot;&gt;AXI video
stream protocol&lt;/a&gt;)
that other things can be overlaid on top of.&lt;/p&gt;

    &lt;p&gt;This generator leaves &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TVALID&lt;/code&gt; high, for reasons we’ve
&lt;a href=&quot;/video/2022/03/14/axis-video.html&quot;&gt;discussed before&lt;/a&gt;,
and that we’ll get to again in a moment.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_mux.v&quot;&gt;A video multiplexer&lt;/a&gt;–to
select between one of the various “displays”, and send only one to the
outgoing video display.&lt;/p&gt;

    &lt;p&gt;One of the things newcomers to the hardware world often don’t realize is that
the hardware used for a display can often not be reused when you switch
display types.  This is sort of like an ALU–the CPU will include support
for ADD, OR, XOR, and AND instructions, even if only one of the results is
selected on each clock cycle.  The same is true here.  Each of the various
displays listed
above is built in hardware, occupies a separate area of the FPGA (whether used
or not), and so something is needed to select between the various outputs to
choose which we’d like.&lt;/p&gt;

    &lt;p&gt;It did take some thought to figure out how to maintaining video
synchronization while multiplexing multiple video streams together.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/axisvoverlay.v&quot;&gt;A video overlay module&lt;/a&gt;–to merge two displays together, creating a result that
looks like it has multiple independent “windows” all displaying real time
data.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I wrote these modules years ago.  They’ve all worked beautifully–in simulation.
So far, these have only been designed to be engineering displays, and not
necessarily great finished products.  Their biggest design problem?  None of
them display any units.  Still, they promise a valuable debugging
capability–provided they work.&lt;/p&gt;

&lt;p&gt;Herein lies the rub.  Although these display modules have worked nicely in
simulation, and although many have been formally verified, for some reason
I’ve had troubles with these modules when placed into actual hardware.&lt;/p&gt;

&lt;p&gt;Debugging this video chain is the topic of today’s discussion.&lt;/p&gt;

&lt;h2 id=&quot;axi-video-rules&quot;&gt;AXI Video Rules&lt;/h2&gt;

&lt;p&gt;For some more background, each of these modules produces an AXI video stream.
In general, these components would take data input, and produce a video
stream as output–much like Fig. 3 below.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 3. General AXI Stream Video component&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/qoi-debug/gendisplay.svg&quot; alt=&quot;&quot; width=&quot;420&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;In this figure, acoustic data arrives on the left, and video data comes out on
the right.  Both use AXI streams.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;/video/2022/03/14/axis-video.html&quot;&gt;AXI stream protocol, however, isn’t necessarily a good fit for video
proccessing&lt;/a&gt;.
You really have to be aware of who drives the pixel clock,
and where the blanking intervals in your design are handled.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Sink&lt;/p&gt;

    &lt;p&gt;If video comes into your device, the pixel clock is driven by that video
 source.  The source will also determine when blanking intervals need to
 take place and how long they should be.  This will be controlled via the
 video’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;VALID&lt;/code&gt; signal.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Source&lt;/p&gt;

    &lt;p&gt;Otherwise, if you are not consuming incoming video but producing video out,
 then the pixel clock and blanking intervals will be driven by the video
 controller.  This will be controlled by the display controllers &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;READY&lt;/code&gt;
 signal.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In our case, these intermediate display modules also need to be aware that
there’s often &lt;em&gt;no&lt;/em&gt; buffering for the input.  If you drop the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SRC_READY&lt;/code&gt; line,
data will be lost.  Acoustic sensor data is coming at the design whether you
are ready for it or not.  Likewise, the &lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;video output data needs to get to the
display module, and there’s no room in the HDMI standard for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;VALID&lt;/code&gt; dropping
when a pixel needs to be
produced&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Put simply, there are two constraints to these controllers: 1) the source can’t
handle &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;VALID &amp;amp;&amp;amp; !READY&lt;/code&gt;, and 2) the display controller at the end of the video
processing chain can’t handle &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;READY &amp;amp;&amp;amp; !VALID&lt;/code&gt;.  Any IP in the middle needs
to do what it can to avoid these conditions.&lt;/p&gt;

&lt;p&gt;This leads to some self-imposed criteria, that I’ve “added” to the AXI stream
protocol.  Here are my extra rules for processing AXI video stream data:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;All video processing components should keep READY high.&lt;/p&gt;

    &lt;p&gt;Specifically, nothing &lt;em&gt;within&lt;/em&gt;
the module should ever drop the ready signal.  Only the downstream display
driver should ever drop READY by more than a cycle or two between lines.
This drop in READY then needs to propagate through all the way through any
video processing chain.&lt;/p&gt;

    &lt;p&gt;My &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_mux.v&quot;&gt;video multiplexer&lt;/a&gt;
module is an example of an exception to this rule: It drops READY on all
of the video streams that aren’t currently active.  By waiting until the
end of a frame before adjusting/swapping which source is active, it can keep
all sources synchronized with the output.  This component will fail,
however, if one of those incoming streams is a true video source.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Keep VALID high as much as possible.&lt;/p&gt;

    &lt;p&gt;Only an upstream video source, such as a camera, should ever drop VALID by
more than a cycle or two between lines.  As with READY, this drop in VALID
should then propagate through the video processing chain.&lt;/p&gt;

    &lt;p&gt;In my case, There’s no such camera in this design, and so I’m never starting
from a live video source.  However, for reuse purposes in case I ever wish
to merge any of these components with a live feed, I try to keep VALID high
as much as possible.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Expect the environment to do something crazy.  Deal with it.  If your
algorithm depends on the image size, and that size changes, deal with it.&lt;/p&gt;

    &lt;p&gt;For example, if you are doing an overlay, and the overlay position changes,
you’ll need to move it.  If a video being overlaid isn’t VALID by the time
it’s needed, then you’ll have to diable the overlay operation and wait for
the overlay video source to get to the end of its frame before stalling it,
and then forcing it to wait until the time required for its first pixel
comes around again.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;If your algorithm has a memory dependency, then there is always the
possibility that the memory cannot keep up with the videos requirements.
Prepare for this.  Expect it.  Plan for it.  Know how to deal with it.&lt;/p&gt;

    &lt;p&gt;For example, if you are reading memory from a frame buffer to generate a
video image, and the memory doesn’t respond in time then, again, you have
to deal with it.  Your algorithm should do something “smart”, fail
gracefully, and then be able to resynchronize again later.  Perhaps
something else, such as a disk-drive DMA, was using memory and kept the
frame buffer from meeting its real-time requirements.  Perhaps it will be
gone later.  Deal with it, and recover.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In my case, I was building a falling raster.  I had two real-time requirements.&lt;/p&gt;

&lt;p&gt;First, data comes from the SONAR device at some incoming rate.  There’s no
room to slow it down.  You either handle it in time, or you don’t. In my case,
SONAR data is slow, so this isn’t really an issue.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;padding: 25px; float: left&quot;&gt;&lt;caption&gt;Fig 4. AXI Stream Video &quot;Rules&quot;&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/qoi-debug/vidrules.svg&quot; alt=&quot;&quot; width=&quot;420&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;This data then goes through an
&lt;a href=&quot;/dsp/2018/10/02/fft.html&quot;&gt;FFT&lt;/a&gt;,
and possibly a logarithm or an averager,
before coming to the first half of the raster.  This component then writes
data to memory, one &lt;a href=&quot;/dsp/2018/10/02/fft.html&quot;&gt;FFT&lt;/a&gt;
line at a time.  (See Fig. 2 above.)  If the memory is
too slow here, data may be catastrophically dropped.  This is bad, but rare.&lt;/p&gt;

&lt;p&gt;Second, the waterfall display data must be produced at a known rate.  VALID
must be held high as much as possible so that the downstream display driver
at the end of the processing chain can rate limit the pipeline as necessary.
That means the waterfall must be read from memory as often as the downstream
display driver needs it.  If the memory can’t keep up, the display goes on.
You can’t allow these to get out of sync, but if they do they have to be able
to resynchronize automatically.&lt;/p&gt;

&lt;p&gt;Those are my rules for AXI video.  I’ve also summarized them in Fig. 4.&lt;/p&gt;

&lt;h2 id=&quot;debugging-challenge&quot;&gt;Debugging Challenge&lt;/h2&gt;

&lt;p&gt;Now let’s return to my SONAR project, where one of the big challenges was that
the SONAR device wasn’t on my desktop.  It’s being developed on the other side
of the Atlantic from where I’m at.  It has no JTAG connection to Vivado.
There’s no ILA, although my &lt;a href=&quot;https://github.com/ZipCPU/wbscope&quot;&gt;Wishbone scope&lt;/a&gt;
works fine.  The bottom line here, though, is that I can’t just glance at the
device (like I’d like) to see if the display is working.&lt;/p&gt;

&lt;p&gt;I’ve therefore spent countless hours using both formal methods and video
simulations to verify that each of these display components work.  Each of
these displays has passed a lint check, a formal check, and a simulation check.
Therefore, they should all be working … right?&lt;/p&gt;

&lt;p&gt;Except that when I tried to deploy these “working” modules to the
hardware … they didn’t work.&lt;/p&gt;

&lt;p&gt;The classic example of “not working” was the split screen spectrum/waterfall
display.  This screen was supposed to display the current spectrum of the
input data on top, with a waterfall synchronized to the same data falling down
beneath it.  It’s a nice effect–when it works.  However, we had problems
where the two would get out of sync.  1) The waterfall would show energy in
locations separate from the spectral energy, 2) the waterfall could be seen
“jumping” horizontally across the screen–just like the old TVs would do when
they lost sync.&lt;/p&gt;

&lt;p&gt;This never happened in any of my simulations.  Never.  Not even once.&lt;/p&gt;

&lt;p&gt;Sadly, my integrated SONAR simulation environment isn’t perfect.  It has
some challenges.  Of course, there’s the obvious challenge that my simulation
isn’t connected to “real” data.  Instead, I tend to drive it with various sine
waves.  These tend to be good for testing.  I suppose I could fix this somewhat
by replaying collected data, but that’s only on my “To-Do” list for now.  Then
there’s the challenge that &lt;a href=&quot;https://github.com/ZipCPU/zbasic/blob/e7b39a56ee515d1cabe8427f30c7add0592bfab1/sim/verilated/memsim.cpp&quot;&gt;my memory simulation
model&lt;/a&gt;
doesn’t typically match Xilinx’s MIG DDR3 performance.  (No, I’m not simulating
the entire DDR3 memory–although perhaps I should.) Finally, I can only
simulate about 5-15 frames of video data.  It just doesn’t take very long
before the &lt;a href=&quot;/blog/2017/07/31/vcd.html&quot;&gt;VCD trace file&lt;/a&gt;
exceeds 100GB, and then &lt;a href=&quot;https://gtkwave.sourceforge.net/&quot;&gt;my
tools&lt;/a&gt; struggle.&lt;/p&gt;

&lt;p&gt;Bottom line: &lt;a href=&quot;/blog/2018/08/04/sim-mismatch.html&quot;&gt;works in simulation, fails hard in
hardware&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Now, how to figure this one out?&lt;/p&gt;

&lt;h2 id=&quot;first-step-formal-verification&quot;&gt;First Step: Formal verification&lt;/h2&gt;

&lt;p&gt;I know I said everything was formally verified.  That wasn’t quite true
initially.  Initially, the &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/axisvoverlay.v&quot;&gt;overlay
module&lt;/a&gt;
wasn’t formally verified.&lt;/p&gt;

&lt;p&gt;In general, I like to develop with formal methods as my guide.  Barring that,
if I ever run into problems then formal verification is my first approach to
debugging.  I find that I can find problems faster when using the formal tools.
It tends to condense debugging very quickly.  Further, the formal tools aren’t
constrained by the requirement that the simulation environment needs to make
sense.  As a result, I tend to check my designs against a much richer
environment when checking them formally than I would via simulation.&lt;/p&gt;

&lt;p&gt;In this case, I was tied up with other problems, so I had someone else do the
formal verification for me.  He was somewhat new to formal verification, and
this particular module was quite the challenge–there are just so many cases
that had to be considered:&lt;/p&gt;

&lt;p&gt;We can start with the typical design, where the overlaid image lands nicely
within the main image window.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 5. Overlaid window in video&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/qoi-debug/midoverlay.svg&quot; alt=&quot;&quot; width=&quot;420&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;This is what I typically think of when I set up an overlay of some type.&lt;/p&gt;

&lt;p&gt;This isn’t as simple as it sounds, though, since the IP needs to know that
the overlay window has finished its line, and so it shouldn’t start on the
next line until the main window gets to the left corner of the overlay
window for the next line.&lt;/p&gt;

&lt;p&gt;What happens, though, when the overlay window scrolls off to once side and
wraps back onto the main window?&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;padding: 25px; float: left&quot;&gt;&lt;caption&gt;Fig 6. Clipping the Overlaid video&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/qoi-debug/overlay.svg&quot; alt=&quot;&quot; width=&quot;420&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;It might also scroll off the bottom as well.&lt;/p&gt;

&lt;p&gt;In both cases, the overlay video should be clipped.  This is not something my
simulation environment ever really checked, but it is something we had no
end of challenges when checking via formal tools.&lt;/p&gt;

&lt;p&gt;These clipped examples are okay.  There’s nothing wrong with them–they just
never look right with only a couple clock cycles of trace.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 7. Overlay not ready&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/qoi-debug/overlay-block.svg&quot; alt=&quot;&quot; width=&quot;420&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;There’s also the possibility of what happens when the overlay window isn’t
ready when the main window is, as illustrated in Fig. 7 on the right.&lt;/p&gt;

&lt;p&gt;Remember our video rules.  Together, these rules require that VALID and READY
be propagated through the module–but never dropped internal to the module.
That means there’s no time to wait.  If the overlay module isn’t ready when
it’s time, then the image will be corrupted.  We can’t wait or the hardware
display will lose sync.  The overlay has to be ready or the image will be
corrupted.&lt;/p&gt;

&lt;p&gt;So, how to deal with situations like this?&lt;/p&gt;

&lt;p&gt;Yeah.&lt;/p&gt;

&lt;p&gt;Yes, my helper learned a lot during this process.  Eventually, we got to the
point of pictorially drawing out what was going on each time the formal engine
presented us with another verification failure, just so we could follow what
was going on.  Yes, our drawings started looking like Fig. 5 or 6 above.&lt;/p&gt;

&lt;p&gt;Yes, formal verification is where I turn when things don’t work.  Typically
there’s some hardware path I’m not expecting, and formal tends to find all
such paths to make sure the logic considers them properly.&lt;/p&gt;

&lt;p&gt;In this case, it wasn’t enough.  Even though I formally verified all of these
components, the displays still weren’t working.  Unfortunately, in order to
know this, I had to ask an engineer in a European time zone to connect a
monitor and … he told me it wasn’t working.  Sure, he was more helpful than
that: he provided me pictures of the failures.  (They were nasty.  These were
ugly looking failures.)  Unfortunately, these told me nothing of what needed
to be adjusted, and it was also costly in terms of requiring a team effort–I
would need to arrange for his availability, (potentially) his cost, all for
something that wasn’t (yet) a customer requirement.&lt;/p&gt;

&lt;p&gt;I needed a better approach.&lt;/p&gt;

&lt;p&gt;What I needed was a way to “see” what was going on, without being there.
I needed a digital method of screen capture.&lt;/p&gt;

&lt;p&gt;Building something like this, however, is quite the challenge: the waterfall
displays all use my memory bandwidth–they can even use a (potentially)
significant memory bandwidth.  Debugging meant that I was going to need a
means of capturing the screen headed to the display that wouldn’t
(significantly) impact my memory bandwidth–otherwise my test infrastructure
(i.e. any debugging screen capture) would impact what I was trying to test.
That might lead to chasing down phantom bugs, or believing things were still
broken even after they’d been fixed.&lt;/p&gt;

&lt;p&gt;This left me at an impass for some time–knowing there were bugs in the video,
but unable to do anything about them.&lt;/p&gt;

&lt;h2 id=&quot;enter-qoi-compression&quot;&gt;Enter QOI Compression&lt;/h2&gt;

&lt;p&gt;Some time ago, I remember reading about &lt;a href=&quot;https://qoiformat.org&quot;&gt;QOI
compression&lt;/a&gt;.  It captured my attention, as a fun
underdog story.&lt;/p&gt;

&lt;p&gt;Yes, I’d implemented my own &lt;a href=&quot;https://en.wikipedia.org/wiki/GIF&quot;&gt;GIF&lt;/a&gt;
compression/decompression in time past.  This was back when I was still focused
on software, and thus before I started doing any hardware design. I’d even
looked up how to compress images with &lt;a href=&quot;https://en.wikipedia.org/wiki/PNG&quot;&gt;PNG&lt;/a&gt;
and how &lt;a href=&quot;https://en.wikipedia.org/wiki/Bzip2&quot;&gt;BZip2&lt;/a&gt; could compress files.
Frankly, over the course of 30 years working in this industry, compression is
kind of hard to avoid.  That said, none of these compression methods is
really suitable for FPGA work.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://qoiformat.org&quot;&gt;QOI&lt;/a&gt; is different.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://qoiformat.org&quot;&gt;QOI&lt;/a&gt; is &lt;em&gt;much&lt;/em&gt; simpler than
&lt;a href=&quot;https://en.wikipedia.org/wiki/GIF&quot;&gt;GIF&lt;/a&gt;,
&lt;a href=&quot;https://en.wikipedia.org/wiki/PNG&quot;&gt;PNG&lt;/a&gt;,
or &lt;a href=&quot;https://en.wikipedia.org/wiki/Bzip2&quot;&gt;BZip2&lt;/a&gt;.  &lt;em&gt;Much&lt;/em&gt; simpler.  It’s so
simple, it can be implemented in hardware without too many challenges.  It’s so
simple, it can be implemented in 700 Xilinx 6-LUTs.  Not only that, it claims
better performance than &lt;a href=&quot;https://en.wikipedia.org/wiki/PNG&quot;&gt;PNG&lt;/a&gt;
across &lt;a href=&quot;https://qoiformat.org/benchmark/&quot;&gt;many (not all) benchmarks&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Yeah, now I’m interested.&lt;/p&gt;

&lt;p&gt;With a little bit of work, I was able to implement a &lt;a href=&quot;https://github.com/ZipCPU/qoiimg/blob/master/rtl/qoi_compress.v&quot;&gt;QOI compression
module&lt;/a&gt;.  A
&lt;a href=&quot;https://github.com/ZipCPU/qoiimg/blob/master/rtl/qoi_encoder.v&quot;&gt;small wrapper&lt;/a&gt;
could encode and attach a small “file” header and trailer onto the compressed
stream.  This could then be followed by a &lt;a href=&quot;https://github.com/ZipCPU/qoiimg/blob/master/rtl/qoi_recorder.v&quot;&gt;QOI image capture
module&lt;/a&gt;
which I could then use to capture a series of subsequent video frames.&lt;/p&gt;

&lt;p&gt;This led to a debugging plan that was starting to take shape.  You can see how
this plan would work in Fig. 8 below.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 8. Video debug plan using QOI compression&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/qoi-debug/qoiplan.svg&quot; alt=&quot;&quot; width=&quot;780&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;If all went well, video data would be siphoned off from between the video
multiplexer and the display driver generating the HDMI output.  This video
would be (nominally) at around (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;800*600*3*60&lt;/code&gt;) 82MB/s.  If the compression
works well, the data rate should drop to about 1MB/s–but we’ll see.&lt;/p&gt;

&lt;p&gt;Of course, as with anything, nothing works out of the box.  Worse, if you are
going to rely on something for “test”, it really needs to be better than
the device under test.  If not, you’ll never know which item is the cause of
an observation: the device under test, or the test infrastructure used to
measure it.&lt;/p&gt;

&lt;p&gt;Therefore, I set up a basic simulation test on my desktop.  I’d run the
SONAR simulation, visually inspect the HDMI output, and capture three frames
of data.  I’d then &lt;a href=&quot;https://github.com/phoboslab/qoi&quot;&gt;convert&lt;/a&gt; these three
frames of data to &lt;a href=&quot;https://en.wikipedia.org/wiki/PNG&quot;&gt;PNG&lt;/a&gt;s.  If the resulting
&lt;a href=&quot;https://en.wikipedia.org/wiki/PNG&quot;&gt;PNG&lt;/a&gt;s visually matched, then I had
a &lt;del&gt;strong&lt;/del&gt; confidence the
&lt;a href=&quot;https://qoiformat.org&quot;&gt;QOI&lt;/a&gt;
&lt;a href=&quot;https://github.com/ZipCPU/qoiimg/blob/master/rtl/qoi_compress.v&quot;&gt;compression&lt;/a&gt;,
&lt;a href=&quot;https://github.com/ZipCPU/qoiimg/blob/master/rtl/qoi_encoder.v&quot;&gt;encoder&lt;/a&gt;, and
&lt;a href=&quot;https://github.com/ZipCPU/qoiimg/blob/master/rtl/qoi_recorder.v&quot;&gt;recorder&lt;/a&gt;
were working.&lt;/p&gt;

&lt;p&gt;Note that I had to cross out the word “strong” there.  Unless and until an
IP can be tested through &lt;em&gt;every&lt;/em&gt; logic path, you really don’t have any “strong”
confidence something is working.  Still, it was enough to get me off the ground.&lt;/p&gt;

&lt;p&gt;The challenge here is that tracing the design through simulation while it
records three images can generate a 120GB+
&lt;a href=&quot;/blog/2017/07/31/vcd.html&quot;&gt;VCD file&lt;/a&gt;,
and took longer to test
in simulation than it did to build the hardware design, load the hardware
design, and capture images from hardware.  As a result, I often found myself
debugging both the &lt;a href=&quot;https://github.com/ZipCPU/qoiimg&quot;&gt;QOI processing system&lt;/a&gt;
and the (buggy) video processing system jointly, &lt;a href=&quot;/blog/2017/06/02/design-process.html&quot;&gt;in hardware, at the same
time&lt;/a&gt;.
No, it’s not ideal, but it did work.&lt;/p&gt;

&lt;h2 id=&quot;the-first-bug-never-getting-back-in-sync&quot;&gt;The First Bug: Never getting back in sync&lt;/h2&gt;

&lt;p&gt;I started my debugging with the default display, a &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_split.v&quot;&gt;split
screen&lt;/a&gt;
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_trace.v&quot;&gt;spectrogram&lt;/a&gt;
and &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_waterfall.v&quot;&gt;waterfall&lt;/a&gt;.
Using my newfound capability, I quickly received an image that looked something
like the figure below.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 9. First QOI capture -- no waterfall&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/qoi-debug/20240527-qoi-before.png&quot; alt=&quot;&quot; width=&quot;800&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;This figure shows what &lt;em&gt;should&lt;/em&gt; be a
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_split.v&quot;&gt;split screen&lt;/a&gt;
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_trace.v&quot;&gt;spectrogram&lt;/a&gt;
and
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_waterfall.v&quot;&gt;waterfall&lt;/a&gt;
display.  The
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_trace.v&quot;&gt;spectrum&lt;/a&gt;
on top appears about right, however the
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_waterfall.v&quot;&gt;waterfall&lt;/a&gt;
that’s supposed to exist in the bottom half of the display is completely absent.&lt;/p&gt;

&lt;p&gt;Well, the good news is that I could at least capture a bug.&lt;/p&gt;

&lt;p&gt;The next step was to walk this bug backwards through the design.  In this case,
we’re walking backwards through Fig. 2 above and the first component to look at
is the &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/axisvoverlay.v&quot;&gt;overlay
module&lt;/a&gt;.  It
is possible for the &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/axisvoverlay.v&quot;&gt;overlay
module&lt;/a&gt;
to lose synchronization.  This typically means either the overlay isn’t
ready when the primary display is ready for it, or that the overlay is still
displaying some (other) portion of its video.  Once out of sync, you can no
longer merge the two displays.  The two streams then need to be resynchronized.
That is, the
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/axisvoverlay.v&quot;&gt;overlay&lt;/a&gt;
module would need to wait for the end of the secondary image (the image to be
overlaid on top of the primary), and then it would need to stall the secondary
image until the primary display was ready for it again.&lt;/p&gt;

&lt;p&gt;However, the &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/axisvoverlay.v&quot;&gt;overlay
module&lt;/a&gt;
wasn’t losing synchronization.&lt;/p&gt;

&lt;p&gt;No?&lt;/p&gt;

&lt;p&gt;This was a complete surprise to me.  This was where I was expecting the bug,
and where most of my debugging efforts had been (blindly) focused up until this
point.&lt;/p&gt;

&lt;p&gt;Okay, so … let’s move back one more step.  (See Fig. 2)&lt;/p&gt;

&lt;p&gt;It is possible for the &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/vid_waterfall_r.v&quot;&gt;video waterfall
reader&lt;/a&gt;
to get out of sync between its two clocks.  Specifically, one portion of the
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/vid_waterfall_r.v&quot;&gt;reader&lt;/a&gt;
reads data, one line at a time, from the bus and stuffs it into
first a synchronous FIFO, and then
an &lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;asynchronous one&lt;/a&gt;.  This
half operates at whatever speed the bus is at, and that’s defined by the
memory’s speed.  The second half of the
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/vid_waterfall_r.v&quot;&gt;reader&lt;/a&gt;
takes this data from the
&lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;asynchronous FIFO&lt;/a&gt; and attempts
to create an AXI stream video output from it–this time at the pixel clock rate.
Because we are not allowed to stall this video output to wait for memory, it
is possible for the two to get out of sync.  In this case, the
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/vid_waterfall_r.v&quot;&gt;reader&lt;/a&gt;
(pixel clock domain) is supposed to wait for an end of frame indication from
the memory
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/vid_waterfall_r.v&quot;&gt;reader&lt;/a&gt;
(bus clock domain, via the
&lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;asynchronous FIFO&lt;/a&gt;),
and then it is to stall the memory
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/vid_waterfall_r.v&quot;&gt;reader&lt;/a&gt;
(by not reading from the
&lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;asynchronous FIFO&lt;/a&gt;)
until it receives an end of video frame indication from its own video
reconstruction logic.&lt;/p&gt;

&lt;p&gt;A quick check revealed that yes, these two were getting out of sync.&lt;/p&gt;

&lt;p&gt;Here’s how the “out-of-sync” detection was taking place:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;initial&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;px_lost_sync&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pix_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pix_reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;px_lost_sync&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;M_VID_TVALID&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;M_VID_TREADY&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;M_VID_HLAST&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// Check when sending the last pixel of a line.  On this last&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// pixel, the data read from memory (px_hlast) must also&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// indicate that it is the last pixel in a line.  Further,&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// if this is also the last line in a frame, then both the&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// memory indicator of the last line in a frame (px_vlast)&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// and the outgoing video indicator (M_VID_VLAST) must match.&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;px_hlast&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;M_VID_VLAST&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;px_vlast&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;px_lost_sync&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;px_lost_sync&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;M_VID_VLAST&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;px_vlast&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// We can resynchronize once both memory and&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// outgoing video streams have both reached the end of&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// a frame.&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;px_lost_sync&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Following any reset, the entire design should be synchronized.  That’s the
easy part.&lt;/p&gt;

&lt;p&gt;Next, if the output of the &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/axisvoverlay.v&quot;&gt;overlay
module&lt;/a&gt;
(that’s the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;M_VID_*&lt;/code&gt; prefix values) is ready to produce the last pixel of a
line, then we check if the FIFO signals line up.  In our example, we have two
sets of synchronization signals.  First, there are the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;M_VID_HLAST&lt;/code&gt; and
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;M_VID_VLAST&lt;/code&gt; signals.  These are generated blindly based upon the frame size.
These indicate the last pixel in a line (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;M_VID_HLAST&lt;/code&gt;) and the end of a frame
(&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;M_VID_VLAST&lt;/code&gt;) respectively–from the perspective of the video stream.  Two
other signals, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;px_hlast&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;px_vlast&lt;/code&gt;, come through the
&lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;asynchronous FIFO&lt;/a&gt;.  These
are used to indicate the last bus word in a line and the end of a frame from
the perspective of the data found within the
&lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;asynchronous FIFO&lt;/a&gt;
containing the samples read from memory–one bus word (not one pixel) at a time.
If these two ever get out of sync, then perhaps memory hasn’t kept up with the
display or perhaps something else has gone wrong.&lt;/p&gt;

&lt;p&gt;So, to determine if we’ve lost sync, we check for it on the last pixel of any
line.  That is, when &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;M_VID_HLAST&lt;/code&gt; is true to indicate the last pixel in a
line, then &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;px_last&lt;/code&gt; should also be true–both should be synchronized.
Likewise, when &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;M_VID_VLAST&lt;/code&gt; (last line of frame) is true, then &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;px_vlast&lt;/code&gt;
should also be true–or the two have come out of sync.&lt;/p&gt;

&lt;p&gt;Because I’m also doing 128b bus word to 8b pixel conversions here, the two
signals don’t directly correspond.  That is, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;px_hlast&lt;/code&gt; might be true (last
bus word of a line), even though &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;M_VID_HLAST&lt;/code&gt; isn’t true yet (last pixel of a
line).  Hence, I only check these values if &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;M_VID_HLAST&lt;/code&gt; is true–on the last
&lt;em&gt;pixel&lt;/em&gt; of the line.&lt;/p&gt;

&lt;p&gt;That’s how we know if we’re out of sync.  But … how do we get synchronized
again?&lt;/p&gt;

&lt;p&gt;For this, the plan is to read from the memory reader as fast as possible until
the end of the frame.  Once we get to the end of the frame, we’ll stop reading
from memory and wait for the video (pixel clock) to get to the end of the
frame.  Once both are synchronized at the end of a frame, the plan is to then
release both together and we’ll be synchronized again.&lt;/p&gt;

&lt;p&gt;At least, that’s how this is &lt;em&gt;supposed&lt;/em&gt; to work.&lt;/p&gt;

&lt;p&gt;The key (broken) signal was the signal to read from the &lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;asynchronous
FIFO&lt;/a&gt;.
This signal, called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;afifo_read&lt;/code&gt;, is shown below.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;afifo_read&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;px_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;PW&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;px_valid&lt;/span&gt;
			&lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;M_VID_TVALID&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;M_VID_TREADY&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;M_VID_HLAST&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;M_VID_TVALID&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;M_VID_TREADY&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;afifo_read&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

		&lt;span class=&quot;c1&quot;&gt;// Always read if we are out of sync&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;px_lost_sync&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;px_hlast&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;px_vlast&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;afifo_read&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Basically, we want to read from the
&lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;asynchronous FIFO&lt;/a&gt;
any time we don’t have a full pixel’s width left in our bus width to pixel
gearbox, any time we don’t have a valid buffer, or any time we reach the end
of the line–where we would flush the gearbox’s buffer.  The exception to this
is if the outgoing AXI stream is stalled.  This is how the
&lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;FIFO&lt;/a&gt; read signal is supposed
to work normally.  There’s one exception here, and that is if the two are out
of sync.  In that case, we will always read from
the &lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;FIFO&lt;/a&gt;
until the last pixel in a line on the last line of the frame.&lt;/p&gt;

&lt;p&gt;This all sounds good.  It looked good on a desk check too.  II passed over this
many times, reading it, convincing myself that this was right.&lt;/p&gt;

&lt;p&gt;The problem is this was the logic that was broken.&lt;/p&gt;

&lt;p&gt;If you look closely, you might notice that this logic would never allow us to
get back in sync.  Once we lose synchronization, we’ll read until the end of
the frame and then stop, only to read again when any of the original criteria
are true–the ones assuming synhronization.&lt;/p&gt;

&lt;p&gt;Yeah, that’s not right.&lt;/p&gt;

&lt;p&gt;This also explains why all my hardware traces showed the waterfall never
resynchronizing with the outgoing video stream.&lt;/p&gt;

&lt;p&gt;One missing condition fixes this.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// ...&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;px_lost_sync&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;px_hlast&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;px_vlast&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;M_VID_HLAST&lt;/span&gt;
				&lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;M_VID_VLAST&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;afifo_read&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;This last condition states that, if we are out of sync and we’ve reached the
last pixel in a frame, then we need to wait until the outgoing frame matches
our sync.  Only then can we read again.&lt;/p&gt;

&lt;p&gt;Once I fixed this, things got better.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 10. QOI capture, showing an attempted waterfall display&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/qoi-debug/20240528-qoi-promising.png&quot; alt=&quot;&quot; width=&quot;800&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;I could now get through a significant fraction of a frame before losing
synchronization for the rest of it.  In other words, I had found and fixed
the cause of why the design wasn’t recovering, just not the cause of what
caused it to get out of sync in the first place.&lt;/p&gt;

&lt;p&gt;The waterfall background is also supposed to be &lt;em&gt;black&lt;/em&gt;, not &lt;em&gt;blue&lt;/em&gt;–so I
needed to dig into that as well.  (That turned out to be a bug in the &lt;a href=&quot;https://github.com/ZipCPU/qoiimg/blob/master/rtl/qoi_compress.v&quot;&gt;QOI
compression
module&lt;/a&gt;.  I
could just about guess this bug, if I watched how the official decoder worked.)&lt;/p&gt;

&lt;p&gt;So, back I went to the &lt;a href=&quot;https://github.com/ZipCPU/wbscope&quot;&gt;Wishbone scope&lt;/a&gt;,
this time &lt;a href=&quot;/blog/2017/06/08/simple-scope.html&quot;&gt;triggering the
scope&lt;/a&gt;
on a loss of sync event.  I needed to find out why this design lost sync in
the first place.&lt;/p&gt;

&lt;h2 id=&quot;the-second-bug-how-did-we-lose-sync-in-the-first-place&quot;&gt;The Second Bug: How did we lose sync in the first place?&lt;/h2&gt;

&lt;p&gt;Years ago, I wrote &lt;a href=&quot;/blog/2018/11/29/llvga.html&quot;&gt;an article that argued that good and correct video handling
was all captured by a pair of
counters&lt;/a&gt;.  You needed one
counter for the horizontal pixel, and another for the vertical pixel.  Once
these got to the raw width and height of the image, the counters would be
reset and start over.&lt;/p&gt;

&lt;p&gt;When dealing with memory, things are a touch different–at least for this
design.&lt;/p&gt;

&lt;p&gt;As hinted above, the bus portion of the &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/vid_waterfall_r.v&quot;&gt;waterfall
reader&lt;/a&gt;
works off of &lt;em&gt;bus words&lt;/em&gt;, not pixels.  It reads one line at a time from the
bus, reading as many bus words as are necessary to make up a line.  In the case
of this system, a bus word on the &lt;a href=&quot;https://store.digilentinc.com/nexys-video-artix-7-fpga-trainer-board-for-multimedia-applications&quot;&gt;Nexys Video
board&lt;/a&gt;
is 128-bits long–the natural width of the DDR3 SDRAM memory.  (Our &lt;a href=&quot;https://www.enclustra.com/en/products/fpga-modules/mercury-kx2/&quot;&gt;next
hardware
platform&lt;/a&gt;
will increase this to 512-bits.)  Likewise, the waterfall pixel size is only
8-bits–since it has no color, and a &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/48de0f29c1cb91fabb0ef4d0cba4829c4a43651c/rtl/gfx/vid_clrmap.v&quot;&gt;false
color&lt;/a&gt;
will be provided later.  Hence, to read an 800 pixel line, the bus master must
read 50 bus words (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;800*8/128&lt;/code&gt;).  The last word will then be marked as the last
in the line, possibly also the last in the frame, and the result will be
stuffed into the &lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;asynchronous
FIFO&lt;/a&gt;.
Once the last word in a line is requested of the bus, the bus master
needs to increment his line pointer address to the next line&lt;/p&gt;

&lt;p&gt;However, there’s a problem with bus mastering: the logic that makes &lt;em&gt;requests&lt;/em&gt;
of a bus has to take place many clocks before the logic that &lt;em&gt;receives&lt;/em&gt; the bus
responses.  The difference is not really that important, but it typically ends
up around 30 clock cycles or so.  That means this design needs two sets of
X and Y counters: one when making requests, to know when a full line (or frame)
has been requested and that it is time to advance to the next line (or frame),
and a second set to keep track of when the line (or frame) ends with respect
to the values &lt;em&gt;returned&lt;/em&gt; from the bus.  This second set controls the end of
line and frame markers that go into the synchronous and then &lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;asynchronous
FIFO&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Let’s walk through this logic to see if I can clarify it at all.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;First, there’s both an synchronous FIFO and an
&lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;asynchronous one&lt;/a&gt;–since
it can be a challenge to know the &lt;em&gt;fill&lt;/em&gt; of the
&lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;asynchronous FIFO&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Once the &lt;em&gt;synchronous&lt;/em&gt; FIFO is at least half empty, the reader begins a bus
transaction.  For a &lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone
bus&lt;/a&gt;, this means both &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CYC&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;STB&lt;/code&gt; need to
be raised.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;For every &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;STB &amp;amp;&amp;amp; !STALL&lt;/code&gt;, a request is made of the bus.  At this time, we
also subtract one from a counter keeping track of the number of available
(i.e. uncommitted) entries in the synchronous FIFO.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Likewise, for every &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;STB &amp;amp;&amp;amp; !STALL&lt;/code&gt;, the IP increments the requested memory
address.&lt;/p&gt;

    &lt;p&gt;Once you get to the end of the line, set the next address to the last line
start address &lt;em&gt;minus&lt;/em&gt; one line of memory.  Remember, we are creating a
&lt;em&gt;falling&lt;/em&gt; raster, where we go from most recent
&lt;a href=&quot;/dsp/2018/10/02/fft.html&quot;&gt;FFT&lt;/a&gt; data to oldest
&lt;a href=&quot;/dsp/2018/10/02/fft.html&quot;&gt;FFT&lt;/a&gt; data.
Hence we read &lt;em&gt;backwards&lt;/em&gt; through memory, one line at a time.&lt;/p&gt;

    &lt;p&gt;Once we get to the beginning of our assigned memory area, we wrap back
to the end of our assigned memory area minus one line.&lt;/p&gt;

    &lt;p&gt;Once we get to the end of the &lt;em&gt;frame&lt;/em&gt;, we need to reset the address to
the last line the
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/vid_waterfall_w.v&quot;&gt;writer&lt;/a&gt;
has just completed.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;On evey &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ACK&lt;/code&gt;, the returned by data gets stored into the synchronous FIFO.
With each result stored in the FIFO, we also add an indication of whether
this return was associated with the end of a line or the end of a frame.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Once the
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/vid_waterfall_r.v&quot;&gt;reader&lt;/a&gt;
gets to the end of the line, we restart the (horizontal) &lt;del&gt;pixel&lt;/del&gt; bus
word counter and increment the line counter.  When it gets to the end of
the frame, we reset the line counter as well.&lt;/p&gt;

    &lt;p&gt;Just to make sure that these two sets of counters (request and return)
remain synchronized, the return counters to set to equal the request
counters any time the bus is idle.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The IP then continues making requests until there would be no more room in
the FIFO for the returned data.  At this point, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;STB&lt;/code&gt; gets dropped and we
wait for the last request to be returned.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Once all requests have been returned, drop &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CYC&lt;/code&gt; and wait again.&lt;/p&gt;

    &lt;p&gt;The rule of the bus is also the rule of the boarding house bathroom:
do your business, and get out of there.  Once you are done with any bus
transactions, it’s therefore important to get off the bus.  Even if we could
(now) make more requests, we’ll get off the bus and wait for the FIFO to
become less than half full again–that way other (potential) bus masters
can have a chance to access memory.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;And … right there is the foundation for this bug.&lt;/p&gt;

&lt;p&gt;The actual bug was how I determined whether or not the last request was being
returned.  Let’s look at that logic for a moment, shall we?  Here’s what it
looked like (when broken):  (Watch for what clears &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_wb_cyc&lt;/code&gt; …)&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;initial&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;o_wb_cyc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;o_wb_stb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;2'b00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;wb_reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// Halt any requests on reset&lt;/span&gt;
		&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;o_wb_cyc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;o_wb_stb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;2'b00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;o_wb_cyc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;o_wb_stb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_wb_stall&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// Drop the strobe signal on the last request.  Never&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// raise it again during this cycle.&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;o_wb_stb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;last_request&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_wb_ack&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;o_wb_stb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_wb_stall&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
					&lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;last_request&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;last_ack&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// Drop ACK once the last return has been received.&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;o_wb_cyc&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fifo_fill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;LGFIFO&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;LGBURST&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// Start requests when the FIFO has less than a burst's size&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// within it.&lt;/span&gt;
		&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;o_wb_cyc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;o_wb_stb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;2'b11&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_reset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;o_wb_cyc&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_wb_err&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;last_ack&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;last_ack&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;wb_outstanding&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;o_wb_stb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
				&lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_wb_ack&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Look specifically at the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;last_ack&lt;/code&gt; signal.&lt;/p&gt;

&lt;p&gt;Depending upon the pipeline, this signal can be off by one clock cycle.&lt;/p&gt;

&lt;p&gt;This was the bug.  Because the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;last_ack&lt;/code&gt; signal, indicating that there’s only
one more acknowledgement left, compared the number of outstanding requests
against &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;2&lt;/code&gt; plus the current acknowledgment, and because the signal was
&lt;em&gt;registered&lt;/em&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;last_ack&lt;/code&gt; might be set if there were two requests outstanding
&lt;em&gt;and&lt;/em&gt; nothing was returned on the current cycle.&lt;/p&gt;

&lt;p&gt;Since all requests would’ve been made by this time, the X and Y &lt;del&gt;pixel&lt;/del&gt;
bus word counters for the &lt;em&gt;request&lt;/em&gt; would reflect that we’d just requested a
line of data.  The &lt;em&gt;return&lt;/em&gt; counters, on the other hand, would be off by one
if &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CYC&lt;/code&gt; ever dropped a cycle early.  These return counters would then get
reset to equal the &lt;em&gt;request&lt;/em&gt; counters any time &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CYC&lt;/code&gt; was zero.  Hence,
dropping the bus line one cycle early would result in a line
of pixels (well, bus words representing pixels …) going into the FIFO
that didn’t have enough pixels within it–or perhaps the LAST signal might
be missing entirely.  Whatever the case, it didn’t line up.&lt;/p&gt;

&lt;p&gt;This particular design was formally verified.  Shouldn’t this bug have shown
up in a formal test?  Sadly, no.  &lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;It’s &lt;em&gt;legal&lt;/em&gt; to drop &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CYC&lt;/code&gt;
early&lt;/a&gt;, so there’s
no protocol violation there.  Further, my acknowledgment counter was off by
one in such that the formal properties allowed it.  If I added an assertion
that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CYC&lt;/code&gt; would never be dropped early (which I did once I discovered this
bug), the design would then immediately (and appropriately) fail.&lt;/p&gt;

&lt;p&gt;There’s one more surprise to this story though.  Why didn’t this bug show up in
simulation?&lt;/p&gt;

&lt;p&gt;Ahh, now there’s a very interesting lesson to be learned.&lt;/p&gt;

&lt;h2 id=&quot;reality-why-didnt-the-bugs-show-up-in-simulation&quot;&gt;Reality: Why didn’t the bug(s) show up in simulation?&lt;/h2&gt;

&lt;p&gt;Why didn’t the bug show up earlier?  Because of Xilinx’s DDR3 SDRAM controller,
commonly known as “The MIG”.&lt;/p&gt;

&lt;p&gt;I don’t normally simulate DDR3 memories.  A DDR3 SDRAM memory controller
requires a lot of hardware specific components, components that aren’t
necessarily easy to simulate, and it also requires a DDR3 SDRAM simulation
model.  I tend to simplify all of this and just simulate my designs with an
&lt;a href=&quot;https://github.com/ZipCPU/zbasic/blob/e7b39a56ee515d1cabe8427f30c7add0592bfab1/sim/verilated/memsim.cpp&quot;&gt;alternate SDRAM model–a model that looks and acts “about” right, but one that
isn’t exact&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It was the difference between &lt;a href=&quot;https://github.com/ZipCPU/zbasic/blob/e7b39a56ee515d1cabe8427f30c7add0592bfab1/sim/verilated/memsim.cpp&quot;&gt;my simulation
model&lt;/a&gt;,
which wouldn’t trigger any of the bugs, and Xilinx’s MIG reality that
ended up triggering the bug.&lt;/p&gt;

&lt;p&gt;Fig. 11, for example, shows what the &lt;a href=&quot;https://github.com/ZipCPU/wbscope&quot;&gt;Wishbone
scope&lt;/a&gt; returned when documenting the
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/vid_waterfall_r.v&quot;&gt;waterfall
reader&lt;/a&gt;’s
transactions with the MIG.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 11.  The Waterfall reader's view of Wishbone bus handshaking when accessing memory&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/qoi-debug/20240605-migfail.png&quot; alt=&quot;&quot; width=&quot;800&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Focus your attention on first the stall (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_stall&lt;/code&gt;) and then the
acknowledgment (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_ack&lt;/code&gt;) lines.&lt;/p&gt;

&lt;p&gt;First, stall is high immediately as part of the beginning of the transaction.
This is to be expected.  With the exception of filling a minimal buffer, any
bus master requesting transactions of the bus is going to need to wait for
&lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;arbitration&lt;/a&gt;.
This only takes a clock or two.  Once
&lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;arbitration&lt;/a&gt; is received, the
&lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;interconnect&lt;/a&gt;
won’t stall the design again during this bus cycle.&lt;/p&gt;

&lt;p&gt;Only the stall line gets raised again after that–several times even.  These
stalls are all due to the MIG.&lt;/p&gt;

&lt;p&gt;Let’s back up a touch.&lt;/p&gt;

&lt;p&gt;There are a lot of rules to SDRAM interaction.  Most SDRAM’s are configured in
memory &lt;em&gt;banks&lt;/em&gt;.  Banks are read and written in &lt;em&gt;rows&lt;/em&gt;.  The data in each row
is stored in a set of capacitors.  This allows for maximum data packing in
minimal area (cost).  However, you can’t read from a row of capacitors.  To
read from the memory, that row first needs to be copied to a row of fast
memory.  This is called
“activating” the row.  Once a row is activated, it can be read from or written
to.  Once you are done with one row, it must be “precharged” (i.e. put back),
before a different row can be activated.  All of this takes time.  If the
row you want isn’t activated, you’ll need to switch rows.  That will cause a
stall as the old row needs to be precharged and the new row activated.  Hence,
when making a long string of read or a long string of write requests, you’ll
suffer from a stall every time you cross rows.&lt;/p&gt;

&lt;p&gt;Xilinx’s MIG has another rule.  Because of how their architecture uses an IO
trained PLL (Xilinx calls this a “phasor”), the MIG needs to regularly read
from memory to keep this PLL trained.  During this time the memory must also
stall.  (Why the MIG can’t train on &lt;em&gt;my&lt;/em&gt; memory reads, but needs its own–I
don’t know.)  These stalls are very periodic, and if you dig a bit you can
find this taking place within their controller.&lt;/p&gt;

&lt;p&gt;Then the part of the trace showing a long stalled section reflects the reality
that, every now and again, the memory needs to be taken entirely off line for
a period of time so that the capacitors can be recharged.  This requires a
longer time period, as highlighted in Fig. 12 below.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 12.  SDRAM refresh cycles force long stalls&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/qoi-debug/20240605-migrefresh.png&quot; alt=&quot;&quot; width=&quot;800&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Once it’s time for a refresh cycle like this, several steps need to take place
in the memory controller–in this case the MIG.  First, any active rows need
to be precharged.  Then, the memory is refreshed.  Finally, you’ll need to
re-activate the row you need.  This takes time as well–as shown in Fig. 12.&lt;/p&gt;

&lt;p&gt;That’s part one–the stall signal.  &lt;a href=&quot;https://github.com/ZipCPU/zbasic/blob/e7b39a56ee515d1cabe8427f30c7add0592bfab1/sim/verilated/memsim.cpp&quot;&gt;My over-simplified SDRAM memory
model&lt;/a&gt;
doesn’t simulate any of these practical memory realities.&lt;/p&gt;

&lt;p&gt;Part two is the acknowledgments.  From these traces, you can see that there’s
about a 30 cycle latency (300ns) from the first request to the first
acknowledgment.  However, unlike my &lt;a href=&quot;https://github.com/ZipCPU/zbasic/blob/e7b39a56ee515d1cabe8427f30c7add0592bfab1/sim/verilated/memsim.cpp&quot;&gt;over-simplified memory
model&lt;/a&gt;,
the acknowledgments also come back broken due to the stalls.  This makes sense.
If every request takes 30 cycles, and some get stalled, then it only makes
sense that the stalled requests would get acknowledged later the ones that
didn’t get stalled.&lt;/p&gt;

&lt;p&gt;Put together, this is why my waterfall display worked in simulation, but not
in hardware.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;Wow, that was a long story!&lt;/p&gt;

&lt;p&gt;Yeah.  It was long from my perspective too.  Although the “bugs” amounted to
only 2-5 lines of Verilog, it took a lot of work to find those bugs.&lt;/p&gt;

&lt;p&gt;Here are some key takeaways to consider:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;All of this was predicated on a &lt;a href=&quot;/blog/2018/08/04/sim-mismatch.html&quot;&gt;simulation vs hardware
mismatch&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;Because the SDRAM simulation did not match the SDRAM reality, cycle for
cycle, a key hardware reality was missed in testing.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;This should’ve been caught via formal methods.&lt;/p&gt;

    &lt;p&gt;From now on, I’m going to have to make certain I check that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CYC&lt;/code&gt; is only
ever dropped either following either a reset, an error, or the last
acknowledgment.  There should be zero requests outstanding when &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CYC&lt;/code&gt; is
dropped.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Why wasn’t the pixel resynchronization bug caught via formal?&lt;/p&gt;

    &lt;p&gt;Because … FIFOs.  It can be a challenge to formally verify a design
containing a FIFO.  Rather than deal with this properly, I allowed the two
halves of the design to be somewhat independent–and so the formal tool
never really examined whether or not the design could (or would) properly
recover from a lost sync.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Did formally verifying the &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/axisvoverlay.v&quot;&gt;overlay
module&lt;/a&gt;
help?&lt;/p&gt;

    &lt;p&gt;Yes.  When we went through it, we found bugs in it.  Once the &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/axisvoverlay.v&quot;&gt;overlay
module&lt;/a&gt;
was formally verified, the result stopped &lt;em&gt;jumping&lt;/em&gt;.  Instead, the
overlay might just note a problem and stop showing the overlaid image.
Even better, unlike before the &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/axisvoverlay.v&quot;&gt;overlay
module&lt;/a&gt;
was properly verified, I haven’t had any more instances of the top and
bottom pictures getting out of sync with each other.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;What about that blue field?&lt;/p&gt;

    &lt;p&gt;Yes, the waterfall background should be black when no signal was present.
The blue field turned out to be caused by a bug in the &lt;a href=&quot;https://github.com/ZipCPU/qoiimg/blob/master/rtl/qoi_compress.v&quot;&gt;QOI compression
module&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;Once fixed, the captured image looked like Fig. 13 below.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 13.  SDRAM refresh cycles force long stalls&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/qoi-debug/20240528-qoi-working.png&quot; alt=&quot;&quot; width=&quot;800&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;This was easily found and fixed.  (It had to deal with a race condition on the
pixel index when writing to the compression table, if I recall correctly …)&lt;/p&gt;

&lt;ol start=&quot;6&quot;&gt;
  &lt;li&gt;
    &lt;p&gt;How about that &lt;a href=&quot;https://github.com/ZipCPU/qoiimg&quot;&gt;QOI module&lt;/a&gt;?&lt;/p&gt;

    &lt;p&gt;The thing worked like a champ!  I love the simplicity of the
&lt;a href=&quot;https://qoiformat.org&quot;&gt;QOI&lt;/a&gt;
encoding, enough so that I’m likely to use it again and again!&lt;/p&gt;

    &lt;p&gt;Okay, perhaps I’m overselling this.  It wasn’t perfect at first.  This is,
in many ways to be expected–this was the first time it was ever used.
However, it was small and cheap, and worked well enough to get the job done.&lt;/p&gt;

    &lt;p&gt;Some time later, I managed to formally verify the
&lt;a href=&quot;https://github.com/ZipCPU/qoiimg/blob/master/rtl/qoi_compress.v&quot;&gt;compression&lt;/a&gt;
engine, and I found another bug or two that had been missed in my hardware
testing.&lt;/p&gt;

    &lt;p&gt;That’s &lt;a href=&quot;https://github.com/ZipCPU/qoiimg/blob/master/rtl/qoi_compress.v&quot;&gt;compression&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;Decompression?  That’s another story.  I think I’ve convinced myself that I
can do decompression in hardware, but the
&lt;a href=&quot;https://github.com/ZipCPU/qoiimg/blob/master/rtl/qoi_decompress.v&quot;&gt;algorithm&lt;/a&gt;
(while cheap) isn’t
really straightforward any more.  At issue is the reality that it will take
several clock cycles (i.e. pipeline stages) to determine the table index for
storing colors into, yet the very next pixel might be dependent upon the
result of reading from the table.  Scheduling the pipeline isn’t
straightforward.  (Worse, I have simulation test cases showing that the
&lt;a href=&quot;https://github.com/ZipCPU/qoiimg/blob/master/rtl/qoi_decompress.v&quot;&gt;decompression logic I have&lt;/a&gt;
doesn’t work yet.)&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Are the displays ready for prime time?&lt;/p&gt;

    &lt;p&gt;I’d love to say so, but they don’t have labeled axes.  They really need
labeled axes to be proper &lt;em&gt;professional&lt;/em&gt; displays.  Perhaps a
&lt;a href=&quot;https://qoiformat.org&quot;&gt;QOI&lt;/a&gt; &lt;a href=&quot;https://github.com/ZipCPU/qoiimg/blob/master/rtl/qoi_decompress.v&quot;&gt;decompression
algorithm&lt;/a&gt;
can take labeled image data from memory and overlay it onto the display as
well.  However, to do this I’m going to have to redesign how I handle
scaling, otherwise the labels won’t match the image.&lt;/p&gt;

    &lt;p&gt;Worse, &lt;a href=&quot;https://x.com/Dg3Yev/status/1797779997190443498&quot;&gt;[DG3YEV Tobias] recently put my waterfall display to
shame&lt;/a&gt;.  My basic displays
are much too simple.  So, it looks like I might need to up my game.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I should point out, in passing, that the &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 SDRAM
controller&lt;/a&gt; doesn’t nearly have as
many stall cycles as Xilinx’s MIG.  It doesn’t use the (undocumented) hardware
phasors, so it doesn’t have to take the memory offline periodically.  Further,
it can schedule the row precharge and activation cycles so as to avoid
bus stalls (when accessing memory sequentially).  As such, it operates about
10% faster than the MIG.  It even gets a lower latency.  These details,
however, really belong in an article to themselves.&lt;/p&gt;

&lt;p&gt;I suppose the bottom line question is whether or not these displays are ready
for our next testing session.  The answer is a solid, No.  Not yet.  I still
need to do some more testing with them.  However, these displays are a lot
closer now than they’ve been for the last two years.&lt;/p&gt;
&lt;hr /&gt;&lt;p&gt;&lt;em&gt;Seest thou a man diligent in his business? he shall stand before kings; he shall not stand before mean men. (Prov 22:29)&lt;/em&gt;</description>
        <pubDate>Sat, 22 Jun 2024 00:00:00 -0400</pubDate>
        <link>https://zipcpu.com/video/2024/06/22/vidbug.html</link>
        <guid isPermaLink="true">https://zipcpu.com/video/2024/06/22/vidbug.html</guid>
        
        
        <category>video</category>
        
      </item>
    
      <item>
        <title>Bringing up Kimos</title>
        <description>&lt;p&gt;Ever had one of those problems where you were stuck for weeks?&lt;/p&gt;

&lt;p&gt;It’s not supposed to happen, but … it does.&lt;/p&gt;

&lt;p&gt;Let me tell you about the Kimos story so far.&lt;/p&gt;

&lt;h2 id=&quot;what-is-kimos&quot;&gt;What is Kimos?&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/kimos&quot;&gt;Kimos is the name of one of the current open source
projects&lt;/a&gt; I’m working on.  The project is
officially named the “Kintex-7 Memory controller, Open Source toolchain”, but
the team shortened that to “KiMOS” and I’ve gotten to the point where I just
call it “Kimos” (pronounced KEE-mos).  The goals of the project are twofold.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Test an &lt;a href=&quot;https://github.com/AngeloJacobo/uberDDR3&quot;&gt;Open Source DDR3 SDRAM memory
controller&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;This includes both performance testing, and performance comparisons against
Xilinx’s MIG controller.&lt;/p&gt;

    &lt;p&gt;Just as a note, &lt;a href=&quot;https://github.com/AngeloJacobo/uberDDR3&quot;&gt;Angelo’s
controller&lt;/a&gt; has a couple of
differences with Xilinx’s controller.  One of them is a simpler
“native” interface: Wishbone, with an option for one (or more)
auxilliary wire(s).  The auxilliary wire(s) are designed to simplify
&lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/master/rtl/axim2wbsp.v&quot;&gt;wrapping this controller with a full AXI
interface&lt;/a&gt;.
Another difference is the fact that &lt;a href=&quot;https://github.com/AngeloJacobo/uberDDR3&quot;&gt;Angelo’s
controller&lt;/a&gt; is built using
documented Xilinx IO capabilities only–rather than the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PHY_CONTROL&lt;/code&gt; and
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PHASER*&lt;/code&gt; constructs that Xilinx used and chose not to document.&lt;/p&gt;

    &lt;p&gt;My hypothesis is that these differences, together with some internal
structural differences that I encouraged Angelo to make, will make his a
faster memory controller.  This test will tell.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Once the memory controller works, our goal is to test Kimos using an
entirely open source tool flow.&lt;/p&gt;

    &lt;p&gt;&lt;em&gt;This open source tool flow would replace Vivado.&lt;/em&gt;&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The project hardware itself is built by &lt;a href=&quot;https://www.enclustra.com&quot;&gt;Enclustra&lt;/a&gt;.
It consists of two boards: a &lt;a href=&quot;https://www.enclustra.com/en/products/base-boards/mercury-st1/&quot;&gt;Mercury+ ST1
baseboard&lt;/a&gt;,
and an associated &lt;a href=&quot;https://www.enclustra.com/en/products/fpga-modules/mercury-kx2/&quot;&gt;KX2
daughterboard&lt;/a&gt;.
Together, these boards provide some nice hardware capability in one place:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;There’s a large DDR3 SDRAM memory, with a 64b data width.  Ultimately,
this means we should be able to transfer 512b per FPGA clock.  In the case
of this project, that’ll be 512b for every 10ns (i.e. a 100MHz FPGA system
clock)–even though the memory itself can be clocked faster.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The board also has two Gb Ethernet interfaces, although I only have plans
for one of them.&lt;/p&gt;

    &lt;p&gt;Each interface (naturally) includes an &lt;a href=&quot;https://en.wikipedia.org/wiki/Management_Data_Input/Output&quot;&gt;MDIO management
interface&lt;/a&gt;.
Although I might be tempted to take this interface for granted, it
shouldn’t be.  It was via the &lt;a href=&quot;https://en.wikipedia.org/wiki/Management_Data_Input/Output&quot;&gt;MDIO
interface&lt;/a&gt;
that I was able to tell which of the two hardware interfaces corresponded
to ETH0 on the schematic and which was ETH1.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;There’s an SD card slot on the board, so I’ve already started using it to
test my &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDIO controller&lt;/a&gt; and it’s new
DMA capability.  Once tested, the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/tree/dev&quot;&gt;dev branch (containing the
DMA)&lt;/a&gt; will have been “tested” and
“hardware proven”, and so I’ll be able to then merge it into the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/tree/master&quot;&gt;master
branch&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;I’m likely to use the FMC interface to test a &lt;a href=&quot;https://github.com/ZipCPU/wbsata&quot;&gt;new SATA
controller&lt;/a&gt; I’m working on.  A nice
&lt;a href=&quot;https://www.fpgadrive.com/&quot;&gt;FPGA Drive daughter board&lt;/a&gt;
from &lt;a href=&quot;https://www.ospero.com&quot;&gt;Ospero Electronic Design, Inc.,&lt;/a&gt; will help to
make this happen.&lt;/p&gt;

    &lt;p&gt;Do note, though, that &lt;a href=&quot;https://github.com/ZipCPU/wbsata&quot;&gt;this controller&lt;/a&gt;,
although posted, is most certainly broken and broken badly at present–it’s
just not that far along in its development to have any reliability to it.
The plan is to first build a SATA Verilog model, get the controller running
in simulation, and then to get it running on this Enclustra hardware.  It’s
just got a long way to go in its process at present.  The good news is that
the project is funded, so if you are interested in it, come back and check
in on it later–after I’ve had the chance to prove (and therefore fix) it.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The device also has some I2C interfaces, which I might investigate for
testing my &lt;a href=&quot;/blog/2021/11/15/ultimate-i2c.html&quot;&gt;ultimate I2C
controller&lt;/a&gt; on.  The
main I2C bus has three chips connected to it: an &lt;a href=&quot;https://media.digikey.com/pdf/Data%20Sheets/Silicon%20Laboratories%20PDFs/Si5338.pdf&quot;&gt;Si5338
clock controller&lt;/a&gt;
(which isn’t needed for any of my applications), an encrypted hash chip
(with … poor documentation–not recommended), and a &lt;a href=&quot;https://www.renesas.com/us/en/document/dst/isl12020m-datasheet&quot;&gt;real time
clock&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The design also has some of the more standard interfaces that everything
relies on, to include
&lt;a href=&quot;/blog/2019/03/27/qflexpress.html&quot;&gt;Flash&lt;/a&gt; and
&lt;a href=&quot;/formal/2019/02/21/txuart.html&quot;&gt;UART&lt;/a&gt;–both
of which I have controllers for already.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Although the
&lt;a href=&quot;https://www.enclustra.com/en/products/base-boards/mercury-st1/&quot;&gt;baseboard&lt;/a&gt;
has HDMI capabilities, Enclustra never connected the HDMI on the
&lt;a href=&quot;https://www.enclustra.com/en/products/base-boards/mercury-st1/&quot;&gt;baseboard&lt;/a&gt;
to the &lt;a href=&quot;https://www.enclustra.com/en/products/fpga-modules/mercury-kx2/&quot;&gt;KX2 daughterboard&lt;/a&gt;.
Hence, if I want video, I’ll need to use the DisplayPort hardware–something
I haven’t done before, but … it does have potential (just not funding).&lt;/p&gt;

    &lt;p&gt;This is a shame, because I have a bunch of live HDMI displays that I’d love
to port to this project that … just aren’t likely to happen.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Eventually, my plan is to port my SONAR work to this hardware–but that remains
a far off vision at this point.&lt;/p&gt;

&lt;p&gt;The project is currently a work in progress, so I have not gotten to the point
of completing either of the open source objectives.  (Since I initially drafted
this, &lt;a href=&quot;https://github.com/AngeloJacobo/uberDDR3&quot;&gt;Angelo’s controller&lt;/a&gt; has
now been ported, and appears to be working–it’s performance just hasn’t
been measured yet.)&lt;/p&gt;

&lt;p&gt;I have, however, completed a first milestone: getting the design working with
Xilinx’s MIG controller.  For a task that should’ve taken no longer than a
couple of days, this portion of the task has taken a month and a half–leaving
me stuck in &lt;a href=&quot;/fpga-hell.html&quot;&gt;FPGA Hell&lt;/a&gt; for most of this
time.&lt;/p&gt;

&lt;p&gt;Now that I have Xilinx’s MIG working, I’d like to share a brief description of
what went wrong, and why this took so long.  Perhaps others may learn from my
failures as well.&lt;/p&gt;

&lt;h2 id=&quot;the-challenges-with-board-bringup&quot;&gt;The challenges with board bringup&lt;/h2&gt;

&lt;p&gt;The initial steps in board bringup went quickly: I could get the LEDs and
serial port up and running with no problems.  From there I could
&lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/master/sw/board/cputest.c&quot;&gt;test&lt;/a&gt; the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; (running out of block RAM), and
things looked good.  At this point, a year or so ago, I put the board on the
shelf to come back to it later when I had more time and motivation (i.e.
funding).&lt;/p&gt;

&lt;p&gt;I wasn’t worried about the next steps.  I already had controllers for the
main hardware components necessary to move forward.  I had &lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/master/rtl/migsdram.v&quot;&gt;a controller that
would work nicely with Xilinx’s
MIG&lt;/a&gt;, &lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/master/rtl/net/enetstream.v&quot;&gt;another
that would handle the Gb
Ethernet&lt;/a&gt;, &lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/master/rtl/qflexpress.v&quot;&gt;a
flash controller&lt;/a&gt;,
and so on.  These were all proven controllers, so it was just a matter of
integrating them and making sure things worked (again) as expected.&lt;/p&gt;

&lt;p&gt;Once the &lt;a href=&quot;https://github.com/ZipCPU/kimos&quot;&gt;Kimos project&lt;/a&gt; kicked off, with the
goals listed above, I added these components to the project and immediately
had problems.&lt;/p&gt;

&lt;h3 id=&quot;the-done-led&quot;&gt;The DONE LED&lt;/h3&gt;

&lt;p&gt;The first problem was that the “DONE” LED wouldn’t light.  Or, rather, it would
light just fine until I tried to include Xilinx’s MIG controller.  Once I
included Xilinx’s MIG controller into the design the LED would no longer light.&lt;/p&gt;

&lt;p&gt;Now … how do you fix that one?  I mean, where do you even start?&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/kimos/one-bug.svg&quot; width=&quot;420&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;I started by reducing the design as much as possible.  I removed components
from the design, and adjusted which components were in the design and which
were not.  With a bit of work, I was able to prove–as mentioned above–that
the design would work as long as Xilinx’s MIG (DDR3 SDRAM) controller was not
a part of the design.  The moment I added Xilinx’s MIG, the design stopped
working.&lt;/p&gt;

&lt;p&gt;Ouch.  What would cause that?  Is there a short circuit on the board somewhere?
Did I mess up the XDC file?  The MIG configuration?&lt;/p&gt;

&lt;p&gt;With some help from some other engineers, we traced the first problem to the
open source FPGA loader I was using:
&lt;a href=&quot;https://github.com/trabucayre/openFPGALoader&quot;&gt;openFPGALoader&lt;/a&gt;.  As it turns
out, this &lt;a href=&quot;https://github.com/trabucayre/openFPGALoader/issues/229&quot;&gt;loader struggles to load large/complex designs at high JTAG
frequencies&lt;/a&gt;.  However,
if you drop the frequency down from 4MHz to &lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/804559a307bd58e34527ebb3f791ad414fe71803/Makefile#L283&quot;&gt;3.75MHz, the loader will “just”
work&lt;/a&gt;
and the DONE LED will get lit.&lt;/p&gt;

&lt;p&gt;The problem goes a bit deeper, and highlights a problem I’ve had personally as
well: since the developer of the
&lt;a href=&quot;https://github.com/trabucayre/openFPGALoader&quot;&gt;openFPGALoader&lt;/a&gt;
component can’t replicate the problem with the hardware he has, he can’t really
test fixes.  Hence, although a valid fix has been proposed, the developer
is uncertain of it.  Still, without help, I wouldn’t have made it this far.&lt;/p&gt;

&lt;p&gt;Sadly, now that the DONE LED lit up for my design, it still didn’t work.
Worse, I no longer trusted the
&lt;a href=&quot;https://github.com/trabucayre/openFPGALoader&quot;&gt;FPGA loader&lt;/a&gt;.
This left me always looking over my shoulder for another loading option.&lt;/p&gt;

&lt;p&gt;For example, I tried programming the design into
&lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/master/rtl/qflexpress.v&quot;&gt;flash&lt;/a&gt;
and then using my &lt;a href=&quot;https://github.com/ZipCPU/kimos/rtl/wbicapetwo.v&quot;&gt;internal configuration access port (ICAPE)
controller&lt;/a&gt; to
load the design from
&lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/master/rtl/qflexpress.v&quot;&gt;flash&lt;/a&gt;.
This didn’t work, until I first took the
&lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/master/rtl/qflexpress.v&quot;&gt;flash&lt;/a&gt; out of
eXecute in Place (XiP) mode.  (Would I have known that, if I hadn’t been the
one to build the &lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/master/rtl/qflexpress.v&quot;&gt;flash
controller&lt;/a&gt;
and put it into XiP mode in the first place?  I’m not sure.)
However, if I first told the
&lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/master/rtl/qflexpress.v&quot;&gt;flash&lt;/a&gt;
to leave XiP mode, I could then specify a warm-boot address to my
&lt;a href=&quot;https://github.com/ZipCPU/kimos/rtl/wbicapetwo.v&quot;&gt;ICAPE&lt;/a&gt; controller,
followed by an IPROG command, which could then load any design that …
didn’t include Xilinx’s MIG DDR3 SDRAM controller.&lt;/p&gt;

&lt;p&gt;At this point, I had proved that my problem was no longer the
&lt;a href=&quot;https://github.com/trabucayre/openFPGALoader&quot;&gt;openFPGALoader&lt;/a&gt;.  That was
the good news.  The bad news was that the design still wasn’t working whenever
I included the MIG.&lt;/p&gt;

&lt;h3 id=&quot;jtaguart-not-working&quot;&gt;JTAG/UART not working&lt;/h3&gt;

&lt;p&gt;If the design loads, the place I want to go next is to get an internal logic
analyzer up and running.  Here, I have two options:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Xilinx’s ILA requires a JTAG connection.&lt;/p&gt;

    &lt;p&gt;Without a Xilinx compatible JTAG connector, I can’t use Xilinx’s ILA.&lt;/p&gt;

    &lt;p&gt;At one point I purchased a USB based JTAG controller.  I … just didn’t
manage to purchase the right one, and so the pins never fit.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;I typically do my &lt;a href=&quot;/blog/2017/06/28/dbgbus-goal.html&quot;&gt;debugging over
UART&lt;/a&gt;, using a
&lt;a href=&quot;https://github.com/ZipCPU/wbscope&quot;&gt;Wishbone scope&lt;/a&gt;–something we’ve
&lt;a href=&quot;/blog/2017/07/08/getting-started-with-wbscope.html&quot;&gt;already discussed on the
blog&lt;/a&gt;.
Using this method I can quickly find and debug problems.&lt;/p&gt;

    &lt;p&gt;However, with this particular design, any time I added the MIG SDRAM
controller to the design my &lt;a href=&quot;/blog/2017/06/28/dbgbus-goal.html&quot;&gt;UART debugging
port&lt;/a&gt; would stop
working–together with the rest of the design.  That left me with no UART,
and no JTAG.  Indeed, I could’ve ping’d the board via the Gb Ethernet unless
and until I added the MIG.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Something was seriously wrong.  This is definitely &lt;em&gt;not&lt;/em&gt;
&lt;a href=&quot;https://english.stackexchange.com/questions/25897/origin-of-the-phrase-now-were-cooking-with&quot;&gt;“cooking with gas”&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;So how then do you debug something this?  LEDs!&lt;/p&gt;

&lt;h3 id=&quot;leds-not-working&quot;&gt;LEDs not working&lt;/h3&gt;

&lt;p&gt;Debugging by LED is slow.  It can take 10+ minutes to make a change to a design,
and each LED will only (at best) give you one bit of output.  So the feedback
isn’t that great.  Still, they are an important part of debugging early design
configuration issues.  In this case, the &lt;a href=&quot;https://www.enclustra.com/en/products/fpga-modules/mercury-kx2/&quot;&gt;Enclustra KX2
daughterboard&lt;/a&gt;
has four LEDs on it, and the &lt;a href=&quot;https://www.enclustra.com/en/products/base-boards/mercury-st1/&quot;&gt;Mercury+ ST1
baseboard&lt;/a&gt;
has another 4 LEDs.  Perhaps they could be used to debug the next steps?&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/kimos/side-by-side.svg&quot; width=&quot;420&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Normally, I build my designs with a &lt;a href=&quot;/blog/2017/05/20-knight-rider.html&quot;&gt;“Knight Rider” themed LED
display&lt;/a&gt;.  This helps me
know that my FPGA design has loaded properly.  There are two parts to this
display.  First, there’s an “active” LED that moves from one end of the LED
string to the other and then back again.  This “active” LED is ON with full
brightness–whatever that means for an individual design.  Then, once the
“active” LED moves on to the next LED in the string, a PWM (actually
&lt;a href=&quot;/dsp/2017/09/04/pwm-reinvention.html&quot;&gt;PDM&lt;/a&gt;)
signal is used to “dim” the LED in a decaying fashion.  Of course, &lt;a href=&quot;/zipcpu/2019/02/09/cpu-blinky.html&quot;&gt;the
CPU can easily override this
display&lt;/a&gt; as necessary.&lt;/p&gt;

&lt;p&gt;My problem was that, even though the “DONE” LED would (now) light up when
loading a design containing the MIG, these user LEDs were not doing anything.&lt;/p&gt;

&lt;p&gt;Curiously, if I overrode the LEDs at the top level of the design, I could make
them turn either on or off.  I just couldn’t get my internal design to control
these LEDs properly.  (I call this an “override” method because the
&lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/804559a307bd58e34527ebb3f791ad414fe71803/rtl/toplevel.v#L474-L477&quot;&gt;top level&lt;/a&gt;
of my design is generated by
&lt;a href=&quot;/zipcpu/2017/10/05/autofpga-intro.html&quot;&gt;AutoFPGA&lt;/a&gt;, and I
wasn’t going so far as to adjust the &lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/804559a307bd58e34527ebb3f791ad414fe71803/autodata/spio.txt#L64-L67&quot;&gt;original source&lt;/a&gt;s
describing how these LEDs should ultimately operate.) Still, using this
top-level override method, I was able to discover that I could see LEDs 4-7
from my desk chair, that these were how I had wired up the LEDs on the
baseboard (a year earlier), and that LEDs 6 and 7 had an opposite polarity
from all of the other LEDs on the board.&lt;/p&gt;

&lt;p&gt;All useful, it just didn’t help.&lt;/p&gt;

&lt;p&gt;At one point, I noticed that the LEDs were configured to use the IO standard
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SSTL15&lt;/code&gt; instead of the normal &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LVCMOS15&lt;/code&gt; standard I normally use.  Once I
switched from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SSTL15&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LVCMOS15&lt;/code&gt;, my &lt;a href=&quot;/blog/2017/05/20-knight-rider.html&quot;&gt;knight-rider
display&lt;/a&gt; worked.&lt;/p&gt;

&lt;p&gt;Unfortunately, neither the serial port nor the Ethernet port worked.  Both of
these continued to work if the MIG controller wasn’t included in the design,
just not if the MIG controller was included.&lt;/p&gt;

&lt;h3 id=&quot;voodoo-engineering&quot;&gt;Voodoo Engineering&lt;/h3&gt;

&lt;p&gt;I like to define Voodoo engineering as “Changing what isn’t broken, in an
attempt to fix what is.”  Not knowing what else to try, I spent a lot of time
doing Voodoo engineering just trying to get the design working.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;With the help of a hardware friend and his lab, we examined all of the power
rails.  Could it be that the design was losing power during the startup
sequence, and so not starting properly even though the “DONE” LED was
lighting up?&lt;/p&gt;

    &lt;p&gt;No.&lt;/p&gt;

    &lt;p&gt;After a lot of work with various probes, all we discovered was that the
design used about 50% more power when the MIG was included.  Did this mean
there was a short circuit somewhere?&lt;/p&gt;

    &lt;p&gt;Curiously, it was the FPGA that got warmer, not the DDR3 SDRAM.&lt;/p&gt;

    &lt;p&gt;I left this debug session convinced I needed to look for a bug in my XDC
file somewhere.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;I spent a lot of time comparing the schematic to the XDC file.  I discovered
some rather important things:&lt;/p&gt;

    &lt;ul&gt;
      &lt;li&gt;
        &lt;p&gt;Some banks required internal voltage references.  These were not declared
in any of the reference designs.&lt;/p&gt;
      &lt;/li&gt;
      &lt;li&gt;
        &lt;p&gt;Two banks needed DCI cascade support, but the reference design only had
one bank using it.&lt;/p&gt;
      &lt;/li&gt;
      &lt;li&gt;
        &lt;p&gt;The design required a voltage select pin that I wasn’t setting.  This pin
needed to be set to high impedance.&lt;/p&gt;
      &lt;/li&gt;
      &lt;li&gt;
        &lt;p&gt;I had the DDR3 CKE IO mapped to the wrong pin.&lt;/p&gt;
      &lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The &lt;a href=&quot;https://www.enclustra.com/en/products/base-boards/mercury-st1/&quot;&gt;Enclustra ST1
baseboard&lt;/a&gt;
can support multiple IO voltages.  These need to be configured via a set of
user jumpers, and the constraints regarding how these IO voltages are to be
set are … complex.  Eventually, I set banks A and B to 1.8V and bank C
to 1.2V.&lt;/p&gt;

    &lt;p&gt;Sadly, nothing but the LEDs were using banks B and C, so … none of these 
changes helped.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I suppose I should be careful here: I was probably fixing actual bugs during
these investigations.  However, none of the bugs I fixed actually helped move
me forward.  Fixing these bugs didn’t get the
&lt;a href=&quot;/formal/2019/02/21/txuart.html&quot;&gt;UART&lt;/a&gt;+SDRAM working, nor did
they get the network interface working whenever the SDRAM was included.  Both
of these interfaces worked without the SDRAM as part of the design, they just
didn’t work when connecting the MIG SDRAM controller to the design.&lt;/p&gt;

&lt;p&gt;Was there some short circuit connection between SDRAM pins and something
on the &lt;a href=&quot;/formal/2019/02/21/txuart.html&quot;&gt;UART&lt;/a&gt; or network IO
banks?  There shouldn’t be, I mean, both of these peripherals were on
separate IO banks from the DDR3 SDRAM.&lt;/p&gt;

&lt;h3 id=&quot;reference-design&quot;&gt;Reference design&lt;/h3&gt;

&lt;p&gt;At this point, I needed to use the reference design to make certain the
hardware still worked.  I’d had weeks of problems where the DONE pin wasn’t
going high.  Did this mean I’d short circuited or otherwise damaged the board?
The design was using a lot more power when configured to use the SDRAM.  Did
this mean there was a short circuit damaging the board?  Had my board been
broken?  Was there a manufacturing defect?&lt;/p&gt;

&lt;p&gt;Normally, this is where you’d use a reference design.  Indeed, this was
&lt;a href=&quot;https://www.enclustra.com&quot;&gt;Enclustra&lt;/a&gt;’s recommendation to me.  Normally this
would be a good recommendation.  They recommended I use their reference design,
prove that the hardware worked, and then slowly migrate that design to my
needs.  My problem with this approach was that their reference design wasn’t
written in RTL.  It was written in TCL with a Verilog wrapper.  Worse, their
TCL Ethernet implementation depended upon an Ethernet controller from Xilinx
that … required a license.  Not only that,
&lt;a href=&quot;https://www.enclustra.com&quot;&gt;Enclustra&lt;/a&gt; did not provide any master XDC file(s).
(They did provide schematics and a .PRJ file with many of the IOs declared
within it.)  Still, how do you “slowly migrate” TCL to RTL?  That left me with
just their MIG PRJ file to reference and … I still had a bug.&lt;/p&gt;

&lt;p&gt;There were a couple of differences between &lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/804559a307bd58e34527ebb3f791ad414fe71803/doc/mig.prj&quot;&gt;my MIG PRJ configuration
file&lt;/a&gt;
and their reference MIG configuration.  My &lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/804559a307bd58e34527ebb3f791ad414fe71803/doc/mig.prj&quot;&gt;MIG PRJ configuration
file&lt;/a&gt;
used a 100MHz user clock, and hence a 400MHz DDR3 clock, whereas their
reference file used an 800MHz DDR3 clock.  (My design wouldn’t close timing at
200MHz, so I was backing away to 100MHz.)  Could this be the difference?&lt;/p&gt;

&lt;p&gt;Upon request, one of my teammates built a LiteX design for this board.  (It
took him less than 2hrs.  I’d been stuck for weeks!  How’d he get it going so
fast?  Dare I mention I was jealous?)  This LiteX design had no problems with
the DDR3 SDRAM–although it doesn’t use Xilinx’s MIG.  I even had him
configure this LiteX demo for the 400MHz DDR3 clock, and … there were no
problems.&lt;/p&gt;

&lt;p&gt;Given that the LiteX design “just worked”, I knew the hardware on my board
still worked.  I just didn’t know what I was doing wrong.&lt;/p&gt;

&lt;h3 id=&quot;the-final-bug-the-reset-polarity&quot;&gt;The final bug: the reset polarity&lt;/h3&gt;

&lt;p&gt;One difference between the MIG driven design and the non-MIG design (i.e. my
design without a DDR3 SDRAM controller) is that the MIG controller wants to
deliver both system clock and the system reset to the rest of the design.  Any
failure to get either a system clock or a system reset from the MIG controller
could break the whole design.&lt;/p&gt;

&lt;p&gt;So, I went back to the top level LEDs again.  I re-examined the logic, and
made sure LED[7] would blink if the MIG was held in reset, and LED[6] would
blink if the clocks didn’t lock.  This lead me to two problems.  The first
problem was based upon where I had my board set up: I couldn’t see LED[7]
from my desk top with a casual glance.  I had to make sure I leaned forward
in my desk chair to see it.  (Yes, this cost me a couple of debug cycles before
I realized I couldn’t see all of the LEDs without leaning forward.)  Once I
could see it, however, I discovered the system reset wire was being held high.&lt;/p&gt;

&lt;p&gt;Well, that would be a problem.&lt;/p&gt;

&lt;p&gt;Normally, when I use the MIG controller, I use an active high reset.  This
time, in order weed out all of the possible bugs, I’d been trying to make my
MIG configuration as close to the example/reference configuration I’d been
given.  That meant I set the design up to use an active-low reset–like the
reference design.  I had assumed that, if the MIG were given an active low
reset it would produce an active low user reset for the design.&lt;/p&gt;

&lt;p&gt;Apparently, I was wrong.  Indeed, after searching out the Xilinx user guide,
I can confirm I was definitely wrong.  The synchronous user reset was active
high.&lt;/p&gt;

&lt;p&gt;Once I switched to an active high reset things started working.  My serial
port now worked.  I could now read from memory over the UART interface, and
“ping” the network interface of the device.  Even better, my debugging
interface now worked.  That meant I could use my
&lt;a href=&quot;https://github.com/ZipCPU/wbscope&quot;&gt;Wishbone scope&lt;/a&gt; again.&lt;/p&gt;

&lt;p&gt;I was now &lt;a href=&quot;https://english.stackexchange.com/questions/25897/origin-of-the-phrase-now-were-cooking-with&quot;&gt;“cooking with gas”&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;cleaning-up&quot;&gt;Cleaning up&lt;/h3&gt;

&lt;p&gt;From here on out, things went quickly.  Sure, there were more bugs, but these
were easily found, identified, and thus fixed quickly.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;While the design came up and I could (now) read from memory, I couldn’t write
to memory without hanging up the design.  &lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/96a24e5756a9e9a363d3d47c7962303afb2f65bd/rtl/migsdram.v#L354&quot;&gt;After tracing it, this bug turned
out to be a simple copy
error&lt;/a&gt;.  It was part of some logic I was getting ready
to test which would’ve ran the MIG at 200MHz, and the design at 100MHz–just
in case that was the issue.&lt;/p&gt;

    &lt;p&gt;This bug was found by adding a &lt;a href=&quot;https://github.com/ZipCPU/wbscope&quot;&gt;Wishbone
scope&lt;/a&gt; to the design, and then
seeing the MIG accept a request that never got acknowledged.&lt;/p&gt;

    &lt;p&gt;Yeah, that’d lock a bus up real quick.&lt;/p&gt;

    &lt;p&gt;I should point out that, because I use Wishbone and because Wishbone has the
ability to &lt;em&gt;abort&lt;/em&gt; an ongoing transaction, I was able to rescue my
connection to the board, and therefore my connection to the bus, even after
this fault.  No, I couldn’t rescue my connection to the SDRAM without a
full reset, but I could still talk to the board and hence I could still use
my &lt;a href=&quot;https://github.com/ZipCPU/wbscope&quot;&gt;Wishbone scope&lt;/a&gt; to debug the problem.
Had this been an AXI bus, I would not have had this capability without using
some form of &lt;a href=&quot;/formal/2020/05/16/firewall.html&quot;&gt;protocol
firewall&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Other bugs were found in the &lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/804559a307bd58e34527ebb3f791ad414fe71803/sw/host/nexbus.cpp&quot;&gt;network software&lt;/a&gt;.
This was fairly new software, never used before, so finding bugs here were
not really all that surprising.&lt;/p&gt;

    &lt;p&gt;At least with these bugs, I could use my &lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/804559a307bd58e34527ebb3f791ad414fe71803/sw/host/nexbus.cpp&quot;&gt;network
software&lt;/a&gt;
together with my Verilator-based simulation environment.  Indeed, my &lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/master/sim/netsim.cpp&quot;&gt;C++
network model&lt;/a&gt;
allows me to send/receive UDP packets to the simulated design, and receive
back what the design would return.&lt;/p&gt;

    &lt;p&gt;Like I said, by this point I was &lt;a href=&quot;https://english.stackexchange.com/questions/25897/origin-of-the-phrase-now-were-cooking-with&quot;&gt;“cooking with gas”&lt;/a&gt;.
It took about two days (out of 45) to get this portion up and running.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The one bug that was a bit surprising was due to a network access test that
set the host software into an infinite loop.  During this infinite loop, the
software would keep writing to a debug dump, which I was hoping to later use
to debug any issues.  The surprise came from the fact that I wasn’t expecting
this issue, so I had let the test run while I stepped away for some family
time.  (Supper and a movie with the kids may have been involved here …)
When I discovered the bug, the debug dump file had grown to over 270GB!
Still, fixing this bug was pretty routine, and there’s not a lot to share
other than it was just another bug.&lt;/p&gt;

&lt;h2 id=&quot;lessons-learned&quot;&gt;Lessons learned&lt;/h2&gt;

&lt;p&gt;There are a lot of lessons to be learned here, some of which I’ve done to
myself.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;All RTL&lt;/p&gt;

    &lt;p&gt;I like all RTL designs.  I prefer all RTL designs.  I can debug an all RTL
design.  I can adjust an all RTL design.  I can version control an all RTL
design.&lt;/p&gt;

    &lt;p&gt;I can’t do this with a TCL design that references opaque components that
may get upgraded or updated any time I turn around.  Worse, I can’t fix
an opaque component–and Xilinx isn’t known for fixing the bugs in their
designs.  As an example, the following bug has been lived in Xilinx’s
Ethernet-Lite controller for years:&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;table align=&quot;center&quot; style=&quot;float: center&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/xilinx-axi-ethernetlite/2022.1-rvalid.png&quot; width=&quot;749&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;I reported this in 2019.  This is only one of several bugs I found.  The logic
above is as of Vivado 2022.1.  In this snapshot, you can see how they commented
the originally broken code.  As a result, the current design now looks like
they tried to fix it and … it’s still broken on its face.  (i.e. RVALID
shouldn’t be adjusted or dropped unless RREADY is known to be true …)&lt;/p&gt;

&lt;p&gt;Or what about RDATA?&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: center&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/xilinx-axi-ethernetlite/2022.1-check.png&quot; width=&quot;749&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;This also violates the first principles of &lt;a href=&quot;/blog/2021/08/28/axi-rules.html&quot;&gt;AXI
handshaking&lt;/a&gt;.  Notice that
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RDATA&lt;/code&gt; might not get set if &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;!RVALID &amp;amp;&amp;amp; !RREADY&lt;/code&gt;–hence the first &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RDATA&lt;/code&gt;
value read from this device might be in error.&lt;/p&gt;

&lt;p&gt;Yeah, … no.  I’m not switching to Xilinx IP any time soon if I can avoid it.
At least with my own IP I can fix any problems–once I find them.&lt;/p&gt;

&lt;p&gt;For all of these reasons, I would want an all HDL reference design from any
vendor I purchase hardware from.  At least in this case, you can now find
an &lt;a href=&quot;https://github.com/ZipCPU/kimos&quot;&gt;all-Verilog reference design for the ST1+KX2 boards in my Kimos
project&lt;/a&gt;–to include a working (and now open
source) &lt;a href=&quot;https://github.com/AngeloJacobo/uberDDR3&quot;&gt;DDR3 SDRAM controller&lt;/a&gt;.&lt;/p&gt;

&lt;ol start=&quot;2&quot;&gt;
  &lt;li&gt;
    &lt;p&gt;Simulation.&lt;/p&gt;

    &lt;p&gt;Perhaps my biggest problem was that I didn’t have an all-Verilog simulation
environment set up for this design from the top level on down.  Such an
environment should’ve found this reset bug at the top level of the design
immediately.  Instead, what I have is a joint Verilog/C++ environment
designed to debug the design from just below the top level using Verilator.
This kept me from finding and identifying the reset bug–something that
could have (and perhaps should have) been found in simulation.&lt;/p&gt;

    &lt;p&gt;In the end, after finding the reset bug, I did break down and I found a
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3/blob/main/testbench/ddr3.sv&quot;&gt;Micron model of a DDR3 memory&lt;/a&gt;.
This was enough to debug some issues associated with getting the &lt;a href=&quot;https://github.com/ZipCPU/wbscope&quot;&gt;Wishbone
scope&lt;/a&gt; working inside the memory
controller, although it’s not really a permanent solution.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left; padding: 25px&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/kimos/open-sim.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Still, this is a big enough problem that I’ve been shopping around the idea
   of an open source all-Verilog simulation environment–something faster than
   Iverilog, with more capability.  If you are interested in working on
   building such a capability–let me know.&lt;/p&gt;

&lt;ol start=&quot;3&quot;&gt;
  &lt;li&gt;
    &lt;p&gt;Finger pointing&lt;/p&gt;

    &lt;p&gt;As is always the case, I tend to point the finger everywhere else when I
can’t find a bug.  This seems to be a common trait among engineers.  For
the longest time I was convinced that my design was creating a short
circuit on the board.  As is typically the case, I often have to come back
to reality once I do find the bugs.&lt;/p&gt;

    &lt;p&gt;I guess the bottom line here is that I have more than enough humble pie to
share.  Feel free to join me.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Since writing this, the project has moved forward quite significantly.  The
design now appears to work with both the MIG and with the
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3&lt;/a&gt; controller–although I
made some more beginner mistakes in the clock setup while getting that
controller up and running.  Still, it’s up and running now, so my next task
will be running some performance metrics to see which controller runs
faster/better/cheaper.  (Hint: the
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3&lt;/a&gt; controller uses about
30% less logic, so there’s at least one difference right off the bat.)&lt;/p&gt;

&lt;p&gt;Stay tuned, and I’ll keep you posted regarding how the two controllers compare
against each other.&lt;/p&gt;
&lt;hr /&gt;&lt;p&gt;&lt;em&gt;For I am not ashamed of the gospel of Christ: for it is the power of God unto salvation to every one that believeth; to the Jew first, and also to the Greek. (Romans 1:16)&lt;/em&gt;</description>
        <pubDate>Thu, 13 Jun 2024 00:00:00 -0400</pubDate>
        <link>https://zipcpu.com/blog/2024/06/13/kimos.html</link>
        <guid isPermaLink="true">https://zipcpu.com/blog/2024/06/13/kimos.html</guid>
        
        
        <category>blog</category>
        
      </item>
    
      <item>
        <title>Chasing resets</title>
        <description>&lt;p&gt;A true story.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right; padding: 20px&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/chasing-resets/cost-estimate.svg&quot; width=&quot;240&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;Some years ago, given a customer’s honest need and request, I proposed a
change to a client’s &lt;a href=&quot;/blog/2021/03/06/asic-lsns.html&quot;&gt;ASIC&lt;/a&gt;
IP.  Specifically, I wanted to add &lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC
checking&lt;/a&gt;,
based upon a &lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
kept in an &lt;a href=&quot;https://en.wikipedia.org/wiki/Out-of-band&quot;&gt;out-of-band memory
region&lt;/a&gt;, to verify the ability
to properly read memory regions error free.  I said the change shouldn’t take
more than about two weeks, and I’d clean up some other problems I was aware of
in the mean time.  This change solved an urgent problem, so he agreed
to it.&lt;/p&gt;

&lt;p&gt;By the time I was done, my 80 hr proposal had turned into 270+ hrs of work.&lt;/p&gt;

&lt;h2 id=&quot;build-it-well&quot;&gt;Build it well&lt;/h2&gt;

&lt;p&gt;I’d like to start my discussion of what went wrong with a list of good
practices to follow.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left; padding: 20px;&quot;&gt;&lt;caption&gt;Fig 1. Basic test bench components&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/chasing-resets/verilogtb.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;Just as a background, a general test bench follows the format shown in Fig. 1,
on the right.  The “test bench” itself is composed of a series of scripts.
These scripts then interact with a common test bench “library”, which then
makes requests of an AXI bus via a “bus functional model”.  This project
was designed to make minor changes to the device under test.&lt;/p&gt;

&lt;p&gt;With that vocabulary under our belt, here are some of the good practices
I would expect to find in a well built design.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Avoid &lt;a href=&quot;https://en.wikipedia.org/wiki/Magic_number_(programming)&quot;&gt;magic numbers&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;Yes, I harp on &lt;a href=&quot;https://en.wikipedia.org/wiki/Magic_number_(programming)&quot;&gt;magic
numbers&lt;/a&gt;
a lot.  There’s a reason for it.  While it wasn’t hard at all to make the
requested changes, I had to come back later and spend more than two weeks
chasing down
&lt;a href=&quot;https://en.wikipedia.org/wiki/Magic_number_(programming)&quot;&gt;magic numbers&lt;/a&gt;
buried in the test bench.&lt;/p&gt;

    &lt;p&gt;Specifically, I wanted to add a hardware capability to calculate and store
a &lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt; in an &lt;a href=&quot;https://en.wikipedia.org/wiki/Out-of-band&quot;&gt;out
of band&lt;/a&gt; area on a storage
device, and then to check those values again when reading the data back.
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;s
can be calculated and checked
quickly and efficiently in hardware–especially if the data is already
moving.  Unfortunately, the test bench had hard coded locations where
everything was supposed to land in the hardware, and as a result all of
these locations needed updating in order to add room for the
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;I spent quite a bit of time chasing down all of these
&lt;a href=&quot;https://en.wikipedia.org/wiki/Magic_number_(programming)&quot;&gt;magic numbers&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;This applies to register address names as well–but we’ll come back to
these in a moment.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The &lt;a href=&quot;https://en.wikipedia.org/wiki/Rule_of_three_(computer_programming)&quot;&gt;“Rule of three”: If you have to write the same thing three times,
refactor it&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;If the &lt;a href=&quot;https://en.wikipedia.org/wiki/Magic_number_(programming)&quot;&gt;magic
numbers&lt;/a&gt;
were confined to one or two places, that would be
one thing.  Unfortunately, they were found throughout the test library
copied from place to place to place.  Every one of those copies then needed
personal attention to double check, in order to answer the question of
whether or not the “copied” number was truly a copied number that could
be modified or removed.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Name your register addresses.  It makes moving them easier.&lt;/p&gt;

    &lt;p&gt;Or, in this case, four versions of this IP earlier someone had removed a
control register from the IP.  The address was then reallocated for another
purpose.  No one noticed the test scripts were still accessing the old
register until I came along and tried to assign names to all of the
registers within the IP.  I then asked, where is the XYZ register?  It’s
not at this address …&lt;/p&gt;

    &lt;p&gt;I hate coming across situations like this.  “Fixing” such situations
always risks making a change (which needs to be made) that then might break
something else later.  (Yes, that happens too …)&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;There’s a benefit to naming even one bit &lt;a href=&quot;https://en.wikipedia.org/wiki/Magic_number_(programming)&quot;&gt;magic
numbers&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;Not to get side tracked, but in another design there was a one-bit
number to indicate data direction.  Throughout the logic, you’d find
expressions like: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;if (direction)&lt;/code&gt;, or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;if (!direction)&lt;/code&gt;.  While you
might think this was okay, the designer wrote the design for the wrong
sense.&lt;/p&gt;

    &lt;p&gt;I then came along and then wanted to “fix” things.&lt;/p&gt;

    &lt;p&gt;Not knowing how deep the corruption lie, or whether or not I was getting
the direction mapping right in the first place, I changed all of these
expressions to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;if (direction == DIR_SOURCE)&lt;/code&gt; or
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;if (direction == DIR_SINK)&lt;/code&gt;.  This way, if necessary, I could come back
later and change &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DIR_SOURCE&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DIR_SINK&lt;/code&gt; at one location (okay, one per
file …) and then trust that everything would change consistently
throughout the design.&lt;/p&gt;

    &lt;p&gt;I got things “mostly” right on my first pass.  The place where I struggled
was in the test bench, where things were named backwards.  Why?  Because if
the design was the &lt;em&gt;source&lt;/em&gt;, the test bench needed to be the &lt;em&gt;sink&lt;/em&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;That reset delay.&lt;/p&gt;

    &lt;p&gt;This is really what I want to discuss today.  How long should a design be
held in reset before being released?&lt;/p&gt;

    &lt;p&gt;My personal answer?  No longer than it needs to be.  Xilinx asks for a 16
clock period AXI reset.  Most designs don’t need this.  Indeed, most
digital designs can reset themselves in a single clock period, although some
require two.&lt;/p&gt;

    &lt;p&gt;Some designs do very validly need a long reset.  I’ve come across this often
where an analog tracking circuit needs to start and lock before the digital
logic should start working with the results of that circuit.  This
make sense, I can understand it, and I’ve built this sort of thing before
when the hardware requires it.  SDRAMs often require long resets as well,
on the order of 200us.&lt;/p&gt;

    &lt;p&gt;In the case of today’s example and lesson learned story, the test bench for
the digital portion of the design was using a 1,000 clock reset.  That is,
the test bench held the design in reset for 1,000 clock cycles.  Why?  That’s
a good question.  Nothing in the IP required such a long reset.  So, I
changed it to 3 cycles.  Three cycles was still overkill–one cycle
should’ve been sufficient, but simulation time can be expensive.  Why
waste simulation time if you don’t need to?&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right; padding: 20px&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/chasing-resets/initial-turn-in.svg&quot; width=&quot;240&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;After changing to a 3 cycle reset, the design worked fine and passed its
   test cases.  I turned my work in, and counted the project done.  All my
   work had been completed in (roughly) the 80 hours I had projected.
   Nice.&lt;/p&gt;

&lt;p&gt;(Okay, my notes say my initial turn in took closer to 120hrs, but I’m going
   to tell the story and pretend my cost estimate was 80hrs.  I can eat a
   40hr overrun on an 80hr contract if I have to–especially if it’s an
   overrun in what I had proposed to do.)&lt;/p&gt;

&lt;ol start=&quot;6&quot;&gt;
  &lt;li&gt;
    &lt;p&gt;Constants should be constant.  Parameters are there for that purpose.&lt;/p&gt;

    &lt;p&gt;If a design has a startup constant, something it depends upon, then that
constant should be set on &lt;em&gt;startup&lt;/em&gt;–before the first clock tick is
over, and not later.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/chasing-resets/parameters.svg&quot; width=&quot;420&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Some engineers like to specify fixed design parameters via input ports
   rather than parameters.  While there are good reasons for doing
   this–especially in
   &lt;a href=&quot;/blog/2021/03/06/asic-lsns.html&quot;&gt;ASIC&lt;/a&gt; designs,
   those fixed constants should be set before the first clock cycle.  If they
   are supposed to be equivalent to wires that are hardwired to either power
   or ground, then they should act like it.&lt;/p&gt;

&lt;p&gt;Personally, I think this purpose is better served by parameters rather
   than hard wired constants, but I can understand a need to build an
   &lt;a href=&quot;/blog/2021/03/06/asic-lsns.html&quot;&gt;ASIC&lt;/a&gt; that
   can then be reconfigured in the field via hard switches.  For example,
   consider how switches can be used to adjust the FPGA wires controlling the
   boot source.  In other words, there is a time for configuring a design via
   input wires.  Just … make those values constants from startup for
   simulation purposes.&lt;/p&gt;

&lt;ol start=&quot;7&quot;&gt;
  &lt;li&gt;
    &lt;p&gt;Calculated values should be &lt;em&gt;calculated&lt;/em&gt;, not set in fixed macros.&lt;/p&gt;

    &lt;p&gt;This particular design depended upon a set of macros, and one test
configuration required one set of macros whereas another test configuration
might require another set of macros.&lt;/p&gt;

    &lt;p&gt;These macros contained all kinds of computed constants.  For instance, if
the design had 512 byte ECC blocks, then the block boundaries were things
like bytes 0-511, 512-1023, 1024-1535, etc–all captured in macros used by
the test bench, and all dependent on the devices page size.  Further
constants captured things like where the ECC would be located in a page, or
how many ECC bytes were used for the given ECC size–which was also a macro.&lt;/p&gt;

    &lt;p&gt;These constants got even worse when it came time to test the ECC.  In this
case, there were macros specifying where to place the errors.  So, for
example, the test bench for a 4bit ECC might generate one error in bytes
0-63, one in bytes 64-127, and macros existed defining these ranges all the
way up to the (macro-defined) size of the page which could be 2kB, 4kB, 8kB,
etc.&lt;/p&gt;

    &lt;p&gt;Sadly, the test script would only run a set of 30 test cases for &lt;em&gt;one&lt;/em&gt; set
of macros.  The design then needed to be reconfigured if you wanted to run
it with another set of macros.  Specifically, every time you needed to change
which ECC option you were testing, or which device model you wished to test
against, then you needed to switch macro sets.  In all, there were over 50
sets of macros, and each macro set contained between 40-150 macros the
design required in order to operate.  Worse, many of those macros were
externally calculated.  Running all tests required starting and restarting
the test driver, one macro set at a time.&lt;/p&gt;

    &lt;p&gt;Here was the problem:  What happens when a macro set configures the IP
to run in one fashion, and you need to reconfigure your operations
mid-sim-runtime to another macro set?  More specifically, what happens when
you need to boot with one ECC option (defined as a macro), and then switch
to another?  In this case, the macro set determined how memory was laid
out, and the customer wanted to change the memory layout in the middle of
a test run.  (He then couldn’t figure out why this was a problem for us …)&lt;/p&gt;

    &lt;p&gt;Lesson learned?  When some configuration points are dependent upon others,
use functions and calculate them within the IP.  That way, if you switch
things around later–or even at runtime, those test-library functions can
still capture all the necessary dependencies.&lt;/p&gt;

    &lt;p&gt;Second lesson learned?  IP should be configured via &lt;em&gt;parameters&lt;/em&gt;, not
macros, and those parameters should all be able to be scripted by the test
driver.  Perhaps you may recall how I discussed handling this in an article
discussing an upgrade to the &lt;a href=&quot;/zipcpu/2022/07/04/zipsim.html&quot;&gt;ZipCPU’s test
infrastructure&lt;/a&gt; some time
back.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;If requirements are in flux, the IP can’t be delivered.&lt;/p&gt;

    &lt;p&gt;This should be a simple given, a basic no-brainer–it’s really basic
engineering 101.  If you don’t know what you want built, you shouldn’t
hire someone to build it until you have solid requirements.  If you want
to change things mid-task, any rework that will be required is going to be
charged against your bottom line.&lt;/p&gt;

    &lt;p&gt;In this case, the end customer of this IP discovered how I was intending
to meet their requirement by adding a
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;.
They then wanted things done
in a different manner.  Specifically, they wanted the
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;s
stored somewhere else.  Of course, this didn’t take place until after I’d
already proposed a fixed price contract based upon 80 hours of work, and
accomplished most
of that work.  Sure, I can support some changes–if the changes are minor.
For example, I initially built a 32b
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
capability and they then wanted a 16b
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
capability.  I figured that’d be a cheap change–since the design was (now)
well parameterized, only two parameters needed to change to adjust.
In this case, however, their simple desire to switch
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
sizes from 32b to 16b now doubled the time spent in verification–since we
now needed to run the verification test suite twice–once for a 32b
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt; and once again
for the 16b &lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt; they
wanted.  Their other change request, moving the
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
storage elsewhere, was major enough that it couldn’t be done without
starting the entire update over from scratch.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right; padding: 20px&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/chasing-resets/crcsz.svg&quot; width=&quot;360&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Change is normal.  Customers don’t always know what they want.  I get that.
   The problem here was that as long as requirements were in flux I wasn’t
   going to deliver any capability.  Let’s agree on what we’re going to deliver
   first, then I’ll deliver that.&lt;/p&gt;

&lt;p&gt;Then the customer started asking why it was taking so long to deliver the
   promised changes, when could we deliver the IP, and they had a hard RTL
   freeze deadline, and …  Yes, this became quite contradictory: 1) They
   wanted me to make a change that would force me to start my work all over
   from scratch, but at the same time 2) wanted all of my changes delivered
   immediately to meet their hard deadline.&lt;/p&gt;

&lt;p&gt;You can’t make this stuff up.&lt;/p&gt;

&lt;ol start=&quot;9&quot;&gt;
  &lt;li&gt;
    &lt;p&gt;If a design can fail, then a simulation test case should exist that can
trigger that failure.&lt;/p&gt;

    &lt;p&gt;This is especially true of &lt;a href=&quot;/blog/2021/03/06/asic-lsns.html&quot;&gt;ASIC&lt;/a&gt; designs, and a lesson I’m needing to learn
in a hard way.  In my case, I knew that I could properly calculate and
detect &lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
errors.  I had formally proven that.&lt;/p&gt;

    &lt;p&gt;However, because I didn’t (initially) generate a simulation test to verify
what would happen on a
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
failure, no one noticed how complicated the register handling for these
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
failures had become.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Test bench drivers should mirror software&lt;/p&gt;

    &lt;p&gt;At some point in time, someone’s going to need to build control software.
They’ll start with the test bench driver.  The closer that test bench
driver looks to real software, the easier their task will be.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;so-what-happened&quot;&gt;So what happened?&lt;/h2&gt;

&lt;p&gt;Okay, ready for the story?&lt;/p&gt;

&lt;p&gt;Here’s what happened: I made my changes inside my promised two weeks.  I
merged and delivered the changes the customer had requested.  Everything worked.&lt;/p&gt;

&lt;p&gt;Life was good.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right; padding: 20px&quot;&gt;&lt;caption&gt;Fig 2. Everything fell apart when merging&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/chasing-resets/merge-failures.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;Then my client then said, oops, we’re sorry, you made the changes to the wrong
version of the IP.  The end customer had asked us to make a simple change to
allow the software to read a sector from non-volatile memory to boot from on
startup.  Here’s the correct version to change.&lt;/p&gt;

&lt;p&gt;The changes appeared minor, so I merged my changes and re-submitted.  This
time, many of the tests now failed.&lt;/p&gt;

&lt;p&gt;What went wrong?&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left; padding: 25px&quot;&gt;&lt;caption&gt;Fig 3. I now use watchdog timers in my test benches&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/chasing-resets/watchdogs.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;The first problem was the reset.  Remember how I removed that 1,000 clock reset,
because it wasn’t needed?  One of the test cases was waiting 100 clock cycles,
and then calling a startup task which would then set the “constant” input values
that were only sampled during reset.  This value would determine whether
the new bootloader capability would be run on startup or not.  The test bench
would then wait on the signal that the bootloader had completed its task.
However, with a 3 cycle reset, the boot on startup constant was never set
before the end of the reset period, so the bootloader never started and the
test bench then hung waiting for the bootloader to complete.  (Waiting on a
non-existent boot loader wasn’t a part of the design I started with.)&lt;/p&gt;

&lt;p&gt;It didn’t help that the test script (in file #1) called a task (in file #2),
that set a value (in file #3), that was checked elsewhere (in file #4), that
was … In other words, there was so much indirection on this reset between
where it was set and its ultimate consequence that it took quite a bit of time
to sort through.  No, it didn’t help that I hadn’t written this IP, nor its
test bench, nor its test scripts, nor its test libraries in the first place.&lt;/p&gt;

&lt;p&gt;Unfortunately, that was only the first problem.&lt;/p&gt;

&lt;p&gt;The second problem was due to an implied requirement that, if your test bench
reads from memory on bootup, there must be an initial set of valid data in
memory for it to boot from–especially if you are checking for valid
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;s and failing a
test if any &lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt; failed.
This requirement didn’t exist in either branch, but became an implied
requirement once the boot up and
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
branches were merged together.  We hadn’t forseen that one coming either.&lt;/p&gt;

&lt;p&gt;A third problem came from how fault detection was handled.  In the case of
a fault, an &lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt;
 would be generated.  The test bench would wait for that
&lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt;, read the
&lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt; register from the IP, and
then handle each active &lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt;
as appropriate.&lt;/p&gt;

&lt;p&gt;In order to properly handle a
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
failure, I needed to adjust how
&lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt;s were handled in the test
library.  That’s fair.  Let’s look at that logic for a moment.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;Interrupt&lt;/a&gt;s were handled in the test
library within a Verilog task.  The relevant portion of this task read
something like:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;do&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;wait&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interrupt&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;read&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interrupt_register&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interrupt_register&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'h01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// Handle interrupt #1&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interrupt_register&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'h02&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// Handle interrupt #2&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interrupt_register&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'h03&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// Handle interrupts #1 and #2&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// ...&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;while&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;task_not_done&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;This was a hidden violation of the &lt;a href=&quot;https://en.wikipedia.org/wiki/Rule_of_three_(computer_programming)&quot;&gt;rule of
three&lt;/a&gt;,
since you’d find the same &lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt;
handler for &lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt; #1 following
a check for the &lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt; register
equalling &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;8'h01&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;8'h03&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;8'h05&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;8'h07&lt;/code&gt;, etc.&lt;/p&gt;

&lt;p&gt;Worse, the &lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt; handlers didn’t
just handle &lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt;s.  They would
also issue commands, reset the interrupt register, use delays, etc., so that
handling &lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt; #1 wasn’t the
same between a reading of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;8'h01&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;8'h05&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;My solution was to spend about two days refactoring this, so that every
&lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt; would be given its own
independent handler properly.  The result looked something like the logic below.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;do&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;wait&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interrupt&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;read&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interrupt_register&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interrupt_register&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// Handle interrupt #1&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interrupt_register&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// Handle interrupt #2&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interrupt_register&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// Handle interrupt #3&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// ...&lt;/span&gt;

		&lt;span class=&quot;n&quot;&gt;clear_interrupts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// and adjust the mask if necessary&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;while&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;task_not_done&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Among other things, I removed all of the register accesses from the
&lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt; “handling” routines,
capturing their needs instead in some registers so the accesses could all
happen at the end.  As a result, &lt;em&gt;nothing&lt;/em&gt; took simulation time during these
handlers and things truly could be merged properly.&lt;/p&gt;

&lt;p&gt;I was proud of this update.  The portion of the test library handling
&lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt;s now “made sense”.&lt;/p&gt;

&lt;p&gt;So, I sent the design off to the test team again only to have it come back to
me again a couple days later.  It had failed another test case.  Where?  In a
second &lt;em&gt;copy&lt;/em&gt; of the same broken
&lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt; handler that I had just
refactored.&lt;/p&gt;

&lt;p&gt;While I might argue that the &lt;a href=&quot;https://en.wikipedia.org/wiki/Rule_of_three_(computer_programming)&quot;&gt;rule of
three&lt;/a&gt;
should’ve applied to this second copy, you could also argue that it didn’t
simply because it was a &lt;em&gt;second&lt;/em&gt; copy of the same
&lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt; handler and not
a &lt;em&gt;third&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;I could go on.&lt;/p&gt;

&lt;p&gt;As I mentioned in the beginning, a basic 80 hour task became a 270+ hour task.
Further, the task went from being &lt;em&gt;on time&lt;/em&gt; to late very suddenly.  Yes,
this was how I spent my Thanksgiving weekend that year.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;A good design plus test bench &lt;em&gt;should&lt;/em&gt; be easy to adjust and modify.&lt;/p&gt;

&lt;p&gt;Building a poor design, a poor test bench, or (worse) both constitutes taking
out a loan from your future self.  This is often called “&lt;a href=&quot;https://en.wikipedia.org/wiki/Technical_debt&quot;&gt;technical
debt&lt;/a&gt;.”
If this is a prototype you are willing to throw away later, then perhaps this
is okay.  If not, then you will end up paying that loan back later, with
interest, at a time you are not expecting to pay it.  It will cost you more
than you want to pay, at a time when you aren’t expecting a delay.&lt;/p&gt;

&lt;p&gt;What about formal methods?  Certainly formal methods might have helped, no?&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left; padding: 25px&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/vlog-wait/rule-of-gold.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;I suppose so.  Indeed, all of my updates were formally verified.  Better yet,
everything that was formally verified worked right the first time.  What about
the stuff that failed?  None of it had ever seen a formal tool.  Test bench
scripts, libraries, and device models, for example, tend not to be formally
verified.  Further, why would you formally verify a “working” design that you
were handed?  Unless, of course, it was never truly “working” in the first
place.&lt;/p&gt;

&lt;p&gt;Remember, well verified, well tested RTL designs are gold in this business.
Build them well, and you can sell or re-use them for years to come.&lt;/p&gt;
&lt;hr /&gt;&lt;p&gt;&lt;em&gt;For yet a little while, and he that shall come will come, and will not tarry.  (Heb 10:37)&lt;/em&gt;</description>
        <pubDate>Mon, 01 Apr 2024 00:00:00 -0400</pubDate>
        <link>https://zipcpu.com/blog/2024/04/01/chasing-resets.html</link>
        <guid isPermaLink="true">https://zipcpu.com/blog/2024/04/01/chasing-resets.html</guid>
        
        
        <category>blog</category>
        
      </item>
    
      <item>
        <title>2023, Year in review</title>
        <description>&lt;p&gt;It should come as no surprise that a blog with &lt;a href=&quot;https://zipcpu.com/blog/2017/08/01/advertising.html&quot;&gt;no
advertisements&lt;/a&gt; has never
paid my bills–at least not directly.  I blog for fun, and to some extent for
&lt;a href=&quot;https://en.wikipedia.org/wiki/Rubber_duck_debugging&quot;&gt;rubber duck debugging&lt;/a&gt;.
As I learn new concepts, I enjoy sharing them here.  Going through the rigor
to write about a topic also helps to make sure I understand the topic as well.&lt;/p&gt;

&lt;p&gt;Why are there &lt;a href=&quot;https://zipcpu.com/blog/2017/08/01/advertising.html&quot;&gt;no
advertisements&lt;/a&gt;?  For
two reasons.  First, because I’m not doing this to make money.  Second, because
because I want more control over any advertising from this site than
most advertisers want to provide.  Perhaps some day the site will be supported
by advertising.  Until then, the web site works fine without advertisements.&lt;/p&gt;

&lt;p&gt;So how then does the blog fit into my business model?  Simply because the blog
helps me find customers via those who read articles here and write to me.&lt;/p&gt;

&lt;h2 id=&quot;business-projects&quot;&gt;Business Projects&lt;/h2&gt;

&lt;p&gt;So, if the blog doesn’t pay my bills, then what does?&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;2023 Projects&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/2023-review/2023-funding.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Well, six projects have paid the bills this year.  Three of these have been
ASIC projects, to include the &lt;a href=&quot;https://www.arasan.com/product/xspi-psram-master/&quot;&gt;PSRAM/NOR flash
controller&lt;/a&gt;, and an &lt;a href=&quot;https://www.arasan.com/product/onfi-4-2-controller-phy/&quot;&gt;ONFI
NAND flash controller&lt;/a&gt;.
Three other projects this year have been FPGA projects, to include an &lt;a href=&quot;/blog/2023/11/25/eth10g.html&quot;&gt;open
source 10Gb Ethernet switch&lt;/a&gt;
and a SONAR front end based upon my
&lt;a href=&quot;https://github.com/ZipCPU/videozip&quot;&gt;VideoZip&lt;/a&gt; design–after including several
very significant upgrades, such as handling ARP and ICMP requests in hardware.
That’s four of the six projects from this year.  Once the other two projects
become a bit more marketable, I may mention them here as well.&lt;/p&gt;

&lt;p&gt;Since I’ve already discussed the &lt;a href=&quot;/blog/2023/11/25/eth10g.html&quot;&gt;10Gb Ethernet
design&lt;/a&gt;, let me take a moment
and discuss the &lt;strong&gt;&lt;a href=&quot;https://github.com/ZipCPU/wbi2c&quot;&gt;I2C controller&lt;/a&gt;&lt;/strong&gt; within it.
The &lt;a href=&quot;https://github.com/ZipCPU/wbi2c&quot;&gt;I2C controller&lt;/a&gt; was originally designed
to support the SONAR project.  Perhaps you may remember the &lt;a href=&quot;/blog/2021/11/15/ultimate-i2c.html&quot;&gt;initial article,
outlining the design goals for this
controller&lt;/a&gt;.  Thankfully,
it’s met all of these goals and more–but we’ll get to that in a moment.  As
part of the SONAR project, its purpose was to sample various non-acoustic
telemetry data: temperature, power supply voltage, current usage, humidity
within the enclosure, and more.  All of these needed to be sampled at regular
intervals.  At first glance, &lt;a href=&quot;https://www.reddit.com/r/FPGA/comments/13ti5zx/when_do_you_solve_a_problem_in_software_instead/&quot;&gt;this sounds like a software
task&lt;/a&gt;–that
is until you start adding real-time requirements to it such as the need to
shut down the SONAR transmitter if it starts overheating, or using so much
power that the FPGA itself will brown out shortly.  So, the
&lt;a href=&quot;https://github.com/ZipCPU/wbi2c&quot;&gt;I2C controller&lt;/a&gt; was designed to generate
(AXI stream) data packets automatically, without CPU intervention, which could
then be forwarded … somewhere.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left; padding: 25px&quot;&gt;&lt;caption&gt;An example I2C-driven OLED output&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/2023-review/ssdlogo-demo.jpg&quot; width=&quot;260&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/wbi2c&quot;&gt;This design&lt;/a&gt; was then incorporated into the
&lt;a href=&quot;/blog/2023/11/25/eth10g.html&quot;&gt;10Gb Ethernet design&lt;/a&gt;.  There
it provided the team the ability to 1) read the DDR3 memory stick
configuration–useful for making sure the &lt;a href=&quot;https://github.com/AngeloJacobo/DDR3_Controller&quot;&gt;DDR3
controller&lt;/a&gt; was properly
configured, 2) read the SFP+ configuration–and discover that we were using
1GbE SFP+ connectors initially instead of 10GbE connectors (Oops!), 3) read the
&lt;a href=&quot;https://en.wikipedia.org/wiki/Extended_Display_Identification_Data&quot;&gt;Extended Display Identification Data
(EDID)&lt;/a&gt;
from the downstream &lt;a href=&quot;https://en.wikipedia.org/wiki/HDMI&quot;&gt;HDMI&lt;/a&gt;
monitor, 4) configure and verify the &lt;a href=&quot;https://www.skyworksinc.com/-/media/Skyworks/SL/documents/public/data-sheets/Si5324.pdf&quot;&gt;Si5324&lt;/a&gt;’s register
settings, 5) draw a logo onto a
&lt;a href=&quot;https://www.amazon.com/Teyleten-Robot-Display-SSD1306-Raspberry/dp/B08ZY4YBHL/&quot;&gt;small OLED display&lt;/a&gt;,
all in addition to 6) actively monitoring hardware temperature.&lt;/p&gt;

&lt;p&gt;Supporting these additional tasks required two fundamental changes to the
&lt;a href=&quot;/blog/2021/11/15/ultimate-i2c.html&quot;&gt;initial vision for this I2C
controller&lt;/a&gt;.  First, I
needed an &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/wbi2c/wbi2cdma.v&quot;&gt;I2C
DMA&lt;/a&gt;, to
quietly transfer results read from the device to memory.  Only once I had
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/wbi2c/wbi2cdma.v&quot;&gt;this DMA&lt;/a&gt;
could the CPU then inspect and/or report on the results.  (It was probably one
of the easiest DMA’s I’ve written, since &lt;a href=&quot;/blog/2021/11/15/ultimate-i2c.html&quot;&gt;I2C is a rather slow
protocol&lt;/a&gt;.)
Second, each packet needed a designated &lt;em&gt;destination&lt;/em&gt; channel, so the design
could know where to forward the results.  This was useful for knowing if the
I2C information should be forwarded to &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/wbi2c/wbi2cdma.v&quot;&gt;the
DMA&lt;/a&gt;, for
storing in memory, or the &lt;a href=&quot;https://en.wikipedia.org/wiki/HDMI&quot;&gt;HDMI&lt;/a&gt;
slave controller, for forwarding the downstream monitor’s
&lt;a href=&quot;https://en.wikipedia.org/wiki/Extended_Display_Identification_Data&quot;&gt;EDID&lt;/a&gt;
to the upstream monitor.  The fact that &lt;a href=&quot;https://github.com/ZipCPU/wbi2c&quot;&gt;this
controller&lt;/a&gt;, designed for completely separate
project, in a completely different domain (i.e. SONAR), ended up working so
well in an &lt;a href=&quot;/blog/2023/11/25/eth10g.html&quot;&gt;10Gb Ethernet
design&lt;/a&gt; project
is a basic testament to a well designed interface.&lt;/p&gt;

&lt;p&gt;The year has also included some internally funded projects.  These include
a new &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDIO/eMMC controller&lt;/a&gt;, a (to-be-posted)
upgrade to &lt;a href=&quot;/blog/2017/06/05/wb-bridge-overview.html&quot;&gt;my standard debugging
bus&lt;/a&gt;, and a
&lt;a href=&quot;/zipcpu/2023/05/29/zipcpu-3p0.html&quot;&gt;ZipCPU upgrade&lt;/a&gt;.  Allow
me to take a moment to discuss these three (unfunded) projects in a bit more
detail.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;&lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDIO/eMMC controller&lt;/a&gt;&lt;/strong&gt; is new.  By
using all four data lanes and a higher clock rate, this upgrade offers a
minimum 8x transfer rate performance improvement over my prior SPI-only
version.  That’s kind of exciting.  Even better, the IP has been tested on
both an SD card as well as an &lt;a href=&quot;http://www.skyhighmemory.com/download/eMMC_4GB_SML_PKG_S40FC004_002_01112.pdf&quot;&gt;eMMC
chip&lt;/a&gt;
as part of the &lt;a href=&quot;/blog/2023/11/25/eth10g.html&quot;&gt;KlusterLab (i.e. 10Gb Ethernet
board)&lt;/a&gt; design.  The IP, &lt;a href=&quot;https://github.com/ZipCPU/sdspi/tree/master/sw&quot;&gt;plus
software&lt;/a&gt;, is so awesome I’m
likely to add it to any future designs I have with SD cards or eMMC chips in
the future.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;The difference between SPI and SDIO: Speed&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/2023-review/sdiovspi.svg&quot; width=&quot;640&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;That’s just the beginning, too.  Just because &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;this new SDIO
controller&lt;/a&gt; works on hardware, doesn’t mean
it works in all modes.  Since its original posting, I’ve added verification to
support all the modes our hardware doesn’t (yet) support.  I’ve also started
adding eMMC BOOT mode support, and I expect I’ll be (eventually) adding DMA
support to this IP as well.  My goal is also to make sure I can support
multiple sector read or write commands–something the SPI only version couldn’t
support, and something that’s supposed to be supported in this new version but
isn’t tested (yet).  (Remember, &lt;a href=&quot;/zipcpu/2022/07/04/zipsim.html&quot;&gt;if it’s not tested it doesn’t
work&lt;/a&gt;.)  In other
words, despite declaring this IP as “working”, it remains under very active
development.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;I will use Slave/Master Terms where appropriate&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/2023-review/slave.svg&quot; width=&quot;480&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;!-- (COMSONICS, SoundWire) --&gt;

&lt;p&gt;Then there’s the upgrade to the &lt;strong&gt;&lt;a href=&quot;https://github.com/ZipCPU/dbgbus&quot;&gt;debuging
bus&lt;/a&gt;&lt;/strong&gt;.  This has been in the works now for
quite a while.  My current/best debugging bus implementation
uses six printable characters to transmit a control code (read request, write
data, or new address) plus 32-bits of data.  At six data bits per 8-bit
character transmitted, this meant six characters would need to be sent
(minimum) in order to send either a 32-bit address or 32-bit data word,
leading to a 36b internal word.  It also required &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;10*6&lt;/code&gt; baud periods (10 baud
periods times six characters) for every uncompressed 32b of data transferred,
for a best case efficiency of 53%.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;The debugging bus multiplexes console and bus channels&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/2023-review/dbgbus.svg&quot; width=&quot;640&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Since then, I’ve slowly been working on an upgrade to this protocol that will
use five (not necessarily printable) characters to transmit 32-bits of data
plus a control code.  This upgrade should achieve an overall 64% worst case
(i.e.  uncompressed) efficiency, for a speed improvement of about 16% over the
prior controller in worst case conditions.  The upgrade comes with some
synchronization challenges, but currently passes all of its simulation
checks–so at this point it’s ready for hardware testing.  My only problem
is … this upgrade isn’t paid for.  Inserting it into one of my business
projects is likely to increase the cost of that project–both in terms of
integration time as well as verification while chasing down any new bugs
introduced by this new implementation–at least until the upgraded bus is
verified.  This has kept this debugging bus upgrade at a lower priority to the
other paying projects.  Well, that and the fact that I only expect a 16%
improvement over the prior implementation.  As a result, the upgrade isn’t
likely to pay for itself for a long time.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Moving from 6 characters to 5 characters to send 32bits&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/2023-review/exbus.svg&quot; width=&quot;640&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Finally, let’s discuss the &lt;a href=&quot;/zipcpu/2023/05/29/zipcpu-3p0.html&quot;&gt;ZipCPU’s big
upgrades&lt;/a&gt;.  As with the
other upgrades, these were also internally funded.  However, the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; has now formed a backdrop to a
majority of my projects.  Indeed, it’s &lt;a href=&quot;/zipcpu/2021/07/23/cpusim.html&quot;&gt;helped me verify ASIC IP in both
simulation&lt;/a&gt; and FPGA
contexts.  One upgrade in particular will keep on giving, and that is the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/tree/master/rtl/zipdma&quot;&gt;new DMA
controller&lt;/a&gt;.  I’ve
already managed to integrate it into a &lt;a href=&quot;https://github.com/ZipCPU/wbsata&quot;&gt;(work in progress) SATA
controller&lt;/a&gt;, and I’m likely to retarget this
DMA engine (plus a small state machine) to meet the DMA needs of my new
&lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDIO/eMMC controller&lt;/a&gt;.  Indeed, it is so
versatile that I’m likely to use this controller across a lot of projects.
Better yet, at this rate, I’m likely to build an AXI version of this new DMA
supporting all of these features as well.  It’s just that good.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left; padding: 25px&quot;&gt;&lt;caption&gt;All labour is profitable, whether or not it's paid for&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/tweets/bible/all-labour.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;As for dollars?  Well, let’s put it this way: the year is now over, and I’m
still in business.  Not only that, but I’ve also managed to keep two kids in
college this year.  More specifically, I expect my third child to graduate
from college this year.  (Five to go …)  So, I’ve been hanging in there,
and I thank my God that my bills have been paid.&lt;/p&gt;

&lt;h2 id=&quot;articles&quot;&gt;Articles&lt;/h2&gt;

&lt;p&gt;2023 has been a slower year for articles than past years.  Much of this is due
to the fact that my time has been so well spent on other paying projects.
That’s left less time for blogging.  (No, it doesn’t help that my family
has fallen in love with Football, and that my major blogging times have been
spent watching my son’s high school games, Air Force Academy Falcon’s football,
the Kansas City Chiefs, Miami Dolphins, Philadelphia Eagles, and my own home
team–the disappointing Minnesota Vikings.) Still, I have managed to push out
seven new articles this year.  Let’s look at each, and see how easy they can
be found using DuckDuckGo.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/2023-review/vikings.svg&quot; width=&quot;320&quot; alt=&quot;What does a Vikings fan do after watching the Vikings win the super bowl?  He turns off the play-station 4.&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;/blog/2023/02/13/eccdbg.html&quot;&gt;Debugging the hard stuff&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;This article discusses some of the challenges I went through when debugging
modifications I made to a working ECC algorithm.  ECC, of course, is one of
those “hard” problems to debug since the intermediate data tends to look
meaningless when viewed.&lt;/p&gt;

    &lt;p&gt;&lt;strong&gt;DuckDuckGo Ranking:&lt;/strong&gt; A search for “FPGA ECC Debugging” brings up the
&lt;a href=&quot;&quot;&gt;ZipCPU home page&lt;/a&gt; as return #111.&lt;/p&gt;

    &lt;p&gt;That’s kind of disappointing.  Let’s try a search using Google.  Google
finds &lt;a href=&quot;/blog/2023/02/13/eccdbg.html&quot;&gt;the correct page&lt;/a&gt;
immediately as its #1 result.  At first I thought the difference was because
Google knew I was interested in &lt;a href=&quot;&quot;&gt;ZipCPU&lt;/a&gt; results.  Then
I asked my daughter to repeat my test on her phone in private mode.  (She
has no interest in FPGA anything, so this would be a first for her.)  Her
Google ranking came up identical, so maybe I can trust this Google ranking.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;/zipcpu/2023/03/13/swic.html&quot;&gt;What is a SwiC&lt;/a&gt;?&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left; padding: 25px&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/swic/barecpu.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;The &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; was originally designed
  to be a System within a Chip, or a SwiC as I called it.  This article
  discusses what a SwiC is, and tries to answer the question of whether or not
  a SwiC makes sense, or equivalently whether or not the
  &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; made for a good SwiC in the
  first place.  In many ways, this article was a review of whether or not the
  &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s
  design goals were appropriate, and whether or not they’ve been met.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DuckDuckGo Ranking:&lt;/strong&gt; Searches on SwiC return all kinds of
  irrelevant results, and searches on “System within a Chip” return all kinds
  of results for “Systems on a Chip”.  If you cheat and search for “ZipCPU
  SwiC”, you get the &lt;a href=&quot;&quot;&gt;ZipCPU&lt;/a&gt; web site as the #1 page.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;/blog/2023/04/08/vpktfifo.html&quot;&gt;What is a Virtual Packet FIFO&lt;/a&gt;?&lt;/p&gt;

    &lt;p&gt;A virtual FIFO is a first-in, first-out data structure built in hardware, but
using &lt;em&gt;external&lt;/em&gt; memory–such as a DDR3 SDRAM–for its memory.  A virtual
packet FIFO is a virtual FIFO that guarantees completed packets and packet
boundaries, in spite of any back pressure that might otherwise cause the FIFO
to fill or overflow.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/vfifo/pktvfifo.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;&lt;a href=&quot;/blog/2023/04/08/vpktfifo.html&quot;&gt;This article&lt;/a&gt;
  goes over the why’s and how’s of a virtual packet FIFO: why you
  might need it, how to use it, and how it works.&lt;/p&gt;

&lt;p&gt;Since writing this article, I’ve now built and tested a &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktvfifo.v&quot;&gt;Wishbone based
  virtual packet FIFO as part of the 10Gb Ethernet
  project&lt;/a&gt;.
  Conclusion?  First, verifying the FIFO is a pain.  Second, I might be able to
  tune its memory usage with some better buffering.  But, overall, the FIFO
  itself works quite nicely in all kinds of environments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DuckDuckGo ranking:&lt;/strong&gt;  The &lt;a href=&quot;&quot;&gt;ZipCPU blog&lt;/a&gt; comes up as
  the #2 ranking on DuckDuckGo following a search for “Virtual Packet FIFO”.
  The &lt;a href=&quot;https://www.reddit.com/r/ZipCPU&quot;&gt;ZipCPU reddit page&lt;/a&gt; comes up as the #7
  ranking.  The page itself?  Not listed.  However, both of the prior pages
  point to this article, so I’m going to give this a DuckDuckGo ranking of #2.
  Sadly, most of DuckDuckGo’s other results are completely irrelevant to a
  Virtual Packet FIFO.  In general, they’re about Virtual FIFOs–not
  Virtual &lt;em&gt;Packet&lt;/em&gt; FIFOs.  As before, though, Google gets the right article
  as it’s number one search result.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;/zipcpu/2023/05/29/zipcpu-3p0.html&quot;&gt;Introducing the ZipCPU 3.0&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;After years of updates, &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; 3.0 is
here!  This means that the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;
now has support for multiple bus structures, wide bus widths, clock stopping,
and a brand new DMA.  &lt;a href=&quot;/zipcpu/2023/05/29/zipcpu-3p0.html&quot;&gt;The
article&lt;/a&gt; announces this
new release, and discusses the importance of each of these major upgrades.&lt;/p&gt;

    &lt;p&gt;&lt;strong&gt;DuckDuckGo Ranking:&lt;/strong&gt; A search for “ZipCPU” on DuckDuckGo yields
&lt;a href=&quot;&quot;&gt;ZipCPU.com&lt;/a&gt; as the #1 search result.  That’s good
enough for me.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;/blog/2023/06/28/sdiopkt.html&quot;&gt;Using a Verilog task to simulate a packet generator for an SDIO
controller&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;I haven’t written a lot about either Verilog test benches, or how to build
them, so this is a bit of a new topic for me.  Specifically, the question
involved was how to make your test bench generate properly synchronous
stimuli.  No, the correct answer is &lt;em&gt;NOT&lt;/em&gt; to generate your stimulus on the
negative edge of the clock.&lt;/p&gt;

    &lt;p&gt;&lt;strong&gt;DuckDuckGo Ranking:&lt;/strong&gt; A search for “SDIO Verilog Tasks” on DuckDuckGo
yields the &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDIO repository&lt;/a&gt; as the #31
search result.  (Google returns the correct article, after searching for
“SDIO Verilog” at #3.)&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;/formal/2023/07/18/sdrxframe.html&quot;&gt;SDIO RX: Bugs found with formal methods&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;If you’ve read my blog often enough, you’ll know that I’m known for formally
verifying my designs.  In the case of the new &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDIO/eMMC
controller&lt;/a&gt;, I had it “working” on hardware
before either the formal verification or the full simulation model were
complete.  This leaves open the question, how many bugs were missed by my
hardware and (partial) simulation testing?&lt;/p&gt;

    &lt;p&gt;The article spends a lot of time also discussing “why” proper verification,
whether formal or simulation, is so important.&lt;/p&gt;

    &lt;p&gt;&lt;strong&gt;DuckDuckGo Ranking:&lt;/strong&gt; A search for “SDIO formal verification” turns up the
&lt;a href=&quot;&quot;&gt;ZipCPU blog&lt;/a&gt; as result #69.  Adding “verilog” to the
search terms, returns the blog as number #46.  As before, Google returns
the right article as the #1 search result after only searching for “SDIO
formal”.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;/blog/2023/11/25/eth10g.html&quot;&gt;An Overview of a 10Gb Ethernet Switch&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;As I mentioned above, one of the big projects of mine this year was a &lt;a href=&quot;https://github.com/ZipCPU/eth10g&quot;&gt;10Gb
Ethernet switch&lt;/a&gt;.  This article goes over
the basics of the switch, and how the various data paths within the design
move data around.&lt;/p&gt;

    &lt;p&gt;&lt;strong&gt;DuckDuckGo Ranking:&lt;/strong&gt; A search for “10Gb Ethernet Switch FPGA” turns up
the &lt;a href=&quot;https://github.com/ZipCPU/eth10g&quot;&gt;Ethernet design&lt;/a&gt; as the #16 result,
and a search on “10Gb Ethernet Switch Verilog” returns the same github result
as the #1 result.  Curiously, the &lt;a href=&quot;https://github.com/ZipCPU/blob/master/bench/rtl/tbenet.v&quot;&gt;10Gb Ethernet test bench
model&lt;/a&gt; for the same
repository comes up as the #2 result.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For all those who like to spam my email account, my conclusions from these
numbers are simple: 1) the &lt;a href=&quot;&quot;&gt;ZipCPU blog&lt;/a&gt; holds its own just
fine on a Google ranking, and 2) DuckDuckGo’s search engine needs work.  &lt;a href=&quot;/blog/2022/11/12/honesty.html&quot;&gt;If
you want to sell me web-based services and don’t know
this&lt;/a&gt;, I’ll assume you haven’t
done your homework and leave your email in my spam box.&lt;/p&gt;

&lt;h2 id=&quot;upcoming-projects&quot;&gt;Upcoming Projects&lt;/h2&gt;

&lt;p&gt;So, what’s next for 2024?  Here are some of the things I know of.  Some of
these are paid for, others still need funding.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float:none&quot;&gt;&lt;caption&gt;2024 Projects&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/2023-review/2024-funding.svg&quot; width=&quot;640&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Still, this is a good list to start from:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;One of my ASIC projects is in the middle of a massive speed upgrade.  This is
not a clock upgrade, or a fastest supported frequency upgrade, but rather an
upgrade to adjust the internal state machine.  I’m anticipating an additional
speed up of between 8x and 256x as a result of this upgrade.&lt;/p&gt;

    &lt;p&gt;Status?  &lt;strong&gt;Funded.&lt;/strong&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;My brand new &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDIO/eMMC controller&lt;/a&gt;
has neither eMMC boot support, nor DMA support.  Boot support might allow me
to boot the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; directly from
an eMMC card, whereas DMA support would allow the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;
to read lots of data from the card without CPU interaction.
Both may be on the near-term horizon, although neither upgrade is funded.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left; padding: 25px&quot;&gt;&lt;caption&gt;Laptop projects have additional requirements&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/2023-review/laptop.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Status?  Not funded.  On the other hand, this project fits quite nicely on
  my laptop for those days when I have the opportunity to take my son to his
  basketball practice … (He’s a 6’4” high school freshman, who is new to
  the sport as of this year …)&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/AutoFPGA&quot;&gt;AutoFPGA&lt;/a&gt; is now, and has for some time,
been a backbone of any of my designs.  I use it for everything.  It makes
adding and removing IP components easy.  One of its key capabilities is
&lt;a href=&quot;/zipcpu/2019/09/03/address-assignment.html&quot;&gt;address assignment (and adjustment)&lt;/a&gt;.
Sadly, it’s worked so well that it now needs some maintenance.  Specifically,
I’d like to upgrade it so that it can handle partially fixed addressing, such
as when some addresses are given and fixed while others are allowed to change
from one design to the next.  This is only a precursor, though, to supporting
2GB memories where the memory address range overlaps one of the ZipSystem’s
fixed address ranges.&lt;/p&gt;

    &lt;p&gt;Status?  A &lt;strong&gt;funded&lt;/strong&gt; (SONAR) project requires these upgrades.  Unlike my
current SONAR project, built around &lt;a href=&quot;https://store.digilentinc.com/nexys-video-artix-7-fpga-trainer-board-for-multimedia-applications/&quot;&gt;Digilent’s Nexys Video
board&lt;/a&gt;,
this one will be built around &lt;a href=&quot;https://www.enclustra.com/en/products/fpga-modules/mercury-kx2/&quot;&gt;Enclustra’s Mercury
KX2&lt;/a&gt;, and
so either &lt;a href=&quot;https://github.com/ZipCPU/AutoFPGA&quot;&gt;AutoFPGA&lt;/a&gt; gets upgraded or
I can’t use the full memory range.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s GCC backend urgently
needs a fix.  Specifically, it has a problem with &lt;a href=&quot;https://en.wikipedia.org/wiki/Tail_call&quot;&gt;tail (sibling)
calls&lt;/a&gt; that jump to register
addresses.  This problem was revealed when testing the &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDIO/eMMC software
drivers&lt;/a&gt;, and needs a proper fix before I
can make any more progress on upgrading the
&lt;a href=&quot;https://zipcpu.com/zipcpu/2021/03/18/zipos.html&quot;&gt;ZipOS&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;Did I mention working on the
&lt;a href=&quot;https://zipcpu.com/zipcpu/2021/03/18/zipos.html&quot;&gt;ZipOS&lt;/a&gt;?  Indeed.
realistically, further work on the &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDIO/eMMC
software&lt;/a&gt; really wants a proper OS of some
type, so … this may be a future and upcoming task.&lt;/p&gt;

    &lt;p&gt;Status?  This project isn’t likely to get any funding, but other projects
are likely to require this fix.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;As another potential project, an old friend is looking into building a
“see-in-the-dark” capability–kind of like a “better” version of
night-vision goggles.  He’s currently arranging for funding, and after all of
my video work I might finally find a customer for it.  Yes, his work will
require some secret sauce processing–but it’s all quite doable, and could
easily fit nicely into this years upcoming work.&lt;/p&gt;

    &lt;p&gt;Status?  If this moves forward, it will be &lt;strong&gt;funded&lt;/strong&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;I’d also like to continue my work on a &lt;a href=&quot;https://github.com/ZipCPU/wbsata&quot;&gt;Wishbone controlled SATA
controller&lt;/a&gt; this year.  I started working
on this controller under the assumption that it would be required by my
SONAR project, and so funded.  Now it no longer looks like it will be funded
under this vehicle.  Still, the controller is now written, even though the
verification work is far from complete.  Specifically, I’ll need to work on
my &lt;a href=&quot;https://github.com/ZipCPU/wbsata&quot;&gt;SATA (Verilog) Verification IP&lt;/a&gt;, until
it’s sufficient enough to get me past knowing if I have the Xilinx GTX
transceivers modeled correctly or not.  Once I get that far, I can both
start testing against actual hardware (on my desk), as well as against
&lt;a href=&quot;/blog/2017/06/21/looking-at-verilator.html&quot;&gt;Verilator&lt;/a&gt;
models.&lt;/p&gt;

    &lt;p&gt;Status?  Funding has been applied for.  Sadly, it’s not likely to be enough
to pay for my hours, but perhaps I can have a junior engineer work on this.
Still, whether or not the funding comes through remains to be determined.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Did I mention that the new debugging bus upgrades are on my list to be
tested?  Who knows, I may test their AXI counterparts first, or I may test
the UDP version first, or …  Only the Good Lord knows how this task will
move forward.&lt;/p&gt;

    &lt;p&gt;Status?  Not funded at all.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;I am looking into getting some funding for a second version of an Ethernet
based Memory controller.  The SONAR project required a &lt;a href=&quot;https://zipcpu.com/blog/2022/08/24/protocol-design.html&quot;&gt;first version of this
controller&lt;/a&gt;,
and it smokes &lt;a href=&quot;/blog/2017/06/05/wb-bridge-overview.html&quot;&gt;my serial port based debugging
controller&lt;/a&gt;.  A
second version of this controller, designed for resource constrained FPGAs,
designed for speed, designed for throughput from the ground up … could
easily become a highly desired product.&lt;/p&gt;

    &lt;p&gt;We’ll see.&lt;/p&gt;

    &lt;p&gt;Status?  Sounds fun, but not (yet) funded.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Finally, I have an outstanding task to test an open source memory controller,
using an open source synthesis, and place and route tool, for both Artix-7
and Kintex-7 devices.  I’ll let you know how that works out.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Since these are business predictions about the future, I am required by the
Good Lord to add that these are subject to whether or not I live and the
Lord wills.  (See &lt;a href=&quot;https://www.blueletterbible.org/kjv/jam/4/13-15&quot;&gt;James
4:13-15&lt;/a&gt; for an explanation.)&lt;/p&gt;

&lt;p&gt;As always, let me know if you are interested in any of these projects, and
especially let me know if you are interested in funding one or more of them.
Either way, the upcoming year looks like it will be quite busy and it’s only
January.&lt;/p&gt;

&lt;p&gt;“My cup runneth over (&lt;a href=&quot;https://blueletterbible.org/kjv/psa/23/5&quot;&gt;Ps 23:5&lt;/a&gt;)”, and
so I shall also pray that God grants you the many blessings He has given me.&lt;/p&gt;

&lt;hr /&gt;&lt;p&gt;&lt;em&gt;Let every thing that hath breath praise the LORD.  Praise ye the LORD. (Ps 150:6)&lt;/em&gt;</description>
        <pubDate>Sat, 20 Jan 2024 00:00:00 -0500</pubDate>
        <link>https://zipcpu.com/blog/2024/01/20/2023-in-review.html</link>
        <guid isPermaLink="true">https://zipcpu.com/blog/2024/01/20/2023-in-review.html</guid>
        
        
        <category>blog</category>
        
      </item>
    
      <item>
        <title>An Overview of a 10Gb Ethernet Switch</title>
        <description>&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 1. The KlusterLab board used for the 10Gb Ethernet Switch testing&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/eth10g/klusterlab_1.0.jpeg&quot; alt=&quot;&quot; width=&quot;360&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;I’ve now been working with &lt;a href=&quot;https://www.symbioticeda.com/&quot;&gt;Symbiotic EDA&lt;/a&gt;
and &lt;a href=&quot;https://www.pcb-arts.com&quot;&gt;PCB Arts&lt;/a&gt; on a &lt;a href=&quot;https://github.com/ZipCPU/eth10g&quot;&gt;10Gb Ethernet switch
project&lt;/a&gt; for
&lt;a href=&quot;https://www.netidee.at/fastopenswitch&quot;&gt;NetIdee&lt;/a&gt; for some time.  Indeed,
I’ve discussed &lt;a href=&quot;https://github.com/ZipCPU/eth10g&quot;&gt;this project&lt;/a&gt; several times
on the blog.  I first brought it up in the context of building a &lt;a href=&quot;/blog/2023/04/08/vpktfifo.html&quot;&gt;Virtual Packet
FIFO&lt;/a&gt;.  The topic then came
up again during two articles on building an &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDIO (SD-Card)
controller&lt;/a&gt;: first when &lt;a href=&quot;/blog/2023/06/28/sdiopkt.html&quot;&gt;discussing how to
build a Verilog test bench for
it&lt;/a&gt;, and then again when
&lt;a href=&quot;/formal/2023/07/18/sdrxframe.html&quot;&gt;discussing what bugs managed to slip past the verification, which then had to
be caught in hardware&lt;/a&gt;.
What we haven’t yet discussed is the switch itself, and how it works.&lt;/p&gt;

&lt;p&gt;When discussing this project, I’ve often shown the &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/doc/eth10g-busblocks.png&quot;&gt;component
diagram&lt;/a&gt; below.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 2. The ETH10G project, from a bus component viewpoint&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;/img/eth10g/eth10g-busblocks.svg&quot;&gt;&lt;img src=&quot;/img/eth10g/eth10g-busblocks.svg&quot; alt=&quot;&quot; width=&quot;780&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;This illustrates the design based upon how the bus views the design–where the
two &lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;crossbars&lt;/a&gt;
are, and what components connect to them.  I’ve also used diagram
this from a management context to show how far the project is along.  Each
component was shaded in red initially, and it’s color slowly adjusted as the
project moved along.  It’s a color code I’ve used often to help communicate
project progress with others (i.e. customers), and it’s now worked nicely
across many projects.  The basic color legend works as:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Red boxes indicate components that either haven’t yet been designed, or
whose design isn’t complete.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Once a component has been designed and passes a basic Verilator lint check,
its color changes from red to yellow.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Components with formal proofs are then colored green.&lt;/p&gt;

    &lt;p&gt;Sometimes, I might also color as green components that aren’t going to
be formally verified, but that pass a test bench based simulation.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Once a component gets tested in hardware, an outline is given to it.  A
red outline indicates the component has failed hardware testing.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;A dark green outline is used to indicate a component that has passed all
hardware testing.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The diagram has been modified a bit from my basic encoding with annotations
underneath the components to indicate if a
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
scope (a type of
internal logic analyzer) had been connected to the component, or if software
had been built for that component or not.  As you can see, I still have a bit
of work left to do on the &lt;a href=&quot;https://github.com/ZipCPU/wbsata&quot;&gt;SATA controller&lt;/a&gt;,
and the PCB still needs to go through another revision.  What is new, however,
is that the networking components of this design are now working.  The design
now functions as a switch.  Therefore, I thought this might be a good time to
discuss the network switch portion of this design, how it accepts 10Gb Ethernet
packets, processes them, and forwards the same 10Gb Ethernet packets on.&lt;/p&gt;

&lt;p&gt;We’ll work our way from the edges of this design, where the Ethernet packets
come in, all the way to the routing algorithm used by the switch.  We’ll
then discuss a special bus arbiter I needed to write, go through the network
components of this design that I expect will be reusable, and then discuss
some lessons learned from building the networking components.  Before we
get there, though, we have to take a moment to remind our readers of the
internal network protocol that is making all of this possible.&lt;/p&gt;

&lt;h2 id=&quot;the-internal-protocol&quot;&gt;The Internal Protocol&lt;/h2&gt;

&lt;p&gt;The entire packet processing system of this 10Gb switch is built around an
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/doc/axin.pdf&quot;&gt;AXI Network (AXIN) packet
protocol&lt;/a&gt;.  I’ve &lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;discussed
this some time before on this
blog&lt;/a&gt;.  However, since &lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;this
protocol&lt;/a&gt; is so central
to everything that follows, it only makes sense to take a moment to quickly
review it here.  Specifically, let’s note the differnces between &lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;this
AXIN protocol&lt;/a&gt; and the
&lt;a href=&quot;/doc/axi-stream.pdf&quot;&gt;standard AXI stream protocol&lt;/a&gt; it
was based upon.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left&quot;&gt;&lt;caption&gt;Fig 3. AXI Network protocol signals&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/eth10g/axinsignals.png&quot; alt=&quot;&quot; width=&quot;472&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Let’s start with why the
&lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;AXIN protocol&lt;/a&gt;
is necessary.  To put it in one word,
&lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;AXIN&lt;/a&gt;
is necessary because of &lt;em&gt;backpressure&lt;/em&gt;.  A &lt;a href=&quot;/doc/axi-stream.pdf&quot;&gt;true AXI
stream&lt;/a&gt; implementation
requires backpressure support–where a slave can tell the source it is not
ready, and hold the ready line false indefinitely.  However, in a network
context the incoming network source has a limited buffer to support any
backpressure.  Hence &lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;the need for a new
protocol&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Many have argued that this isn’t a sufficient need to justify creating a new
protocol.  Why not, they have argued, just buffer any incoming packets until
a whole packet is in the buffer before forwarding it with full backpressure
support?  I’ve now seen several FIFO implementations which can do this sort of
thing: buffer until a full packet has been received, then forward the packet
via traditional AXI stream.  The reason why I haven’t chosen this approach is
because I had a customer ask for jumbo packet support (64kB+ packet sizes).
True “jumbo packets” will be larger than my largest incoming FIFO, so I haven’t
chosen this approach either.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;AXIN protocol&lt;/a&gt; differs
from &lt;a href=&quot;/doc/axi-stream.pdf&quot;&gt;AXI stream&lt;/a&gt; in two key respects.
The first is the addition of an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ABORT&lt;/code&gt; signal.  The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ABORT&lt;/code&gt; signal can be
raised by an &lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;AXIN master&lt;/a&gt;
master at any time–even when &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;VALID &amp;amp;&amp;amp; !READY&lt;/code&gt;.  If raised, it signals that
to the &lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;AXIN slave&lt;/a&gt;
that the current packet needs to be dropped in its entirety.
Many reasons might cause the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ABORT&lt;/code&gt; to be asserted.  For example, if the
initial packet source can’t handle the slave’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;!READY&lt;/code&gt; signal, it could abort
the packet.  If the packet CRC doesn’t match, the packet might also be aborted.&lt;/p&gt;

&lt;p&gt;The second differences is the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BYTES&lt;/code&gt; field.  The &lt;a href=&quot;/doc/axi-stream.pdf&quot;&gt;original AXI stream
protocol&lt;/a&gt;
identified valid bytes via &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TKEEP&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TSTRB&lt;/code&gt;.  This allows packet data to
be discontinuous, and requires processing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;2*WIDTH/8&lt;/code&gt; signals per beat.  The
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/doc/axin.pdf&quot;&gt;AXIN protocol&lt;/a&gt;
instead uses a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BYTES&lt;/code&gt; field having only &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$clog2(WIDTH/8)&lt;/code&gt; bits.
This &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BYTES&lt;/code&gt; field has the requirement that it must be zero (all bytes valid)
for all but the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LAST&lt;/code&gt; beat, where it can be anything.  This also carries
the implied requirement that all beats prior to the last beat must be full,
and the last beat must be right (or left) justified.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 4. Watch out for mixed endianness!&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/eth10g/endianbug.svg&quot; alt=&quot;&quot; width=&quot;420&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;When defining &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/doc/axin.pdf&quot;&gt;this
protocol&lt;/a&gt;, I didn’t define
whether or not it was little endian or big endian.  As a consequence, some of
the components of this design are little endian (Ethernet is little endian by
nature), and others are big endian–since I implement both
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt; and the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; in a big-endian fashion.&lt;/p&gt;

&lt;h2 id=&quot;gtx-phy-front-end&quot;&gt;GTX PHY Front End&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/eth10g&quot;&gt;This design&lt;/a&gt; &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/xgtxphy.v&quot;&gt;uses Xilinx’s GTX IO
controllers&lt;/a&gt;
to generate and ingest the 10Gb/s links.  As &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/xgtxphy.v&quot;&gt;I’ve configured
them&lt;/a&gt;, the GTX
IO controllers within this project act like 1:32 and 32:1 I/OSERDES macros,
simplifying their implementation.  Some key features of these components is
that they can recover the clock when receiving and they
can perform some amount of receive decision feedback channel equalization (DFE).&lt;/p&gt;

&lt;p&gt;Although these transceivers appear amazing in capability, this great capability
seems also to be their Achilles heel.  I found myself reading, re-reading, and
then re-reading their user guide again and again only to find myself confused
regarding how to configure them.  It didn’t help that many of the configuration
options said that the Xilinx wizard’s configuration was to be used, and nothing
more explained about the respective option.  Nor did it help that many of the
ports remain as legacy from previous versions of the controller, and the user
guide suggests that they should not be used anymore.  The GTX transceivers have
many, many options associated with them, many of which are either not
documented at all or are only poorly documented within the user guide.  The
“official” solution to this problem is to use a Vivado wizard for that purpose,
and then to use one of the canned configurations Vivado offers.  Perhaps this
is ideal for Xilinx, who sells a fixed number of IP components based upon these
configurations.  However, this doesn’t really work well when you wish to
post or share an all–RTL design.  It also fails when you want to step off
of the beaten path.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left&quot;&gt;&lt;caption&gt;Fig 5. The next GTX project will bee SATA&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/eth10g/nextgtx.svg&quot; alt=&quot;&quot; width=&quot;420&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;As a result, I often ended up rather confused when configuring the GTX
components.&lt;/p&gt;

&lt;p&gt;The most obvious consequence of this is that I relied heavily on the GTX
simulation models.  This meant that I needed to &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/bench/rtl/tbenet.v&quot;&gt;model a 10Gb Ethernet link
in Verilog&lt;/a&gt;
to stimulate those models.  If you look, you’ll notice I’ve
got quite the &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/fc846af41987236f1c00886e5f87e58ca7e6ba51/bench/rtl/tbenet.v#L162-L306&quot;&gt;clock recovery circuit&lt;/a&gt;
written there in Verilog as well, so
that I can recover the clock from the signal generated by the GTX simulation
model.  For those who know me, this is also a rather drastic departure from my
all-&lt;a href=&quot;/blog/2017/06/21/looking-at-verilator.html&quot;&gt;Verilator&lt;/a&gt;
approach to simulation.  This is understandable, though, given both the
complexity of these components and the fact that Xilinx did not provide a
simulation model that would work with
&lt;a href=&quot;/blog/2017/06/21/looking-at-verilator.html&quot;&gt;Verilator&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;A second consequence is that I was never able to get the GTX’s 64/66b
encoder/decoder working.  I’m sure the interface was quite intuitive to
whoever designed it.  I just couldn’t get it working–not even in simulation.
Then again, schedule pressure being what it was, it was just simpler to build
(and formally verify) my own
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/p66brxgears.v&quot;&gt;32:66&lt;/a&gt;
and &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/p66btxgears.v&quot;&gt;66:32&lt;/a&gt;
gearboxes, and to use my own &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/p66brxgears.v&quot;&gt;66b synchronization
module&lt;/a&gt;.
Perhaps I might’ve figured out how to do this with their GTX transceiver if I
had another month to work on it.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 6. Note to Xilinx's designers&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/eth10g/nowizard.svg&quot; alt=&quot;&quot; width=&quot;420&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Were I to offer any suggestions to Xilinx regarding their GTX design, I
would simply suggest that they simplify it drastically.  A good hardware IO
module on an FPGA should handle any required high speed IO, while also leaving
the protocol processing to be implemented in the FPGA fabric.  It should also
be versatile enough to continue to support the same protocols it currently
supports (and more), without doing protocol specific handling, such as 64/66b
or 8b/10b encoding and decoding, in the PHY components themselves.  This
means I’d probably remove phase alignment and “comma detection” from the PHY
(I wasn’t using either of them for this project anyway) and force them back
into the FPGA fabric.  (Ask me again about these features, though, after I’ve
had to use the GTX for a project that requires them.)&lt;/p&gt;

&lt;p&gt;So, in the end, I used the GTX transceiver simply as a combined 32:1 OSERDES
(for the transmit side) and a 1:32 ISERDES with clock recovery (for receive).
Everything else was relegated to the fabric.&lt;/p&gt;

&lt;h2 id=&quot;the-digital-front-end&quot;&gt;The Digital Front End&lt;/h2&gt;

&lt;p&gt;Ethernet processing in FPGA Logic was split into two parts.  The first part,
what I call the “digital front end”, converts the 32b data interface
required by the PHY to the
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/doc/axin.pdf&quot;&gt;AXIN interfaces&lt;/a&gt; used by
everything else.  That’s all done in my
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/netpath.v&quot;&gt;netpath&lt;/a&gt;
module.&lt;/p&gt;

&lt;p&gt;Early on in this project, I diagrammed out this path, as shown in Fig. 7 below.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 7. Networking flow block diagram&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/eth10g/eth10g-blocks.svg&quot; alt=&quot;&quot; width=&quot;780&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;The diagram begin as a simple block diagram of networking components that
needed to be built and verified for the switch to work.  (The color legend
for this component is roughly the same as the color legend for the bus
components shown in Fig. 2 above.)&lt;/p&gt;

&lt;p&gt;Let’s walk through that module briefly here, working through first the receive
chain followed by the transmit chain.  In each case, processing is accomplished
a discrete number of steps.&lt;/p&gt;

&lt;h3 id=&quot;the-incoming-rx-chain&quot;&gt;The (Incoming) RX Chain&lt;/h3&gt;

&lt;p&gt;The receive data processing chain starts with 32b words, sampled at
322MHz, and converts them to 128b wide &lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;AXIN packet
streams&lt;/a&gt; sampled at 100MHz.
(Why 100MHz?  Because that was the DDR3 SDRAM memory controller’s speed in this
design.)&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/p66brxgears.v&quot;&gt;32b/66b Gearbox&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;As I mentioned in the GTX section above, I could never figure out how to
get the 64/66b conversion working in the GTX front end.  I’m sure Xilinx’s
interface made sense to some, it just never quite made sense to me.  As a
result, it was just easier to use the PHY as a 32b ISERDES and process
everything from there.  This way I could control the signaling and gearbox
handling.  Even better, I could formally verify that I was doing things
right using the signals I had that were under my own control.&lt;/p&gt;

    &lt;p&gt;That means that the first step is a &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/p66brxgears.v&quot;&gt;32:66b
gearbox&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;This gearbox is also where the 66b synchronization happens.  For those not
familiar with the 64/66b protocol, the extra two bits are used for
synchronization.  These two bits are guaranteed (by protocol) to be
different.  One bit combination will indicate the presence of packet data,
the other indicates control information.&lt;/p&gt;

    &lt;p&gt;The &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/fb19d04f46d5bb485020e2d8664c5606c6645612/rtl/net/p66brxgears.v#L114-L138&quot;&gt;alignment algorithm is fairly straightforward, and centers on an
alignment counter&lt;/a&gt;.
We first assume some arbitrary shift will produce aligned data.  &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/fb19d04f46d5bb485020e2d8664c5606c6645612/rtl/net/p66brxgears.v#L124&quot;&gt;If the two
control bits differ&lt;/a&gt;,
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/fb19d04f46d5bb485020e2d8664c5606c6645612/rtl/net/p66brxgears.v#L127&quot;&gt;this counter is incremented by one&lt;/a&gt;.
Once the counter sets the MSB, this particular shift is declared
to be aligned.  If the two control bits are the same–that is if they are
invalid, then &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/fb19d04f46d5bb485020e2d8664c5606c6645612/rtl/net/p66brxgears.v#L128-L129&quot;&gt;the counter is decremented by three&lt;/a&gt;.
Once the counter gets to zero, an alignment failure is declared and
the algorithm moves on to check the next potential alignment by &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/fb19d04f46d5bb485020e2d8664c5606c6645612/rtl/net/p66brxgears.v#L131-L136&quot;&gt;incrementing
the shift amount&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;From a rate standpoint, data comes in at 32b/clk, and leaves &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/p66brxgears.v&quot;&gt;this
gearbox&lt;/a&gt;
via an &lt;a href=&quot;/doc/axi-stream.pdf&quot;&gt;AXI stream protocol&lt;/a&gt;
that requires &lt;em&gt;READY&lt;/em&gt; to be held high.  There’s no room to support any
backpressure here.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axincdc.v&quot;&gt;CDC&lt;/a&gt; from 322.3MHz to 200MHz&lt;/p&gt;

    &lt;p&gt;The problem with data coming in from the GTX PHY is that the received data
is on a derived clock.  As with any other externally provided clock, this
means that the incoming clock (after recovery) will appear to be &lt;em&gt;near&lt;/em&gt;
its anticipated 10.3125GHz rate, but is highly unlikely to match any
internal reference we have to 10.3125GHz exactly.  Even when we divide it
down by 32x, it will only be somewhere &lt;em&gt;near&lt;/em&gt; a local 322.3MHz reference.
For this reason, we want to cross from the per-channel clock rate to a
common rate that can be used within our design–one allowing some timing
overhead.  In this case, we convert to a 200MHz clock.&lt;/p&gt;

    &lt;p&gt;The CDC is accomplished via a &lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;basic asynchronous
FIFO&lt;/a&gt;, with the exception that
the &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axincdc.v&quot;&gt;FIFO has been modified to recognize and propagate ABORT
signals&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left&quot;&gt;&lt;caption&gt;Fig 8. Turning the scrambling off?&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/eth10g/scrambleon.svg&quot; alt=&quot;&quot; width=&quot;420&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/p64bscrambler.v&quot;&gt;Descrambler&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Embedding clock and data together, as described above, requires having
data that is sufficiently pseudorandom–otherwise there might not be enough
bit transitions to reconstruct the clock signal.  Likewise, if the
incoming data isn’t sufficiently random, the 66b frame detector might suffer
from “false locks” and fail to properly detect packet data.  The 10Gb
Ethernet specification describes a feedthrough scrambling algorithm that is
to be applied to the 64 data bits of every 66b code word.  Our first task,
therefore, before we can process the 66b code word is to remove this
scrambling.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/p642pkt.v&quot;&gt;Packet delimiter and fault detector&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;The next step is to convert these descrambled 66b codewords into our
internal AXI network (AXIN) packet stream format.&lt;/p&gt;

    &lt;p&gt;For those familiar with Ethernet, they may recall that the Ethernet
specification describes an &lt;a href=&quot;https://en.wikipedia.org/wiki/Media-independent_interface#XGMII&quot;&gt;XGMII “10Gigabit Media Independent
Interface”&lt;/a&gt;,
and suggests this interface should be used to feed the link.  This project
didn’t use the XGMII interface at all.  Why not?  Because the 66b encoding
describes a set of either 64b data or 64b codewords.  As a result, it makes
sense to process these words 64b at a time–rather than the 32b at a time
used by the XGMII protocol.  Transitioning from 64b at a time to the
32b/clk of the XGMII protocol would require either processing data at
the original 322.3MHz, or doing another clock domain crossing.  It was just
easier to go straight from the 66b format of the 10Gb interface directly
to a 64b &lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;AXIN&lt;/a&gt; format.&lt;/p&gt;

    &lt;p&gt;Following this converter, we are now in a standard packet format.
Everything from here until a (roughly) equivalent point in the transmit
path takes place in this AXIN packet protocol.&lt;/p&gt;

    &lt;p&gt;While &lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;AXIN&lt;/a&gt;
backpressure is supported from here on out, any backpressure will
likely cause a dropped packet.&lt;/p&gt;

    &lt;p&gt;This is the first place in our processing chain where an ABORT may be
generated.  Any ABORTs that follow in the receive chain will either be
propagated from this one, or due to subsequently detected errors in the
packet stream.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/dropshort.v&quot;&gt;Cull short packets&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Our next two steps massage our data just a little bit.  This first step
drops any packets shorter than the Ethernet minimum packet length: 64Bytes.
These packets are dropped via the
&lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;AXIN ABORT&lt;/a&gt; signal.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/crc_axin.v&quot;&gt;CRC Checking&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;From here, we check packet CRCs.  Any packet whose CRC doesn’t match will
be dropped via an ABORT signal.&lt;/p&gt;

    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/01b6322d7950cf29035c6dfcfca077ebddf1669c/rtl/net/crc_axin.v#L124-L141&quot;&gt;Although a basic CRC algorithm is quite straightforward&lt;/a&gt;,
getting this algorithm to pass timing was a bit harder.  Remember, packets
arrive at 64b/clk, and can have any length.  Therefore, the four byte CRC
may be found on any byte boundary.  That means we need to be prepared to
check for any one of eight possible locations for a final CRC.&lt;/p&gt;

    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/01b6322d7950cf29035c6dfcfca077ebddf1669c/rtl/net/crc_axin.v&quot;&gt;Our first attempt to check CRCs&lt;/a&gt;
generated the correct CRC for each byte cut based upon the CRC from the
last byte cut.  This didn’t pass timing at 200MHz.&lt;/p&gt;

    &lt;p&gt;So, let’s take a moment to analyze the basic CRC algorithm.  You can &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/01b6322d7950cf29035c6dfcfca077ebddf1669c/rtl/net/crc_axin.v#L124-L141&quot;&gt;see
it summarized in Verilog here&lt;/a&gt;.
It works based upon a register I’ll call the “fill” (it’s called “current”
in &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/01b6322d7950cf29035c6dfcfca077ebddf1669c/rtl/net/crc_axin.v#L124-L141&quot;&gt;this Verilog summary&lt;/a&gt;).
On each new bit, this “fill” is
shifted right by one.  The bit that then falls off the end of the register
is exclusive OR’d with the incoming bit.  If the result is a ‘1’, then a
32b value is added to the register, otherwise the result is left as a
straight shift.&lt;/p&gt;

    &lt;p&gt;The important conclusion to draw from this is that the CRC “fill” register
is simply propagated via linear algebra over GF2.  &lt;em&gt;It’s just a linear
system!&lt;/em&gt;  This is important to understand when working with any algorithm
of this type.&lt;/p&gt;

    &lt;p&gt;Not only is this a linear system, but 1) the equations for each bit have
fixed coefficients, and 2) they are roughly pseudorandom.  That means that
it should be possible to calculate any bit in the next CRC value based upon
the previous fill and the new data using only 96 input bits, of which only
a rough half of them will be non-zero (due to the pseudorandomness of the
operator).  Exclusive OR’ing 48bits together can be done with only 12 LUT6s
and three levels of logic.  (My out-of-date version of Yosys maps this to
20 LUT6s and 4 LUT4s.)  My point is this: if you can convince your synthesis
tool to remap this into a set of independnt linear equations, each will
cheap and easy to implement.&lt;/p&gt;

    &lt;p&gt;Rewriting the CRC checker so that it first transformed the problem into a
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/crc_eqn.v&quot;&gt;linear equation set&lt;/a&gt;,
turned out to be just the juice needed to pass timing at 200MHz.&lt;/p&gt;

    &lt;p&gt;The result of this &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/crc_axin.v&quot;&gt;check
module&lt;/a&gt;
is that packets with failing CRCs will be ABORTed.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axinwidth.v&quot;&gt;Resize from 64b/clk to 128b/clk&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Since the system clock for this design is 100MHz, we need to cross clock
domains one more time.  This requires more parallelism, so we first increase
our packet width from 64b/clk to 128b/clk in preparation of this final
clock domain crossing.&lt;/p&gt;

    &lt;p&gt;The &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axinwidth.v&quot;&gt;width converter&lt;/a&gt;
used for this purpose has been designed to be very generic.  As a result,
you’ll find it used many times over in this design.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axincdc.v&quot;&gt;CDC&lt;/a&gt; from 200MHz to 100MHz – the bus clock speed&lt;/p&gt;

    &lt;p&gt;The last step of our incoming packet processor is to
&lt;a href=&quot;/blog/2017/10/20/cdc.html&quot;&gt;cross clock domains&lt;/a&gt;
from our intermediate clock speed of 200MHz to 100MHz.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You may notice two additional blocks in Fig. 7 that aren’t connected to the
data stream.  These
are outlined with a dashed line to indicate that they are optional.  I placed
the components in the chain because I have used a similar component in previous
designs.  These components might check the IP version, and potentially check
that the header checksum or the packet length matches the one arriving.
In other designs, these components would also verify that arriving packets
were properly addressed to &lt;em&gt;this&lt;/em&gt; destination.  However, in
&lt;a href=&quot;https://github.com/ZipCPU/eth10g&quot;&gt;this project&lt;/a&gt;, a choice was made early on
that these IP-specific components wouldn’t be required for an &lt;em&gt;Ethernet&lt;/em&gt; (not
IP) switch, and so they have never been either built or integrated into the
design.&lt;/p&gt;

&lt;p&gt;From here on out, all processing takes place via the
&lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;AXIN protocol&lt;/a&gt;.  It is
possible for a long packet to come through this portion of the interface, only
to be ABORTed right at the end.&lt;/p&gt;

&lt;h3 id=&quot;the-outgoing-tx-chain&quot;&gt;The (Outgoing) TX Chain&lt;/h3&gt;

&lt;p&gt;The other half of the digital front end takes packets incoming, via the
&lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;AXIN protocol&lt;/a&gt;,
and converts them to a set of 32b words for the GTX PHY.  As with
the receive processing chain, this is also accomplish in a series of discrete
steps.  Both halves are found within the
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/netpath.v&quot;&gt;netpath.v&lt;/a&gt;
module.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axincdc.v&quot;&gt;CDC&lt;/a&gt;
from 100MHz to 200MHz&lt;/p&gt;

    &lt;p&gt;Our first step is to cross from the 100MHz bus clock speed to a 200MHz
intermediate clock speed.  This simply moves us closer to the clock we
ultimately need to be on, while reducing the amount of logic required on
each step.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axinwidth.v&quot;&gt;Resize&lt;/a&gt; from 128b/clk to 64b/clk&lt;/p&gt;

    &lt;p&gt;Our first step, once we cross back into the 200MHz clock domain, is to
move from 128b/clk back to 64b/clk.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktgate.v&quot;&gt;Packet gate and FIFO&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;One problem with Ethernet transmission is that there’s no way to pause an
outgoing packet.  If the data isn’t ready when it’s time to send, the
packet will need to be dropped.  In order to keep this from happening, I’ve
inserted what I call a “&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktgate.v&quot;&gt;packet gate&lt;/a&gt;”.  This component first
loads incoming data into an &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/netfifo.v&quot;&gt;AXIN FIFO&lt;/a&gt;, and then it holds up each individual
packet until either 1) the entire packet has entered the
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/netfifo.v&quot;&gt;FIFO&lt;/a&gt;, or
2) the &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/netfifo.v&quot;&gt;FIFO&lt;/a&gt;
becomes full. (Remember, backpressure is fully supported on the input.)
This way we can have confidence, going forward, that any memory delays from
the &lt;a href=&quot;/blog/2023/04/08/vpktfifo.html&quot;&gt;virtual FIFO&lt;/a&gt;
feeding us our data from DDR3 memory will not cause us
to drop packets.  For shorter packets, this is a guarantee.  For longer
packets, the
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/netfifo.v&quot;&gt;FIFO&lt;/a&gt;
just mitigates any potential problems.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Spoiler&lt;/p&gt;

    &lt;p&gt;According to the Ethernet standard, I should have a packet spoiler at my
next step.  This spoiler would guarantee that any packets that must be
ABORTed (for whatever reason) have failing CRCs.&lt;/p&gt;

    &lt;p&gt;This spoiler is not (yet) a part of my design.  As a result, there is a
(low) risk of a corrupt packet crossing the interface and (somehow) having
a valid CRC on the other end.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pkt2p64b.v&quot;&gt;Packet assembler&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;At this point, it’s time to switch from the
&lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;AXIN protocol&lt;/a&gt;
back to the 66b/clk Ethernet network protocol.  This includes inserting idle
indications, link error (remote fault) indications, start and end of packet
indications, as well as the packet data itself.&lt;/p&gt;

    &lt;p&gt;The big thing to remember here is that, once a packet enters this component,
the packet data stream cannot be allowed to run dry without corrupting the
outgoing data stream.&lt;/p&gt;

    &lt;p&gt;The result of this stage is a 66b AXI stream, whose &lt;em&gt;VALID&lt;/em&gt; signal must be
held high.  &lt;em&gt;READY&lt;/em&gt; will not be constantly high, but will be adjusted as
necessary to match the speed of the ultimate transmit clock.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/p64bscrambler.v&quot;&gt;Scrambler&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Just like the incoming packet data, we need to apply the same feedthrough
scrambler to the data going out.&lt;/p&gt;

    &lt;p&gt;The result of this stage remains a 66b AXI stream.  As with the prior
stage, &lt;em&gt;VALID&lt;/em&gt; needs to be held high.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/p66btxgears.v&quot;&gt;66/64b gearbox&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;As I mentioned earlier, the Xilinx GTX interface was a challenge to use
and get working.  I only use, therefore, the 32b interface as either a
ISERDES or in this case an OSERDES type of operator.  That means I need
to move from 66b/clk to 32b/clk.  The first half of this conversion takes
place at the 200MHz clock rate, converting to 64b/clock.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axincdc.v&quot;&gt;CDC&lt;/a&gt; from 200MHz to 322.3MHz&lt;/p&gt;

    &lt;p&gt;From here, we can cross from our intermediate clock rate to the clock rate
of the GTX transmitter.  This is done via a &lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;standard asynchronous FIFO–such
as we’ve written about on this blog
before&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;64/32b gearbox&lt;/p&gt;

    &lt;p&gt;The last step in the digital front end processor is to switch from 64b
at a time to 32b at a time.  Every other clock cycle reads a new 64b from
the asynchronous FIFO and sends 32b of those 64b, whereas the other 32b are
sent on the next clock.&lt;/p&gt;

    &lt;p&gt;Why not match the receive handler, and place the full 64/66b gearbox at
the 322.3MHz rate?  I tried.  It didn’t pass timing.  This two step approach,
first going from &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/p66btxgears.v&quot;&gt;66b to
64b&lt;/a&gt;,
and then from 64b to 32b is a compromise that works well enough.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That’s the two digital components of the PHY.  Ever after these two components,
everything takes place using the
&lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;AXIN protocol&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;the-switch&quot;&gt;The Switch&lt;/h2&gt;

&lt;p&gt;Now that we’ve moved to a common protocol, everything else can be handled via
more generic components.  This brings us to the core of the project, the 4x4
Ethernet switch function itself.  This switch function is built in two parts.
The first part is all about buffering incoming packets, and then routing them
to their ultimate destinations.  The second component, the routing algorithm,
we’ll discuss in the next section.&lt;/p&gt;

&lt;p&gt;The challenge of the switch is simply that packets may arrive at any time on
any of our interfaces, and we want to make sure those packets can be properly
routed to any outgoing interface–even if that outgoing interface is currently
busy.  While one approach might be to drop packets if the outgoing interface
is busy, I chose the approach in this project of instead trying to buffer
packets in memory via the &lt;a href=&quot;/blog/2023/04/08/vpktfifo.html&quot;&gt;virtual packet
FIFOs&lt;/a&gt;.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Notify all routing tables of the incoming MAC&lt;/p&gt;

    &lt;p&gt;The first step in the switch is to grab the MAC source address from the
incoming, and to notify all of the per-port routing tables of this MAC
address.&lt;/p&gt;

    &lt;p&gt;We’ll discuss the routing algorithm more in the next section.  It’s based
upon first observing the Ethernet MAC addresses of where packets come from,
and then routing packets to those ports when the addresses are known, or
to all ports when the destination port is unknown.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;/blog/2023/04/08/vpktfifo.html&quot;&gt;Virtual packet FIFOs&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;The second step is to buffer the place the incoming packet into either a
&lt;a href=&quot;/blog/2023/04/08/vpktfifo.html&quot;&gt;Virtual packet FIFO&lt;/a&gt;,
or (based upon a configuration choice) just run it through a block RAM
based &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/netfifo.v&quot;&gt;NetFIFO&lt;/a&gt;.&lt;/p&gt;

    &lt;ul&gt;
      &lt;li&gt;
        &lt;p&gt;Resize from 128b/clk to 512b/clk&lt;/p&gt;

        &lt;p&gt;Pushing all incoming packets into memory requires some attention be paid
to memory bandwidth.  At 128b/clk, there’s not enough bandwidth to push
more than one incoming packet to memory, much less to read it back out
again.  For this reason, the DDR3 SDRAM width was selected to be wide
enough to handle 512b/clock, or (equivalently) just over four incoming
packets at once.&lt;/p&gt;

        &lt;p&gt;Sadly, that means we need to convert our packet width again.  This time,
we convert from 128b/clk to 512b/clk.&lt;/p&gt;
      &lt;/li&gt;
      &lt;li&gt;
        &lt;p&gt;Incoming &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/netfifo.v&quot;&gt;NetFIFO&lt;/a&gt;&lt;/p&gt;

        &lt;p&gt;SDRAM memory is not known for its predictable timing.  Lots of things can
happen that would make an incoming packet stream suffer.  It could be that
the CPU is currently using the memory and the network needs to wait.  It
might be that the memory is in the middle of a refresh cycle, and so all 
users need to wait.  To make sure we can ride through any of these delays,
the first step is to push the incoming packet into a &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/netfifo.v&quot;&gt;block-RAM based
AXIN FIFO&lt;/a&gt;.&lt;/p&gt;
      &lt;/li&gt;
      &lt;li&gt;
        &lt;p&gt;Write packets to memory, then prefixing it with a 32b length word&lt;/p&gt;

        &lt;p&gt;Packet data is then written to memory.&lt;/p&gt;

        &lt;p&gt;My &lt;a href=&quot;/blog/2023/04/08/vpktfifo.html&quot;&gt;Virtual packet FIFO&lt;/a&gt;
implementations work on a 32b alignment.  This means that the incoming
packet data may need to be realigned to avoid overwriting valid data that
might already be in the FIFO.&lt;/p&gt;

        &lt;p&gt;Once the entire packet has been written to memory, the 32b word following
is set to zero and the 32b word preceding the packet is then written with
the packets’ size.  This creates sort of a linked-list structure in memory,
but one that only gets updated once a completed packet has been written to
memory.&lt;/p&gt;

        &lt;p&gt;If the incoming packet is ever ABORTed, then the length word for the packet
is kept at zero, and the next packet is just written to the place this
one would’ve been written to.&lt;/p&gt;

        &lt;p&gt;Once the packet size word has been written to memory, and the memory
has acknowledged it, the &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktvfifowr.v&quot;&gt;virtual packet FIFO writer&lt;/a&gt;
then notifies the reader of its new write pointer.  This signals to the
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktvfiford.v&quot;&gt;reader&lt;/a&gt;
that a new packet may be available.&lt;/p&gt;
      &lt;/li&gt;
      &lt;li&gt;
        &lt;p&gt;Read packets back from memory, length word first&lt;/p&gt;

        &lt;p&gt;Once a packet has been written to memory, and hence once the write pointer
changes, that packet can then be read from memory.  This task works by
first reading the packet length from memory, and then reading that many
bytes from memory to form an outgoing packet.&lt;/p&gt;

        &lt;p&gt;One thing to beware of is that
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
has no ability to offer read
backpressure on the bus.  (AXI has the ability, but for performance and
potential &lt;a href=&quot;https://en.wikipedia.org/wiki/Deadlock&quot;&gt;deadlock&lt;/a&gt;
reasons it really shouldn’t be used.)  This means
that the part of the bus handler requesting data needs to be very aware
of the number of words both requested by the bus as well as those contained
in the synchronous FIFO to follow.  Bus requests should not be issued
unless room in the subsequent FIFO can be guaranteed.&lt;/p&gt;

        &lt;p&gt;I’ve also made an attempt to guarantee that FIFO performance will be
handled on a &lt;em&gt;burst&lt;/em&gt; basis.  Hence,
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
requests will not be made
unless the FIFO is less than half full, and they won’t stop being made
until the FIFO is either fully committed, or the end of the packet has
been reached.&lt;/p&gt;
      &lt;/li&gt;
      &lt;li&gt;
        &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/sfifo.v&quot;&gt;Synchronous FIFO&lt;/a&gt;&lt;/p&gt;

        &lt;p&gt;This is just a basic &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/sfifo.v&quot;&gt;synchronous FIFO&lt;/a&gt;
with nothing special about it.  It’s almost identical to the &lt;a href=&quot;/blog/2017/07/29/fifo.html&quot;&gt;FIFO I’ve
written about before&lt;/a&gt;,
save that &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/sfifo.v&quot;&gt;this one&lt;/a&gt;
has been formally verified and gets used any time I need a FIFO.&lt;/p&gt;
      &lt;/li&gt;
      &lt;li&gt;
        &lt;p&gt;Width adjustment, from 512b/clk back to 128b/clk&lt;/p&gt;

        &lt;p&gt;The last step in the &lt;a href=&quot;/blog/2023/04/08/vpktfifo.html&quot;&gt;virtual packet
FIFO&lt;/a&gt; is to convert
the bit width back from 512b/clk to 128b/clk.&lt;/p&gt;

        &lt;p&gt;Remember, 128b/clk is just barely enough to keep up at a 100MHz clock.
We needed the 512b/clk width to use the memory.  Now that we’re done with
the memory transactions, we can go back to the 128b/clk rate to lower
our design’s LUT count.&lt;/p&gt;

        &lt;p&gt;From this point forward, we can support as much backpressure as necessary.
In many ways, this is the purpose of the
&lt;a href=&quot;/blog/2023/04/08/vpktfifo.html&quot;&gt;virtual packet FIFOs&lt;/a&gt;:
allowing us to buffer unlimited packet sizes in memory, so that we can
guarantee full backpressure support following this point.&lt;/p&gt;
      &lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Get the packet’s destination MAC&lt;/p&gt;

    &lt;p&gt;Once a packet comes back out of the
&lt;a href=&quot;/blog/2023/04/08/vpktfifo.html&quot;&gt;FIFO&lt;/a&gt;, the next step
is to look up its MAC destination address.  &lt;a href=&quot;https://en.wikipedia.org/wiki/Ethernet_frame&quot;&gt;This can be found in the first
6 octets&lt;/a&gt; of the packet.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Look up a destination for this MAC&lt;/p&gt;

    &lt;p&gt;We then send this MAC address to the routing algorithm, to look up where
it should be sent to.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Broadcast the packet to all desired destinations&lt;/p&gt;

    &lt;p&gt;The final step is to &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axinbroadcast.v&quot;&gt;broadcast&lt;/a&gt;
this packet to all of its potential destination ports at once.&lt;/p&gt;

    &lt;p&gt;This is built off of an &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axinbroadcast.v&quot;&gt;AXIN broadcast&lt;/a&gt;
component to create many
&lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;AXIN&lt;/a&gt;
streams from one, followed by an
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axinarbiter.v&quot;&gt;AXIN arbiter&lt;/a&gt;
used to select from many sources to determine which source should be
output at a given time.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That’s how most packets are handled.  Packets to and from the CPU are handled
differently.  Unlike the regular packet paths, the CPU’s virtual FIFO
involves the CPU.  This changes the logic for this path subtly.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;The CPUNet acts like its own port to the switch.  That means that,
internally, the switch is a 5x5 switch and not a 4x4 switch.  It still has
four physical ports, but it also has a virtual internal port going to
the CPU.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The CPU port doesn’t require a
&lt;a href=&quot;/blog/2023/04/08/vpktfifo.html&quot;&gt;virtual packet FIFO&lt;/a&gt;
within the &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/routecore.v&quot;&gt;router core&lt;/a&gt;.  Instead, its
&lt;a href=&quot;/blog/2023/04/08/vpktfifo.html&quot;&gt;virtual packet FIFO&lt;/a&gt;
is kept external to the &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/routecore.v&quot;&gt;router
core&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;As a result, packets come straight in to the &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/routecore.v&quot;&gt;switch
component&lt;/a&gt;,
get their MAC source recorded in the routing table, and then do all of
the route processing other packets go through, save that they don’t need
to through a &lt;a href=&quot;/blog/2023/04/08/vpktfifo.html&quot;&gt;virtual packet
FIFO&lt;/a&gt; since they come
from &lt;a href=&quot;/blog/2023/04/08/vpktfifo.html&quot;&gt;a virtual packet
FIFO&lt;/a&gt; in the first place.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Likewise, in the reverse direction, packets routed to the CPU leave the
router and go straight to the CPU’s
&lt;a href=&quot;/blog/2023/04/08/vpktfifo.html&quot;&gt;virtual packet FIFO&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;On entry, the CPU path has the option to filter out packets not addressed
to its MAC–whatever assignment it is given.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;However, for packet inspection and testing, this extra filter has often been
turned off.  Indeed, a special routing extension has been added to allow
the CPU to “see” all ports coming in, and so it can inspect any packet
going through the switch if bandwidth allows.  (If bandwidth doesn’t allow
this, packets in the switch may be dropped as well.)&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;the-routing-algorithm&quot;&gt;The Routing Algorithm&lt;/h2&gt;

&lt;p&gt;The very first thing I built was the routing algorithm.  This I felt was the
core of the design, the soul and spirit of everything else.  Without routing,
there would be no switch.&lt;/p&gt;

&lt;p&gt;The problem was simply that I’d never built a routing algorithm before.&lt;/p&gt;

&lt;p&gt;My first and foremost goal, therefore, was to just build something that works.
I judged that I could always come back later and build something better.  Even
better, because this system is open source, released under the Apache 2.0
license, someone else is always welcome to come back later and build something
better.  That’s how open source worked, right?&lt;/p&gt;

&lt;p&gt;So let’s go through the basic requirements.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;All packets must be routed.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Hence, if the router can’t tell which destination to send a packet to, 
then that packet should be broadcast to all destinations.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Packets should not be looped back upon their source.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Packets sent to broadcast addresses must be broadcast.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In hindsight, I should have built the table with some additional requirements:&lt;/p&gt;

&lt;ol start=&quot;4&quot;&gt;
  &lt;li&gt;
    &lt;p&gt;It should be possible for the internal CPU to read the state of the table at
any time.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The CPU should be able to generate configure fixed routes.&lt;/p&gt;

    &lt;p&gt;These include both routes where all packets from a given port go to a
given destination port or set of ports, but also routes where a given MAC
address goes to a given port.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;In hindsight, my routing algorithm used a lot of internal logic resources.
Perhaps a better solution might be to share the routing algorithm across
ports.  I didn’t do that.  Instead, each port had its own routing table.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Given these requiremnts, I chose to build the routing algorithm around an
internal &lt;em&gt;routing table&lt;/em&gt;.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 9. Routing table columns&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/eth10g/routetbl.svg&quot; alt=&quot;&quot; width=&quot;480&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Each network port was given its own routing table.  As packets arrived,
the source MAC from the packet was isolated and then a doublet of source
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[PORT, MAC]&lt;/code&gt; was forwarded to the routing tables in the design
associated with all of the other ports.&lt;/p&gt;

&lt;p&gt;Each table entry has four components:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Each entry has a &lt;em&gt;valid&lt;/em&gt; flag.&lt;/p&gt;

    &lt;p&gt;When a new source MAC doublet arrives, it will be placed into the first
table entry without a valid entry within it.  That entry will then become
valid.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Each entry has a &lt;em&gt;MAC address&lt;/em&gt; associated with it.  This is the source address
of the packet used to create this entry.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Each entry has a &lt;em&gt;port number&lt;/em&gt; associated with it.  This is the number of the
port a packet with the given source address last arrived at.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Each entry has an &lt;em&gt;age&lt;/em&gt;.  (It’s really a timeout value …)&lt;/p&gt;

    &lt;ul&gt;
      &lt;li&gt;
        &lt;p&gt;The age of any new entry to the table is given this timeout value.  The
timeout then counts down by one every clock cycle.&lt;/p&gt;
      &lt;/li&gt;
      &lt;li&gt;
        &lt;p&gt;If a MAC declaration arrives for an existing entry, its timeout is reset.
It will then start counting down from its full timeout value.&lt;/p&gt;
      &lt;/li&gt;
      &lt;li&gt;
        &lt;p&gt;If 1) a new MAC declaration arrives for an entry that isn’t in the table,
and , and 2) all entries are full, then 4) the oldest entry will be
rewritten with this new entry.&lt;/p&gt;

        &lt;p&gt;Calculating “oldest” turned out to be one of the more difficult parts of
this algorithm.&lt;/p&gt;
      &lt;/li&gt;
      &lt;li&gt;
        &lt;p&gt;After a period of time, if no packets arrive from a particular source,
then the entry will die of old age.&lt;/p&gt;

        &lt;p&gt;This allows us to accommodate routes that may need to change over time.&lt;/p&gt;
      &lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Note that no effort is made to sort this table one way or other.&lt;/p&gt;

&lt;p&gt;Now that we have this table, we can look up the destination port for a given
packet’s destination MAC address as follows:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;If the packet’s destination MAC matches one from the table, the packet will
then be sent sent to the port associated with that MAC.&lt;/p&gt;

    &lt;p&gt;This basically sends packets to the last source producing a packet with that
MAC address.  This will fail if a particular source sends to a particular
destination, but that destination never responds.&lt;/p&gt;

    &lt;p&gt;Still, this constitutes a successful MAC table lookup.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;If the destination MAC can’t be found in the table, then it will be
forwarded to all possible destinations.&lt;/p&gt;

    &lt;p&gt;This constitutes a failed MAC lookup.  In the worst case, this will increase
network traffic going out by a factor of 4x in a 4x4 router.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This algorithm made a workable draft algorithm, up and until the network links
needed to be debugged.&lt;/p&gt;

&lt;p&gt;Just to make certain everything was working, the hardware was set up with the
bench-top configuration shown in Fig. 10.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 10. Testing configuration&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/eth10g/netbench.svg&quot; alt=&quot;&quot; width=&quot;420&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Ports #0 and #1 were connected in a loopback fashion.  Packets sent to port
#0 would be received at port #1, and vice versa–packets sent to port #1
would be received at port #0.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Port #2 was connected to an external computer via a coaxial cable.  We’ll
call this PC#2.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Port #3 was connected to an external computer via an optical fiber.  We’ll
call this PC#3.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now, let me ask, what’s that loopback going to do to our routing algorithm?
Packets arriving on interface #2 for an unknown destination (PC#3’s address)
will then be forwarded to interfaces #0 and #1 in addition to port #3.
Ports #0 and #1 will then generate two incoming packets to be sent to the same
unknown MAC address (PC#3’s address), which will then cause our table to learn
that the MAC address generated by PC#2 comes from either port #0 or port #1.
(A race condition will determine which port it gets registered to.)  This
is also going to flood our virtual FIFOs with a never ending number of packets.&lt;/p&gt;

&lt;p&gt;This is not a good thing.&lt;/p&gt;

&lt;p&gt;I tried a quick patch to solve this issue.  The patch involved two new
parameters, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OPT_ALWAYS&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OPT_NEVER&lt;/code&gt;.  Using these two options, I was
able to adjust the outgoing port so that it read:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;parameter&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;NETH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;		&lt;span class=&quot;c1&quot;&gt;// Number of ports&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;parameter&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NETH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;OPT_NEVER&lt;/span&gt;  &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;4'h3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;parameter&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NETH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;OPT_ALWAYS&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;4'h0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;c1&quot;&gt;//&lt;/span&gt;
	&lt;span class=&quot;c1&quot;&gt;// ...&lt;/span&gt;
	&lt;span class=&quot;c1&quot;&gt;//&lt;/span&gt;

	&lt;span class=&quot;n&quot;&gt;TX_PORT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lkup_port&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;~&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_NEVER&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OPT_ALWAYS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;I could then use this to keep my design from forwarding packets to
port #0 or #1.&lt;/p&gt;

&lt;p&gt;It wasn’t good enough.&lt;/p&gt;

&lt;p&gt;For some reason, I was able to receive ARP requests from either ports #2
or #3, but they would never acknowledge any packets sent to them.  So … I
started instrumenting everything.  I wondered if things were misrouted,
so I tested sending packets everywhere arbitrarily.  Suddenly, ports #2
and #3 started acknowledging pings!  But when I cleaned up my arbitrary
routing, they no longer acknowledged pings anymore.&lt;/p&gt;

&lt;p&gt;This lead me into a cycle of adjusting &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OPT_NEVER&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OPT_ALWAYS&lt;/code&gt; over
and over again, and rebuilding every time.  Eventually, this got so painful
that I turned these into run-time configurable registers.  (Lesson learned …)&lt;/p&gt;

&lt;p&gt;I’ll admit, I started getting pretty frustrated over this one bug.  The design
worked in simulation.  I could transmit packets from port #0 in simulation to
port #3, going through the router, so I &lt;em&gt;knew&lt;/em&gt; things should work–they just
weren’t working.  Like any good engineer, I started blaming the PCB designer
for miswiring the board.  To do that, though, I needed a good characterization
of what was taking place on the board.  So I started meticulously running
tests and drawing things out.  This lead to the following picture:&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 11. The nightmare routing bug&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/eth10g/20231122-nightmare.svg&quot; alt=&quot;&quot; width=&quot;780&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;This showed that packets sent to either ports #2 or #3 weren’t arriving.  If
I used the CPU to send to port #0, I could see the result on port #1.  However,
if I sent to port #1, nothing would be received on port #0.  Just to make
things really weird, if I sent to port #0 then ports #3 and #2 would see the
packet, but I couldn’t send packets to ports #2 or #3.&lt;/p&gt;

&lt;p&gt;Drawing the figure above out really helped.  Perhaps you can even see the bug
from the figure.&lt;/p&gt;

&lt;p&gt;If not, here it was: I used &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/xgtxphy.v&quot;&gt;one module to control all four GTX
transceivers&lt;/a&gt;.
This module had, as its input therefore, a value &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;S_DATA&lt;/code&gt; containing 32-bits
for every port.  This module was not subject to lint testing, however, since
… the GTX PHY couldn’t be Verilated–or I might’ve noticed what was going
on.  I was then forwarding bits [63:0] of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;S_DATA&lt;/code&gt; to &lt;em&gt;every&lt;/em&gt; outgoing port,
rather than 32-bits of zero followed by bits &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[32*gk +: 32]&lt;/code&gt; of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;S_DATA&lt;/code&gt; to
port &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gk&lt;/code&gt;.  Because my network bench tests were limited to only look at
selected outputs, I then never noticed this issue.  The result was that anything
sent to port #0 would be sent to &lt;em&gt;all&lt;/em&gt; ports: #0-#3, and anything sent to
ports #1-#3 would go nowhere within the design.&lt;/p&gt;

&lt;p&gt;So let me apologize now to the PCB designer for this board.  This one was my
bug after all.&lt;/p&gt;

&lt;p&gt;Let me also suggest that this bug would’ve been much easier to find if I had
designed the routing algorithm from the beginning so that I could test specific
hardware paths.  The CPU, for example, should be able to override any and all
routing paths.  Likewise, the CPU should be able to read (and so verify) any
current routing paths.&lt;/p&gt;

&lt;p&gt;A second problem with the router algorithm was that it consumed too many
resources.  The routing tables were first designed to have 64 entries in each
of the four tables.  When space got tight, this was dropped to 32 entries,
and then to 16, and then down to 8.  Any rewrite of this algorithm should
therefore address the space used by the table–in addition to making it easier
for the CPU to read and modify the table at will.&lt;/p&gt;

&lt;h2 id=&quot;the-special-bus-arbiter&quot;&gt;The Special Bus Arbiter&lt;/h2&gt;

&lt;p&gt;This design required a special
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
arbiter.  To explain this need, though, I’ll need to do a little bit of math.&lt;/p&gt;

&lt;p&gt;Suppose a port is receiving data at 10Gb/s.  By the time we’ve adjusted the
data down so that the stream is moving 512b per clock at a 10ns clock period,
that means one 512b word will need to be written to memory 51.2ns.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left&quot;&gt;&lt;caption&gt;Fig 12. Bus width.  We had to go wide&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/eth10g/buswidth.svg&quot; alt=&quot;&quot; width=&quot;360&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;The typical DDR3 SDRAM access takes about 20 clock cycles of latency or so,
with longer latency requirements if the SDRAM requires either a refresh cycle
or a bank swapping cycle.  Bank swapping can be avoided if the SDRAM accesses
are burst together in a group–a good reason to use a FIFO, but there’s no easy
way to avoid refresh cycles.&lt;/p&gt;

&lt;p&gt;That’s only the first problem.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 13. Wishbone can be very inefficient&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;/img/eth10g/b4drom.svg&quot;&gt;&lt;img src=&quot;/img/eth10g/b4drom.svg&quot; alt=&quot;&quot; width=&quot;780&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;The second problem is illustrated in Fig. 13 above.  In this diagram, the
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;S1_*&lt;/code&gt; signals are from the first source, while the
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;S2_*&lt;/code&gt; signals are from the second source.  The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;M_*&lt;/code&gt; signals would then go
to the downstream device–such as the DDR3 SDRAM or, in this case, a second
arbiter (shown above in Fig. 2) to then go to the DDR3 SDRAM.  (Yeah, I know,
there’s a &lt;em&gt;lot&lt;/em&gt; of levels to bus processing …)&lt;/p&gt;

&lt;p&gt;The problem in this illustration is associated with how
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
normally works: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CYC &amp;amp;&amp;amp; STB&lt;/code&gt; are raised on the first beat of any transaction
from a given master, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;STB&lt;/code&gt; is dropped once all requests have been accepted,
and then &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CYC&lt;/code&gt; is dropped once all &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ACK&lt;/code&gt;s have been received.  The arbiter
then takes a cycle to know that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CYC&lt;/code&gt; has dropped, before allowing the
second master to have access to the bus.  This creates a great inefficiency,
since the master isn’t likely to make any requests between when it drops &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;STB&lt;/code&gt;
and when it drops &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CYC&lt;/code&gt;.  Given what we know of DDR3 SDRAM, this inefficiency
could cost about 200ns per access.&lt;/p&gt;

&lt;p&gt;However, if all four ports are receiving at the same time, then 512b will
need to be written four times every 51ns, from each of four masters.&lt;/p&gt;

&lt;p&gt;This requirement was impossible to meet with my normal
&lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;crossbar&lt;/a&gt;.
That &lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;crossbar&lt;/a&gt;
would wait until a master dropped &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CYC&lt;/code&gt;, before allowing a second
master to access the bus–suffering the inefficiency each time.  This is all
illustrated in Fig. 13 above.&lt;/p&gt;

&lt;p&gt;The solution that I’ve chosen to use is to create a &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/wbmarbiter.v&quot;&gt;special Wishbone
master arbiter&lt;/a&gt;:
one that multiplexes wishbone accesses together from multiple masters to a
single slave.  This way I could transition from one master’s bus requests to
a second master’s bus requests, and then just route the return data to the
next master looking for return data.&lt;/p&gt;

&lt;p&gt;You can see how this might save time in Fig. 14 below.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 14. Merging WB requests within the same cycle&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;/img/eth10g/aftrdrom.svg&quot;&gt;&lt;img src=&quot;/img/eth10g/aftrdrom.svg&quot; alt=&quot;&quot; width=&quot;780&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Notice how, in this figure, the requests get tightly packed together going to
memory.&lt;/p&gt;

&lt;p&gt;(Yes, this is what AXI was designed to do naturally, even though my AXI
&lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;crossbars&lt;/a&gt; don’t yet do
this.)&lt;/p&gt;

&lt;p&gt;There are some consequences to this approach, however, that really keep
it from being used generally.  One consequence is the loss of the ability
to lock the bus and do atomic transactions by simply holding &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CYC&lt;/code&gt; high.
(The virtual packet FIFOs don’t need atomic transactions, so this isn’t
an issue.)
A second consequence is in error handling.  Because the
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt; bus aborts
any outstanding transactions on any error return, all outstanding requests
need to be aborted should any master receive a &lt;a href=&quot;https://en.wikipedia.org/wiki/Bus_error&quot;&gt;bus
error&lt;/a&gt; return.  As a result, all
masters will report a &lt;a href=&quot;https://en.wikipedia.org/wiki/Bus_error&quot;&gt;bus error&lt;/a&gt;
return even if only one master caused the
&lt;a href=&quot;https://en.wikipedia.org/wiki/Bus_error&quot;&gt;error&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Given that the &lt;a href=&quot;/blog/2023/04/08/vpktfifo.html&quot;&gt;virtual packet
FIFO&lt;/a&gt;
bus masters would only ever interact with memory, this solution works in this
circumstance.  I may not be able to use it again.&lt;/p&gt;

&lt;h2 id=&quot;reusable-components&quot;&gt;Reusable Components&lt;/h2&gt;

&lt;p&gt;We’ve discussed &lt;a href=&quot;/blog/2020/01/13/reuse.html&quot;&gt;hardware reuse&lt;/a&gt;
on this blog before, and we’ll probably come back and do so again.  I’ve said
before, I will say again: well designed, well verified IP is gold in this
business.  It’s gold because it can be reused over and over again.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left&quot;&gt;&lt;caption&gt;Fig 15. Good as gold&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/eth10g/ipgold.svg&quot; alt=&quot;&quot; width=&quot;360&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;As examples, the network portion of this project alone has reused many
components,
to include the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;, my &lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;asynchronous
FIFO&lt;/a&gt;, my &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/sfifo.v&quot;&gt;synchronous
FIFO&lt;/a&gt;,
my &lt;a href=&quot;/blog/2019/05/22/skidbuffer.html&quot;&gt;skidbuffer&lt;/a&gt;, and my
&lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;Wishbone crossbar&lt;/a&gt;.  If you
look over the bus diagram in Fig. 2 above, you’ll also see many other
components that may have either been reused, or are likely to be reused
again–to include my &lt;a href=&quot;/blog/2019/03/27/qflexpress.html&quot;&gt;Quad SPI flash
controller&lt;/a&gt;,
&lt;a href=&quot;/blog/2018/11/29/llvga.html&quot;&gt;HDMI Controller(s)&lt;/a&gt;,
&lt;a href=&quot;https://github.com/ZipCPU/wbicapetwo&quot;&gt;ICAPE2 controller&lt;/a&gt;, &lt;a href=&quot;/blog/2021/11/15/ultimate-i2c.html&quot;&gt;I2C
Controller&lt;/a&gt;, and now my
new &lt;a href=&quot;/formal/2023/07/18/sdrxframe.html&quot;&gt;SD-Card Controller&lt;/a&gt;–but
I’m trying to limit today’s focus on the network specific components for now.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 16. A list of my various network designs&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/eth10g/netprojs.svg&quot; alt=&quot;&quot; width=&quot;520&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Sadly, this is my fourth network design and many of the components from my
first three network designs cannot be reused.  My first network design used
a 100M/s link.  The next two network designs were based off of a 1Gb/s
Ethernet, and so used either 8b/clk or 32b/clk–depending on which clock was
being used.  Any component specific to the bit-width of the network could
(sadly) not be reused.  Likewise, the first two of these three previous designs
didn’t use the
&lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;AXIN protocol&lt;/a&gt;–limiting
their potential reuse.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left&quot;&gt;&lt;caption&gt;Fig 17. What is a &quot;store+notify&quot; design?&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/eth10g/storenotify.svg&quot; alt=&quot;&quot; width=&quot;420&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Perhaps this is to be expected.  The first time you design a solution to a
problem, you are likely to make a lot of mistakes.  The second time you’ll
get a lot closer.  The third time even closer, etc.  This is one of the reasons
I like to tell students to “Plan for failure.”  Why?  Because “success” always
comes after failure, and sometimes many “failures” are required to get there.
Therefore, you need to start projects early enough that you have time to get
past your early failures.  Hence, you should always plan for these failures
if you wish to succeed.&lt;/p&gt;

&lt;p&gt;That said, this &lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;AXIN
protocol&lt;/a&gt; was enough of a
success on my third network project that I’m likely to use it again and again.
The fact that I chose to use it here attests to the fact that it worked well
enough on the last project that it didn’t need to be rebuilt at all.&lt;/p&gt;

&lt;p&gt;That also means I’m likely going to be using and reusing many of the generic
&lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;AXIN&lt;/a&gt; components from this
project to the next.  These include:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/netfifo.v&quot;&gt;NetFIFO&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;This is a basic FIFO, but for the
&lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;AXIN protocol&lt;/a&gt;.
Packets that have been completely received by the
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/netfifo.v&quot;&gt;FIFO&lt;/a&gt;
will not be aborted–no matter how the master adjusts the ABORT signal.
What makes this component different from other network FIFOs I’ve seen is
that the ABORT signal may still propagate from the input to the output prior
to a full packet being recorded in the
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/netfifo.v&quot;&gt;FIFO&lt;/a&gt;.
This means that &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/netfifo.v&quot;&gt;this particular network FIFO implementation&lt;/a&gt;
is able to handle packet sizes that might be much larger than the
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/netfifo.v&quot;&gt;FIFO&lt;/a&gt; itself.&lt;/p&gt;

    &lt;p&gt;Although &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/netfifo.v&quot;&gt;my particular
implementation&lt;/a&gt;
doesn’t include the &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/doc/axin.pdf&quot;&gt;AXIN
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BYTES&lt;/code&gt;&lt;/a&gt; signal, the data
width can be adjusted to include it alongside the data signal for no loss of
functionality.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/netskid.v&quot;&gt;NetSkid&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;This is a version of your
&lt;a href=&quot;/blog/2019/05/22/skidbuffer.html&quot;&gt;basic skidbuffer&lt;/a&gt;,
but this time applied to the
&lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;AXIN protocol&lt;/a&gt;.
As a result, it supports the ABORT signal, but in all other ways it’s just a
basic &lt;a href=&quot;/blog/2019/05/22/skidbuffer.html&quot;&gt;skidbuffer&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;This particular component was easy enough to design, and it would make a good
classroom exercise for the student to design his own.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;/blog/2023/04/08/vpktfifo.html&quot;&gt;Virtual Packet FIFOs&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;The most challenging component to verify, and hence to build, has been the
&lt;a href=&quot;/blog/2023/04/08/vpktfifo.html&quot;&gt;virtual Packet FIFOs&lt;/a&gt;.
On the other hand, these are &lt;em&gt;very&lt;/em&gt; versatile, and so I’m likely to use them
again and again in the future.  This verification work, therefore, has been
time well spent.  You can read more about virtual packet FIFOs, what they are
and how they work, in &lt;a href=&quot;/blog/2023/04/08/vpktfifo.html&quot;&gt;this article&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;As of writing this, my
&lt;a href=&quot;/blog/2023/04/08/vpktfifo.html&quot;&gt;virtual Packet FIFO&lt;/a&gt;.
work isn’t quite done.  There’s still a bit of a signaling issue, and the
formal proof of the memory reader portion isn’t passing.  However, the
component is working in hardware so this is something I’ll just have to
come back to as I have the opportunity.&lt;/p&gt;

    &lt;p&gt;Once completed, the only real upgrade potential remaining might be to
convert these virutal FIFOs to AXI.  Getting &lt;a href=&quot;/blog/2020/06/16/axiaddr-limits.html&quot;&gt;AXI
bursting&lt;/a&gt; right
won’t be easy, but it would certainly add value to my &lt;a href=&quot;https://github.com/ZipCPU/wb2axip&quot;&gt;AXI
library&lt;/a&gt; to have such a component within
it.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axinwidth.v&quot;&gt;Width converter&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;The &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axinwidth.v&quot;&gt;AXIN width converter&lt;/a&gt;
will likely be reused.  It should be generic enough
to convert &lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;AXIN streams&lt;/a&gt;
from any power of two byte width to any other.  That
means it will likely be reused on my next network project again as well.
It’s just necessary network infrastructure, so that’s how it will be used
and reused.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktgate.v&quot;&gt;Packet Gate&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;This component buffers either a whole packet, or fills its buffer with a
full packet before releasing the packet downstream.  It’s really nothing more
than a &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/netfifo.v&quot;&gt;network
FIFO&lt;/a&gt;
with some additional logic added.  Still, this one is
generic enough that it can be reused in other projects as necessary.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/dropshort.v&quot;&gt;Dropping short packets&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;I’ve now written multiple components that can/will drop short packets.
This new one, however, should be versatile enough to be able to be reused
across projects.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/crc_axin.v&quot;&gt;CRC checker&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;As with the short packet detector, my CRC checker should work nicely across
projects–even when the widths are different.  This will then form the
basis for future network CRC projects.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 18. The AXIN broadcaster and AXIN arbiter work together&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/eth10g/bcastarb.svg&quot; alt=&quot;&quot; width=&quot;480&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axinbroadcast.v&quot;&gt;AXIN Broadcaster&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;The &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axinbroadcast.v&quot;&gt;AXIN broadcasting component&lt;/a&gt;
is only slightly modified from my &lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/master/rtl/axisbroadcast.v&quot;&gt;AXI stream
broadcaster&lt;/a&gt;.  It’s a simple component: an
&lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;AXIN stream&lt;/a&gt;
comes in, specifying one (or more) destinations that this &lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;AXIN
stream&lt;/a&gt; should be
broadcast to.  The broadcaster then sends a beat to each
&lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;AXIN stream&lt;/a&gt;
and, as beats are accepted from all outgoing streams, sends more beats.&lt;/p&gt;

    &lt;p&gt;Or, rather, that was the initial design.&lt;/p&gt;

    &lt;p&gt;As it turns out, this approach had a serious design flaw in it contributing
to an honest to goodness, bona-fide
&lt;a href=&quot;https://en.wikipedia.org/wiki/Deadlock&quot;&gt;deadlock&lt;/a&gt;.  As a result, both this
component, and the corresponding arbiter that follows it, needed fixing.&lt;/p&gt;

    &lt;p&gt;The new/updated algorithm includes two non-&lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;AXIN&lt;/a&gt; signals:
CHREQ (channel request, from the master) and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ALLOC&lt;/code&gt; (channel allocated,
returned from the slave).  When “broadcasting” to a single port, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CHREQ&lt;/code&gt;
will equal &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;VALID&lt;/code&gt;.  When broadcasting to multiple ports, it will first set
this &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CHREQ&lt;/code&gt; signal to indicate each of the downstream ports it is attempting
to broadcast to.  This will be done before setting &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;VALID&lt;/code&gt; to forward any
data downstream.  The downstream ports, on seeing a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CHREQ&lt;/code&gt; signal, will
raise the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ALLOC&lt;/code&gt; (allocated channel) return signal if they aren’t busy.
The design then waits for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ALLOC&lt;/code&gt; to match &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CHREQ&lt;/code&gt;.  Once the two match, the
packet will be forwarded.  If, however, after 64 clock cycles, enough
channels haven’t been allocated to match the channel requests, then we may
have detected a &lt;a href=&quot;https://en.wikipedia.org/wiki/Deadlock&quot;&gt;deadlock&lt;/a&gt;.  In this
case, all requests will be dropped for a pseudorandom time period before
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CHREQ&lt;/code&gt; will be raised to try again.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axinarbiter.v&quot;&gt;AXIN Arbiter&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;As shown in Fig. 18 above, the &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axinarbiter.v&quot;&gt;AXIN
arbiter&lt;/a&gt;
is the other side of the &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axinbroadcast.v&quot;&gt;AXIN broadcaster&lt;/a&gt;.
Whereas the &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axinbroadcast.v&quot;&gt;AXIN
broadcaster&lt;/a&gt;
takes a single &lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;AXIN
stream&lt;/a&gt; input and
(potentially) forwards it to multiple output streams, the
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axinarbiter.v&quot;&gt;arbiter&lt;/a&gt;
takes multiple
&lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;AXIN&lt;/a&gt; input streams
and selects from among them to source a single output stream.&lt;/p&gt;

    &lt;p&gt;As with the &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axinbroadcast.v&quot;&gt;broadcaster&lt;/a&gt;,
the original design was quite simple: select from among many packet sources
based upon the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;VALID&lt;/code&gt; signal, and then hold on to that selection until
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;VALID &amp;amp;&amp;amp; READY &amp;amp;&amp;amp; LAST&lt;/code&gt;.&lt;/p&gt;

    &lt;p&gt;The &lt;a href=&quot;https://en.wikipedia.org/wiki/Deadlock&quot;&gt;deadlock&lt;/a&gt;
occurred when at least two simultaneous attempts were made to broadcast
packets to two (or more) separate destinations, as shown in FIG:.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 19. Deadlock!&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/eth10g/deadlock.svg&quot; alt=&quot;&quot; width=&quot;780&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;If one destination arbitrated to allow the first source to transmit but the
  second destination chose to allow the second source, then the whole system
  would lock up and fail.&lt;/p&gt;

&lt;p&gt;Now, the
  &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axinarbiter.v&quot;&gt;arbiter&lt;/a&gt;
  selects its incoming signal based upon &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CHREQ&lt;/code&gt;, and then
  sets the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ALLOC&lt;/code&gt; return signal once arbitration has been granted to let
  the source know it has arbitration.  Arbitration is then lost once &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CHREQ&lt;/code&gt;
  is dropped–independent of whether a packet has (or has not) completed.
  In this way, a failed arbitration (i.e. detected
  &lt;a href=&quot;https://en.wikipedia.org/wiki/Deadlock&quot;&gt;deadlock&lt;/a&gt;)
  can be dropped and try again before any packet information gets lost.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axincdc.v&quot;&gt;AXIN CDC&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;As long as there are multiple clocks in a design, there will be a need for
&lt;a href=&quot;/blog/2017/10/20/cdc.html&quot;&gt;crossing clock domains&lt;/a&gt;.
This is uniquely true when there’s a difference a data input clock, data
output, and memory clock domains.&lt;/p&gt;

    &lt;p&gt;&lt;a href=&quot;/blog/2017/10/20/cdc.html&quot;&gt;Crossing clock domains&lt;/a&gt;
with the &lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;AXIN protocol&lt;/a&gt;
isn’t as pretty as when using the &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/netfifo.v&quot;&gt;network
FIFO&lt;/a&gt;.
Indeed, there’s no easy way to ABORT and free the FIFO resources of a packet
that’s been only partially accepted by an &lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;asynchronous
FIFO&lt;/a&gt;.
Instead, the ABORT signal is converted into a data wire and simply forwarded
as normal through a &lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;standard asynchronous
FIFO&lt;/a&gt;.
It’s a simple enough approach, but it does nothing to free up
FIFO resources on an ABORTed packet.  However, freeing up FIFO resources
can still be (mostly) accomplished by placing a small
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axincdc.v&quot;&gt;AXIN CDC&lt;/a&gt;
(i.e. an &lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;asynchronous
FIFO&lt;/a&gt;)
back to back with the &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/netfifo.v&quot;&gt;standard network
FIFO&lt;/a&gt;.
discussed above.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/wbmarbiter.v&quot;&gt;WBMArbiter (Wishbone master arbiter)&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;I mentioned this one in the last section.  Although it’s not unique to
networking, I may yet use this component again.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left&quot;&gt;&lt;caption&gt;Fig 20. Gearboxes can (and should) be formally verified&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/eth10g/frmlgearbox.svg&quot; alt=&quot;&quot; width=&quot;420&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;There is another item worth mentioning in this list, and those are the two
gearboxes.
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/p66brxgears.v&quot;&gt;[1]&lt;/a&gt;,
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/p66btxgears.v&quot;&gt;[2]&lt;/a&gt;.
I’m not listing these with the reusable components above, simply
because I’m not likely to use these particular gearboxes outside of a 64/66b
encoding system, so they aren’t really all that generic.  However, the
lessons learned from building them, together with the methodology for
formally verifying them, is likely something I’ll remember long into the
future when building gearboxes of other ratios.&lt;/p&gt;

&lt;h2 id=&quot;lessons-learned&quot;&gt;Lessons Learned&lt;/h2&gt;

&lt;p&gt;Perhaps the biggest lesson (re)learned during this design is that you need
to plan for debugging from the beginning.  This includes some of the following
lessons:&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 21. For debugging purposes, count all ABORTs&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/eth10g/abortctr.svg&quot; alt=&quot;&quot; width=&quot;420&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Anytime a component generates an ABORT signal, that should be logged and
counted somewhere.  This is different from components propagating an ABORT
signal.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Testing requires that the router have static paths, at least until the
rest of the network interface works.&lt;/p&gt;

    &lt;p&gt;Port forwarding can also permit the CPU to inspect any packet crossing
through the interface.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Bit-order in the ethernet specification is quite confusing.&lt;/p&gt;

    &lt;p&gt;I had to build and then rebuild the &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/p642pkt.v&quot;&gt;66b to AXIN protocol
component&lt;/a&gt;
and its &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pkt2p64b.v&quot;&gt;transmit
counterpart&lt;/a&gt;
many times over during this project–mostly because of misunderstandings
of the Ethernet specification.  This included confusions over how bytes
are numbered within the 66b message word, the size of control bytes (7b) in
66b message words vs the size of data bytes (8b), whether the synchronization
pattern was supposed to be &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;10&lt;/code&gt; for control words or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;01&lt;/code&gt; and so forth.
As a result, even though I was able to get the design working with GTX
transceivers in a 10Gb Ethernet test bench early on, I would later discover
more than once that the “working” test bench and design didn’t match
10Gb Ethernet hardware components on the market.&lt;/p&gt;

    &lt;p&gt;For example, the Ethernet specification indicates that the start of packet
indication must include bits, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;... 10101010 10101010&lt;/code&gt; followed by the octet
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;10101011&lt;/code&gt;.  What it doesn’t make clear is that these bits need to be read
right to left.  This meant that the final three bytes of any start of
packet delimeter needed to be &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0x55&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0x55&lt;/code&gt;, followed by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0xd5&lt;/code&gt; rather
than (as I first read the specification) &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0xaa&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0xaa&lt;/code&gt;, followed by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0xab&lt;/code&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The routing algorithm above doesn’t work well in loopback situations.&lt;/p&gt;

    &lt;p&gt;As I mentioned above, this particular problem turned into a debugging
nightmare.  Even now that I have a working algorithm, I’m not yet convinced
it does a good job handling this reality.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Look out for &lt;a href=&quot;https://en.wikipedia.org/wiki/Deadlock&quot;&gt;deadlocks&lt;/a&gt;!&lt;/p&gt;

    &lt;p&gt;Others who have dealt with
&lt;a href=&quot;https://en.wikipedia.org/wiki/Deadlock&quot;&gt;deadlocks&lt;/a&gt; before have told me
that 1) &lt;a href=&quot;https://en.wikipedia.org/wiki/Deadlock&quot;&gt;deadlocks&lt;/a&gt; are system
architecture level problems, and 2) that they are easy enough to spot
after you’ve had enough experience with them.  In my case, I studied them
back in my college days, but had never seen a
&lt;a href=&quot;https://en.wikipedia.org/wiki/Deadlock&quot;&gt;deadlocks&lt;/a&gt; in real life before.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left&quot;&gt;&lt;caption&gt;Fig 22. Deadlocks are very real&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/eth10g/bigfoot.svg&quot; alt=&quot;&quot; width=&quot;420&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;ol&gt;
  &lt;li&gt;Because the routing tables were written entirely in RTL using Flip-Flops,
and not using any block RAM, the CPU has no ability to read the tables back
out to verify their functionality.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Finally, I’ve really enjoyed &lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;this network (AXIN)
protocol&lt;/a&gt;.  It’s met my
needs now on multiple projects.  Better yet, because I’ve made this protocol
common across these projects, I can now re-use components between them.&lt;/p&gt;

&lt;p&gt;Perhaps that should be a final lesson learned as well: well designed internal
protocols facilitate design reuse.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;So the big question I imagine on everybody’s mind at this point is, now that
you have a 10Gb Ethernet switch, how well does it work?&lt;/p&gt;

&lt;p&gt;Sadly, my answer to that (at present) is … I don’t know.&lt;/p&gt;

&lt;p&gt;I’ve tested the design using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;scp&lt;/code&gt; to move files from one PC to another,
only to discover &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;scp&lt;/code&gt; has internal speed limitations within it.  This leaves
me and the project needing better test cases.&lt;/p&gt;

&lt;p&gt;How about the hardware?&lt;/p&gt;

&lt;p&gt;As with any PCB design, the original PCB design for this project has had some
issues.  The work presented above has been done with the &lt;em&gt;original&lt;/em&gt;, first/draft
board design.  Part of my task has been to build enough RTL design to identify
these issues.  As of the this date, I’ve now worked over all of the hardware
design save the SATA port.  Yes, there are some issues that will need to be
corrected in the next revision of the board.  These issues, however, weren’t
significant enough that they kept us from verifying the components on the board.
As a result, I can now say with certainty that the 10GbE components of the
board work–although, IIRC, we were going to change the polarity of some of
the LED controls associated with it.  Still, its enough to say it works.&lt;/p&gt;

&lt;p&gt;The flash, SD card, and eMMC interfaces?  Those will need some redesign work,
and so they should be fixed on the next revision of the board.  As for the
10GbE interfaces?  Those work.&lt;/p&gt;

&lt;p&gt;If you are interested in one of these boards to work with, then please
contact &lt;a href=&quot;mailto:edmund@symbioticeda.com&quot;&gt;Edmund, at Symbiotic EDA&lt;/a&gt;, for details.&lt;/p&gt;

&lt;p&gt;If the Lord wills, there are several components of this design that might
be fun to blog about.  These include the &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/bench/formal/faxin_slave.v&quot;&gt;formal property
set&lt;/a&gt;
that I’ve been using to verify AXIN components, the &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/netskid.v&quot;&gt;AXIN
Skidbuffer&lt;/a&gt;
or even the
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/netfifo.v&quot;&gt;NetFIFO&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For now, I’ll just note that the design is &lt;a href=&quot;https://github.com/ZipCPU/eth10g&quot;&gt;posted on
github&lt;/a&gt;, licensed under Apache 2.0, and
I will invite others to examine it and make comments on it.&lt;/p&gt;

&lt;hr /&gt;&lt;p&gt;&lt;em&gt;This also cometh from the LORD of Hosts, which is wonderful in counsel, and excellent in working.  (Is 28:29)&lt;/em&gt;</description>
        <pubDate>Sat, 25 Nov 2023 00:00:00 -0500</pubDate>
        <link>https://zipcpu.com/blog/2023/11/25/eth10g.html</link>
        <guid isPermaLink="true">https://zipcpu.com/blog/2023/11/25/eth10g.html</guid>
        
        
        <category>blog</category>
        
      </item>
    
      <item>
        <title>SDIO RX: Bugs found w/ Formal methods</title>
        <description>&lt;p&gt;This post is the second post regarding my new &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDIO/eMMC
controller&lt;/a&gt;.  The SDIO protocol is commonly
used on SD cards, and the eMMC protocol for eMMC chips.  The two protocols
are so similar that, when using
&lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;this controller&lt;/a&gt;, they will differ in
software only.  Today’s bottom line is that, although &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;the
controller&lt;/a&gt; is still quite new and only
barely silicon proven, this week I had the chance to formally verify the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdrxframe.v&quot;&gt;receive portion&lt;/a&gt;
of the controller and so I thought I might write about what took place.
My goal will be to answer the question of whether this extra step of doing
formal verification was worth it or not.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;This SDIO controller&lt;/a&gt; is being written as
part of the &lt;a href=&quot;https://github.com/ZipCPU/eth10g&quot;&gt;Klusterlab project&lt;/a&gt;.  I’ve been
calling &lt;a href=&quot;https://github.com/ZipCPU/eth10g&quot;&gt;this same project the ETH10G
project&lt;/a&gt;, because at its core it is a &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/routecore.v&quot;&gt;10Gb
Ethernet switch&lt;/a&gt;.
The project team, however, has named it KlusterLab because of all of the
various hardware and IO interfaces that have been integrated into it.
This has given me plenty of opportunities for testing hardware components, and
for writing blog articles about them.  As a result, I’ve now written about the
development of this project a couple times.  First, I wrote about &lt;a href=&quot;/blog/2023/04/08/vpktfifo.html&quot;&gt;virtual
packet FIFOs&lt;/a&gt;,
then about &lt;a href=&quot;/blog/2023/06/28/sdiopkt.html&quot;&gt;using Verilog tasks to script the SDIO
transmitter&lt;/a&gt;, and now today
I want to discuss formally verifying the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdrxframe.v&quot;&gt;SDIO
receiver&lt;/a&gt;
that will soon be tested on this board.  I’ll even go so far as to discuss
the remaining bugs that were found during hardware verification.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 1. SDIO progress&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/sdrxframe/progress.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;You might argue this &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdrxframe.v&quot;&gt;SDIO
receiver&lt;/a&gt;
had already been verified.  Indeed, you might argue that the &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;entire
controller&lt;/a&gt; had been verified.  Let’s walk
through the various development steps.  At this point in its development, the
entire &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDIO controller&lt;/a&gt; has been written.  It
passes a &lt;a href=&quot;/blog/2017/06/21/looking-at-verilator&quot;&gt;Verilator&lt;/a&gt;
lint check.  I’ve also generated simulation test scripts to verify the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdckgen.v&quot;&gt;divided clock
generator&lt;/a&gt; and the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdtxframe.v&quot;&gt;transmitter&lt;/a&gt; in
isolation.  A final &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/bench/verilog/tb_sdio.v&quot;&gt;simulation
environment&lt;/a&gt;
drives the entire controller through its paces: starting up the
SDIO controller in a simulated environment all the way from sending the
CMD0 (GO IDLE) all the way through reading and writing a page of data
(CMD24 and CMD17).  It passes these simulation steps nicely.  I’ve integrated
the controller into the &lt;a href=&quot;https://github.com/ZipCPU/eth10g&quot;&gt;larger design&lt;/a&gt;, and
it passes both Vivado synthesis and timing.  Indeed, it’s been implemented in
hardware.  Most recently, it failed in hardware testing for PCB reasons, not
logic reasons–but that still counts as a &lt;em&gt;failure&lt;/em&gt; in hardware, so I’ve got
more work to do before I can call this silicon proven.  At the end of this
article, I’ll share the results of my next round silicon testing–once I
finished verifying the receiver.&lt;/p&gt;

&lt;p&gt;The question before us today is whether or not I’ve skipped any necessary
tests.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left; padding: 25px&quot;&gt;&lt;caption&gt;Fig 2. The rule of gold&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/vlog-wait/rule-of-gold.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;One might argue at this point that this is all the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;controller&lt;/a&gt; needs to do in practice, and so
I should stop here.  In the past, I might’ve done so.  However, the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;controller&lt;/a&gt; can do a lot more than I’ve
tested so far.  It’s designed to operate with either 1 data bit, 4 data bits
(SDIO/eMMC), or 8 data bits (eMMC only), in either SDR and DDR modes, and
both with (eMMC only) and without (SDIO/eMMC) a data strobe.  The &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdckgen.v&quot;&gt;clock
divider&lt;/a&gt; is designed
to allow the IOs to be driven at less than one
edge per clock cycle, two edges per clock cycle, or even four edges per
clock cycle.  That’s a lot of features, and due to the way the current board
is designed, I won’t be able to test all of them.  (Specifically, the PCB
design connected the clock line to the CCLK pin, and it doesn’t allow a
card voltage change from 3.3V to 1.8V, so I can’t use either
&lt;a href=&quot;/blog/2020/08/22/oddr.html&quot;&gt;ODDR&lt;/a&gt;
or OSERDES controllers to drive it any faster than than half the clock rate.)&lt;/p&gt;

&lt;p&gt;Here’s my problem with stopping here: I want to place this &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDIO/eMMC
controller&lt;/a&gt;
in my “library” of working components, and I’ve had too many experiences in the
past of pulling something out of my library only to end up debugging it when
I place it onto hardware.  In practice, it’s worse than that–because by the
time I pull it out of my library, I typically won’t remember that I only tested
&lt;em&gt;some&lt;/em&gt; of the modes the IP supports, or whether or not it’s been updated since
I last used it on hardware.  I just remember that it has “worked” in the past,
so I consider it a piece of “working” IP from my library.  What that means is
that, when things don’t work in hardware, I won’t be suspecting this piece of
IP.  Hence, I’ll find myself looking all over some large SOC design for a bug,
instrumenting everything and its brother, before I finally realize that a
“working” IP component from my library had been left with a bug in it.&lt;/p&gt;

&lt;p&gt;This is unsatisfactory.&lt;/p&gt;

&lt;p&gt;Debugging a large design is a painful process.  It takes a lot of time–often
time that’s been allocated for other purposes–you know, like the new
capabilities the design is supposed to have–capabilities the sponsor is paying
for.  It delays product delivery with unscheduled debugging.  Frankly, I don’t
like spending my time on unplanned projects like that.  As a result, I want a
solid assurance that every IP component in my “library” &lt;em&gt;works&lt;/em&gt; before I add
it to a larger design.  It’s not enough that it worked in silicon the last
time it was used.  I want to know if any updates made since that time still
work.  I want to know to all the features work, to include features that
haven’t yet been tested in silicon.&lt;/p&gt;

&lt;p&gt;This requires a more rigorous approach to IP verification than just
demonstrating the IP once in silicon.&lt;/p&gt;

&lt;p&gt;For me, that more vigorous approach involves formally verifying each leaf
component, and then simulating the library component as a whole.  When it
comes to &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;this SDIO/eMMC controller&lt;/a&gt;,
I have formal proofs of &lt;em&gt;most&lt;/em&gt; of the major components.  I have a proof of the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdckgen.v&quot;&gt;clock generator&lt;/a&gt;,
the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdwb.v&quot;&gt;Wishbone
controller&lt;/a&gt;, the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdcmd.v&quot;&gt;command wire handler&lt;/a&gt;
and the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdtxframe.v&quot;&gt;transmitter&lt;/a&gt;.
What I didn’t have, which we’ll be discussing today, is a formal proof of the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdrxframe.v&quot;&gt;receiver&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Yes, I now have a formal proof of the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdrxframe.v&quot;&gt;SDIO/eMMC
receiver&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;So, before we get started, let me ask: how many bugs do you think I found
going through this process?&lt;/p&gt;

&lt;h2 id=&quot;fitting-the-receiver-into-the-design&quot;&gt;Fitting the Receiver into the Design&lt;/h2&gt;

&lt;p&gt;Let me take a moment, though, to introduce you to &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdrxframe.v&quot;&gt;this
subcomponent&lt;/a&gt; and
discuss how it is supposed to work, prior to discussing the problems it had.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 3. The SDIO/eMMC receive component&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/sdrxframe/pinlist.svg&quot; width=&quot;400&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;The &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdrxframe.v&quot;&gt;SDIO/eMMC receive
framer&lt;/a&gt;, as I
call it, is responsible for receiving a block of data, checking the
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;(s), and writing
that block of data to an external &lt;a href=&quot;https://en.wikipedia.org/wiki/Multiple_buffering#Double_buffering_in_computer_graphics&quot;&gt;ping-pong
buffer&lt;/a&gt;.
A separate &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdwb.v&quot;&gt;Wishbone
component&lt;/a&gt;
acts as its controller in two ways.  First, it tells
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdrxframe.v&quot;&gt;the receiver&lt;/a&gt;
what IO mode is in operation.  The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_cfg_width&lt;/code&gt; setting tells us if we are
using 1, 4 (SDIO/eMMC), or 8 (eMMC only) IO pins.  The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_cfg_ddr&lt;/code&gt; setting
controls whether or not we’ll need to check separate
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;s
for each clock edge.  Similarly, if we are using the data strobe pin, as
indicated by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_cfg_ds&lt;/code&gt;, then we’ll be accepting data via the asynchronous
data port from the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdfrontend.v&quot;&gt;PHY&lt;/a&gt;,
rather than the simpler synchronous port.&lt;/p&gt;

&lt;p&gt;When the user issues a CMD17 to read a block, the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdwb.v&quot;&gt;Wishbone
controller&lt;/a&gt; will
raise the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_rx_en&lt;/code&gt; line to indicate a block of data is on the way.  It
will also set &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_length&lt;/code&gt; to the length of the block to be expected.  The
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_crc_en&lt;/code&gt; pin also allows us to receive things that may, or may not–counter
to protocol–have &lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;s
attached.  Since the protocol requires
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;s,
I may remove this (unused) configuration bit in the future.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdfrontend.v&quot;&gt;front end&lt;/a&gt;
provides two sets of inputs for us, of which we will pick and
choose only one.  The first set is the &lt;em&gt;synchronous&lt;/em&gt; path.  This is the path
used in all SDIO modes and most of the eMMC modes–the path that doesn’t
depend upon the data strobe return from the eMMC device.  The second path is
the &lt;em&gt;asynchronous&lt;/em&gt; path from the front end.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left; padding: 25px&quot;&gt;&lt;caption&gt;Fig 4. The PHY supports three operating modes&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/sdrxframe/phymodes.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Both of these paths come to us from a &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdfrontend.v&quot;&gt;front end
component&lt;/a&gt; that
I’m going to call the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdfrontend.v&quot;&gt;PHY&lt;/a&gt;.  The
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdfrontend.v&quot;&gt;PHY&lt;/a&gt; can be
built in one of three ways, as shown in Fig. 4.  First, it can be built in
“standard” mode, where the IO buffers are driven directly from logic.  It can
also be built where the IO buffers are all driven via
&lt;a href=&quot;/blog/2020/08/22/oddr.html&quot;&gt;ODDR&lt;/a&gt; components, and the
returns come back via IDDR sampling.  Finally, there’s a front end mode which
will drive the IOs via 8:1 OSERDES elements, and read the results back via
a 1:8 ISERDES.  Which IO mode is used controls the maximum clock speed.
Likewise, only the SERDES IO mode supports the data strobe.&lt;/p&gt;

&lt;p&gt;All data messages in the SDIO/eMMC protocol start with a zero start bit.  This
is used as a synchronization point.  In our case one of the key features of
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdfrontend.v&quot;&gt;the front end&lt;/a&gt;
is that it strips off the start bit.  It also samples our data for us–either
by sampling the outgoing clock edges to discover a sample point, or by
sampling data when the return data strobe is present (eMMC only).&lt;/p&gt;

&lt;p&gt;A second key feature is specific to the synchronous path.  In this case, the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdfrontend.v&quot;&gt;PHY&lt;/a&gt;
measures the outgoing clock signal (before it gets to the pins), and sets a
sample time some programmable delayed time afterwards.  In this way, for high
speed IO, we allow ourselves to sample the incoming data at a programmable
fraction of a clock cycle later than the outgoing clock itself, to allow
for any clock propagation time from our controller, through the PCB to the
SDIO/eMMC chip, and then coming back from the SDIO/eMMC chip through the PCB to
our FPGA.&lt;/p&gt;

&lt;p&gt;All this is to say that by the time we get any data, all the hard work of
discovering when to sample the various IO bits has been taken care of for us.&lt;/p&gt;

&lt;p&gt;On the synchronous interface, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_rx_strb&lt;/code&gt; signal will indicate whether we
have new data available.  It will either indicate no sampling clock edges
(&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_rx_strb==0&lt;/code&gt;), one edge (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_rx_strb==2'b10&lt;/code&gt;), or two edges
(&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_rx_strb==2'b11&lt;/code&gt;) of data.  If &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_rx_strb[1]&lt;/code&gt; is true, then either 1, 4, or
8 bits of data will be available on the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_rx_data[15:8]&lt;/code&gt; ports.  If
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_rx_strb[0]&lt;/code&gt; is also true, then 1, 4, or 8 bits of data will also be present
in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_rx_data[7:0]&lt;/code&gt; inputs.&lt;/p&gt;

&lt;p&gt;The asynchronous interface is even simpler to use.  If &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;S_ASYNC_VALID&lt;/code&gt; is ever
true, then we’ll have 32-bits of incoming data available to us.  There will
never be less.  This is due in part due to how the front end IOs are set up,
and also in part due to the nature of how the data strobe line is used.
Specifically, it is never used in one or four bit modes–always 8 bit.
Likewise, the data strobe is only ever used in
&lt;a href=&quot;/blog/2020/08/22/oddr.html&quot;&gt;DDR&lt;/a&gt; mode, when data is
transmitted on both clock edges.&lt;/p&gt;

&lt;p&gt;That describes what comes into this &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdrxframe.v&quot;&gt;receive
component&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;There are two interfaces on the output.  The first is the control interface.
For every request that is made, that is for every time &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_rx_en&lt;/code&gt; is enabled,
the controller will process a received packet.  Once the packet is complete,
this receiver will raise the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_done&lt;/code&gt; flag.  At that time, it will also
raise the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_err&lt;/code&gt; flag if there were any errors associated with the packet.
Such errors could either be 1) a
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
mismatch, or 2) a watchdog timeout error.&lt;/p&gt;

&lt;p&gt;Let me pause here for a moment to point out, whenever you use a return
data strobe for sampling data coming back to a chip, you &lt;em&gt;always&lt;/em&gt; need to
add a watchdog timer.  This is to keep your controller from hanging in
the event you make a mistake and either 1) don’t properly wire up the
data strobe, or 2) make a mistake in your protocol handling so that the
downstream chip doesn’t return the number of data strobes you are expecting.
In our case, the watchdog timer will also generate a timeout if the start
bit isn’t received within its timeout window–something that will come back
to haunt us when we get to hardware testing.&lt;/p&gt;

&lt;p&gt;The last interface, coming out of this controller, is the memory interface.
This is designed to feed one of two &lt;a href=&quot;https://en.wikipedia.org/wiki/Multiple_buffering#Double_buffering_in_computer_graphics&quot;&gt;ping-pong
buffers&lt;/a&gt;.
My vision is that these &lt;a href=&quot;https://en.wikipedia.org/wiki/Multiple_buffering#Double_buffering_in_computer_graphics&quot;&gt;ping-pong
buffers&lt;/a&gt;
will be as wide as the bus, so they can be used in high speed DMA operations on
wider buses if necessary–although to date I’ve only tried them at 32-bits
each.  (Yes, the &lt;a href=&quot;https://github.com/ZipCPU/eth10g&quot;&gt;ETH10G
project&lt;/a&gt; uses a 512-bus, but I’m initially
only going to connect this to the 32-bit control bus portion of that design.)&lt;/p&gt;

&lt;p&gt;The memory interface has valid, strobe, and data lines.  If the valid line
is high, then the strobe lines will tell you which bytes within the data lines
to write.&lt;/p&gt;

&lt;p&gt;If all goes well, once &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_rx_en&lt;/code&gt; is set, memory will flow from the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdfrontend.v&quot;&gt;PHY&lt;/a&gt;,
get collected into bytes and/or words, and then sent out the memory interface.
Once complete, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_done&lt;/code&gt; signal will be raised and the controller will
then drop the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_rx_en&lt;/code&gt; line, and only raise it following another command from
the user–or perhaps the to-be written DMA.&lt;/p&gt;

&lt;p&gt;At least, that’s how this portion of the design is &lt;em&gt;supposed&lt;/em&gt; to work.&lt;/p&gt;

&lt;h2 id=&quot;outlining-the-formal-proof&quot;&gt;Outlining the Formal Proof&lt;/h2&gt;

&lt;p&gt;One of the reasons why &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdrxframe.v&quot;&gt;this
component&lt;/a&gt;
took so long to verify was because I had a sort of writer’s block when I first
looked at it.  I didn’t really know where to start.  &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdrxframe.v&quot;&gt;The
design&lt;/a&gt;,
I said to myself, was so simple–what could possibly be done to verify it?&lt;/p&gt;

&lt;p&gt;Yeah.&lt;/p&gt;

&lt;p&gt;As a result, &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdrxframe.v&quot;&gt;this
component&lt;/a&gt;
sat on the shelf for a week or two while I worked on other things.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 5. When do I get to the real stuff?&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/sdrxframe/realstuff.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;The &lt;a href=&quot;/formal/2020/06/12/four-keys.html&quot;&gt;minimum requirement&lt;/a&gt;
of any formal proof is something I call “The Contract.” The contract describes
how the IP is supposed to operate if everything is working.  In this case, the
contract is fairly easy to express in words: given an arbitrary byte, arriving
at an arbitrary position in the received data stream, formally &lt;em&gt;prove&lt;/em&gt; that
this arbitrary byte gets processed properly and sent to the output.&lt;/p&gt;

&lt;p&gt;As with most things in life, however, you need some sort of structure to hang
all of this verification logic off of.  You can think of it like a skeleton.
Just like a skeleton holds all your joints, ligaments, and muscles in place, a
good formal verification structure can be used to hold all of the formal
verification logic in place.&lt;/p&gt;

&lt;p&gt;I chose two pieces for my skeleton.  The first was a bit counter.  Starting
from the beginning of the operation, I would count the number of bits
arriving on our interface.  If we were in one bit mode, that would be the
number of ones arriving on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_rx_strb&lt;/code&gt;.  For the four or eight bit interface,
it would be four or eight times that much.  It simply counts how many bits of
valid data we’ve received.  The second key component was a memory counter.
This memory counter would count the number of bytes written to the &lt;a href=&quot;https://en.wikipedia.org/wiki/Multiple_buffering#Double_buffering_in_computer_graphics&quot;&gt;ping-pong
buffer&lt;/a&gt;
control outputs.&lt;/p&gt;

&lt;p&gt;I then needed some assertions to tie these two together.&lt;/p&gt;

&lt;p&gt;Those two counters alone were enough to find the first several bugs.&lt;/p&gt;

&lt;p&gt;They were also enough to allow me to build and express the contract.&lt;/p&gt;

&lt;p&gt;A third component of the skeleton that I added at a later time was a 1-bit
state machine.  This one bit state machine would become high upon a request
for operation–one cycle after &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_rx_en&lt;/code&gt; goes high, and then it would go low
once we completed our task.  I needed this to prove that the design wouldn’t
hang–especially since I was seeing it hang at the time.&lt;/p&gt;

&lt;p&gt;At this point, we can come back to our original question, and ask: how
many bugs did I find?&lt;/p&gt;

&lt;h2 id=&quot;bugs-discovered-via-formal-verification&quot;&gt;Bugs discovered via formal verification&lt;/h2&gt;

&lt;p&gt;Let’s count the bugs I found.  Since I’m using &lt;a href=&quot;https://git-scm.com/&quot;&gt;git&lt;/a&gt;,
it’s not all that hard.  I’m just doing a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;git diff&lt;/code&gt;, or rather
&lt;a href=&quot;https://meldmerge.org/?utm_source=Logiciels.Pro&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;meld&lt;/code&gt;&lt;/a&gt;
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdrxframe.v&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sdrxframe&lt;/code&gt;&lt;/a&gt; to
be more specific, and counting all of the differences between the commit before
verifying the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdrxframe.v&quot;&gt;receiver&lt;/a&gt; and
the changes after now that the verification now passes.  Let’s walk through
the differences, shall we?&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;The first difference doesn’t really count.  I discovered, via simulation
testing, that I had stripped off the start bit in two locations: first in the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdfrontend.v&quot;&gt;PHY&lt;/a&gt;,
and second in the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdrxframe.v&quot;&gt;receive
framer&lt;/a&gt;.
The result was both lost data and a failing
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;, since the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdrxframe.v&quot;&gt;receive framer&lt;/a&gt;
would remove one or more clock cycles of data from the beginning of any
packet, while looking for that start bit.  This change just hadn’t made
it into my baseline commit.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The next big change is sort of borderline as to whether it should count
or not.  Since I had let the design sit for a couple weeks before coming
back to it to verify it, I came at it with fresh eyes and noticed a big
bug while simply desk checking: I never implemented the
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;s for the
negative clock edge.&lt;/p&gt;

    &lt;p&gt;When using the SDIO/eMMC protocol, each data wire used in transmission
gets its own &lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
at the end of the data block.  These
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;s are each
16-bits in length, and they protect the entire data block.  That’s in single
data rate (SDR) mode.  When operating in dual data rate (DDR) mode and
sending data on each edge of the clock, there’s one
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt; for each data
wire on the positive edge of the clock, and a separate
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt; for each data
wire protecting the data sent on the negative edge of the clock.  Both
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;s
are 16-bits, and they are interleaved–so the positive edge
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
will alternate transmission with the negative edge
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
at the end of the packet.&lt;/p&gt;

    &lt;p&gt;When building the receive controller, however, I had only implemented the
positive edge
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;s.&lt;/p&gt;

    &lt;p&gt;Oops.&lt;/p&gt;

    &lt;p&gt;I’m not sure I’d call that a formal verification bug, though, since the
tools didn’t really find it.  I found it via a desk check.  That is, I found
it via a desk check that I was only doing because I was adding formal
properties to the design in order to verify it.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The next bug was associated with the logic for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_done&lt;/code&gt;.  This bug
didn’t really show up as an assertion failure, rather it showed up as I
was trying to formally describe how the logic was supposed to operate.&lt;/p&gt;

    &lt;p&gt;The first problem here was that I had &lt;em&gt;two&lt;/em&gt; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;done&lt;/code&gt; signals.  One was used
internally, and the other was my external signal.  Further, I couldn’t
really make out (from my own design even!) what the real difference was
between these two signals.  How were they supposed to relate?  Were they
supposed to be identical?&lt;/p&gt;

    &lt;p&gt;Let me back up and explain this a bit more.  I want to formally verify
the entire operation of the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdrxframe.v&quot;&gt;receive
framer&lt;/a&gt;.
That means I want to verify,
formally, that it can properly receive 512 bytes of data in all modes,
to include the mode where it only receives one bit at a time for a minimum
total of 4096 clock cycles.  This is a minimum, however, because when I
operate the design at 100kHz (the slowest potential clock speed), there
will be
1,000 clock cycles between every bit.  Hence, a full operation will
take more than 4M clock cycles.  Most formal proofs will die on anything
over about 20 clock cycles, with the longest proof I have running at about
350 clock cycles.  There would be no way I’d verify 4M clock cycles of
operation, therefore, without using
&lt;a href=&quot;/blog/2018/03/10/induction-exercise.html&quot;&gt;Induction&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;&lt;a href=&quot;/blog/2018/03/10/induction-exercise.html&quot;&gt;Induction&lt;/a&gt;,
however, requires assertions to both verify and then guarantee
all of the relationships between registers.  That means I need assertions
to describe the differences between these two done registers.&lt;/p&gt;

    &lt;p&gt;Yeah, the second &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;done&lt;/code&gt; register was quickly dropped when I couldn’t decide
what it’s real purpose was.&lt;/p&gt;

    &lt;p&gt;Even that wasn’t enough, since there were several registers that needed
to act on the clock prior to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_done&lt;/code&gt;.  Therefore, I ended up creating
a signal I called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;w_done&lt;/code&gt; to indicate that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_done&lt;/code&gt; was about to be
set, and everything remaining should clean itself up.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The next bug was that
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt; errors (not
tested by my simulation), wouldn’t show up coincident with the done signal.
Yes, my &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;done&lt;/code&gt; logic was really messed up.  At first I was declaring the
design &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;done&lt;/code&gt; once all the data (not
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;s)
had been received.  Then I tried setting &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;done&lt;/code&gt; once all the
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;s
had been received, but not allowing for the last bit(s) to impact the
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;s,
nor for a test of whether or not the last
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
successfully received its data.&lt;/p&gt;

    &lt;p&gt;In many ways this didn’t surprise me: I rarely test fault conditions in
simulation.  I should.  Indeed, I need to make it a habit of doing so, but
my simulation setup for this design was still somewhat new, so I hadn’t yet
verified failed &lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
handling.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;At full speed, data would get written to the wrong memory address.&lt;/p&gt;

    &lt;p&gt;Remember how I said that the skeleton of the proof would help?  Well, it
turns out my address counting was messed up.  I would calculate the
next memory address on the cycle I wrote to memory.  That wasn’t a problem.
However, I’d then use that memory address to shift the next memory strobe
and data into position, and so my logic required a dead cycle between
memory writes in order to be successful.  That would be fine when operating
on four data bits (SDIO/eMMC) in SDR mode, when not using the OSERDES (i.e.
when using the CCLK pin).  In other words, it would work fine
the way both my simulation and my hardware were setup.  However, this
approach would fail quickly when/if I ever transitioned to one of the high
speed modes this IP was supposed to support.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left; padding: 25px&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/sdrxframe/pride-shame.svg&quot; width=&quot;240&quot; alt=&quot;When pride cometh, then cometh shame: but with the lowly is wisdom. (Prov 11:2)&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Remember how I commented earlier on my frustrations when taking a design
   out of my library to add to a larger SOC-based design?  This would’ve lead
   directly to one of those problems.  Had I not verified this IP, I would’ve
   run it on hardware and been really proud of it.  I’d put it, in my pride,
   into my library and declare it to be “working”, only to come back later,
   configure it for a (supposed to be supported) high frequency mode, only
   to discover that mode didn’t work.&lt;/p&gt;

&lt;p&gt;This is why I like formal methods.&lt;/p&gt;

&lt;ol start=&quot;6&quot;&gt;
  &lt;li&gt;
    &lt;p&gt;I also came across a bug whereby the receiver might ingest one too many
clock cycles.  What happens, for example, if you want to receive a five
byte packet, the data width is set to 8-bits, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_rx_strb==2'b11&lt;/code&gt; on
every clock cycle?  The answer is that, on the last clock cycle, the
data associated with the second clock edge would need to be discarded.&lt;/p&gt;

    &lt;p&gt;In this case, I needed to generate a new signal, one I called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;last_strb&lt;/code&gt;,
to keep the IP from ingesting more than one clock edge with the last data
set.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The next bug was associated with disabling
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
checking.  When I built &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdrxframe.v&quot;&gt;this
receiver&lt;/a&gt;,
I built it with a mode for receiving something that doesn’t have a
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;.  This was to
support reading particular registers that weren’t
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt; protected.
In hindsight, however, I’m not really sure I need this mode–since 1) all
of those registers transfer their data over the CMD wire, and 2) even those
unprotected registers still have
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;s–they
just can’t be trusted.  Regardless of whether it is needed or not, however,
the formal tool decided to test it and found it broken.&lt;/p&gt;

    &lt;p&gt;Of course, my simulation didn’t check this mode.  There was no reason to.
All data transactions require
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;s.  However,
design lock ups are bad, and that was what the formal tool found.  If ever
the &lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
checking was disabled, the design might accept its packet (but not the
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;)
and then hang waiting on the remaining
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
that would never come.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The last bug was more serious.  It involved those cases where I might
receive data on two separate clock edges within a single clock cycle.
This might be the case when using either the
&lt;a href=&quot;/blog/2020/08/22/oddr.html&quot;&gt;ODDR&lt;/a&gt; component in DDR
mode, or the SERDES component in a multiple IO clock per system clock mode.
The bug would only be triggered if I receive data on one clock edge at
first (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_rx_strb==2'b10&lt;/code&gt;), and then ever after received data on both
clock edges (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_rx_strb==2'b11&lt;/code&gt;).  Not only that, it’s only triggered
in 8-bit mode.&lt;/p&gt;

    &lt;p&gt;Here’s how the bug works.  When the first 8-bits of data arrive,
those bits get written to bits [31:24] of the memory bus–assuming it’s
32-bits, which it is for these runs.  On all subsequent clock cycles,
16-bits arrive and get forwarded to the memory.  Hence you’d write to
bits [23:8] of memory on the second write, and then you’d want to write to
bits [7:0] of the current memory word and (oops) bits [31:24] of the
subsequent word.  This is called an &lt;em&gt;unaligned data access&lt;/em&gt;, and herein
lies the bug.  I didn’t account for writing unaligned data to memory.&lt;/p&gt;

    &lt;p&gt;Fixing this bug wasn’t hard, but it did require logic to handle the
unaligned memory write request.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;One of the tricks I often use when formally verifying components is to assume
difficult things won’t happen.  It helps the proof along, and can often help
me get through the simpler logic.  Of course, difficult things &lt;em&gt;do&lt;/em&gt; happen in
real life, and so these assumptions can easily render a proof invalid.  For
this reason, I make sure to place all such assumptions in a specially marked
block at the end of the file–a block I like to call “Careless assumptions,”
because of the likelihood that they will void a proof.  Over time, as I get
the opportunity, I’ll slowly work off these “careless assumptions” until none
remain.&lt;/p&gt;

&lt;p&gt;In this case, my “Careless assumptions” section held two assumptions for a while
that I needed to come back to.  The first was that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_rx_strb&lt;/code&gt; would only ever
be either &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;2'b11&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;2'b10&lt;/code&gt;.  This allowed me to get the proof to pass first,
and then come back later to handle the unaligned memory requests.  My second
assumption is that the watchdog timeout would never fire.  In both cases, I
had to come back later and work through removing these assumptions before the
proof could really be declared complete.&lt;/p&gt;

&lt;p&gt;Today, I can now say with confidence that this design and proof no longer
contains any “careless” assumptions.&lt;/p&gt;

&lt;h2 id=&quot;hardware-bringup&quot;&gt;Hardware bringup&lt;/h2&gt;

&lt;p&gt;Yes, but … &lt;em&gt;does it work?&lt;/em&gt;  Alternatively, I might ask, did all that formal
verification work actually make a dent when it came to how long it took to
bringup &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;the controller&lt;/a&gt; in silicon to talk
to its first device?&lt;/p&gt;

&lt;p&gt;To answer that question, let’s go over the bugs found during hardware bringup.&lt;/p&gt;

&lt;p&gt;First, as background, the design did need a hardware change before starting.
The FPGA was driving the SD Card at 1.8Volts via a TI TXB0108 voltage
translator to 3.3V, and the voltage translator couldn’t handle the open
drain signaling required during startup.&lt;/p&gt;

&lt;p&gt;Second, I was quite pleased to see the card respond to the very first
command I gave it, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SEND_IF_COND&lt;/code&gt; command.  Not only did it respond, but
it also returned a valid response.  This helped to add momentum to the
subsequent testing, knowing that at least the interaction via the
command wire worked.&lt;/p&gt;

&lt;p&gt;Now let’s go over the bugs I found.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;The next command in the bringup of an SD card, following the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SEND_IF_COND&lt;/code&gt;
command, is to ask the card to send its Operating Conditions Register (OCR).
This is part of a voltage negotiation that takes place between &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;the
controller&lt;/a&gt; and the card.  It’s supposed
to be a &lt;em&gt;handshake&lt;/em&gt;.  &lt;strong&gt;The bug:&lt;/strong&gt; in my first software drafts, I never
told the card what voltages I could provide.  Hence, from the card’s
perspective, we had never come to an agreement on the required voltage
and so the card never booted up.&lt;/p&gt;

    &lt;p&gt;Reading through the specification helped here.&lt;/p&gt;

    &lt;p&gt;&lt;strong&gt;Conclusion:&lt;/strong&gt; This was a &lt;em&gt;software&lt;/em&gt; bug.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The next problem was that I couldn’t get the card to respond
to the next command, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ALL_SEND_CID&lt;/code&gt;.  This is where every card sharing the
bus sends its identification via an open-drain setup, and whoever sends a
‘0’ wins the bus for that bit and following.  It’s a part of the protocol
designed to allow multiple cards to share a bus–although I’ve never actually
seen this used in practice.  In this case, I just couldn’t get the card to
respond at &lt;em&gt;all&lt;/em&gt; to this request.  The card had responded fine to the
previous command, just not this one.&lt;/p&gt;

    &lt;p&gt;The problem here turned out to have nothing to do with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ALL_SEND_CID&lt;/code&gt;
command at all–it was how I handled the response to the reading the
OCR.  Bit [31] of the OCR is listed in the specification as “Card power up
status bit (busy)”.  So, I figured that once the bit was clear, the card
was no longer busy.  &lt;strong&gt;The bug:&lt;/strong&gt; Re-reading the specification revealed I
had the sense wrong–the bit needed to become a one before moving on.
Because I wasn’t waiting, the card hadn’t finished powering up when I
gave it its next command, hence it wasn’t responding.&lt;/p&gt;

    &lt;p&gt;I found this bug via simulation, once I tried increasing the power up time
in simulation to the point where it would have an impact.  Then, when &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/bench/verilog/tb_sdio.v&quot;&gt;the
simulation&lt;/a&gt;
didn’t match my software, I knew I was on to something.&lt;/p&gt;

    &lt;p&gt;&lt;strong&gt;Conclusion:&lt;/strong&gt; This was a second &lt;em&gt;software&lt;/em&gt; bug.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left; padding: 25px&quot;&gt;&lt;caption&gt;Fig 6. Voodoo computing&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/sdrxframe/voodoo.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;ol start=&quot;3&quot;&gt;
  &lt;li&gt;
    &lt;p&gt;During this time I should point out I did a lot of return code debugging.
Specifically, &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdcmd.v&quot;&gt;my command wire
processor&lt;/a&gt; got a
lot of scrubbing to make sure I was getting the right return code for
any error I encountered.  I’m not sure I really found anything here, but I
did change a bunch of stuff in the process.&lt;/p&gt;

    &lt;p&gt;&lt;strong&gt;Conclusion:&lt;/strong&gt; I’m not sure there was a bug here at all.  I think the
bottom line issue here was that I had forgotten, between when I wrote this
module and when I came back to it, exactly how the interface was supposed
to work.  So I ended up rewriting how errors should be reported, even though
they may have not been reported incorrectly in the first place.&lt;/p&gt;

    &lt;p&gt;Sadly, I also discovered that I had left a “Careless assumption” in my
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdcmd.v&quot;&gt;command wire processor&lt;/a&gt;,
an assumption that kept the formal proof of this processor from ever
examining a timeout situation.  So, I had to pause here to remove this
last assumption–especially since I was getting timeout errors, and I
had no confidence that these errors were correct.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;I then managed to get far enough to read several registers from the SD
card.  Two in particular, the Card Identification (CID) register and the
Card Specific Register (CSR) deserve some extra mention.  These are each
128’bit registers (including
the &lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;).
.  They follow what would normally be an echo of the
8’b command, and end with a 7-bit
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt; followed by a
stop bit for a total of 136 bits.  In testing, I could read these registers
just fine.  The bug the problem was that the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdcmd.v&quot;&gt;command
wire processor&lt;/a&gt;
was indicating a &lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
error every time it read from these registers.&lt;/p&gt;

    &lt;p&gt;&lt;strong&gt;The bug:&lt;/strong&gt; Digging further, I discovered I had calculated the
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
over the 8’bit prefix to the 120’bit data register, not just the 120’bit
data bits.  In this case, both &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/bench/verilog/mdl_sdio.v&quot;&gt;my SDIO
model&lt;/a&gt;
and &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdcmd.v&quot;&gt;my controller&lt;/a&gt;
were in error.  This is a classic example of building the wrong thing
right.&lt;/p&gt;

    &lt;p&gt;I’m not sure I would’ve found this apart from hardware testing.&lt;/p&gt;

    &lt;p&gt;&lt;strong&gt;Conclusion:&lt;/strong&gt; This one was a hardware bug.  The hardware did everything
I had designed it to do and it did it all properly, I had just designed
it to do the wrong thing.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;I also came across a second problem with these 128-bit registers, and that
was that I could read them once, and once only.  Ever after that first
success, the register would always read zero.&lt;/p&gt;

    &lt;p&gt;To understand this bug, we have to look a bit deeper into the design.&lt;/p&gt;

    &lt;p&gt;The &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDIO controller&lt;/a&gt; is designed to
handle data transfer via two internal FIFOs–the &lt;a href=&quot;https://en.wikipedia.org/wiki/Multiple_buffering#Double_buffering_in_computer_graphics&quot;&gt;ping-pong
buffer&lt;/a&gt;s.
Normally, those buffers are only used for data transfer: Either software
writes a sector into them that is then forwarded to the SD Card, or the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;controller&lt;/a&gt; reads a sector from the SD
Card and places the results into the buffer for software to come back and
read once the operation is complete.  The exception to this rule is that
these 128-bit registers are also written to the &lt;a href=&quot;https://en.wikipedia.org/wiki/Multiple_buffering#Double_buffering_in_computer_graphics&quot;&gt;ping-pong
buffer&lt;/a&gt;s,
not by the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdrxframe.v&quot;&gt;receive
framer&lt;/a&gt;, but by
the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdcmd.v&quot;&gt;command wire
handler&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;The problem in this case had to do with the pointers to the FIFO.
&lt;strong&gt;The bug:&lt;/strong&gt; I wasn’t
resetting the read pointer when I issued a command to read these registers.
As a result, the first time I read the registers properly from addresses
0, 1, 2, and 3.  When I issued the command again, the pointers weren’t reset
and so I was attempting to read the 128-bit register value from addresses
4, 5, 6, and 7–after it had been stored in addresses 0, 1, 2, and 3.&lt;/p&gt;

    &lt;p&gt;&lt;strong&gt;Conclusion:&lt;/strong&gt; While you might argue this was bad user interface design,
it required a hardware fix.  Therefore this falls into the category of a
hardware bug.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Once the card identifies itself, it is then picks its relative address and
the protocol clock can speed up from 400kHz to 25MHz.  Later, if I want
to restart things, I might wish to slow the clock back down to 400kHz.
&lt;strong&gt;Design bug:&lt;/strong&gt; Along the way, I discovered that my register design
provided me no way of knowing what the current clock speed was.  Hence,
I might change the clock speed, but never know how long to wait until that
new speed was active.&lt;/p&gt;

    &lt;p&gt;I solved this by adjusting the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdwb.v&quot;&gt;Wishbone
controller&lt;/a&gt; and
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdckgen.v&quot;&gt;clock divider&lt;/a&gt; so
that &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdwb.v&quot;&gt;the controller&lt;/a&gt;
would return, upon a read request, the &lt;em&gt;current&lt;/em&gt;
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdckgen.v&quot;&gt;clock divider&lt;/a&gt;
setting, not necessarily the most recently commanded one.  Once the two
matched, I could then know the clock rate had properly changed and so I
could move on.&lt;/p&gt;

    &lt;p&gt;This still creates a sudden clock change.  Were the card to try to lock a
PLL to this clock, it wouldn’t have time to lock it before I was sending
the next command.  On the other hand, the specification does say that the
clock can be stopped or paused at any time if need be, a criteria that would
probably preclude such an implementation.&lt;/p&gt;

    &lt;p&gt;&lt;strong&gt;Conclusion:&lt;/strong&gt; This was a flaw in my user interface design.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;SD Cards have two command sets.  There are regular commands, called CMDs
and followed by a decimal number, such as CMD0 (GO IDLE) or CMD17
(READ SECTOR).  There are also application specific commands, or ACMDs.
To send an ACMD, you first send a CMD55, and then the following command is
interpreted as an ACMD.  I now needed to issue an ACMD6 to set the bus width
to four bits.  However, much as I tried, I couldn’t get the card to respond
to my CMD55 at all.  &lt;strong&gt;The bug:&lt;/strong&gt; It was only after much frustration that I
looked up the CMD55’s argument, only to discover I was supposed to
address the card in the CMD55 via the card’s relative address–and I was just
setting the address field to zero.  This was appropriate earlier in the
setup, before the card had assigned itself a non-zero relative address, but
not once the address had been assigned.  No wonder it wasn’t responding–I
wasn’t addressing it.&lt;/p&gt;

    &lt;p&gt;&lt;strong&gt;Conclusion:&lt;/strong&gt; This makes for a third software bug.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;At this point, I was finally at the point in the sequence where I could
issue a command to read a sector from the SD Card and … I got stuck here
again.  I kept issuing read commands, only to have them end in
a failure with a
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
failing error code.  In the end, this turned out to be a couple of bugs.&lt;/p&gt;

    &lt;p&gt;&lt;strong&gt;The bug:&lt;/strong&gt; The first problem was that I couldn’t tell the difference
between a read failure and a command response failure.  Both might return
the same &lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
failure code, both shared the same three bits.&lt;/p&gt;

    &lt;p&gt;&lt;strong&gt;Conclusion:&lt;/strong&gt; I really need to adjust the user interface here, so I can
tell the difference between failures on the command wire, and read failures
on the data lines–whether they be timeout errors or actual
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt; errors.&lt;/p&gt;

    &lt;p&gt;Now for the other problems …&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;I also made the mistake at one point of not enabling the FIFOs.  Sure
enough, by design, the read wasn’t enabled because the FIFOs hadn’t been
enabled.&lt;/p&gt;

    &lt;p&gt;&lt;strong&gt;Conclusion:&lt;/strong&gt;  This was a software bug, caused by my thrashing around
trying to determine if I had a read error or a command wire error, and
so I had turned off the FIFOs to get the command to end early enough that
I might trigger the internal logic analyzer on something useful, and then
I later forgot that I had them turned off.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;At this point, I still wasn’t able to read a sector from the device,
and it took a bit longer to figure out why.  Not only that, I had to dig
into the trace from my internal logic analyzer to discover the next bug.
Remember how I said when discussing the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdrxframe.v&quot;&gt;receiver
design&lt;/a&gt;, that
the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdfrontend.v&quot;&gt;PHY&lt;/a&gt;
would remove the start bit?  Well, in order to do that, the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdfrontend.v&quot;&gt;PHY&lt;/a&gt;
needs to be told when to expect a packet so it can reset its start-bit
search algorithm.  Nothing in my internal interfaces allowed for this
communication–I just hadn’t foreseen the need.&lt;/p&gt;

    &lt;p&gt;&lt;strong&gt;Conclusion:&lt;/strong&gt; This was definitely a hardware bug.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Then I got lucky.  &lt;strong&gt;The bug:&lt;/strong&gt; I just managed to (by chance) adjust the
scope enough that I could see there was a packet (i.e. the sector) coming
back across the interface, but the design just wasn’t seeing it.  This was
key, because it told me I wasn’t somehow messing up the command.  I had
the command sequence right, and the card was returning data, I just wasn’t
seeing it.&lt;/p&gt;

    &lt;p&gt;Was it a problem in the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdrxframe.v&quot;&gt;receiver&lt;/a&gt;?&lt;/p&gt;

    &lt;p&gt;No.  The formally verified
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdrxframe.v&quot;&gt;receiver&lt;/a&gt;
worked nicely as designed.&lt;/p&gt;

    &lt;p&gt;The problem was in my watchdog timer.  The timer was set, by a parameter in
the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdrxframe.v&quot;&gt;receiver&lt;/a&gt;,
to wait a maximum of 8M clock cycles for the first data bit.  That timer was
overridden at the top level, so that it would only wait 64 clock cycles for
the first data bit.  Needless to say, the card didn’t respond that fast.&lt;/p&gt;

    &lt;p&gt;&lt;strong&gt;Conclusion:&lt;/strong&gt; Yeah, this was another hardware bug.  This time, it was in
the design’s configuration.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;At this point, I’d like to write that all my formally verified modules worked
as intended.  Was this really the case?  Let’s work through the issues
identified above.  Issue three wasn’t clearly a bug.  Issues one, two,
seven, and nine were all software issues,
and issues six and eight were user interface design issues.&lt;/p&gt;

&lt;p&gt;That leaves four hardware issues that were revealed during bringup.  The
first was the 128-bit register CRC issue.  This flaw passed both formal
and simulation based verification–the design did what I told it to, I had
just told it to do the wrong thing.  The second issue, that of the FIFO
pointers, should’ve been caught when I verified the user interface, or at least
when I ran the whole design in simulation.  The last two issues, that of when
to start looking for the start bit and how long the watchdog timeout should
be, were both issues rooted in the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdfrontend.v&quot;&gt;PHY&lt;/a&gt;.
They were missed simply because 1) I didn’t formally verify the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdfrontend.v&quot;&gt;PHY&lt;/a&gt;,
and 2) my simulation never checked more than one packet, and 3) never
waited a significantly long period of time before returning a packet.
(Given how expensive simulation can be, I dislike waiting if I don’t have to.
Unfortunately, this led to missing two bugs in simulation that had to be caught 
later in hardware.)&lt;/p&gt;

&lt;p&gt;One item I haven’t yet mentioned is how long this hardware bringup session
took.  Once everything was formally verified, I managed to run through all the
hardware bringup over the course of two days.  The work often took place when
another project was running simulations, and it had to be paused for nearly
a whole day in the middle due to internet connectivity issues.  So, let’s
say hardware bringup–from synthesis to reading a sector took no more than
a day of work in total.  This is in contrast to my first attempts to bring
up &lt;a href=&quot;https://github.com/ZipCPU/sdsdpi/blob/master/sw/sdiodrv.c&quot;&gt;this controller&lt;/a&gt;,
taken before either simulation or formal were accomplished,
where I just embarrassed myself when not only did the controller not work but
I had no idea why not.  At least this time I had more confidence in what
was going on.&lt;/p&gt;

&lt;p&gt;I should also caveat this list by pointing out that I haven’t (yet) verified
either &lt;a href=&quot;https://github.com/ZipCPU/sdsdpi/blob/master/sw/sdiodrv.c&quot;&gt;my software
driver&lt;/a&gt;, or the
hardware’s ability to write a sector.  So far, I’ve only tested a piece of
well instrumented &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/dev/sw/zipcpu/board/sdiochk.c&quot;&gt;test
software&lt;/a&gt;–nothing
that can really be used beyond initial hardware verification.  My work,
therefore, isn’t complete yet.&lt;/p&gt;

&lt;h2 id=&quot;conclusions&quot;&gt;Conclusions&lt;/h2&gt;

&lt;p&gt;I like to ask myself, after going through all this pain, was it really worth
it?  Was it worth all the pain of going through a formal verification process?
Were the bugs I found ones that justified the extra work?&lt;/p&gt;

&lt;p&gt;In this case, let’s think of the alternative.  The alternative is that I
would run this design on silicon, convince myself over the next week or two
that it worked, and then put this design away, in my library, containing
bugs associated with features never tested in silicon.  I’d be proud of my
work, and pat myself on the back.  Then, sometime later–perhaps a year or
more, I’d come back to this design, remember how proud I was of it, pull it off
the shelf, and place it into a new design using an IO mode that had never been
properly tested, only to
discover things not working.  I’d then re-run the simulation, get a “success”
result, and convince myself that some other part of the design must be in error.
Then, after a painful week of debugging–perhaps even two–I’d be forced back
to this portion of the design only to kick myself for allowing such bugs to be
left in my “library”.&lt;/p&gt;

&lt;p&gt;This is definitely one of those cases where an ounce of prevention is worth
a pound of cure.  It’s certainly easier to debug a design shortly after
writing it than it is to come back to it years later wondering what’s wrong
with it.&lt;/p&gt;

&lt;hr /&gt;&lt;p&gt;&lt;em&gt;Prepare thy work without, and make it fit for thyself in the field; and afterwards build thine house. (Prov 24:27)&lt;/em&gt;</description>
        <pubDate>Tue, 18 Jul 2023 00:00:00 -0400</pubDate>
        <link>https://zipcpu.com/formal/2023/07/18/sdrxframe.html</link>
        <guid isPermaLink="true">https://zipcpu.com/formal/2023/07/18/sdrxframe.html</guid>
        
        
        <category>formal</category>
        
      </item>
    
      <item>
        <title>Using a Verilog task to simulate a packet generator for an SDIO controller</title>
        <description>&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 1. The KlusterLab board&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/sdiopkt/klusterlab_1.0.jpeg&quot; width=&quot;780&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;One of my current projects is to test and bring up a &lt;a href=&quot;https://github.com/ZipCPU/eth10g&quot;&gt;10Gb Ethernet
test board&lt;/a&gt;.  The board has been fondly named
the “KlusterLab”, because of all of the various interfaces present on it.
Among those interfaces are &lt;a href=&quot;https://github.com/pcbarts/fast-open-switch-hardware/blob/8c531c306c93a31ff30c5851e233dbd086ae79f1/fast-open-switch%20r1.0/Documentation/KlusterLab%20Schematics%20r1.0.pdf&quot;&gt;an SD port and an eMMC port&lt;/a&gt;.
Now, how shall I verify their functionality?&lt;/p&gt;

&lt;h2 id=&quot;building-an-sdioemmc-controller&quot;&gt;Building an SDIO/eMMC Controller&lt;/h2&gt;

&lt;p&gt;My first round of testing the SD port used my SPI-based SD card controller,
&lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDSPI&lt;/a&gt;.  Using that controller, the card
responded in much the way I expected, save that the first sector of the card
wasn’t what &lt;a href=&quot;http://elm-chan.org/fsw/ff/00index_e.html&quot;&gt;FATFS&lt;/a&gt; was expecting.
Since this is simply board bringup, where we are just trying to verify that the
hardware is working, I signed the component off as working.  I could obviously
write to the device, and I could read from the device.  (Okay, I signed it
off too early, but we’re getting there.)&lt;/p&gt;

&lt;p&gt;The eMMC port, however, that was going to require more work.  I didn’t have
an eMMC controller.&lt;/p&gt;

&lt;p&gt;On the other hand, I had the beginning scribbles of an SDIO controller I’d
started to work on years earlier.  SDIO is actually the native SD card
protocol.  It includes a clock, a bidirectional command wire, and four
bidirectional data wires that can be used to send blocks of memory back and
forth.  Moreover, it the SDIO and eMMC protocols are electrically so similar
that a single hardware controller can work for both of them–even if they would
need different software drivers.  Still, maybe I could breathe some life into
my draft controller components?&lt;/p&gt;

&lt;p&gt;In the end, I didn’t breathe life into any dead corpse of a project.  I pretty
much started over.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 2. SDIO/eMMC IO modes&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/sdiopkt/iomodes.svg&quot; width=&quot;420&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;Why?  Well, I figured, if I was going to build a &lt;em&gt;new&lt;/em&gt; controller, I should
build a design to support all of the IO modes used by eMMC and SDIO.  This
meant I wanted to support both the open-drain IO these interfaces start out
at, as well as their faster push-pull speeds.  I wanted to support the
bidirectional command and data pins.  Since the ports will work with either 1,
4, or 8 (eMMC only) data pins, I wanted to support all of these modes.  Finally,
the clock speed could run anywhere from 100kHz at the slowest, all the way up
to 200MHz.  The faster protocols even ran in DDR mode, with the fastest
protocol requiring return data capture using a data strobe.&lt;/p&gt;

&lt;p&gt;Thankfully, this is all doable.  It just needs a little planning.&lt;/p&gt;

&lt;p&gt;Over the course of a weekend or two, I managed to draft all of the components
to the entire controller.  You can see the various parts and pieces of this
new controller in Fig. 3 below.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 3. SDIO/eMMC controller components&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/sdiopkt/sdioparts.svg&quot; width=&quot;780&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Once I built the front end hardware, I realized I couldn’t use an OSERDES
with the FPGAs CCLK pin, so I had to rebuild it without IO elements.  I
then built it again using DDR elements, and then built it again without
Xilinx IO elements at all.&lt;/p&gt;
&lt;table align=&quot;center&quot; style=&quot;float: left; padding: 25px&quot;&gt;&lt;caption&gt;Fig 4. Approaches to front-end IO&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/sdiopkt/sdfrontend.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;The first front end I built was for the highest speed modes.  This was to handle
anything above 50MHz, and it did so using both 8:1 OSERDES and 1:8 ISERDES
elements.  I allowed the return data path to be captured any number of 1.25ns
intervals from an outgoing clock edge (when not using the data strobe).  Even
better, all of the data strobe qualified signals were placed nicely into
appropriate asynchronous buffers.  Then I realized, the current
version of the board will drive the eMMC clock through the CCLK pin, so I have
no access to the OSERDES when driving this pin.  Hence I built a second version
of the front end that just did direct IO pin control–allowing nothing to move
at less than a clock cycle and limiting the controller to 50MHz.  This I built
both with and without Xilinx IO support–to make sure I could handle the speeds
CCLK would generate.  The final version, shown as #2 in Fig. 4, was a DDR based
version.  This allowed me to run at 50MHz DDR from a 100MHz primary clock.  It
also allowed me to simulate something quickly without needing to simulate
Xilinx primitives.&lt;/p&gt;

&lt;p&gt;(I know, Xilinx primitives aren’t all that hard to simulate in Vivado.
In fact, they’re pretty easy to simulate.  However, I was using open source
tools up to that point and wasn’t yet ready to fire Vivado up.)&lt;/p&gt;

&lt;p&gt;All that remained was to verify the controller worked.&lt;/p&gt;

&lt;p&gt;But, hey, why let that stand in the way of progress if you have hardware
available to you?  I mean, look at it this way, properly verifying this
brand new controller would require I have a simulation model of an SD Card.
Proper simulation models take at least as much time to develop as the IP itself,
whereas in this case I have an actual SD card to test against.  Why not test
against reality?&lt;/p&gt;

&lt;p&gt;So, I did.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 5. Initial hardware results&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/sdiopkt/initialfail.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;As you might expect, nothing worked.  I wasn’t really all that surprised.
Indeed, I sort of expected that.  (I’m not really a miracle worker …)  What
I wasn’t expecting was the controller to fail as early as when the board first
came up.  That was kind of embarrassing.&lt;/p&gt;

&lt;p&gt;The problem was even worse than that.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left; padding: 25px&quot;&gt;&lt;caption&gt;Fig 6. What failed?  I couldn't tell&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/sdiopkt/whichfail.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;My whole goal at this portion of the project was to tell if the PCB design
had bugs in it or not.  Since I had no confidence in my own controller, I
couldn’t tell.  Was it my controller failing or the PCB?&lt;/p&gt;

&lt;p&gt;Even more embarrassing was that I couldn’t read known constant values back from
my own controller’s &lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone
registers&lt;/a&gt;.  You’d think
after all these years I could draft a &lt;a href=&quot;/zipcpu/2017/05/29/simple-wishbone.html&quot;&gt;Wishbone register
handler&lt;/a&gt; and get it
right on the first try.  (I certainly thought so.)  Nope.  I didn’t even get
that part right.  Reading a register which was supposed to have a 4’hc in
bits [31:28] returned 4’h0 for those bits.  Writing to that register didn’t
have any effect.  Frankly, I had no idea what the IP was doing when I wrote
to or read from it.  Yikes!&lt;/p&gt;

&lt;p&gt;So that left me in the middle of the longer, slower process of verifying the
various components of this IP.  I’ve made some progress at this task, and
so I now have formal proofs for the divided clock generator, the
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
controller and the command pin processor.  The proofs aren’t
&lt;a href=&quot;/formal/2020/06/12/four-keys.html&quot;&gt;complete&lt;/a&gt;, but
they were enough to get me moving a step closer to success.  They were also
enough to find some rather embarrassing bugs–like writing when I was supposed
to be reading, or not setting the output register properly at all.  That said,
I can (now) read and write registers like I expect to, even when testing in
hardware(–now).  It’s for this reason that I colored these components in
green in Fig. 3 above.&lt;/p&gt;

&lt;p&gt;Unfortunately, when I got to verifying the packet transmit component, I ran
into trouble.&lt;/p&gt;

&lt;p&gt;Specifically, I wanted to verify that any packet sent to the packet transmitter
would be faithfully forwarded to the front end, with a proper start bit,
CRC(s), and stop bit.  It needed to do this when transmitting 1bit, 2bits,
4bits, 8bits, 16bits, or 32bits per clock cycle.  However, once I got to the
point where the controller “passed” its &lt;a href=&quot;/blog/2018/03/10/induction-exercise.html&quot;&gt;formal induction
proof&lt;/a&gt;, I switched
to &lt;a href=&quot;/formal/2018/07/14/dev-cycle.html&quot;&gt;checking cover traces&lt;/a&gt;
and the formal solver couldn’t demonstrate the ability to complete a packet at
all.  Something in my logic was broken, and cover checks are horrible for not
telling you what the problem is.&lt;/p&gt;

&lt;p&gt;I needed to switch to simulation.&lt;/p&gt;

&lt;p&gt;This, now, is the background behind what I’m going to discuss next.&lt;/p&gt;

&lt;h2 id=&quot;simulation-tasks&quot;&gt;Simulation Tasks&lt;/h2&gt;

&lt;p&gt;A full and proper Verilog simulation test bench, such as the one shown in
Fig. 7, contains several specific components.  These include:&lt;/p&gt;
&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 7. Simulation components&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/sdiopkt/fulltb.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;A test bench environment.  This is the “top level” of the simulated design.
All other components will be submodules of this one.  The top level
test bench will typically define any and all clocks and resets used by
the design.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The device under test (DUT). In this case, it would be my new controller
with all of its pieces.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;A bus functional model (BFM) to send commands to the DUT.  This is one of
those build once use everywhere kinds of components.  In this case, it
would need to be a Wishbone BFM, since this controller has a &lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone
interface&lt;/a&gt;.  Indeed,
I have &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/bench/cpp/wb_tb.h&quot;&gt;such a model that I
use&lt;/a&gt;
often when controlling a design from
&lt;a href=&quot;/blog/2017/06/21/looking-at-verilator.html&quot;&gt;Verilator&lt;/a&gt;.
I just don’t (yet) have a Verilog one–and the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; doesn’t really count,
although &lt;a href=&quot;/zipcpu/2021/07/23/cpusim.html&quot;&gt;I have used it for that purpose in the
past&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;A device hardware model.  This would model the component you intend to
communicate with.  In this case, it would model either an eMMC or an SD
card.  It’s also going to be the most complex piece of all the components
I’m discussing here, since it will need to support all of the IO modes
used by the DUT, as well as responding autonomously to the DUT–just like
and SD card or eMMC chip would.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The final component of a test bench is a test script, designed to drive
the BFM so that it tests the DUT from the perspective of the user.&lt;/p&gt;

    &lt;p&gt;&lt;a href=&quot;/zipcpu/2022/07/04/zipsim.html&quot;&gt;I’ve learned&lt;/a&gt; that
this portion of the test really needs to be in its own file.  That way,
you can swap which test script you run from one test to the next and
verify and test different portions of the design at a time.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I colored the missing pieces as black in Fig. 7 above.  Unlike the draft and
untrusted components I often draw in red, these components haven’t even been
drafted.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left; padding: 25px&quot;&gt;&lt;caption&gt;Fig 8. Lessons learned when flagging failures&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/sdiopkt/lsnlearned.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;A good test bench will also include lots of checks within it, either within the
BFM or the device model, in order to make sure everything is done according
to protocol at all stages in the process.  Any failure in this whole
setup should be flagged as a failure (if it isn’t an expected and intended
failure), and should then be
&lt;a href=&quot;/zipcpu/2022/07/04/zipsim.html&quot;&gt;recorded&lt;/a&gt;
as a failed test case for later fault analysis.&lt;/p&gt;

&lt;p&gt;Putting all these components together takes a lot of work.  I just wanted to
know if my transmit controller worked, and what was causing its &lt;a href=&quot;/formal/2018/07/14/dev-cycle.html&quot;&gt;formal cover
check&lt;/a&gt; to fail.&lt;/p&gt;

&lt;p&gt;So I just built a quick and dirty test bench, as shown in Fig. 9.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 9. Quick and dirty simulation components&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/sdiopkt/verilogtb.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;I then wanted to share a piece of this test bench here, since it uses
Verilog tasks and I wanted to point out some key features of using tasks
together with AXI streams.&lt;/p&gt;

&lt;p&gt;The test bench component I wish to share below is a Verilog task designed to
send a packet, via an AXI stream port, to this transmit controller.  Further,
it controls one additional transmit control wire–the transmit enable signal,
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tx_en&lt;/code&gt;.  This is the signal from the
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
controller to the transmit controller enabling its operation.&lt;/p&gt;

&lt;p&gt;Let’s walk through this task.&lt;/p&gt;

&lt;p&gt;It starts with a task declaration.  The task accepts one parameter: the
length of the packet to be transmitted in 32-bit words.  (It helps that
nothing in either the SDIO or the eMMC protocol will need to transmit
or receive something that’s not a multiple of 32-bits.)&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;&lt;span class=&quot;k&quot;&gt;task&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;send_packet&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;input&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;length&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
	&lt;span class=&quot;kt&quot;&gt;integer&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;As with all Verilog task inputs, or outputs for that matter, they are passed
by value.  Hence even if this length parameter were to change while the task
was operating, the value would remain constant within this environment.&lt;/p&gt;

&lt;p&gt;There are two other details of tasks to be noted here.  The first is that
this is a “static” task.  “Automatic” tasks are defined by placing the
word “automatic” between the task and its name.  “Automatic” tasks get
a new set of variables assigned to them every time they are called.  Since
this task is “static”, however, the “counter” value declared above will be
shared across all invocations of it.  This will also make it easier to debug,
since most Verilog simulators have an option to include these values in their
generated traces.&lt;/p&gt;

&lt;p&gt;The second thing to note is that this is a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;task&lt;/code&gt; and not a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;function&lt;/code&gt;.
Tasks are great when an operation needs to consume time.  As such, they
aren’t always synthesizable, whereas functions can be used to encapsulate
complex but synthesizable operations.&lt;/p&gt;

&lt;p&gt;Finally, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;integer counter&lt;/code&gt; declaration just gives us a register we can
work with.  It’s identical to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;reg signed [31:0] counter&lt;/code&gt;, but just a bit
simpler to declare.&lt;/p&gt;

&lt;p&gt;We’ll start our task off by initializing all the values we are going to
control.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;tx_en&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;S_VALID&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;S_DATA&lt;/span&gt;  &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;S_LAST&lt;/span&gt;  &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;No, these values are not declared within the task.  This is the task
controlling values defined within the task’s external context–something
I’ve only seen done in test bench contexts like this one.&lt;/p&gt;

&lt;p&gt;Do take careful note that all of these signals going into my DUT are clock
synchronous.  Clock synchronous inputs should be set on the clock edge,
and they should be set using non-blocking assignments.&lt;/p&gt;

&lt;p&gt;I come across a lot of designs that don’t do this.&lt;/p&gt;

&lt;p&gt;One common approach I find is where an engineer will attempt to set things
on the clock edge using blocking assignments, but without referencing the
clock itself.  These engineers might reason that, since the clock has
positive edges every 10ns, they should just set test bench inputs every 10ns.
Sadly, this leads to a race condition within the simulator, since the values
will either then be set before or after the clock edge and not at the clock
edge.  Worse, you can’t tell from the trace on which side of the clock
edge such values were set on.  While I’ve seen this mistake often, the lesson
learned here tends to come quickly.&lt;/p&gt;

&lt;p&gt;I’ve also seen things where engineers will offset their logic from the clock
edge.  Perhaps they might transition their signals on the negative edge of
the clock.  This, however, creates a coherence problem within the design,
making debugging a challenge.  (Yes, this is still the approach I use with
&lt;a href=&quot;/blog/2017/06/21/looking-at-verilator.html&quot;&gt;Verilator&lt;/a&gt;–it
is flawed, and I am likely to change how I handle such things in the future.)&lt;/p&gt;

&lt;p&gt;In one recent project I worked on, however, I did find that I needed to model
the hold time following the clock edge.  This was due to the partial modeling
of physical realities, to the point where the clock was arriving at multiple
times during the clock cycle.  My solution to this problem was to add a hold
time model, such as:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;tx_en&lt;/span&gt;   &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;#&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;HOLD_TIME&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;S_VALID&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;#&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;HOLD_TIME&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;S_DATA&lt;/span&gt;  &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;#&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;HOLD_TIME&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;S_LAST&lt;/span&gt;  &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;#&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;HOLD_TIME&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Had the right hand side of any of these expressions required calculation
with the current (before clock) data, this would guarantee that the correct
calculation was made, and the correct value set &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;#HOLD_TIME&lt;/code&gt; time units
following the clock edge.&lt;/p&gt;

&lt;p&gt;As a last note, let me point out that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tx_en&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;S_VALID&lt;/code&gt; really need to
already be clear (zero) coming into this routine.  Otherwise, if they were
not, it is possible we might’ve just broken the AXI protocol by lowering
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;S_VALID&lt;/code&gt; without checking &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;S_READY&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Now that we have this setup behind us, we can start up the DUT.  The first
step is to enable the transmitter.  Then one clock later, we’ll start the
packet stream.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;tx_en&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;S_VALID&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Why wait a clock cycle?  Simply because the actual packet generator will
take a clock cycle (or two) to read from its local block RAM before setting
the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;S_VALID&lt;/code&gt; signal.  That’s just the way this particular controller works.&lt;/p&gt;

&lt;p&gt;The next step is to send the packet data.&lt;/p&gt;

&lt;p&gt;When using AXI streams, there are only two times when you are allowed to
set or adjust any stream values.  The first is on a reset, when the VALID
signal must be brought low.  Ever after that, stream information may only
be adjusted when either valid is low (which it won’t be here) or when ready
is high.  Hence, we check for these conditions before adjusting anything.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;do&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;S_VALID&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;S_READY&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;S_VALID&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;S_LAST&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;S_DATA&lt;/span&gt;  &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;S_LAST&lt;/span&gt;  &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;length&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

		&lt;span class=&quot;k&quot;&gt;wait&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;while&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;length&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Checking for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;S_READY&lt;/code&gt; is very important for this controller.  Depending
on the IO settings, the transmitter might be able to send 32 bits per clock
cycle (i.e. 200MHz eMMC clock, 8 data lines, and DDR mode driven from a 100MHz
bus clock), or it might take 32,000 cycles to send 32-bits (i.e. 100kHz clock,
1 data line, and SDR mode), or any number of possibilities in between.&lt;/p&gt;

&lt;p&gt;The trick, however, is that we need to keep running this loop until all
data words have been sent.  If only sometimes through the loop the data
is accepted, then we need to make sure we keep looping–hence the while
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;counter &amp;lt; length&lt;/code&gt; condition.&lt;/p&gt;

&lt;p&gt;But what about the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wait(!clk)&lt;/code&gt; condition?&lt;/p&gt;

&lt;p&gt;That’s there because the non-blocking statements above don’t take any
simulation time.  You need to get past them to a point where the simulation
is forced to step forward for the statements above to take effect.  If we
didn’t have this wait condition, we might find ourselves looping back up
to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@(posedge clk)&lt;/code&gt; statement before our assignments had taken effect.
Waiting for the clock to become negative breaks this cycle, and doesn’t
really effect anything otherwise.&lt;/p&gt;

&lt;p&gt;Once all the data has been sent, there’s one more important step: clearing
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;S_VALID&lt;/code&gt; at the end of the packet.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;do&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;S_READY&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;S_VALID&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

		&lt;span class=&quot;k&quot;&gt;wait&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;while&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;S_VALID&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The rest of this task is just simulation maintenance.  For example, you
don’t want to return from sending a packet while the transmitter is still
busy, so we’ll wait for it to finish it’s task.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;wait&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tx_valid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Once the transmitter completes its task, we need to drop the transmit enable
line.  We’ll wait a clock cycle first.  Then, we have to drop it on a clock
edge, since this is a source synchronous signal.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
	&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;tx_en&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Lest we allow the controller to start immediately with the next packet,
we’ll force a minimum packet separation of another two clock cycles.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
	&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;endtask&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;This was enough to stimulate the DUT into sending a packet.  It was also enough
to find out why the cover check was failing–so I can now return to my
formal proof.&lt;/p&gt;

&lt;p&gt;It wasn’t enough to check that the packet CRC was valid.&lt;/p&gt;

&lt;p&gt;CRC checking can be a challenge with formal methods, so I wanted to check
it via simulation anyway.  Worse, there are many types of CRCs I need to check.
Specifically, both SDIO and eMMC protocol place CRCs at the end of every
packet, and those CRCs are one per data line.  That’s right.  The CRC on
data line zero covers only the data on data line zero.  The CRC on data line
7 covers the data transmitted on data line 7.  Moreover, when working in one
of the DDR modes, the CRC checks are split between the two clock edges.
There’s one CRC per data line per clock edge, so up to a maximum of 16 CRCs
that can be transmitted and then must be checked.&lt;/p&gt;

&lt;p&gt;Did I find bugs?  Definitely.  How many?  Too many for my ego.  I was convinced
I had a wonderful design, with only a few minor bugs before going through this
effort.  Those “few minor” bugs, however, were causing the transmitter to
lock up mid operation–something my formal proof wasn’t catching.  This
simulation, however, caught the bug nicely.&lt;/p&gt;

&lt;p&gt;Unfortunately, because this was an ad-hoc simulation setup, it fails several
of the criteria of a “good” simulation.  For example, most of the verification
that took place was done by looking over the resulting trace.  It wasn’t until
I got to CRC checking that I started adding assertions to my simulation
to automatically catch and flag bugs.  As a result, while this helped me get
further along in the project, it’s not really good enough to keep around long
term.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;So, did I find the initial failure?  Absolutely!  The design came up broken
because … the PCB design was broken.  The PCB design used an approach to
crossing from 1.8V to 3.3V that didn’t allow open drain signals to cross
properly.  As a result, once I had confidence that my IP was working, I
could tell that when the board started up it didn’t pull the data lines high
through the voltage translator.  Instead, the voltage translator sampled the
line low and maintained it at a low level in spite of the pull up resistors
that were intended to pull it high.  As a result, when the SD card processed
the first command I sent, it would notice that data line 3 was also low, and
it would go into SPI mode even if I wanted it to stay in SDIO mode.&lt;/p&gt;

&lt;p&gt;Yes, that’s right, it’s not just the SDIO controller that is a work in
progress, and I’m not the only one who makes mistakes when bringing things
up for the first time.&lt;/p&gt;

&lt;p&gt;If this is a piece of IP you are interested in, then watch &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;this
space&lt;/a&gt;.  I intend to post it and maintain
it there once I’m done getting it to work.  Once done, it will have an
interface similar to the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDSPI&lt;/a&gt; controller that’s currently there,
and I intend to maintain the two alongside each other.  The two big
improvements I’d still like to make are 1) adding a DMA, to push data around
to and from memory at the full speed of the bus, and 2) I’d also like to
add support for reading or writing a stream of blocks in a row–something the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDSPI&lt;/a&gt; controller has no ability to do.&lt;/p&gt;

&lt;p&gt;If you are too impatient to wait, you can take a peek at the controller
under development
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/sdspi/sdio_top.v&quot;&gt;here&lt;/a&gt;,
but I’d caution you that if I were to offer any guarantees or warranties at
this point, it would be that there are still bugs in the design.  Frankly,
it’s a work in progress.&lt;/p&gt;

&lt;p&gt;I am tempted, though, to come back to this controller later and write another
article about how to process the data strobe returned by the eMMC controller.
Such data strobes are becoming commonplace in high speed memory controllers,
to such an extent that I’ve now had to build controllers that could handle
data strobes from &lt;a href=&quot;https://www.arasan.com/product/xspi-psram-master/&quot;&gt;NOR flash (xSPI protocol),
HyperRAM&lt;/a&gt;,
&lt;a href=&quot;https://www.arasan.com/product/onfi-4-2-controller-phy/&quot;&gt;NAND flash (ONFI)&lt;/a&gt;,
and now this eMMC interface.  Another project I’m involved in, where I’m only
mentoring an amazing engineer, is that of a &lt;a href=&quot;https://github.com/AngeloJacobo/DDR3_Controller&quot;&gt;Verilog DDR3
controller&lt;/a&gt; and the
DDR3 protocol includes return strobes as well.  All told, I may have now seen
four or five different ways of processing these return data strobe signals–so
it might be ripe for a future blog post and/or a discussion on the topic.&lt;/p&gt;
&lt;hr /&gt;&lt;p&gt;&lt;em&gt;Give instruction to a wise man, and he will be yet wiser: teach a just man, and he will increase in learning.  (Prov 9:9)&lt;/em&gt;</description>
        <pubDate>Wed, 28 Jun 2023 00:00:00 -0400</pubDate>
        <link>https://zipcpu.com/blog/2023/06/28/sdiopkt.html</link>
        <guid isPermaLink="true">https://zipcpu.com/blog/2023/06/28/sdiopkt.html</guid>
        
        
        <category>blog</category>
        
      </item>
    
      <item>
        <title>Introducing the ZipCPU v3.0</title>
        <description>&lt;p&gt;It’s time to announce a new version of the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;:
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; v3.0!&lt;/p&gt;

&lt;p&gt;For reference, here’s how the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s
development has taken place over the years:&lt;/p&gt;

&lt;h2 id=&quot;zipcpu-v01&quot;&gt;ZipCPU v0.1&lt;/h2&gt;

&lt;p&gt;Way back in the beginning, the 
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; had four bit opcodes and only
16x16-bit multiplies.  It truly had a very limited instruction set.  That said,
the instruction set design was too limited to be very functional.&lt;/p&gt;

&lt;p&gt;This original instruction set didn’t even last a half a year.&lt;/p&gt;

&lt;h2 id=&quot;zipcpu-v10&quot;&gt;ZipCPU v1.0&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;,
v1.0, had 32-bit bytes and no octet level access.  If you wanted to read or
write an octet (8bit value) in memory, you needed to read a 32b word, modify
the 8b value within it, and write the 32b word back.&lt;/p&gt;

&lt;p&gt;As a result, the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;, v1.0, did not
have C library support.&lt;/p&gt;

&lt;h2 id=&quot;zipcpu-v20&quot;&gt;ZipCPU v2.0&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; v2.0 provided 8-bit byte support,
better compiler support, and full C-library support.  The instruction set
also included changes to the supported condition codes as well.&lt;/p&gt;

&lt;h2 id=&quot;now-announcing-zipcpu-v30&quot;&gt;Now: Announcing ZipCPU v3.0&lt;/h2&gt;

&lt;p&gt;The good news is that the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s
instruction set, as shown in Fig. 1, has not changed as part of this release.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 1. The ZipCPU instruction set and encoding&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/zipcpuv3/nextgen.png&quot; alt=&quot;&quot; width=&quot;480&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Although there have been (essentially) no changes to the instruction set
with this release, it feels like everything else associated with the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; has changed:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;This upgrade started with a core refactor, so that the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;
could support more than just the
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone bus&lt;/a&gt;.  As a
result, the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; can now support
&lt;a href=&quot;/zipcpu/2018/03/21/dblfetch.html&quot;&gt;Wishbone&lt;/a&gt;,
&lt;a href=&quot;/zipcpu/2021/04/17/axilops.html&quot;&gt;AXI-Lite&lt;/a&gt;, and &lt;a href=&quot;/zipcpu/2021/09/30/axiops.html&quot;&gt;(full)
AXI&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The new memory interfaces are now bus width independent, allowing the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;
to work on buses larger than 32-bits.  Indeed, it’s since been used on
64-bit and 512-bit buses quite successfully.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The core refactor led to better formal proofs, since the memory components
can now be verified independently.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The DMA has also been rewritten for bus width independence.  This rewrite
provides even more capabilities along the way.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The debug port has been rewritten.  This change is the one really
necessitating a new major release, as it won’t even appear to be backwards
compatible with prior releases.  Instead of two registers, the
rewritten &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;
debug port is now accessed via 33 registers: a control
register and one debug register address per each of the 32 internal registers.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The companion core configuration file has been replaced by parameter settings
at the CPU wrapper level.  Parameter names have been formalized across
wrappers, so common names configure common capabilities.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; now has &lt;a href=&quot;/zipcpu/2022/07/04/zipsim.html&quot;&gt;its own
simulation infrastructure for CPU level
testing&lt;/a&gt;.
This new infrastructure makes it possible to 1) test multiple configurations
of the CPU, 2) test the CPU in multi-processor environments, 3) verify that
the clock can be stopped and restarted, 4) verify the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; in both
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
and AXI configurations, 5) verify the lock instructions, and 6) verify the
CPU’s new debugging port.  This new simulation infrastructure also includes
the ability to measure test coverage.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;In perhaps the only downgrade of capabilities, the NOOP/SIM instructions
NEXIT and SEXIT have lost their ability to exit a simulation with a given
exit code.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Finally, the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;
has the ability to stop its clock if necessary.  Using this outside of the
simulator will likely require hardware level support, so for the time being
this may be a simulation only capability.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Put together, these are enough changes to warrant a new major release.&lt;/p&gt;

&lt;p&gt;Let’s take a moment to discuss these changes.&lt;/p&gt;

&lt;h2 id=&quot;expanding-the-bus&quot;&gt;Expanding the Bus&lt;/h2&gt;

&lt;p&gt;Perhaps the one reason driving this upgrade more than any other was the bus,
both in width and in type.  I needed to test a variety of AXI peripherals I
was building and wanted (needed, really) a CPU that could speak both AXI and
AXI-Lite.  Some of these peripherals required bus sizes wider than 32-bits.
Worse, the prior version of the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s
core included the memory ports within the CPU core itself forcing the CPU
to be &lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt; only.&lt;/p&gt;

&lt;p&gt;This proved to be a verification nightmare.  It meant that, in order to verify
the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s core functionality, I
needed to verify the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;
against every possible memory interface it might have.&lt;/p&gt;

&lt;p&gt;To make matters worse, which memory model the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;
used was determined not by parameter, but by macro.  This made the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;
harder to configure or adjust in any design.&lt;/p&gt;

&lt;p&gt;The fix was to refactor the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;.  In
the process, the interfaces to the prefetch, and the interface to the memory
unit, were both standardized as shown in Fig. 2 below.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 2. The ZipCPU's refactored architecture&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/tweets/zipcpu/cpu-verification.svg&quot; alt=&quot;&quot; width=&quot;780&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;The &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; core
was then verified against a pair of formal interface specifications,
as were the instruction fetch and memory units.  This made it possible to
formally verify those units separate from
the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; core.&lt;/p&gt;

&lt;p&gt;The refactor wasn’t quite seamless.  &lt;a href=&quot;/zipcpu/2021/09/30/axiops.html&quot;&gt;AXI exclusive
access&lt;/a&gt; required a different
interface to the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; than
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
exclusive access (i.e. bus locking) required.  In
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;, all you need
to do is hold the cycle line high between any two accesses to do any sort of
“read-modify-write” routine.  In AXI, on the other hand, any
“read-modify-write” routine won’t know until the write return whether the
sequence was successful or not.  Then, if the sequence was not successful, the
“read-modify-write” routine needs to be repeated.  To add this new capability,
the core now provides the AXI module the instruction pointer at the beginning
of any “read-modify-write” sequence.  If the “read-modify-write” sequence then
fails, the memory module returns as if it were returning from a “load into the
program counter” access causing a jump to the beginning of the sequence.&lt;/p&gt;

&lt;p&gt;The good news is that, once I had the AXI interface I needed, &lt;a href=&quot;/zipcpu/2021/07/23/cpusim.html&quot;&gt;I could then
test and demonstrate ASIC IP using this
approach&lt;/a&gt;.  As a result, I’ve
used this upgraded &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; to test both
a &lt;a href=&quot;https://www.arasan.com/product/xspi-master-ip/&quot;&gt;NOR flash controller&lt;/a&gt;
and a &lt;a href=&quot;https://www.arasan.com/product/xspi-psram-master/&quot;&gt;hyperRAM controller&lt;/a&gt;
using my &lt;a href=&quot;http://store.digilentinc.com/arty-artix-7-fpga-development-board-for-makers-and-hobbyists&quot;&gt;Arty board&lt;/a&gt;,
as well as an &lt;a href=&quot;https://www.arasan.com/products/nand-flash/&quot;&gt;ONFI flash
controller&lt;/a&gt; (via simulation only).
This has provided me with the valuable ability of &lt;a href=&quot;/zipcpu/2021/07/23/cpusim.html&quot;&gt;debugging system software
entirely in simulation&lt;/a&gt;–and
thus being able to answer why the device did (or did not) respond as expected.&lt;/p&gt;

&lt;h2 id=&quot;upgrading-the-dma&quot;&gt;Upgrading the DMA&lt;/h2&gt;

&lt;p&gt;Some time ago, someone contacted me to ask if I’d be willing to work with them
to build an “ideal-DMA”.  They had noticed that it seemed like every IP
component they integrated into their SOC required a DMA, and so it felt like
they had DMA’s running all through their SOC.  Wouldn’t it make more sense,
they asked, if we could just build one “better”/”ideal” DMA and not to keep
building all these special purpose DMAs throughout their SOC?&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/zipcpuv3/one-dma.svg&quot; alt=&quot;&quot; width=&quot;240&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;No, the deal didn’t go through.  I didn’t have the hours to spare at the time,
and we had some disagreements over the legal terms of working together.
However, this did leave me asking the question, what would constitute an ideal
DMA?&lt;/p&gt;

&lt;p&gt;Then I needed to use &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/8a05953d059158d4fa85eacd5ba86dfc84a4a1bf/rtl/peripherals/wbdmac.v&quot;&gt;my
DMA&lt;/a&gt;
with one of these wider bus sizes.  Specifically, I was working on a &lt;a href=&quot;https://github.com/ZipCPU/eth10g/&quot;&gt;project
requiring a 512b bus
width&lt;/a&gt;, and &lt;a href=&quot;/zipcpu/2018/02/12/zbasic-intro.html&quot;&gt;I use the DMA as part of
the process of loading CPU memory images from flash to RAM in the first
place&lt;/a&gt;.
At this point, my &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/8a05953d059158d4fa85eacd5ba86dfc84a4a1bf/rtl/peripherals/wbdmac.v&quot;&gt;one-32b-size-fits-all DMA&lt;/a&gt;
just couldn’t connect to the bus.
It &lt;em&gt;needed&lt;/em&gt; to be upgraded.  I no longer had a choice.&lt;/p&gt;

&lt;p&gt;So, let’s think of all the lessons I’ve learned over the last couple of years
using the last DMA.  What would a better DMA look like?  Specifically, a DMA
is designed to move data around without CPU intervention.  What kinds of data
moves are required?&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;The most obvious requirement is for a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memcpy()&lt;/code&gt; type of data move, that
moves memory from one location to somewhere else.  Such a capability needs
to move memory as fast as possible (it &lt;em&gt;is&lt;/em&gt; a DMA, right?), and so it really
needs to use the whole bus width.&lt;/p&gt;

    &lt;p&gt;It also needs octet level alignment in order to be relevant–unaligned
requests need to be expected, and handled appropriately.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;What about peripherals?  Consider audio peripherals, for example.&lt;/p&gt;

    &lt;p&gt;A microphone peripheral might capture 16b audio samples, generate an
interrupt after each sample is captured, and then need its sample to be read
and copied to memory.&lt;/p&gt;

    &lt;p&gt;A D/A peripheral might be similar: generating an interrupt whenever it
consumes a sample and needs another.  The DMA should then need to read the
new sample from memory and write it to this peripheral.&lt;/p&gt;

    &lt;p&gt;In both cases, the data source address for the microphone peripheral,
or the data destination address of the speaker, won’t change but the
address in memory will.&lt;/p&gt;

    &lt;p&gt;To make matters worse, the audio peripheral might require 8b or 16b values
which would be packed in memory.  Hence the DMA needs to be able to read
or write less than a full bus word at a time.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Years ago, I wrote a controller for an &lt;a href=&quot;https://github.com/ZipCPU/openarty/blob/master/rtl/wboledrgb.v&quot;&gt;RGB based OLED
peripheral&lt;/a&gt;.
I demonstrated the capability of this controller by alternately placing the
Gisselquist Technology logo and my own mug onto the display.  In this case,
the peripheral understood 32b command and data words, but the data had to
be transferred one word at a time.  An interrupt would then tell the CPU
when it was time to transfer the next 32b word.  To use a DMA, the DMA
would need to wait for the interrupt, transfer the next word from memory
to a constant destination address, then wait for the next interrupt again.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;How about block peripherals?  For example, I have an
&lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDSPI peripheral&lt;/a&gt;
which allows me access to an SD card via its (optional) SPI interface.  The
peripheral has two 32b data ports for transfers.  Each port leads to a 512B
FIFO, and the controller is expected to ping-pong between the two ports
for speed.&lt;/p&gt;

    &lt;p&gt;Reading from the SD card will fill one of these FIFOs and then trigger
an interrupt.  At that point, a DMA needs to read (many times) from the
same 32b address, form wide bus words together, and then write the results
to memory.&lt;/p&gt;

    &lt;p&gt;Writing to the same FIFO is similar, only the interrupt works in a different
fashion.  The CPU would first call the DMA to transfer a block (typically
512 bytes) of memory to the data port FIFO.  This block would need to be
read at whatever the bus size is, and then packaged into 32bit writes to fill
the FIFO.  Once the transfer is done, the CPU should be interrupted, and
the CPU can then instruct the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDSPI peripheral&lt;/a&gt;
to write the information to the external SD card.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left; padding: 25px&quot;&gt;&lt;caption&gt;Fig 3. Requirements of an upgraded DMA&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/zipcpuv3/betterdma.svg&quot; alt=&quot;&quot; width=&quot;480&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;From these requirements alone, what does a good DMA need to do?&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;(Optinally) Wait on an interrupt before starting any transfer.  Which
interrupt will need to be user selectable.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Transfer a given per-interrupt amount, perhaps less than the whole transfer.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Be able to read either 8b, 16b, 32b, or the full bus width at a time.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Be able to write either 8b, 16b, 32b, or the full bus width at a time.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Accessing peripheral memory may require that the DMA not increment the
source or destination address, whereas accesses to memory will require both
that the source/destination address increment and that accesses may (or may
not) be aligned.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/8a05953d059158d4fa85eacd5ba86dfc84a4a1bf/rtl/zipdma/zipdma.v&quot;&gt;new ZipDMA&lt;/a&gt;
now offers all these abilities.  It’s an awesome DMA capability.&lt;/p&gt;

&lt;p&gt;You can see the basic structure of this &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/8a05953d059158d4fa85eacd5ba86dfc84a4a1bf/rtl/zipdma/zipdma.v&quot;&gt;new
DMA&lt;/a&gt; in Fig. 4 below.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 4. The new ZipCPU DMA's structure&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/zipcpuv3/zipdma-blocks.svg&quot; alt=&quot;&quot; width=&quot;780&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Requests are made to the DMA, then sent to an FSM.  The FSM then breaks those
DMA requests into chunks.  Remember, unlike AXI,
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt; can only operate
in one direction at a time.  Therefore, all operations need to take place in
chunks where data is first read, then written.  In terms of the chunk processing
itself, there’s a memory to stream processor to read data from the bus.  This
will read 8b, 16b, 32b, or the full bus width of data per clock cycle.  Data
are then packed by a gearbox prior to going into a FIFO.  Coming out of the
FIFO, the same data words are now unpacked into the user’s desired transfer
width: 8b, 16b, 32b, or the full width of the bus.  As the final per-chunk step,
this data is placed onto the bus and written.&lt;/p&gt;

&lt;p&gt;The biggest problem with this new capability?  There’s only one ZipDMA.  If
it’s so good that every process needs it, there will be contention for it.
The second biggest problem?  This DMA capability is (currently) a
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
&lt;em&gt;only&lt;/em&gt; capability.  I don’t (yet) have an AXI version of it.  Further work on
this DMA will
concentrate on making sure all of the various capabilities within it are
properly verified–as I don’t yet have a good set of DMA focused test cases
for that purpose.&lt;/p&gt;

&lt;h2 id=&quot;a-better-debugging-interface&quot;&gt;A Better Debugging Interface&lt;/h2&gt;

&lt;p&gt;The original &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; had only two
registers for its in-hardware debugging interface.  One register could be used
to &lt;a href=&quot;/zipcpu/2017/08/25/hw-debugging.html&quot;&gt;reset, start, stop, and step the
CPU&lt;/a&gt;.  This same
register could be used to select which internal
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; register the
second register would access.  Reads and writes to this second register would
then either read or update the actual (selected) register within the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;.  While this worked, it didn’t
work well.&lt;/p&gt;

&lt;p&gt;To illustrate the problem, consider Fig. 5 below.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 5. Before the update: two round-trips per request&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/zipcpuv3/slowdbg.svg&quot; alt=&quot;&quot; width=&quot;420&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Reading any register from this interface required &lt;em&gt;two&lt;/em&gt; accesses, and therefore
&lt;em&gt;two&lt;/em&gt; round trips through the &lt;a href=&quot;/blog/2017/06/05/wb-bridge-overview.html&quot;&gt;debugging
bus&lt;/a&gt; to the FPGA.&lt;/p&gt;

&lt;p&gt;This interface struggled when I tried to debug
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; programs over my serial port
“&lt;a href=&quot;/blog/2017/06/05/wb-bridge-overview.html&quot;&gt;debugging bus&lt;/a&gt;”.
The debugger wanted the ability to read all of the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s
registers.  The two register interface then required that I first write the
to the first register, that write then needed to complete, then I would read
the data register, and that read would need to complete, all before I could
move on to reading the second CPU register.  That required two round trip
transactions just to read one register, or sixty four round trip transactions
to read all of the CPU’s registers.  (There are 16 supervisor register and
16 user registers.)&lt;/p&gt;

&lt;p&gt;This could take a long time over a serial port.&lt;/p&gt;

&lt;p&gt;My &lt;a href=&quot;/blog/2017/06/05/wb-bridge-overview.html&quot;&gt;debugging bus&lt;/a&gt;
has another type of read command: one where you can read multiple sequential
addresses in a row.  This operation only requires sending
the request (read thirty two 32b registers) and then waiting for the results.
There’s no requirement for a round-trip handshake in the meantime.  Instead,
any handshaking is complete once the entire operation is complete.  The
CPU just needed a minor upgrade to provide enough addresses on the bus to do
this.&lt;/p&gt;

&lt;p&gt;You can see how this operation is different in Fig. 6 below.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left; padding: 25px&quot;&gt;&lt;caption&gt;Fig 6. After the update, one request yields all results&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/zipcpuv3/fastdbg.svg&quot; alt=&quot;&quot; width=&quot;420&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Even better, I formalized how the CPU was to respond to accesses.  Debug
register reads shouldn’t need to stop the CPU–that way you can monitor
registers while the CPU is running, at the risk of reading an incoherent set
of registers.  This might be useful to know that the CPU is running, or where
it might be in its processing.  Bus writes, on the other hand, do need to stop
the CPU.&lt;/p&gt;

&lt;h2 id=&quot;upgrading-the-simulation-environment&quot;&gt;Upgrading the Simulation Environment&lt;/h2&gt;

&lt;p&gt;The original &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s
purpose was to be absolutely as light on resources as
possible.  However, depending on the project, the “lightest resource” CPU might
have too little power.  I therefore quickly learned that the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;
would need the ability to expand or adjust its area to fit the available area,
while optimizing the CPU’s speed in that area.  Example configurations might
include whether or not the CPU used caches, or just very simple data accessing
routines, whether the hardware supported multiplies via hardware-specific
DSP elements, whether &lt;a href=&quot;/zipcpu/2021/07/03/slowmpy.html&quot;&gt;an all–RTL
multiply&lt;/a&gt; was required, or
whether the CPU should be built with no capability for multiplies at all.  All
this led to an early on requirement that the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; needed to be highly configurable.&lt;/p&gt;

&lt;p&gt;My original approach to all this configurability was to create a separate
configuration file containing a set of macros in it.  The CPU’s configuration
would then depend on if or how those macros were configured.  I also had a
separate project, one I called &lt;a href=&quot;https://github.com/ZipCPU/zbasic&quot;&gt;ZBasic&lt;/a&gt;,
which would be used to test the CPU.  Within the
&lt;a href=&quot;https://github.com/ZipCPU/zbasic&quot;&gt;ZBasic project&lt;/a&gt;
was a piece of CPU testing software that could then
be used to test whether each of the CPU’s instructions worked.  All of this put
together worked great for testing a single configuration–the one described by
the macro file.  However I kept running into problems where I’d port the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;
to some piece of hardware or other and it wouldn’t work.  Perhaps I had made
some change some time earlier, and only tested other configurations to prove
that change.  Whatever the cause, I was often left debugging the CPU in
hardware–the one place you don’t want to debug the CPU.&lt;/p&gt;

&lt;p&gt;As it turns out, &lt;a href=&quot;/zipcpu/2022/07/04/zipsim.html&quot;&gt;it takes some engineering thought to build a test
setup that can check all configurations of a highly configurable
CPU&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Version 3.0 of the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;
comes with such an infrastructure.  &lt;a href=&quot;/zipcpu/2022/07/04/zipsim.html&quot;&gt;I’ve written about
it before&lt;/a&gt;.
It centers around a &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/8a05953d059158d4fa85eacd5ba86dfc84a4a1bf/sim/sim_run.pl&quot;&gt;Perl script&lt;/a&gt;
and a &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/8a05953d059158d4fa85eacd5ba86dfc84a4a1bf/sim/rtl/sim_testcases.txt&quot;&gt;file describing a series of tests&lt;/a&gt;.
Each test specifies a canned configuration, a piece of CPU software,
and one of two environments: a &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/8a05953d059158d4fa85eacd5ba86dfc84a4a1bf/sim/rtl/wb_tb.v&quot;&gt;Wishbone
environment&lt;/a&gt;
and an &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/8a05953d059158d4fa85eacd5ba86dfc84a4a1bf/sim/rtl/axi_tb.v&quot;&gt;AXI environment&lt;/a&gt;.
Each test can also include any parameter overrides, so the default environments
can be overridden for the test.  I override these defaults, for example, to
adjust the bus width for non-32bit bus testing.  All told, there are 105 tests
that take just over an hour to run under
&lt;a href=&quot;https://www.veripool.org/verilator/&quot;&gt;Verilator&lt;/a&gt;, or just over five days when
using &lt;a href=&quot;https://steveicarus.github.io/iverilog/&quot;&gt;Icarus Verilog&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Simulation testing, however, is perhaps the one place where the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s
instruction set is now less capable than before.&lt;/p&gt;

&lt;p&gt;As background, the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; had two
special instructions, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NOOP&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SIM&lt;/code&gt; instructions, that took specialized
arguments when run in simulation.  In hardware, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NOOP&lt;/code&gt; instructions turned
into standard no-operation instructions, whereas the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SIM&lt;/code&gt; instructions turned
into illegal instructions.  That functionality alone left 22-bits of instruction
space which I could use for additional functionality,
from which I had carved out sub-instructions to write characters or even
register values to the simulation log.  One special instruction would dump the
entire register set.  Another special instruction, encoded as either &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NEXIT&lt;/code&gt;
(NOOP based) or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SEXIT&lt;/code&gt; (SIM based) was supposed to cause the simulation to
end with a given exit code.  It’s this x&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;EXIT&lt;/code&gt; code’s functionality that’s
been lost.&lt;/p&gt;

&lt;p&gt;The reason had to do with the implementation of these instructions.  They were
originally implemented by the
&lt;a href=&quot;https://www.veripool.org/verilator/&quot;&gt;Verilator&lt;/a&gt;
C++ wrapper, and that wrapper required the ability to take a sneak-peek into the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s internals to know when to
execute these instructions.  This lead to two problems.  First, the instructions
would never work in a more traditional simulator that didn’t have or need such
a wrapper–such as &lt;a href=&quot;https://steveicarus.github.io/iverilog/&quot;&gt;Icarus
Verilog&lt;/a&gt; or a commercial simulator.
The second problem was that the interface used by the wrapper kept changing.
Since the features it depended upon weren’t standard Verilog
but rather depended upon &lt;a href=&quot;https://www.veripool.org/verilator/&quot;&gt;Verilator&lt;/a&gt;’s
internals, the interface broke every time
&lt;a href=&quot;https://www.veripool.org/verilator/&quot;&gt;Verilator&lt;/a&gt;
changed its internal data structure.&lt;/p&gt;

&lt;p&gt;In the end, I resolved these problems by rewriting how the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NOOP&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SIM&lt;/code&gt;
instructions were handled, and the new rewrite was done entirely in Verilog.
Using Verilog only, I could guarantee that all Verilog compliant simulators
would correctly implement these instructions.  However, I could not properly
implement the exit code requirement of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NEXIT&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SEXIT&lt;/code&gt; instructions.
Hence, while the CPU has gained the capability of executing NOOP and SIM
instructions under a general purpose Verilog simulator, it has lost the
capability to exit the simulation with a specific exit code.&lt;/p&gt;

&lt;p&gt;In many ways, this is a small price to pay for better interoperability
between simulators, and the ability to simulate/test the CPU under a large
number of configurations.&lt;/p&gt;

&lt;h2 id=&quot;clock-gating&quot;&gt;Clock Gating&lt;/h2&gt;

&lt;p&gt;One of the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s
goals has always been low-logic.  A truly low logic CPU &lt;em&gt;should&lt;/em&gt; also be able
to be a low-power CPU.  In ASIC designs, low power often means
&lt;a href=&quot;/blog/2021/10/26/clkgate.html&quot;&gt;clock gating&lt;/a&gt;,
and the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; has had a plan for
&lt;a href=&quot;/blog/2021/10/26/clkgate.html&quot;&gt;clock gating&lt;/a&gt;
since the beginning.&lt;/p&gt;

&lt;p&gt;Here’s how it works: the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;
supports two special modes, a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;HALT&lt;/code&gt; mode and a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SLEEP&lt;/code&gt; mode.  &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SLEEP&lt;/code&gt; will
cause the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; to stop executing
instructions until the next interrupt.  &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;HALT&lt;/code&gt; causes the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; to stop executing instructions
at all, and thus to come to a complete halt until either the
&lt;a href=&quot;/zipcpu/2017/08/25/hw-debugging.html&quot;&gt;debug port&lt;/a&gt; or an
external reset restarts the CPU.  These modes were originally envisioned to
allow the clock to be stopped by the CPU.&lt;/p&gt;

&lt;p&gt;This &lt;a href=&quot;/blog/2021/10/26/clkgate.html&quot;&gt;clock gating capability&lt;/a&gt;
is now a reality–in simulation at least.&lt;/p&gt;

&lt;p&gt;I’ve also now used
&lt;a href=&quot;/blog/2021/10/26/clkgate.html&quot;&gt;clock gating&lt;/a&gt;
several times, although never in actual
hardware.  The biggest lesson I’ve learned?  The
&lt;a href=&quot;/zipcpu/2017/08/25/hw-debugging.html&quot;&gt;debug port&lt;/a&gt;
&lt;em&gt;must&lt;/em&gt; automatically restart the clock to handle requests.  There’s been more
than once when I’ve tried to load a program into the CPU externally from the
&lt;a href=&quot;/zipcpu/2017/08/25/hw-debugging.html&quot;&gt;debugging port&lt;/a&gt;,
only to find out later that the reason the CPU was non responsive was
because its clock was stopped.  The next biggest lesson?  Stopping the clock
&lt;em&gt;might&lt;/em&gt; lower simulation time, but this isn’t a given.&lt;/p&gt;

&lt;p&gt;The end result of this work is that the CPU now has a program to test its
ability to stop the clock.&lt;/p&gt;

&lt;h2 id=&quot;profiling&quot;&gt;Profiling&lt;/h2&gt;

&lt;p&gt;There was one more minor update to the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;, this one having to do with
profiling.&lt;/p&gt;

&lt;p&gt;Yes, I’ve profiled software–mostly benchmarks–running on the 
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;, although only in simulation.
My current approach involves recording, for every instruction in a given
program, both the number of times that instruction was executed and the
number of clock cycles used to execute that instruction.  The resulting data
has done wonders for speeding up the CPU.&lt;/p&gt;

&lt;p&gt;While I’ve been using this data for quite some time, my previous method of
collecting it involved examining
&lt;a href=&quot;https://www.veripool.org/verilator/&quot;&gt;Verilator&lt;/a&gt;’s internal data structures to
access it.  While that has worked in the past, it forces me to update the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; every time
&lt;a href=&quot;https://www.veripool.org/verilator/&quot;&gt;Verilator&lt;/a&gt; changes
their internal data structures.  (This was the same problem I had with the
simulation only instructions.)  The solution is to create a proper external
port, coming out of the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;,
containing this data.  It’s then there if you want to use it, or it can be
ignored if you do not.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s simulation monitor
program had the same problem, where it was also accessing values internal
to the design.  Such values tend to move or get renamed with
&lt;a href=&quot;https://www.veripool.org/verilator/&quot;&gt;Verilator&lt;/a&gt;
updates.  As with the profiler interface, this is easily solved by generating
proper Verilog ports to the CPU containing references to these values for
the monitor.&lt;/p&gt;

&lt;p&gt;In both cases, I expect my updated solution to handling these values will
need less maintenance as I use (and maintain) the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; over time.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;As with any project, the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;
remains a work in progress.  It will likely remain so for the foreseeable
future.  This is a good thing.  It means the CPU remains supported.&lt;/p&gt;

&lt;p&gt;In the meantime, I’ve now used the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; on a variety of commercial
projects.  I’ve written about some of them.  For example, it’s been used in
a couple of SONAR applications, and I’m now importing it into a
&lt;a href=&quot;https://github.com/ZipCPU/eth10g&quot;&gt;10Gb Ethernet switch&lt;/a&gt;
application.  I’ve also used it to test, via both simulation and hardware,
pre-ASIC IP cores.  (I.e. IP cores designed for ASICs, but tested in FPGAs
first.)&lt;/p&gt;

&lt;p&gt;At present, the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; has two
drawbacks that I’d still like to address in the future.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;It still doesn’t have a memory management unit (MMU) to give it access to
virtual memory.  Worse, the MMU I designed years ago for the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;
is now abandonware.  It needs to be rebuilt.  Since all of the internal
interfaces have changed between the core and the memory components, the
MMU’s required interfaces have changed as well.  Worse, the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s core infrastructure may
also need to be adjusted so that it can handle page faults.  For example,
what happens in a compressed instruction if the second half of the
instruction suffers from a page fault?  At present, compressed instructions
do not need to be, and therefore cannot be restarted mid-instruction.&lt;/p&gt;

    &lt;p&gt;This upgrade will be required before I can truly run Linux on the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;The good news is that I don’t have any applications that require such an MMU
at present.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Although the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; supports AXI,
it doesn’t really do so by the book.  AXI is, by the book,
&lt;a href=&quot;https://en.wikipedia.org/wiki/Endianness&quot;&gt;little endian&lt;/a&gt;
whereas the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; remains a
&lt;a href=&quot;https://en.wikipedia.org/wiki/Endianness&quot;&gt;big-endian&lt;/a&gt;
machine.  Yes, it now has options to run in a
&lt;a href=&quot;https://en.wikipedia.org/wiki/Endianness&quot;&gt;little endian&lt;/a&gt;
fashion, but the
&lt;a href=&quot;https://en.wikipedia.org/wiki/Endianness&quot;&gt;little endian&lt;/a&gt;
options within the tool chain haven’t been
tested, and so I have no confidence that they will work.  What this means
is that bytes within words are mis-ordered when using AXI.  The
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;
will still write byte zero to bits [31:24], and byte one to bits [23:16]
and so forth.&lt;/p&gt;

    &lt;p&gt;This has lead me to no end of troubles when testing AXI IP that is properly
ordered.  Not only that, but I now have an optimized software routine for
byte-reordering–a patch, written instead of a proper upgrade.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I’ve imagined a third upgrade over the years as well: adding a floating point
capability to the CPU.  Moreover, I’ve reserved several instruction op-codes to
support 32-bit single precision floating point operations.  In hindsight,
however, I’m not sure to what extent I would use these instructions even if I
did implement them.  I don’t normally use &lt;em&gt;single&lt;/em&gt; precision floating point.
I default to using &lt;em&gt;double&lt;/em&gt; precision floating point.  Not only that, but the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;
will never be a hard core floating point processing machine.  It just
doesn’t fit that role.  It will always be better as a fixed point system.
Hence, floating point is no longer one of my goals for the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This v3.0 release also marks the first time the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;
has synthesized (with caches) on a Kintex-7 device with &lt;em&gt;200MHz clock&lt;/em&gt;!&lt;/p&gt;

&lt;p&gt;Yes, the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;
has been well used (by me), and remains well loved.&lt;/p&gt;
&lt;hr /&gt;&lt;p&gt;&lt;em&gt;Every man's work shall be made manifest: for the day shall declare it, because it shall be revealed by fire; and the fire shall try every man's work of what sort it is.  (1Cor 3:13)&lt;/em&gt;</description>
        <pubDate>Mon, 29 May 2023 00:00:00 -0400</pubDate>
        <link>https://zipcpu.com/zipcpu/2023/05/29/zipcpu-3p0.html</link>
        <guid isPermaLink="true">https://zipcpu.com/zipcpu/2023/05/29/zipcpu-3p0.html</guid>
        
        
        <category>zipcpu</category>
        
      </item>
    
      <item>
        <title>What is a Virtual Packet FIFO?</title>
        <description>&lt;p&gt;I first came across virtual packet FIFOs in a &lt;a href=&quot;/blog/2022/04/29/proto-bringup.html&quot;&gt;SONAR
project&lt;/a&gt; by necessity.
The &lt;a href=&quot;/blog/2022/04/29/proto-bringup.html&quot;&gt;SONAR device&lt;/a&gt;’s
only means of communicating with the outside world was
via Gb Ethernet.  There was no
&lt;a href=&quot;/formal/2019/02/21/txuart.html&quot;&gt;UART&lt;/a&gt; and no JTAG.
Everything went over Ethernet.  Collected data went over Ethernet.
Device control was over Ethernet.  &lt;a href=&quot;/blog/2022/08/24/protocol-design.html&quot;&gt;Debugging had to be done over
Ethernet&lt;/a&gt;.  FPGA
reconfiguration and all software updates had to go over Ethernet.  Last of
all, the CPU needed to talk to the outside world over Ethernet.  This was
where I first came up with the idea of a virtual packet FIFO.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 1. A Virtual Packets FIFO&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/vfifo/pktvfifo.svg&quot; alt=&quot;&quot; width=&quot;560&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;The idea came from necessity, given how &lt;a href=&quot;https://github.com/ZipCPU/zipversa/blob/master/rtl/enet/enetpackets.v&quot;&gt;my previous network
controller&lt;/a&gt;
operated.  In &lt;a href=&quot;https://github.com/ZipCPU/zipversa/blob/master/rtl/enet/enetpackets.v&quot;&gt;that
controller&lt;/a&gt;,
packets would be received into a small block RAM connected the controller.
That block RAM could hold only one packet at a time.  Once a packet was
received, therefore, the network controller would be deaf until the CPU
processed the packet and then notified the controller it could use its
memory for another packet.  Likewise, when the CPU wished to transmit a
packet, it would write a single packet to the controller’s memory, notify
it that a packet was present, and then wait for the controller to finish
transmitting it before writing the next packet to memory.&lt;/p&gt;

&lt;p&gt;This works great–on a low bandwidth interface.  But what happens if two
packets arrive in short succession?  Or, similarly, what happens if
packets arrive that are larger than the &lt;a href=&quot;https://github.com/ZipCPU/zipversa/blob/master/rtl/enet/enetpackets.v&quot;&gt;controller’s internal
buffer&lt;/a&gt;?
What about “Jumbo packets”?&lt;/p&gt;

&lt;p&gt;All of these problems necessitated a new solution, and the solution I chose was
a virtual packet FIFO.  This solution has two big upgrades to the previous
one.  The first is a size upgrade.  A virtual packet FIFO can be &lt;em&gt;much&lt;/em&gt; larger
than its block RAM counterpart.  The second upgrade is the number of packets
that can be held.  Frankly, it doesn’t make much sense if you can hold
lots of data, if you can’t also fill that with either lots of packets or
a small number of jumbo packets.&lt;/p&gt;

&lt;p&gt;Since this is a neat idea, let’s take a moment and discuss it.&lt;/p&gt;

&lt;h2 id=&quot;packet-streams&quot;&gt;Packet streams&lt;/h2&gt;

&lt;p&gt;Some time ago, I discussed &lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;the problems with the AXI stream
protocol&lt;/a&gt;.
At the time, I based my discussion on three specific applications:
&lt;a href=&quot;/video/2022/03/14/axis-video.html&quot;&gt;video&lt;/a&gt;,
&lt;a href=&quot;/blog/2019/11/14/sdspi.html&quot;&gt;data capture&lt;/a&gt;, and network
packet handling.  In each of these applications, data would arrive at the
incoming interface independent of whether or not there was space available
to handle it.  &lt;a href=&quot;https://en.wikipedia.org/wiki/Back_pressure&quot;&gt;Backpressure&lt;/a&gt;, a
key feature of the AXI stream protocol, could not be supported properly
without risking data corruption.&lt;/p&gt;

&lt;p&gt;At that time, &lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;I suggested a new AXI stream field:
ABORT&lt;/a&gt;.
If the &lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;ABORT&lt;/a&gt; signal
was ever asserted from an upstream source, the rest of any data packet
would need to be dropped, and data handling would need to start over with
the first beat of the next packet.  This new
&lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;ABORT&lt;/a&gt;
signal has worked nicely in network packet handling constructs.  Indeed, it
has worked &lt;em&gt;very&lt;/em&gt; well.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Yes, it’s a bit harder to work with and harder to verify than straight AXI
streams.  This is to be expected.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;However, it was joy to watch the network design “just work” with this
protocol.  In particular, I watched network data get captured, formed
into packets, and then dropped as the design started up–because either
the network interface hadn’t finished its negotiation into 1Gb mode
(it could never keep up at less than 1Gb/s), or because the data hadn’t
been told where to go yet.  (Yes, it still needed a destination addresses
for the SONAR data, both IP and Ethernet, before it could send it out.)&lt;/p&gt;

    &lt;p&gt;Once configuration completed, the protocol started blasting captured
packets without a hitch.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I loved it!&lt;/p&gt;

&lt;p&gt;Others, however, have argued that my proposed
&lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;ABORT&lt;/a&gt;
field was unnecessary.  Why create a new protocol, they argued, vs. just using
straight AXI stream?  The answer to this is twofold:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Jumbo_frame&quot;&gt;Jumbo Frames&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

    &lt;p&gt;In order to use straight AXI stream, you have to first convert the incoming
network packet to AXI stream in the first place.  The follows simply because
that incoming network interface doesn’t know anything about
&lt;a href=&quot;https://en.wikipedia.org/wiki/Back_pressure&quot;&gt;backpressure&lt;/a&gt;.
To do this conversion, incoming packets need to first go into a buffer.
If there’s not sufficient space in the buffer, the packet is simply dropped.
If there’s sufficient space, the packet is “committed” and can then be read
out of the buffer via standard AXI stream.&lt;/p&gt;

    &lt;p&gt;The size of this buffer forces a limit on the maximum packet size that can
be handled.  Packets larger than the buffer size will need to be dropped.&lt;/p&gt;

    &lt;p&gt;While I was designing the original SONAR Ethernet controller, my customer
asked about &lt;a href=&quot;https://en.wikipedia.org/wiki/Jumbo_frame&quot;&gt;jumbo
frames&lt;/a&gt;–packets much larger than
the (otherwise) maximum
Ethernet packet size of 1500 Bytes.  How much larger?  They didn’t say.
All of a sudden, I could no longer size my buffer prior to hardware
layout (place and route).&lt;/p&gt;

    &lt;p&gt;The Virtual Packet FIFO we’ll discuss today can solve this problem of
converting an (otherwise) unsized packet to AXI stream proper.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Vendor Infrastructure&lt;/strong&gt;&lt;/p&gt;

    &lt;p&gt;If I used Xilinx (or any other vendor’s) AXI stream infrastructure, I might
be tied to that protocol.  The choice of whether or not to use AXI stream
is really a business decision: either rebuild the AXI stream infrastructure
from scratch to support a modified protocol, or stick to the AXI stream
protocol as is.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;table align=&quot;center&quot; style=&quot;padding: 25px; float: right&quot;&gt;&lt;caption&gt;Fig 2. Advantages to using your own IP&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/vfifo/personal-ip.svg&quot; alt=&quot;&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;If I rebuild the infrastructure from scratch, I incur additional costs
   above and beyond what I might have incurred had I used someone else’s
   (free) infrastructure.  I can release any IP I build under my chosen user
   license.  I can also formally verify anything I build.  I will also gain
   the ability (and responsibility, and cost associated with) debugging and
   maintaining it.  The good news, though, is that I can guarantee the quality
   of any IP I control.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;padding: 25px; float: left&quot;&gt;&lt;caption&gt;Fig 3. Advantages to using vendor IP&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/vfifo/vendor-ip.svg&quot; alt=&quot;&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;If I use a vendor’s infrastructure then I might save some money–while
   risking my project’s success on the vendor’s responsiveness to bugs found in
   their infrastructure.  Given that I’m aware of bugs that’ve lived in Xilinx
   IP for nearly 10 years, and given that I’m a small one-man nobody shop, I
   don’t have a strong confidence that they’ll fix anything that’s broken.&lt;/p&gt;

&lt;p&gt;Yes, I suppose this is a business decision.&lt;/p&gt;

&lt;p&gt;Frankly, I don’t use vendor infrastructure unless I have to.  It’s just the
   nature of how I’ve structured my own business at &lt;a href=&quot;/about/gisselquist-technology.html&quot;&gt;Gisselquist
   Technology&lt;/a&gt;.
   I’ve now built &lt;a href=&quot;/about/zipcpu.html&quot;&gt;my own CPU&lt;/a&gt;, my own
   GNU compiler and assembler back ends, &lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;my own bus
   interconnects&lt;/a&gt;,
   &lt;a href=&quot;https://github.com/ZipCPU/dspfilters&quot;&gt;my own DSP filters&lt;/a&gt;,
   &lt;a href=&quot;/dsp/2017/08/30/cordic.html&quot;&gt;CORDICs&lt;/a&gt;,
   &lt;a href=&quot;/dsp/2018/10/02/fft.html&quot;&gt;FFTs&lt;/a&gt;, etc.  So it should come
   as no surprise that I’d have no problems building &lt;a href=&quot;https://github.com/ZipCPU/eth10g/tree/master/rtl/net&quot;&gt;an AXI stream
   infrastructure&lt;/a&gt; based
   around a new “&lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;ABORT&lt;/a&gt;”
   field.&lt;/p&gt;

&lt;p&gt;Yes, there are risks with this approach.  One common risk is that I might
   need to interface with a vendor protocol, so I often have &lt;a href=&quot;https://github.com/ZipCPU/wb2axip&quot;&gt;conversion
   routines&lt;/a&gt; available to move back and
   forth between one protocol and another when necessary.  For example, it’s
   not enough to use
   &lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt; if you need
   to interact with Xilinx’s MIG–so I use a &lt;a href=&quot;/blog/2020/03/23/wbm2axisp.html&quot;&gt;bridge from one protocol to the
   other&lt;/a&gt;.  I &lt;a href=&quot;https://github.com/ZipCPU/wb2axip&quot;&gt;also
   have&lt;/a&gt; sufficient infrastructure to
   use AXI without bridges if necessary.&lt;/p&gt;

&lt;p&gt;Still, AXI stream is a really simple protocol, and &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/doc/axin.pdf&quot;&gt;this modified AXI
   network stream
   protocol&lt;/a&gt;,
   while more complex, isn’t really that much more difficult to deal with.&lt;/p&gt;

&lt;p&gt;Since writing &lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;that
article&lt;/a&gt;, I’ve had great
success with this new
&lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;ABORT&lt;/a&gt; field.
Indeed, I’ve had so much success, that I’m now rebuilding all of my network
data handling components to use it.&lt;/p&gt;

&lt;p&gt;However, there is one (more) problem this &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/doc/axin.pdf&quot;&gt;new
protocol&lt;/a&gt; needs
to address: stream widths.&lt;/p&gt;

&lt;p&gt;When working with 1Gb Ethernet, I could operate at 8b/clock at 125MHz, and
stream widths weren’t really a problem–every beat contained exactly one byte.
Well, not quite.  Stream widths became a bit of a problem when crossing clock
domains, since I would need to guarantee sufficient handling width.  To
handle the CDC case, I first converted to a wider (32b) AXI stream, and
then prepended a 32b packet length to the packet.  This kept me from
supporting &lt;a href=&quot;https://en.wikipedia.org/wiki/Jumbo_frame&quot;&gt;jumbo frames&lt;/a&gt;,
so when rebuilding for a 10Gb interface, I needed a new solution.&lt;/p&gt;

&lt;p&gt;Standard AXI stream solves this problem with their TSTRB and TKEEP fields.
Each field has one bit per byte per beat within it, and allows the stream
processor to handle less than a full beat of information.  For example, when
dealing with a 32-bit interface, a 16-bit value might contain two NULL bytes,
where a NULL byte is defined as one where TKEEP and TSTRB are both low.&lt;/p&gt;

&lt;p&gt;This seemed insufficient for me for a variety of reasons.  In general, to use
an AXI stream of this type, you’d first want to pack it and remove all
NULL bytes.  This would force any unused bytes into the last beat, while also
requiring that the last beat had at least one valid byte.  The last beat would
also need to be packed, so that all used bytes would be on the low end–when
using little endian semantics, or the high end otherwise.  Further, I never
saw a reason for keeping “position” bytes (TKEEP &amp;amp;&amp;amp; !TSTRB) around.  The
result was that TKEEP and TSTRB contained too many bits for my purpose.&lt;/p&gt;

&lt;p&gt;So I created a new field: BYTES.  At first, the BYTES field had &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$clog2(DW/8+1)&lt;/code&gt;
bits to it, where &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DW&lt;/code&gt; is the number of bits in the DATA field–sometimes
called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;C_AXIS_DATA_WIDTH&lt;/code&gt;.  This BYTES field would then be equal to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DW/8&lt;/code&gt;
for every beat prior to the last one, and between one and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DW/8&lt;/code&gt; inclusive for
the last beat.  (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0 &amp;lt; BYTES &amp;lt;= DW/8&lt;/code&gt;)  Then, on second thought, I realized the
top bit of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BYTES&lt;/code&gt; was irrelevant: Since &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BYTES&lt;/code&gt; was never zero, and never
more than &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DW/8&lt;/code&gt;, I could map the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DW/8&lt;/code&gt; value to zero and drop a bit.  So,
now, BYTES has &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$clog2(DW/8)&lt;/code&gt; bits and a value of zero (representing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DW/8&lt;/code&gt;
bytes) for all but the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LAST&lt;/code&gt; beat where it might represent fewer bytes per
beat.&lt;/p&gt;

&lt;p&gt;So, in summary, to support packet data I made the following changes to the
AXI stream protocol:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;&lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;ABORT&lt;/a&gt;&lt;/strong&gt;: A new
field, indicating that the upstream processor needed to drop the packet
for any reason.  Possible reasons I’ve come across include: 1) CRC errors,
2) protocol errors, 3) hardware errors, or even 4) insufficient memory
for handling &lt;a href=&quot;https://en.wikipedia.org/wiki/Back_pressure&quot;&gt;backpressure&lt;/a&gt;,
from downstream.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;TKEEP/TSTRB&lt;/strong&gt;: I dropped both of these fields.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;BYTES&lt;/strong&gt;: A new field to replace the TSTRB/TKEEP fields, while still
indicating how many bytes are active in a given beat.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;And, of course, all beats are fully packed.  Hence, all but the LAST
beat will have &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DW/8&lt;/code&gt; valid bytes in it.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I’ve named &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/doc/axin.pdf&quot;&gt;this (new) protocol the AXI-networking, or AXIN,
protocol&lt;/a&gt;, for lack
of a better name.  As a result, if you look through &lt;a href=&quot;https://github.com/ZipCPU/eth10g/tree/master/rtl/net&quot;&gt;the designs I’ve built
to use this protocol&lt;/a&gt;,
you’ll find “AXIN” in a lot of the names.&lt;/p&gt;

&lt;p&gt;I also have a lot of infrastructure for this new protocol, and that
infrastructure is growing on a daily basis.  For example, I have AXIN
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/netskid.v&quot;&gt;skidbuffers&lt;/a&gt;,
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axincdc.v&quot;&gt;asynchronous&lt;/a&gt;
and &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/netfifo.v&quot;&gt;synchronous
FIFO&lt;/a&gt;s,
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axinbroadcast.v&quot;&gt;broadcasters&lt;/a&gt;,
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axinarbiter.v&quot;&gt;arbiters&lt;/a&gt;,
a &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axinwidth.v&quot;&gt;width
converter&lt;/a&gt;,
and more.  (A CRC checker is still being verified, but will likely be posted
soon.)&lt;/p&gt;

&lt;p&gt;This brings us to the topic of virtual FIFOs.&lt;/p&gt;

&lt;h2 id=&quot;what-is-a-virtual-fifo&quot;&gt;What is a virtual FIFO?&lt;/h2&gt;

&lt;p&gt;A virtual FIFO is simply a FIFO that uses external instead of internal memory.
That external memory is typically accessed via a bus, shared among many
potential users, and commonly exists off-chip.  A classic example would be a
DDR3 SDRAM memory accessed via an AXI (or
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;) bus.  You can
see the difference between a traditional and a virtual FIFO in Fig. 4.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;padding: 25px; float: right&quot;&gt;&lt;caption&gt;Fig 4. Difference between a FIFO and a Virtual FIFO&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/vfifo/fifo-comparison.svg&quot; alt=&quot;&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Some time ago, I built a &lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/master/rtl/axivfifo.v&quot;&gt;Virtual FIFO for the AXI
protocol&lt;/a&gt;.  The
flow went as follows:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;The first step was to &lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/49f06ea0219c48a1010f95d72d78ba535b075217/rtl/axivfifo.v#L486-L490&quot;&gt;buffer a burst of data&lt;/a&gt;
into a &lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/master/rtl/sfifo.v&quot;&gt;synchronous
FIFO&lt;/a&gt;.  To work
smoothly, the &lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/master/rtl/sfifo.v&quot;&gt;synchronous
FIFO&lt;/a&gt; needed
space for at least two AXI bursts.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Once a full burst’s worth of data was available in the &lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/49f06ea0219c48a1010f95d72d78ba535b075217/rtl/axivfifo.v#L486-L490&quot;&gt;local
FIFO&lt;/a&gt;,
that data would be &lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/49f06ea0219c48a1010f95d72d78ba535b075217/rtl/axivfifo.v#L486-L490&quot;&gt;burst to the AXI
bus&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;When using &lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;,
I’d do the same thing, save that the burst would only
end after the incoming FIFO was completely drained.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/49f06ea0219c48a1010f95d72d78ba535b075217/rtl/axivfifo.v#L409-L427&quot;&gt;Once BVALID was then received&lt;/a&gt;,
we would know that a full AXI burst’s
worth of memory was now available in the external RAM to be &lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/49f06ea0219c48a1010f95d72d78ba535b075217/rtl/axivfifo.v#L681-L735&quot;&gt;read
back&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;In the case of
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;,
I’d count data words, not burst sizes, but it’s the same principle.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;There was again another (local, block RAM) &lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/master/rtl/sfifo.v&quot;&gt;synchronous
FIFO&lt;/a&gt; on the read
memory side.  Like the first FIFO, &lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/49f06ea0219c48a1010f95d72d78ba535b075217/rtl/axivfifo.v#L788-L796&quot;&gt;this
one&lt;/a&gt;
also required enough room to contain at least two AXI bursts.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Once a burst’s worth of data has been placed into the external RAM, &lt;em&gt;and&lt;/em&gt;
there is sufficient (uncommitted) data in the second FIFO to hold it,
a &lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/49f06ea0219c48a1010f95d72d78ba535b075217/rtl/axivfifo.v#L681-L735&quot;&gt;burst read request would be issued&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;Again, when using
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;,
I’d make requests until the entire FIFO’s size was committed–not just the
initial burst size.  Hence, as reads might be made from the FIFO while
requesting data from the bus, additional reads would be made.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Data read back from memory would then get sent straight into &lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/49f06ea0219c48a1010f95d72d78ba535b075217/rtl/axivfifo.v#L788-L796&quot;&gt;this second
buffer&lt;/a&gt;
once it returned from the bus.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The final, outgoing AXI stream, would then be fed &lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/49f06ea0219c48a1010f95d72d78ba535b075217/rtl/axivfifo.v#L786&quot;&gt;straight from this
second buffer&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Only when the incoming FIFO is full would
&lt;a href=&quot;https://en.wikipedia.org/wiki/Back_pressure&quot;&gt;backpressure&lt;/a&gt;,
attempt to slow down the upstream source.&lt;/p&gt;

    &lt;p&gt;The incoming FIFO would be “full” if it wasn’t getting emptied.  This
would happen if either 1) &lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/49f06ea0219c48a1010f95d72d78ba535b075217/rtl/axivfifo.v#L535-L536&quot;&gt;the memory was full&lt;/a&gt;,
or 2) the &lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/49f06ea0219c48a1010f95d72d78ba535b075217/rtl/axivfifo.v#L540-L541&quot;&gt;FIFO couldn’t write to memory fast
enough&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Success, when using this technique, required that the stream bandwidth be
less than 50% of the memory bandwidth.  This will often require that any stream
necessitating a high throughput might first need to be resized to a wider
width–just to reduce the throughput to something the memory can handle.
Remember, when sizing memory bandwidth, there are lots of things that can
use up your bandwidth:&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left; padding: 25px&quot;&gt;&lt;caption&gt;Fig 5. Calculating memory bandwidth&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/vfifo/mem-bw.svg&quot; alt=&quot;&quot; width=&quot;480&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;You’ll need bandwidth for the data to come in and get written to memory&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;You’ll need that much again to read the data back out&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;You’ll need to allocate time for bus latency.&lt;/p&gt;

    &lt;p&gt;This can be worse for any bus that needs to stop in order to switch
directions.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;A memory can only read or write on any given clock cycle, and also needs
a couple cycles to switch from reading to writing and back again.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Don’t forget, you’ll also need to allocate some number of memory clock
cycles for refresh.  How many cycles will be required here depends upon
your memory, your bus structure, and whether or not your memory allows
pulling refresh cycles or whether such pulled cycles are supported in your
controller.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Xilinx’s MIG controller also uses a couple of clock cycles per refresh
to keep it’s IO PLL locked on the memories DQS strobe signal(s).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;You’ll also need memory bandwidth for everything else that might use the
memory.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In short, &lt;em&gt;it depends&lt;/em&gt;.  The best way to size memory bandwidth requirements is
to calculate how many beats per second you will need, and then make sure
your memory can support perhaps four times that amount.&lt;/p&gt;

&lt;p&gt;A key problem with the standard virtual FIFO, described above, is that there’s
no (good) way to store non-data information such as packet boundaries in
memory.  Either you increase the memory storage requirement to hold a LAST bit
(often by 2x!), or it just gets dropped.  Indeed, my &lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/master/rtl/axivfifo.v&quot;&gt;basic AXI virtual
FIFO&lt;/a&gt;
implementation simply drops this data.  As a result, it works well on a
proper &lt;em&gt;stream&lt;/em&gt; interface, but not very well on a &lt;em&gt;packet&lt;/em&gt; interface.&lt;/p&gt;

&lt;p&gt;I have a &lt;a href=&quot;https://github.com/ZipCPU/wbscope/blob/master/rtl/memscope.v&quot;&gt;separate virtual FIFO that I’ve built for my
&lt;em&gt;scope&lt;/em&gt;&lt;/a&gt;,
sometimes called an internal logic analyzer.  (This one’s been formally
verified, but never used in any practical context.  It was fun to build, and
a good learning exercise, it just hasn’t fit into any important usage
scenarios … yet.)  In this case, if the
&lt;a href=&quot;/blog/2017/06/08/simple-scope.html&quot;&gt;scope&lt;/a&gt;
ever gets overrun and can’t keep up, all the data will be dropped
and it will start collecting all over again with new data.&lt;/p&gt;

&lt;p&gt;Again, the problem with the stream protocol is
&lt;em&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Back_pressure&quot;&gt;backpressure&lt;/a&gt;&lt;/em&gt;,
and what to do if you overrun the FIFO, and where/when in your stream processing
will that information be known.  When dealing with packets, the rule
is data needs to be dropped at packet boundaries.  That information needs
to be communicated to the place where the decision can be made.&lt;/p&gt;

&lt;p&gt;So how do we mix the &lt;em&gt;packet&lt;/em&gt; concept with the concept of a &lt;em&gt;virtual FIFO&lt;/em&gt;?
The answer is a virtual packet FIFO.&lt;/p&gt;

&lt;h2 id=&quot;what-is-a-virtual-packet-fifo&quot;&gt;What is a virtual packet FIFO?&lt;/h2&gt;

&lt;p&gt;A virtual packet FIFO is simply a virtual FIFO that maintains packet
boundaries.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 6. Virtual FIFO definitions&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/vfifo/vpktdefns.svg&quot; alt=&quot;&quot; width=&quot;560&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;That means two things.&lt;/p&gt;

&lt;p&gt;First, it means we have to preserve packet boundaries.  That &lt;em&gt;LAST&lt;/em&gt; signal
is important when working with packets.  Moreover, packet boundaries need
&lt;em&gt;octet&lt;/em&gt; level precision.&lt;/p&gt;

&lt;p&gt;Second, it means that packets are written to the FIFO before being
&lt;em&gt;committed&lt;/em&gt; to the FIFO.  Only after a full packet has been written to
the FIFO can it ever get committed.&lt;/p&gt;

&lt;p&gt;To handle both of these requirements, I rearranged how I stored packets
in memory.  Instead of storing packet data alone, or packet data plus a LAST
bit, or packet data plus some number of ancillary bits (TSTRB, TKEEP, TUSER,
and TLAST), I store the length of the packet &lt;em&gt;before&lt;/em&gt; the packet.&lt;/p&gt;

&lt;p&gt;Fig. 6 shows this pictorally.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;padding: 25px; float: right&quot;&gt;&lt;caption&gt;Fig 6. Virtual packets in memory&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/vfifo/vpktmem.svg&quot; alt=&quot;&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Specifically:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;All packet length fields precede the packet they describe.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;All packet lengths are 32’bits.  Yes, this is an arbitrary length.
However, 1) this seems to be the smallest bus size I’ve ever needed to work
with.  2) I rarely have more than 4GB of memory, so this seems sufficient.
3) It allows for jumbo packet sizes up to whatever memory size I have on hand.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;I also force a minimum 32-bit alignment on all accesses.  So, for a 128-bit
wide bus, this word will be aligned on a 32-bit boundary.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Before a packet is committed, its packet length shall be NULL.  (i.e. 32’h0)
You can think of this like the NULL pointer at the end of a linked list.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The packet data is written to memory using the full width of the bus.&lt;/p&gt;

    &lt;p&gt;In the context of the &lt;a href=&quot;https://github.com/ZipCPU/eth10g&quot;&gt;10Gb Ethernet switch&lt;/a&gt;
I’m working on, maintaining memory throughput is important.  As a result, I
need to use the full memory width (512 bits) as often as possible.  Anything
less would reduce my memory bandwidth.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;If ever a packet gets dropped, the &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktvfifowr.v&quot;&gt;packet
writer&lt;/a&gt;
just goes back to the beginning of the packet data area and starts over
following the NULL packet length word when the next packet starts.&lt;/p&gt;

    &lt;ul&gt;
      &lt;li&gt;
        &lt;p&gt;Packets can be dropped for any upstream reason.&lt;/p&gt;
      &lt;/li&gt;
      &lt;li&gt;
        &lt;p&gt;Packets are also dropped if the virtual packet FIFO runs out of room.&lt;/p&gt;

        &lt;p&gt;This is a necessary criteria to prevent a deadlock created if the upstream
source never needs to abort, and yet there’s no room in memory to hold
the last of the packet in memory.&lt;/p&gt;

        &lt;p&gt;In order to support packet length pointers, a packet may not be completed
unless there’s room for both the packet length before and the packet
length of the packet to follow.&lt;/p&gt;
      &lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As I mentioned, I’m now working on building a Virtual Packet FIFO.  It hasn’t
yet been verified, or I’d present the entire FIFO here.  For now, let me point
out the three major components and discuss how they work together:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;There’s the &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktvfifo.v&quot;&gt;Controller&lt;/a&gt;,&lt;/li&gt;
  &lt;li&gt;The &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktvfifowr.v&quot;&gt;writer&lt;/a&gt;, and&lt;/li&gt;
  &lt;li&gt;The &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktvfiford.v&quot;&gt;reader&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You may expect these components to change as they eventually get verified,
and then tested and proven in hardware.  (As of today, only the
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktvfifowr.v&quot;&gt;writer&lt;/a&gt;
has passed a formal check, and even that check didn’t properly include the
&lt;a href=&quot;/formal/2020/06/12/four-keys.html&quot;&gt;contract&lt;/a&gt;.)&lt;/p&gt;

&lt;p&gt;Let’s discuss each of these components briefly.&lt;/p&gt;

&lt;h3 id=&quot;the-virtual-packet-fifo-controller&quot;&gt;The Virtual Packet FIFO Controller&lt;/h3&gt;

&lt;p&gt;The controller is responsible for setting the base address and memory size
allocated to the virtual FIFO.  These two values are then propagated down to
both &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktvfifowr.v&quot;&gt;writer&lt;/a&gt;
and &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktvfiford.v&quot;&gt;reader&lt;/a&gt;.
It’s also responsible for resetting the FIFO, and (depending on the
configuration) releasing it from reset.&lt;/p&gt;

&lt;p&gt;Even though this is only my second virtual packet FIFO, I’ve already had several
diverse needs for this controller.  In one design, the controller was given a
fixed memory allocation.  This is appropriate if the controller is required to
start up and operate without any CPU intervention.  In another design, the CPU
could allocate memory for the FIFO and then enable or disable the FIFO.  When
I can’t decide which of the two I want, sometimes I will generate a combination
of the two, so that the FIFO may start with a default allocation that the CPU
can come back to and adjust later if necessary.&lt;/p&gt;

&lt;p&gt;What happens if the CPU gives it a bad allocation?  One of the challenges of
controller design is determining how the virtual packet FIFO should handle
&lt;a href=&quot;https://en.wikipedia.org/wiki/Bus_error&quot;&gt;bus errors&lt;/a&gt;.
In general, a &lt;a href=&quot;https://en.wikipedia.org/wiki/Bus_error&quot;&gt;bus error&lt;/a&gt; indicates
that the FIFO has a bad memory allocation.  This might be the case if the
CPU has allocated memory to the FIFO that isn’t present in the system.  In
this case, it makes the most sense for the FIFO to shut down with some type of
error condition, and to then wait for the CPU to correct its memory allocation.
On the other hand, if the CPU will get its instructions for “fixing” any faults
from the network, then the network must be able to heal itself without any
CPU intervention.&lt;/p&gt;

&lt;p&gt;A similar problem might be generated by the
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktvfiford.v&quot;&gt;reader&lt;/a&gt;
if it ever comes across an invalid packet length word.  Such a packet length
migth be equal to zero, greater than the total size of the allocated memory,
or perhaps just big enough to pass the write pointer.&lt;/p&gt;

&lt;p&gt;In both of these cases, either a
&lt;a href=&quot;https://en.wikipedia.org/wiki/Bus_error&quot;&gt;bus error&lt;/a&gt; or an invalid
packet length, there should be an appropriate way to fix the situation.
In a hands-off implementation, the FIFO will need to just reset
itself–hopefully in that case memory allocation issues will be handled before
implementation.  In another case, the FIFO will wait for the CPU to issue a
new address before releasing itself from reset.  The same could be done with
the read packet length, or alternatively the &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktvfiford.v&quot;&gt;packet
reader&lt;/a&gt;
might just skip to where the write pointer is at–skipping any packets in the
way.&lt;/p&gt;

&lt;p&gt;Which method of resolving faults is appropriate depends upon the particular
design requirements.&lt;/p&gt;

&lt;h3 id=&quot;the-write-state-machine&quot;&gt;The Write State Machine&lt;/h3&gt;

&lt;p&gt;Once the base address and memory size are known, and once the FIFO has been
released from reset, incoming packets may be written to memory.&lt;/p&gt;

&lt;p&gt;This takes place in several discrete steps.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;First, prior to any packet, &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/8e4af12717cfc96611e47b260236e52f5412d95c/rtl/net/pktvfifowr.v#L397-L423&quot;&gt;the packet’s length word must be written as
zero&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Then, as packet data enters the FIFO, its data &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/8e4af12717cfc96611e47b260236e52f5412d95c/rtl/net/pktvfifowr.v#L424-L464&quot;&gt;gets written directly to
memory&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;Unfortunately, the 32-bit packet length word guarantees that further writes
to memory can not be guaranteed to have any particular alignment.  Incoming
data must then be &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/8e4af12717cfc96611e47b260236e52f5412d95c/rtl/net/pktvfifowr.v#L450-L457&quot;&gt;realigned as it is written to memory&lt;/a&gt;.
This also means that there may need to be &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/8e4af12717cfc96611e47b260236e52f5412d95c/rtl/net/pktvfifowr.v#L483-L488&quot;&gt;N+1 memory writes for
every N memory words&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;A second problem here is associated with the data pointer.  Specifically,
pointers wrap.  Hence, &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/8e4af12717cfc96611e47b260236e52f5412d95c/rtl/net/pktvfifowr.v#L226-L237&quot;&gt;any calculation of the next memory
address&lt;/a&gt;
must include &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/8e4af12717cfc96611e47b260236e52f5412d95c/rtl/net/pktvfifowr.v#L232-L234&quot;&gt;a check against the last memory address, and a wrap back
to the first address if it passes the end of
memory&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;Finally, &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/8e4af12717cfc96611e47b260236e52f5412d95c/rtl/net/pktvfifowr.v#L461-L462&quot;&gt;this is the only place where committed memory space is
checked&lt;/a&gt;.
If a packet uses all of the available memory space, not just the remaining
memory space, then it must be aborted locally.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;On any packet &lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;ABORT&lt;/a&gt;s,
the &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/8e4af12717cfc96611e47b260236e52f5412d95c/rtl/net/pktvfifowr.v#L435-L440&quot;&gt;write pointer is set to follow the prior NULL
length&lt;/a&gt;.  On a local
packet &lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;ABORT&lt;/a&gt;, such
as might take place if the packet overflowed memory, then we need to
resync to the beginning of the next packet.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Once a packet is complete, the next word becomes the length field of the
next packet.  &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/8e4af12717cfc96611e47b260236e52f5412d95c/rtl/net/pktvfifowr.v#L495-L523&quot;&gt;It is set to
NULL&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;After this next word has been set to NULL, the &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktvfifowr.v&quot;&gt;FIFO
writer&lt;/a&gt;
can then &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/8e4af12717cfc96611e47b260236e52f5412d95c/rtl/net/pktvfifowr.v#L524-L557&quot;&gt;go back and write the length of the current (just written) packet
into memory&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;This is what actually commits the packet to memory.  We can know the
packet has been committed once all bus requests have been completed.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Once the bus becomes idle, we tell the
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktvfiford.v&quot;&gt;reader&lt;/a&gt;
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/8e4af12717cfc96611e47b260236e52f5412d95c/rtl/net/pktvfifowr.v#L579&quot;&gt;our new start-of-packet pointer&lt;/a&gt;
and go back to step #2 above to handle the next packet.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All of this is handled via a (monster) state machine that can master the bus.&lt;/p&gt;

&lt;h3 id=&quot;the-read-state-machine&quot;&gt;The Read State Machine&lt;/h3&gt;

&lt;p&gt;Once a packet is committed, a second state machine can then read the packet
back from the bus.  (This one still needs a lot of verification work …)&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;First, the
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktvfiford.v&quot;&gt;reader&lt;/a&gt;
reads the length word from memory.&lt;/p&gt;

    &lt;p&gt;Knowing when to read this length word is a bit of a problem.  Were this a
piece of CPU software, we might poll this memory word.  If the memory word
was ever non-zero, we’d know a packet was present.  However, this design is
intended for a hardware implementation.  Hardware can poll memory on every
clock cycle, so much so that the
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktvfifowr.v&quot;&gt;writer&lt;/a&gt;
wouldn’t have any cycles left to write the packet to memory.  To prevent
this, we’d need to specify a polling interval, which would then increase our
latency.  Supporting minimum latency requires a different solution.&lt;/p&gt;

    &lt;p&gt;My solution to this problem has been to use an out-of-band communication
scheme through the controller.  In this scheme, the
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktvfifowr.v&quot;&gt;writer&lt;/a&gt;
tells the
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktvfiford.v&quot;&gt;reader&lt;/a&gt;
a pointer to the length word of the last packet committed.  If this address
doesn’t match the
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktvfiford.v&quot;&gt;reader&lt;/a&gt;’s
last memory address, then a packet is present that may be read.
In another version of this FIFO, one with the CPU in the middle, the CPU
provides the reader with the same pointer.  Again, this tells the
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktvfiford.v&quot;&gt;reader&lt;/a&gt;
when it’s safe to go and read the packet length counter for the next packet.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Once returned, the
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktvfiford.v&quot;&gt;reader&lt;/a&gt;
then verifies the packet length word.&lt;/p&gt;

    &lt;ul&gt;
      &lt;li&gt;It’s not allowed to be zero.&lt;/li&gt;
      &lt;li&gt;The packet length may not pass the write pointer.  This would indicate
a memory overrun condition.&lt;/li&gt;
      &lt;li&gt;The packet length must be less than the size of memory.&lt;/li&gt;
    &lt;/ul&gt;

    &lt;p&gt;On any failure, we can either reset the entire FIFO, or (alternatively) just
drop all packets between our current location and the write pointer.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;If the length pointer is good, we start reading from memory.&lt;/p&gt;

    &lt;p&gt;There are two challenges with this task.  The first challenge is that
the memory will (in general) be misaligned.  The second challenge is that
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt; has no
concept of &lt;a href=&quot;https://en.wikipedia.org/wiki/Back_pressure&quot;&gt;backpressure&lt;/a&gt;,
and our outgoing stream interface may require it.&lt;/p&gt;

    &lt;p&gt;To handle misalignment, we need to keep track of the previously read
memory word.  That, plus the current memory word, both shifted
appropriately, we’ll yield an outgoing stream word.  The trick is that
we may need to read an additional word to get all of the outgoing
stream data associated with this packet.&lt;/p&gt;

    &lt;p&gt;The way to handle
&lt;a href=&quot;https://en.wikipedia.org/wiki/Back_pressure&quot;&gt;backpressure&lt;/a&gt; when using
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt; is to
guarantee that we don’t issue a read request in the first place unless
there’s space available in the following outgoing FIFO for a packet word.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;One of the nice things about the
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktvfiford.v&quot;&gt;reader&lt;/a&gt;,
is that we don’t need to generate any
&lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;ABORT&lt;/a&gt;s.  That’s
a pleasant simplification.  Indeed, at this point in our return processing, we
could finally handle (infinite)
&lt;a href=&quot;https://en.wikipedia.org/wiki/Back_pressure&quot;&gt;backpressure&lt;/a&gt; if need be.&lt;/p&gt;

&lt;h3 id=&quot;a-new-interconnect&quot;&gt;A New Interconnect&lt;/h3&gt;

&lt;p&gt;One piece I wasn’t expecting in this new architecture was an updated/better
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
&lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;interconnect&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;As perspective, &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktvfifo.v&quot;&gt;this virtual packet
FIFO&lt;/a&gt;, is
designed to support a &lt;a href=&quot;https://github.com/ZipCPU/eth10g&quot;&gt;10Gb, 4-way Ethernet
switch&lt;/a&gt;.  That means I want to be able to
support 10Gb arriving (and departing) on each of the 4 interfaces at the same
time.  When using our planned hardware, the memory will run on a 200MHz clock,
reading (or writing) 512-bits (64-bytes) of data per clock cycle.  However, a
10Gb Ethernet switch will generate one 512-bit word every 51.s ns, or (roughly)
once every 11 clocks at 200MHz.  Hence, when the interface is running full
speed, we’ll be getting requests from rotating controllers.  The first
controller might want a beat, but then not need anything for another 10 beats
while the second controller wants a beat, etc.&lt;/p&gt;

&lt;p&gt;Typically, I run
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt; in a fashion
where I burst data to the bus (i.e. to memory) and then wait for the response
before shutting the interface down.  When using Xilinx’s MIG, this can take
up 20 clock cycles of latency.  If I did that here, I’d never have enough
memory bandwidth to keep up.&lt;/p&gt;

&lt;p&gt;My solution to this problem is to use a &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/wbmarbiter.v&quot;&gt;special type of
interconnect&lt;/a&gt;–one
I first developed for an AXI project.  When using &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/wbmarbiter.v&quot;&gt;this
interconnect&lt;/a&gt;, N
masters may request bus accesses of a single slave.  In this case, as each bus
master makes its request, the master’s ID is placed in a FIFO.  Since
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt; requests are
always returned in the order they are received, I can then use this FIFO to
route responses back to the appropriate master.  This will allow me to
interleave requests from multiple masters together on their way to memory.&lt;/p&gt;

&lt;p&gt;That’s the good news–more bandwidth.  The bad news is that this N:1 arbiter
will break &lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt; in
two ways.  First, since there’s no guaranteed concept of the end of a
transaction, there’s no way to know when to lock the bus.  Second, as I
implement &lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;, a
&lt;a href=&quot;https://en.wikipedia.org/wiki/Bus_error&quot;&gt;bus error&lt;/a&gt; terminates any ongoing
transaction.  This means that if N masters are active and only one of those
masters receives a &lt;a href=&quot;https://en.wikipedia.org/wiki/Bus_error&quot;&gt;bus error&lt;/a&gt;–in
response to some erroneous transaction, then all
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt; masters will
receive a &lt;a href=&quot;https://en.wikipedia.org/wiki/Bus_error&quot;&gt;bus error&lt;/a&gt; in return to
their ongoing operations.  For now, this will work: 1) these virtual packet
FIFOs will not be locking the bus, and 2) any
&lt;a href=&quot;https://en.wikipedia.org/wiki/Bus_error&quot;&gt;bus errors&lt;/a&gt; should be rare or
even non-existent.  Still, it’s a risk, and I’ll need to make sure it’s
well documented throughout the project.&lt;/p&gt;

&lt;h2 id=&quot;conclusions&quot;&gt;Conclusions&lt;/h2&gt;

&lt;p&gt;This is now the &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktvfifo.v&quot;&gt;second virtual packet FIFO I’ve
created&lt;/a&gt;.
If any design becomes so useful that you need to build it more than once,
then it’s going to become useful again.&lt;/p&gt;

&lt;p&gt;In this case, this &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktvfifo.v&quot;&gt;virtual packet
FIFO&lt;/a&gt; will
play an important part of the &lt;a href=&quot;https://github.com/ZipCPU/eth10g&quot;&gt;10Gb Ethernet
switch&lt;/a&gt;) I’m working on.  As packets arrive
from the PHY, their CRC’s will be validated, their &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axinwidth.v&quot;&gt;stream width
expanded&lt;/a&gt;,
they’ll then &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axincdc.v&quot;&gt;cross clock
domains&lt;/a&gt;,
their source MACs will recorded in the router, and they will enter &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktvfifo.v&quot;&gt;this virtual
packet FIFO&lt;/a&gt;.
Once these packets come out of
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktvfifo.v&quot;&gt;the FIFO&lt;/a&gt;,
the’ll go into a separate &lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/master/rtl/sfifo.v&quot;&gt;synchronous
FIFO&lt;/a&gt;, have
their destination MACs checked, get routed to an outgoing interface, &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axincdc.v&quot;&gt;cross
clock domains&lt;/a&gt;
(again), have &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/axinwidth.v&quot;&gt;their widths
adjusted&lt;/a&gt;
back to the interface width, and finally get blasted out the network.  Feel
free to check out
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/doc/eth10g-blocks.png&quot;&gt;this picture&lt;/a&gt; to see
an overview of this entire operation, as well as the status of the various
components required of this project.&lt;/p&gt;

&lt;p&gt;For now, however, the project is still draft.  The hardware, while drafted,
isn’t yet built and I’m still working on the RTL components within it.
&lt;a href=&quot;https://www.blueletterbible.org/kjv/jas/4/15/&quot;&gt;Lord willing&lt;/a&gt;,
I’ll have the RTL done by the time the hardware is available.&lt;/p&gt;

&lt;p&gt;Still, the overall concept of a Virtual Packet FIFO was one I felt would
be worth sharing.&lt;/p&gt;
&lt;hr /&gt;&lt;p&gt;&lt;em&gt;There is that scattereth, and yet increaseth; and there is that withholdeth more than is meet, but it tendeth to poverty.  (Prov 11:24)&lt;/em&gt;</description>
        <pubDate>Sat, 08 Apr 2023 00:00:00 -0400</pubDate>
        <link>https://zipcpu.com/blog/2023/04/08/vpktfifo.html</link>
        <guid isPermaLink="true">https://zipcpu.com/blog/2023/04/08/vpktfifo.html</guid>
        
        
        <category>blog</category>
        
      </item>
    
  </channel>
</rss>