Skip to content
This repository has been archived by the owner on Aug 20, 2024. It is now read-only.

RFC: Emit sync-read mems intact, with readwrite ports if applicable #2092

Closed
wants to merge 1 commit into from

Conversation

albert-magyar
Copy link
Contributor

The default handling of sync-read memories by the FIRRTL compiler has been mentioned in several issues, including recently in chipsalliance/chisel#1788. While --repl-seq-mem can be used along with Verilog memory-generation flows, the emission of native FIRRTL memories has been a point of surprise and an FPGA QoR hurdle for new Chisel/FIRRTL users.

I think arguments can be made either way about adding this feature, so I'd like to hear what anyone thinks about:

  • Adding this as an emission option
  • Emitting sync-read mems this way by default (note that this will not affect flows with --repl-seq-mem).

If there's strong sentiment either way, it's easy to tweak this for different disabled/address-out-of-range behavior.

Type of improvement: Verilog emission
API impact: No public API changes, but passes downstream of VerilogMemDelays may now see sync-read mems.
Backend code-generation impact: most sync-read mems will now (potentially optionally) be emitted directly from FIRRTL to Verilog without splitting readwrite ports or adding pipelining registers.

Contributor Checklist

  • Did you add Scaladoc to every public function/method?
  • Did you update the FIRRTL spec to include every new feature/behavior?
  • Did you add at least one test demonstrating the PR?
  • Did you delete any extraneous printlns/debugging code?
  • Did you specify the type of improvement?
  • Did you state the API impact?
  • Did you specify the code generation impact?
  • Did you request a desired merge strategy?
  • Did you add text to be included in the Release Notes for this change?

Reviewer Checklist (only modified by reviewer)

  • Did you add the appropriate labels?
  • Did you mark the proper milestone (1.2.x, 1.3.0, 1.4.0) ?
  • Did you review?
  • Did you check whether all relevant Contributor checkboxes have been checked?
  • Did you mark as Please Merge?

@ekiwi
Copy link
Contributor

ekiwi commented Feb 26, 2021

I think this is a great idea and I am in general for emitting sync read mems directly by default (unless there are significant behavioral differences but I don't think there are, right?).

Could you add an example to your PR that shows how the Verilog changes?

@albert-magyar
Copy link
Contributor Author

Here is a basic example of a 1R1W sync-read memory. The details around disabled behavior are easy to change, and there isn't really a single answer for the right level of conservative behavior. Though I consider that and random initialization to be details subject to change once it's decided whether native sync-read mem emission is a good idea, these aspects do tend to have a heavy influence on the visual difference in the code.

; FIRRTL
circuit syncreadmem:
  module syncreadmem:
    input clk: Clock
    input addr: UInt<6>
    input we: UInt<1>
    input wdata: UInt<4>
    output rdata: UInt<4>

    mem m:
      data-type => UInt<4>
      depth => 64
      reader => r
      writer => w
      read-latency => 1
      write-latency => 1
      read-under-write => undefined

    m.r.clk <= clk
    m.r.addr <= addr
    m.r.en <= UInt(1)
    rdata <= m.r.data

    m.w.clk <= clk
    m.w.addr <= addr
    m.w.en <= we
    m.w.mask <= we
    m.w.data <= wdata
// Current Verilog:                 // New Verilog:
module syncreadmem(                 module syncreadmem(
  input        clk,                   input        clk,
  input  [5:0] addr,                  input [5:0]  addr,
  input        we,                    input        re,
  input        re,                    input        we,
  input  [3:0] wdata,                 input [3:0]  wdata,
  output [3:0] rdata                  output [3:0] rdata
);                                  );
  reg [3:0] m [0:63];                 reg [3:0] m [0:63];
  wire [3:0] m_r_data;                wire m_r_en = re;
  wire [5:0] m_r_addr;                // [Added comment: note that this is now a reg]
  wire [3:0] m_w_data;                reg [5:0] m_r_addr;
  wire [5:0] m_w_addr;                wire [3:0] m_r_data;
  wire  m_w_mask;                     wire [3:0] m_w_data = wdata;
  wire  m_w_en;                       wire [5:0] m_w_addr = addr;
  reg [5:0] m_r_addr_pipe_0;          wire  m_w_mask = we;
  assign m_r_addr = m_r_addr_pipe_0;  wire  m_w_en = we;
  assign m_r_data = m[m_r_addr];      assign m_r_data = m[m_r_addr];
  assign m_w_data = wdata;            assign rdata = m_r_data;
  assign m_w_addr = addr;             always @(posedge clk) begin
  assign m_w_mask = we;                 if(m_r_en) begin
  assign m_w_en = we;                     m_r_addr <= addr;
  assign rdata = m_r_data;              end
  always @(posedge clk) begin           if(m_w_en & m_w_mask) begin
    if(m_w_en & m_w_mask) begin           m[m_w_addr] <= m_w_data;
      m[m_w_addr] <= m_w_data;          end
    end                               end
    m_r_addr_pipe_0 <= addr;        endmodule
  end
endmodule

@ekiwi
Copy link
Contributor

ekiwi commented Feb 26, 2021

I just ran the example through yosys and synth_ice40.

For the old style emission I get:

=== oldmem ===

   Number of wires:                 16
   Number of wire bits:             61
   Number of public wires:          16
   Number of public wire bits:      61
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                 10
     SB_DFF                          5
     SB_LUT4                         4
     SB_RAM40_4K                     1

For the new emission:

=== syncreadmem ===

   Number of wires:                 26
   Number of wire bits:             79
   Number of public wires:          26
   Number of public wire bits:      79
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                 26
     SB_DFF                          4
     SB_DFFE                         6
     SB_DFFSR                        1
     SB_LUT4                        14
     SB_RAM40_4K                     1

New emission with read_en set to always be true (this suggests that maybe we should try and find a way to indicate to yosys that when read_en is false, the result is actually a DontCare):

=== syncreadmem ===

   Number of wires:                 17
   Number of wire bits:             62
   Number of public wires:          17
   Number of public wire bits:      62
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                 10
     SB_DFF                          5
     SB_LUT4                         4
     SB_RAM40_4K                     1

So for yosys's ice40 flow, the new emission style does not seems to make a difference.
Do you get different results when using Vivado?

Sorry if these questions are naive, but I don't have a lot of experience with BRAM inference and I am trying to figure out in which situations the old emission style was sub-optimal.

@albert-magyar
Copy link
Contributor Author

Sorry if these questions are naive, but I don't have a lot of experience with BRAM inference and I am trying to figure out in which situations the old emission style was sub-optimal.

That makes complete sense, I should have posted an example that more clearly motivates the problem. For reference, this is the solution to the issue presented in chipsalliance/chisel#1788. I will add a two-RW example in this thread. Unfortunately, I don't think it will be possible to highlight this issue with Yosys, since iCE40 Embedded Block Rams (EBRs) support either 1R1W or single-RW port configurations, so splitting an RW port is not as meaningful of a barrier to efficient synthesis in an iCE40-based flow as it would be in modern Xilinx-based flow.

New emission with read_en set to always be true (this suggests that maybe we should try and find a way to indicate to yosys that when read_en is false, the result is actually a DontCare):

As an aside, there are many competing arguments for many different variants of "what to do with disabled ports." The "disable some synchronous element" variant might cost some small amount of logic, but it can often save power. It's completely possible to make either approach behave nearly any way; the current disabled-port behavior is more-or-less arbitrarily dictated by VerilogMemDelays and does not necessarily represent an optimum.

There are three primary concerns: synthesis mapping, power, and verification pessimism. There are also three approaches to balance these: compromise-striking emission patterns, specialize backends, and conditional compilation. In general, I would contend that retaining memories intact to the emitter (the real effect of this PR) gives us more semantic information to strike a better balance, but the potential for increased Scala implementation complexity creates risk.

@ekiwi
Copy link
Contributor

ekiwi commented Feb 26, 2021

There are three primary concerns: synthesis mapping, power, and verification pessimism. There are also three approaches to balance these: compromise-striking emission patterns, specialize backends, and conditional compilation.

  • for synthesis: I would assume that if you are targeting an ASIC you will use --repl-seq-mems and if you are targeting an FPGA then you might just use the generated Verilog (the underlying assumption here is that the FPGA tools are better at memory inference)
  • for power: I feel like if you want a specific behavior to preserve power, we should enable people to write firrtl passes (if possible) but this might not need to be a focus for the default emission
  • for verification: I feel like we will eventually have a switch for simulation/verification emission which will try to model more pessimistic behavior. I am already doing that for the formal backend, but in half a year or so I hope to be able to make some of the more pessimistic modelling available for simulation purposes as well.

In general, I would contend that retaining memories intact to the emitter (the real effect of this PR) gives us more semantic information to strike a better balance, but the potential for increased Scala implementation complexity creates risk.

Funny enough, I initially wrote the formal backend without using VerilogMemDelays but I am currently reconsidering because I want to lift the more pessimistic modelling of undefined behavior in firrtl memories to the firrtl level in order to re-use it for more pessimistic simulations.

What was the initial reason for splitting read/write ports into two? Could we change VerilogMemDelays to not do that?

The other big problem I know about wrt. memory emission is that we split up non-ground-type memories. This prevents synthesis tools from inferring write masks properly.

@albert-magyar
Copy link
Contributor Author

What was the initial reason for splitting read/write ports into two?

I don't know what the original justification was, but I assume VerilogMemDelays owes its existence to Scala implementation simplicity. FIRRTL supports a very broad range of memory behaviors, and handling them all in VerilogMemDelays is simpler than emitting them.

Could we change VerilogMemDelays to not do that?

Changing VerilogMemDelays to not do that is the core feature added by this PR. To elaborate: keeping RW ports intact prevents lowering to combinational-read memories, so it necessitates adding support for both sync-read-mem and RW-port emission to the emitter. This PR adds that support and modifies VerilogMemDelays to be less invasive in its modification of sync-read memories. The space of what memories to apply this to and what flags to export to the user is a downstream problem to the technical changes to the emitter to enable this at all.

The other big problem I know about wrt. memory emission is that we split up non-ground-type memories. This prevents synthesis tools from inferring write masks properly.

The bigger problem is not just not inferring write masks properly, but the fact that it often prevents memories where the mask is not meaningfully used at all from being correctly inferred as a single memory. There currently exists a SimplifyMems pass in the FIRRTL compiler that can be added as a dependency to avoid this splitting to potentially enhance memory resource efficiency in FPGAs for "un-masked" memories. When it comes to actually using write masks, a whole host of problems emerges, not least of which is that most FPGA memories do not have bitmasks. Improving mask emission is a logical follow-on from this PR, but is likely only to be beneficial for the case of memories of VectorType of ~8b elements.

@seldridge
Copy link
Member

High level: this makes sense. The memory lowering in FIRRTL (without --repl-seq-mem) is pretty sub-optimal, though functionally correct for simulation/testing purposes. Better, FPGA resource-efficient code gen would be amazing.

An immediate related concern is #856 (--repl-seq-mem not supporting anything but 1r1w SyncReadMem). A solution for this is an alternative avenue to better FPGA tooling integration.

Longer term, BRAM compatibility with a specific target vendor would be ideal, but may be tedious in practice. (I don't know about what differences, if any, exist between Xilinx, Altera, or Yosys tooling with respect to BRAM inference, but I expect there are things which are not compatible and I expect write masks are weird. Some Googling got me to: https://danstrother.com/2010/09/11/inferring-rams-in-fpgas/.) This is exacerbated by the fact that the Verilog Emitter is not particularly flexible...

Towards a more flexible Verilog Emitter... I've been doing a lot of work on llvm/circt memory lowering to try and get it to be compatible with what the FIRRTL compiler is doing (llvm/circt#488, llvm/circt#585, and llvm/circt#691 put it almost there). This may be a better backend for more featureful, flexible memory emission as that already has support for SystemVerilog structs, n-dimensional arrays, etc. and may make it easier to think about lowering into Verilog that is optimized for vendor, tooling, etc.

Thanks for hacking on this @albert-magyar. I'll add this to the agenda for Monday's dev meeting to get some broader feedback on this RFC.

@ekiwi
Copy link
Contributor

ekiwi commented Mar 1, 2021

An immediate related concern is #856 (--repl-seq-mem not supporting anything but 1r1w SyncReadMem). A solution for this is an alternative avenue to better FPGA tooling integration.

I don't believe that --repl-seq-mem makes for a good user experience for anyone following the Scala-first path. I.e. we want people to be able to write Chisel in a Scala IDE and then just call a function to generate Verilog that is ready to be synthesized onto an FPGA. --repl-seq-mem matches the development style, where everything is integrated into an external make system as used by some UCB-BAR and SiFive people. However, I do believe that this is a big hurder for newcomers and FPGA tools seem to have gotten good enough to infer memories that I am confident that we will be able to find an emission style that works well for the majority of tools Vivado and Yosys being the most important ones.

This is exacerbated by the fact that the Verilog Emitter is not particularly flexible...

I think this has mostly been a symptom of people being afraid to touch the Verilog emitter.

This may be a better backend for more featureful, flexible memory emission as that already has support for SystemVerilog structs, n-dimensional arrays, etc. and may make it easier to think about lowering into Verilog that is optimized for vendor, tooling, etc.

I believe that the C++ firrtl compiler (CFC? analogous to SFC?) will eventually replace the more researchy Scala implementation. However until then there seems to be at least a bunch of deployment issues to be solved (like how do you ship a shared object with a Scala library?) so I think it makes sense to continue adding features to the SFC.

@seldridge
Copy link
Member

seldridge commented Mar 1, 2021

@ekiwi wrote:

I think this has mostly been a symptom of people being afraid to touch the Verilog emitter.

The problem I see is that any non-trivial Verilog generation requires an actual Verilog IR to lower to. The current implementation doing, roughly, FIRRTL IR => Seq[Seq[Any]] is super difficult to work with and not a scalable solution.

Adding Verilog IR or pushing more towards custom IR nodes is tractable (like what PadWidths does), but a ton of work, and I would argue out of scope for the Scala FIRRTL Compiler (SFC).

@ekiwi wrote:

I believe that the C++ firrtl compiler (CFC? analogous to SFC?) will eventually replace the more researchy Scala implementation.

Yeah, I don't know what to call this. I've been going with "MLIR FIRRTL Compiler" (MFC).

No reason to stop dev here. I do expect that eventually Verilog emission hacking will become far easier with the MFC, though.

(And I do have a published prototype that will let you do mixed lowering with the SFC to a point and then let the MFC take over, assuming that you have the MFC installed on your path: https://github.com/sifive/chisel-circt.)

@schoeberl
Copy link

This is a very appreciated initiative! Besides Vivado and yosys, we should also check Quartus. I can help with that.

From my experience with FPGA memories there are all three behaviors of read during write (old data - seldom, new data - sometimes by synthesize done with forwarding, undefined - which is probably a possible mix of old and new bits). AFAIK undefined is common, but cannot really be expressed in Verilog or VHDL. In most cases, I do an explicit forwarding to get the newly written data.

@ekiwi
Copy link
Contributor

ekiwi commented Mar 9, 2021

@albert-magyar Could you rebase this on the latest master branch please? I fear that since I added a dependency on VerilogMemDelays with some of my recent changes this PR might break the SMT backend.

@albert-magyar
Copy link
Contributor Author

Superseded by #2111

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants