## Agile Hardware Design
***
# Decoupling

## Prof. Scott Beamer
### sbeamer@ucsc.edu

## [CSE 293](https://classes.soe.ucsc.edu/cse293/Spring21/)

## Plan for Today

* Scala case classes
* Decoupling blocks in Chisel
* Chisel Queue demo

## Loading The Chisel Library Into a Notebook

In [1]:
val path = System.getProperty("user.dir") + "/../resource/chisel_deps.sc"
interp.load.module(ammonite.ops.Path(java.nio.file.FileSystems.getDefault().getPath(path)))

[36mpath[39m: [32mString[39m = [32m"/Users/sbeamer/Spring 2021/CSE 293/lectures/07-decoup/../resource/chisel_deps.sc"[39m

In [2]:
import chisel3._
import chisel3.util._
import chisel3.tester._
import chisel3.tester.RawTester.test

[32mimport [39m[36mchisel3._
[39m
[32mimport [39m[36mchisel3.util._
[39m
[32mimport [39m[36mchisel3.tester._
[39m
[32mimport [39m[36mchisel3.tester.RawTester.test[39m

## Scala Case Classes

* Special type of class with additional features built-in
  * Companion object (with constructor) (don't need `new` to instantiate)
  * All parameters are automatically public (don't need to make them `val`)
  * Automatic implementations of `toString`, `equals`, and `copy` 
  * Great for pattern matching (future lecture)


In [6]:
case class Movie(name: String, year: Int, genre: String) {
    def decade(): String = (year - year%10) + "s"
}

val m1 = Movie("Gattaca", 1997, "drama")
val m2 = Movie("The Avengers", 1998, "action")
m2.copy(year=2012)
m2.decade()

defined [32mclass[39m [36mMovie[39m
[36mm1[39m: [32mMovie[39m = [33mMovie[39m([32m"Gattaca"[39m, [32m1997[39m, [32m"drama"[39m)
[36mm2[39m: [32mMovie[39m = [33mMovie[39m([32m"The Avengers"[39m, [32m1998[39m, [32m"action"[39m)
[36mres5_3[39m: [32mMovie[39m = [33mMovie[39m([32m"The Avengers"[39m, [32m2012[39m, [32m"action"[39m)
[36mres5_4[39m: [32mString[39m = [32m"1990s"[39m

## Using `case class` for Parameters in Chisel

In [3]:
case class CounterParams(val limit: Int, val start: Int = 0) {
    def width = log2Ceil(limit + 1)
}

class MyCounter(cp: CounterParams) extends Module {
    val io = IO(new Bundle {
        val en  = Input(Bool())
        val out = Output(UInt())
    })
    val count = RegInit(0.U(cp.width.W))
    when (io.en) {
        when (count < cp.limit.U) {
            count := count + 1.U
        } .otherwise {
            count := cp.start.U
        }
    }
    io.out := count
}

println(getVerilog(new MyCounter(CounterParams(15))))

Elaborating design...
Done elaborating.
module MyCounter(
  input        clock,
  input        reset,
  input        io_en,
  output [3:0] io_out
);
`ifdef RANDOMIZE_REG_INIT
  reg [31:0] _RAND_0;
`endif // RANDOMIZE_REG_INIT
  reg [3:0] count; // @[cmd2.sc 10:24]
  wire [3:0] _T_2 = count + 4'h1; // @[cmd2.sc 13:28]
  assign io_out = count; // @[cmd2.sc 18:12]
  always @(posedge clock) begin
    if (reset) begin // @[cmd2.sc 10:24]
      count <= 4'h0; // @[cmd2.sc 10:24]
    end else if (io_en) begin // @[cmd2.sc 11:18]
      if (count < 4'hf) begin // @[cmd2.sc 12:35]
        count <= _T_2; // @[cmd2.sc 13:19]
      end else begin
        count <= 4'h0; // @[cmd2.sc 15:19]
      end
    end
  end
// Register and memory initialization
`ifdef RANDOMIZE_GARBAGE_ASSIGN
`define RANDOMIZE
`endif
`ifdef RANDOMIZE_INVALID_ASSIGN
`define RANDOMIZE
`endif
`ifdef RANDOMIZE_REG_INIT
`define RANDOMIZE
`endif
`ifdef RANDOMIZE_MEM_INIT
`define RANDOMIZE
`endif
`ifndef RANDOM
`define RANDOM $random

defined [32mclass[39m [36mCounterParams[39m
defined [32mclass[39m [36mMyCounter[39m

## Motivation for Handshaking Protocol

* Can already be difficult to correctly implement a seqentual component, but what about two sequential components interacting?

* For today, let's only focus on transferring data
  * A _producer_ sending data to a _consumer_

* _**Challenge:**_ recognize when a side is (or is not) able to send/receive data

<img src="images/producer.svg" alt="ready/valid schematic" style="width:75%;margin-left:auto;margin-right:auto"/>

## Best to Distribute Control

* When to use _centralized_ vs _distributed_ control?
  * Common tradeoff throughout systems
  * Centralized can be more efficient and easier to implement (for small scale)
  * Distributed (peer-to-peer) can scale to larger designs much more easily
  * _Common outcome:_ centralized within components and distributed between them
  * Thus, question: _"At what scale to switch from centralized to distributed?"_

* For data transfer between components, may need ...
  * Ability for producer to indicate no data is being sent
  * Ability for consumer to indicate inability to receive data (_back pressure_)

## Ready/Valid Protocol

* Common hardware design pattern for producer-consumer data transfer

* _**valid**_ - output from producer indicating sending data

* _**ready**_ - output from consumer indicating able to receive

* Transfer occurs when both _ready & valid_ in same cycle



<img src="images/readyValid.svg" alt="ready/valid schematic" style="width:75%;margin-left:auto;margin-right:auto"/>

## Chisel Supports Ready/Valid

* Best to use standard library's support for these patterns
  * Less code to write, less chance of error, standardization improves readability
* To use, wrap data to transfer with desired protocol
  * Library will add needed additional signals & provide helper functions

### Valid - only `ready`

* Consumer can't say no
  * Must consume when sent
* Indicates the existence of data
  * Amost like hardware equivalent of Scala's `Option`

### Decoupled - `ready & valid`

* Consumer can apply backpressure
* _**BEWARE**_ of _combinational loops_
  * Avoid using ready/valid input to combinationally create ready/valid output

## Example: Using Chisel `Valid` (1/2)

In [28]:
class MakeValid(n: Int) extends Module {
    val io = IO(new Bundle {
        val en  = Input(Bool())
        val in  = Input(UInt(n.W))
        val out = Valid(UInt(n.W))
    })
    io.out.valid := io.en
    io.out.bits := io.in
}

println(getVerilog(new MakeValid(4)))

Elaborating design...
Done elaborating.
module MakeValid(
  input        clock,
  input        reset,
  input        io_en,
  input  [3:0] io_in,
  output       io_out_valid,
  output [3:0] io_out_bits
);
  assign io_out_valid = io_en; // @[cmd27.sc 7:18]
  assign io_out_bits = io_in; // @[cmd27.sc 8:17]
endmodule



defined [32mclass[39m [36mMakeValid[39m

## Example: Using Chisel `Valid` (2/2)

In [29]:
class ValidReceiver(n: Int) extends Module {
    val io = IO(new Bundle {
        val in = Flipped(Valid(UInt(n.W)))
    })
    when (io.in.valid) {
        printf("  received %d\n", io.in.bits)
    }
}

// println(getVerilog(new ValidReceiver(4)))
test(new ValidReceiver(4)) { c =>
    for (cycle <- 0 until 8) {
        c.io.in.bits.poke(cycle.U)
        println(s"cycle: $cycle")
        c.io.in.valid.poke((cycle%2 == 0).B)
        c.clock.step()
    }
}

Elaborating design...
Done elaborating.
cycle: 0
  received   0
cycle: 1
cycle: 2
  received   2
cycle: 3
cycle: 4
  received   4
cycle: 5
cycle: 6
  received   6
cycle: 7
test ValidReceiver Success: 0 tests passed in 10 cycles in 0.007993 seconds 1251.12 Hz


defined [32mclass[39m [36mValidReceiver[39m

## Example: Using Chisel `Decoupled` (1/2)

In [30]:
class CountWhenReady(maxVal: Int) extends Module {
    val io = IO(new Bundle {
        val en  = Input(Bool())
        val out = Decoupled(UInt())
    })
    val advanceCounter = Wire(Bool())
    advanceCounter := io.en && io.out.ready
    val (count, wrap) = Counter(advanceCounter, maxVal)
    io.out.bits := count
    io.out.valid := io.en
}

println(getVerilog(new CountWhenReady(3)))

Elaborating design...
Done elaborating.
module CountWhenReady(
  input        clock,
  input        reset,
  input        io_en,
  input        io_out_ready,
  output       io_out_valid,
  output [1:0] io_out_bits
);
`ifdef RANDOMIZE_REG_INIT
  reg [31:0] _RAND_0;
`endif // RANDOMIZE_REG_INIT
  wire  advanceCounter = io_en & io_out_ready; // @[cmd29.sc 7:29]
  reg [1:0] count; // @[Counter.scala 60:40]
  wire  wrap_wrap = count == 2'h2; // @[Counter.scala 72:24]
  wire [1:0] _wrap_value_T_1 = count + 2'h1; // @[Counter.scala 76:24]
  assign io_out_valid = io_en; // @[cmd29.sc 10:18]
  assign io_out_bits = count; // @[cmd29.sc 9:17]
  always @(posedge clock) begin
    if (reset) begin // @[Counter.scala 60:40]
      count <= 2'h0; // @[Counter.scala 60:40]
    end else if (advanceCounter) begin // @[Counter.scala 118:17]
      if (wrap_wrap) begin // @[Counter.scala 86:20]
        count <= 2'h0; // @[Counter.scala 86:28]
      end else begin
        count <= _wrap_value_T_1; // @[Counte

defined [32mclass[39m [36mCountWhenReady[39m

## Example: Using Chisel `Decoupled` (2/2)

In [7]:
class CountWhenReady(maxVal: Int) extends Module {
    val io = IO(new Bundle {
        val en  = Input(Bool())
        val out = Decoupled(UInt())
    })
    val (count, wrap) = Counter(io.out.fire, maxVal)
    io.out.valid := false.B
    io.out.bits := 0.U
    when (io.en) {
        io.out.enq(count)
    }
}

// println(getVerilog(new CountWhenReady(3)))

test(new CountWhenReady(3)) { c =>
    c.io.en.poke(true.B)
    for (cycle <- 0 until 7) {
        c.io.out.ready.poke((cycle%2 == 1).B)
        println(s"cycle: $cycle, count: ${c.io.out.bits.peek()}")
        c.clock.step()
    }
}

Elaborating design...
Done elaborating.
cycle: 0, count: UInt<1>(0)
cycle: 1, count: UInt<1>(0)
cycle: 2, count: UInt<1>(1)
cycle: 3, count: UInt<1>(1)
cycle: 4, count: UInt<2>(2)
cycle: 5, count: UInt<2>(2)
cycle: 6, count: UInt<1>(0)
test CountWhenReady Success: 0 tests passed in 9 cycles in 0.046964 seconds 191.64 Hz


defined [32mclass[39m [36mCountWhenReady[39m

## Using Queues to Handle Backpressure

* If traffic is bursty, can use a _queue_ to smoot traffic rate
  * Queue fills up when too much demand
  * When demand wanes, can drain queue
* A queue can't solve a throughput mismatch
  * If always production rate > consumption rate, queue can't help
* A queue is a great place to use _decoupled_ interfaces
* Chisel's util provides `Queue` generator

<img src="images/queue.svg" alt="ready/valid schematic" style="width:65%;margin-left:auto;margin-right:auto"/>

## Using Chisel's `Queue`

* Part of `util`
* Specify number of entries and type `Queue(UInt(4.W), 8)`
  * `pipe` - 
  * `flow`

<img src="images/queueReady.svg" alt="ready/valid schematic" style="width:85%;margin-left:auto;margin-right:auto"/>

## Chisel `Queue` Demo (1/2)

In [18]:
class CountIntoQueue(maxVal: Int, numEntries: Int, pipe: Boolean, flow: Boolean) extends Module {
    val io = IO(new Bundle {
        val en  = Input(Bool())
        val out = Decoupled(UInt())
        val count = Output(UInt())
    })
    val q = Module(new Queue(UInt(), numEntries, pipe=pipe, flow=flow))
    val (count, wrap) = Counter(q.io.enq.fire, maxVal)
    q.io.enq.valid := io.en
    q.io.enq.bits := count
    io.out <> q.io.deq
    io.count := count // for visibility
}

// println(getVerilog(new CountIntoQueue(3,1)))

defined [32mclass[39m [36mCountIntoQueue[39m

## Chisel `Queue` Demo (2/2)

In [24]:
test(new CountIntoQueue(3,1,fals,true)) { c =>
    c.io.en.poke(true.B)
    c.io.out.ready.poke(false.B)
    for (cycle <- 0 until 4) {   // fill up queue
        println(s"f count:${c.io.count.peek()} out:${c.io.out.bits.peek()} v:${c.io.out.valid.peek()}")
        c.clock.step()
    }
    println()
    c.io.en.poke(false.B)
    c.io.out.ready.poke(true.B)
    for (cycle <- 0 until 4) {   // drain queue
        println(s"d count:${c.io.count.peek()} out:${c.io.out.bits.peek()} v:${c.io.out.valid.peek()}")
        c.clock.step()
    }
    println()
    c.io.en.poke(true.B)
    for (cycle <- 0 until 3) {   // simultaneous
        println(s"d count:${c.io.count.peek()} out:${c.io.out.bits.peek()} v:${c.io.out.valid.peek()}")
        c.clock.step()
    }
}

Elaborating design...
Done elaborating.
f count:UInt<1>(0) out:UInt<1>(0) v:Bool(true)
f count:UInt<1>(1) out:UInt<1>(0) v:Bool(true)
f count:UInt<1>(1) out:UInt<1>(0) v:Bool(true)
f count:UInt<1>(1) out:UInt<1>(0) v:Bool(true)

d count:UInt<1>(1) out:UInt<1>(0) v:Bool(true)
d count:UInt<1>(1) out:UInt<1>(1) v:Bool(false)
d count:UInt<1>(1) out:UInt<1>(1) v:Bool(false)
d count:UInt<1>(1) out:UInt<1>(1) v:Bool(false)

d count:UInt<1>(1) out:UInt<1>(1) v:Bool(true)
d count:UInt<2>(2) out:UInt<2>(2) v:Bool(true)
d count:UInt<1>(0) out:UInt<1>(0) v:Bool(true)
test CountIntoQueue Success: 0 tests passed in 13 cycles in 0.011434 seconds 1137.01 Hz
