## Agile Hardware Design
***
# Intro to Design Optimization & Memory

## Prof. Scott Beamer
### sbeamer@ucsc.edu

## [CSE 293](https://classes.soe.ucsc.edu/cse293/Winter22/)

## Plan for Today

* Intro to physical design (optimization)
* Limitations of memories
* Architectural interventions - banking, pipelining, ...

## Why Optimize Physical Design?

### Improve _Efficiency_!

### The real question is when is the effort worth it?
* Probably have _performance_ or _cost_ motivation, because otherwise writing software (to run on a processor) is probably much easier

* Will optimize in various phases of the design and in various ways
  * Keep an eye on cost/benefit ratio of this effort

## What Metrics to Optimize For?

### Power
* impacts battery life, thermals (heat), cost (energy bill and power supply)

### Performance
* how _fast_ it completes application goals 

### Area
* chip (or FPGA) resources required, i.e. _cost_

## Need a Target/Goal for Optimization

* Select an _implementation technology_ (e.g. specific ASIC process or FPGA)
  * Unlike software, many optimizations will not be portable
  * Tools also require some metrics as constraints (e.g. target clock rate or area)

* **Measure first** before optimizing heavily
  * Confirm optimization targets are actually problematic for PPA
  * Also consider how much could be gained from optimization effort

* A _hardware generator_ eases creation of many design alternatives
  * Hard to know in advance which design configuration will be best
  * A generator has a much better chance of being portable (i.e. useful) than a single design instance

* Making _tradeoffs_ across all 3 (PPA) metric
  * Need way to weigh tradeoffs between metrics (more later in week)

## Some Differences Between Software Development & Chip Design

### "Easy" World of Software Development
* Much of effort goes into implementation (and verification)
* Typically only one metric to optimize: _performance_
  * Most performance optimizations are highly portable (e.g. most CPUs similar)
* Tools (i.e. compilers) are highly automated and nearly never produce incorrect result

### "Hard" World of Chip Design
* After implementation, physical design optimization & verification are large hurdles
* Multiple metrics to optimize for: _power_, _performance_, _area_ (PPA)
  * Optimizations may be very technology dependent
* Tools (i.e. tool flows) require _significant_ human intervention and oversight

## Contrasting HW Design Philosophies
* Actual design flows may vary (or have aspects of both), but present extremes

## Waterfall
![tradtional loop](images/trad-hw.svg)

* Complete stage before moving on
* Late integration results in need to optimize & verify again

## Agile
![agile loop](images/agile-hw.svg)

* Integrate & verify design early
* Incrementally add features & optimize
* Go around loop as many times as necessary

## Main Stages of Chip Design Tool Flow

<img src="images/toolflow.svg" alt="toolflow phases" style="width:70%;margin-left:auto;margin-right:auto"/>

* Real tool flows much more complicated
* Tools at different stages interact frequently & re-run
  * Example: re-synthesize, place, and route a critical path
* Many more steps at end for handling physical details and aiding manufacturing
* Additionally, will want to verify design as it goes through these steps

## Why Worry About Memory?

### Memory can often contribute significantly to cost
* Off-chip - price of memory components as well as pins to interface
* On-chip - can consume significant area

### Memory can also limit performance
* Memory latency is well-known challenge for computer architecture
* Limited memory bandwidth can cap overall throughput

### What do we use memory for?
* Hold application data and intermediate state
* Think of memory as an all-purpose _"connector"_, i.e. a means to _communicate data in time_
  * May not know producer/consumer relationship in advance

## Memory Terminology

* _Capacity_ - size of storage (e.g. number of bits, bytes, words)
* _Latency_ - time to retrieve data (or complete a write)
* _Bandwidth_ - overall throughput (data/time e.g. bytes/second)
* _Request_ - command sent to memory (i.e. read or write)
* _Access width_ - amount of data transfered per request
* _Port_ - means of accessing memory with a request
  * Common types: read-only, write-only, or read/write
* _Requests in flight_ - in progress requests that have been sent to memory

<img src="images/terms.svg" alt="memory terminology" style="width:70%;margin-left:auto;margin-right:auto"/>

## Little's Law

### Parallelism = Throughput x Latency
* Helpful tool for architects to reason about tradeoffs
* Above terms are all for _average_ (e.g. average latency)
* Often, one metric is fixed & you try to optimize another metric by improving third
  * Example: latency is fixed but want to improve throughput => must increase parallelism

## Practical Considerations & Constraints for Memory

_**Goal**_ - reduce cost and performance deteriment of memory accesses

_**Common pain points**_
* Memory latency is too high
* Too many ports requested
* Can't provide desired bandwidth

_**Typical approaches**_ and often architect's role
* Reduce memory capacity demand and select densest (cheapest) technology to suffice
* Increase latency tolerance (when possible) in design
* Reduce bandwidth demands

_**Memory technologies**_ typically trade off cost/density for performance & energy efficiency
* Most expensive/fastest to cheapest/slowest: registers, SRAM, DRAM, PCM, flash

## Architectural Intervention: Banking Introduction

* _Problem_ - application desires greater request bandwidth than feasible to implement
  * Internally, the memory technology can only provide so many ports
  * The memory latency has already been reduced as far as is reasonable for that technology

* _Solution_ - _**banking**_
  * Break up memory into multiple _banks_
  * Each bank can service requests independently (increases request parallelism)
  * If implemented, can often be _parameterized_ nicely

<img src="images/banks-high.svg" alt="banking high-level" style="width:60%;margin-left:auto;margin-right:auto"/>

## Architectural Intervention: Banking Considerations

* How to divide requests across banks?
  * _Partition_ memory space across banks (e.g. N banks, each has 1/N of data)
    * Send requests to proper bank by hashing address
  * _Replicate_ data across banks (all banks hold same data)
    * Most effective for increasing number of _read_ ports

* Will banks have independent ports or share same port?
  * _Independent_ ports - still need to know how to select correct port
  * _Shared_ port - can time multiplex multiple requests on same port
    * Most effective when access latency >> request latency
    * Will need way to _tag_ requests to make responses clear

* What about very large memories?
  * A single memory has size limits, so banking is often inevitable
  * Some applications add memory for capacity and end up with surplus bandwidth

## Smoothing Out Memory Bandwidth Demand

* _Problem_ - cost (beyond capacity) is often proportional to _peak_ bandwidth demand
  * Many applications are bursty in their use of memory bandwidth
  * Ideally, would pay for _average_ bandwidth rather than peak

* _Solution_ - _**smooth out**_ bandwidth demand over time
  * Reduce burstiness so application is continously communicating
  * Common approaches: _pipelining_ & _double buffering_

<img src="images/traffic.svg" alt="traffic burstiness" style="width:70%;margin-left:auto;margin-right:auto"/>

## Arch. Intervention - Overlap Communication & Computation

* _Problem_ - compute is idle while memory is reading / writing

* _Solution_ - overlap memory accesses (communication) with their use (computation)
  * Definitely an example of _pipelining_
  * Often requires more _parallelism,_ as need additional requests to send to memory while computing on current data

<img src="images/overlap.svg" alt="overlapping communication and computation" style="width:80%;margin-left:auto;margin-right:auto"/>

## Arch. Intervention - Double Buffering

* _Problem_ - want to overlap communication & computation, but insuffient memory ports/bandwidth

* _Solution_ - use two memories (_buffers_)
  * Let compute work out of one memory
  * Perform needed communication out of other memory
  * Swap roles of memory when tasks complete

<img src="images/double.svg" alt="double buffering" style="width:90%;margin-left:auto;margin-right:auto"/>

## Instantiating Memory in Practice

* _Off-chip memories_ (& interface) are carefully selected & planned

* On-chip memory for ASIC - typically registers or SRAM
  * SRAM arrays provided by foundry in preset sizes or from _memory compiler_
  * Deliberately instantiate needed memory cells
  * Will frequently want to codesign/tweak array sizes and architecture to match

* On-chip memory on FPGA - typically registers, LUT RAM, BRAM, URAM
  * Ideal: tools look at Verilog and _infer_ need for memory (more portable)
  * Sometimes need to use intrinsics or give tools a "nudge"

* Chisel describes behavior but not technology
  * Typically sufficient for registers, and on FPGAs for BRAM/URAM
  * For SRAM, typically instantiate _blackbox_ to wrap interface of provided array
  * Chisel can specify a read/write port

## Summary - Physical Design Intro + Memory Optimization

* Before designing hardware or optimizing it...
  * know why need to build hardware rather than programming a CPU
  * measure PPA and compare to goal
* Close the loop early - find issues with design + backend tools early
  * More iterations through tools gives more opportunities to optimize
* Little's Law is a handy way to reason about latency, throughput, and parallelism
* Optimizing memory (off-chip or on-chip) should be done first
  * Can have a big impact on cost
  * Changing memory type (or organization) can require big changes, so don't want to do late in process
  * _Optimizations:_ banking, traffic shaping, comm. + comp. overlap, double buffering