# ZigZag-Project: Enabling Fast Architecture-Mapping DSE for Deep Learning Accelerators

#### **ISPASS2023 Tutorial**

Arne Symons, Linyan Mei, Guilherme Paim Prof. Marian Verhelst

MICAS Labs, ESAT, KU Leuven, Belgium









#### **Tutorial Outline**



- $\rightarrow$  Introduction (8:30 8:50)
- ➤ Lab 1: Assess HW performance of DNN layer onto accelerator, with fixed temporal mapping (8:50 9:30)
- ➤ Lab 2: Automate temporal mapping optimization (9:30 10:00)
- Break (10:00 to 10:30)
- ➤ Lab 3: Understand the HW architecture definition (10:30 11:00)
- ➤ Lab 4: Explore layer-fused mappings on multi-core architectures using Stream (11:00 11:45)
- ➤ Concluding remarks (11:45 12:00)



# ZigZag Project



## > Project started Dec. 2018

## Many contributors



Prof. Marian Verhelst



Pouya Houshmand



Steven Colleman



Vikram Jain



Guilherme Paim



Koen Goetschalckx



Arne Symons



Linyan Mei



Victor JUNG (ETHz)



Sebastian Karl (TUM)

# A high-level view





# **DNN** Layer

for Input

for Weight

for Output



for b = 0 to B-1 for k = 0 to K-1 for c = 0 to C-1 for oy = 0 to OY-1 for ox = 0 to OX-1

for ox = 0 to OX-1 for fy = 0 to FY-1 for fx = 0 to FX-1 (B: I/O batch size)

(K: O channel/W kernel)

(C: I/W channel)

(**OY**: O row)

(OX: O column)

(FY: W kernel row)

(FX: W kernel column)

 $O[b][k][oy][ox] += I[b][c][oy+fy][ox+fx] \times W[k][c][fy][fx]$ 

|    | В        | K        | С        | OY       | ОХ              | FY       | FX              |
|----|----------|----------|----------|----------|-----------------|----------|-----------------|
| W  | ×        | <b>\</b> | <b>✓</b> | ×        | ×               | <b>\</b> | <b>✓</b>        |
| -1 | <b>√</b> | ×        | <b>✓</b> | $?^{IY}$ | ? <sup>IX</sup> | $?^{IY}$ | ? <sup>IX</sup> |
| 0  | <b>✓</b> | <b>✓</b> | ×        | <b>✓</b> | <b>✓</b>        | ×        | ×               |

√ relevant (r)

× irrelevant (ir)

? partially relevant (pr)
?<sup>IX/IY</sup> partially relevant to IX/IY

#### A DNN Conv2D layer:

3D operand (W/I/O) space.

**7D** nested for-loop MAC operation.

Each Operand has its own (ir)relevant loop dimensions.

- r loops contribute to data size.
- ir loops contribute to data reuse.
- pr loops contribute to both data size and data reuse.



# **DNN** Layer



| Workload             | I Batch<br>size | O<br>channel | I / W<br>channel | O<br>row | O<br>column | W<br>row | W<br>column |
|----------------------|-----------------|--------------|------------------|----------|-------------|----------|-------------|
| Conv2D (right fig.)  | В               | K            | C                | OY       | OX          | FY       | FX          |
| Conv1D               | В               | K            | С                | 1        | OX          | 1        | FX          |
| Depthwise Conv2D*    | В               | 1            | 1                | OY       | OX          | FY       | FX          |
| Pointwise Conv2D     | В               | K            | С                | OY       | OX          | 1        | 1           |
| Matrix-Vector Multi. | 1               | K            | С                | 1        | 1           | 1        | 1           |
| Matrix-Matrix Multi. | В               | K            | С                | 1        | 1           | 1        | 1           |

<sup>\*</sup> Repeat 'C' or 'K' times to finish one Depthwise Conv2D layer (C = K).

Most **ML** workloads fit into the regular **nested for-loop format**.

No data dependency between each for-loop.





#### **DNN** Accelerator





Large Design Degrees of Freedom!



# Mapping (a.k.a. Dataflow)





(Loop tilling, ordering)

#### **Data Stationarity**

**for** b = 0 **to** B-1 for Input (B: I/O batch size) for Weight **for** k = 0 **to** K-1(K: O channel/W kernel) for Output **for** c = 0 **to** C-1(C: I/W channel) **for** oy = 0 **to** OY-1 (**OY**: O row) for ox = 0 to OX-1(OX: O column) **for** fy = 0 **to** FY-1 (FY: W kernel row) for fx = 0 to FX-1(FX: W kernel column)  $O[b][k][oy][ox] += I[b][c][oy+fy][ox+fx] \times W[k][c][fy][fx]$ 

**Operation Parallelism** 

Large Design
Degrees of Freedom!

Spatial Mapping (Loop unrolling)



# Loop Ordering and Splitting





# Co-Exploration







#### **Technology and Others**

**Technology**: 65nm/40nm/28nm/..., NVM, CIM, 3D IC, etc.

**Others**: Sparsity, various precisions, cross-layer execution, etc.

HUGE design space at each level & at combined levels.



# **Getting Started**



#### https://github.com/ZigZag-Project/zigzag

- \$ git clone git@github.com:ZigZag-Project/zigzag.git
- \$ cd zigzag
- \$ conda create --name my-zigzag-env python=3.10
- \$ conda activate my-zigzag-env
- \$ pip install -r requirements.txt
- \$ git checkout ispass2023-tutorial
- \$ code.





- Open lab1/main.py
- > Expects three **arguments**:
  - accelerator
  - model
  - mapping
- Extracts names from the given arguments and sets inputs
- Defines the sequence of stages to be executed
- > Runs the sequence of stages with inputs
- Plots the returned CostModelEvaluation (CME)





#### Model (workload)

First layer of ResNet18 (ONNX format)



#### **Accelerator**



## **Mapping**

Defines mapping of layers onto accelerator



## **Mapping**

Defines mapping of layers onto accelerator







## First experiment:

- model = "lab1/resnet18\_first\_layer.onnx"
- accelerator = "zigzag.inputs.examples.hardware.TPU\_like"
- mapping = "mapping"
- Run lab1/main.py





#### Second experiment:

- Modify lab1/mapping.py:
  - > Change temporal loop ordering
  - Run lab1/main.py



# Lab 2: Automating temporal mapping



- Copy lab1/main.py → lab2/main.py
- ➤ Replace TemporalOrderingConversionStage → LomaStage
- Change dump\_filename\_pattern
- Change plotting save\_path
- Copy lab1/mapping.py -> lab2/mapping.py
- Remove temporal\_ordering



# Lab 2: Automating temporal mapping



## First experiment:

- model = "lab2/resnet18\_first\_layer.onnx"
- accelerator = "zigzag.inputs.examples.hardware.TPU\_like"
- mapping = "mapping"
- Run lab2/main.py



#### Lab 2: User-defined workload



- Copy lab2/main.py → lab2/main\_user\_defined.py
- ➤ Replace ONNXModelParserStage → WorkloadParserStage
- Change dump\_filename\_pattern
- Change plotting save\_path



## Lab 2: User-defined workload



## Second experiment:

- model = "resnet18\_first\_layer"
- accelerator = "zigzag.inputs.examples.hardware.TPU\_like"
- mapping = "mapping"
- Run lab2/main\_user\_defined.py

ZigZag is also distributed on PyPI

```
~/zigzag$ pip install zigzag-dse
```

> API call for common use-case

```
from zigzag.api import get_hardware_performance_zigzag

def get_hardware_performance_zigzag(
    workload,
    accelerator,
    mapping,
    opt="latency",
    dump_filename_pattern="outputs/{datetime}.json",
    pickle_filename="outputs/list_of_cmes.pickle",
):
```



## Break



- ➤ Lab 3 & 4 after the break
- > Start at 10.30 AM



# Lab 3: Architectural impact



- Open lab3/inputs/hardware/...
  - Definition of multiplier array
  - Definition of memory hierarchy
  - Definition of core
- Open lab3/inputs/mapping/...
  - Definition of spatial mappings
- Open lab3/main.py
  - Uses API call for every core architecture mapping
  - Uses API call for architecture comparison plot



# Lab 3: Architectural impact



#### First experiment:

Run lab3/main.py



# Conclusion: ZigZag



- Hardware accelerator model based on array of multipliers and attached memory hierarchy
- Hardware performance estimation of DNN layer through analytical cost model
- > Optimization of layer mapping through different stages
- Enables co-exploration of accelerator & mapping



# From ZigZag to Stream



| Frameworks | Workload          | Hardware                | Mapping                              |
|------------|-------------------|-------------------------|--------------------------------------|
| ZigZag     | A NN layer        | Single-core accelerator | Single-layer mapping                 |
| Stream     | One/more<br>NN(s) | Multi-core accelerator  | Fine-grained layer-<br>fused mapping |

Focus on Stream for the rest of the session



## **Traditional DNN Acceleration**



#### An example workload





#### An example accelerator





## Multi-core DNN Acceleration



#### An example workload





#### A multi-core accelerator





# Layer-fused processing



#### Tiled for layer-fused processing





#### A multi-core accelerator







#### Stream





#### Tiled for layer-fused scheduling





#### Schedule the workload to hardware accelerators





#### What can Stream do?



✓ Model single-core architecture (identical to ZigZag)





#### What can Stream do?



- ✓ Model single-core architecture (identical to ZigZag)
- ✓ Model different multi-core topologies







## What can Stream do?



- ✓ Model single-core architecture (identical to ZigZag)
- ✓ Model different multi-core topologies
- ✓ Model different scheduling granularities



A Multi-core Arch

(TPU-like dataflow accelerator)

#### What can Stream do?



- ✓ Model single-core architecture (identical to ZigZag)
- ✓ Model different multi-core topologies
- ✓ Model different scheduling granularities
- ✓ Model different scheduling heuristics



#### Stream Overview







# **Getting Started**



#### https://github.com/ZigZag-Project/stream

- \$ git clone git@github.com:ZigZag-Project/stream.git
- \$ cd stream
- \$ conda create --name my-stream-env python=3.10
- \$ conda activate my-zigzag-env
- \$ pip install -r requirements.txt
- \$ git checkout ispass2023-tutorial
- \$ code.





- Open lab4/main\_fixed.py
- > Defines inputs directly in file instead of arguments
- > Extracts (from input names) and defines variables
- > Sets up the sequence of stages
- Runs the stages
- Plots the results





#### **Workload**

- Open lab4/inputs/workload/duplicated\_resnet18\_layer\_fixed.py
- First layer of ResNet18 duplicated 4 times
- Dependencies between the layers (operand\_source)
- operator\_type overloaded for fixed mapping





#### **Accelerator**

- Open lab4/inputs/hardware/heterogeneous\_quadcore.py
- Imports the "computational cores"
- Imports the pooling and simd cores
- > Imports the offchip core
- Creates a 2D mesh of these cores
- > Defines the multi-core Accelerator object





## **Mapping**

- Open lab4/inputs/mapping/mapping\_fixed.py
- Defines for each operator\_type the possible layer-core allocations

### First experiment:

- model = "...duplicated\_resnet18\_layer\_fixed"
- accelerator = "...heterogeneous\_quadcore"
- mapping = "...mapping\_fixed"
- Run lab4/main\_fixed.py





### Second experiment:

- What happens if we remove the layer dependencies?
- Remove the operand\_source and modify the constant\_operands
- Run lab4/main\_fixed.py



### Third experiment:

- Allow genetic algorithm to find best layer-core allocation
- model = "...duplicated\_resnet18\_layer"
  - Modified operator\_type
- mapping = "...mapping"
  - Modified for flexible layer-core allocation
- Run lab4/main\_layer\_by\_layer.py

## Layer Fusion: Computation Node (CN) KULEUVEN







Computation node (CN) granularity impacts scheduling flexibility

and others... Core utilization Intra-CN data reuse Data loading overhead Control overhead





- Open lab4/main\_layer\_fused.py
- hint\_loops defines the outer-CN loops
  - hint\_loops = [("OY", 2)] means 2 CNs per layer
  - hint\_loops = [("OY", "all")] means OY CNs per layer





### Fourth experiment:

- Assess the impact of layer-fused processing
- Run lab4/main\_layer\_fused.py





- Open lab4/main.py
- End-to-end ResNet18 onnx model
- Layer-by-layer





### Last experiment:

- End-to-end ResNet18 example
- Run lab4/main.py (layer-by-layer)
- > If time permits:
  - Modify hint\_loops with ("OY", "all")
  - > Re-run and analyze differences



### What's to come?



- Automatically infer optimal CN granularity
- > Integrate inter-core connect energy estimation framework
- > Automatically search for optimal multi-core architectures
- > Add optimization constraints (e.g. max latency, area, ...)
- Code generation for existing accelerators
- > Automatic hardware generation from hardware templates



## Conclusion



- ZigZag enables fast hardware performance estimation for specialized DNN accelerator architectures
- Mapping optimizations yield better performance
- Co-exploration of architecture with mapping
- Stream extends the capabilities to multi-core architectures employing layer-fused scheduling
- Unified hardware model for different architecture topologies
- > Different scheduling granularities through computation node



# Materials of ZigZag-project



- Goal: Enabling Fast Architecture-Scheduling/Mapping DSE for Machine Learning Accelerators
- Github: <a href="https://github.com/ZigZag-Project">https://github.com/ZigZag-Project</a>
  - ZigZag
  - ZigZag-Demo
  - DeFiNES
  - Stream
- ZigZag documentation: <a href="https://zigzag-project.github.io/zigzag/">https://zigzag-project.github.io/zigzag/</a>
- Stream documentation: Underway (end of May)
- ZigZag-related publications:

https://zigzag-project.github.io/zigzag/publications.html