# ZUMA User's Manual

Tobias Wiersema, Arne Bockhorn, Alexander D. Brant, Monica Keerthipati, Nithin S. Sabu, and Felix P. Jentzsch

# 30.05.2020

# Contents

| 1                | Introduction       2         1.1 Contributors and Brief History                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |                                                                                              |  |  |
|------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------|--|--|
| 2                | stallation           VTR flow            Vosys                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                                                                                              |  |  |
| 3                | Getting started 3.1 Running the Example                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | <b>5</b> 5 5                                                                                 |  |  |
| 4                | Caveats and Restrictions                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | 6                                                                                            |  |  |
| 5                | Including a ZUMA Overlay in a Project  5.1 Including a ZUMA Overlay in a Xilinx Vivado Project                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                                                                                              |  |  |
| 6                | Background and Advanced Usage 6.1 Basic FPGA Model Used by ZUMA 6.1.1 Global Structure 6.1.2 Local Structure 6.1.3 Structure Configuration Parameters 6.1.4 Data Structures 6.2 ZUMA Tool Flow Details 6.2.1 Basic Flow 6.2.2 Timing-augmented Flow 6.3 Generated Data Structures and Files 6.3.1 Overlay Description Graph 6.3.2 Verilog file 6.3.3 Configuration Bitstream 6.4 ZUMA Overlay Configuration Process 6.4.1 Module Description 6.4.2 Configuration Details 6.4.3 The Configuration Module fixed_config 6.4.4 Wrapper Module 6.4.5 ZUMA_custom_generated module | 14<br>14<br>16<br>17<br>18<br>18<br>19<br>21<br>24<br>26<br>27<br>27<br>27<br>28<br>28<br>28 |  |  |
| 7                | Comprehensive List of all Configuration Parameters                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | 30                                                                                           |  |  |
| $\mathbf{R}_{0}$ | rerences                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | 33                                                                                           |  |  |

# 1 Introduction

This repository contains the ZUMA FPGA overlay architecture system that was introduced by Brant and Lemieux in 2012 [1, 2] and later extended by Wiersema, Bockhorn and Platzner [3, 4] and several students of Paderborn University. ZUMA is an open-source, cross-compatible embedded FPGA architecture that is intended as an overlay on top of an existing FPGA, in essence an "FPGA-on-an-FPGA." This approach of a virtual FPGA has a number of benefits, including bitstream compatibility between different vendors and parts, compatibility with open FPGA tool flows, and the ability to embed some programmable logic into systems on FPGAs without the need for releasing or recompiling the master netlist.

This manual provides an overview of the ZUMA system and contains the following elements:

- 1. Instructions on how to prepare the repository and external tools to be able to run the overlay generation flow (Section 2).
- 2. Instructions on how to get started with the basic tool flow of ZUMA (Section 3).
- 3. Instructions for including a generated overlay into a new or existing FPGA design (Section 5).
- 4. Details on advanced usage scenarios, an in-depth description of the underlying FPGA model of the virtual FPGA and how it corresponds to the build parameters of the ZUMA system, as well as an overview of the generated output files and their layout (Section 6). This should allow you to generate advanced, custom-built virtual FPGAs that exactly fit to your needs.

The folders included in this repository contain a number of components needed to use ZUMA, as well as examples and tests. The directory structure is as follows:

| -doc/              | Contains this documentation.                                                                                          |  |  |
|--------------------|-----------------------------------------------------------------------------------------------------------------------|--|--|
| ${\rm example}/$   | Contains a ZUMA preferences file, sample Verilog and timing SDF files, and a script to compile.                       |  |  |
| ${\rm external}/$  | Required third party tools as GIT submodules.                                                                         |  |  |
| $\mathrm{misc}/$   | Contains a patch that is required to use (the very old) VPR6 with ZUMA.                                               |  |  |
| source/            | Scripts to generate the ZUMA Verilog components and bitstreams.                                                       |  |  |
| tests/             | Included scripts used to test ZUMA components.                                                                        |  |  |
| tests/integration/ | Python unit tests to automatically assert the correct installation and behavior of the ZUMA scripts.                  |  |  |
| verilog/           | Verilog files used for building a ZUMA system, included platform specific and simulation files.                       |  |  |
| license.txt        | The license under which ZUMA can be used.                                                                             |  |  |
| Makefile           | Global Makefile to prepare a working tool flow for overlay generation.                                                |  |  |
| toolpaths.py       | Global path setup to tie in the third party tools – can be adapted if the provided tool submodules shall not be used. |  |  |

# 1.1 Contributors and Brief History

Many wonderful people have contributed to the ZUMA project. This section is a (very) brief history of how the project was shaped by them.

In **2012**, Alexander D. Brant wrote the first version of the ZUMA generator scripts as a graduate student under the guidance of Guy G. F. Lemieux, with whom he published the first paper on ZUMA [1].

In 2013 / 2014, Tobias Wiersema, as a research assistant to Marco Platzner, extended the implementation to also support sequential circuits, by extending the eLUT LUTRAM macro instantiation, added the optional permutation layer to the overlay, and embedded it into a ReconOS<sup>1</sup> hardware thread that can receive its configuration at runtime from a software thread through shared memory. Although this embedding is not bundled here, it was the basis for streamlining the configuration process within the test bench wrapper.

In the years 2014 - 2016, Arne Bockhorn rewrote much of the ZUMA code as a student research assistant to Tobias Wiersema, and extended it with the timing analysis features, realized the initial description of the Clos network-based IIBs, made it compatible to VTR1 / VPR7, wrote the Bitstream-to-BLIF back-translation tool, and helped with the ReconOS integration. In this period, Arne, Marco, and Tobias also published their two ZUMA papers [3, 4].

In 2017 / 2018, Nithin S. Sabu and Monica Keerthipati realized the first Xilinx Vivado version of ZUMA as graduate students together with their student project group ReCoTroy, supervised by Tobias Wiersema. They also contributed the corresponding guide in Section 5.1.

In 2019, Felix P. Jentzsch implemented the first Xilinx Vivado nested block design that contained a ZUMA overlay as a student research assistant to Tobias Wiersema. He contributed the corresponding instructions in Section 5.1.1.

In 2020, Arne Bockhorn rewrote the generator scripts again to make it more maintainable and compatible to version 8 of VTR / VPR, extended the timing analysis to a full virtual timing-driven placement & routing flow version, enabled a ZUMA overlay description in hierarchical HDL, and wrote integration tests to be able to ensure the correctness of features and to spot regressions. Tobias Wiersema aggregated all available ZUMA documentation and compiled the first versions of this user's manual, and enabled the integration of the external tools directly within the context of this GIT repository. Arne's and Tobias' goal was to provide the most modern, compatible, feature complete, and comprehensively documented ZUMA version to render its usage and adoption for future projects as easy as possible.

<sup>1</sup>http://www.reconos.de

# 2 Installation

ZUMA scripts are written in Python, and require a valid Python install to run. The scripts are tested with Python 2.7. You will also need the python libraries plumbum and numpy and optional graphviz. To just get started, just run

#### make

This will fetch and build all required tools and run a unit test that asserts the correct behavior of the tool chain. If this finishes with an OK, then your ZUMA copy is ready to be used, and you can skip the rest of this section. If you run into build errors, please refer to the failing tool's GitHub site for help.

#### 2.1 VTR flow

The VTR<sup>2</sup> tool set must also be installed in order to compile with ZUMA. ZUMA does not call the VTR flow directly, but tools thereof and requires files that are generated by them.

For convenience, a GIT submodule is located under 'external/vtr' that points to a VTR version of the official VTR GitHub repository<sup>3</sup> that is known to work with this ZUMA version. This included VTR version can be build by just issuing a standard make in ZUMA's top directory, or in the 'external' subdirectory.

Should you want to use an existing VTR installation with theses scripts, you can adapt the variable VTR\_DIR that is used in scripts to find the VTR install. To change it globally, update the file 'toolpaths.py' in the base directory to point to your installation location.

VPR versions prior to 7 (which are thus very old by now) do not automatically dump the routing resource graph and lack a command line switch to do so. Since this file is needed by ZUMA, you need to activate the dumping of this file via a debug switch at compile time. For modern versions of VTR and VPR, you will not need to perform the following steps and can skip to the next section.

For VPR 6 a patch file is located in the directory '\$VTR\_DIR/vpr/SRC/route/', which can be applied to the file 'rr\_graph.c', by calling:

```
cat (ZUMA dir)/misc/patch.txt (VTR dir)/vpr/SRC/route/rr_graph.c > \
    (VTR dir)/vpr/SRC/route/rr_graph.c'
```

The patch instructs VPR to always dump its routing graph to the file 'rr\_graph.echo'. If using a different version, defining CREATE\_ECHO\_FILES in 'rr\_graph.c' will enable this functionality.

### 2.2 Yosys

To enable an automatic verification of the functional equivalence between the generated overlay configuration and the original HDL specification, you additionally need Yosys<sup>4</sup>.

For convenience, a GIT submodule is located under 'external/yosys' that points to a Yosys version of the official Yosys GitHub repository<sup>5</sup> that is known to work with this ZUMA version. This included Yosys version can be build by just issuing a standard make in ZUMA's top directory, or in the 'external' subdirectory.

Should you want to use an existing Yosys installation with theses scripts, you can adapt the variable yosysDir that is used in scripts to find the Yosys install. To change it globally, update the file 'toolpaths.py' in the base directory to point to your installation location.

<sup>2</sup>https://verilogtorouting.org/

<sup>3</sup>https://github.com/verilog-to-routing/vtr-verilog-to-routing

<sup>4</sup>http://www.clifford.at/yosys/

<sup>&</sup>lt;sup>5</sup>https://github.com/YosysHQ/yosys

# 3 Getting started

# 3.1 Running the Example

Calling the Python script

compile.sh test.v

will automatically build the ZUMA system Verilog 'ZUMA\_custom\_generated.v', and a bitstream hex file 'output.hex', which can be used to synthesize and configure a ZUMA system. By passing other circuit files, modifying the example ZUMA configuration file 'zuma\_config.py', or providing an alternative configuration file via the --config command line switch, custom architectures and bitstreams can be generated.

#### 3.2 ZUMA Tool Flow

An overview of the flow of tools required to generate ZUMA overlays, and configurations for them, is depicted in Figure 1, adapted from [4].

It roughly works as follows: Starting with the ZUMA parameters in the 'zuma\_config.py' the compile.sh uses the templates stored in 'source/templates/' to generate the architecture description of the overlay in VTR's XML format. Leveraging this architecture description and the virtual circuit, e.g., 'test.v', the VTR flow (or more specifically ODIN II, ABC, and VPR) can deduce the complete routing resources of the overlay, which are saved into a file (custom format in VTR  $\leq$  7 and XML in VTR 8), and can also synthesize, place and route the virtual circuit to the described architecture, resulting in descriptive files for the netlist, the placement and the routing. The ZUMA scripts take all of these generated files and compute the correct configuration for each programmable entity of the overlay, i.e., eLUTs and programmable interconnect points, and save the complete virtual configuration as ZUMA bitstream into the files 'output.hex' and 'output.hex.mif'. While the former is the correct bitstream version as defined for ZUMA, the latter is an undecorated collection of only the configuration bits and nothing more, ready-to-use for inclusion by vendor tools as memory content. For a more in-depth, step-by-step discussion of the flow, and for using advanced features such as static timing analysis or timing-driven placement and routing of virtual circuits, see Section 6.2.

For the physical side of things, the ZUMA scripts generate a description of the complete overlay fabric in Verilog, 'ZUMA\_custom\_generated.v', using LUTRAM instantiation macros to define all programmable entities. This Verilog file can be included into a regular FPGA project to actually synthesize an overlay onto a physical device. For more details of this inclusion, see Section 5.

# 3.3 Bit to BLIF

You can also reverse ZUMA's virtual synthesis and (re-)build a BLIF file from the generated bitstream. To this end, you can call the script

>example/extract\_logic\_function.sh output.hex.mif output.blif HasClock HasReset

where 'output.hex.mif' is the bitstream you want to build your BLIF from, 'output.blif' the name of the BLIF file you want to create and HasClock and HasReset are two booleans [True/False] which indicate if the circuit uses a clock and / or a reset signal. Those two signal properties cannot be read from the bitstream so you have to specify them.

Additionally the script requires the same 'zuma\_config.py' architecture parameter configuration that was used to build the 'output.hex.mif' initially, because the architecture details can also not be read from the bitstream.



Figure 1: The tool flow to generate overlays and configurations.

# 4 Caveats and Restrictions

A constraint for the virtual circuit file is that the head of the model must have the following signature:

```
verilog-module-name ([clock], reset, [input-name-1, input-name-2, ...])
```

The clock and reset signal must have the given names and positions for the scripts to recognize their special behavior. The reset is treated as the first input on the FPGA. The declaration of a clock is optional.

The LUTRAMs used in the ZUMA architecture and contained in this repository are generated using Xilinx and Altera macros. To use a different LUTRAM or memory, instantiate it in the file 'lut\_custom.v', and define a new platform (e.g., PLATFORM\_STRATIXIV) in the file 'define.v' to select this LUTRAM.

Note that the Altera macros have not been maintained and that for the current ZUMA version thus only the Xilinx side is tested and guaranteed to work with current vendor tool flows. Especially support for sequential virtual circuits has so far only been implemented for Xilinx projects.

# 5 Including a ZUMA Overlay in a Project

Once the Verilog architecture is created, and a hex bitstream is generated, the ZUMA system can be compiled and used. The generated Verilog file that describes the virtual fabric, along with the files in the 'verilog/generic/' and 'verilog/platform/(platform)/' directories should be included in a new Xilinx / Altera project, although getting it to work for Altera devices might require some (read: significant amount of) additional work. We will describe the intent and general process here, for specific details on how to include it in a Xilinx Vivado project, please refer to Section 5.1.

The generated hex file should be placed in the project directory, and specified as the initial contents of the ZUMA configuration memory. The top level file 'ZUMA\_TB\_wrapper' includes a memory block which references the hex file 'output.hex' that should be generated by the ZUMA tools and included with the project. Note that sequential circuits are so far only supported in Xilinx projects.

If you change the configuration memory size, or the configuration width, you have to create an appropriate new memory using the vendor tools. Runtime configuration of the ZUMA overlay is performed by loading each block of memory in the hex file to the port config\_data, along with its address to config\_addr, and asserting the corresponding config\_en port. Configuration completes when all data are loaded.

The ports fpga\_inputs and fpga\_outputs provide the interface between the physical and virtual FPGA logic. As described in the original ZUMA paper [1] and Brant's master's thesis [2], the pins are located at the edges of the array, begin at the grid coordinate (0,1), and increase in the Y direction first. The pins can be fixed to correspond to those of the input Verilog by specifying a pin location file when running VPR placement.

If you want to use the timing analysis or want to build a BLIF from a hex file, see the timing and bit to BLIF readme files.

# 5.1 Including a ZUMA Overlay in a Xilinx Vivado Project

For the inclusion of a generated ZUMA overlay into a Xilinx Vivado project, we provide a more detailed explanation here along with screenshots. When following these steps, you should be able to synthesize a working, configurable ZUMA overlay with your project.

- 1. Create a new project in Vivado by including the source files in the following directories:
  - verilog/generic
  - verilog/platforms/xilinx

Do not copy the files into the project, as they will be common to all projects.



You might want to copy the file 'verilog/generic/ZUMA\_TB\_wrapper.v' to the Vivado project, however, since this will be the top test bench for the project we are building here, so you might want to adapt this.



- 2. Click on Add sources and add the following source files also to the project:
  - example/ZUMA\_custom\_generated.v overlay description
  - example/output.hex.mif virtual configuration
  - example/def\_generated.vh generated header file

This time, you can choose whether or not to check *Copy sources into project*. While these files change for each project, and thus could be copied, not copying them will force Vivado to read them from the ZUMA directory, so that when you regenerate them, they should be included in the new version. Should you, however, generate overlays for multiple projects in your ZUMA directory, it would be safer to copy them into the Vivado project.



3. Right click on 'define.vh' and click Set Global Include as shown. Do the same for 'def\_generated.vh'.



4. From the IP catalog, select the *Distributed Memory Generator* core as shown. Change the component name to lut\_xilinx. Set *Data Width* to 1. Change *Memory Type* to *Simple Dual Port RAM* and click *OK*. This will include the basic building block of ZUMA into the project – the LUTRAMs – which the overlay will instantiate a hundredfold.



5. In the Generate Output Products window that appears, set Synthesis Options to Out of context per IP and click Generate.



6. Since the implementation of virtual sequential circuits, ZUMA requires a special version of this building block for its eLUTs. Hence, add the *Distributed Memory Generator* core a second time now, but this time with different settings. Change the component name to elut\_xilinx. Set *Data Width* again to 1, and change *Memory Type* also to *Simple Dual Port RAM*. Do NOT dismiss the dialog, but proceed to the next tab now.



In the Port config tab, set Output Options to Both. Proceed to the final tab.



In the RST and Initialization tab, check Reset QSDPO in Reset Options.



Now you can finally click OK and in the next window (Generate Output Products) click Generate again with Out of context per IP settings.

7. Modify the 'ZUMA\_TB\_wrapper.v' file based on your needs. It will be a good idea to remove the inputs and outputs from external ports and expose only part of it to the interface, since all ZUMA IO pins are general purpose IO pins and can thus be configured to be either an input or an output by the virtual configuration. The provided fpga\_inputs and fpga\_outputs are thus exactly twice the number of actually available IO pins, and there will not be enough pins to map all the inputs and outputs should you attempt to fully use both arrays for one configuration. Also, make sure to tie the unused inputs to ground, as otherwise, simulation will break.

#### 5.1.1 Troubleshooting

**Issue 1** Synthesis of the complete design is nearly impossible, since Vivado finds thousands of combinational loops.

This might happen when you have a project that is a Vivado block design containing several IP cores, where one of them acts as the wrapper and configuration controller for the ZUMA overlay, and your follow the guide in this section to instantiate the customized *Distributed Memory Generator* IP within. This will probably seem to work fine, but the synthesis of the complete design can be nearly impossible as Vivado complains about thousands of combinational loops, crashing after running out of memory. This will then happen with the block design set to global synthesis, as well as with out-of-context synthesis.

As it turns out, it is necessary to run out-of-context synthesis for each LUTRAM module. This way they are considered black boxes during final synthesis and timing loops are not reported. Unfortunately, Vivado has some serious limitations regarding nested block designs. The out-of-context synthesis products generated within the ZUMA wrapper IP are not recognized by Vivado, because nesting pre-synthesized IP cores in this manner is not supported. Some additional information can be found at https://forums.xilinx.com/t5/Design-Entry/Limitations-of-the-Block-Designs/td-p/553937

This problem can be solved by instantiating the LUTRAM modules from an .edif netlist. These can be generated by customizing the *Distributed Memory Generator* IP in a new Vivado project, generating the output products, opening the resulting .dcp file with Vivado, and using the write\_edif tcl command. This way the LUTRAM can be instantiated by simply including this .edif as a source file. An HDL stub definition of the module is needed though, at least for Verilog. This can be generated using the write\_verilog -mode port command. Details of this solution can be found at https://forums.xilinx.com/t5/Design-Entry/Adding-xilinx-IP-dcp-files-for-packaging-custom-IP-to-speed-up/td-p/603142 and also at https://www.xilinx.com/support/answers/54074.html.

# 6 Background and Advanced Usage

This section's purpose is to give you enough insight into the ZUMA overlay structure and the file generation, so that you can tweak the generated virtual FPGAs to best fit your needs.

### 6.1 Basic FPGA Model Used by ZUMA

Since ZUMA derives its FPGA model from architectures defined using the VTR tool flow, we will use their notions and general model division here. On the most abstract level, ZUMA thus uses an island-style FPGA layout as depicted in Figure 2, i.e., logic block islands floating on a sea of interconnect, which is also called the Toronto FPGA model. VTR usually denotes these logic blocks as as configurable logic blocks (CLBs), and within the ZUMA material they are often simply called clusters.



Figure 2: General island-style FPGA layout. Source: [5].

For our explanations, we consider the virtual FPGA structure on two different levels:

- 1. The global structure and layout with all the interconnect between the CLBs.
- 2. The local structure and layout within each CLB.

#### 6.1.1 Global Structure

The outer, global structure of the generated virtual FPGAs consists of an  $X \times Y$  array of CLBs as depicted in Figure 3. The inputs and outputs of the virtual device are modeled on the edges of the grid, resulting in  $2 \cdot (X + Y)$  IO pads. For ZUMA, each of these IO pads comprises two general purpose IOs, i.e., IOs which can be configured to be either a global input or a global output, such that the virtual device has  $\#GIOs = 4 \cdot (X + Y)$ . Thus, the overlay can divide the #GIOs general purpose IOs between the global inputs and outputs as required by the current circuit.



Figure 3: Global structure of a  $2\times 2$  virtual FPGA. Source: VPR manual.

The routing resources are organized as routing channels in x or y direction (*Chanx* and *Chany*), and they form a unidirectional interconnect network between the CLBs and the outer IOs. Each channel consists of a number of individual tracks that can carry one logical signal each. The channels and tracks are visualized in Figure 4.



Figure 4: Tracks of different lengths and directions in channels: The purple track connects resources vertically and has a length of 2, while red ones are horizontal and the upper one has a length of 1. Source: VPR.

The channels are connected to each other via switchboxes, which are shown in Figure 5. Within these areas, a selection of tracks from each channel can be connected to a selection of other tracks of different channels. Since allowing the complete connection of any track to any other track would be too area consuming, ZUMA overlays, like many other FPGA devices, employ a crossbar pattern known as Wilton routing [6] here.



Figure 5: Tracks of channels are connected by the switchboxes (green). Source: VPR.

Each CLB connects to some of the tracks of the surrounding channels of the global routing resources – these connection locations are typically called connect blocks (cp. Figure 2). Each CLB element thus has its own connect block to connect itself to the global routing resources, and a switchbox to actually realize the global routing. Therefore the global outer structure of a ZUMA virtual FPGA consists of the global routing network (channels, connect blocks, and switchboxes), CLBs (or clusters), and IO pads.

#### 6.1.2 Local Structure

Each CLB, or cluster, consists of an input interconnect block (IIB) for its intra-cluster input routing and N basic logic elements (BLE). These BLEs in turn comprise one lookup table (LUT) with input width K, and one flip flop (FF) that is bypassable using a configurable MUX, see image 6.

The N bit output of the whole cluster is the combined output of the N individual BLEs. The IIB is fed with the I inputs from the connect block, i.e., the connected input tracks from the global routing resources, and also with the N feedback outputs from the BLE elements. Each of these (I+N) inputs must be routed to (almost) every pin of all N LUTs. Therefore the  $N\cdot K$  outputs of the IIB are connected to the different input pins of the BLE elements.



Figure 6: The structure of a ZUMA CLB, or cluster.

Currently the IIB is implemented using connected MUXes, and there are two different implementations available to choose from, each with a different MUX density:

- 1. The first IIB is a straightforward fully-connected crossbar between the (I + N) inputs and the  $N \cdot K$  outputs, which requires a considerable amount of MUXes to realize, but cannot suffer from congestion and is thus guaranteed to find a local routing for any configuration.
- 2. The second one is based on Clos networks [7] and uses fewer MUXes, but the local routing algorithm does not always find a valid interconnect routing, due to randomness in the current routing approach.

# 6.1.3 Structure Configuration Parameters

The structure-related ZUMA configuration parameters are thus as follows:

| Global structure                      |                                                                                           |  |  |
|---------------------------------------|-------------------------------------------------------------------------------------------|--|--|
| params.X                              | Grid size in $x$ dimension                                                                |  |  |
| params. $Y$                           | Grid size in $y$ dimension                                                                |  |  |
| params. $L$                           | Length of the routing channels                                                            |  |  |
| params. $W$                           | Number of tracks per routing channel                                                      |  |  |
| Local structure                       |                                                                                           |  |  |
| params.I                              | External cluster inputs (from connect block to IIB)                                       |  |  |
| params. $N$                           | LUTs per cluster                                                                          |  |  |
| params. $K$                           | LUT input width                                                                           |  |  |
| ${\rm params.} UseClos$               | Whether the IIB should be Clos network-based (otherwise it is a fully connected crossbar) |  |  |
| Connect block                         | •                                                                                         |  |  |
| params. $fc_in$                       | From how many tracks of the connect block each of the $I$ cluster inputs can be driven    |  |  |
| ${\tt params}. \textit{fc\_in\_type}$ | Whether params. $fc_in$ is an absolute number or relative value                           |  |  |
| $\mathrm{params}.fc\_\mathit{out}$    | How many tracks of the connect block each of the $N$ cluster outputs can drive            |  |  |
| $params.fc\_out\_type$                | Whether params. $fc\_out$ is an absolute number or relative value                         |  |  |

#### 6.1.4 Data Structures

The whole virtual FPGA structure is saved in different data structures, whose scope and purpose we will briefly discuss here.

Within the VTR flow, the overall layout and the local structure are stored in an XML architecture description file, e.g., 'ARCH.xml'. The expanded global structure of the FPGA is stored in the routing resource graph (RRG), which is also an XML file in VTR8, e.g., 'rr\_graph.xml'.

Within ZUMA, the whole structure is built into one complete graph, let us call it overlay description graph (ODG) here for easier reference. This ODG includes all virtual resources and structures, i.e., IO pads, routing MUXes, eLUTs, and FFs. To be able to use Xilinx' LUTRAM macros of a specific input width l (or the Altera alternative) to implement the overlay later, the ODG has to be made l-feasible, i.e., each node must have a fan-in of at most l and a fan-out of exactly 1. This process is one of the internal processes within the overlay generation, and the resulting l-feasible ODG is used for most processing steps within the flow. Within the rest of this document, we will call the former graph the ODG, and the latter one the technology-mapped ODG.

#### 6.2 ZUMA Tool Flow Details

To fully understand all generated files, we have to take a closer look into the inner workings of the ZUMA overlay generation, which we will do in this section. These are thus the flow details that happen automatically when running compile.sh.

#### 6.2.1 Basic Flow

The detailed steps to generate an overlay using the flow depicted in Figure 1 are as follows for the basic version without timing analysis or optimization (user interaction is only required for top-level items):

- 1. Set all values in 'zuma\_config.py' to fit your needs.
- 2. Run the ZUMA flow ('compile.sh' with your virtual circuit).
  - (a) 'generate\_buildfiles.py'
    - Generates the VPR architecture files and build scripts from templates, i.e., currently 'ARCH\_vpr8.xml', 'abccommands.vpr8', 'vpr8.sh', and 'vpr8\_timing.sh' in the 'build' directory, or their VPR 7 alternatives.
    - Applies the parameters from 'zuma\_config.py' to all generated files.
  - (b) VTR
    - Synthesizes the virtual circuit into a technology-mapped BLIF file.
    - Generates all local and global routing resources in VPR.
    - Saves the global ones as routing resource graph (RRG).
    - Saves the result of its packing, placement and routing steps as a netlist, placement and routing file.
  - (c) 'zuma\_build.py'
    - Parses the global routing resources from the RRG.
    - Builds up an internal overlay description graph (ODG, cp. Section 6.1.4) representing all global and local resources from the parsed global information and statically for the local ones.
    - Builds the technology-mapped ODG from the ODG by expanding nodes which use too many inputs to fit into the targeted LUTRAM macros.
    - Writes out the Verilog description for the complete overlay by going through all nodes and edges of the technology-mapped ODG and generating a correspondingly connected LUTRAM instantiation in Verilog.

- Parses the information from the BLIF, netlist, placement, and routing files and correlates it with the technology-mapped ODG, i.e., which LUT node implements which functionality, which signal name is used where, how do the routing nodes need to be configured to pass along the signals the correct way, etc.
- Builds the ZUMA bitstream for the virtual circuit by going through all nodes of the technology-mapped ODG and saving their configuration as the corresponding LUTRAM configuration.
- (d) Verifies whether the generated circuit is functionally equivalent to the original specification.
- 3. Include the generated overlay in an FPGA design (cf. Section 5).
- 4. Run the device vendor's EDA tools.
- 5. Configure the ZUMA overlay with the bitstream.

#### 6.2.2 Timing-augmented Flow

Note: This section is currently only applicable to Xilinx devices.

The detailed steps to generate an overlay using the flow depicted in Figure 1 are as follows for the version with all bells and whistles, which allows for static timing analysis of the virtual circuit mapped to the physical device, as well as timing-driven virtual placement and routing. For the steps that are already part of the details presented in Section 6.2.1, we only give reduced details here, to focus on the differences (user interaction is again only required for top-level items, except for the SDF extraction):

- 1. Set all values in 'zuma\_config.py' to fit your needs.
- 2. Set params. vprAnnotation = False and params. sdf = False in 'zuma\_config.py'.
- 3. Run the ZUMA flow ('compile.sh' with dummy / any virtual circuit).
  - (a) 'generate\_buildfiles.py'
  - (b) VTR (with timing analysis off)
  - (c) 'zuma\_build.py'
    - Writes out the Verilog description for the complete overlay.
- 4. Include the generated overlay in an FPGA design (cf. Section 5).
- 5. Run the device vendor's EDA tools.
- 6. Extract and copy the standard delay format (SDF) file(s) describing the timing properties of the overlay.
  - For Vivado, generate the SDF file that contains the routing delay information by opening your top-level design and issuing the following command in the tcl console:

```
>write_sdf Top_complete.sdf
```

Copy the generated file 'Top\_complete.sdf' to your 'example/' directory. Edit the parameter params.sdfFileName to this name in 'zuma\_config.py' (The parameter params.sdfFlipflopFileName is not used in the vivado flow).

• For ISE, also assuming that your top-level design has the name *Top*, generate the first SDF file that contains the the routing delay information by issuing:

```
>netgen -s 1 -pcf Top.pcf -sdf_anno true -sdf_path "netgen/par" \
-ne -insert_glbl true -insert_pp_buffers false -w \
-dir netgen/par -ofmt verilog -sim Top.ncd Top_no_buffer.v
```

And then generate the second SDF file that holds the flip flop delays (port delay + Tshcko) like this:

```
>netgen -s 1 -pcf Top.pcf -sdf_anno true -sdf_path "netgen/par" \
-ne -insert_glbl true -insert_pp_buffers true -w \
-dir netgen/par -ofmt verilog -sim Top.ncd Top_with_buffer.v
```

Copy the generated file(s) 'Top\_with\_buffer.sdf' and 'Top\_no\_buffer.sdf', if applicable, to your 'example/' directory. Edit the parameters params.sdfFileName, and params.sdfFileName to these names in 'zuma\_config.py'.

- 7. Set params.vprAnnotation = True and params.sdf = True in 'zuma\_config.py'.
- 8. Run the ZUMA flow for a second time ('compile.sh' with your actual virtual circuit) or multiple times thereafter for new virtual circuits. This will invoke two different subflows which will be explained in the following.
- 9. Subflow 1:
  - (a) 'generate\_buildfiles.py'
    - Generates 'ARCH\_vpr8.xml', 'abccommands.vpr8', 'vpr8.sh', and 'vpr8\_timing.sh' in the 'build' directory.
    - Within 'ARCH\_vpr8.xml', every CLB and BLE are their own individual instance with zeroed timing.
  - (b) VTR (with timing analysis off)
  - (c) 'zuma\_build.py'
    - i. 'ReadSDF.py'
      - Parses the SDF file, identifies mappings of SDF cells and technology-mapped ODG (cp. Section 6.1.4) nodes and augments the ODG with the timing information for routing MUXes and LUTs (addLutCellDelayToMappedNode), as well as FFs (addFlipflopCellDelayToMappedNode).
      - Parses iopath and read port delays, as well as setup and hold times.
      - Parses only worst-case times, as VPR can only handle one delay entry per entity, and the worst case is then usually the most interesting one.
    - ii. 'TimingAnalysisSDF.py' performTimingAnalysis() calculates the critical path from the delay-augmented technology-mapped ODG.
    - iii. 'NodeGraphTiming.py'
      - Computes the delay information of the ODG nodes from the imported information in the technology-mapped ODG nodes, i.e., congregates the delay of expanded nodes into a delay for their supernode.
      - This step is necessary, as only the nodes of the original ODG correspond to the entities known to VPR.
    - iv. 'TimingAnnotation.py' annotateClusterTiming()
      - Computes delays of all paths within the clusters which connect entities to each other that are known to VPR.
    - v. 'TimingAnnotation.py' annotateBack() writes all computed delays from the augmented (original) ODG into the files 'ARCH\_vpr8\_timing.xml' and 'rr\_graph\_timing.xml'.
- 10. Subflow 2:
  - (a) 'generate\_buildfiles.py'
  - (b) VTR
    - Now using 'vpr8\_timing.sh' and thus with timing analysis on.

- Uses the delay-augmented 'ARCH\_vpr8\_timing.xml' and 'rr\_graph\_timing.xml'.
- Can thus perform (meaningful) timing-driven placement and routing.
- (c) 'zuma\_build.py'
  - i. Now uses 'rr\_graph\_timing.xml' and RRG.
  - ii. 'ReadSDF.py'
  - iii. 'TimingAnalysisSDF.py' performTimingAnalysis() can now do a proper timing analysis for the current virtual circuit, giving a delay estimate with a corresponding maximum frequency  $f_{max}$ .
  - iv. Builds the ZUMA bitstream for the virtual circuit.
- 11. Set the clock for the virtual device to a frequency  $\leq f_{max}$ .
- 12. Configure the ZUMA overlay with the bitstream.

After performing these steps, you can run the ZUMA 'compile.sh' script with any virtual circuit to get its critical path. The path and resulting frequency  $f_{max}$  will be printed on the command line. To create a new timing-augmented overlay, repeat steps 1 through 6, to just map a new virtual circuit to an existing overlay, repeat steps 7 through 12.

#### 6.3 Generated Data Structures and Files

In this section we will now give an overview of the intermediate and final generated files and data structures.

For thorough descriptions of the architecture<sup>6</sup> and routing resource graph<sup>7</sup> files, see the documentation of the VTR flow.

### 6.3.1 Overlay Description Graph

The first own data structure of the ZUMA flow is the ODG, the internal representation of all resources of the virtual FPGA. It is generated in two steps:

- 1. The global, outer structure of the virtual FPGA is generated from the RRG, consisting of the global routing network (channels and switchboxes) and clusters, but not the inner structure within such a cluster.
- 2. The inner structure of the different clusters (BLE elements and interconnect routing) is created from scratch analogously to the architecture file.

For the first step, Figure 7 shows an example of such an outer structure. To get such a structure the script parses the RRG file generated by VPR, which contains the outer structure visualized in Figure 7 as a graph. We load this RRG in the function <code>load\_graph(filename)</code> in the file 'InitFpga.py'. Because we can only obtain the outer, global structure from it, we have to add the nodes for the inner, local structure ourselves. In our example this only affects the cluster location 1\_1. Figure 8 shows the original cluster, while the updated cluster is shown in Figure 9. The graph was altered by the following steps of the function <code>build\_inner\_structure()</code> in the file 'InitFpga.py':

1. For each BLE add an eLUT (node 62) and a MUX (63), which decides whether the flip flop of that BLE is bypassed or not. The FF does not have to be added as a separate node as the eLUT and FF are grouped together into one entity for the generated Verilog file. Connect the new nodes appropriately.

 $<sup>^6</sup> Architecture: \ \verb|https://docs.verilogtorouting.org/en/latest/arch/|$ 

 $<sup>^{7}</sup>RRG: \quad \text{https://docs.verilogtorouting.org/en/latest/vpr/file\_formats/\#routing-resource-graph-file-format-xml}$ 



Routing succeeded with a channel width factor of 2.

Figure 7: A small outer structure: We have 8 channels of length one (black arrows), 8 output pins (OPIN, red numbered blocks) on the IO pad which route the input of an fpga. We have 8 input pins(IPIN, blue numbered blocks) on the IO pads, which route to the output of the fpga. We have one used cluster input (IPIN, blue) and one output pin (OPIN,red). Also we have switchbox connections (green arrows).

- 2. Connect the MUX output with the cluster output pin (28) for routing the signal out of the cluster. To this end, we dismount the OPIN from the source node (25) because this node is not useful anymore and will be removed in the build of the Verilog file.
- 3. Add the nodes for the IIB in the function buildSimpleNetwork(): Add enough MUXes to route the cluster inputs and the BLE outputs to every pin of every LUT. Therefore each pin gets its own MUX (64 69) and the inputs of these MUXes are the cluster input node (27) and the output of the one LUT MUX (63).

Furthermore we have to generate a  $k_{host}$ -feasible version of the ODG, i.e., we have to replace each node with more than  $k_{host}$  input edges (more than  $k_{host}$  pins) with several nodes, because the real FPGA hardware we want to synthesize our virtual FPGA on, has only  $k_{host}$  pins per LUT.  $k_{host}$  is implemented as global constant globs.host size = 6.

The generation of the possible LUT types is realized in:

- writeSimpleLut()
- writeTightlyLut()
- writeComplexLut()

In this document, we call this new, expanded overlay description graph the technology-mapped ODG.



Figure 8: An overlay description graph, after the RRG parsing.



Figure 9: An ODG with added nodes for the local structure.

#### 6.3.2 Verilog file

The central Verilog file 'ZUMA\_custom\_generated.v', which describes the complete configurable overlay, is generated in the function build\_global\_routing\_Verilog in the file 'BuildVerilog. py'. As explained before, the basic approach is to traverse the ODG and write its structure into a file, translated to Verilog primitives and Xilinx macros. We also perform several optimizations to omit the generation of useless nodes, which:

- Are sources/sink nodes,
- have no input, or
- have only one input, i.e., passthrough nodes. For these, we use Verilog assigns instead of generating routing MUXes. An exception are FF MUXes, because they will have two inputs on the real FPGA hardware, but not in our graph.

Listing 1 shows an example excerpt of a generated Verilog file, with some of the interesting elements:

- The outward Verilog module interface (which we will discuss in a later section).
- ODG nodes that are translated to wires.
- Passthrough nodes from the ODG that are realized using Verilog assigns.
- A LUTRAM macro instantiation for a routing MUX (mux\_337).
- A LUTRAM macro instantiation for an eLUT (c\_mux\_1316) and its FF MUX (c\_mux\_1317).
- The IO pad endpoints fpga\_outputs[i] and fpga\_inputs[i].
- The instantiation of the simple configuration controller.

Listing 1: Example Verilog excerpt from ZUMA custom generated.v.

```
'include "define.v"
//ZUMA global routing Entity
//automatically generated by script
module ZUMA_custom_generated
    parameter N_NUMLUTS = 8,
    parameter I_CLINPUTS = 28,
    parameter K_LUTSIZE = 6,
    parameter CONFIG_WIDTH = 32
)
(
clk,
fpga_inputs,
fpga_outputs,
config_data,
config_en,
progress,
config_addr,
clk2,
ffrst
);
    // [..]
    wire node_0;
    wire node_1;
    wire node_2;
```

```
wire node_3;
wire node_4;
// [..]
//sbox driver x at (1, 2, 2, 2)
assign node_836 = node_986;
//sbox driver x at (1, 2, 2, 2)
assign node_837 = node_1298;
// [..]
//cluster input at (2, 1)
//size: 6
//inputs: [536, 537, 604, 605, 648, 649]
lut_custom mux_337 (
    .a(wr_addr), // input [5 : 0] a
    .d(wr_data[5]), // input [0 : 0]
    .dpra({node_536,node_537,node_604,node_605,node_648,node_649})
        , // input [5 : 0] dpra
    .clk(clk), // input clk
    .we(wren[4]), // input we
    .dpo(node_337));
// [..]
//internal cluster node (eLUT) at (1, 2)
//size: 6
//inputs: [1332, 1333, 1334, 1335, 1336, 1337]
elut_custom c_mux_1316 (
    .a(wr_addr), // input [5 : 0] a
    .d(wr_data[11]), // input [0 : 0]
    .dpra({node_1332,node_1333,node_1334,node_1335,node_1336,
       node_1337}), // input [5 : 0] dpra
    .clk(clk), // input clk
    .we(wren[27]), // input we
    .dpo(node_1316_unreg), // unregistered output
    .qdpo_clk(clk2), // run clk
    .qdpo_rst(ffrst), // input flip flop reset
    .qdpo(node_1316_reg)); // registered output
//internal cluster node (ffmux) at (1, 2)
//size: 1
//inputs: [1316]
lut_custom c_mux_1317 (
    .a(wr_addr), // input [5 : 0] a
    .d(wr_data[12]), // input [0 : 0]
    .dpra({node_1316_reg,node_1316_unreg,1'b0,1'b0,1'b0,1'b0}), //
        input [5 : 0] dpra
    .clk(clk), // input clk
    .we(wren[27]), // input we
    .dpo(node_1317));
// [..]
assign fpga_outputs[0] = node_0;
assign fpga_outputs[1] = node_3;
// More of these
assign node_1 = fpga_inputs[0];
assign node_4 = fpga_inputs[1];
// More of these
```

```
parameter NUM_CONFIG_STAGES = 78;
config_controller_simple
#(
    .WIDTH(CONFIG_WIDTH),
    .STAGES(NUM_CONFIG_STAGES),
    .LUTSIZE(K_LUTSIZE)
)
configuration_ctrl
(
    .clk(clk),
    .reset(1'b0),
    .wren_out(wren),
    .progress(progress),
    .wren_in(config_en),
    .addr_in(config_addr),
    .addr_out(wr_addr)
);
endmodule
```

## 6.3.3 Configuration Bitstream

Using the files generated by VPR, the ZUMA scripts compute a routing for every MUX node in the ODG (which input to which output) and a LUT configuration for each LUT node, and write them into two bitstream files. The ZUMA bitstreams consist of several lines, where each line comprises several fields of specific sizes. *Note*: The bitstream values are hex values, so two values are the content of one byte. The fields of each record are as follows:

| Field number | # Bytes  | Description                                                                                                                                                                                               |  |  |
|--------------|----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| 0            | 1        | A colon character (':') to indicate a new record.                                                                                                                                                         |  |  |
| 1            | 1        | Configuration data field size, contains the size of the later configuration data field in Bytes.                                                                                                          |  |  |
| 2            | 4        | Configuration address for the configuration data of this line.                                                                                                                                            |  |  |
| 3            | 1        | Record type of this line:  0x00 Configuration data.  0x01 EOF, i.e., no configuration data in this line, and the first field has the value 0x00.  0x02 IO configuration. Deprecated, but still generated. |  |  |
| 4            | variable | Configuration data for the ZUMA overlay. The size of the field depends on the value of the first field, i.e., "Configuration data field size"                                                             |  |  |
| 5            | 1        | Checksum for the previous fields, calculated as follows: $(256 - (Sum of bytes of the fields)) mod 256$                                                                                                   |  |  |

The bitstream file 'output.hex' is then a collection of configuration records (lines) such as this:

|   | $\mathbf{Size}$ | ${f Address}$ | $\mathbf{Type}$ | Data        | Checksum |
|---|-----------------|---------------|-----------------|-------------|----------|
| : | 04              | 00 00 00 00   | 02              | 80 00 00 00 | F2       |
| : | 04              | 00 00 00 00   | 00              | 00 00 00 00 | FC       |
| : | 04              | 00 00 00 04   | 00              | 08 00 04 00 | EC       |
| : | 04              | 80 00 00 00   | 00              | 40 00 00 00 | B4       |
|   |                 |               |                 |             |          |
| : | 04              | 00 00 4D F8   | 00              | 83 04 00 00 | 30       |
| : | 04              | 00 00 4D FC   | 00              | 83 04 00 00 | 2C       |
| : | 00              | 00 00 00 00   | 01              |             | FF       |

And the undecorated bitstream file 'output.hex.mif' is also a collection of configuration records (lines) but only the lines with record type 0x00 and from these only the *Configuration data* field.

In both files, the contained configuration data of the records correspond to routing MUX LUTs and eLUTs in a specific pattern, which is detailed in Section 6.4.

## 6.4 ZUMA Overlay Configuration Process

In this section we will briefly explain which parts of the generated overlay is involved how in the (re-)configuration of a ZUMA overlay.

#### 6.4.1 Module Description

The port interface of the module in the generated hardware overlay ' ${\tt ZUMA\_custom\_generated.v}$ ' are as follows:

- clk: The clock which is only used to configure the virtual FPGA.
- clk2: The clock which is used for running the virtual circuit, i.e., which is connected to the FFs of the virtual FPGA.
- fpga\_inputs, fpga\_outputs: Global IOs of the virtual FPGA.
- config\_en: A signal to put the overlay into reconfiguration mode.
- config\_data Input for configuration lines from the 'output.hex.mif' file. This is used to configure the LUTs in the virtual FPGA. Will be described later in detail.
- config\_addr The address at which to write the current config\_data item. Starts at 0 and ends after processing the entire 'output.hex.mif'. Will be described later in detail.
- progress signals with a rising edge to high that the configuration is finished.

#### 6.4.2 Configuration Details

To configure a ZUMA overlay, you have to connect the ZUMA\_custom\_generated of 'ZUMA\_custom\_generated.v' with a configurator module. We provide the file 'verilog/generic/ZUMA\_TB\_wrapper.v', which can be used as a template to perform such a configuration. In it we connect the ZUMA module with the configurator fixed\_config from the file 'verilog/platforms/xilinx/init\_config.v', as shown in Listing 2.

Listing 2: Excerpt from ZUMA\_TB\_wrapper.v.

```
// Reverse the retrieved config data, as the
// overlay requires it in reverse direction as stored
generate
    genvar i;
    for (i = 0; i < CONFIG_WIDTH; i = i + 1)
    begin: reverse
        assign cfg_in[CONFIG_WIDTH-1-i] = cfg[i];
    end
endgenerate

// Fetch config data for next address
fixed_config #(.LUT_SIZE(LUT_SIZE), .NUM_STAGES(NUM_STAGES))
    config_data (
        .address_in(next_address),
        .clock(clk),
        .q(cfg)</pre>
```

### 6.4.3 The Configuration Module fixed config

To load the configuration, fixed\_config contains a RAM block with the complete content of the file 'output.hex.mif'. At runtime, the module feeds each line of this file (one per clock cycle) to the overlay, which has been put into reconfiguration mode. The line address is given by the input address\_in starting from 0 to the length of 'output.hex.mif'-1, and the addressed line is provided by the output q. This output is then directed to the ZUMA input config\_data by the wrapper.

### 6.4.4 Wrapper Module

To start a configuration we therefore just have to raise the ZUMA input config\_en to high and start increasing the address input connected to both the ZUMA and fixed\_config module starting by 0.

# 6.4.5 ZUMA custom generated module

To understand the configuration process, we must remember how LUTs are configured. Figure 10 depicts the configured output for each input of a 6-input LUT. For every input combination there is exactly one output bit assigned. To configure a LUT we thus only have to store this output vector.



Figure 10: LUT config vector

To speed up the reconfiguration process, ZUMA configures blocks of params. config\_width LUTs in parallel, which are 32 by default. Each such block is internally called a stage. For each

stage ZUMA configures each line of the 32 LUTs step-by-step as shown in Figure 11. Therefore, every one of the 32 bits of each config\_data value is used for a different LUT. ZUMA thus renames that input to wr\_data and assigns the proper bit wr\_data[i] as input to each LUT of a stage. To address only the LUTs of a single stage at a time during this configuration writing,



Figure 11: configuration line per line

ZUMA uses two address signals: wr\_addr and wren. These are assigned by the configuration controller module *config controller simple* as shown in Listing 3.

Listing 3: Excerpt from config\_controller\_simple.v.

addr\_out is directly the last LUTSIZE-1 bits of the incoming addr\_in signal of ZUMA, however, wren is a bit more special: For each stage number i, only the i-th bit wren[i] is 1 while all the others are 0.

To summarize: The least significant LUTSIZE bits of the configuration address are thus used to walk the the lines of all 32 LUTs of a stage simultaneously, and the remaining upper bits are used to walk the stages. The configuration process thus fully configures (up to) 32 LUTs in parallel and then switches to the next stage, until there are no unconfigured stages left. Within each stage, the process configures all LUTs of that stage in parallel, line-by-line, until all  $2^{k_{host}}$  configuration bits are written. Hence, the bitstream is also made up of blocks of  $2^{k_{host}}$  lines of 32-bit values that directly correspond to this scheme.

# 7 Comprehensive List of all Configuration Parameters

Table 1: A list of all ZUMA configuration parameters with brief explanations.

| Parameter                              | Type           | Description                                                                                                                    |
|----------------------------------------|----------------|--------------------------------------------------------------------------------------------------------------------------------|
|                                        | General pare   | ameters                                                                                                                        |
| params. vprVersion                     | Integer        | Version of VPR included in installed VTR. Versions 6, 7, and 8 are supported (cp. Section 2).                                  |
| ${\it params.} \textit{verifyOverlay}$ | Boolean        | Enables an equivalence check of the generated overlay with the initial user circuit. Requires VPR 8 and Yosys (cp. Section 2). |
| ${\it params.} use Clock$              | Boolean        | Needs to be correctly set for params. verify Overlay to inform it whether the virtual circuit uses a clock.                    |
| $params. \it packed Overlay$           | Boolean        | Enables an additional packed version of<br>the overlay, ready for verification and<br>building.                                |
|                                        | Structural par | rameters                                                                                                                       |
| Global structure                       |                |                                                                                                                                |
| params. $X$                            | Integer        | Grid size in $x$ dimension                                                                                                     |
| params. $Y$                            | Integer        | Grid size in $y$ dimension                                                                                                     |
| params. $L$                            | Integer        | Length of the routing channels                                                                                                 |
| params. $W$                            | Integer        | Number of tracks per routing channel                                                                                           |
| $params. \it ordered IO$               | Boolean        | Whether or not to build large permutation MUXes around all IOs to fix their position in the fpga_input/fpga_output arrays.     |
| Local structure                        |                |                                                                                                                                |
| params. $I$                            | Integer        | External cluster inputs (from connect block to IIB)                                                                            |
| params. $N$                            | Integer        | LUTs per cluster                                                                                                               |
| params. $K$                            | Integer        | LUT input width                                                                                                                |
| ${\it params.} \ UseClos$              | Boolean        | Whether the IIB should be Clos network-based (otherwise it is a fully connected crossbar).                                     |
| Connect block                          |                |                                                                                                                                |
| params. $fc_in$                        | Float          | From how many tracks of the connect block each of the $I$ cluster inputs can be driven.                                        |
| ${\tt params}. fc\_in\_type$           | String         | 'abs', or 'rel': Whether params. $fc_in$ is an absolute number or relative value.                                              |
| $\mathrm{params}.fc\_\mathit{out}$     | Float          | How many tracks of the connect block each of the $N$ cluster outputs can drive.                                                |
| $params.fc\_out\_type$                 | String         | 'abs', or 'rel': Whether params. $fc\_out$ is an absolute number or relative value.                                            |

Continued on next page

Table 1 – continued from previous page

| Parameter                               | Type             | Description                                                                                                                                                                                                                                                                                                                                 |
|-----------------------------------------|------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                                         | Configuration 1  | parameters                                                                                                                                                                                                                                                                                                                                  |
| $params. config\_width$                 | Integer          | Width of the configuration port (in Bits).                                                                                                                                                                                                                                                                                                  |
| $params. config\_addr\_width$           | Integer          | Width of the configuration addresses Bits).                                                                                                                                                                                                                                                                                                 |
| Para                                    | umeters for hier | rarchical builds                                                                                                                                                                                                                                                                                                                            |
| Module Separation                       |                  | Whether the resources of that type she be implemented hierarchically in the ditional overlay fabric ('packed0verly'). All combinations of turned on the are possible.                                                                                                                                                                       |
| params. hierarchyNode                   | Boolean          | One module per ODG node.                                                                                                                                                                                                                                                                                                                    |
| params. hierarchy Inter Connect         | Boolean          | One module per cluster interconnect block (IIB).                                                                                                                                                                                                                                                                                            |
| params. hierarchy Ble                   | Boolean          | One module per BLE.                                                                                                                                                                                                                                                                                                                         |
| params.hierarchyCluster                 | Boolean          | One module per cluster / CLB.                                                                                                                                                                                                                                                                                                               |
| Black box replacement                   |                  | Whether the resources of that should be replaced by black be in the additional overlay far ('packedOverlayBlackBox.v') — ful later reorganization with tools as RapidWright. Not all combinate are possible, because the highest I will win, e.g. having the cluster replaying the blackbox and a ODG node, cluster will win over the node. |
| params.blackBox                         | Boolean          | Enable black box replacement in general                                                                                                                                                                                                                                                                                                     |
| params.blackBoxBle                      | Boolean          | Replace BLEs with black boxes.                                                                                                                                                                                                                                                                                                              |
| params. blackBoxCluster                 | Boolean          | Replace clusters / CLBs with black boxes.                                                                                                                                                                                                                                                                                                   |
| $params. {\it blackBoxInterconnect}$    | Boolean          | Replace the cluster interconnect bloc (IIB) with black boxes.                                                                                                                                                                                                                                                                               |
| Timing An                               | alysis and Opt   | imization parameters                                                                                                                                                                                                                                                                                                                        |
| Timing Analysis                         |                  | See Section 6.2.2                                                                                                                                                                                                                                                                                                                           |
| params.sdf                              | Boolean          | Whether to activate the timing analy                                                                                                                                                                                                                                                                                                        |
| ${\it params.} sdfFileName$             | String           | Path / Name of the first generated SI file 'Top_complete.sdf' (Vivado) or 'Top_no_buffer.sdf' (ISE).                                                                                                                                                                                                                                        |
| ${\bf params.} sdf Flip flop File Name$ | String           | In case of an ISE flow: Path / Name the second generated SDF file 'Top_with_buffer.sdf'.                                                                                                                                                                                                                                                    |
| $params. \it time Scale$                | Float            | The time scale used in the provided files as float, e.g., 1.0/10000000000000000000000000000000000                                                                                                                                                                                                                                           |
| $params. \it time Format$               | String           | The time scale used in the provided files as string, e.g., "ps".                                                                                                                                                                                                                                                                            |

Table 1 – continued from previous page

| Parameter                               | $rac{1-continued\ fro}{\mathbf{Type}}$ | Description                                                                                                                                                                                                                                                              |  |
|-----------------------------------------|-----------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| params. sdfUsedTool                     | String                                  | Indicator of the concrete SDF format. Implemented formats are the Xilinx flows "ise" and "vivado".                                                                                                                                                                       |  |
| ${\bf params.} sdfInterconnectCellType$ | String                                  | Vivado adds all interconnect to the top ZUMA cell, and hence we need its name to extract the delays. Typically should be zuma_wrapper.                                                                                                                                   |  |
| ${\it params.} in stance Pre {\it fix}$ | String                                  | The path prefix that can be used to identify overlay components in the SDF files. For instance, the component <code>zuma_top/zuma_i</code> is translated to <code>zuma_top_zuma_i</code> in the SDF file. For the standard TB wrapper, this would be <code>XUM/</code> . |  |
| Timing-driven placement & routin        | $\overline{g}$                          | -                                                                                                                                                                                                                                                                        |  |
| ${\it params.} \textit{vprAnnotation}$  | Boolean                                 | Whether VPR 8 should be used in timing-driven P&R mode in a second run with the timing from the SDF file(s). Requires VPR 8 (cp. Section 2).                                                                                                                             |  |
| $params. \it extract Setup Hold$        | Boolean                                 | Whether the setup and hold times should be extracted from the SDF file.                                                                                                                                                                                                  |  |
| ${\rm params.} setupTime$               | Numberstring                            | If setup times are not read from the SDF file, they can be manually entered here. Example: "4e-12".                                                                                                                                                                      |  |
| ${\rm params.} hold Time$               | Numberstring                            | If hold times are not read from the SDF file, they can be manually entered here. Example: "2.0e-9".                                                                                                                                                                      |  |
| $params. \it annotate Outer Routing$    | Boolean                                 | Whether the outer, global routing resources between CLBs should be annotated with their timing.                                                                                                                                                                          |  |
| $params. \it annotate Inner Routing$    | Boolean                                 | Whether the inner, local routing resources inside CLBs should be nnotated with their timing.                                                                                                                                                                             |  |
| Timing analysis components              |                                         | Which parts of the virtual FPGA should contribute to the critical path and thus $f_{max}$ calculation.                                                                                                                                                                   |  |
| params. skipOrderedLayerTiming          | Boolean                                 | If the delays of the ordering layer should be considered. These are only meaningful if params. ordered IO is set to True.                                                                                                                                                |  |
| params. skipOuterRoutingTiming          | Boolean                                 | Whether the outer, global routing resources between CLBs should be considered.                                                                                                                                                                                           |  |
| params. skipInnerRoutingTiming          | Boolean                                 | Whether the inner, local routing resources inside CLBs should be considered.                                                                                                                                                                                             |  |
| Debugging and Visualization parameters  |                                         |                                                                                                                                                                                                                                                                          |  |

Continued on next page

Table 1 – continued from previous page

| Parameter                             | Type    | Description                                                                                                                                      |
|---------------------------------------|---------|--------------------------------------------------------------------------------------------------------------------------------------------------|
| params. dumpNodeGraph                 | Boolean | If set, the ODG and the technology-mapped ODG will be dumped into files in a readable format.                                                    |
| ${\rm params.} graphviz$              | Boolean | Makes params. dumpNodeGraph also dump graphical versions of the ODGs – WARNING: Graphviz could freeze the build process if the graph is too big. |
| $params. \it dump Unconfigured Nodes$ | Boolean | If set, a list of unconfigured nodes with<br>their Verilog names will be generated.                                                              |

# References

- [1] Alexander D. Brant and Guy G. F. Lemieux. ZUMA: An open FPGA overlay architecture. In *International Symposium on Field-Programmable Custom Computing Machines (FCCM)*, pages 93–96. IEEE, 2012. doi:10.1109/FCCM.2012.25.
- [2] Alexander Dunlop Brant. Coarse and fine grain programmable overlay architectures for FPGAs, 2 2013. URL http://hdl.handle.net/2429/43918.
- [3] Tobias Wiersema, Arne Bockhorn, and Marco Platzner. Embedding FPGA overlays into configurable systems-on-chip: ReconOS meets ZUMA. In 2014 International Conference on ReConFigurable Computing and FPGAs, pages 1–6. IEEE, 12 2014. doi:10.1109/ReConFig.2014.7032514.
- [4] Tobias Wiersema, Arne Bockhorn, and Marco Platzner. An architecture and design tool flow for embedding a virtual FPGA into a reconfigurable system-on-chip. *Computers and Electrical Engineering*, 55:112–122, 2016. doi:10.1016/j.compeleceng.2016.04.005.
- [5] Katherine Compton and Scott Hauck. Reconfigurable computing: A survey of systems and software. *ACM Computing Surveys*, 34(2):171–210, 6 2002. ISSN 0360-0300. doi:10.1145/508352.508353.
- [6] Steven J. E. Wilton. Architectures and Algorithms for Field-Programmable Gate Arrays with Embedded Memories. PhD thesis, 1997.
- [7] Charles Clos. A study of non-blocking switching networks. The Bell System Technical Journal, 32(2):406–424, Mar 1953. doi:10.1002/j.1538-7305.1953.tb01433.x.