

### SCUOLA DI INGEGNERIA INDUSTRIALE E DELL'INFORMAZIONE

## Graph Neural Network Acceleration with SODA Framework

Master of Science Thesis in Computer Science and Engineering

Author: Giovanni Demasi

Student ID: 987062

Advisor: Prof. Fabrizio Ferrandi Co-advisors: Serena Curzel Academic Year: 2022-23

#### Abstract

Here goes the Abstract in English of your thesis followed by a list of keywords. The Abstract is a concise summary of the content of the thesis (single page of text) and a guide to the most important contributions included in your thesis. The Abstract is the very last thing you write. It should be a self-contained text and should be clear to someone who hasn't (yet) read the whole manuscript. The Abstract should contain the answers to the main scientific questions that have been addressed in your thesis. It needs to summarize the adopted motivations and the adopted methodological approach as well as the findings of your work and their relevance and impact. The Abstract is the part appearing in the record of your thesis inside POLITesi, the Digital Archive of PhD and Master Theses (Laurea Magistrale) of Politecnico di Milano. The Abstract will be followed by a list of four to six keywords. Keywords are a tool to help indexers and search engines to find relevant documents. To be relevant and effective, keywords must be chosen carefully. They should represent the content of your work and be specific to your field or sub-field. Keywords may be a single word or two to four words.

**Keywords:** here, the keywords, of your thesis

## Abstract in Lingua Italiana

Qui va l'Abstract in lingua italiana della tesi seguito dalla lista di parole chiave.

Parole chiave: qui, vanno, le parole chiave, della tesi

## Contents

| A            | bstra | ct                                                 | i   |
|--------------|-------|----------------------------------------------------|-----|
| $\mathbf{A}$ | bstra | ct in Lingua Italiana                              | iii |
| $\mathbf{C}$ | ontei | nts                                                | v   |
| 1            | Inti  | oduction                                           | 1   |
|              | 1.1   | Contributions                                      | 3   |
|              | 1.2   | Thesis structure                                   | 3   |
| <b>2</b>     | Bac   | kground                                            | 5   |
|              | 2.1   | Graphs                                             | 5   |
|              |       | 2.1.1 Graph Representation                         | 6   |
|              | 2.2   | Graph Neural Networks                              | 6   |
|              |       | 2.2.1 Graph Convolutional Network                  | 8   |
|              |       | 2.2.2 Graph Isomorphism Network                    | 9   |
|              | 2.3   | SODA Toolchain                                     | 10  |
|              |       | 2.3.1 SODA-OPT Frontend                            | 10  |
|              |       | 2.3.2 SODA Synthesizer Backend                     | 11  |
|              | 2.4   | Conclusion                                         | 11  |
| 3            | Rel   | ated Work                                          | 13  |
|              | 3.1   | Chapter structure                                  | 13  |
|              | 3.2   | Software frameworks                                | 13  |
|              | 3.3   | Hardware accelerators                              | 14  |
| 4            | Pro   | blem Formulation                                   | 15  |
| 5            | FP    | GA Toolchain for Graph Neural Network Acceleration | 17  |
| 6            | Exr   | perimental Results                                 | 19  |

| 7 Conclusions and Future Developments | 21 |
|---------------------------------------|----|
| Bibliography                          | 23 |
| List of Figures                       | 27 |
| List of Tables                        | 29 |
| List of Symbols                       | 31 |
| Acknowledgements                      | 33 |

# 1 Introduction

In recent years, deep learning has brought about a revolutionary transformation in various machine learning tasks, spanning from image classification and video processing to speech recognition and natural language understanding. Traditionally, these tasks have predominantly operated within the Euclidean space, where data is typically represented. Nevertheless, a growing number of applications now generate data from non-Euclidean domains, presenting it in the form of complex graphs with intricate relationships and interdependencies among objects. The inherent complexity of graph data has posed considerable challenges for existing machine learning algorithms. Consequently, there has been a surge of studies focusing on extending deep learning techniques to accommodate and leverage graph data.

Graph neural networks (GNNs) have been introduced in response to the growing demand for learning tasks involving graph data, which encompasses extensive relational information among its elements. These neural models effectively capture the interdependence among graph nodes by employing message passing mechanisms.

As Graph Neural Networks are increasingly employed, particularly in domains characterized by vast amounts of data, such as social networks and chemistry, a need arises to optimize and accelerate their capabilities. Inference in GNNs refers to the time the model takes to make predictions after training. The duration of the inference process determines the speed at which queries are answered, and researchers strive to minimize this time span.

In applications of deep learning that prioritize low latency, FPGAs outperform other computing devices, such as CPUs and GPUs, by providing superior performance. FPGAs offer the advantage of being fine-tuned to strike the optimal balance between power efficiency and meeting performance requirements.

Due to this reason, researchers have been actively pursuing the development of new FPGA accelerators for Graph Neural Networks (GNNs) in recent times.

The conventional approach to hardware design involves a combination of manual coding

1. Introduction

and automated processing. However, this method demands significant effort and relies heavily on the expertise of the designers, leading to varying quality of results.

To address these challenges, the objective of this thesis research study is to develop a comprehensive toolchain that, starting from PyTorch [22], a cutting-edge high-level programming framework for creating neural network algorithms based on the Python programming language, enables the automatic generation of a Graph Neural Networks (GNNs) FPGA accelerator with minimal effort required.

The suggested toolchain represents an enhancement of the SODA toolchain [3]. It operates by transforming the PyTorch model, provided as input, into a multi-level intermediate representation (MLIR) [19] utilizing Torch-MLIR [26], an MLIR based compiler toolkit for PyTorch programs. This MLIR representation is then passed to the SODA framework to conduct hardware/software partitioning of the algorithm specifications and architecture-independent optimizations. Following this, the framework generates a low-level IR (LLVM IR) specifically tailored for the hardware generation engine, PandA-Bambu [10].

In pursuit of the thesis goal, various optimizations were adopted throughout the process. Specifically, efforts were made to optimize specific computations in Graph Neural Networks during the experimental phase. As these networks often deal with massive graph sizes, the computation time and memory requirements are substantial. Consequently, a significant portion of the research focuses on optimizing the computation phase of Graph Neural Networks using tailored SODA optimizations, particularly matrix multiplication.

Furthermore, limitations and challenges have been encountered along the way. Another objective of this thesis is to analyze these limitations, ensuring they are clearly understood thoroughly. This analysis aims to provide valuable insights for future research endeavors, enabling the development of solutions to overcome these limitations and further enhance the proposed toolchain.

While the intended purpose of the toolchain is to be general, the experimental phase primarily focused on two specific types of Graph Neural Networks: Graph Isomorphism Networks (GIN) [29] and Graph Convolutional Networks (GCN) [18]. These models were sourced from reliable GitHub implementations and were modified as necessary.

The GCN model [25], designed for node classification task and written in pure PyTorch, held particular importance for the experimental phase as it served as the basis for the resulting accelerator. On the other hand, the GIN model [24], designed for graph classification task and written in PyTorch Geometric [11], a library built upon PyTorch for easier development and training of Graph Neural Networks, did not progress through the final

1. Introduction 3

step of the proposed toolchain. This was due to some incompatibilities between PyTorch Geometric and Torch-MLIR, which are integral parts of this thesis research.

#### 1.1 Contributions

#### 1.2 Thesis structure

Chapter 1 introduces the context of the thesis, its objective, and its goals, including a general overview of the research's focus, contributions, and outcome. Chapter 2 presents the background needed to understand the thesis's content deeply. In particular, it contains a summary of Graph Neural Networks, how they work, an explanation of the GNN types used in the experimental phase, and the type of tasks that they can perform, including some of their applications. Additionally, it presents the SODA framework, an important part of this thesis's proposed toolchain. Chapter 3 instead contains an overview of the related works. Other Graph Neural Network acceleration frameworks will be analyzed, underlying their differences compared to the research study done for this thesis and some limitations. Chapter 4 formulates the problem statement, summarizes the open issues of the research objective, and explains how the thesis goals can be helpful and their expected impact. Chapter 5 is the core chapter of the thesis, it clearly explains how the problem has been faced and what technologies have been used. It contains a detailed description of the proposed toolchain and its working method. Chapter 6 lists all the performed experiments, gives the necessary information to reproduce them and contains their outcomes and the issues and limitations encountered. Finally, Chapter 7 presents overall considerations of the study, both with the main achievements obtained and the most notable obstacles faced. Along with this, potential improvements for future studies are considered.

This chapter provides essential background information to enhance understanding of the thesis content and objectives. It begins by introducing the graph data structure, which is crucial for comprehending Graph Neural Networks. Additionally, the chapter provides an introduction to Graph Neural Networks, outlining their capabilities and exploring various applications. Furthermore, it introduces two essential tools, SODA and Bambu, which are integral parts of the SODA Toolchain that served as the foundation for this research study.

#### 2.1 Graphs

Graphs are a data structure representing a collection of objects, known as vertices or nodes, and a set of edges [33]. In a graph, the edges can be either directed or undirected, as shown in Figure 2.1, and they typically connect two vertices, which may or may not be distinct. The vertices represent entities or elements, and the edges represent their relationships or connections.

Graphs serve as a versatile tool for describing diverse forms of data. Molecules, the fundamental units of matter, are composed of atoms and electrons arranged in three-dimensional space. In this intricate structure, all particles interact with each other. However, when a pair of atoms are stably positioned at a specific distance, we refer to their connection as a covalent bond. These bonds with distinct atomic distances can vary in nature, such as



Figure 2.1: Example of directed and undirected graphs

single or double bonds. Representing this complex three-dimensional object as a graph offers a practical and widely adopted abstraction, where atoms are nodes and covalent bonds act as edges [9].

Social networks provide another domain where graphs find utility. They serve as valuable tools for examining patterns within the collective behavior of people, institutions, and organizations. By representing individuals as nodes and their relationships as edges, we can construct a graph that effectively captures groups of people and their interconnectedness.

#### 2.1.1 Graph Representation

Here I am going to talk about how graphs can be represented. Especially the ways encountered during the research: adjacency matrix, COO and CSR.

#### 2.2 Graph Neural Networks

Graph neural networks (GNNs) are deep learning techniques that operate on graphstructured data. Thanks to their impressive performance, GNNs have recently gained significant popularity as a widely adopted method for graph analysis.

Graph Neural Networks (GNNs) are designed to process graph data and consist of multiple interconnected layers. At its core, a GNN is an algorithm that exploits the connectivity within a graph to understand and represent the relationships between nodes. By relying on the graph's structure, the GNN iteratively processes input edge, vertex, and graph feature vectors, which encode known attributes and transforms them into output feature vectors that capture the desired predictions. Each Graph Neural Network typically encompasses three main stages: pre-processing, iterative updates and decoding or readout [1].

- Pre-processing: this initial step, while optional, involves transforming the input feature vectors and graph structure representation through a pre-processing procedure.
- 2. Iterative updates: following pre-processing, the feature vectors of each edge and vertex undergo iterative updates using aggregate-combine functions. For edge updates, attributes from the edge itself, connected vertices, and the graph are aggregated and combined to generate a new edge feature vector. Similarly, vertex updates involve aggregating feature vectors from neighboring vertices  $\mathcal{N}(v)$  and combining them to obtain a new feature vector. This iterative process gradually incorporates relationships between increasingly distant nodes and edges, allowing for multi-hop updates. Furthermore, the graph may coarsen through pooling [31] (i.e. selective

reduction or adjustment of either the graph structure or the neighborhood set of each node) in each subsequent layer, or the neighborhood set may change via layer sampling [13] (i.e. coarsening the graph from one layer to the next, leading to a reduction in the number of nodes that need to be processed during aggregation and combination steps).

3. **Decoding or readout**: once the graph possesses a global feature vector, it is updated once upon completion of edge and node updates. The final output can be an edge/node embedding, representing specific information about each edge or node in a low-dimensional feature vector format, or a graph embedding that summarizes the entire output graph.

Performing these stages on large and sparse graphs can introduce dynamic computational data flow and numerous irregular memory access patterns.

Similar to other neural networks, the processing of a GNN is influenced by its architecture. GNNs, as previously said, are structured into layers, each representing an iteration in the update process described earlier. This layering allows information to propagate across nodes, enabling the influence of distant nodes. Consequently, the appropriate number of layers in a GNN will vary depending on the significance of relationships among distant nodes in a specific application. The commonly adopted range for the number of GNN layers is 1 to 5, as an excessive number of layers can introduce undesired problems such as feature over-smoothing, vanishing gradients, or over-fitting [20].

Graph Neural Networks are a group of neural networks which are designed to solve different tasks. Prediction tasks on graphs can generally be classified into three categories: graph-level, node-level, and edge-level predictions [23].

In a graph-level task, the objective is to predict the property or characteristic of an entire graph. For instance, when considering a molecule represented as a graph, we might aim to predict attributes such as its likelihood of binding to a receptor associated with a specific disease. This assignment is comparable to image classification tasks, where the objective is to assign a label to an entire image. Similarly, in text analysis, sentiment analysis serves as a similar problem where the goal is to determine a complete sentence's overall mood or emotion in one go.

Node-level tasks involve predicting the identity or function of individual nodes within a graph. One example of a node-level task is node classification in a social network. Given a social network graph where nodes represent individuals and edges represent relationships between them, the task is to predict the demographic attributes or characteristics (e.g.,

age, gender, occupation) of each node based on their connection patterns and features. Drawing an analogy to image processing, node-level prediction problems can be compared to image segmentation tasks, where the objective is to assign labels to each pixel in an image based on its role. Similarly, in text analysis, a comparable task would involve predicting the parts of speech for each word in a sentence, such as identifying whether a word is a noun, verb, adverb, and so on.

The remaining prediction task in graphs pertains to edge prediction. One example of an edge-level task is link prediction in a social network. Given a graph representing a social network where, as before, in node-level tasks, nodes correspond to individuals and edges represent relationships between them, the edge-level task aims to predict missing or potential connections between nodes. This can involve predicting the likelihood of a future friendship or the probability of a collaboration between individuals based on their shared characteristics or mutual connections in the network.

Different popular Graph Neural Network architectures have been proposed recently, some of which are more suitable for some tasks than others. A summary of two types of GNNs used in the experimental phase is provided in the following sections.

#### 2.2.1 Graph Convolutional Network

A graph convolutional network (GCN) [8, 18] is a type of neural network architecture explicitly designed to operate on graph-structured data. GCNs aim to learn node representations by aggregating and combining information from neighboring nodes in the graph. The core idea behind GCNs is to perform convolution-like operations on the graph, where the convolutional filters are defined based on the graph's adjacency matrix or other graph-specific structures. This enables GCNs to capture and leverage the structural information encoded in the graph to make predictions or perform downstream tasks. GCNs have demonstrated effectiveness in various applications, including node classification, link prediction, and graph classification.

Given an undirected graph  $\mathcal{G} = (V, E)$ , where V represents the set of nodes (vertices), and E represents the set of edges, with an adjacency matrix  $\tilde{A} = A + I_N$ , where  $I_N$  is the identity matrix, the layer-wise propagation rule in a GCN can be expressed as:

$$H^{(l+1)} = f\left(\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}}H^{(l)}W^{(l)}\right)$$
(2.1)

Where  $H^{(l)} \in \mathbb{R}^{N \times D}$  is the input node features matrix,  $W^{(l)}$  is a layer-specific learnable

weight matrix,  $\tilde{D}$  is the degree matrix defined as  $\tilde{D}_{ii} = \sum_{j} \tilde{A}_{ij}$ , and  $f(\cdot)$  represents a non-linear activation function applied element-wise, such as  $ReLU(\cdot) = max(0, \cdot)$ . The equation above demonstrates the propagation of node features through graph convolution, where the adjacency matrix  $\tilde{A}$  captures the connectivity information of the graph,  $\tilde{D}^{-\frac{1}{2}}$  normalizes the adjacency matrix, and  $H^{(l)}W^{(l)}$  performs a linear transformation of node features. The resulting  $H^{(l+1)}$  represents the updated node representations after the graph convolution operation. In practice, multiple graph convolutional layers can be stacked to capture increasingly complex relationships and refine the node representations further.

#### 2.2.2 Graph Isomorphism Network

A Graph Isomorphism Network (GIN) [8, 29] is a type of neural network architecture designed to operate on graph-structured data by capturing graph isomorphism, which is the property of two graphs having the same structure, inspired by the Weisfeiler-Lehman (WL) graph isomorphism test [29]. GINs aim to learn node representations that are invariant under graph isomorphism, enabling them to generalize across different graphs with similar structures.

The learned vertex features from GIN-Conv can be directly utilized for tasks such as node classification and link prediction. It is possible to perform this model as:

$$h_v^{(k+1)} = MLP^{(k)} \left( \left( 1 + \epsilon^{(k)} \right) \cdot h_v^{(k)} + \sum_{u \in \mathcal{N}(v)} h_u^{(k)} \right)$$
 (2.2)

Where  $h_v^{(k)}$  represents the initial node representation of node v,  $\mathcal{N}(v)$  represents the neighborhood of node v,  $\epsilon$  is a learnable parameter or a fixed scalar,  $MLP(\cdot)$  represents a Multi Layer Perceptron and  $h_v^{(k+1)}$  represents the updated node representations.

In the neighborhood aggregation process of GINs, each node's representation is updated by considering its own representation and its neighbors' representations. The neighborhood aggregation is performed through the MLP operation, followed by non-linear activation.

GINs are trained using graph-level objectives, such as graph classification or property prediction, and aim to learn invariant representations under graph isomorphism, allowing them to generalize well to unseen graphs with similar structures. However, even if the node embeddings acquired through GIN can be directly applied to tasks such as node classification and link prediction, in the case of graph classification tasks, it is necessary to use a Readout function that takes individual node embeddings as input and produces the embedding representation for the entire graph.

The Readout function is then utilized to generate the overall representation of the graph, leveraging the individual vertex representations. By concatenating the results from all iterations of GINConv, the final graph representation is obtained as:

$$h_G = CONCAT \left( READOUT \left( \left\{ h_v^{(k)} | v \in G \right\} \right) | k = 0, 1, ..., K \right)$$
 (2.3)

Where READOUT in 2.2 can be replaced with a sum operator in order to generalize the WL test [29].

#### 2.3 SODA Toolchain

SODA [3] is a software-defined accelerator synthesizer. It enables the creation of highly specialized accelerators from algorithms designed in high-level programming frameworks. The synthesizer comprises a compiler-based frontend that interfaces with high-level programming frameworks, applying advanced optimizations. It also includes a compiler-based backend responsible for generating Verilog code and interfacing with external tools to compile the final design, which can be applied to application-specific integrated circuits (ASICs) or field-programmable gate arrays (FPGAs).

SODA's exceptional power lies in its ability to offer a fully automated end-to-end hardware compiler, eliminating the need for human intervention and any modifications to the input code. The SODA synthesizer framework comprises two main components: a compiler-based frontend and a compiler-based hardware generation engine. This framework seamlessly integrates with high-level Python frameworks by accepting their input descriptions, which are then translated by the frontend into a high-level intermediate representation (IR). Leveraging the multi-level intermediate representation (MLIR), the frontend facilitates hardware/software partitioning of algorithm specifications and performs architecture-independent optimizations. Following this, it generates a low-level IR (LLVM IR) that is utilized by the hardware generation engine, PandA-Bambu [10]. PandA-Bambu can accept LLVM IR as input, making it a cutting-edge open-source HLS tool. Throughout the entire SODA toolchain, compiler passes are employed to implement optimizations at all levels, greatly influencing the generated hardware designs' performance, area, and power characteristics.

#### 2.3.1 SODA-OPT Frontend

SODA-OPT, the high-level compiler frontend of the SODA synthesizer, performs search, outlining, optimization, dispatching, and acceleration pass on the input program. Its pri-

mary objective is to prepare the program for hardware synthesis, targeting either FPGAs or ASICs. To accomplish these tasks, SODA-OPT relies on and extends the MLIR framework. MLIR is a framework that facilitates the development of reusable, extensible, and modular compiler infrastructure by defining dialects. These dialects serve as self-contained intermediate representations (IRs) that adhere to the meta-IR syntax of MLIR. By utilizing dialects, code can be modeled at different levels of abstraction, allowing for specialized representations that aid in specific compiler optimizations.

Code regions selected for hardware acceleration undergo an optimization pipeline that progressively lowers them through various MLIR dialects until they are ultimately translated into an LLVM IR format tailored explicitly for hardware synthesis. On the other hand, the host module is lowered into an LLVM IR file containing runtime calls to control the generated custom accelerators.

#### 2.3.2 SODA Synthesizer Backend

Bambu, the SODA synthesizer backend, harnesses cutting-edge HLS techniques to produce accelerator designs using the low-level LLVM IR generated by the SODA frontend. Bambu boasts multiple frontends based on standard compilers such as GCC or CLANG. It constructs an internal IR to execute HLS steps and generates designs in HDL formats, such as Verilog or VHDL. In addition to synthesizable HDL, Bambu can automatically generate testbenches for verification purposes. Using Bambu, the SODA synthesizer can target both FPGAs and ASICs.

Bambu is optimized to handle a broad range of C and C++ constructs while also being able to process LLVM IR through its internal Clang frontend. Through SODA-OPT, Bambu can be connected with MLIR code. The LLVM IR generated after SODA-OPT's high-level optimizations undergoes explicit restructuring for HLS, resulting in more efficient accelerators than direct translation from MLIR to LLVM IR.

Bambu generates designs at the Register Transfer Level (RTL), adhering to the Finite State Machine with Datapath (FSMD) model. These generated accelerators can subsequently be integrated into larger system-level designs, with or without microcontrollers controlling the execution.

#### 2.4 Conclusion

This chapter has presented the foundational concepts necessary for understanding the subsequent contents of this thesis. It provided a concise overview of the broad domain

of Graphs and Graph Neural Networks, explicitly focusing on the architectures of Graph Convolutional Networks and Graph Isomorphism Networks. Additionally, the chapter introduced SODA and PandA-Bambu, which will be further investigated within the context of the proposed design flow for the creation of GNNs FPGA-based accelerators.

The following chapter, however, is dedicated to an extensive analysis of scientific literature on hardware acceleration for Graph Neural Networks. This analysis primarily focuses on publications concerning FPGA-based implementations and design flows that leverage High-Level Synthesis techniques.

# 3 Related Work

Accelerating Graph Neural Networks (GNNs) has become a subject of intense interest within the research community, encompassing the exploration of ASIC and FPGA accelerators. In this chapter, a comprehensive examination is conducted on cutting-edge Graph Neural Networks FPGA accelerators and design flows based on High-Level Synthesis (HLS). As explained in Chapter 6, particular emphasis has been placed on optimizing matrix-matrix multiplication during this thesis research study. Consequently, this chapter also delves into the relevant literature concerning various approaches to Matmul optimization.

#### 3.1 Chapter structure

#### 3.2 Software frameworks

The challenges posed by GNN processing have led to inefficiencies in traditional deep neural network (DNN) libraries and graph processing frameworks. This is primarily due to the alternating computational phases characteristic of GNNs. While DNN libraries excel in accelerating combination operations within vertices and edges, they need help with aggregation tasks. On the other hand, graph processing libraries effectively handle irregular memory accesses during graph traversal but assume simplistic operations at the vertices, which is not the case in GNNs. Recent research studies tried to bridge the gap by adapting the DNN libraries to overcome Graph Neural Network challenges.

The two main software frameworks trying to accelerate Graph Neural Networks computation are PyTorch Geometric [11] and Deep Graph Library [27]. They both provide a lot of examples and code for multiple GNN architectures providing optimizations that could work for the acceleration of both training and inference.

PyTorch Geometric is a PyTorch-based library specifically designed for deep learning on input data with irregular structures, including graphs, point clouds, and manifolds. In addition to offering comprehensive graph data structures and processing techniques, it in3. Related Work

corporates many state-of-the-art methods from relational learning and 3D data processing domains. PyTorch Geometric achieves remarkable data throughput by introducing efficient handling of mini-batches containing input examples of varying sizes and efficiently handling sparsity through specialized GPU scatter and gather kernels, which operate on all edges and nodes concurrently, as opposed to relying on sparse matrix multiplication kernels. A key aspect of PyG involves defining a message-passing interface encompassing message and update functions for neighborhood aggregation and combination and multiple pooling operations.

DGL is a recently developed library that seamlessly integrates with TensorFlow, PyTorch, or MXNet. It introduces three essential functions: message for aggregating edges, update and reduce for aggregating and combining at the nodes. DGL adopts a matrix multiplication approach to enhance performance and harnesses specialized kernels designed for GPUs or TPUs. Specifically, both sampled dense-dense and sparse matrix multiplications and options for node, edge, or feature parallelization are considered. DGL intelligently selects the optimal parallelization scheme using heuristics, considering various factors, including the input graph. It distills the computational patterns of GNNs into a set of generalized sparse tensor operations, which facilitate extensive parallelization. By prioritizing the graph as the central programming abstraction, DGL enables transparent optimizations. Furthermore, through a framework-neutral design philosophy, DGL allows users to effortlessly port and leverage existing components across multiple deep learning frameworks.

The approach used by DGL outperformed PyTorch Geometric in training Graph Neural Networks, as stated in their paper [27]. However, both libraries target CPU and GPU architectures. Knowing the extreme computational power of FPGA, the field of hardware accelerators started gaining more and more interest, with the expectation of having GNN hardware accelerators capable of outperforming the performance of CPU-GPU targeting libraries.

#### 3.3 Hardware accelerators

# 4 Problem Formulation

Problem formulated in a clear way, what we did and how, with open issues and thesis goals.

# 5 FPGA Toolchain for Graph Neural Network Acceleration

Introduction of the way I faced the problem, with the motivation for the followed approach. Explanation of the toolchain in a clear way.

# 6 Experimental Results

Chapter dedicated to the outcome of the results, what I have obtained and what limitations have been encountered. Explaining the still open issues and research suggestions.

# 7 Conclusions and Future Developments

Final chapter containing the main conclusions of my research and possible future developments.

- [1] S. Abadal, A. Jain, R. Guirado, J. López-Alonso, and E. Alarcón. Computing graph neural networks: A survey from algorithms to accelerators. *CoRR*, abs/2010.00130, 2020. URL https://arxiv.org/abs/2010.00130.
- [2] S. Abi-Karam, Y. He, R. Sarkar, L. Sathidevi, Z. Qiao, and C. Hao. Gengnn: A generic FPGA framework for graph neural network acceleration. *CoRR*, abs/2201.08475, 2022. URL https://arxiv.org/abs/2201.08475.
- [3] N. B. Agostini, S. Curzel, J. J. Zhang, A. Limaye, C. Tan, V. Amatya, M. Minutoli, V. G. Castellana, J. Manzano, D. Brooks, G.-Y. Wei, and A. Tumeo. Bridging python to silicon: The soda toolchain. *IEEE Micro*, 42(5):78–88, 2022. doi: 10.1109/MM. 2022.3178580.
- [4] A. Auten, M. Tomei, and R. Kumar. Hardware acceleration of graph neural networks. In 2020 57th ACM/IEEE Design Automation Conference (DAC), pages 1–6, 2020. doi: 10.1109/DAC18072.2020.9218751.
- [5] A. Bik, P. Koanantakool, T. Shpeisman, N. Vasilache, B. Zheng, and F. Kjolstad. Compiler support for sparse tensor computations in MLIR. ACM Transactions on Architecture and Code Optimization, 19(4):1–25, sep 2022. doi: 10.1145/3544559. URL https://doi.org/10.1145%2F3544559.
- [6] U. Bondhugula. High performance code generation in MLIR: an early case study with GEMM. CoRR, abs/2003.00532, 2020. URL https://arxiv.org/abs/2003.00532.
- [7] S. Böhm. How to optimize a cuda matmul kernel for cublas-like performance: a worklog, 2022. URL https://siboehm.com/articles/22/CUDA-MMM.
- [8] A. Daigavane, B. Ravindran, and G. Aggarwal. Understanding convolutions on graphs. *Distill*, 2021. doi: 10.23915/distill.00032. https://distill.pub/2021/understanding-gnns.
- [9] D. Duvenaud, D. Maclaurin, J. Aguilera-Iparraguirre, R. Gómez-Bombarelli, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams. Convolutional networks on graphs

for learning molecular fingerprints. CoRR, abs/1509.09292, 2015. URL http://arxiv.org/abs/1509.09292.

- [10] F. Ferrandi, V. G. Castellana, S. Curzel, P. Fezzardi, M. Fiorito, M. Lattuada, M. Minutoli, C. Pilato, and A. Tumeo. Invited: Bambu: an open-source research framework for the high-level synthesis of complex applications. In 2021 58th ACM/IEEE Design Automation Conference (DAC), pages 1327–1330, 2021. doi: 10.1109/DAC18074.2021.9586110.
- [11] M. Fey and J. E. Lenssen. Fast graph representation learning with pytorch geometric. CoRR, abs/1903.02428, 2019. URL http://arxiv.org/abs/1903.02428.
- [12] T. Geng, A. Li, T. Wang, C. Wu, Y. Li, A. Tumeo, and M. C. Herbordt. UWB-GCN: hardware acceleration of graph-convolution-network through runtime workload rebalancing. CoRR, abs/1908.10834, 2019. URL http://arxiv.org/abs/1908.10834.
- [13] W. L. Hamilton, R. Ying, and J. Leskovec. Inductive representation learning on large graphs. CoRR, abs/1706.02216, 2017. URL http://arxiv.org/abs/1706.02216.
- [14] L. He. Engn: A high-throughput and energy-efficient accelerator for large graph neural networks. CoRR, abs/1909.00155, 2019. URL http://arxiv.org/abs/1909. 00155.
- [15] W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec. Open graph benchmark: Datasets for machine learning on graphs. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 22118–22133. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper\_files/paper/2020/file/fb60d411a5c5b72b2e7d3527cfc84fd0-Paper.pdf.
- [16] Y. Hu, Y. Du, E. Ustun, and Z. Zhang. Graphlily: Accelerating graph linear algebra on hbm-equipped fpgas. In 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD), pages 1–9, 2021. doi: 10.1109/ICCAD51958.2021.9643582.
- [17] K. Kiningham, C. Ré, and P. A. Levis. GRIP: A graph neural network accelerator architecture. CoRR, abs/2007.13828, 2020. URL https://arxiv.org/abs/2007. 13828.
- [18] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. *CoRR*, abs/1609.02907, 2016. URL http://arxiv.org/abs/1609.02907.
- [19] C. Lattner, M. Amini, U. Bondhugula, A. Cohen, A. Davis, J. Pienaar, R. Riddle,

T. Shpeisman, N. Vasilache, and O. Zinenko. Mlir: Scaling compiler infrastructure for domain specific computation. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 2–14, 2021. doi: 10.1109/CGO51591. 2021.9370308.

- [20] Q. Li, Z. Han, and X. Wu. Deeper insights into graph convolutional networks for semi-supervised learning. CoRR, abs/1801.07606, 2018. URL http://arxiv.org/ abs/1801.07606.
- [21] S. Liang, C. Liu, Y. Wang, H. Li, and X. Li. Deepburning-gl: an automated framework for generating graph neural network accelerators. In 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD), pages 1–9, 2020.
- [22] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Z. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-performance deep learning library. CoRR, abs/1912.01703, 2019. URL http://arxiv.org/abs/1912.01703.
- [23] B. Sanchez-Lengeling, E. Reif, A. Pearce, and A. B. Wiltschko. A gentle introduction to graph neural networks. *Distill*, 2021. doi: 10.23915/distill.00033. https://distill.pub/2021/gnn-intro.
- [24] O. G. B. team and other contributors. Gnn models from open graph benchmark, 2020. URL https://github.com/snap-stanford/ogb/tree/master/examples/graphproppred/mol.
- [25] M. W. Thomas N. Kipf and other contributors. Gcn model in pytorch, 2017. URL https://github.com/tkipf/pygcn.
- [26] n. t. Torch-MLIR team and other contributors. Torch-mlir: Mlir based compiler toolkit for pytorch programs, 2021. URL https://github.com/llvm/torch-mlir.
- [27] M. Wang, L. Yu, D. Zheng, Q. Gan, Y. Gai, Z. Ye, M. Li, J. Zhou, Q. Huang, C. Ma, Z. Huang, Q. Guo, H. Zhang, H. Lin, J. Zhao, J. Li, A. J. Smola, and Z. Zhang. Deep graph library: Towards efficient and scalable deep learning on graphs. CoRR, abs/1909.01315, 2019. URL http://arxiv.org/abs/1909.01315.
- [28] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu. A comprehensive survey on graph neural networks. CoRR, abs/1901.00596, 2019. URL http://arxiv.org/ abs/1901.00596.

[29] K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks?, 2019.

- [30] M. Yan, L. Deng, X. Hu, L. Liang, Y. Feng, X. Ye, Z. Zhang, D. Fan, and Y. Xie. Hygcn: A GCN accelerator with hybrid architecture. CoRR, abs/2001.02514, 2020. URL http://arxiv.org/abs/2001.02514.
- [31] R. Ying, J. You, C. Morris, X. Ren, W. L. Hamilton, and J. Leskovec. Hierarchical graph representation learning with differentiable pooling. CoRR, abs/1806.08804, 2018. URL http://arxiv.org/abs/1806.08804.
- [32] B. Zhang, H. Zeng, and V. Prasanna. Hardware acceleration of large scale gcn inference. In 2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP), pages 61–68, 2020. doi: 10.1109/ASAP49362. 2020.00019.
- [33] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, and M. Sun. Graph neural networks: A review of methods and applications. *CoRR*, abs/1812.08434, 2018. URL http://arxiv.org/abs/1812.08434.

# List of Figures

| 2.1 | Example of | directed | and | undirected | graphs | s |  |  |  |  |  |  |  |  |  |  |  |  |  | 5 |
|-----|------------|----------|-----|------------|--------|---|--|--|--|--|--|--|--|--|--|--|--|--|--|---|
|-----|------------|----------|-----|------------|--------|---|--|--|--|--|--|--|--|--|--|--|--|--|--|---|

## List of Tables

# List of Symbols

| Notation                        | Description                                                |
|---------------------------------|------------------------------------------------------------|
| $\mathcal{G} = (V, E)$          | The input graph for the GNN                                |
| V                               | Set of vertices of the graph                               |
| E                               | Set of edges of the graph                                  |
| $\mathcal{N}(v)$                | Set of neighbors of vertex $v$                             |
| $A \in \mathbb{R}^{N \times N}$ | Adjacency matrix of $\mathcal{G}$ ( $N$ : number of nodes) |
| $	ilde{D}$                      | Degree matrix of the graph                                 |
| $W^{(l)}$                       | Weight matrix of the neural network $(l: layer)$           |
| $H^{(l)}$                       | Input node features matrix $(l: layer)$                    |
| $h_v$                           | Node representation of node $v$                            |
| $\epsilon$                      | Learnable parameter or fixed scalar                        |
| I                               | Identity matrix                                            |

## Acknowledgements

Acknowledgements here...