From 10746f8680554b307349b683b19166e01d58aeb1 Mon Sep 17 00:00:00 2001 From: Tat Dat Duong Date: Sun, 27 Jun 2021 23:40:12 +0200 Subject: [PATCH] feat: final version. --- chapters/conclusion.tex | 2 +- chapters/introduction.tex | 4 ++-- chapters/preliminaries/cuda.tex | 2 +- chapters/preliminaries/gpu.tex | 2 +- chapters/preliminaries/tnl.tex | 2 +- chapters/realisation/bnode.tex | 6 ++++-- chapters/state-of-art.tex | 2 +- chapters/theory/b+tree.tex | 6 +++--- chapters/theory/b-link-tree.tex | 12 ++++++------ chapters/theory/b-tree.tex | 20 ++++++++++---------- 10 files changed, 30 insertions(+), 28 deletions(-) diff --git a/chapters/conclusion.tex b/chapters/conclusion.tex index 7f39939..60ec35b 100644 --- a/chapters/conclusion.tex +++ b/chapters/conclusion.tex @@ -11,7 +11,7 @@ \section{Goals and results} Multiple GoogleTest based unit test suites were written to ensure proper functionality. A complementary web tool has been developed to visualize operations done in a multithreaded environment. -Finally, both of the implementations were measured and compared against STL and other state-of-the-art GPU and CPU B-Tree implementations. Measurements of both variants against \codecpp{std::map} show a substantial speedup: $\approx$\,$200\times$ when querying and $\approx$\,$25\times$ when inserting random sequences of length $2^{24}$--$2^{25}$. When compared to the GPU B-Tree implementation by Awad et al. \cite{awad}, both of the variants implemented in this thesis are slower. It should be noted that both the B$^+$Tree and B-Link-Tree perform similarly, suggesting the reduced instruction count and reduced unnecessary pointer chasing in B$^+$Tree overshadow the concurrency improvements of B-Link-Tree. +Finally, both of the implementations were measured and compared against STL and other state-of-the-art GPU and CPU B-Tree implementations. Measurements of both variants against \codecpp{std::map} show a substantial speedup: $\approx$\,$200\times$ when querying and $\approx$\,$85\times$ when inserting random sequences of length $2^{24}$--$2^{25}$. When compared to the GPU B-Tree implementation by Awad et al. \cite{awad}, both of the variants implemented in this thesis are slower. It should be noted that both the B$^+$Tree and B-Link-Tree perform similarly, suggesting the reduced instruction count and reduced unnecessary pointer chasing in B$^+$Tree overshadow the concurrency improvements of B-Link-Tree. \section{Future work} diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 1683873..3c88e94 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -16,8 +16,8 @@ \section{Structure of Work} The main goal of this thesis is to introduce an implementation of a B-Tree built on top of the Template Numerical Library (TNL). This implementation will have the ability to change the execution environment to either run on the GPU or the CPU. The solution will be implemented in \CC\ and will support CUDA, as these are the prerequisites of the used TNL library. -First, in \cref{chapter:preliminaries}, hardware structure and software architecture will be introduced, with a general primer of both the TNL library and the B-Tree data structure. Next, in \cref{chapter:theory}, different variants and modifications of the data structure, such as B$^+$Tree or B-Link-Tree, are explained and compared against the original B-Tree. +First, in \cref{chapter:preliminaries}, hardware structure and software architecture will be introduced, with a general primer of both the TNL library and the B-Tree data structure. Previous implementations of such data structure and other state-of-the-art CPU solutions are discussed in \cref{chapter:state-of-art}. -Previous implementations of such data structure and other state-of-the-art CPU solutions are discussed in \cref{chapter:state-of-art}. The realization section presented in \cref{chapter:realisation} will explain the design decision choices and various implementation and optimization details made for both of the GPU B-Tree variants shown in this thesis. +Next, in \cref{chapter:theory}, different variants and modifications of the data structure, such as B$^+$Tree or B-Link-Tree, are explained and compared against the original B-Tree. The realization section presented in \cref{chapter:realisation} will explain the design decision choices and various implementation and optimization details made for both of the GPU B-Tree variants shown in this thesis. Last but not least, implementation correctness, testing methodology, and the experimental benchmark results between the developed solution and other chosen CPU and GPU implementations are presented in \cref{chapter:testing}. \ No newline at end of file diff --git a/chapters/preliminaries/cuda.tex b/chapters/preliminaries/cuda.tex index a04be45..d7bb5c9 100644 --- a/chapters/preliminaries/cuda.tex +++ b/chapters/preliminaries/cuda.tex @@ -1,6 +1,6 @@ \section{CUDA programming model} \label{label:cuda} -To simplify development on general-purpose GPUs, NVIDIA introduced the \acrfull{cuda} programming model in November 2006. As mentioned previously, GPUs are suited for parallel workloads optimizing in total throughput, sacrificing the performance of serial operations. However, not all programs are fully parallel in nature, and some problems are difficult, if not impossible, to formulate in a manner that would benefit from the use of a GPU. Thus, a sane idea would be to utilize both CPU and GPU, using GPU in workloads, where parallelism yields significant performance uplift. With CUDA, programmers can write applications run on the CPU and accelerate parallel workloads with the GPU while using familiar C/\CC\ programming language for both processors. +To simplify development on general-purpose GPUs, NVIDIA introduced the \acrfull{cuda} programming model in November 2006. As mentioned previously, GPUs are suited for parallel workloads optimizing in total throughput, sacrificing the performance of serial operations. However, not all programs are fully parallel in nature, and some problems are difficult, if not impossible, to formulate in a manner that would benefit from the use of a GPU. Thus, a sane idea would be to utilize both the CPU and the GPU, using GPU in workloads, where parallelism yields significant performance uplift. With CUDA, programmers can write applications that run on the CPU and accelerate parallel workloads with the GPU while using familiar C/\CC\ programming language for both processors. In \acrshort{cuda}, the CPU and GPU and their memory are referred to as \textit{host} and \textit{device} respectively. A host manages the memory of both the device and the host itself, and it launches user-defined functions, called \textit{kernels}, which the device executes. A program thus usually executes serial code on the host and parallel code on the device. diff --git a/chapters/preliminaries/gpu.tex b/chapters/preliminaries/gpu.tex index a642e52..629683c 100644 --- a/chapters/preliminaries/gpu.tex +++ b/chapters/preliminaries/gpu.tex @@ -11,7 +11,7 @@ \begin{figure} \centering \includegraphics[width=\textwidth]{components/figure/cpu-vs-gpu.png} - \caption[Comparison between a typical CPU vs GPU architecture]{Comparison between a typical CPU vs GPU architecture, the main difference being the increased amount of computing cores. \cite{cudaprog}} + \caption[Comparison between a typical CPU vs GPU architecture]{Comparison between a typical CPU vs GPU architecture, the main difference being the increased amount of computing cores \cite{cudaprog}.} \label{figure:cpu-vs-gpu} \end{figure} diff --git a/chapters/preliminaries/tnl.tex b/chapters/preliminaries/tnl.tex index 25293f2..3425691 100644 --- a/chapters/preliminaries/tnl.tex +++ b/chapters/preliminaries/tnl.tex @@ -12,4 +12,4 @@ \subsection{Views} Data can be directly accessed only from a device where it was previously allocated. To read or write from a different device, copying of data must occur between memories. These cross-device operations are considerably expensive and should thus be used sparingly. -One common problem when developing GPU-accelerated programs with TNL is the ability to supply an instance of Array containers. Object instances cannot be passed to the kernel by reference, and every object must be deep-copied. This implementation detail brings significant performance overhead but also raises the question: How to mirror the changes back to the CPU copy of the object? A companion class is introduced to solve this question: \codecpp{TNL::Container::ArrayView}. This class implements the proxy design pattern, substituting \codecpp{TNL::Container::Array}. This class allows the user to read and write into the array but permits the user from performing an operation, which may change memory allocation of the array, like resizing. +One common problem when developing GPU-accelerated programs with TNL is the ability to supply an instance of Array containers. Object instances cannot be passed to the kernel by reference, and every object must be deep-copied. This implementation detail brings significant performance overhead but also raises the question: How to mirror the changes back to the CPU copy of the object? A companion class is introduced to solve this question: \codecpp{TNL::Container::ArrayView}. This class implements the proxy design pattern, substituting \codecpp{TNL::Container::Array}, and allows the user to read and write into the array but permits the user from performing an operation, which may change memory allocation of the array, like resizing. diff --git a/chapters/realisation/bnode.tex b/chapters/realisation/bnode.tex index 0dc54d9..50b7d95 100644 --- a/chapters/realisation/bnode.tex +++ b/chapters/realisation/bnode.tex @@ -2,6 +2,10 @@ \section{Node structure} This section describes the proposed B$^+$Tree and B-Link-Tree node structure, found in the \cref{lst:node}. Comments and helpers methods were removed for the sake of brevity. +\codecpp{uint16_t} have been chosen in favor of smaller data types, as most instructions in the \acrshort{isa} do not support operand types smaller than 16-bits and instead convert them to larger data types via a \mintinline{asm}{cvt} conversion statement \cite{ptxisa}. + +\pagebreak + \begin{listing} \begin{minted}{cpp} template @@ -36,6 +40,4 @@ \section{Node structure} \label{lst:node} \end{listing} -\codecpp{uint16_t} have been chosen in favor of smaller data types, as most instructions in the \acrshort{isa} do not support operand types smaller than 16-bits and instead convert them to larger data types via a \mintinline{asm}{cvt} conversion statement \cite{ptxisa}. - As seen in the \codecpp{BLinkNode} variant at line 25--26, \codecpp{mChildren} and \codecpp{mSibling} are both using the \codecpp{volatile} qualifier to avoid incorrect memory access optimization by the compiler. This qualifier tells the compiler to assume that the contents of the variable may be changed or used at any time by another thread. Therefore all references to this variable will compile into an actual memory read or write \cite{cudaprog}. \ No newline at end of file diff --git a/chapters/state-of-art.tex b/chapters/state-of-art.tex index 490ad38..c2c99e1 100644 --- a/chapters/state-of-art.tex +++ b/chapters/state-of-art.tex @@ -10,7 +10,7 @@ \section{Prior Art} Sewall et al. \cite{palm} introduced novel modifications to B$^+$Tree operations in the proposed PALM technique. This technique uses the Bulk Synchronous Parallel model, where queries are grouped, and the work is distributed among threads. This work also optimizes synchronization by avoiding barriers in favor of communicating adjacent threads in a point-to-point manner. Latches are avoided under condition that all search queries have been completed before insertions, and a node may be written by exactly one thread. -Previous projects related to GPU implementation of B-Trees focused on query throughput. Fix et al. \cite{fix2011accelerating} measured substantial performance speedup compared to sequential CPU execution by running independent queries in each thread block. Until recently, the general approach for updating is either to perform such updates on the CPU or to rebuild the structure from scratch, which is the case of Fix et al. implementation. +Previous projects related to GPU implementation of B-Trees focused on query throughput. Fix et al. \cite{fix2011accelerating} measured substantial performance speedup compared to sequential CPU execution by running independent queries in each thread block. Until recently, the general approach for updating is either to perform such updates on the CPU or to rebuild the structure from scratch, which is the case of this implementation. Kim et al. proposed FAST \cite{fast}, a configurable high-performance tree optimized for SIMD and multi-core CPU and GPU systems. The structure can be configured towards the target hardware architecture by specifying the size of a cache line, SIMD register width, and memory page size. Similar to Fix et al., only bulk creation and querying are supported. Updates are done by rebuilding the tree. diff --git a/chapters/theory/b+tree.tex b/chapters/theory/b+tree.tex index 6b8434d..7eeb4cf 100644 --- a/chapters/theory/b+tree.tex +++ b/chapters/theory/b+tree.tex @@ -4,16 +4,16 @@ \section{B$^+$Tree} B$^+$Tree is a B-Tree where keys are stored exclusively in leaf nodes. \end{definition} -Separators found in internal nodes can be freely chosen and may not match the actual keys in leaf nodes, as long as these separators split the tree into subtrees and preserve the ordering of the keys. Value pointers are also exclusively stored in leaf nodes, highlighted by blue arrows in \cref{figure:b-plus-tree}. +Separators found in internal nodes can be freely chosen and may not match the actual keys in leaf nodes, as long as these separators split the tree into subtrees and preserve the ordering of the keys. Value pointers are also exclusively stored in leaf nodes, highlighted as blue lines in \cref{figure:b-plus-tree}. -As the B$^+$Tree does not reuse the keys and may duplicate the keys found in the leaf nodes to use as separators in internal nodes, they do bear increased storage requirements. Compressing techniques on keys can be used to reduce the increased space complexity in exchange for increased complexity caused by compressing itself. +As the B$^+$Tree does not reuse the keys and may duplicate the keys found in the leaf nodes to use as separators in internal nodes, they do bear increased storage requirements. Compressing techniques on keys can be used to mitigate the increased space complexity in exchange for a slight execution complexity increase due to compression itself. In most implementations, leaf nodes may include an additional pointer to a right sibling node, as seen in \cref{figure:b-plus-tree} highlighted as red arrows, enabling straightforward sequential querying, which is helpful for range querying. \begin{figure} \centering \input{components/figure/b+tree.tex} - \caption[B$^+$Tree with $\mathit{Order} = 3$.]{B$^+$Tree with $\mathit{Order} = 3$. Blue arrows indicate a pointer to a value, red arrows indicate an optional pointer to a sibling node.} + \caption[B$^+$Tree with $\mathit{Order} = 3$.]{B$^+$Tree with $\mathit{Order} = 3$. Blue lines indicate a pointer to a value, red arrows indicate an optional pointer to a sibling node.} \label{figure:b-plus-tree} \end{figure} diff --git a/chapters/theory/b-link-tree.tex b/chapters/theory/b-link-tree.tex index c0417e1..b6473ac 100644 --- a/chapters/theory/b-link-tree.tex +++ b/chapters/theory/b-link-tree.tex @@ -17,7 +17,7 @@ \section{B-Link-Tree}\label{section:b-link-tree} \begin{figure}[H] \centering \input{components/figure/b-link-tree.tex} - \caption[B-Link-Tree with $\mathit{Order} = 3$.]{B-Link-Tree with $\mathit{Order} = 3$. Blue arrows indicate a pointer to a value, red arrows indicate an optional pointer to a sibling node.} + \caption[B-Link-Tree with $\mathit{Order} = 3$.]{B-Link-Tree with $\mathit{Order} = 3$. Blue lines indicate a pointer to a value, red arrows indicate an optional pointer to a sibling node.} \label{figure:b-link-tree} \end{figure} @@ -35,7 +35,7 @@ \subsection{Insertion and Search} \end{figure} -Assuming node $x$ is a full node, which needs to be split, shown in step (a) of \cref{figure:b-link-insert-states}. When splitting the node $x$, a new right sibling node $x^{\prime\prime}$ is created, as seen in step (b). The node $y$ inherits the high key from the split node $x$, whereas the $x$ node is updated (marked as $x^\prime$) with a new $x.highkey = x^{\prime\prime}.key_0$. Thus, an internal node does exist without a parent in between the operations. +Assuming node $x$ is a full node, which needs to be split, shown in step (a) of \cref{figure:b-link-insert-states}. When splitting the node $x$, a new right sibling node $x^{\prime\prime}$ is created, as seen in step (b). The node $x^{\prime\prime}$ inherits the high key and the sibling pointer from the split node $x$, whereas the $x$ node is updated (marked as $x^\prime$) with a new $x.highkey = x^{\prime\prime}.key_0$ and sibling pointer. Thus, an internal node does exist without a parent in between the operations. As the final step of node splitting, both the separator key and the pointer to the newly split node $y$ are inserted in the parent node $p$, seen in step (c). Similar to the insertion in B-Tree, a split operation might trigger additional splitting in higher levels. @@ -43,11 +43,11 @@ \subsection{Insertion and Search} \subsection{Proof of correctness} -The following theorems need to be proven to prove the correctness of each operation performed on the B-Link-Tree: +The following theorems need to be proven to prove the correctness of each operation performed on the B-Link-Tree \cite{lehman}: \begin{itemize} - \item \textit{Deadlock freedom} --- threads performing operations on the B-Link-Tree cannot produce a deadlock, - \item \textit{Correct tree modifications} --- the tree must appear as a valid tree for all nodes at any time, + \item \textit{Deadlock freedom} --- threads performing operations on the B-Link-Tree cannot produce a deadlock. + \item \textit{Correct tree modifications} --- the tree must appear as a valid tree for all nodes at any time. \item \textit{Correct interactions} --- concurrent operations do not interfere with one another. \end{itemize} @@ -119,7 +119,7 @@ \subsection{Proof of correctness} If the insertion happens on an internal node ($x.\mathit{leaf} = \mathit{false}$), a key-pointer pair created by splitting a lower-level node $z^{\prime\prime}$ is inserted into the node $x$. This scenario is the only one where a key-pointer pair could propagate upwards to node $x$. The operation $P$ will be able to utilize the link pointers $z.\mathit{sibling}$ to reach both the original node and the newly split node. - In the second and third scenarios, the process $I$ has split the node $x$ into two nodes $x^\prime$ and $x^{\prime\prime}$. If the process happens on a leaf node, $P$ will continue as if no insertion has occurred. Similar to the first scenario, the only possible scenario where the process $I$ needs to split is when a child node $z$ went through a split and a new separator key and a pointer to $z^{\prime\prime}$. + In the second and third scenarios, the process $I$ has split the node $x$ into two nodes $x^\prime$ and $x^{\prime\prime}$. If the process happens on a leaf node, $P$ will continue as if no insertion has occurred. Similar to the first scenario, the only possible case where the process $I$ needs to split a non-leaf node is when a child node $z$ went through a split and a new separator key and a pointer to $z^{\prime\prime}$ is being inserted into the node $x$. Both the insertion and search in the node $z^{\prime\prime}$ below node $x$ will be correct thanks to \cref{theorem:b-link-tree:modifications}. It only remains to prove the correctness of split operation on node $x$. diff --git a/chapters/theory/b-tree.tex b/chapters/theory/b-tree.tex index f873d9c..fa24216 100644 --- a/chapters/theory/b-tree.tex +++ b/chapters/theory/b-tree.tex @@ -4,7 +4,7 @@ \section{B-Tree}\label{section:b-tree} Since their invention 50 years ago \cite{bayer-org}, B-Trees have been already considered ubiquitous less than ten years later \cite{10.1145/356770.356776}. They can be found in various forms in databases (e.g., PostgreSQL \cite{postgresql}) and file systems (e.g., BTRFS \cite{btrfs}), where a performant self-balancing external index for large blocks of data is required. They can also be found in more applications such as data mining, decision support systems, and Online analytical processing (OLAP) \cite{olap,goetz-tech}. -However, it raises the question of why B-Trees are used for on-disk data and binary search trees are used for in-memory data. The main reason behind this is the high overhead of data access in block-access storage, where byte access is not well supported. A typical example is disk storage, where a disk is divided into blocks. B-Trees exploit this behavior by having its node be as large as a whole block. +However, it raises the question of why B-Trees are used for on-disk data and binary search trees are used for in-memory data. The main reason behind this is the high overhead of data access in block-access storage, where byte access is not well supported. A typical example is disk storage, where a disk is divided into blocks. B-Tree exploits this behavior by having its nodes be as large as a whole block. B-Trees are exceptionally useful for secondary disk-based storage, but it still yields significant improvements even when storing data in memory; as with CPU caches and memory line caches, the memory can be treated as a block-access device. @@ -13,7 +13,7 @@ \section{B-Tree}\label{section:b-tree} \begin{figure}[H] \centering \input{components/figure/b-tree.tex} - \caption[B-Tree with $\mathit{Order} = 3$.]{B-Tree with $\mathit{Order} = 3$. Blue arrows indicate the presence of a pointer to a value.} + \caption[B-Tree with $\mathit{Order} = 3$.]{B-Tree with $\mathit{Order} = 3$. Blue lines indicate the presence of a pointer to a value.} \label{figure:b-tree} \end{figure} @@ -24,22 +24,22 @@ \section{B-Tree}\label{section:b-tree} \begin{enumerate} \item $x.\mathit{size}$, the number of keys in a node, \item $x.\mathit{leaf}$, a boolean value indicating whether the node is a leaf node or not, - \item An list of $x.\mathit{size}$ keys $x.\mathit{key}_1, x.\mathit{key}_2, \dots, x.\mathit{key}_{x.size}$ sorted in ascending order ($x.\mathit{key}_1 \le x.\mathit{key}_2 \le \dots \le x.\mathit{key}_{x.\mathit{size}}$), + \item an list of $x.\mathit{size}$ keys $x.\mathit{key}_1, x.\mathit{key}_2, \dots, x.\mathit{key}_{x.size}$ sorted in ascending order ($x.\mathit{key}_1 \le x.\mathit{key}_2 \le \dots \le x.\mathit{key}_{x.\mathit{size}}$). \end{enumerate} - \item Every internal node $x$ has $x.size + 1$ pointers as its children \\($x.child_1, x.child_2, \dots, x.child_{x.size + 1}$), - \item All leaf nodes appear at the same depth, which is the height of tree $h$, + \item Every internal node $x$ has $x.size + 1$ pointers as its children \\($x.child_1, x.child_2, \dots, x.child_{x.size + 1}$). + \item All leaf nodes appear at the same depth, which is the height of tree $h$. \item Nodes have upper and lower bounds, which limit the number of keys and children. Assuming $m$ is the order of a B-Tree: \begin{enumerate} - \item Every node other than the root has at least $\floor{\nicefrac{m}{2}}$ children, thus every node other than root has at least $\floor{\nicefrac{m}{2}} - 1$ keys, - \item The root node has at least two children, except if it is a leaf node, + \item Every node other than the root has at least $\floor{\nicefrac{m}{2}}$ children, thus every node other than root has at least $\floor{\nicefrac{m}{2}} - 1$ keys. + \item The root node has at least two children, except if it is a leaf node. \item Every node may contain at most $m$ children, thus may contain at most $m - 1$ keys. \end{enumerate} \end{enumerate} \end{definition} -An example B-Tree with $\mathit{Order} = 3$ can be seen in \cref{figure:b-tree}. In the case of B-Tree, each key may include a pointer to a specific value, highlighted as blue arrows, which is useful when implementing a map-like container. +An example B-Tree with $\mathit{Order} = 3$ can be seen in \cref{figure:b-tree}. In the case of B-Tree, each key may include a pointer to a specific value, highlighted as blue lines, which is useful when implementing a map-like container. -As a note, B-Trees are a specialization of $(a,b)$-Trees, where a B-Tree is either an $(a, 2a)$-tree or $(a, 2a + 1)$-tree depending on the oddness / evenness of $a$. This is also why 2-4 trees (also known as \enquote{2-3-4-trees}) (which in turn are similar to RB-Trees) are B-Trees with an order of 3. +As a note, B-Trees are a specialization of $(a,b)$-Trees, where a B-Tree is either an $(a, 2a)$-tree or $(a, 2a + 1)$-tree depending on the oddness / evenness of $a$. This is also why 2-4 trees (also known as \enquote{2-3-4-trees} which in turn are similar to RB-Trees) are B-Trees with an order of 3. \begin{lemma} B-Tree $T$ of order $m$ with $n \ge 1$ keys has height $h = \Theta(\log{n})$. @@ -117,7 +117,7 @@ \subsection{Insertion} If a node after insertion has subsequently become full after insertion, a split operation must occur to preserve the rules of the B-Tree \cref{def:btree}. The node is considered \textit{full} if the node contains exactly $m - 1$ keys. The median key is chosen as the separator, and the node is split into two smaller nodes based on that separator. The separator is inserted into the parent of the split node, which might trigger the split operation again. -When the root node needs to be split, a new node with the separator as its only key and two subtrees as its children, which the definition \cref{def:btree} permits (only internal nodes must contain at least $\floor{\nicefrac{m}{2}}$ keys). +When the root node needs to be split, a new node with the separator as its only key and two subtrees as its children, which the \cref{def:btree} permits (only internal nodes must contain at least $\floor{\nicefrac{m}{2}}$ keys). \subsection{Deletion}