Skip to content

Commit

Permalink
feat: final version.
Browse files Browse the repository at this point in the history
  • Loading branch information
dqbd committed Jun 27, 2021
1 parent 2021a5d commit 10746f8
Show file tree
Hide file tree
Showing 10 changed files with 30 additions and 28 deletions.
2 changes: 1 addition & 1 deletion chapters/conclusion.tex
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ \section{Goals and results}

Multiple GoogleTest based unit test suites were written to ensure proper functionality. A complementary web tool has been developed to visualize operations done in a multithreaded environment.

Finally, both of the implementations were measured and compared against STL and other state-of-the-art GPU and CPU B-Tree implementations. Measurements of both variants against \codecpp{std::map} show a substantial speedup: $\approx$\,$200\times$ when querying and $\approx$\,$25\times$ when inserting random sequences of length $2^{24}$--$2^{25}$. When compared to the GPU B-Tree implementation by Awad et al. \cite{awad}, both of the variants implemented in this thesis are slower. It should be noted that both the B$^+$Tree and B-Link-Tree perform similarly, suggesting the reduced instruction count and reduced unnecessary pointer chasing in B$^+$Tree overshadow the concurrency improvements of B-Link-Tree.
Finally, both of the implementations were measured and compared against STL and other state-of-the-art GPU and CPU B-Tree implementations. Measurements of both variants against \codecpp{std::map} show a substantial speedup: $\approx$\,$200\times$ when querying and $\approx$\,$85\times$ when inserting random sequences of length $2^{24}$--$2^{25}$. When compared to the GPU B-Tree implementation by Awad et al. \cite{awad}, both of the variants implemented in this thesis are slower. It should be noted that both the B$^+$Tree and B-Link-Tree perform similarly, suggesting the reduced instruction count and reduced unnecessary pointer chasing in B$^+$Tree overshadow the concurrency improvements of B-Link-Tree.

\section{Future work}

Expand Down
4 changes: 2 additions & 2 deletions chapters/introduction.tex
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@ \section{Structure of Work}

The main goal of this thesis is to introduce an implementation of a B-Tree built on top of the Template Numerical Library (TNL). This implementation will have the ability to change the execution environment to either run on the GPU or the CPU. The solution will be implemented in \CC\ and will support CUDA, as these are the prerequisites of the used TNL library.

First, in \cref{chapter:preliminaries}, hardware structure and software architecture will be introduced, with a general primer of both the TNL library and the B-Tree data structure. Next, in \cref{chapter:theory}, different variants and modifications of the data structure, such as B$^+$Tree or B-Link-Tree, are explained and compared against the original B-Tree.
First, in \cref{chapter:preliminaries}, hardware structure and software architecture will be introduced, with a general primer of both the TNL library and the B-Tree data structure. Previous implementations of such data structure and other state-of-the-art CPU solutions are discussed in \cref{chapter:state-of-art}.

Previous implementations of such data structure and other state-of-the-art CPU solutions are discussed in \cref{chapter:state-of-art}. The realization section presented in \cref{chapter:realisation} will explain the design decision choices and various implementation and optimization details made for both of the GPU B-Tree variants shown in this thesis.
Next, in \cref{chapter:theory}, different variants and modifications of the data structure, such as B$^+$Tree or B-Link-Tree, are explained and compared against the original B-Tree. The realization section presented in \cref{chapter:realisation} will explain the design decision choices and various implementation and optimization details made for both of the GPU B-Tree variants shown in this thesis.

Last but not least, implementation correctness, testing methodology, and the experimental benchmark results between the developed solution and other chosen CPU and GPU implementations are presented in \cref{chapter:testing}.
2 changes: 1 addition & 1 deletion chapters/preliminaries/cuda.tex
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
\section{CUDA programming model} \label{label:cuda}

To simplify development on general-purpose GPUs, NVIDIA introduced the \acrfull{cuda} programming model in November 2006. As mentioned previously, GPUs are suited for parallel workloads optimizing in total throughput, sacrificing the performance of serial operations. However, not all programs are fully parallel in nature, and some problems are difficult, if not impossible, to formulate in a manner that would benefit from the use of a GPU. Thus, a sane idea would be to utilize both CPU and GPU, using GPU in workloads, where parallelism yields significant performance uplift. With CUDA, programmers can write applications run on the CPU and accelerate parallel workloads with the GPU while using familiar C/\CC\ programming language for both processors.
To simplify development on general-purpose GPUs, NVIDIA introduced the \acrfull{cuda} programming model in November 2006. As mentioned previously, GPUs are suited for parallel workloads optimizing in total throughput, sacrificing the performance of serial operations. However, not all programs are fully parallel in nature, and some problems are difficult, if not impossible, to formulate in a manner that would benefit from the use of a GPU. Thus, a sane idea would be to utilize both the CPU and the GPU, using GPU in workloads, where parallelism yields significant performance uplift. With CUDA, programmers can write applications that run on the CPU and accelerate parallel workloads with the GPU while using familiar C/\CC\ programming language for both processors.

In \acrshort{cuda}, the CPU and GPU and their memory are referred to as \textit{host} and \textit{device} respectively. A host manages the memory of both the device and the host itself, and it launches user-defined functions, called \textit{kernels}, which the device executes. A program thus usually executes serial code on the host and parallel code on the device.

Expand Down
2 changes: 1 addition & 1 deletion chapters/preliminaries/gpu.tex
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
\begin{figure}
\centering
\includegraphics[width=\textwidth]{components/figure/cpu-vs-gpu.png}
\caption[Comparison between a typical CPU vs GPU architecture]{Comparison between a typical CPU vs GPU architecture, the main difference being the increased amount of computing cores. \cite{cudaprog}}
\caption[Comparison between a typical CPU vs GPU architecture]{Comparison between a typical CPU vs GPU architecture, the main difference being the increased amount of computing cores \cite{cudaprog}.}
\label{figure:cpu-vs-gpu}
\end{figure}

Expand Down
2 changes: 1 addition & 1 deletion chapters/preliminaries/tnl.tex
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@ \subsection{Views}

Data can be directly accessed only from a device where it was previously allocated. To read or write from a different device, copying of data must occur between memories. These cross-device operations are considerably expensive and should thus be used sparingly.

One common problem when developing GPU-accelerated programs with TNL is the ability to supply an instance of Array containers. Object instances cannot be passed to the kernel by reference, and every object must be deep-copied. This implementation detail brings significant performance overhead but also raises the question: How to mirror the changes back to the CPU copy of the object? A companion class is introduced to solve this question: \codecpp{TNL::Container::ArrayView}. This class implements the proxy design pattern, substituting \codecpp{TNL::Container::Array}. This class allows the user to read and write into the array but permits the user from performing an operation, which may change memory allocation of the array, like resizing.
One common problem when developing GPU-accelerated programs with TNL is the ability to supply an instance of Array containers. Object instances cannot be passed to the kernel by reference, and every object must be deep-copied. This implementation detail brings significant performance overhead but also raises the question: How to mirror the changes back to the CPU copy of the object? A companion class is introduced to solve this question: \codecpp{TNL::Container::ArrayView}. This class implements the proxy design pattern, substituting \codecpp{TNL::Container::Array}, and allows the user to read and write into the array but permits the user from performing an operation, which may change memory allocation of the array, like resizing.
6 changes: 4 additions & 2 deletions chapters/realisation/bnode.tex
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@ \section{Node structure}

This section describes the proposed B$^+$Tree and B-Link-Tree node structure, found in the \cref{lst:node}. Comments and helpers methods were removed for the sake of brevity.

\codecpp{uint16_t} have been chosen in favor of smaller data types, as most instructions in the \acrshort{isa} do not support operand types smaller than 16-bits and instead convert them to larger data types via a \mintinline{asm}{cvt} conversion statement \cite{ptxisa}.

\pagebreak

\begin{listing}
\begin{minted}{cpp}
template <typename KeyType, typename ValueType, size_t Order>
Expand Down Expand Up @@ -36,6 +40,4 @@ \section{Node structure}
\label{lst:node}
\end{listing}

\codecpp{uint16_t} have been chosen in favor of smaller data types, as most instructions in the \acrshort{isa} do not support operand types smaller than 16-bits and instead convert them to larger data types via a \mintinline{asm}{cvt} conversion statement \cite{ptxisa}.

As seen in the \codecpp{BLinkNode} variant at line 25--26, \codecpp{mChildren} and \codecpp{mSibling} are both using the \codecpp{volatile} qualifier to avoid incorrect memory access optimization by the compiler. This qualifier tells the compiler to assume that the contents of the variable may be changed or used at any time by another thread. Therefore all references to this variable will compile into an actual memory read or write \cite{cudaprog}.
2 changes: 1 addition & 1 deletion chapters/state-of-art.tex
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ \section{Prior Art}

Sewall et al. \cite{palm} introduced novel modifications to B$^+$Tree operations in the proposed PALM technique. This technique uses the Bulk Synchronous Parallel model, where queries are grouped, and the work is distributed among threads. This work also optimizes synchronization by avoiding barriers in favor of communicating adjacent threads in a point-to-point manner. Latches are avoided under condition that all search queries have been completed before insertions, and a node may be written by exactly one thread.

Previous projects related to GPU implementation of B-Trees focused on query throughput. Fix et al. \cite{fix2011accelerating} measured substantial performance speedup compared to sequential CPU execution by running independent queries in each thread block. Until recently, the general approach for updating is either to perform such updates on the CPU or to rebuild the structure from scratch, which is the case of Fix et al. implementation.
Previous projects related to GPU implementation of B-Trees focused on query throughput. Fix et al. \cite{fix2011accelerating} measured substantial performance speedup compared to sequential CPU execution by running independent queries in each thread block. Until recently, the general approach for updating is either to perform such updates on the CPU or to rebuild the structure from scratch, which is the case of this implementation.

Kim et al. proposed FAST \cite{fast}, a configurable high-performance tree optimized for SIMD and multi-core CPU and GPU systems. The structure can be configured towards the target hardware architecture by specifying the size of a cache line, SIMD register width, and memory page size. Similar to Fix et al., only bulk creation and querying are supported. Updates are done by rebuilding the tree.

Expand Down
6 changes: 3 additions & 3 deletions chapters/theory/b+tree.tex
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,16 @@ \section{B$^+$Tree}
B$^+$Tree is a B-Tree where keys are stored exclusively in leaf nodes.
\end{definition}

Separators found in internal nodes can be freely chosen and may not match the actual keys in leaf nodes, as long as these separators split the tree into subtrees and preserve the ordering of the keys. Value pointers are also exclusively stored in leaf nodes, highlighted by blue arrows in \cref{figure:b-plus-tree}.
Separators found in internal nodes can be freely chosen and may not match the actual keys in leaf nodes, as long as these separators split the tree into subtrees and preserve the ordering of the keys. Value pointers are also exclusively stored in leaf nodes, highlighted as blue lines in \cref{figure:b-plus-tree}.

As the B$^+$Tree does not reuse the keys and may duplicate the keys found in the leaf nodes to use as separators in internal nodes, they do bear increased storage requirements. Compressing techniques on keys can be used to reduce the increased space complexity in exchange for increased complexity caused by compressing itself.
As the B$^+$Tree does not reuse the keys and may duplicate the keys found in the leaf nodes to use as separators in internal nodes, they do bear increased storage requirements. Compressing techniques on keys can be used to mitigate the increased space complexity in exchange for a slight execution complexity increase due to compression itself.

In most implementations, leaf nodes may include an additional pointer to a right sibling node, as seen in \cref{figure:b-plus-tree} highlighted as red arrows, enabling straightforward sequential querying, which is helpful for range querying.

\begin{figure}
\centering
\input{components/figure/b+tree.tex}
\caption[B$^+$Tree with $\mathit{Order} = 3$.]{B$^+$Tree with $\mathit{Order} = 3$. Blue arrows indicate a pointer to a value, red arrows indicate an optional pointer to a sibling node.}
\caption[B$^+$Tree with $\mathit{Order} = 3$.]{B$^+$Tree with $\mathit{Order} = 3$. Blue lines indicate a pointer to a value, red arrows indicate an optional pointer to a sibling node.}
\label{figure:b-plus-tree}
\end{figure}

Expand Down
12 changes: 6 additions & 6 deletions chapters/theory/b-link-tree.tex
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ \section{B-Link-Tree}\label{section:b-link-tree}
\begin{figure}[H]
\centering
\input{components/figure/b-link-tree.tex}
\caption[B-Link-Tree with $\mathit{Order} = 3$.]{B-Link-Tree with $\mathit{Order} = 3$. Blue arrows indicate a pointer to a value, red arrows indicate an optional pointer to a sibling node.}
\caption[B-Link-Tree with $\mathit{Order} = 3$.]{B-Link-Tree with $\mathit{Order} = 3$. Blue lines indicate a pointer to a value, red arrows indicate an optional pointer to a sibling node.}
\label{figure:b-link-tree}
\end{figure}

Expand All @@ -35,19 +35,19 @@ \subsection{Insertion and Search}
\end{figure}


Assuming node $x$ is a full node, which needs to be split, shown in step (a) of \cref{figure:b-link-insert-states}. When splitting the node $x$, a new right sibling node $x^{\prime\prime}$ is created, as seen in step (b). The node $y$ inherits the high key from the split node $x$, whereas the $x$ node is updated (marked as $x^\prime$) with a new $x.highkey = x^{\prime\prime}.key_0$. Thus, an internal node does exist without a parent in between the operations.
Assuming node $x$ is a full node, which needs to be split, shown in step (a) of \cref{figure:b-link-insert-states}. When splitting the node $x$, a new right sibling node $x^{\prime\prime}$ is created, as seen in step (b). The node $x^{\prime\prime}$ inherits the high key and the sibling pointer from the split node $x$, whereas the $x$ node is updated (marked as $x^\prime$) with a new $x.highkey = x^{\prime\prime}.key_0$ and sibling pointer. Thus, an internal node does exist without a parent in between the operations.

As the final step of node splitting, both the separator key and the pointer to the newly split node $y$ are inserted in the parent node $p$, seen in step (c). Similar to the insertion in B-Tree, a split operation might trigger additional splitting in higher levels.

Tree traversal is modified to honor $x.highkey$ by returning the node at $x.\mathit{sibling}$ when the target key $k$ is larger or equal to $x.highkey$. With these additional attributes, it is possible to traverse the entire tree, even though some nodes (such as $x^{\prime\prime}$ in step (b) of \cref{figure:b-link-insert-states}) are yet to be inserted into the parent node.

\subsection{Proof of correctness}

The following theorems need to be proven to prove the correctness of each operation performed on the B-Link-Tree:
The following theorems need to be proven to prove the correctness of each operation performed on the B-Link-Tree \cite{lehman}:

\begin{itemize}
\item \textit{Deadlock freedom} --- threads performing operations on the B-Link-Tree cannot produce a deadlock,
\item \textit{Correct tree modifications} --- the tree must appear as a valid tree for all nodes at any time,
\item \textit{Deadlock freedom} --- threads performing operations on the B-Link-Tree cannot produce a deadlock.
\item \textit{Correct tree modifications} --- the tree must appear as a valid tree for all nodes at any time.
\item \textit{Correct interactions} --- concurrent operations do not interfere with one another.
\end{itemize}

Expand Down Expand Up @@ -119,7 +119,7 @@ \subsection{Proof of correctness}

If the insertion happens on an internal node ($x.\mathit{leaf} = \mathit{false}$), a key-pointer pair created by splitting a lower-level node $z^{\prime\prime}$ is inserted into the node $x$. This scenario is the only one where a key-pointer pair could propagate upwards to node $x$. The operation $P$ will be able to utilize the link pointers $z.\mathit{sibling}$ to reach both the original node and the newly split node.

In the second and third scenarios, the process $I$ has split the node $x$ into two nodes $x^\prime$ and $x^{\prime\prime}$. If the process happens on a leaf node, $P$ will continue as if no insertion has occurred. Similar to the first scenario, the only possible scenario where the process $I$ needs to split is when a child node $z$ went through a split and a new separator key and a pointer to $z^{\prime\prime}$.
In the second and third scenarios, the process $I$ has split the node $x$ into two nodes $x^\prime$ and $x^{\prime\prime}$. If the process happens on a leaf node, $P$ will continue as if no insertion has occurred. Similar to the first scenario, the only possible case where the process $I$ needs to split a non-leaf node is when a child node $z$ went through a split and a new separator key and a pointer to $z^{\prime\prime}$ is being inserted into the node $x$.

Both the insertion and search in the node $z^{\prime\prime}$ below node $x$ will be correct thanks to \cref{theorem:b-link-tree:modifications}. It only remains to prove the correctness of split operation on node $x$.

Expand Down
Loading

0 comments on commit 10746f8

Please sign in to comment.