k-Nearest Neighbor algorithm
This project is aimed at using SDAccel to implement the k-Nearest Neighbor algorithm onto a Xilinx FPGA. The nearest neighbor algorithm is used to find the k nearest neighbors of a specified point among a set of unstructured data points. It is widely used in a diverse range of domains and applications such as pattern recognition, machine learning, computer vision and coding theory to name a few. For a set S of n reference data points in a d-dimensional space and a query point q, the k-nearest neighbor algorithm finds the k closest points in S to the point q. This is illustrated in Figure 1 for k=3 and n=12. The red triangle represents the query point while the blue diamonds represent the points in the reference data set in Figure 1.![alt text](./figures/knn.png) Figure 1: Pictorial demonstration of KNN search algorithm
The algorithm includes the following main steps:
- Compute n distances between the query point q and the n reference points of the set S. The distances in our case are the squared Euclidean distances which for a given set of two 2-dimensional points (x1,y1) and (x2,y2) are as follows:
The repository contains three directories, namely "data", "knn_cpu" and "knn_fpga". The "data" directory contains a file with a sample set of reference data points. The "knn_cpu" and "knn_fpga" directories contain two different implementations of the KNN algorithm, in which the nearest neighbor identification is done on the CPU and FPGA respectively. Each implementation contains the OpenCL code of the algorithm along with the required host code and the Tcl script used to run the example in SDAccel.
KNN │ README.md │ └── data │ └─ filelist.txt │ └── knn_cpu │ │ main_cpu.cpp │ │ nn_cpu.cl │ │ params.h │ └─ solution_cpu.tcl │ └── knn_fpga │ main_fpga.cpp │ nn_fpga.cl │ params.h └─ solution_fpga.tcl
|knn_cpu||Contains the "single kernel implementation" of the algorithm. The nearest neighbors identification in this implementation is done on the host.|
|knn_fpga||Contains the "two kernel implementation" of the algorithm, which uses a memory architecture optimization provided by SDAccel that implements the global memories used to communicate between the kernels as streaming buffers on the FPGA Block RAMs. The nearest neighbor identification in this case is done on the FPGA.|
|filelist.txt||Contains the points of a reference data set (294912 points).|
|params.h||Some constant parameters (see below).|
|main_cpu.cpp||The host code for the implementation with nearest neighbor identification done on the host.|
|nn_cpu.cl||The kernel which calculates all the distances and writes them to the host for nearest neighbors identification.|
|solution_cpu.tcl||Script for the "single kernel implementation" in SDAccel.|
|main_fpga.cpp||The host code for the "two kernel implementation" of the algorithm, where the distances are calculated by multiple work groups in one kernel and then the neighbors are computed by a single work-group (assuming that k is small) in another kernel.|
|nn_fpga.cl||The OpenCL code using global memory buffers to stream data between two kernels. One of the kernels calculates the distance between the query point and all the points in the reference data set. The second kernel identifies the k nearest neighbors using a single work-group.|
|solution_fpga.tcl||Script for the "two kernel implementation" in SDAccel.|
The number of nearest neighbors to be returned, the required work group size and the maximum size of an input file line are defined in "params.h"
In the "two kernel" implementation, the size of the inter-kernel on-chip global memory buffer is also specified. Note that SDAccel implements it using a streaming buffer, hence its specific size does not actually matter.
How to run an example
The code has been tested with Xilinx SDAccel 2015.1.
The script files for each test case can be used as follows:
The k nearest neighbors, the given query point, and the total number of points are printed on standard output.
The sample output after a successful run would look like:![alt text](./figures/samplerun.png) Figure 2: Sample Output after a successful run
We considered the following performance metrics:
Initiation Interval: Indicating the number of clock cycles between two consecutive iterations of a loop
Latency: Indicating the number of clock cycles required to complete the execution of the entire kernel
Both metrics have a profound effect on the execution time and energy consumption in case of an FPGA implementation. The various optimization options applied cause a significant reduction in the overall latency for the different implementations. Both the cases use "reqd_work_group_size" attribute necessary to allow for performance optimization while generating custom logic for a kernel.
The latency after employing work-item pipeline optimization for the "knn_cpu" case goes from 7433 clock cycles down to 294 clock cycles. For the "knn_fpga" case, that uses the on-chip block RAMs to implement global memory buffers used for inter-kernel communication, the best case latency reduces to 292 clock cycles from 5901 clock cycles in the un-optimized case. This improvement in the overall latency shows up in the performance comparison tables given below.
The FPGA vs GPU performance comparison for the "knn_cpu" and "knn_fpga" implementations in terms of execution time, power and energy consumption is described here. Note that in the "knn_cpu" case, the nearest neighbors identification time on the host (the CPU) has also been added to the distance calculation time on FPGA to get the total execution time. This "sorting and neighbors identification time" for "knn_cpu" case is also reported below. The execution time on FPGA implementation in each case has been calculated by using extrapolation to take into account the total number of work-groups. The devices used for comparison are the following:
- NVIDIA GeForce GTX 960 with 1024 cores
- NVIDIA Quadro K4200 with 1344 cores
- Xilinx Virtex 7 xc7vx690tffg1157-2
|Platform||Total Time||Sort Time (CPU)||Power(W)||Energy(mJ)|