README.md

Getting Started Examples

This page contains examples for users who are new to Xilinx SDx OpenCL Flows. The focus of the examples is towards code optimization for Xilinx devices.The table lists various categories of examples in suggested order which users can follow.

Prerequisites

  • User is familiar with basics of OpenCL flow.
  • User has gone through SDx tutorial and is familiar with basics of tool functionality and terminology.
S.No. Category Description
1 host OpenCL host code for optimized interfacing with Xilinx Devices
2 kernel_to_gmem Kernel to Global Memory Access Optimization.
3 kernel_opt Kernel Optimization for performance
4 dataflow Kernel Optimization through Macro Level Pipelining
5 clk_freq Improving Kernel Clock Frequency through Optimized code.
6 debug Debugging and Profiling of Kernel.
7 rtl_kernel RTL Kernel Based Examples
8 misc OpenCL miscellaneous Examples
9 cpu_to_fpga Labs to showcase the cpu to fpga conversion with kernel optimizations.

Examples Table

Example Description Key Concepts / Keywords
host/concurrent_kernel_execution_ocl/ This example will demonstrate how to use multiple and out of order command queues to simultaneously execute multiple kernels on an FPGA. Key Concepts
- Concurrent execution
- Out of Order Command Queues
- Multiple Command Queues
Keywords
- CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
- clSetEventCallback()
host/copy_buffer_ocl/ This Copy Buffer example demonstrate how one buffer can be copied from another buffer. Key Concepts
- Copy Buffer
Keywords
- cl::CommandQueue::enqueueCopyBuffer()
host/data_transfer_ocl/ This example illustrates several ways to use the OpenCL API to transfer data to and from the FPGA Key Concepts
- OpenCL API
- Data Transfer
- Write Buffers
- Read Buffers
- Map Buffers
- Async Memcpy
Keywords
- enqueueWriteBuffer()
- enqueueReadBuffer()
- enqueueMapBuffer()
- enqueueUnmapMemObject()
- enqueueMigrateMemObjects()
host/device_query_ocl/ This example prints the OpenCL properties of the platform and its devices. It also displays the limits and capabilities of the hardware. Key Concepts
- OpenCL API
- Querying device properties
Keywords
- clGetPlatformIDs()
- clGetPlatformInfo()
- clGetDeviceIDs()
- clGetDeviceInfo()
host/errors_ocl/ This example discuss the different reasons for errors in OpenCL and how to handle them at runtime. Key Concepts
- OpenCL API
- Error handling
Keywords
- CL_SUCCESS
- CL_DEVICE_NOT_FOUND
- CL_DEVICE_NOT_AVAILABLE
host/helloworld_c/ This is simple example of vector addition to describe how to use HLS kernels in Sdx Environment. This example highlights the concepts like PIPELINE which increases the kernel performance Key Concepts
- HLS C Kernel
- OpenCL Host APIs
Keywords
- gmem
- bundle
- #pragma HLS INTERFACE
- m_axi
- s_axi4lite
host/helloworld_ocl/ This example is a simple OpenCL application. It will highlight the basic flow of an OpenCL application. Key Concepts
- OpenCL API
host/host_global_bandwidth/ Host to global memory bandwidth test
host/kernel_swap_ocl/ This example shows how host can swap the kernels and share same buffer between two kernels which are exist in separate binary containers. Dynamic platforms does not persist the buffer data so host has to migrate data from device to host memory before swapping the next kernel. After kernel swap, host has to migrate the buffer back to device. Key Concepts
- Handling Buffer sharing across multiple binaries
- Multiple Kernel Binaries
Keywords
- clEnqueueMigrateMemObjects()
- CL_MIGRATE_MEM_OBJECT_HOST
host/multiple_devices_ocl/ This example show how to take advantage of multiple FPGAs on a system. It will show how to initialized an OpenCL context, allocate memory on the two devices and execute a kernel on each FPGA. Key Concepts
- OpenCL API
- Multi-FPGA Execution
- Event Handling
Keywords
- cl_device_id
- clGetDeviceIDs()
host/overlap_ocl/ This examples demonstrates techniques that allow user to overlap Host(CPU) and FPGA computation in an application. It will cover asynchronous operations and event object. Key Concepts
- OpenCL API
- Synchronize Host and FPGA
- Asynchronous Processing
- Events
- Asynchronous memcpy
Keywords
- cl_event
- clCreateCommandQueue
- CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
- clEnqueueMigrateMemObjects
host/stream_access_ocl/ This is a simple example that demonstrates on how to process an input stream of data for computation in an application. It shows how to perform asynchronous operations and event handling. Key Concepts
- OpenCL API
- Synchronize Host and FPGA
- Asynchronous Processing
- Events
- Asynchronous Data Transfer
Keywords
- cl::event
- CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
host/sub_devices_ocl/ This example demonstrates how to create OpenCL subdevices which uses a single kernel multiple times in order to show how to handle each instance independently including independent buffers, command queues and sequencing. Key Concepts
- Sub Devices
Keywords
- cl_device_partition_property
- createSubDevices
- CL_DEVICE_PARTITION_EQUALLY
host/errors_oclpp/ This example discuss the different reasons for errors in OpenCL C++ and how to handle them at runtime. Key Concepts
- OpenCL C++ API
- Error handling
Keywords
- CL_SUCCESS
- CL_DEVICE_NOT_FOUND
- CL_DEVICE_NOT_AVAILABLE
- CL_INVALID_VALUE
- CL_INVALID_KERNEL_NAME
- CL_INVALID_BUFFER_SIZE
host/device_query_oclpp/ This Example prints the OpenCL properties of the platform and its devices using OpenCLPP APIs. It also displays the limits and capabilities of the hardware. Key Concepts
- OpenCL API
- Querying device properties
kernel_to_gmem/burst_rw_c/ This is simple example of using AXI4-master interface for burst read and write Key Concepts
- burst access
Keywords
- memcpy
- max_read_burst_length
- max_write_burst_length
kernel_to_gmem/burst_rw_ocl/ This is simple example of using AXI4-master interface for burst read and write Key Concepts
- burst access
Keywords
- param:compiler.interfaceWrBurstLen
- param:compiler.interfaceRdBurstLen
kernel_to_gmem/custom_datatype_c/ This is simple example of RGB to HSV conversion to demonstrate Custom DATA Type usages in C Based Kernel. Xilinx HLS Compiler Supports Custom Data Type to use for operation as well as Memory Interface between Kernel and Global Memory. Key Concepts
- Custom Datatype
Keywords
- struct
- #pragma HLS data_pack
- #pragma HLS LOOP_TRIPCOUNT
kernel_to_gmem/custom_datatype_ocl/ This is simple example of RGB to HSV conversion to demonstrate Custom DATA Type usages in OpenCL Based Kernel. Xilinx HLS Compiler Supports Custom Data Type to use for operation as well as Memory Interface between Kernel and Global Memory. Key Concepts
- Dataflow
- Custom Datatype
Keywords
- struct
kernel_to_gmem/full_array_2d_c/ This is a simple example of accessing full data from 2d array Key Concepts
- 2D data full array Access
kernel_to_gmem/full_array_2d_ocl/ This is a simple example of accessing full data from 2d array Key Concepts
- 2D data full array Access
kernel_to_gmem/gmem_2banks_c/ This example of 2ddr to demonstrate on how to use 2ddr DSA. How to create buffers in each DDR. Key Concepts
- Multiple Banks
Keywords
- max_memory_ports
- misc:map_connect
- cl_mem_ext_ptr_t
- XCL_MEM_DDR_BANK0
- XCL_MEM_DDR_BANK1
- XCL_MEM_DDR_BANKx
- CL_MEM_EXT_PTR_XILINX
- HLS Interface m_axi bundle
kernel_to_gmem/gmem_2banks_ocl/ This example of 2ddr to demonstrate on how to use 2ddr DSA. How to create buffers in each DDR. Key Concepts
- Multiple Banks
Keywords
- max_memory_ports
- misc:map_connect
- cl_mem_ext_ptr_t
- XCL_MEM_DDR_BANK0
- XCL_MEM_DDR_BANK1
- XCL_MEM_DDR_BANKx
- CL_MEM_EXT_PTR_XILINX
kernel_to_gmem/kernel_global_bandwidth/ Bandwidth test of global to local memory.
kernel_to_gmem/memcoalesce_hang_c/ This example shows Memory Coalesce Deadlock/Hand situation and how to handle it. User can switch between BAD and GOOD case using makefile variable KFLOW. Key Concepts
- Memory Coalesce
- Memory Deadlock/Hang
- Multiple Interfaces
Keywords
- HLS INTERFACE
- bundle
- m_axi
kernel_to_gmem/row_array_2d_c/ This is a simple example of accessing each row of data from 2d array Key Concepts
- Row of 2D data array access
Keywords
- hls::stream
kernel_to_gmem/row_array_2d_ocl/ This is a simple example of accessing each row of data from 2d array Key Concepts
- Row of 2D data array access
Keywords
- xcl_dataflow
- xcl_pipeline_loop
kernel_to_gmem/wide_mem_rw_c/ This is simple example of vector addition to demonstrate Wide Memory Access using ap_uint<512> data type. Based on input argument type, xocc compiler will figure our the memory datawidth between Global Memory and Kernel. For this example, ap_uint<512> datatype is used, so Memory datawidth will be 16 x (integer bit size) = 16 x 32 = 512 bit. Key Concepts
- Kernel to DDR
- wide memory access
- burst read and write
Keywords
- ap_uint<>
- ap_int.h
kernel_to_gmem/wide_mem_rw_ocl/ This is simple example of vector addition to demonstrate Wide Memory Access using uint16 data type. Based on input argument type, xocc compiler will figure our the memory datawidth between Global Memory and Kernel. For this example, uint16 datatype is used, so Memory datawidth will be 16 x (integer bit size) = 16 x 32 = 512 bit. Key Concepts
- Kernel to DDR
- wide memory access
- burst read and write
Keywords
- uint16
- xcl_pipeline_loop
kernel_to_gmem/window_array_2d_c/ This is a simple example of accessing each window of data from 2d array Key Concepts
- window of 2D data array access
Keywords
- #pragma HLS DATAFLOW
- #pragma HLS PIPELINE
- #pragma HLS stream
kernel_to_gmem/window_array_2d_ocl/ This is a simple example of accessing each window of data from 2d array Key Concepts
- window/tile of 2D data array access
Keywords
- pipe
- xcl_pipeline_loop
- xcl_reqd_pipe_depth
kernel_opt/aos_vs_soa_ocl/ This example demonstrates how data layout can impact the performance of certain kernels. The example we will demonstrate how using the Structure of Array data layout can impact certain data parallel problems. Key Concepts
- Kernel Optimization
- Data Layout
kernel_opt/array_partition_ocl/ This example shows how to use array partitioning to improve performance of a kernel Key Concepts
- Kernel Optimization
- Array Partitioning
Keywords
- xcl_array_partition
- complete
kernel_opt/lmem_2rw_c/ This is simple example of vector addition to demonstrate how to utilized both ports of Local Memory memory. Key Concepts
- Kernel Optimization
- 2port BRAM Utilization
- two read/write Local Memory
Keywords
- #pragma HLS UNROLL FACTOR=2
kernel_opt/lmem_2rw_ocl/ This is simple example of vector addition to demonstrate how to utilized both ports of Local Memory. Key Concepts
- Kernel Optimization
- 2port BRAM Utilization
- two read/write Local Memory
Keywords
- opencl_unroll_hint(2)
kernel_opt/loop_fusion_c/ This example will demonstrate how to fuse two loops into one to improve the performance of an OpenCL C/C++ Kernel. Key Concepts
- Kernel Optimization
- Loop Fusion
- Loop Pipelining
Keywords
- #pragma HLS PIPELINE
kernel_opt/loop_fusion_ocl/ This example will demonstrate how to fuse two loops into one to improve the performance of an OpenCL kernel. Key Concepts
- Kernel Optimization
- Loop Fusion
- Loop Pipelining
Keywords
- xcl_pipeline_loop
kernel_opt/loop_pipeline_ocl/ This example demonstrates how loop pipelining can be used to improve the performance of a kernel. Key Concepts
- Kernel Optimization
- Loop Pipelining
Keywords
- xcl_pipeline_loop
kernel_opt/loop_reorder_c/ This is a simple example of matrix multiplication (Row x Col) to demonstrate how to achieve better pipeline II factor by loop reordering. Key Concepts
- Kernel Optimization
- Loop reorder to improve II
Keywords
- #pragma HLS PIPELINE
- #pragma HLS ARRAY_PARTITION
kernel_opt/loop_reorder_ocl/ This is a simple example of matrix multiplication (Row x Col) to demonstrate how to achieve better pipeline II factor by loop reordering. Key Concepts
- Kernel Optimization
- Loop reorder to improve II
Keywords
- xcl_pipeline_loop
- xcl_array_partition(complete, 2)
kernel_opt/partition_cyclicblock_c/ This example shows how to use array block and cyclic partitioning to improve performance of a kernel Key Concepts
- Kernel Optimization
- Array Partitioning
- Block Partition
- Cyclic Partition
Keywords
- #pragma HLS ARRAY_PARTITION
- cyclic
- block
- factor
- dim
kernel_opt/partition_cyclicblock_ocl/ This example shows how to use array block and cyclic partitioning to improve performance of a kernel Key Concepts
- Kernel Optimization
- Array Partitioning
- Block Partition
- Cyclic Partition
Keywords
- xcl_array_partition
- cyclic
- block
kernel_opt/shift_register_c/ This example demonstrates how to shift values in registers in each clock cycle Key Concepts
- Kernel Optimization
- Shift Register
- FIR
Keywords
- #pragma HLS ARRAY_PARTITION
kernel_opt/shift_register_ocl/ This example demonstrates how to shift values in registers in each clock cycle Key Concepts
- Kernel Optimization
- Shift Register
- FIR
Keywords
- xcl_array_partition
- getprofilingInfo()
kernel_opt/systolic_array_c/ This is a simple example of matrix multiplication (Row x Col) to help developers learn systolic array based algorithm design. Note : Systolic array based algorithm design is well suited for FPGA.
kernel_opt/systolic_array_ocl/ This is a simple example of matrix multiplication (Row x Col) to help developers learn systolic array based algorithm design. Note: Systolic array based algorithm design is well suited for FPGA.
kernel_opt/vectorization_memorycoalescing_ocl/ This example is a simple OpenCL application which highlights the vectorization concept. It provides a basis for calculating the bandwidth utilization when the compiler looking to vectorize. Key Concepts
- Vectorization
- Memory Coalescing
Keywords
- vec_type_hint
dataflow/dataflow_func_ocl/ This is simple example of vector addition to demonstrate Dataflow functionality in OpenCL Kernel. OpenCL Dataflow allows user to run multiple functions together to achieve higher throughput. Key Concepts
- Function/Task Level Parallelism
Keywords
- xcl_dataflow
- xclDataflowFifoDepth
dataflow/dataflow_loop_c/ This is simple example of vector addition to demonstrate Loops Dataflow functionality of HLS. HLS Dataflow allows user to schedule multiple sequential loops concurrently to achieve higher throughput. Key Concepts
- Loop Dataflow
Keywords
- dataflow
- hls::stream<>
dataflow/dataflow_loop_ocl/ This is simple example of vector addition to demonstrate Loops Dataflow functionality. OpenCL Dataflow allows user to schedule multiple sequential loops to run concurrently to achieve higher throughput. Key Concepts
- Loop Dataflow
Keywords
- xcl_dataflow
- xclDataflowFifoDepth
dataflow/dataflow_pipes_ocl/ This is simple example of vector addition to demonstrate OpenCL Pipe Memory usage. OpenCL PIPE memory functionality allows user to achieve kernel-to-kernel data transfer without using global memory. Key Concepts
- Dataflow
- kernel to kernel pipes
Keywords
- pipe
- xcl_reqd_pipe_depth
- read_pipe_block()
- write_pipe_block()
dataflow/dataflow_stream_array_c/ This is simple example of Multiple Stages Vector Addition to demonstrate Array of Stream usage in HLS C Kernel Code. Key Concepts
- Array of Stream
Keywords
- dataflow
- hls::stream<>
dataflow/dataflow_stream_c/ This is simple example of vector addition to demonstrate Dataflow functionality of HLS. HLS Dataflow allows user to schedule multiple task together to achieve higher throughput. Key Concepts
- Task Level Parallelism
Keywords
- dataflow
- hls::stream<>
dataflow/dataflow_subfunc_ocl/ This is simple example of vector addition to demonstrate how OpenCL Dataflow allows user to run multiple sub functions together to achieve higher throughput. Key Concepts
- SubFunction Level Parallelism
Keywords
- xcl_dataflow
- xclDataflowFifoDepth
clk_freq/critical_path_ocl/ This example shows a normal coding style which could lead to critical path issue and design will give degraded timing. Example also contains better coding style which can improve design timing. Key Concepts
- Critical Path handling
- Improve Timing
clk_freq/large_loop_c/ This is a CNN (Convolutional Neural Network) based example which mainly focuses on Convolution operation of a CNN network. The goal of this example is to demonstrate a method to overcome kernel design timing failure issue. It also presents the effectiveness of using multiple compute units to improve performance. Key Concepts
- Clock Frequency
- Multiple Compute Units
- Convolutional Neural Networks
Keywords
- #pragma HLS ARRAY_PARTITION
- #pragma HLS PIPELINE
- #pragma HLS INLINE
clk_freq/large_loop_ocl/ This is a CNN (Convolutional Neural Network) based example which mainly focuses on Convolution operation of a CNN network. The goal of this example is to demonstrate a method to overcome kernel design timing failure issue. It also presents the effectiveness of using multiple compute units to improve performance. Key Concepts
- Clock Frequency
- Multiple Compute Units
- Convolutional Neural Networks
Keywords
- xcl_array_partition
- xcl_pipeline_loop
- always_inline
clk_freq/split_kernel_c/ This is a multi-filter image processing application to showcase effectiveness of Dataflow/Streams usage. This examples is intended to help developers to break down the complex kernels into multiple sub-functions using HLS Dataflow/Streams. It presents a way to concurrently execute multiple functions with better area utilization compared to a complex single kernel implementation. The main objective of this example is to showcase a way to build a optimal FPGA design which achieves maximum frequency with optimal resource utilization and achieves better performance compared to single complex kernel implementations. Key Concepts
- Dataflow
- Stream
Keywords
- #pragma HLS DATAFLOW
- hls::stream
- #pragma HLS INLINE
- #pragma HLS ARRAY_PARTITION
- #pragma HLS PIPELINE
clk_freq/split_kernel_ocl/ This is a multi-filter image processing application to showcase effectiveness of Dataflow/Streams usage. This examples is intended to help developers to break down the complex kernel into multiple sub-functions using OpenCL Dataflow. It presents a way to concurrently execute multiple functions with better area utilization compared to a complex single kernel implementation. The main objective of this example is to showcase a way to build a optimal FPGA design which achieves maximum frequency with optimal resource utilization and achieves better performance compared to single kernel implementations. Key Concepts
- Dataflow
- Stream
Keywords
- xcl_dataflow
- xcl_array_partition
- xcl_pipeline_loop
clk_freq/too_many_cu_c/ This is simple example of vector addition to demonstrate effectiveness of using single compute unit with heavy work load to achieve better performance. Bad example uses multiple compute units to achieve good performance but it results in heavy usage of FPGA resources and area due to which design fails timing. Good example uses single compute unit to compute with heavier work load, it helps in less resource utilization and also helps in kernel scalability. To switch between Good/Bad cases use the flag provided in makefile. Key Concepts
- Clock Frequency
- Data Level Parallelism
- Multiple Compute Units
Keywords
- #pragma HLS PIPELINE
- #pragma HLS ARRAY_PARTITION
clk_freq/too_many_cu_ocl/ This is simple example of vector addition to demonstrate effectiveness of using single compute unit with heavy work load to achieve better performance. Bad example uses multiple compute units to achieve good performance but it results in heavy usage of FPGA resources and area due to which design fails timing. Good example uses single compute unit to compute with heavier work load, it helps in less resource utilization and also helps in kernel scalability. To switch between Good/Bad cases use the flag provided in makefile. Key Concepts
- Clock Frequency
- Data Level Parallelism
- Multiple Compute Units
Keywords
- xcl_array_partition(complete, 1)
- xcl_pipeline_loop
debug/debug_printf_ocl/ This is simple example of vector addition and printing of data that is computational result (addition). It is based on vectored addition that demonstrates printing of work item data (integer product in this case) Key Concepts
- Use of print statements for debugging
Keywords
- printf
- param:compiler.enableAutoPipelining=false
debug/debug_profile_ocl/ This is simple example of vector addition and printing profile data (wall clock time taken between start and stop). It also dump a waveform file which can be reloaded to vivado to see the waveform. Run command 'vivado -source ./scripts/open_waveform.tcl -tclargs <device_name>-<kernel_name>..<device_name>.wdb' to launch waveform viewer. User can also update batch to gui in sdaccel.ini file to see the live waveform while running application. Key Concepts
- Use of Profile API
- Waveform Dumping and loading
rtl_kernel/rtl_adder_pipes/ This example shows an adder with pipes using 3 RTL kernels. Key Concepts
- RTL Kernel
- Multiple RTL Kernels
rtl_kernel/rtl_vadd/ Simple example of vector addition using RTL Kernel Key Concepts
- RTL Kernel
rtl_kernel/rtl_vadd_2clks/ This example shows vector addition with 2 kernel clocks using RTL Kernel. Key Concepts
- RTL Kernel
- Multiple Kernel Clocks
Keywords
- --kernel_frequency
rtl_kernel/rtl_vadd_2kernels/ This example has two RTL Kernels. Both Kernel_0 and Kernel_1 perform vector addition. The Kernel_1 reads the output from Kernel_0 as one of two inputs. Key Concepts
- Multiple RTL Kernels
rtl_kernel/rtl_vadd_hw_debug/ This is an example that showcases the Hardware Debug of Vector Addition RTL Kernel in Hardware. Key Concepts
- RTL Kernel Debug
rtl_kernel/rtl_vadd_mixed_cl_vadd/ This example has one RTL kernel and one CL kernel. Both RTL kernel and CL kernel perform vector addition. The CL kernel reads the output from RTL kernel as one of two inputs. Key Concepts
- Mixed Kernels
misc/sum_scan/ Example of parallel prefix sum
misc/vadd/ Simple example of vector addition.
misc/vdotprod/ Simple example of vector dot-product.
cpu_to_fpga/00_cpu/ This is a simple example of matrix multiplication (Row x Col).
cpu_to_fpga/01_ocl/ This is a simple example of OpenCL matrix multiplication (Row x Col). Key Concepts
- OpenCL APIs
cpu_to_fpga/02_lmem_ocl/ This is a simple example of matrix multiplication (Row x Col) to demonstrate how to reduce number of memory accesses using local memory. Key Concepts
- Kernel Optimization
- Local Memory
cpu_to_fpga/03_burst_rw_ocl/ This is a simple example of matrix multiplication (Row x Col) to demonstrate how to achieve better pipeline with burst read and write to/from local memory from/to DDR. Key Concepts
- Kernel Optimization
- Burst Read/Write
cpu_to_fpga/04_partition_ocl/ This is a simple example of matrix multiplication (Row x Col) to demonstrate how to achieve better performance by array partitioning and loop unrolling. Key Concepts
- Array Partition
- Loop Unroll
Keywords
- xcl_pipeline_loop
- xcl_array_partition(complete, dim)
- opencl_unroll_hint