accessor-BLAS

The purpose of this repository is to both showcase the performance of the Ginkgo accessor and to serve as an integration example.

Including the Ginkgo's accessor into your project

Integration into the build system

To use Ginkgo's accessor, you need to:

Use C++14 or higher in your own project
use ${GINKGO_DIR} as an include directory (you only need ${GINKGO_DIR}/accessor from Ginkgo)

In this repository, we use CMake, which makes the integration straight forward. We give users the option to either specify a local copy of the Ginkgo repository, or automatically clone the repository into the build directory, followed by using it. We achieve both with these few lines in CMake:

accessor-BLAS/CMakeLists.txt

Lines 16 to 43 in 8137ba6

    
           string(COMPARE EQUAL "${GINKGO_DIR}" "" empty_ginkgo_dir) 
        
           # If directory not specified, try to find it. If not found, clone it locally 
        
           if (empty_ginkgo_dir) 
        
               message(STATUS "GINKGO_DIR was not specified. Cloning Ginkgo into build directory...") 
        
               find_package(Git REQUIRED) 
        
               execute_process( 
        
                   COMMAND "${GIT_EXECUTABLE}" clone --branch develop 
        
                               "https://github.com/ginkgo-project/ginkgo.git" "Ginkgo" 
        
                   WORKING_DIRECTORY "${accessor_example_BINARY_DIR}" 
        
                   RESULT_VARIABLE result 
        
                   ) 
        
               if(result) 
        
                   message(FATAL_ERROR "Failed to clone Ginkgo repository. Error: ${result}") 
        
               endif() 
        
               set(GINKGO_DIR "${accessor_example_BINARY_DIR}/Ginkgo" CACHE PATH "${GINKGO_DIR_INFO}" FORCE) 
        
               message(STATUS "Ginkgo successfully cloned into \"${GINKGO_DIR}\"") 
        
           endif() 
        
           # This is everything that is needed to use the Ginkgo accessor: C++14 and the 
        
           # Ginkgo directory as include directory to have access to the accessor headers. 
        
           function(example_apply_default_target_settings target) 
        
               target_compile_features("${target}" PUBLIC cxx_std_14) 
        
               target_include_directories("${target}" PRIVATE 
        
                   "${GINKGO_DIR}" 
        
                   ) 
        
           endfunction()

Now, we only need to call this function for every target where we use the accessor.

Creating a range with an accessor

In this repository, we only use the reduced_row_major accessor, but all others work accordingly.

For the reduced_row_major accessor, you need to specify:

the dimensionality of the range (we specify 2D, even for vectors, so we can access vectors with a stride)
the arithmetic and storage type Now, this type can be used to create the range<reduced_row_major<...>> by specifying the size, stride and storage pointer.

We showcase the creation of both constant and non-constant ranges with reduced_row_major accessors here:

accessor-BLAS/cuda/gemv_kernels.cuh

Lines 178 to 189 in 8137ba6

    
           constexpr std::size_t dimensionality{2}; 
        
           std::array<gko::acc::size_type, dimensionality - 1> m_stride{m_info.stride}; 
        
           std::array<gko::acc::size_type, dimensionality - 1> x_stride{x_info.stride}; 
        
           std::array<gko::acc::size_type, dimensionality - 1> res_stride{res_info.stride}; 
        
           using accessor = 
        
               gko::acc::reduced_row_major<dimensionality, ArType, StType>; 
        
           using range = gko::acc::range<accessor>; 
        
           using c_range = gko::acc::range<typename accessor::const_accessor>; 
        
           auto m_acc = c_range(m_info.size, mtx, m_stride); 
        
           auto x_acc = c_range(x_info.size, x, x_stride); 
        
           auto res_acc = range(res_info.size, res, res_stride);

We utilize the constant accessors for the matrix and the x vector since both storage pointers are also constant.

Utilizing the range/accessor in a CUDA kernel

Utilizing the range in a kernel (works the same for CPUs) is straight forward:

Use a templated kernel argument in order to accept all kind of ranges
Read and write operations utilize the bracket operator()

To know which arithmetic type is used, we can either use the accessor::arithmetic_type property, or detect what type arithmetic operations result in. In this example, we use the second option:

accessor-BLAS/cuda/gemv_kernels.cuh

Line 86 in 8137ba6

using ar_type = decltype(alpha * mtx(0, 0) * x(0, 0) + beta * res(0, 0));

Read and write options can be observed in GEMV here:

accessor-BLAS/cuda/gemv_kernels.cuh

Line 110 in 8137ba6

res(row_idx, 0) = alpha * shared[local_id] + beta * res(row_idx, 0);

Difference between using plain pointers and using the range/accessor

Here, we compare the GEMV kernel written with plain pointers:

accessor-BLAS/cuda/gemv_kernels.cuh

Lines 30 to 64 in 8137ba6

    
           template <std::int64_t block_size, typename ValueType> 
        
           __global__ __launch_bounds__(block_size) void gemv( 
        
               const matrix_info m_info, ValueType alpha, const ValueType *__restrict__ mtx, 
        
               const matrix_info x_info, const ValueType *__restrict__ x, 
        
               const matrix_info res_info, ValueType beta, ValueType *__restrict__ res) 
        
           { 
        
               // expect x_info.size[1] == 1 
        
               const std::int64_t row_idx{blockIdx.x}; 
        
               if (row_idx >= m_info.size[0]) { 
        
                   return; 
        
               } 
        
               const std::int64_t row_start = row_idx * m_info.stride; 
        
               __shared__ char shared_impl[sizeof(ValueType) * block_size]; 
        
               auto shared = reinterpret_cast<ValueType *>(shared_impl); 
        
               ValueType local_result{}; 
        
               const auto group = cg::this_thread_block(); 
        
               const auto local_id = group.thread_rank(); 
        
               for (std::int64_t col = local_id; col < m_info.size[1]; col += block_size) { 
        
                   const auto mtx_val = mtx[row_start + col]; 
        
                   const auto x_val = x[col * x_info.stride]; 
        
                   local_result += mtx_val * x_val; 
        
               } 
        
               shared[local_id] = local_result; 
        
               reduce(group, shared, [](ValueType a, ValueType b) { return a + b; }); 
        
               if (local_id == 0) { 
        
                   const auto res_idx = row_idx * res_info.stride; 
        
                   if (beta == ValueType{0}) { 
        
                       res[res_idx] = alpha * shared[local_id]; 
        
                   } else { 
        
                       res[res_idx] = alpha * shared[local_id] + beta * res[res_idx]; 
        
                   } 
        
               } 
        
           }

and using the range/accessor:

accessor-BLAS/cuda/gemv_kernels.cuh

Lines 79 to 113 in 8137ba6

    
           template <std::int64_t block_size, typename MtxRange, typename XRange, 
        
                     typename ResRange, typename ArType> 
        
           __global__ __launch_bounds__(block_size) void acc_gemv(ArType alpha, 
        
                                                                  MtxRange mtx, XRange x, 
        
                                                                  ArType beta, 
        
                                                                  ResRange res) 
        
           { 
        
               using ar_type = decltype(alpha * mtx(0, 0) * x(0, 0) + beta * res(0, 0)); 
        
               static_assert(std::is_same<ArType, ar_type>::value, "Types must be equal!"); 
        
               // expect x_info.size[1] == 1 
        
               const std::int64_t row_idx{blockIdx.x}; 
        
               if (row_idx >= mtx.length(0)) { 
        
                   return; 
        
               } 
        
               const auto num_cols = mtx.length(1); 
        
               __shared__ char shared_impl[sizeof(ar_type) * block_size]; 
        
               auto shared = reinterpret_cast<ar_type *>(shared_impl); 
        
               ar_type local_result{}; 
        
               const auto group = cg::this_thread_block(); 
        
               const auto local_id = group.thread_rank(); 
        
               for (std::int64_t col = local_id; col < num_cols; col += block_size) { 
        
                   local_result += mtx(row_idx, col) * x(col, 0); 
        
               } 
        
               shared[local_id] = local_result; 
        
               reduce(group, shared, [](ar_type a, ar_type b) { return a + b; }); 
        
               if (local_id == 0) { 
        
                   if (beta == ArType{0}) { 
        
                       res(row_idx, 0) = alpha * shared[local_id]; 
        
                   } else { 
        
                       res(row_idx, 0) = alpha * shared[local_id] + beta * res(row_idx, 0); 
        
                   } 
        
               } 
        
           }

The main differences between these are:

We have fewer parameters in the range/accesser kernel because stride and size information are integrated into the ranges.
We don't need to compute a 1D index with the range/accessor because indices to both dimensions are fed into the brace operator
For debug purposes (and showcase), we extract the arithmetic type from the accessor.

Plots

So far, we ran the benchmarks and error analysis on an nvidia A100 and on an nvidia V100 GPU.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
cuda		cuda
plots		plots
.clang-format		.clang-format
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda

cuda

plots

plots

.clang-format

.clang-format

CMakeLists.txt

CMakeLists.txt

LICENSE

LICENSE

README.md

README.md

Repository files navigation

accessor-BLAS

Including the Ginkgo's accessor into your project

Integration into the build system

Creating a range with an accessor

Utilizing the range/accessor in a CUDA kernel

Difference between using plain pointers and using the range/accessor

Plots

Evaluation on the A100

Evaluation on the V100

About

Releases

Packages

Contributors 2

Languages

	string(COMPARE EQUAL "${GINKGO_DIR}" "" empty_ginkgo_dir)
	# If directory not specified, try to find it. If not found, clone it locally
	if (empty_ginkgo_dir)
	message(STATUS "GINKGO_DIR was not specified. Cloning Ginkgo into build directory...")
	find_package(Git REQUIRED)

	execute_process(
	COMMAND "${GIT_EXECUTABLE}" clone --branch develop
	"https://github.com/ginkgo-project/ginkgo.git" "Ginkgo"
	WORKING_DIRECTORY "${accessor_example_BINARY_DIR}"
	RESULT_VARIABLE result
	)
	if(result)
	message(FATAL_ERROR "Failed to clone Ginkgo repository. Error: ${result}")
	endif()
	set(GINKGO_DIR "${accessor_example_BINARY_DIR}/Ginkgo" CACHE PATH "${GINKGO_DIR_INFO}" FORCE)
	message(STATUS "Ginkgo successfully cloned into \"${GINKGO_DIR}\"")
	endif()


	# This is everything that is needed to use the Ginkgo accessor: C++14 and the
	# Ginkgo directory as include directory to have access to the accessor headers.
	function(example_apply_default_target_settings target)
	target_compile_features("${target}" PUBLIC cxx_std_14)
	target_include_directories("${target}" PRIVATE
	"${GINKGO_DIR}"
	)
	endfunction()

	constexpr std::size_t dimensionality{2};
	std::array<gko::acc::size_type, dimensionality - 1> m_stride{m_info.stride};
	std::array<gko::acc::size_type, dimensionality - 1> x_stride{x_info.stride};
	std::array<gko::acc::size_type, dimensionality - 1> res_stride{res_info.stride};

	using accessor =
	gko::acc::reduced_row_major<dimensionality, ArType, StType>;
	using range = gko::acc::range<accessor>;
	using c_range = gko::acc::range<typename accessor::const_accessor>;
	auto m_acc = c_range(m_info.size, mtx, m_stride);
	auto x_acc = c_range(x_info.size, x, x_stride);
	auto res_acc = range(res_info.size, res, res_stride);

	template <std::int64_t block_size, typename ValueType>
	__global__ __launch_bounds__(block_size) void gemv(
	const matrix_info m_info, ValueType alpha, const ValueType *__restrict__ mtx,
	const matrix_info x_info, const ValueType *__restrict__ x,
	const matrix_info res_info, ValueType beta, ValueType *__restrict__ res)
	{
	// expect x_info.size[1] == 1
	const std::int64_t row_idx{blockIdx.x};
	if (row_idx >= m_info.size[0]) {
	return;
	}
	const std::int64_t row_start = row_idx * m_info.stride;
	__shared__ char shared_impl[sizeof(ValueType) * block_size];
	auto shared = reinterpret_cast<ValueType *>(shared_impl);

	ValueType local_result{};
	const auto group = cg::this_thread_block();
	const auto local_id = group.thread_rank();

	for (std::int64_t col = local_id; col < m_info.size[1]; col += block_size) {
	const auto mtx_val = mtx[row_start + col];
	const auto x_val = x[col * x_info.stride];
	local_result += mtx_val * x_val;
	}
	shared[local_id] = local_result;
	reduce(group, shared, [](ValueType a, ValueType b) { return a + b; });
	if (local_id == 0) {
	const auto res_idx = row_idx * res_info.stride;
	if (beta == ValueType{0}) {
	res[res_idx] = alpha * shared[local_id];
	} else {
	res[res_idx] = alpha * shared[local_id] + beta * res[res_idx];
	}
	}
	}

	template <std::int64_t block_size, typename MtxRange, typename XRange,
	typename ResRange, typename ArType>
	__global__ __launch_bounds__(block_size) void acc_gemv(ArType alpha,
	MtxRange mtx, XRange x,
	ArType beta,
	ResRange res)
	{
	using ar_type = decltype(alpha * mtx(0, 0) * x(0, 0) + beta * res(0, 0));
	static_assert(std::is_same<ArType, ar_type>::value, "Types must be equal!");
	// expect x_info.size[1] == 1
	const std::int64_t row_idx{blockIdx.x};
	if (row_idx >= mtx.length(0)) {
	return;
	}

	const auto num_cols = mtx.length(1);
	__shared__ char shared_impl[sizeof(ar_type) * block_size];
	auto shared = reinterpret_cast<ar_type *>(shared_impl);
	ar_type local_result{};
	const auto group = cg::this_thread_block();
	const auto local_id = group.thread_rank();

	for (std::int64_t col = local_id; col < num_cols; col += block_size) {
	local_result += mtx(row_idx, col) * x(col, 0);
	}
	shared[local_id] = local_result;
	reduce(group, shared, [](ar_type a, ar_type b) { return a + b; });
	if (local_id == 0) {
	if (beta == ArType{0}) {
	res(row_idx, 0) = alpha * shared[local_id];
	} else {
	res(row_idx, 0) = alpha * shared[local_id] + beta * res(row_idx, 0);
	}
	}
	}

License

ginkgo-project/accessor-BLAS

Folders and files

Latest commit

History

Repository files navigation

accessor-BLAS

Including the Ginkgo's accessor into your project

Integration into the build system

Creating a range with an accessor

Utilizing the range/accessor in a CUDA kernel

Difference between using plain pointers and using the range/accessor

Plots

Evaluation on the A100

Evaluation on the V100

About

Resources

License

Stars

Watchers

Forks

Languages