-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DPCPP cooperative group #757
Conversation
Thanks Mike for creating the PR. I'm trying to see what I can do with a templated configuration type as possible improvement, I'll tell you whether it works. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't go into the config right now, but the rest looks good
Codecov Report
@@ Coverage Diff @@
## develop #757 +/- ##
========================================
Coverage 94.17% 94.17%
========================================
Files 400 400
Lines 31051 31080 +29
========================================
+ Hits 29241 29270 +29
Misses 1810 1810
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks good!
The main point of interest for me is the choice of 32 for the warp size. Could you document somewhere, maybe just in this PR, why you chose 32? Will shuffles be expected to work for 32 work-items independent of vector architecture? Can we get away with a smaller warp size? (I think a smaller warp size, if it still fully utilizes the vector units, can enable more flexible parallelism.)
Apart from that I have a few other points. I guess the sync implementation and tests still remain, and documentation can be improved in several places.
__dpct_inline__ ValueType shfl_up(ValueType var, | ||
SelectorType selector) const noexcept | ||
{ | ||
const auto result = this->shuffle_up(var, selector); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this->
here? If not, it's probably better to remove it. Maybe
const auto result = this->shuffle_up(var, selector); | |
const auto result = sub_group::shuffle_up(var, selector); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it needs this->
to use the generated subgroup information
__dpct_inline__ ValueType ShflOpName(ValueType var, SelectorType selector) \ | ||
const noexcept \ | ||
{ \ | ||
return this->ShflOp(var, selector); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just ShflOp
is enough here?
return this->ShflOp(var, selector); \ | |
return ShflOp(var, selector); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I call the member function, I will use this->
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
member functions are sometimes not found in a templated member function unless you use this->
, which is also why all of our templated tests need to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically, that is not needed for inline definitions (in the class body rather than outside). I was just thinking of avoiding an explicit pointer indirection if we can help it. I understand it's probably fine because it should be optimized out.
|
||
|
||
// Enable group can directly use group function | ||
__SYCL_INLINE_NAMESPACE(cl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give an example of a function this will enable? In general it seems like a bad idea for us to depend on sycl's detail namespace - I guess they can change it any time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for example, we use the group algorithm the reduce in ballot.
If we do not do this way, when we need use oneAPI implemenation on gko's subgroup, we need to add static_cast to subgroup type.
I agree with you, they might change it at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if it works with static casting to sycl::ONEAPI::sub_group*
when you call reduce, perhaps we should just do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I stay this version first because it still works and it makes ginkgo cooperative group like an extension not another object.
But if they change frequently and we are hard to stay with them, we can delete them and use static_cast.
|
||
// specialization for 1 | ||
template <> | ||
class thread_block_tile<1> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this specialization needed? Do you expect we'll need to use thread_block_tiles of size 1 in algorithms? Otherwise, if this case is correctly but inefficiently handled by the generic implementation, I guess we don't need this specialization. If it's needed, it would be nice to have unit tests for this too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is for those kernels implemented by warp sense.
providing it such that we can use the same kernel on subgroup(1) to get the single thread implementation
I add a few default helper to reduce the internal layer code. TODO:
|
@ginkgo-project/reviewers this PR can be reviewed. Also welcome any name suggestion of these things |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/** | ||
* The type containing a bitmask over all lanes of a warp. | ||
*/ | ||
using lane_mask_type = uint64; | ||
|
||
/** | ||
* The bitmask of the entire warp. | ||
*/ | ||
static constexpr auto full_lane_mask = ~zero<lane_mask_type>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we ever need this? I think we cannot have masked operation anyway. Maybe all of the config can be removed to begin with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think the full lane mask should have the same number of bits as a full lane has threads. Is that true here? Otherwise things related to popcnt
and ballot
might break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to use mask<subgroup_size, lane_mask_type>
to get the last subgroup_size
bits activated in lane_mask_type
bool allowed = false; | ||
for (auto &i : subgroup_sizes) { | ||
allowed |= (i == warpsize); | ||
} | ||
return allowed && (blocksize <= max_workgroup_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use validate_function
instead?
Also, why are these protected? How do you use them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the protected structure is the original one.
I add a public function to get the const exec_info
.
Do you have any comment about using function directly or structure function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly good to me, I would like to see if we can remove a few CUDA-isms though, since they only relate to kernel launch, not the actual kernel implementation.
/** | ||
* The type containing a bitmask over all lanes of a warp. | ||
*/ | ||
using lane_mask_type = uint64; | ||
|
||
/** | ||
* The bitmask of the entire warp. | ||
*/ | ||
static constexpr auto full_lane_mask = ~zero<lane_mask_type>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think the full lane mask should have the same number of bits as a full lane has threads. Is that true here? Otherwise things related to popcnt
and ballot
might break.
#if defined(_MSC_VER) | ||
#define __dpct_align__(n) __declspec(align(n)) | ||
#define __dpct_inline__ __forceinline | ||
#else | ||
#define __dpct_align__(n) __attribute__((aligned(n))) | ||
#define __dpct_inline__ __inline__ __attribute__((always_inline)) | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we programming for DPC++ or SYCL? For the former, can't we include dpct.hpp directly? For the latter, is this portable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is due to the issues from dpct before. for example, including dpct file gives error or their atomic_add has some issue, so we also have atomic_add implementation for real number in another pr
* dim3 is a cuda-like dim3 for sycl-range, which provides the same ordering as | ||
* cuda and gets the sycl-range in reverse ordering. | ||
*/ | ||
struct dim3 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are building a lot of complexity here (also the config selection dim3 integration), so I want to ask: Why do we need this additional wrapper? Can't we use SYCL primitives directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can use the SYCL native sycl::range<3>
but it will gives different view on kernel launch and build the sycl::range
for example, we use kernel<<<32, 32>>>
and sycl will need to use
kernel(sycl::nd_range<3>(sycl::range<3>(1, 1, 32) * sycl::range<3>(1, 1, 32), sycl::range<3>(1, 1, 32)))
but with dim3, we can still use kernel(32, 32)
and in the beginning, I would like to reduce the difference from cuda to dpcpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which places does that actually matter? Most of the time, we are using one-dimensional kernels, except for SpMV (multiple columns), some Dense kernels (2D) and Jacobi. Wouldn't you use sycl::ndrange<1>(...)
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the mapping will be different. At least, it will require us to develop own dpct to avoid any conversion like threadIdx -> get_local_id(3)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and it will gives inconsistent index sense between the kernel1D/2D/3D.
range<1>(x)
get_local_id(0)
is x and x is contiguous as cuda
range<2>(x, y)
get_local_id(0)
is still x but x is not contiguous
or range<2>(y, x)
x is contiguous but x needs get_local_id(1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that only matters on the launch side, right? I would assume that the reason for this is that it maps more cleanly to the nested for-loop model
for (; i < range[0]; i++)
for (; j < range[1]; j++)
for (; k < range[2]; k++)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it affects not only launch side but the kernel index for threads
__dpct_inline__ ValueType ShflOpName(ValueType var, SelectorType selector) \ | ||
const noexcept \ | ||
{ \ | ||
return this->ShflOp(var, selector); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
member functions are sometimes not found in a templated member function unless you use this->
, which is also why all of our templated tests need to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work! It's pretty cool, eg. how you handle stuff like __WG_BOUNDS__
. Guess you want to make DPC++ as close to CUDA as possible, which is what a lot of the code here seems to be doing.
Below, I have some concerns around the ConfigSet
stuff and other comments / suggestions.
include/ginkgo/core/base/types.hpp
Outdated
* | ||
* @note this is the last case of nested template | ||
*/ | ||
template <int num_groups, int current_shift> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
template <int num_groups, int current_shift> | |
template <int current_shift, int num_groups> |
This way, I think num_groups
can be inferred and we'll need to provide only current_shift
while using this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried it before but it did not get the num_group information from array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see. Even with the switched order? Then never mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it needs switched order to give the possibility for not explicitly setting.
I will put the my trying code later.
It is also not my expectation, so maybe I did something wrong there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try adding a deduction guide? Neat little C++17 feature that might help here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ConfigSet is also a general solution for others kernel (not dpcpp) so we still need to stay C++14.
I put the related code: https://godbolt.org/z/oa5d4arMs
it switch the order but can not miss the num_groups
include/ginkgo/core/base/types.hpp
Outdated
* | ||
* @note this is the usual case of nested template | ||
*/ | ||
template <int num_groups, int current_shift> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
template <int num_groups, int current_shift> | |
template <int current_shift, int num_groups> |
include/ginkgo/core/base/types.hpp
Outdated
const std::array<char, num_groups> &bits) | ||
{ | ||
return bits[current_shift + 1] + | ||
shift<num_groups, (current_shift + 1)>(bits); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then this can simply be
shift<num_groups, (current_shift + 1)>(bits); | |
shift<current_shift + 1>(bits); |
include/ginkgo/core/base/types.hpp
Outdated
|
||
/** | ||
* ConfigSet is a way to embed several information into one integer by given | ||
* certain bits. The usage will be the following |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* certain bits. The usage will be the following | |
* certain bits. | |
* | |
* The usage will be the following: |
include/ginkgo/core/base/types.hpp
Outdated
class ConfigSet { | ||
public: | ||
static constexpr size_type num_groups = sizeof...(num_bits); | ||
static constexpr std::array<char, num_groups> bits{num_bits...}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want to put int
s into a char
array?
include/ginkgo/core/base/types.hpp
Outdated
* The encoded result will use 32 bits to record | ||
* rrrrr1..12....2...k..k, which 1/2/k means the bits store the information for | ||
* 1/2/k position and r is for rest of unused bits. | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat crazy suggestion but might be great to have: maybe you could include a mathematical proof that this coding scheme is indeed a unique map when the numbers to encode are small enough. You could precisely state (in the assumptions) the maximum values of the arguments of encode
for the coding to work, and then the proof would be couple of lines to prove:
For integer vectors x and y with all components less than the respective maxima, x != y
implies encode(x...) != encode(y...)
.
If you say that is unnecessary, that is completely fine.
|
||
|
||
// Enable group can directly use group function | ||
__SYCL_INLINE_NAMESPACE(cl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if it works with static casting to sycl::ONEAPI::sub_group*
when you call reduce, perhaps we should just do that.
|
||
|
||
using namespace gko::kernels::dpcpp; | ||
using KCfg = gko::ConfigSet<12, 7>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess 12 and 7 are arbitrarily chosen, or do you like 12 and 7 for some specific reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use wrong number here. It should be 11, 7.
it is from log_2(1024) + 1
for workgoup_size and log_2(64) + 1
for subgroup_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That appears to make sense, but I don't really see it. It seems like you want the bits array to contain the maximum number of bits needed by each position. But I guess it will work correctly for many combinations, depending on which numbers are given as parameters to the encode
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is described in ConfigSet note. the #bit should be log_2(max)+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - nice job! I am still not entirely happy about the dim3
thing, since it means adding CUDA-isms to our SYCL code, but I guess it would only be used in a small number of places anyways, so I guess it will be okay.
core/test/base/types.cpp
Outdated
ASSERT_EQ((std::is_same<decltype(mask3_u), const unsigned int>::value), | ||
true); | ||
ASSERT_EQ((std::is_same<decltype(fullmask_u), const unsigned int>::value), | ||
true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ASSERT_EQ((std::is_same<decltype(mask3_u), const unsigned int>::value), | |
true); | |
ASSERT_EQ((std::is_same<decltype(fullmask_u), const unsigned int>::value), | |
true); | |
ASSERT_TRUE((std::is_same<decltype(mask3_u), const unsigned int>::value)); | |
ASSERT_TRUE((std::is_same<decltype(fullmask_u), const unsigned int>::value)); |
Not sure if you even need the additional parentheses then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still needs the additional parentheses such that the macro unpacks parameter correctly.
core/test/base/types.cpp
Outdated
ASSERT_EQ((std::is_same<decltype(mask3_u64), const std::uint64_t>::value), | ||
true); | ||
ASSERT_EQ( | ||
(std::is_same<decltype(fullmask_u64), const std::uint64_t>::value), | ||
true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here and the following
@@ -266,9 +266,6 @@ fi | |||
# Arrange the remain files and give | |||
if [ -f "${CONTENT}" ]; then | |||
add_regroup | |||
if [ "${HAS_HIP_RUNTIME}" = "true" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for these changes again? Do we no longer need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the behavior still exists. DPCPP gives more additional header than <hip/runtime.h>, so I move the all additional before the LICENSE into header section not just hip/runtime.h
include/ginkgo/core/base/types.hpp
Outdated
* | ||
* @note this is the last case of nested template | ||
*/ | ||
template <int num_groups, int current_shift> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try adding a deduction guide? Neat little C++17 feature that might help here.
it depends on how much effort we would like to start DPCPP in the beginning. |
Co-authored-by: Terry Cojean <terry.cojean@kit.edu>
Co-authored-by: Aditya Kashi <aditya.kashi@kit.edu> Co-authored-by: Tobias Ribizel <ribizel@kit.edu>
Co-authored-by: Terry Cojean <terry.cojean@kit.edu>
Co-authored-by: Aditya Kashi <aditya.kashi@kit.edu> Co-authored-by: Tobias Ribizel <ribizel@kit.edu>
I swap the order of shift although we do not have auto deduction there.
|
5067e93
to
32a7c19
Compare
delete throw in constexpr because it fails in gcc <= 5.x Co-authored-by: Aditya Kashi <aditya.kashi@kit.edu> Co-authored-by: Terry Cojean <terry.cojean@kit.edu> Co-authored-by: Tobias Ribizel <ribizel@kit.edu>
Kudos, SonarCloud Quality Gate passed!
|
Ginkgo release 1.4.0 The Ginkgo team is proud to announce the new Ginkgo minor release 1.4.0. This release brings most of the Ginkgo functionality to the Intel DPC++ ecosystem which enables Intel-GPU and CPU execution. The only Ginkgo features which have not been ported yet are some preconditioners. Ginkgo's mixed-precision support is greatly enhanced thanks to: 1. The new Accessor concept, which allows writing kernels featuring on-the-fly memory compression, among other features. The accessor can be used as header-only, see the [accessor BLAS benchmarks repository](https://github.com/ginkgo-project/accessor-BLAS/tree/develop) as a usage example. 2. All LinOps now transparently support mixed-precision execution. By default, this is done through a temporary copy which may have a performance impact but already allows mixed-precision research. Native mixed-precision ELL kernels are implemented which do not see this cost. The accessor is also leveraged in a new CB-GMRES solver which allows for performance improvements by compressing the Krylov basis vectors. Many other features have been added to Ginkgo, such as reordering support, a new IDR solver, Incomplete Cholesky preconditioner, matrix assembly support (only CPU for now), machine topology information, and more! Supported systems and requirements: + For all platforms, cmake 3.13+ + C++14 compliant compiler + Linux and MacOS + gcc: 5.3+, 6.3+, 7.3+, all versions after 8.1+ + clang: 3.9+ + Intel compiler: 2018+ + Apple LLVM: 8.0+ + CUDA module: CUDA 9.0+ + HIP module: ROCm 3.5+ + DPC++ module: Intel OneAPI 2021.3. Set the CXX compiler to `dpcpp`. + Windows + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, all versions after 8.1+ + Microsoft Visual Studio: VS 2019 + CUDA module: CUDA 9.0+, Microsoft Visual Studio + OpenMP module: MinGW or Cygwin. Algorithm and important feature additions: + Add a new DPC++ Executor for SYCL execution and other base utilities [#648](#648), [#661](#661), [#757](#757), [#832](#832) + Port matrix formats, solvers and related kernels to DPC++. For some kernels, also make use of a shared kernel implementation for all executors (except Reference). [#710](#710), [#799](#799), [#779](#779), [#733](#733), [#844](#844), [#843](#843), [#789](#789), [#845](#845), [#849](#849), [#855](#855), [#856](#856) + Add accessors which allow multi-precision kernels, among other things. [#643](#643), [#708](#708) + Add support for mixed precision operations through apply in all LinOps. [#677](#677) + Add incomplete Cholesky factorizations and preconditioners as well as some improvements to ILU. [#672](#672), [#837](#837), [#846](#846) + Add an AMGX implementation and kernels on all devices but DPC++. [#528](#528), [#695](#695), [#860](#860) + Add a new mixed-precision capability solver, Compressed Basis GMRES (CB-GMRES). [#693](#693), [#763](#763) + Add the IDR(s) solver. [#620](#620) + Add a new fixed-size block CSR matrix format (for the Reference executor). [#671](#671), [#730](#730) + Add native mixed-precision support to the ELL format. [#717](#717), [#780](#780) + Add Reverse Cuthill-McKee reordering [#500](#500), [#649](#649) + Add matrix assembly support on CPUs. [#644](#644) + Extends ISAI from triangular to general and spd matrices. [#690](#690) Other additions: + Add the possibility to apply real matrices to complex vectors. [#655](#655), [#658](#658) + Add functions to compute the absolute of a matrix format. [#636](#636) + Add symmetric permutation and improve existing permutations. [#684](#684), [#657](#657), [#663](#663) + Add a MachineTopology class with HWLOC support [#554](#554), [#697](#697) + Add an implicit residual norm criterion. [#702](#702), [#818](#818), [#850](#850) + Row-major accessor is generalized to more than 2 dimensions and a new "block column-major" accessor has been added. [#707](#707) + Add an heat equation example. [#698](#698), [#706](#706) + Add ccache support in CMake and CI. [#725](#725), [#739](#739) + Allow tuning and benchmarking variables non intrusively. [#692](#692) + Add triangular solver benchmark [#664](#664) + Add benchmarks for BLAS operations [#772](#772), [#829](#829) + Add support for different precisions and consistent index types in benchmarks. [#675](#675), [#828](#828) + Add a Github bot system to facilitate development and PR management. [#667](#667), [#674](#674), [#689](#689), [#853](#853) + Add Intel (DPC++) CI support and enable CI on HPC systems. [#736](#736), [#751](#751), [#781](#781) + Add ssh debugging for Github Actions CI. [#749](#749) + Add pipeline segmentation for better CI speed. [#737](#737) Changes: + Add a Scalar Jacobi specialization and kernels. [#808](#808), [#834](#834), [#854](#854) + Add implicit residual log for solvers and benchmarks. [#714](#714) + Change handling of the conjugate in the dense dot product. [#755](#755) + Improved Dense stride handling. [#774](#774) + Multiple improvements to the OpenMP kernels performance, including COO, an exclusive prefix sum, and more. [#703](#703), [#765](#765), [#740](#740) + Allow specialization of submatrix and other dense creation functions in solvers. [#718](#718) + Improved Identity constructor and treatment of rectangular matrices. [#646](#646) + Allow CUDA/HIP executors to select allocation mode. [#758](#758) + Check if executors share the same memory. [#670](#670) + Improve test install and smoke testing support. [#721](#721) + Update the JOSS paper citation and add publications in the documentation. [#629](#629), [#724](#724) + Improve the version output. [#806](#806) + Add some utilities for dim and span. [#821](#821) + Improved solver and preconditioner benchmarks. [#660](#660) + Improve benchmark timing and output. [#669](#669), [#791](#791), [#801](#801), [#812](#812) Fixes: + Sorting fix for the Jacobi preconditioner. [#659](#659) + Also log the first residual norm in CGS [#735](#735) + Fix BiCG and HIP CSR to work with complex matrices. [#651](#651) + Fix Coo SpMV on strided vectors. [#807](#807) + Fix segfault of extract_diagonal, add short-and-fat test. [#769](#769) + Fix device_reset issue by moving counter/mutex to device. [#810](#810) + Fix `EnableLogging` superclass. [#841](#841) + Support ROCm 4.1.x and breaking HIP_PLATFORM changes. [#726](#726) + Decreased test size for a few device tests. [#742](#742) + Fix multiple issues with our CMake HIP and RPATH setup. [#712](#712), [#745](#745), [#709](#709) + Cleanup our CMake installation step. [#713](#713) + Various simplification and fixes to the Windows CMake setup. [#720](#720), [#785](#785) + Simplify third-party integration. [#786](#786) + Improve Ginkgo device arch flags management. [#696](#696) + Other fixes and improvements to the CMake setup. [#685](#685), [#792](#792), [#705](#705), [#836](#836) + Clarification of dense norm documentation [#784](#784) + Various development tools fixes and improvements [#738](#738), [#830](#830), [#840](#840) + Make multiple operators/constructors explicit. [#650](#650), [#761](#761) + Fix some issues, memory leaks and warnings found by MSVC. [#666](#666), [#731](#731) + Improved solver memory estimates and consistent iteration counts [#691](#691) + Various logger improvements and fixes [#728](#728), [#743](#743), [#754](#754) + Fix for ForwardIterator requirements in iterator_factory. [#665](#665) + Various benchmark fixes. [#647](#647), [#673](#673), [#722](#722) + Various CI fixes and improvements. [#642](#642), [#641](#641), [#795](#795), [#783](#783), [#793](#793), [#852](#852) Related PR: #857
Release 1.4.0 to master The Ginkgo team is proud to announce the new Ginkgo minor release 1.4.0. This release brings most of the Ginkgo functionality to the Intel DPC++ ecosystem which enables Intel-GPU and CPU execution. The only Ginkgo features which have not been ported yet are some preconditioners. Ginkgo's mixed-precision support is greatly enhanced thanks to: 1. The new Accessor concept, which allows writing kernels featuring on-the-fly memory compression, among other features. The accessor can be used as header-only, see the [accessor BLAS benchmarks repository](https://github.com/ginkgo-project/accessor-BLAS/tree/develop) as a usage example. 2. All LinOps now transparently support mixed-precision execution. By default, this is done through a temporary copy which may have a performance impact but already allows mixed-precision research. Native mixed-precision ELL kernels are implemented which do not see this cost. The accessor is also leveraged in a new CB-GMRES solver which allows for performance improvements by compressing the Krylov basis vectors. Many other features have been added to Ginkgo, such as reordering support, a new IDR solver, Incomplete Cholesky preconditioner, matrix assembly support (only CPU for now), machine topology information, and more! Supported systems and requirements: + For all platforms, cmake 3.13+ + C++14 compliant compiler + Linux and MacOS + gcc: 5.3+, 6.3+, 7.3+, all versions after 8.1+ + clang: 3.9+ + Intel compiler: 2018+ + Apple LLVM: 8.0+ + CUDA module: CUDA 9.0+ + HIP module: ROCm 3.5+ + DPC++ module: Intel OneAPI 2021.3. Set the CXX compiler to `dpcpp`. + Windows + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, all versions after 8.1+ + Microsoft Visual Studio: VS 2019 + CUDA module: CUDA 9.0+, Microsoft Visual Studio + OpenMP module: MinGW or Cygwin. Algorithm and important feature additions: + Add a new DPC++ Executor for SYCL execution and other base utilities [#648](#648), [#661](#661), [#757](#757), [#832](#832) + Port matrix formats, solvers and related kernels to DPC++. For some kernels, also make use of a shared kernel implementation for all executors (except Reference). [#710](#710), [#799](#799), [#779](#779), [#733](#733), [#844](#844), [#843](#843), [#789](#789), [#845](#845), [#849](#849), [#855](#855), [#856](#856) + Add accessors which allow multi-precision kernels, among other things. [#643](#643), [#708](#708) + Add support for mixed precision operations through apply in all LinOps. [#677](#677) + Add incomplete Cholesky factorizations and preconditioners as well as some improvements to ILU. [#672](#672), [#837](#837), [#846](#846) + Add an AMGX implementation and kernels on all devices but DPC++. [#528](#528), [#695](#695), [#860](#860) + Add a new mixed-precision capability solver, Compressed Basis GMRES (CB-GMRES). [#693](#693), [#763](#763) + Add the IDR(s) solver. [#620](#620) + Add a new fixed-size block CSR matrix format (for the Reference executor). [#671](#671), [#730](#730) + Add native mixed-precision support to the ELL format. [#717](#717), [#780](#780) + Add Reverse Cuthill-McKee reordering [#500](#500), [#649](#649) + Add matrix assembly support on CPUs. [#644](#644) + Extends ISAI from triangular to general and spd matrices. [#690](#690) Other additions: + Add the possibility to apply real matrices to complex vectors. [#655](#655), [#658](#658) + Add functions to compute the absolute of a matrix format. [#636](#636) + Add symmetric permutation and improve existing permutations. [#684](#684), [#657](#657), [#663](#663) + Add a MachineTopology class with HWLOC support [#554](#554), [#697](#697) + Add an implicit residual norm criterion. [#702](#702), [#818](#818), [#850](#850) + Row-major accessor is generalized to more than 2 dimensions and a new "block column-major" accessor has been added. [#707](#707) + Add an heat equation example. [#698](#698), [#706](#706) + Add ccache support in CMake and CI. [#725](#725), [#739](#739) + Allow tuning and benchmarking variables non intrusively. [#692](#692) + Add triangular solver benchmark [#664](#664) + Add benchmarks for BLAS operations [#772](#772), [#829](#829) + Add support for different precisions and consistent index types in benchmarks. [#675](#675), [#828](#828) + Add a Github bot system to facilitate development and PR management. [#667](#667), [#674](#674), [#689](#689), [#853](#853) + Add Intel (DPC++) CI support and enable CI on HPC systems. [#736](#736), [#751](#751), [#781](#781) + Add ssh debugging for Github Actions CI. [#749](#749) + Add pipeline segmentation for better CI speed. [#737](#737) Changes: + Add a Scalar Jacobi specialization and kernels. [#808](#808), [#834](#834), [#854](#854) + Add implicit residual log for solvers and benchmarks. [#714](#714) + Change handling of the conjugate in the dense dot product. [#755](#755) + Improved Dense stride handling. [#774](#774) + Multiple improvements to the OpenMP kernels performance, including COO, an exclusive prefix sum, and more. [#703](#703), [#765](#765), [#740](#740) + Allow specialization of submatrix and other dense creation functions in solvers. [#718](#718) + Improved Identity constructor and treatment of rectangular matrices. [#646](#646) + Allow CUDA/HIP executors to select allocation mode. [#758](#758) + Check if executors share the same memory. [#670](#670) + Improve test install and smoke testing support. [#721](#721) + Update the JOSS paper citation and add publications in the documentation. [#629](#629), [#724](#724) + Improve the version output. [#806](#806) + Add some utilities for dim and span. [#821](#821) + Improved solver and preconditioner benchmarks. [#660](#660) + Improve benchmark timing and output. [#669](#669), [#791](#791), [#801](#801), [#812](#812) Fixes: + Sorting fix for the Jacobi preconditioner. [#659](#659) + Also log the first residual norm in CGS [#735](#735) + Fix BiCG and HIP CSR to work with complex matrices. [#651](#651) + Fix Coo SpMV on strided vectors. [#807](#807) + Fix segfault of extract_diagonal, add short-and-fat test. [#769](#769) + Fix device_reset issue by moving counter/mutex to device. [#810](#810) + Fix `EnableLogging` superclass. [#841](#841) + Support ROCm 4.1.x and breaking HIP_PLATFORM changes. [#726](#726) + Decreased test size for a few device tests. [#742](#742) + Fix multiple issues with our CMake HIP and RPATH setup. [#712](#712), [#745](#745), [#709](#709) + Cleanup our CMake installation step. [#713](#713) + Various simplification and fixes to the Windows CMake setup. [#720](#720), [#785](#785) + Simplify third-party integration. [#786](#786) + Improve Ginkgo device arch flags management. [#696](#696) + Other fixes and improvements to the CMake setup. [#685](#685), [#792](#792), [#705](#705), [#836](#836) + Clarification of dense norm documentation [#784](#784) + Various development tools fixes and improvements [#738](#738), [#830](#830), [#840](#840) + Make multiple operators/constructors explicit. [#650](#650), [#761](#761) + Fix some issues, memory leaks and warnings found by MSVC. [#666](#666), [#731](#731) + Improved solver memory estimates and consistent iteration counts [#691](#691) + Various logger improvements and fixes [#728](#728), [#743](#743), [#754](#754) + Fix for ForwardIterator requirements in iterator_factory. [#665](#665) + Various benchmark fixes. [#647](#647), [#673](#673), [#722](#722) + Various CI fixes and improvements. [#642](#642), [#641](#641), [#795](#795), [#783](#783), [#793](#793), [#852](#852) Related PR: #866
This PR adds the cooperative group in dpcpp to keep the same interface as cuda/hip.
Note. the subgroup seems to be warp in cuda not subwarp.
Reference: https://intel.github.io/llvm-docs/cuda/opencl-subgroup-vs-cuda-crosslane-op.html
This PR is WIP but I would like to bring it to review.to check the config selection can be acceptable from ginkgo
for config selection, please check the cg_shuffle_config_call in dpcpp/test/components/cooperative_groups_kernels.dp.cpp
Summary:
helper
gives default implementation macro for simple kernel cases (no explicit template parameter and 1d block)__WG_BOUND__
gives something like__launch_bound__
but it needs the 3d information not the product__WG_BOUND_CONFIG__
can use ConfigSet for easy unpackTODO:
also set the cooperative group test result individually in cuda/hip?I decide to move another PR to handle the cuda/hip cooperative group because it contains another subwarp test