DPCPP cooperative group #757

yhmtsai · 2021-05-03T16:30:59Z

This PR adds the cooperative group in dpcpp to keep the same interface as cuda/hip.
Note. the subgroup seems to be warp in cuda not subwarp.
Reference: https://intel.github.io/llvm-docs/cuda/opencl-subgroup-vs-cuda-crosslane-op.html

This PR is WIP but I would like to bring it to review.
to check the config selection can be acceptable from ginkgo
for config selection, please check the cg_shuffle_config_call in dpcpp/test/components/cooperative_groups_kernels.dp.cpp

Summary:

dim3 and sycl_nd_range: a cuda-like usage for sycl range and nd_range with tests
helper gives default implementation macro for simple kernel cases (no explicit template parameter and 1d block)
__WG_BOUND__ gives something like __launch_bound__ but it needs the 3d information not the product
__WG_BOUND_CONFIG__ can use ConfigSet for easy unpack
cooperative group implementation and set the test result individually
another selection for config (it allows bool, int, size_type template by roughly go through all kernel template)
update format_header such that it can handle the generated dpcpp file (the script is not yet here)
ConfigSet: @tcojean provides a general way to embed information into encoded integer

TODO:
~~also set the cooperative group test result individually in cuda/hip?~~
I decide to move another PR to handle the cuda/hip cooperative group because it contains another subwarp test

tcojean · 2021-05-03T16:32:11Z

Thanks Mike for creating the PR. I'm trying to see what I can do with a templated configuration type as possible improvement, I'll tell you whether it works.

upsj

I didn't go into the config right now, but the rest looks good

dpcpp/components/cooperative_groups.dp.hpp

dpcpp/test/utils.hpp

codecov · 2021-05-03T18:20:57Z

Codecov Report

Merging #757 (7f72418) into develop (95c7652) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff            @@
##           develop     #757   +/-   ##
========================================
  Coverage    94.17%   94.17%           
========================================
  Files          400      400           
  Lines        31051    31080   +29     
========================================
+ Hits         29241    29270   +29     
  Misses        1810     1810

Impacted Files	Coverage Δ
include/ginkgo/core/base/types.hpp	`92.59% <ø> (ø)`
core/test/base/types.cpp	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 95c7652...7f72418. Read the comment docs.

yhmtsai · 2021-05-03T18:32:13Z

test_all_subgroup will run 4, 8, 16 for cpu and 8, 16, 32 for intel gpu
search dpcpp/test/components/cooperative_groups_kernels on dgpu-job, or cpu-job

Slaedr

Overall, this looks good!

The main point of interest for me is the choice of 32 for the warp size. Could you document somewhere, maybe just in this PR, why you chose 32? Will shuffles be expected to work for 32 work-items independent of vector architecture? Can we get away with a smaller warp size? (I think a smaller warp size, if it still fully utilizes the vector units, can enable more flexible parallelism.)

Apart from that I have a few other points. I guess the sync implementation and tests still remain, and documentation can be improved in several places.

dpcpp/base/config.hpp

dpcpp/test/utils.hpp

include/ginkgo/core/base/types.hpp

dpcpp/test/components/cooperative_groups_kernels.dp.cpp

Slaedr · 2021-05-07T13:29:27Z

dpcpp/components/cooperative_groups.dp.hpp

+    __dpct_inline__ ValueType shfl_up(ValueType var,
+                                      SelectorType selector) const noexcept
+    {
+        const auto result = this->shuffle_up(var, selector);


Do we need this-> here? If not, it's probably better to remove it. Maybe

Suggested change

const auto result = this->shuffle_up(var, selector);

const auto result = sub_group::shuffle_up(var, selector);

I think it needs this-> to use the generated subgroup information

dpcpp/components/cooperative_groups.dp.hpp

Slaedr · 2021-05-07T15:15:57Z

dpcpp/components/cooperative_groups.dp.hpp

+    __dpct_inline__ ValueType ShflOpName(ValueType var, SelectorType selector) \
+        const noexcept                                                         \
+    {                                                                          \
+        return this->ShflOp(var, selector);                                    \


Maybe just ShflOp is enough here?

Suggested change

return this->ShflOp(var, selector); \

return ShflOp(var, selector); \

When I call the member function, I will use this->

member functions are sometimes not found in a templated member function unless you use this->, which is also why all of our templated tests need to use it.

Typically, that is not needed for inline definitions (in the class body rather than outside). I was just thinking of avoiding an explicit pointer indirection if we can help it. I understand it's probably fine because it should be optimized out.

Slaedr · 2021-05-07T17:05:10Z

dpcpp/components/cooperative_groups.dp.hpp

+
+
+// Enable group can directly use group function
+__SYCL_INLINE_NAMESPACE(cl)


Can you give an example of a function this will enable? In general it seems like a bad idea for us to depend on sycl's detail namespace - I guess they can change it any time.

for example, we use the group algorithm the reduce in ballot.
If we do not do this way, when we need use oneAPI implemenation on gko's subgroup, we need to add static_cast to subgroup type.
I agree with you, they might change it at some point.

Yes, if it works with static casting to sycl::ONEAPI::sub_group* when you call reduce, perhaps we should just do that.

I stay this version first because it still works and it makes ginkgo cooperative group like an extension not another object.
But if they change frequently and we are hard to stay with them, we can delete them and use static_cast.

Slaedr · 2021-05-07T17:07:18Z

dpcpp/components/cooperative_groups.dp.hpp

+
+// specialization for 1
+template <>
+class thread_block_tile<1> {


Why is this specialization needed? Do you expect we'll need to use thread_block_tiles of size 1 in algorithms? Otherwise, if this case is correctly but inefficiently handled by the generic implementation, I guess we don't need this specialization. If it's needed, it would be nice to have unit tests for this too.

it is for those kernels implemented by warp sense.
providing it such that we can use the same kernel on subgroup(1) to get the single thread implementation

yhmtsai · 2021-05-18T12:23:43Z

I add a few default helper to reduce the internal layer code.
GKO_ENABLE_DEFAULT_HOST_CONFIG(name, kernel) generates the default implementation for those kernels without expilict template parameter.
GKO_ENABLE_DEFAULT_CONFIG_CALL(name, callable, cfg, list)
cfg gives how to decode the int and list gives the available list for selection.
Also, @tcojean provides a general way, ConfigSet, to embed the several information into one int.
It can use not only in dpcpp but also in cuda/hip thing like ellpack we have 2 (3) parameters for selection.

TODO:

add documentation

yhmtsai · 2021-05-18T22:13:07Z

@ginkgo-project/reviewers this PR can be reviewed. Also welcome any name suggestion of these things

tcojean

LGTM

dev_tools/scripts/format_header.sh

tcojean · 2021-05-18T22:47:46Z

dpcpp/base/config.hpp

+    /**
+     * The type containing a bitmask over all lanes of a warp.
+     */
+    using lane_mask_type = uint64;
+
+    /**
+     * The bitmask of the entire warp.
+     */
+    static constexpr auto full_lane_mask = ~zero<lane_mask_type>();


Will we ever need this? I think we cannot have masked operation anyway. Maybe all of the config can be removed to begin with?

I also think the full lane mask should have the same number of bits as a full lane has threads. Is that true here? Otherwise things related to popcnt and ballot might break.

you need to use mask<subgroup_size, lane_mask_type> to get the last subgroup_size bits activated in lane_mask_type

dpcpp/base/helper.hpp

dpcpp/components/cooperative_groups.dp.hpp

dpcpp/base/dim3.dp.hpp

tcojean · 2021-05-18T23:18:15Z

include/ginkgo/core/base/executor.hpp

+            bool allowed = false;
+            for (auto &i : subgroup_sizes) {
+                allowed |= (i == warpsize);
+            }
+            return allowed && (blocksize <= max_workgroup_size);


You can use validate_function instead?

Also, why are these protected? How do you use them?

the protected structure is the original one.
I add a public function to get the const exec_info.
Do you have any comment about using function directly or structure function?

dpcpp/test/components/cooperative_groups_kernels.dp.cpp

upsj

Looks mostly good to me, I would like to see if we can remove a few CUDA-isms though, since they only relate to kernel launch, not the actual kernel implementation.

upsj · 2021-05-19T08:54:06Z

dpcpp/base/config.hpp

+    /**
+     * The type containing a bitmask over all lanes of a warp.
+     */
+    using lane_mask_type = uint64;
+
+    /**
+     * The bitmask of the entire warp.
+     */
+    static constexpr auto full_lane_mask = ~zero<lane_mask_type>();


I also think the full lane mask should have the same number of bits as a full lane has threads. Is that true here? Otherwise things related to popcnt and ballot might break.

upsj · 2021-05-19T08:56:52Z

dpcpp/base/dpct.hpp

+#if defined(_MSC_VER)
+#define __dpct_align__(n) __declspec(align(n))
+#define __dpct_inline__ __forceinline
+#else
+#define __dpct_align__(n) __attribute__((aligned(n)))
+#define __dpct_inline__ __inline__ __attribute__((always_inline))
+#endif


Are we programming for DPC++ or SYCL? For the former, can't we include dpct.hpp directly? For the latter, is this portable?

this is due to the issues from dpct before. for example, including dpct file gives error or their atomic_add has some issue, so we also have atomic_add implementation for real number in another pr

upsj · 2021-05-19T09:00:45Z

dpcpp/base/dim3.dp.hpp

+ * dim3 is a cuda-like dim3 for sycl-range, which provides the same ordering as
+ * cuda and gets the sycl-range in reverse ordering.
+ */
+struct dim3 {


We are building a lot of complexity here (also the config selection dim3 integration), so I want to ask: Why do we need this additional wrapper? Can't we use SYCL primitives directly?

Yes, we can use the SYCL native sycl::range<3>
but it will gives different view on kernel launch and build the sycl::range
for example, we use kernel<<<32, 32>>> and sycl will need to use
kernel(sycl::nd_range<3>(sycl::range<3>(1, 1, 32) * sycl::range<3>(1, 1, 32), sycl::range<3>(1, 1, 32)))
but with dim3, we can still use kernel(32, 32)
and in the beginning, I would like to reduce the difference from cuda to dpcpp

In which places does that actually matter? Most of the time, we are using one-dimensional kernels, except for SpMV (multiple columns), some Dense kernels (2D) and Jacobi. Wouldn't you use sycl::ndrange<1>(...) directly?

I think the mapping will be different. At least, it will require us to develop own dpct to avoid any conversion like threadIdx -> get_local_id(3)

and it will gives inconsistent index sense between the kernel1D/2D/3D.
range<1>(x) get_local_id(0) is x and x is contiguous as cuda
range<2>(x, y) get_local_id(0) is still x but x is not contiguous
or range<2>(y, x) x is contiguous but x needs get_local_id(1)

But that only matters on the launch side, right? I would assume that the reason for this is that it maps more cleanly to the nested for-loop model

for (; i < range[0]; i++) for (; j < range[1]; j++) for (; k < range[2]; k++)

it affects not only launch side but the kernel index for threads

upsj · 2021-05-19T09:03:14Z

dpcpp/components/cooperative_groups.dp.hpp

+    __dpct_inline__ ValueType ShflOpName(ValueType var, SelectorType selector) \
+        const noexcept                                                         \
+    {                                                                          \
+        return this->ShflOp(var, selector);                                    \


member functions are sometimes not found in a templated member function unless you use this->, which is also why all of our templated tests need to use it.

Slaedr

Excellent work! It's pretty cool, eg. how you handle stuff like __WG_BOUNDS__. Guess you want to make DPC++ as close to CUDA as possible, which is what a lot of the code here seems to be doing.

Below, I have some concerns around the ConfigSet stuff and other comments / suggestions.

Slaedr · 2021-05-26T09:47:11Z

include/ginkgo/core/base/types.hpp

+ *
+ * @note this is the last case of nested template
+ */
+template <int num_groups, int current_shift>


Suggested change

template <int num_groups, int current_shift>

template <int current_shift, int num_groups>

This way, I think num_groups can be inferred and we'll need to provide only current_shift while using this.

I tried it before but it did not get the num_group information from array

Oh I see. Even with the switched order? Then never mind.

it needs switched order to give the possibility for not explicitly setting.
I will put the my trying code later.
It is also not my expectation, so maybe I did something wrong there.

Did you try adding a deduction guide? Neat little C++17 feature that might help here.

I think ConfigSet is also a general solution for others kernel (not dpcpp) so we still need to stay C++14.
I put the related code: https://godbolt.org/z/oa5d4arMs
it switch the order but can not miss the num_groups

Slaedr · 2021-05-26T09:47:42Z

include/ginkgo/core/base/types.hpp

+ *
+ * @note this is the usual case of nested template
+ */
+template <int num_groups, int current_shift>


Suggested change

template <int num_groups, int current_shift>

template <int current_shift, int num_groups>

Slaedr · 2021-05-26T09:48:30Z

include/ginkgo/core/base/types.hpp

+    const std::array<char, num_groups> &bits)
+{
+    return bits[current_shift + 1] +
+           shift<num_groups, (current_shift + 1)>(bits);


Then this can simply be

Suggested change

shift<num_groups, (current_shift + 1)>(bits);

shift<current_shift + 1>(bits);

Slaedr · 2021-05-26T09:53:14Z

include/ginkgo/core/base/types.hpp

+
+/**
+ * ConfigSet is a way to embed several information into one integer by given
+ * certain bits. The usage will be the following


Suggested change

* certain bits. The usage will be the following

* certain bits.

*

* The usage will be the following:

Slaedr · 2021-05-26T09:56:53Z

include/ginkgo/core/base/types.hpp

+class ConfigSet {
+public:
+    static constexpr size_type num_groups = sizeof...(num_bits);
+    static constexpr std::array<char, num_groups> bits{num_bits...};


You want to put ints into a char array?

Slaedr · 2021-05-26T11:17:17Z

include/ginkgo/core/base/types.hpp

+ * The encoded result will use 32 bits to record
+ * rrrrr1..12....2...k..k, which 1/2/k means the bits store the information for
+ * 1/2/k position and r is for rest of unused bits.
+ *


Somewhat crazy suggestion but might be great to have: maybe you could include a mathematical proof that this coding scheme is indeed a unique map when the numbers to encode are small enough. You could precisely state (in the assumptions) the maximum values of the arguments of encode for the coding to work, and then the proof would be couple of lines to prove:
For integer vectors x and y with all components less than the respective maxima, x != y implies encode(x...) != encode(y...).

If you say that is unnecessary, that is completely fine.

dpcpp/components/cooperative_groups.dp.hpp

Slaedr · 2021-05-26T11:33:34Z

dpcpp/components/cooperative_groups.dp.hpp

+
+
+// Enable group can directly use group function
+__SYCL_INLINE_NAMESPACE(cl)


Yes, if it works with static casting to sycl::ONEAPI::sub_group* when you call reduce, perhaps we should just do that.

Slaedr · 2021-05-26T11:42:56Z

dpcpp/test/components/cooperative_groups_kernels.dp.cpp

+
+
+using namespace gko::kernels::dpcpp;
+using KCfg = gko::ConfigSet<12, 7>;


I guess 12 and 7 are arbitrarily chosen, or do you like 12 and 7 for some specific reason?

I use wrong number here. It should be 11, 7.
it is from log_2(1024) + 1 for workgoup_size and log_2(64) + 1 for subgroup_size

That appears to make sense, but I don't really see it. It seems like you want the bits array to contain the maximum number of bits needed by each position. But I guess it will work correctly for many combinations, depending on which numbers are given as parameters to the encode function.

it is described in ConfigSet note. the #bit should be log_2(max)+1

dpcpp/test/components/cooperative_groups_kernels.dp.cpp

upsj

LGTM - nice job! I am still not entirely happy about the dim3 thing, since it means adding CUDA-isms to our SYCL code, but I guess it would only be used in a small number of places anyways, so I guess it will be okay.

upsj · 2021-05-27T09:48:03Z

core/test/base/types.cpp

+    ASSERT_EQ((std::is_same<decltype(mask3_u), const unsigned int>::value),
+              true);
+    ASSERT_EQ((std::is_same<decltype(fullmask_u), const unsigned int>::value),
+              true);


Suggested change

ASSERT_EQ((std::is_same<decltype(mask3_u), const unsigned int>::value),

true);

ASSERT_EQ((std::is_same<decltype(fullmask_u), const unsigned int>::value),

true);

ASSERT_TRUE((std::is_same<decltype(mask3_u), const unsigned int>::value));

ASSERT_TRUE((std::is_same<decltype(fullmask_u), const unsigned int>::value));

Not sure if you even need the additional parentheses then?

It still needs the additional parentheses such that the macro unpacks parameter correctly.

upsj · 2021-05-27T09:48:30Z

core/test/base/types.cpp

+    ASSERT_EQ((std::is_same<decltype(mask3_u64), const std::uint64_t>::value),
+              true);
+    ASSERT_EQ(
+        (std::is_same<decltype(fullmask_u64), const std::uint64_t>::value),
+        true);


same here and the following

upsj · 2021-05-27T09:54:12Z

dev_tools/scripts/format_header.sh

@@ -266,9 +266,6 @@ fi
 # Arrange the remain files and give
 if [ -f "${CONTENT}" ]; then
    add_regroup
-    if [ "${HAS_HIP_RUNTIME}" = "true" ]; then


What is the reason for these changes again? Do we no longer need it?

the behavior still exists. DPCPP gives more additional header than <hip/runtime.h>, so I move the all additional before the LICENSE into header section not just hip/runtime.h

upsj · 2021-05-27T09:57:44Z

include/ginkgo/core/base/types.hpp

+ *
+ * @note this is the last case of nested template
+ */
+template <int num_groups, int current_shift>


Did you try adding a deduction guide? Neat little C++17 feature that might help here.

yhmtsai · 2021-05-27T10:40:34Z

it depends on how much effort we would like to start DPCPP in the beginning.
If we wouldn't like Cuda-isms into dpcpp, we should reimplement all not porting in the begining.
Using dpct gives the index structure and also the global range is not possible be the same as grid.
without dim3, we still need to do the same thing manually and need to switch our brand between SYCL/(HIPCUDA)

Co-authored-by: Terry Cojean <terry.cojean@kit.edu>

Co-authored-by: Aditya Kashi <aditya.kashi@kit.edu> Co-authored-by: Tobias Ribizel <ribizel@kit.edu>

Co-authored-by: Terry Cojean <terry.cojean@kit.edu>

Co-authored-by: Aditya Kashi <aditya.kashi@kit.edu> Co-authored-by: Tobias Ribizel <ribizel@kit.edu>

yhmtsai · 2021-05-27T23:43:56Z

I swap the order of shift although we do not have auto deduction there.
But if someway to achieve it or use C++17, this order of template is unchanged.
I add some comments to show the encode/decode is bijection function.
Also the ConfigSet can not be compiled or can throw the error.

the total bits is larger than the bits of ConfigSetType
the value is larger than the certain bit representation (delete due to gcc 5.x does not support conditional throw in constexpr)
encode uses wrong number of args
decode out-of-range position

include/ginkgo/core/base/types.hpp

delete throw in constexpr because it fails in gcc <= 5.x Co-authored-by: Aditya Kashi <aditya.kashi@kit.edu> Co-authored-by: Terry Cojean <terry.cojean@kit.edu> Co-authored-by: Tobias Ribizel <ribizel@kit.edu>

sonarcloud · 2021-05-28T21:24:25Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
1 Code Smell

No Coverage information
3.4% Duplication

Ginkgo release 1.4.0 The Ginkgo team is proud to announce the new Ginkgo minor release 1.4.0. This release brings most of the Ginkgo functionality to the Intel DPC++ ecosystem which enables Intel-GPU and CPU execution. The only Ginkgo features which have not been ported yet are some preconditioners. Ginkgo's mixed-precision support is greatly enhanced thanks to: 1. The new Accessor concept, which allows writing kernels featuring on-the-fly memory compression, among other features. The accessor can be used as header-only, see the [accessor BLAS benchmarks repository](https://github.com/ginkgo-project/accessor-BLAS/tree/develop) as a usage example. 2. All LinOps now transparently support mixed-precision execution. By default, this is done through a temporary copy which may have a performance impact but already allows mixed-precision research. Native mixed-precision ELL kernels are implemented which do not see this cost. The accessor is also leveraged in a new CB-GMRES solver which allows for performance improvements by compressing the Krylov basis vectors. Many other features have been added to Ginkgo, such as reordering support, a new IDR solver, Incomplete Cholesky preconditioner, matrix assembly support (only CPU for now), machine topology information, and more! Supported systems and requirements: + For all platforms, cmake 3.13+ + C++14 compliant compiler + Linux and MacOS + gcc: 5.3+, 6.3+, 7.3+, all versions after 8.1+ + clang: 3.9+ + Intel compiler: 2018+ + Apple LLVM: 8.0+ + CUDA module: CUDA 9.0+ + HIP module: ROCm 3.5+ + DPC++ module: Intel OneAPI 2021.3. Set the CXX compiler to `dpcpp`. + Windows + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, all versions after 8.1+ + Microsoft Visual Studio: VS 2019 + CUDA module: CUDA 9.0+, Microsoft Visual Studio + OpenMP module: MinGW or Cygwin. Algorithm and important feature additions: + Add a new DPC++ Executor for SYCL execution and other base utilities [#648](#648), [#661](#661), [#757](#757), [#832](#832) + Port matrix formats, solvers and related kernels to DPC++. For some kernels, also make use of a shared kernel implementation for all executors (except Reference). [#710](#710), [#799](#799), [#779](#779), [#733](#733), [#844](#844), [#843](#843), [#789](#789), [#845](#845), [#849](#849), [#855](#855), [#856](#856) + Add accessors which allow multi-precision kernels, among other things. [#643](#643), [#708](#708) + Add support for mixed precision operations through apply in all LinOps. [#677](#677) + Add incomplete Cholesky factorizations and preconditioners as well as some improvements to ILU. [#672](#672), [#837](#837), [#846](#846) + Add an AMGX implementation and kernels on all devices but DPC++. [#528](#528), [#695](#695), [#860](#860) + Add a new mixed-precision capability solver, Compressed Basis GMRES (CB-GMRES). [#693](#693), [#763](#763) + Add the IDR(s) solver. [#620](#620) + Add a new fixed-size block CSR matrix format (for the Reference executor). [#671](#671), [#730](#730) + Add native mixed-precision support to the ELL format. [#717](#717), [#780](#780) + Add Reverse Cuthill-McKee reordering [#500](#500), [#649](#649) + Add matrix assembly support on CPUs. [#644](#644) + Extends ISAI from triangular to general and spd matrices. [#690](#690) Other additions: + Add the possibility to apply real matrices to complex vectors. [#655](#655), [#658](#658) + Add functions to compute the absolute of a matrix format. [#636](#636) + Add symmetric permutation and improve existing permutations. [#684](#684), [#657](#657), [#663](#663) + Add a MachineTopology class with HWLOC support [#554](#554), [#697](#697) + Add an implicit residual norm criterion. [#702](#702), [#818](#818), [#850](#850) + Row-major accessor is generalized to more than 2 dimensions and a new "block column-major" accessor has been added. [#707](#707) + Add an heat equation example. [#698](#698), [#706](#706) + Add ccache support in CMake and CI. [#725](#725), [#739](#739) + Allow tuning and benchmarking variables non intrusively. [#692](#692) + Add triangular solver benchmark [#664](#664) + Add benchmarks for BLAS operations [#772](#772), [#829](#829) + Add support for different precisions and consistent index types in benchmarks. [#675](#675), [#828](#828) + Add a Github bot system to facilitate development and PR management. [#667](#667), [#674](#674), [#689](#689), [#853](#853) + Add Intel (DPC++) CI support and enable CI on HPC systems. [#736](#736), [#751](#751), [#781](#781) + Add ssh debugging for Github Actions CI. [#749](#749) + Add pipeline segmentation for better CI speed. [#737](#737) Changes: + Add a Scalar Jacobi specialization and kernels. [#808](#808), [#834](#834), [#854](#854) + Add implicit residual log for solvers and benchmarks. [#714](#714) + Change handling of the conjugate in the dense dot product. [#755](#755) + Improved Dense stride handling. [#774](#774) + Multiple improvements to the OpenMP kernels performance, including COO, an exclusive prefix sum, and more. [#703](#703), [#765](#765), [#740](#740) + Allow specialization of submatrix and other dense creation functions in solvers. [#718](#718) + Improved Identity constructor and treatment of rectangular matrices. [#646](#646) + Allow CUDA/HIP executors to select allocation mode. [#758](#758) + Check if executors share the same memory. [#670](#670) + Improve test install and smoke testing support. [#721](#721) + Update the JOSS paper citation and add publications in the documentation. [#629](#629), [#724](#724) + Improve the version output. [#806](#806) + Add some utilities for dim and span. [#821](#821) + Improved solver and preconditioner benchmarks. [#660](#660) + Improve benchmark timing and output. [#669](#669), [#791](#791), [#801](#801), [#812](#812) Fixes: + Sorting fix for the Jacobi preconditioner. [#659](#659) + Also log the first residual norm in CGS [#735](#735) + Fix BiCG and HIP CSR to work with complex matrices. [#651](#651) + Fix Coo SpMV on strided vectors. [#807](#807) + Fix segfault of extract_diagonal, add short-and-fat test. [#769](#769) + Fix device_reset issue by moving counter/mutex to device. [#810](#810) + Fix `EnableLogging` superclass. [#841](#841) + Support ROCm 4.1.x and breaking HIP_PLATFORM changes. [#726](#726) + Decreased test size for a few device tests. [#742](#742) + Fix multiple issues with our CMake HIP and RPATH setup. [#712](#712), [#745](#745), [#709](#709) + Cleanup our CMake installation step. [#713](#713) + Various simplification and fixes to the Windows CMake setup. [#720](#720), [#785](#785) + Simplify third-party integration. [#786](#786) + Improve Ginkgo device arch flags management. [#696](#696) + Other fixes and improvements to the CMake setup. [#685](#685), [#792](#792), [#705](#705), [#836](#836) + Clarification of dense norm documentation [#784](#784) + Various development tools fixes and improvements [#738](#738), [#830](#830), [#840](#840) + Make multiple operators/constructors explicit. [#650](#650), [#761](#761) + Fix some issues, memory leaks and warnings found by MSVC. [#666](#666), [#731](#731) + Improved solver memory estimates and consistent iteration counts [#691](#691) + Various logger improvements and fixes [#728](#728), [#743](#743), [#754](#754) + Fix for ForwardIterator requirements in iterator_factory. [#665](#665) + Various benchmark fixes. [#647](#647), [#673](#673), [#722](#722) + Various CI fixes and improvements. [#642](#642), [#641](#641), [#795](#795), [#783](#783), [#793](#793), [#852](#852) Related PR: #857

Release 1.4.0 to master The Ginkgo team is proud to announce the new Ginkgo minor release 1.4.0. This release brings most of the Ginkgo functionality to the Intel DPC++ ecosystem which enables Intel-GPU and CPU execution. The only Ginkgo features which have not been ported yet are some preconditioners. Ginkgo's mixed-precision support is greatly enhanced thanks to: 1. The new Accessor concept, which allows writing kernels featuring on-the-fly memory compression, among other features. The accessor can be used as header-only, see the [accessor BLAS benchmarks repository](https://github.com/ginkgo-project/accessor-BLAS/tree/develop) as a usage example. 2. All LinOps now transparently support mixed-precision execution. By default, this is done through a temporary copy which may have a performance impact but already allows mixed-precision research. Native mixed-precision ELL kernels are implemented which do not see this cost. The accessor is also leveraged in a new CB-GMRES solver which allows for performance improvements by compressing the Krylov basis vectors. Many other features have been added to Ginkgo, such as reordering support, a new IDR solver, Incomplete Cholesky preconditioner, matrix assembly support (only CPU for now), machine topology information, and more! Supported systems and requirements: + For all platforms, cmake 3.13+ + C++14 compliant compiler + Linux and MacOS + gcc: 5.3+, 6.3+, 7.3+, all versions after 8.1+ + clang: 3.9+ + Intel compiler: 2018+ + Apple LLVM: 8.0+ + CUDA module: CUDA 9.0+ + HIP module: ROCm 3.5+ + DPC++ module: Intel OneAPI 2021.3. Set the CXX compiler to `dpcpp`. + Windows + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, all versions after 8.1+ + Microsoft Visual Studio: VS 2019 + CUDA module: CUDA 9.0+, Microsoft Visual Studio + OpenMP module: MinGW or Cygwin. Algorithm and important feature additions: + Add a new DPC++ Executor for SYCL execution and other base utilities [#648](#648), [#661](#661), [#757](#757), [#832](#832) + Port matrix formats, solvers and related kernels to DPC++. For some kernels, also make use of a shared kernel implementation for all executors (except Reference). [#710](#710), [#799](#799), [#779](#779), [#733](#733), [#844](#844), [#843](#843), [#789](#789), [#845](#845), [#849](#849), [#855](#855), [#856](#856) + Add accessors which allow multi-precision kernels, among other things. [#643](#643), [#708](#708) + Add support for mixed precision operations through apply in all LinOps. [#677](#677) + Add incomplete Cholesky factorizations and preconditioners as well as some improvements to ILU. [#672](#672), [#837](#837), [#846](#846) + Add an AMGX implementation and kernels on all devices but DPC++. [#528](#528), [#695](#695), [#860](#860) + Add a new mixed-precision capability solver, Compressed Basis GMRES (CB-GMRES). [#693](#693), [#763](#763) + Add the IDR(s) solver. [#620](#620) + Add a new fixed-size block CSR matrix format (for the Reference executor). [#671](#671), [#730](#730) + Add native mixed-precision support to the ELL format. [#717](#717), [#780](#780) + Add Reverse Cuthill-McKee reordering [#500](#500), [#649](#649) + Add matrix assembly support on CPUs. [#644](#644) + Extends ISAI from triangular to general and spd matrices. [#690](#690) Other additions: + Add the possibility to apply real matrices to complex vectors. [#655](#655), [#658](#658) + Add functions to compute the absolute of a matrix format. [#636](#636) + Add symmetric permutation and improve existing permutations. [#684](#684), [#657](#657), [#663](#663) + Add a MachineTopology class with HWLOC support [#554](#554), [#697](#697) + Add an implicit residual norm criterion. [#702](#702), [#818](#818), [#850](#850) + Row-major accessor is generalized to more than 2 dimensions and a new "block column-major" accessor has been added. [#707](#707) + Add an heat equation example. [#698](#698), [#706](#706) + Add ccache support in CMake and CI. [#725](#725), [#739](#739) + Allow tuning and benchmarking variables non intrusively. [#692](#692) + Add triangular solver benchmark [#664](#664) + Add benchmarks for BLAS operations [#772](#772), [#829](#829) + Add support for different precisions and consistent index types in benchmarks. [#675](#675), [#828](#828) + Add a Github bot system to facilitate development and PR management. [#667](#667), [#674](#674), [#689](#689), [#853](#853) + Add Intel (DPC++) CI support and enable CI on HPC systems. [#736](#736), [#751](#751), [#781](#781) + Add ssh debugging for Github Actions CI. [#749](#749) + Add pipeline segmentation for better CI speed. [#737](#737) Changes: + Add a Scalar Jacobi specialization and kernels. [#808](#808), [#834](#834), [#854](#854) + Add implicit residual log for solvers and benchmarks. [#714](#714) + Change handling of the conjugate in the dense dot product. [#755](#755) + Improved Dense stride handling. [#774](#774) + Multiple improvements to the OpenMP kernels performance, including COO, an exclusive prefix sum, and more. [#703](#703), [#765](#765), [#740](#740) + Allow specialization of submatrix and other dense creation functions in solvers. [#718](#718) + Improved Identity constructor and treatment of rectangular matrices. [#646](#646) + Allow CUDA/HIP executors to select allocation mode. [#758](#758) + Check if executors share the same memory. [#670](#670) + Improve test install and smoke testing support. [#721](#721) + Update the JOSS paper citation and add publications in the documentation. [#629](#629), [#724](#724) + Improve the version output. [#806](#806) + Add some utilities for dim and span. [#821](#821) + Improved solver and preconditioner benchmarks. [#660](#660) + Improve benchmark timing and output. [#669](#669), [#791](#791), [#801](#801), [#812](#812) Fixes: + Sorting fix for the Jacobi preconditioner. [#659](#659) + Also log the first residual norm in CGS [#735](#735) + Fix BiCG and HIP CSR to work with complex matrices. [#651](#651) + Fix Coo SpMV on strided vectors. [#807](#807) + Fix segfault of extract_diagonal, add short-and-fat test. [#769](#769) + Fix device_reset issue by moving counter/mutex to device. [#810](#810) + Fix `EnableLogging` superclass. [#841](#841) + Support ROCm 4.1.x and breaking HIP_PLATFORM changes. [#726](#726) + Decreased test size for a few device tests. [#742](#742) + Fix multiple issues with our CMake HIP and RPATH setup. [#712](#712), [#745](#745), [#709](#709) + Cleanup our CMake installation step. [#713](#713) + Various simplification and fixes to the Windows CMake setup. [#720](#720), [#785](#785) + Simplify third-party integration. [#786](#786) + Improve Ginkgo device arch flags management. [#696](#696) + Other fixes and improvements to the CMake setup. [#685](#685), [#792](#792), [#705](#705), [#836](#836) + Clarification of dense norm documentation [#784](#784) + Various development tools fixes and improvements [#738](#738), [#830](#830), [#840](#840) + Make multiple operators/constructors explicit. [#650](#650), [#761](#761) + Fix some issues, memory leaks and warnings found by MSVC. [#666](#666), [#731](#731) + Improved solver memory estimates and consistent iteration counts [#691](#691) + Various logger improvements and fixes [#728](#728), [#743](#743), [#754](#754) + Fix for ForwardIterator requirements in iterator_factory. [#665](#665) + Various benchmark fixes. [#647](#647), [#673](#673), [#722](#722) + Various CI fixes and improvements. [#642](#642), [#641](#641), [#795](#795), [#783](#783), [#793](#793), [#852](#852) Related PR: #866

yhmtsai added the 1:ST:WIP This PR is a work in progress. Not ready for review. label May 3, 2021

yhmtsai requested review from upsj, pratikvn, Slaedr, thoasm, hartwiganzt and tcojean May 3, 2021 16:30

yhmtsai self-assigned this May 3, 2021

upsj reviewed May 3, 2021

View reviewed changes

dpcpp/components/cooperative_groups.dp.hpp Outdated Show resolved Hide resolved

dpcpp/components/cooperative_groups.dp.hpp Outdated Show resolved Hide resolved

dpcpp/components/cooperative_groups.dp.hpp Outdated Show resolved Hide resolved

dpcpp/test/utils.hpp Outdated Show resolved Hide resolved

upsj added this to the Ginkgo 1.4.0 milestone May 5, 2021

Slaedr approved these changes May 7, 2021

View reviewed changes

yhmtsai force-pushed the oneapi_coopgp branch from 4540556 to 0814b7a Compare May 17, 2021 13:27

yhmtsai added 1:ST:ready-for-review This PR is ready for review and removed 1:ST:WIP This PR is a work in progress. Not ready for review. labels May 18, 2021

yhmtsai requested a review from Slaedr May 18, 2021 22:11

tcojean approved these changes May 19, 2021

View reviewed changes

upsj reviewed May 19, 2021

View reviewed changes

yhmtsai force-pushed the oneapi_coopgp branch from a26bee0 to d76ee87 Compare May 20, 2021 07:38

tcojean approved these changes May 20, 2021

View reviewed changes

yhmtsai force-pushed the oneapi_coopgp branch from ef84d6e to 45f8654 Compare May 23, 2021 22:28

Slaedr reviewed May 26, 2021

View reviewed changes

upsj approved these changes May 27, 2021

View reviewed changes

yhmtsai added 1:ST:ready-to-merge This PR is ready to merge. and removed 1:ST:ready-for-review This PR is ready for review labels May 27, 2021

yhmtsai and others added 13 commits May 28, 2021 01:35

initial cooperative group work

6c66039

manually add config selection

3c63671

try config selection in cooperative group test

b90c7dd

Use general ConfigSet

c029470

Co-authored-by: Terry Cojean <terry.cojean@kit.edu>

review update and all coop test use ConfigSet

f76bcf9

add default_helper and add validate_function

2dcf2d7

implement full WG_BOUND

e95c14d

improve the test, add description and dim3 test

29d5460

Co-authored-by: Aditya Kashi <aditya.kashi@kit.edu> Co-authored-by: Tobias Ribizel <ribizel@kit.edu>

add the test for ConfigSet and review update

b5d6976

Co-authored-by: Terry Cojean <terry.cojean@kit.edu>

directly use queue not exec_info

fdce704

using ::gko::ConfigSetType

5bad163

pass desired_cfg in arg to ensure cfg is desired.

ee3b630

switch shift order, add ConfigSet check, review update

6ab97cd

Co-authored-by: Aditya Kashi <aditya.kashi@kit.edu> Co-authored-by: Tobias Ribizel <ribizel@kit.edu>

yhmtsai force-pushed the oneapi_coopgp branch from f26569f to 6ab97cd Compare May 27, 2021 23:35

tcojean reviewed May 28, 2021

View reviewed changes

include/ginkgo/core/base/types.hpp Outdated Show resolved Hide resolved

yhmtsai force-pushed the oneapi_coopgp branch 2 times, most recently from 5067e93 to 32a7c19 Compare May 28, 2021 16:12

use std::uint32_t, move ConfigSet to core, and del throw in constexpr

7f72418

delete throw in constexpr because it fails in gcc <= 5.x Co-authored-by: Aditya Kashi <aditya.kashi@kit.edu> Co-authored-by: Terry Cojean <terry.cojean@kit.edu> Co-authored-by: Tobias Ribizel <ribizel@kit.edu>

yhmtsai force-pushed the oneapi_coopgp branch from 32a7c19 to 7f72418 Compare May 28, 2021 19:22

yhmtsai merged commit da19a97 into develop May 28, 2021

yhmtsai deleted the oneapi_coopgp branch May 28, 2021 23:16

	const auto result = this->shuffle_up(var, selector);
	const auto result = sub_group::shuffle_up(var, selector);

	return this->ShflOp(var, selector); \
	return ShflOp(var, selector); \



		// Enable group can directly use group function
		__SYCL_INLINE_NAMESPACE(cl)

	template <int num_groups, int current_shift>
	template <int current_shift, int num_groups>

	shift<num_groups, (current_shift + 1)>(bits);
	shift<current_shift + 1>(bits);



		using namespace gko::kernels::dpcpp;
		using KCfg = gko::ConfigSet<12, 7>;

DPCPP cooperative group #757

DPCPP cooperative group #757

Conversation

yhmtsai commented May 3, 2021 • edited Loading

tcojean commented May 3, 2021

upsj left a comment

Choose a reason for hiding this comment

codecov bot commented May 3, 2021 • edited Loading

Codecov Report

yhmtsai commented May 3, 2021

Slaedr left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Slaedr May 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhmtsai commented May 18, 2021 • edited Loading

yhmtsai commented May 18, 2021

tcojean left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

upsj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhmtsai May 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Slaedr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Slaedr May 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

upsj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhmtsai commented May 27, 2021

yhmtsai commented May 27, 2021 • edited Loading

sonarcloud bot commented May 28, 2021

yhmtsai commented May 3, 2021 •

edited

Loading

codecov bot commented May 3, 2021 •

edited

Loading

Slaedr left a comment •

edited

Loading

Slaedr May 26, 2021 •

edited

Loading

yhmtsai commented May 18, 2021 •

edited

Loading

yhmtsai May 19, 2021 •

edited

Loading

Slaedr May 26, 2021 •

edited

Loading

yhmtsai commented May 27, 2021 •

edited

Loading