Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DPCPP cooperative group #757

Merged
merged 14 commits into from
May 28, 2021
Merged

DPCPP cooperative group #757

merged 14 commits into from
May 28, 2021

Conversation

yhmtsai
Copy link
Member

@yhmtsai yhmtsai commented May 3, 2021

This PR adds the cooperative group in dpcpp to keep the same interface as cuda/hip.
Note. the subgroup seems to be warp in cuda not subwarp.
Reference: https://intel.github.io/llvm-docs/cuda/opencl-subgroup-vs-cuda-crosslane-op.html

This PR is WIP but I would like to bring it to review.
to check the config selection can be acceptable from ginkgo
for config selection, please check the cg_shuffle_config_call in dpcpp/test/components/cooperative_groups_kernels.dp.cpp

Summary:

  • dim3 and sycl_nd_range: a cuda-like usage for sycl range and nd_range with tests
  • helper gives default implementation macro for simple kernel cases (no explicit template parameter and 1d block)
  • __WG_BOUND__ gives something like __launch_bound__ but it needs the 3d information not the product
    __WG_BOUND_CONFIG__ can use ConfigSet for easy unpack
  • cooperative group implementation and set the test result individually
  • another selection for config (it allows bool, int, size_type template by roughly go through all kernel template)
  • update format_header such that it can handle the generated dpcpp file (the script is not yet here)
  • ConfigSet: @tcojean provides a general way to embed information into encoded integer

TODO:
also set the cooperative group test result individually in cuda/hip?
I decide to move another PR to handle the cuda/hip cooperative group because it contains another subwarp test

@yhmtsai yhmtsai added the 1:ST:WIP This PR is a work in progress. Not ready for review. label May 3, 2021
@yhmtsai yhmtsai self-assigned this May 3, 2021
@ginkgo-bot ginkgo-bot added mod:core This is related to the core module. mod:dpcpp This is related to the DPC++ module. reg:build This is related to the build system. reg:helper-scripts This issue/PR is related to the helper scripts mainly concerned with development of Ginkgo. reg:testing This is related to testing. labels May 3, 2021
@tcojean
Copy link
Member

tcojean commented May 3, 2021

Thanks Mike for creating the PR. I'm trying to see what I can do with a templated configuration type as possible improvement, I'll tell you whether it works.

Copy link
Member

@upsj upsj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't go into the config right now, but the rest looks good

dpcpp/components/cooperative_groups.dp.hpp Outdated Show resolved Hide resolved
dpcpp/components/cooperative_groups.dp.hpp Outdated Show resolved Hide resolved
dpcpp/components/cooperative_groups.dp.hpp Outdated Show resolved Hide resolved
dpcpp/test/utils.hpp Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented May 3, 2021

Codecov Report

Merging #757 (7f72418) into develop (95c7652) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff            @@
##           develop     #757   +/-   ##
========================================
  Coverage    94.17%   94.17%           
========================================
  Files          400      400           
  Lines        31051    31080   +29     
========================================
+ Hits         29241    29270   +29     
  Misses        1810     1810           
Impacted Files Coverage Δ
include/ginkgo/core/base/types.hpp 92.59% <ø> (ø)
core/test/base/types.cpp 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 95c7652...7f72418. Read the comment docs.

@yhmtsai
Copy link
Member Author

yhmtsai commented May 3, 2021

test_all_subgroup will run 4, 8, 16 for cpu and 8, 16, 32 for intel gpu
search dpcpp/test/components/cooperative_groups_kernels on dgpu-job, or cpu-job

@upsj upsj added this to the Ginkgo 1.4.0 milestone May 5, 2021
Copy link
Contributor

@Slaedr Slaedr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks good!

The main point of interest for me is the choice of 32 for the warp size. Could you document somewhere, maybe just in this PR, why you chose 32? Will shuffles be expected to work for 32 work-items independent of vector architecture? Can we get away with a smaller warp size? (I think a smaller warp size, if it still fully utilizes the vector units, can enable more flexible parallelism.)

Apart from that I have a few other points. I guess the sync implementation and tests still remain, and documentation can be improved in several places.

dpcpp/base/config.hpp Outdated Show resolved Hide resolved
dpcpp/test/utils.hpp Outdated Show resolved Hide resolved
include/ginkgo/core/base/types.hpp Outdated Show resolved Hide resolved
dpcpp/test/components/cooperative_groups_kernels.dp.cpp Outdated Show resolved Hide resolved
dpcpp/test/components/cooperative_groups_kernels.dp.cpp Outdated Show resolved Hide resolved
__dpct_inline__ ValueType shfl_up(ValueType var,
SelectorType selector) const noexcept
{
const auto result = this->shuffle_up(var, selector);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this-> here? If not, it's probably better to remove it. Maybe

Suggested change
const auto result = this->shuffle_up(var, selector);
const auto result = sub_group::shuffle_up(var, selector);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it needs this-> to use the generated subgroup information

dpcpp/components/cooperative_groups.dp.hpp Show resolved Hide resolved
__dpct_inline__ ValueType ShflOpName(ValueType var, SelectorType selector) \
const noexcept \
{ \
return this->ShflOp(var, selector); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just ShflOp is enough here?

Suggested change
return this->ShflOp(var, selector); \
return ShflOp(var, selector); \

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I call the member function, I will use this->

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

member functions are sometimes not found in a templated member function unless you use this->, which is also why all of our templated tests need to use it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically, that is not needed for inline definitions (in the class body rather than outside). I was just thinking of avoiding an explicit pointer indirection if we can help it. I understand it's probably fine because it should be optimized out.



// Enable group can directly use group function
__SYCL_INLINE_NAMESPACE(cl)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example of a function this will enable? In general it seems like a bad idea for us to depend on sycl's detail namespace - I guess they can change it any time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example, we use the group algorithm the reduce in ballot.
If we do not do this way, when we need use oneAPI implemenation on gko's subgroup, we need to add static_cast to subgroup type.
I agree with you, they might change it at some point.

Copy link
Contributor

@Slaedr Slaedr May 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if it works with static casting to sycl::ONEAPI::sub_group* when you call reduce, perhaps we should just do that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stay this version first because it still works and it makes ginkgo cooperative group like an extension not another object.
But if they change frequently and we are hard to stay with them, we can delete them and use static_cast.


// specialization for 1
template <>
class thread_block_tile<1> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this specialization needed? Do you expect we'll need to use thread_block_tiles of size 1 in algorithms? Otherwise, if this case is correctly but inefficiently handled by the generic implementation, I guess we don't need this specialization. If it's needed, it would be nice to have unit tests for this too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is for those kernels implemented by warp sense.
providing it such that we can use the same kernel on subgroup(1) to get the single thread implementation

@yhmtsai
Copy link
Member Author

yhmtsai commented May 18, 2021

I add a few default helper to reduce the internal layer code.
GKO_ENABLE_DEFAULT_HOST_CONFIG(name, kernel) generates the default implementation for those kernels without expilict template parameter.
GKO_ENABLE_DEFAULT_CONFIG_CALL(name, callable, cfg, list)
cfg gives how to decode the int and list gives the available list for selection.
Also, @tcojean provides a general way, ConfigSet, to embed the several information into one int.
It can use not only in dpcpp but also in cuda/hip thing like ellpack we have 2 (3) parameters for selection.

TODO:

  • add documentation

@yhmtsai yhmtsai added 1:ST:ready-for-review This PR is ready for review and removed 1:ST:WIP This PR is a work in progress. Not ready for review. labels May 18, 2021
@yhmtsai yhmtsai requested a review from Slaedr May 18, 2021 22:11
@yhmtsai
Copy link
Member Author

yhmtsai commented May 18, 2021

@ginkgo-project/reviewers this PR can be reviewed. Also welcome any name suggestion of these things

Copy link
Member

@tcojean tcojean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

dev_tools/scripts/format_header.sh Outdated Show resolved Hide resolved
dev_tools/scripts/format_header.sh Outdated Show resolved Hide resolved
Comment on lines +47 to +55
/**
* The type containing a bitmask over all lanes of a warp.
*/
using lane_mask_type = uint64;

/**
* The bitmask of the entire warp.
*/
static constexpr auto full_lane_mask = ~zero<lane_mask_type>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we ever need this? I think we cannot have masked operation anyway. Maybe all of the config can be removed to begin with?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think the full lane mask should have the same number of bits as a full lane has threads. Is that true here? Otherwise things related to popcnt and ballot might break.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to use mask<subgroup_size, lane_mask_type> to get the last subgroup_size bits activated in lane_mask_type

dpcpp/base/helper.hpp Outdated Show resolved Hide resolved
dpcpp/base/helper.hpp Outdated Show resolved Hide resolved
dpcpp/components/cooperative_groups.dp.hpp Outdated Show resolved Hide resolved
dpcpp/base/dim3.dp.hpp Outdated Show resolved Hide resolved
Comment on lines 820 to 824
bool allowed = false;
for (auto &i : subgroup_sizes) {
allowed |= (i == warpsize);
}
return allowed && (blocksize <= max_workgroup_size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use validate_function instead?

Also, why are these protected? How do you use them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the protected structure is the original one.
I add a public function to get the const exec_info.
Do you have any comment about using function directly or structure function?

dpcpp/test/components/cooperative_groups_kernels.dp.cpp Outdated Show resolved Hide resolved
dpcpp/test/components/cooperative_groups_kernels.dp.cpp Outdated Show resolved Hide resolved
Copy link
Member

@upsj upsj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good to me, I would like to see if we can remove a few CUDA-isms though, since they only relate to kernel launch, not the actual kernel implementation.

Comment on lines +47 to +55
/**
* The type containing a bitmask over all lanes of a warp.
*/
using lane_mask_type = uint64;

/**
* The bitmask of the entire warp.
*/
static constexpr auto full_lane_mask = ~zero<lane_mask_type>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think the full lane mask should have the same number of bits as a full lane has threads. Is that true here? Otherwise things related to popcnt and ballot might break.

Comment on lines +41 to +47
#if defined(_MSC_VER)
#define __dpct_align__(n) __declspec(align(n))
#define __dpct_inline__ __forceinline
#else
#define __dpct_align__(n) __attribute__((aligned(n)))
#define __dpct_inline__ __inline__ __attribute__((always_inline))
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we programming for DPC++ or SYCL? For the former, can't we include dpct.hpp directly? For the latter, is this portable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is due to the issues from dpct before. for example, including dpct file gives error or their atomic_add has some issue, so we also have atomic_add implementation for real number in another pr

* dim3 is a cuda-like dim3 for sycl-range, which provides the same ordering as
* cuda and gets the sycl-range in reverse ordering.
*/
struct dim3 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are building a lot of complexity here (also the config selection dim3 integration), so I want to ask: Why do we need this additional wrapper? Can't we use SYCL primitives directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can use the SYCL native sycl::range<3>
but it will gives different view on kernel launch and build the sycl::range
for example, we use kernel<<<32, 32>>> and sycl will need to use
kernel(sycl::nd_range<3>(sycl::range<3>(1, 1, 32) * sycl::range<3>(1, 1, 32), sycl::range<3>(1, 1, 32)))
but with dim3, we can still use kernel(32, 32)
and in the beginning, I would like to reduce the difference from cuda to dpcpp

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which places does that actually matter? Most of the time, we are using one-dimensional kernels, except for SpMV (multiple columns), some Dense kernels (2D) and Jacobi. Wouldn't you use sycl::ndrange<1>(...) directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the mapping will be different. At least, it will require us to develop own dpct to avoid any conversion like threadIdx -> get_local_id(3)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and it will gives inconsistent index sense between the kernel1D/2D/3D.
range<1>(x) get_local_id(0) is x and x is contiguous as cuda
range<2>(x, y) get_local_id(0) is still x but x is not contiguous
or range<2>(y, x) x is contiguous but x needs get_local_id(1)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that only matters on the launch side, right? I would assume that the reason for this is that it maps more cleanly to the nested for-loop model

for (; i < range[0]; i++)
    for (; j < range[1]; j++)
        for (; k < range[2]; k++)

Copy link
Member Author

@yhmtsai yhmtsai May 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it affects not only launch side but the kernel index for threads

__dpct_inline__ ValueType ShflOpName(ValueType var, SelectorType selector) \
const noexcept \
{ \
return this->ShflOp(var, selector); \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

member functions are sometimes not found in a templated member function unless you use this->, which is also why all of our templated tests need to use it.

Copy link
Contributor

@Slaedr Slaedr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work! It's pretty cool, eg. how you handle stuff like __WG_BOUNDS__. Guess you want to make DPC++ as close to CUDA as possible, which is what a lot of the code here seems to be doing.

Below, I have some concerns around the ConfigSet stuff and other comments / suggestions.

*
* @note this is the last case of nested template
*/
template <int num_groups, int current_shift>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
template <int num_groups, int current_shift>
template <int current_shift, int num_groups>

This way, I think num_groups can be inferred and we'll need to provide only current_shift while using this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it before but it did not get the num_group information from array

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. Even with the switched order? Then never mind.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it needs switched order to give the possibility for not explicitly setting.
I will put the my trying code later.
It is also not my expectation, so maybe I did something wrong there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try adding a deduction guide? Neat little C++17 feature that might help here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ConfigSet is also a general solution for others kernel (not dpcpp) so we still need to stay C++14.
I put the related code: https://godbolt.org/z/oa5d4arMs
it switch the order but can not miss the num_groups

*
* @note this is the usual case of nested template
*/
template <int num_groups, int current_shift>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
template <int num_groups, int current_shift>
template <int current_shift, int num_groups>

const std::array<char, num_groups> &bits)
{
return bits[current_shift + 1] +
shift<num_groups, (current_shift + 1)>(bits);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then this can simply be

Suggested change
shift<num_groups, (current_shift + 1)>(bits);
shift<current_shift + 1>(bits);


/**
* ConfigSet is a way to embed several information into one integer by given
* certain bits. The usage will be the following
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* certain bits. The usage will be the following
* certain bits.
*
* The usage will be the following:

class ConfigSet {
public:
static constexpr size_type num_groups = sizeof...(num_bits);
static constexpr std::array<char, num_groups> bits{num_bits...};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to put ints into a char array?

* The encoded result will use 32 bits to record
* rrrrr1..12....2...k..k, which 1/2/k means the bits store the information for
* 1/2/k position and r is for rest of unused bits.
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat crazy suggestion but might be great to have: maybe you could include a mathematical proof that this coding scheme is indeed a unique map when the numbers to encode are small enough. You could precisely state (in the assumptions) the maximum values of the arguments of encode for the coding to work, and then the proof would be couple of lines to prove:
For integer vectors x and y with all components less than the respective maxima, x != y implies encode(x...) != encode(y...).

If you say that is unnecessary, that is completely fine.

dpcpp/components/cooperative_groups.dp.hpp Show resolved Hide resolved


// Enable group can directly use group function
__SYCL_INLINE_NAMESPACE(cl)
Copy link
Contributor

@Slaedr Slaedr May 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if it works with static casting to sycl::ONEAPI::sub_group* when you call reduce, perhaps we should just do that.



using namespace gko::kernels::dpcpp;
using KCfg = gko::ConfigSet<12, 7>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess 12 and 7 are arbitrarily chosen, or do you like 12 and 7 for some specific reason?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use wrong number here. It should be 11, 7.
it is from log_2(1024) + 1 for workgoup_size and log_2(64) + 1 for subgroup_size

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That appears to make sense, but I don't really see it. It seems like you want the bits array to contain the maximum number of bits needed by each position. But I guess it will work correctly for many combinations, depending on which numbers are given as parameters to the encode function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is described in ConfigSet note. the #bit should be log_2(max)+1

Copy link
Member

@upsj upsj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - nice job! I am still not entirely happy about the dim3 thing, since it means adding CUDA-isms to our SYCL code, but I guess it would only be used in a small number of places anyways, so I guess it will be okay.

Comment on lines 117 to 120
ASSERT_EQ((std::is_same<decltype(mask3_u), const unsigned int>::value),
true);
ASSERT_EQ((std::is_same<decltype(fullmask_u), const unsigned int>::value),
true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ASSERT_EQ((std::is_same<decltype(mask3_u), const unsigned int>::value),
true);
ASSERT_EQ((std::is_same<decltype(fullmask_u), const unsigned int>::value),
true);
ASSERT_TRUE((std::is_same<decltype(mask3_u), const unsigned int>::value));
ASSERT_TRUE((std::is_same<decltype(fullmask_u), const unsigned int>::value));

Not sure if you even need the additional parentheses then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still needs the additional parentheses such that the macro unpacks parameter correctly.

Comment on lines 123 to 127
ASSERT_EQ((std::is_same<decltype(mask3_u64), const std::uint64_t>::value),
true);
ASSERT_EQ(
(std::is_same<decltype(fullmask_u64), const std::uint64_t>::value),
true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here and the following

@@ -266,9 +266,6 @@ fi
# Arrange the remain files and give
if [ -f "${CONTENT}" ]; then
add_regroup
if [ "${HAS_HIP_RUNTIME}" = "true" ]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for these changes again? Do we no longer need it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the behavior still exists. DPCPP gives more additional header than <hip/runtime.h>, so I move the all additional before the LICENSE into header section not just hip/runtime.h

*
* @note this is the last case of nested template
*/
template <int num_groups, int current_shift>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try adding a deduction guide? Neat little C++17 feature that might help here.

@yhmtsai
Copy link
Member Author

yhmtsai commented May 27, 2021

it depends on how much effort we would like to start DPCPP in the beginning.
If we wouldn't like Cuda-isms into dpcpp, we should reimplement all not porting in the begining.
Using dpct gives the index structure and also the global range is not possible be the same as grid.
without dim3, we still need to do the same thing manually and need to switch our brand between SYCL/(HIPCUDA)

@yhmtsai yhmtsai added 1:ST:ready-to-merge This PR is ready to merge. and removed 1:ST:ready-for-review This PR is ready for review labels May 27, 2021
@yhmtsai
Copy link
Member Author

yhmtsai commented May 27, 2021

I swap the order of shift although we do not have auto deduction there.
But if someway to achieve it or use C++17, this order of template is unchanged.
I add some comments to show the encode/decode is bijection function.
Also the ConfigSet can not be compiled or can throw the error.

  • the total bits is larger than the bits of ConfigSetType
  • the value is larger than the certain bit representation (delete due to gcc 5.x does not support conditional throw in constexpr)
  • encode uses wrong number of args
  • decode out-of-range position

@yhmtsai yhmtsai force-pushed the oneapi_coopgp branch 2 times, most recently from 5067e93 to 32a7c19 Compare May 28, 2021 16:12
delete throw in constexpr because it fails in gcc <= 5.x

Co-authored-by: Aditya Kashi <aditya.kashi@kit.edu>
Co-authored-by: Terry Cojean <terry.cojean@kit.edu>
Co-authored-by: Tobias Ribizel <ribizel@kit.edu>
@sonarcloud
Copy link

sonarcloud bot commented May 28, 2021

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
3.4% 3.4% Duplication

@yhmtsai yhmtsai merged commit da19a97 into develop May 28, 2021
@yhmtsai yhmtsai deleted the oneapi_coopgp branch May 28, 2021 23:16
tcojean added a commit that referenced this pull request Aug 20, 2021
Ginkgo release 1.4.0

The Ginkgo team is proud to announce the new Ginkgo minor release 1.4.0. This
release brings most of the Ginkgo functionality to the Intel DPC++ ecosystem
which enables Intel-GPU and CPU execution. The only Ginkgo features which have
not been ported yet are some preconditioners.

Ginkgo's mixed-precision support is greatly enhanced thanks to:
1. The new Accessor concept, which allows writing kernels featuring on-the-fly
memory compression, among other features. The accessor can be used as
header-only, see the [accessor BLAS benchmarks repository](https://github.com/ginkgo-project/accessor-BLAS/tree/develop) as a usage example.
2. All LinOps now transparently support mixed-precision execution. By default,
this is done through a temporary copy which may have a performance impact but
already allows mixed-precision research.

Native mixed-precision ELL kernels are implemented which do not see this cost.
The accessor is also leveraged in a new CB-GMRES solver which allows for
performance improvements by compressing the Krylov basis vectors. Many other
features have been added to Ginkgo, such as reordering support, a new IDR
solver, Incomplete Cholesky preconditioner, matrix assembly support (only CPU
for now), machine topology information, and more!

Supported systems and requirements:
+ For all platforms, cmake 3.13+
+ C++14 compliant compiler
+ Linux and MacOS
  + gcc: 5.3+, 6.3+, 7.3+, all versions after 8.1+
  + clang: 3.9+
  + Intel compiler: 2018+
  + Apple LLVM: 8.0+
  + CUDA module: CUDA 9.0+
  + HIP module: ROCm 3.5+
  + DPC++ module: Intel OneAPI 2021.3. Set the CXX compiler to `dpcpp`.
+ Windows
  + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, all versions after 8.1+
  + Microsoft Visual Studio: VS 2019
  + CUDA module: CUDA 9.0+, Microsoft Visual Studio
  + OpenMP module: MinGW or Cygwin.


Algorithm and important feature additions:
+ Add a new DPC++ Executor for SYCL execution and other base utilities
  [#648](#648), [#661](#661), [#757](#757), [#832](#832)
+ Port matrix formats, solvers and related kernels to DPC++. For some kernels,
  also make use of a shared kernel implementation for all executors (except
  Reference). [#710](#710), [#799](#799), [#779](#779), [#733](#733), [#844](#844), [#843](#843), [#789](#789), [#845](#845), [#849](#849), [#855](#855), [#856](#856)
+ Add accessors which allow multi-precision kernels, among other things.
  [#643](#643), [#708](#708)
+ Add support for mixed precision operations through apply in all LinOps. [#677](#677)
+ Add incomplete Cholesky factorizations and preconditioners as well as some
  improvements to ILU. [#672](#672), [#837](#837), [#846](#846)
+ Add an AMGX implementation and kernels on all devices but DPC++.
  [#528](#528), [#695](#695), [#860](#860)
+ Add a new mixed-precision capability solver, Compressed Basis GMRES
  (CB-GMRES). [#693](#693), [#763](#763)
+ Add the IDR(s) solver. [#620](#620)
+ Add a new fixed-size block CSR matrix format (for the Reference executor).
  [#671](#671), [#730](#730)
+ Add native mixed-precision support to the ELL format. [#717](#717), [#780](#780)
+ Add Reverse Cuthill-McKee reordering [#500](#500), [#649](#649)
+ Add matrix assembly support on CPUs. [#644](#644)
+ Extends ISAI from triangular to general and spd matrices. [#690](#690)

Other additions:
+ Add the possibility to apply real matrices to complex vectors.
  [#655](#655), [#658](#658)
+ Add functions to compute the absolute of a matrix format. [#636](#636)
+ Add symmetric permutation and improve existing permutations.
  [#684](#684), [#657](#657), [#663](#663)
+ Add a MachineTopology class with HWLOC support [#554](#554), [#697](#697)
+ Add an implicit residual norm criterion. [#702](#702), [#818](#818), [#850](#850)
+ Row-major accessor is generalized to more than 2 dimensions and a new
  "block column-major" accessor has been added. [#707](#707)
+ Add an heat equation example. [#698](#698), [#706](#706)
+ Add ccache support in CMake and CI. [#725](#725), [#739](#739)
+ Allow tuning and benchmarking variables non intrusively. [#692](#692)
+ Add triangular solver benchmark [#664](#664)
+ Add benchmarks for BLAS operations [#772](#772), [#829](#829)
+ Add support for different precisions and consistent index types in benchmarks.
  [#675](#675), [#828](#828)
+ Add a Github bot system to facilitate development and PR management.
  [#667](#667), [#674](#674), [#689](#689), [#853](#853)
+ Add Intel (DPC++) CI support and enable CI on HPC systems. [#736](#736), [#751](#751), [#781](#781)
+ Add ssh debugging for Github Actions CI. [#749](#749)
+ Add pipeline segmentation for better CI speed. [#737](#737)


Changes:
+ Add a Scalar Jacobi specialization and kernels. [#808](#808), [#834](#834), [#854](#854)
+ Add implicit residual log for solvers and benchmarks. [#714](#714)
+ Change handling of the conjugate in the dense dot product. [#755](#755)
+ Improved Dense stride handling. [#774](#774)
+ Multiple improvements to the OpenMP kernels performance, including COO,
an exclusive prefix sum, and more. [#703](#703), [#765](#765), [#740](#740)
+ Allow specialization of submatrix and other dense creation functions in solvers. [#718](#718)
+ Improved Identity constructor and treatment of rectangular matrices. [#646](#646)
+ Allow CUDA/HIP executors to select allocation mode. [#758](#758)
+ Check if executors share the same memory. [#670](#670)
+ Improve test install and smoke testing support. [#721](#721)
+ Update the JOSS paper citation and add publications in the documentation.
  [#629](#629), [#724](#724)
+ Improve the version output. [#806](#806)
+ Add some utilities for dim and span. [#821](#821)
+ Improved solver and preconditioner benchmarks. [#660](#660)
+ Improve benchmark timing and output. [#669](#669), [#791](#791), [#801](#801), [#812](#812)


Fixes:
+ Sorting fix for the Jacobi preconditioner. [#659](#659)
+ Also log the first residual norm in CGS [#735](#735)
+ Fix BiCG and HIP CSR to work with complex matrices. [#651](#651)
+ Fix Coo SpMV on strided vectors. [#807](#807)
+ Fix segfault of extract_diagonal, add short-and-fat test. [#769](#769)
+ Fix device_reset issue by moving counter/mutex to device. [#810](#810)
+ Fix `EnableLogging` superclass. [#841](#841)
+ Support ROCm 4.1.x and breaking HIP_PLATFORM changes. [#726](#726)
+ Decreased test size for a few device tests. [#742](#742)
+ Fix multiple issues with our CMake HIP and RPATH setup.
  [#712](#712), [#745](#745), [#709](#709)
+ Cleanup our CMake installation step. [#713](#713)
+ Various simplification and fixes to the Windows CMake setup. [#720](#720), [#785](#785)
+ Simplify third-party integration. [#786](#786)
+ Improve Ginkgo device arch flags management. [#696](#696)
+ Other fixes and improvements to the CMake setup.
  [#685](#685), [#792](#792), [#705](#705), [#836](#836)
+ Clarification of dense norm documentation [#784](#784)
+ Various development tools fixes and improvements [#738](#738), [#830](#830), [#840](#840)
+ Make multiple operators/constructors explicit. [#650](#650), [#761](#761)
+ Fix some issues, memory leaks and warnings found by MSVC.
  [#666](#666), [#731](#731)
+ Improved solver memory estimates and consistent iteration counts [#691](#691)
+ Various logger improvements and fixes [#728](#728), [#743](#743), [#754](#754)
+ Fix for ForwardIterator requirements in iterator_factory. [#665](#665)
+ Various benchmark fixes. [#647](#647), [#673](#673), [#722](#722)
+ Various CI fixes and improvements. [#642](#642), [#641](#641), [#795](#795), [#783](#783), [#793](#793), [#852](#852)


Related PR: #857
tcojean added a commit that referenced this pull request Aug 23, 2021
Release 1.4.0 to master

The Ginkgo team is proud to announce the new Ginkgo minor release 1.4.0. This
release brings most of the Ginkgo functionality to the Intel DPC++ ecosystem
which enables Intel-GPU and CPU execution. The only Ginkgo features which have
not been ported yet are some preconditioners.

Ginkgo's mixed-precision support is greatly enhanced thanks to:
1. The new Accessor concept, which allows writing kernels featuring on-the-fly
memory compression, among other features. The accessor can be used as
header-only, see the [accessor BLAS benchmarks repository](https://github.com/ginkgo-project/accessor-BLAS/tree/develop) as a usage example.
2. All LinOps now transparently support mixed-precision execution. By default,
this is done through a temporary copy which may have a performance impact but
already allows mixed-precision research.

Native mixed-precision ELL kernels are implemented which do not see this cost.
The accessor is also leveraged in a new CB-GMRES solver which allows for
performance improvements by compressing the Krylov basis vectors. Many other
features have been added to Ginkgo, such as reordering support, a new IDR
solver, Incomplete Cholesky preconditioner, matrix assembly support (only CPU
for now), machine topology information, and more!

Supported systems and requirements:
+ For all platforms, cmake 3.13+
+ C++14 compliant compiler
+ Linux and MacOS
  + gcc: 5.3+, 6.3+, 7.3+, all versions after 8.1+
  + clang: 3.9+
  + Intel compiler: 2018+
  + Apple LLVM: 8.0+
  + CUDA module: CUDA 9.0+
  + HIP module: ROCm 3.5+
  + DPC++ module: Intel OneAPI 2021.3. Set the CXX compiler to `dpcpp`.
+ Windows
  + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, all versions after 8.1+
  + Microsoft Visual Studio: VS 2019
  + CUDA module: CUDA 9.0+, Microsoft Visual Studio
  + OpenMP module: MinGW or Cygwin.


Algorithm and important feature additions:
+ Add a new DPC++ Executor for SYCL execution and other base utilities
  [#648](#648), [#661](#661), [#757](#757), [#832](#832)
+ Port matrix formats, solvers and related kernels to DPC++. For some kernels,
  also make use of a shared kernel implementation for all executors (except
  Reference). [#710](#710), [#799](#799), [#779](#779), [#733](#733), [#844](#844), [#843](#843), [#789](#789), [#845](#845), [#849](#849), [#855](#855), [#856](#856)
+ Add accessors which allow multi-precision kernels, among other things.
  [#643](#643), [#708](#708)
+ Add support for mixed precision operations through apply in all LinOps. [#677](#677)
+ Add incomplete Cholesky factorizations and preconditioners as well as some
  improvements to ILU. [#672](#672), [#837](#837), [#846](#846)
+ Add an AMGX implementation and kernels on all devices but DPC++.
  [#528](#528), [#695](#695), [#860](#860)
+ Add a new mixed-precision capability solver, Compressed Basis GMRES
  (CB-GMRES). [#693](#693), [#763](#763)
+ Add the IDR(s) solver. [#620](#620)
+ Add a new fixed-size block CSR matrix format (for the Reference executor).
  [#671](#671), [#730](#730)
+ Add native mixed-precision support to the ELL format. [#717](#717), [#780](#780)
+ Add Reverse Cuthill-McKee reordering [#500](#500), [#649](#649)
+ Add matrix assembly support on CPUs. [#644](#644)
+ Extends ISAI from triangular to general and spd matrices. [#690](#690)

Other additions:
+ Add the possibility to apply real matrices to complex vectors.
  [#655](#655), [#658](#658)
+ Add functions to compute the absolute of a matrix format. [#636](#636)
+ Add symmetric permutation and improve existing permutations.
  [#684](#684), [#657](#657), [#663](#663)
+ Add a MachineTopology class with HWLOC support [#554](#554), [#697](#697)
+ Add an implicit residual norm criterion. [#702](#702), [#818](#818), [#850](#850)
+ Row-major accessor is generalized to more than 2 dimensions and a new
  "block column-major" accessor has been added. [#707](#707)
+ Add an heat equation example. [#698](#698), [#706](#706)
+ Add ccache support in CMake and CI. [#725](#725), [#739](#739)
+ Allow tuning and benchmarking variables non intrusively. [#692](#692)
+ Add triangular solver benchmark [#664](#664)
+ Add benchmarks for BLAS operations [#772](#772), [#829](#829)
+ Add support for different precisions and consistent index types in benchmarks.
  [#675](#675), [#828](#828)
+ Add a Github bot system to facilitate development and PR management.
  [#667](#667), [#674](#674), [#689](#689), [#853](#853)
+ Add Intel (DPC++) CI support and enable CI on HPC systems. [#736](#736), [#751](#751), [#781](#781)
+ Add ssh debugging for Github Actions CI. [#749](#749)
+ Add pipeline segmentation for better CI speed. [#737](#737)


Changes:
+ Add a Scalar Jacobi specialization and kernels. [#808](#808), [#834](#834), [#854](#854)
+ Add implicit residual log for solvers and benchmarks. [#714](#714)
+ Change handling of the conjugate in the dense dot product. [#755](#755)
+ Improved Dense stride handling. [#774](#774)
+ Multiple improvements to the OpenMP kernels performance, including COO,
an exclusive prefix sum, and more. [#703](#703), [#765](#765), [#740](#740)
+ Allow specialization of submatrix and other dense creation functions in solvers. [#718](#718)
+ Improved Identity constructor and treatment of rectangular matrices. [#646](#646)
+ Allow CUDA/HIP executors to select allocation mode. [#758](#758)
+ Check if executors share the same memory. [#670](#670)
+ Improve test install and smoke testing support. [#721](#721)
+ Update the JOSS paper citation and add publications in the documentation.
  [#629](#629), [#724](#724)
+ Improve the version output. [#806](#806)
+ Add some utilities for dim and span. [#821](#821)
+ Improved solver and preconditioner benchmarks. [#660](#660)
+ Improve benchmark timing and output. [#669](#669), [#791](#791), [#801](#801), [#812](#812)


Fixes:
+ Sorting fix for the Jacobi preconditioner. [#659](#659)
+ Also log the first residual norm in CGS [#735](#735)
+ Fix BiCG and HIP CSR to work with complex matrices. [#651](#651)
+ Fix Coo SpMV on strided vectors. [#807](#807)
+ Fix segfault of extract_diagonal, add short-and-fat test. [#769](#769)
+ Fix device_reset issue by moving counter/mutex to device. [#810](#810)
+ Fix `EnableLogging` superclass. [#841](#841)
+ Support ROCm 4.1.x and breaking HIP_PLATFORM changes. [#726](#726)
+ Decreased test size for a few device tests. [#742](#742)
+ Fix multiple issues with our CMake HIP and RPATH setup.
  [#712](#712), [#745](#745), [#709](#709)
+ Cleanup our CMake installation step. [#713](#713)
+ Various simplification and fixes to the Windows CMake setup. [#720](#720), [#785](#785)
+ Simplify third-party integration. [#786](#786)
+ Improve Ginkgo device arch flags management. [#696](#696)
+ Other fixes and improvements to the CMake setup.
  [#685](#685), [#792](#792), [#705](#705), [#836](#836)
+ Clarification of dense norm documentation [#784](#784)
+ Various development tools fixes and improvements [#738](#738), [#830](#830), [#840](#840)
+ Make multiple operators/constructors explicit. [#650](#650), [#761](#761)
+ Fix some issues, memory leaks and warnings found by MSVC.
  [#666](#666), [#731](#731)
+ Improved solver memory estimates and consistent iteration counts [#691](#691)
+ Various logger improvements and fixes [#728](#728), [#743](#743), [#754](#754)
+ Fix for ForwardIterator requirements in iterator_factory. [#665](#665)
+ Various benchmark fixes. [#647](#647), [#673](#673), [#722](#722)
+ Various CI fixes and improvements. [#642](#642), [#641](#641), [#795](#795), [#783](#783), [#793](#793), [#852](#852)

Related PR: #866
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1:ST:ready-to-merge This PR is ready to merge. mod:core This is related to the core module. mod:dpcpp This is related to the DPC++ module. reg:build This is related to the build system. reg:helper-scripts This issue/PR is related to the helper scripts mainly concerned with development of Ginkgo. reg:testing This is related to testing.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants