Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check if executors share the same memory #670

Merged
merged 5 commits into from Dec 11, 2020
Merged

Check if executors share the same memory #670

merged 5 commits into from Dec 11, 2020

Conversation

tcojean
Copy link
Member

@tcojean tcojean commented Nov 24, 2020

This PR contains only the first commit of #652.

This implements a new executor function memory_accessible between executors which verifies whether they share the same memory. Currently, this functionality is used in the temporary_clone function to save copies.
This currently means that:

  • Any OpenMP executor has the same memory has another OpenMP executor.
  • Same for Reference executors.
  • Same for CUDA executors if they have the same device id, and same for HIP executors.
  • OpenMP and Reference also share the same memory, and same for a host-side DPC++ executor.
  • HIP and CUDA have the same memory if they are the same device id and HIP actually uses the CUDA backend.

@tcojean tcojean added is:enhancement An improvement of an existing feature. mod:core This is related to the core module. 1:ST:ready-for-review This PR is ready for review labels Nov 24, 2020
@tcojean tcojean self-assigned this Nov 24, 2020
@tcojean tcojean changed the title Implement executor equal_to (==) Check if executors share the same memory Nov 24, 2020
Copy link
Member

@yhmtsai yhmtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about we still isolate the reference executor?
If we optimize omp executor with some additional information like csr in cuda/hip,
considering omp/reference as the same memory will make the conversion failed.

dpcpp/base/executor.dp.cpp Outdated Show resolved Hide resolved
include/ginkgo/core/base/executor.hpp Show resolved Hide resolved
dpcpp/base/executor.dp.cpp Outdated Show resolved Hide resolved
@upsj
Copy link
Member

upsj commented Nov 25, 2020

Awesome idea, this might also help later on with the MPI executor. Two comments before I review it in more detail:

  1. This slightly changes the semantics of make_temporary_clone since the returned object no longer necessarily exists on the executor we passed in. Do we use this other executor somewhere?
  2. I would prefer if we used a member function memory_accessible or something like that instead of overloading operator==, since that might prevent us from implementing full executor equality checks later.

@tcojean
Copy link
Member Author

tcojean commented Nov 26, 2020

If we optimize omp executor with some additional information like csr in cuda/hip,
considering omp/reference as the same memory will make the conversion failed.

I don't think this is such a big issue. If the algorithm needs to clone anyway, gko::clone will still work. If it needs to transform the memory without copies, it can do as long as it doesn't break anything, or it's put back to the previous state. You could always call in make_srow() equivalent in that situation to update the strategy related data? Also, it doesn't seem like we have such a case for now.

This slightly changes the semantics of make_temporary_clone since the returned object no longer necessarily exists on the executor we passed in. Do we use this other executor somewhere?

Indeed, it does in two ways: 1) temporary clone is called after this PR only when there is no direct access to memory, and 2) we do not ensure anymore that LinOps have exactly the same executors when calling a function like apply() or however else we combine them. So far in terms of tests that seems to work, but I'm indeed not sure whether we have such an extensive testing of all combinations to prove that at a 100% level. I believe that's where integration tests would be useful.

I would prefer if we used a member function memory_accessible or something like that instead of overloading operator==, since that might prevent us from implementing full executor equality checks later.

Sure, I don't see how else we would use operator== than this but that does make it more future proof, so I did this change.

core/test/base/array.cpp Outdated Show resolved Hide resolved
@tcojean tcojean force-pushed the exec_equal_to branch 2 times, most recently from 50361e6 to 6acf8eb Compare November 27, 2020 08:58
Copy link
Member

@pratikvn pratikvn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! But I have one question: This PR changes the behaviour of make_temporary_clone and make_temporary_clone seems to be part of the public interface.

Doesn't that mean we would be possibly breaking existing code ?

dpcpp/base/executor.dp.cpp Show resolved Hide resolved
include/ginkgo/core/base/executor.hpp Outdated Show resolved Hide resolved
include/ginkgo/core/base/executor.hpp Outdated Show resolved Hide resolved
Comment on lines +1495 to +1496
std::for_each(device_type_.begin(), device_type_.end(),
[](char &c) { c = std::tolower(c); });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)

@tcojean
Copy link
Member Author

tcojean commented Nov 27, 2020

I don't think changing the temporary clone interface is interface breaking, since the point of this function is to do a copy when you need it. This still happens now, all that changes is that you get less copies when the data is already sitting on memory that you can access.

Copy link
Collaborator

@Slaedr Slaedr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is an intricate piece of work.

My understanding of why we need a two-level dispatch:

Because we do not want to template Executor on the concrete executor type, we have an ExecutorBase template with CRTP, and this is a friend of Executor for all CRTP values. The verify_memory_to functions, which do the actual work, depend on being overloaded on the concrete executor types. This means calling one of them requires knowledge of the concrete type. Therefore, Executor cannot directly call verify_memory_to, so it calls verify_memory_from. This second function is not overloaded but is implemented in ExecutorBase and thus has access to its concrete type, and can call verify_memory_to of its argument, passing its concrete self as the argument in turn.

Is that correct? Does this get affected when actual memory space classes are introduced, or has the plan for that changed?

I have a minor reservation which is a point that @pratikvn raised, about consistency in using pointers and references for verify_memory_to and verify_memory_from.

@tcojean
Copy link
Member Author

tcojean commented Nov 30, 2020

@Slaedr yes you are pretty much exactly correct about that design, this is how it works. The idea of this double dispatch with the CRTP enabled class in the middle is to allow to evaluate to the actual executor type so that we can call the proper verify_memory_to, for the proper target executor.

About the memory space classes, either this will stay with memory spaces instead of executors, or, more likely, we build the memory spaces properly so that they are shared when they should be, and anyway we don't need all of this anymore.

About using pointers, yes I will change to that, this is a leftover of the previous design which I started changing in some recent commits, and had not done so for the deepest levels.

@Slaedr
Copy link
Collaborator

Slaedr commented Nov 30, 2020

About the memory space classes, either this will stay with memory spaces instead of executors, or, more likely, we build the memory spaces properly so that they are shared when they should be, and anyway we don't need all of this anymore.

So all this is a only temporary measure until the full memory spaces are implemented? How much of this will be retained?

@tcojean
Copy link
Member Author

tcojean commented Nov 30, 2020

About the memory space classes, either this will stay with memory spaces instead of executors, or, more likely, we build the memory spaces properly so that they are shared when they should be, and anyway we don't need all of this anymore.

So all this is a only temporary measure until the full memory spaces are implemented? How much of this will be retained?

That is hard to say. First of all, the memory space is definitely interface breaking so it could be in a while still, whereas this is not. Removing this and replacing by memory spaces will be interface breaking as well, but not much more since memory spaces are interface breaking to begin with. Second, it depends as I said on how the memory space implementation is done. You could keep this together with memory space, if we somehow want to keep a CudaMemory and HipMemory for example, you would still need to say Cuda == HIP when HIP is ran on CUDA and both have the same device id. If it's done in another way, where HIP executor gets a CudaMemory space while using a CUDA backend, then in this case you don't need the verify memory overload.

@tcojean tcojean force-pushed the exec_equal_to branch 2 times, most recently from 78f8a62 to c9d73fd Compare December 1, 2020 16:18
@github-actions
Copy link

github-actions bot commented Dec 1, 2020

Error: The following files need to be formatted:

core/device_hooks/dpcpp_hooks.cpp
cuda/test/base/array.cu
include/ginkgo/core/base/dim.hpp
include/ginkgo/core/base/exception_helpers.hpp
include/ginkgo/core/base/executor.hpp
include/ginkgo/core/base/temporary_clone.hpp
include/ginkgo/core/base/utils.hpp
include/ginkgo/core/base/utils_helper.hpp
include/ginkgo/core/log/logger.hpp
include/ginkgo/ginkgo.hpp

You can find a formatting patch under Artifacts here or run format! if you have write access to Ginkgo

@github-actions
Copy link

github-actions bot commented Dec 1, 2020

Error: The following files need to be formatted:

core/device_hooks/dpcpp_hooks.cpp
cuda/test/base/array.cu
include/ginkgo/core/base/dim.hpp
include/ginkgo/core/base/exception_helpers.hpp
include/ginkgo/core/base/executor.hpp
include/ginkgo/core/base/temporary_clone.hpp
include/ginkgo/core/base/utils.hpp
include/ginkgo/core/base/utils_helper.hpp
include/ginkgo/core/log/logger.hpp
include/ginkgo/ginkgo.hpp

You can find a formatting patch under Artifacts here or run format! if you have write access to Ginkgo

@tcojean
Copy link
Member Author

tcojean commented Dec 2, 2020

rebase!

@github-actions
Copy link

github-actions bot commented Dec 2, 2020

Error: The following files need to be formatted:

core/device_hooks/dpcpp_hooks.cpp
cuda/test/base/array.cu
include/ginkgo/core/base/dim.hpp
include/ginkgo/core/base/exception_helpers.hpp
include/ginkgo/core/base/executor.hpp
include/ginkgo/core/base/temporary_clone.hpp
include/ginkgo/core/base/utils.hpp
include/ginkgo/core/base/utils_helper.hpp
include/ginkgo/core/log/logger.hpp
include/ginkgo/ginkgo.hpp

You can find a formatting patch under Artifacts here or run format! if you have write access to Ginkgo

@tcojean
Copy link
Member Author

tcojean commented Dec 2, 2020

format!

@codecov
Copy link

codecov bot commented Dec 2, 2020

Codecov Report

Merging #670 (018244f) into develop (b8a705c) will decrease coverage by 0.10%.
The diff coverage is 74.66%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #670      +/-   ##
===========================================
- Coverage    93.00%   92.89%   -0.11%     
===========================================
  Files          332      333       +1     
  Lines        24187    24265      +78     
===========================================
+ Hits         22495    22541      +46     
- Misses        1692     1724      +32     
Impacted Files Coverage Δ
core/device_hooks/dpcpp_hooks.cpp 29.03% <0.00%> (-5.76%) ⬇️
include/ginkgo/core/base/dim.hpp 88.88% <ø> (ø)
include/ginkgo/core/base/exception_helpers.hpp 90.90% <ø> (ø)
include/ginkgo/core/log/logger.hpp 92.10% <ø> (ø)
core/test/base/executor.cpp 91.58% <47.05%> (-8.42%) ⬇️
include/ginkgo/core/base/executor.hpp 83.67% <73.07%> (-2.28%) ⬇️
include/ginkgo/core/base/utils_helper.hpp 87.17% <87.17%> (ø)
core/devices/cuda/executor.cpp 75.00% <100.00%> (+25.00%) ⬆️
core/devices/hip/executor.cpp 71.42% <100.00%> (+38.09%) ⬆️
core/test/base/array.cpp 100.00% <100.00%> (ø)
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b8a705c...018244f. Read the comment docs.

Copy link
Collaborator

@Slaedr Slaedr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just one minor question.

yhmtsai and others added 3 commits December 4, 2020 16:19
Co-authored-with: Terry Cojean <terry.cojean@kit.edu>
+ Do not use `operator==`, but a funciton `memory_accessible` instead.
+ Make DPC++ host and CPU be memory compatible.
+ Use pointers for the interface instead of references.
+ Ensure DPC++ tests always work.
+ Fix some typos.

Co-authored-by: Yuhsiang M. Tsai <yhmtsai@gmail.com>
Co-authored-by: Tobias Ribizel <ribizel@kit.edu>
Co-authored-by: Pratik Nayak <pratikvn@protonmail.com>
Co-authored-by: Aditya Kashi <aditya.kashi@kit.edu>
Co-authored-by: tcojean <tcojean@users.noreply.github.com>
Copy link
Member

@thoasm thoasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some minor comments.

include/ginkgo/core/base/executor.hpp Show resolved Hide resolved
dpcpp/base/executor.dp.cpp Outdated Show resolved Hide resolved
dpcpp/base/executor.dp.cpp Show resolved Hide resolved
core/test/base/executor.cpp Outdated Show resolved Hide resolved
core/test/base/executor.cpp Outdated Show resolved Hide resolved
Copy link
Member Author

@tcojean tcojean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments, I will fix the issues pointed out.

core/test/base/executor.cpp Outdated Show resolved Hide resolved
dpcpp/base/executor.dp.cpp Show resolved Hide resolved
Some code style issues.

Co-authored-by: Thomas Grützmacher <thomas.gruetzmacher@kit.edu>
Copy link
Member

@yhmtsai yhmtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for handling this issue and add a lot of tests.
I have one question on the test.
we will make the reference executor isolated, right?

cuda/test/base/lin_op.cu Show resolved Hide resolved
@Slaedr
Copy link
Collaborator

Slaedr commented Dec 9, 2020

Could you remind me why we need the reference executor to have a different memory space from the omp executor? Do we document this somewhere?

@tcojean tcojean force-pushed the exec_equal_to branch 2 times, most recently from 9dd05db to 80297f3 Compare December 9, 2020 10:46
@yhmtsai
Copy link
Member

yhmtsai commented Dec 9, 2020

we have some specific data in different executor, like csr srow.
If we consider omp and reference in the same memory, when omp apply on reference data, the data will not be converted to the omp one. Thus, omp operation may access unallocated memory or wrong data.
Note. we only have it in cuda/hip not omp/reference now.

Another point is from testing or design.
We use reference executor/operation as our correct result to compare different executor.
May isolating the reference executor to avoid any unexpected operation or other executor make sense?

Copy link
Member

@upsj upsj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a minor question

include/ginkgo/core/base/executor.hpp Outdated Show resolved Hide resolved
@sonarcloud
Copy link

sonarcloud bot commented Dec 9, 2020

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 11 Code Smells

32.7% 32.7% Coverage
0.0% 0.0% Duplication

@tcojean tcojean added 1:ST:ready-to-merge This PR is ready to merge. and removed 1:ST:ready-for-review This PR is ready for review labels Dec 10, 2020
@tcojean tcojean merged commit 2a951ac into develop Dec 11, 2020
@tcojean tcojean deleted the exec_equal_to branch December 11, 2020 08:49
tcojean added a commit that referenced this pull request Aug 20, 2021
Ginkgo release 1.4.0

The Ginkgo team is proud to announce the new Ginkgo minor release 1.4.0. This
release brings most of the Ginkgo functionality to the Intel DPC++ ecosystem
which enables Intel-GPU and CPU execution. The only Ginkgo features which have
not been ported yet are some preconditioners.

Ginkgo's mixed-precision support is greatly enhanced thanks to:
1. The new Accessor concept, which allows writing kernels featuring on-the-fly
memory compression, among other features. The accessor can be used as
header-only, see the [accessor BLAS benchmarks repository](https://github.com/ginkgo-project/accessor-BLAS/tree/develop) as a usage example.
2. All LinOps now transparently support mixed-precision execution. By default,
this is done through a temporary copy which may have a performance impact but
already allows mixed-precision research.

Native mixed-precision ELL kernels are implemented which do not see this cost.
The accessor is also leveraged in a new CB-GMRES solver which allows for
performance improvements by compressing the Krylov basis vectors. Many other
features have been added to Ginkgo, such as reordering support, a new IDR
solver, Incomplete Cholesky preconditioner, matrix assembly support (only CPU
for now), machine topology information, and more!

Supported systems and requirements:
+ For all platforms, cmake 3.13+
+ C++14 compliant compiler
+ Linux and MacOS
  + gcc: 5.3+, 6.3+, 7.3+, all versions after 8.1+
  + clang: 3.9+
  + Intel compiler: 2018+
  + Apple LLVM: 8.0+
  + CUDA module: CUDA 9.0+
  + HIP module: ROCm 3.5+
  + DPC++ module: Intel OneAPI 2021.3. Set the CXX compiler to `dpcpp`.
+ Windows
  + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, all versions after 8.1+
  + Microsoft Visual Studio: VS 2019
  + CUDA module: CUDA 9.0+, Microsoft Visual Studio
  + OpenMP module: MinGW or Cygwin.


Algorithm and important feature additions:
+ Add a new DPC++ Executor for SYCL execution and other base utilities
  [#648](#648), [#661](#661), [#757](#757), [#832](#832)
+ Port matrix formats, solvers and related kernels to DPC++. For some kernels,
  also make use of a shared kernel implementation for all executors (except
  Reference). [#710](#710), [#799](#799), [#779](#779), [#733](#733), [#844](#844), [#843](#843), [#789](#789), [#845](#845), [#849](#849), [#855](#855), [#856](#856)
+ Add accessors which allow multi-precision kernels, among other things.
  [#643](#643), [#708](#708)
+ Add support for mixed precision operations through apply in all LinOps. [#677](#677)
+ Add incomplete Cholesky factorizations and preconditioners as well as some
  improvements to ILU. [#672](#672), [#837](#837), [#846](#846)
+ Add an AMGX implementation and kernels on all devices but DPC++.
  [#528](#528), [#695](#695), [#860](#860)
+ Add a new mixed-precision capability solver, Compressed Basis GMRES
  (CB-GMRES). [#693](#693), [#763](#763)
+ Add the IDR(s) solver. [#620](#620)
+ Add a new fixed-size block CSR matrix format (for the Reference executor).
  [#671](#671), [#730](#730)
+ Add native mixed-precision support to the ELL format. [#717](#717), [#780](#780)
+ Add Reverse Cuthill-McKee reordering [#500](#500), [#649](#649)
+ Add matrix assembly support on CPUs. [#644](#644)
+ Extends ISAI from triangular to general and spd matrices. [#690](#690)

Other additions:
+ Add the possibility to apply real matrices to complex vectors.
  [#655](#655), [#658](#658)
+ Add functions to compute the absolute of a matrix format. [#636](#636)
+ Add symmetric permutation and improve existing permutations.
  [#684](#684), [#657](#657), [#663](#663)
+ Add a MachineTopology class with HWLOC support [#554](#554), [#697](#697)
+ Add an implicit residual norm criterion. [#702](#702), [#818](#818), [#850](#850)
+ Row-major accessor is generalized to more than 2 dimensions and a new
  "block column-major" accessor has been added. [#707](#707)
+ Add an heat equation example. [#698](#698), [#706](#706)
+ Add ccache support in CMake and CI. [#725](#725), [#739](#739)
+ Allow tuning and benchmarking variables non intrusively. [#692](#692)
+ Add triangular solver benchmark [#664](#664)
+ Add benchmarks for BLAS operations [#772](#772), [#829](#829)
+ Add support for different precisions and consistent index types in benchmarks.
  [#675](#675), [#828](#828)
+ Add a Github bot system to facilitate development and PR management.
  [#667](#667), [#674](#674), [#689](#689), [#853](#853)
+ Add Intel (DPC++) CI support and enable CI on HPC systems. [#736](#736), [#751](#751), [#781](#781)
+ Add ssh debugging for Github Actions CI. [#749](#749)
+ Add pipeline segmentation for better CI speed. [#737](#737)


Changes:
+ Add a Scalar Jacobi specialization and kernels. [#808](#808), [#834](#834), [#854](#854)
+ Add implicit residual log for solvers and benchmarks. [#714](#714)
+ Change handling of the conjugate in the dense dot product. [#755](#755)
+ Improved Dense stride handling. [#774](#774)
+ Multiple improvements to the OpenMP kernels performance, including COO,
an exclusive prefix sum, and more. [#703](#703), [#765](#765), [#740](#740)
+ Allow specialization of submatrix and other dense creation functions in solvers. [#718](#718)
+ Improved Identity constructor and treatment of rectangular matrices. [#646](#646)
+ Allow CUDA/HIP executors to select allocation mode. [#758](#758)
+ Check if executors share the same memory. [#670](#670)
+ Improve test install and smoke testing support. [#721](#721)
+ Update the JOSS paper citation and add publications in the documentation.
  [#629](#629), [#724](#724)
+ Improve the version output. [#806](#806)
+ Add some utilities for dim and span. [#821](#821)
+ Improved solver and preconditioner benchmarks. [#660](#660)
+ Improve benchmark timing and output. [#669](#669), [#791](#791), [#801](#801), [#812](#812)


Fixes:
+ Sorting fix for the Jacobi preconditioner. [#659](#659)
+ Also log the first residual norm in CGS [#735](#735)
+ Fix BiCG and HIP CSR to work with complex matrices. [#651](#651)
+ Fix Coo SpMV on strided vectors. [#807](#807)
+ Fix segfault of extract_diagonal, add short-and-fat test. [#769](#769)
+ Fix device_reset issue by moving counter/mutex to device. [#810](#810)
+ Fix `EnableLogging` superclass. [#841](#841)
+ Support ROCm 4.1.x and breaking HIP_PLATFORM changes. [#726](#726)
+ Decreased test size for a few device tests. [#742](#742)
+ Fix multiple issues with our CMake HIP and RPATH setup.
  [#712](#712), [#745](#745), [#709](#709)
+ Cleanup our CMake installation step. [#713](#713)
+ Various simplification and fixes to the Windows CMake setup. [#720](#720), [#785](#785)
+ Simplify third-party integration. [#786](#786)
+ Improve Ginkgo device arch flags management. [#696](#696)
+ Other fixes and improvements to the CMake setup.
  [#685](#685), [#792](#792), [#705](#705), [#836](#836)
+ Clarification of dense norm documentation [#784](#784)
+ Various development tools fixes and improvements [#738](#738), [#830](#830), [#840](#840)
+ Make multiple operators/constructors explicit. [#650](#650), [#761](#761)
+ Fix some issues, memory leaks and warnings found by MSVC.
  [#666](#666), [#731](#731)
+ Improved solver memory estimates and consistent iteration counts [#691](#691)
+ Various logger improvements and fixes [#728](#728), [#743](#743), [#754](#754)
+ Fix for ForwardIterator requirements in iterator_factory. [#665](#665)
+ Various benchmark fixes. [#647](#647), [#673](#673), [#722](#722)
+ Various CI fixes and improvements. [#642](#642), [#641](#641), [#795](#795), [#783](#783), [#793](#793), [#852](#852)


Related PR: #857
tcojean added a commit that referenced this pull request Aug 23, 2021
Release 1.4.0 to master

The Ginkgo team is proud to announce the new Ginkgo minor release 1.4.0. This
release brings most of the Ginkgo functionality to the Intel DPC++ ecosystem
which enables Intel-GPU and CPU execution. The only Ginkgo features which have
not been ported yet are some preconditioners.

Ginkgo's mixed-precision support is greatly enhanced thanks to:
1. The new Accessor concept, which allows writing kernels featuring on-the-fly
memory compression, among other features. The accessor can be used as
header-only, see the [accessor BLAS benchmarks repository](https://github.com/ginkgo-project/accessor-BLAS/tree/develop) as a usage example.
2. All LinOps now transparently support mixed-precision execution. By default,
this is done through a temporary copy which may have a performance impact but
already allows mixed-precision research.

Native mixed-precision ELL kernels are implemented which do not see this cost.
The accessor is also leveraged in a new CB-GMRES solver which allows for
performance improvements by compressing the Krylov basis vectors. Many other
features have been added to Ginkgo, such as reordering support, a new IDR
solver, Incomplete Cholesky preconditioner, matrix assembly support (only CPU
for now), machine topology information, and more!

Supported systems and requirements:
+ For all platforms, cmake 3.13+
+ C++14 compliant compiler
+ Linux and MacOS
  + gcc: 5.3+, 6.3+, 7.3+, all versions after 8.1+
  + clang: 3.9+
  + Intel compiler: 2018+
  + Apple LLVM: 8.0+
  + CUDA module: CUDA 9.0+
  + HIP module: ROCm 3.5+
  + DPC++ module: Intel OneAPI 2021.3. Set the CXX compiler to `dpcpp`.
+ Windows
  + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, all versions after 8.1+
  + Microsoft Visual Studio: VS 2019
  + CUDA module: CUDA 9.0+, Microsoft Visual Studio
  + OpenMP module: MinGW or Cygwin.


Algorithm and important feature additions:
+ Add a new DPC++ Executor for SYCL execution and other base utilities
  [#648](#648), [#661](#661), [#757](#757), [#832](#832)
+ Port matrix formats, solvers and related kernels to DPC++. For some kernels,
  also make use of a shared kernel implementation for all executors (except
  Reference). [#710](#710), [#799](#799), [#779](#779), [#733](#733), [#844](#844), [#843](#843), [#789](#789), [#845](#845), [#849](#849), [#855](#855), [#856](#856)
+ Add accessors which allow multi-precision kernels, among other things.
  [#643](#643), [#708](#708)
+ Add support for mixed precision operations through apply in all LinOps. [#677](#677)
+ Add incomplete Cholesky factorizations and preconditioners as well as some
  improvements to ILU. [#672](#672), [#837](#837), [#846](#846)
+ Add an AMGX implementation and kernels on all devices but DPC++.
  [#528](#528), [#695](#695), [#860](#860)
+ Add a new mixed-precision capability solver, Compressed Basis GMRES
  (CB-GMRES). [#693](#693), [#763](#763)
+ Add the IDR(s) solver. [#620](#620)
+ Add a new fixed-size block CSR matrix format (for the Reference executor).
  [#671](#671), [#730](#730)
+ Add native mixed-precision support to the ELL format. [#717](#717), [#780](#780)
+ Add Reverse Cuthill-McKee reordering [#500](#500), [#649](#649)
+ Add matrix assembly support on CPUs. [#644](#644)
+ Extends ISAI from triangular to general and spd matrices. [#690](#690)

Other additions:
+ Add the possibility to apply real matrices to complex vectors.
  [#655](#655), [#658](#658)
+ Add functions to compute the absolute of a matrix format. [#636](#636)
+ Add symmetric permutation and improve existing permutations.
  [#684](#684), [#657](#657), [#663](#663)
+ Add a MachineTopology class with HWLOC support [#554](#554), [#697](#697)
+ Add an implicit residual norm criterion. [#702](#702), [#818](#818), [#850](#850)
+ Row-major accessor is generalized to more than 2 dimensions and a new
  "block column-major" accessor has been added. [#707](#707)
+ Add an heat equation example. [#698](#698), [#706](#706)
+ Add ccache support in CMake and CI. [#725](#725), [#739](#739)
+ Allow tuning and benchmarking variables non intrusively. [#692](#692)
+ Add triangular solver benchmark [#664](#664)
+ Add benchmarks for BLAS operations [#772](#772), [#829](#829)
+ Add support for different precisions and consistent index types in benchmarks.
  [#675](#675), [#828](#828)
+ Add a Github bot system to facilitate development and PR management.
  [#667](#667), [#674](#674), [#689](#689), [#853](#853)
+ Add Intel (DPC++) CI support and enable CI on HPC systems. [#736](#736), [#751](#751), [#781](#781)
+ Add ssh debugging for Github Actions CI. [#749](#749)
+ Add pipeline segmentation for better CI speed. [#737](#737)


Changes:
+ Add a Scalar Jacobi specialization and kernels. [#808](#808), [#834](#834), [#854](#854)
+ Add implicit residual log for solvers and benchmarks. [#714](#714)
+ Change handling of the conjugate in the dense dot product. [#755](#755)
+ Improved Dense stride handling. [#774](#774)
+ Multiple improvements to the OpenMP kernels performance, including COO,
an exclusive prefix sum, and more. [#703](#703), [#765](#765), [#740](#740)
+ Allow specialization of submatrix and other dense creation functions in solvers. [#718](#718)
+ Improved Identity constructor and treatment of rectangular matrices. [#646](#646)
+ Allow CUDA/HIP executors to select allocation mode. [#758](#758)
+ Check if executors share the same memory. [#670](#670)
+ Improve test install and smoke testing support. [#721](#721)
+ Update the JOSS paper citation and add publications in the documentation.
  [#629](#629), [#724](#724)
+ Improve the version output. [#806](#806)
+ Add some utilities for dim and span. [#821](#821)
+ Improved solver and preconditioner benchmarks. [#660](#660)
+ Improve benchmark timing and output. [#669](#669), [#791](#791), [#801](#801), [#812](#812)


Fixes:
+ Sorting fix for the Jacobi preconditioner. [#659](#659)
+ Also log the first residual norm in CGS [#735](#735)
+ Fix BiCG and HIP CSR to work with complex matrices. [#651](#651)
+ Fix Coo SpMV on strided vectors. [#807](#807)
+ Fix segfault of extract_diagonal, add short-and-fat test. [#769](#769)
+ Fix device_reset issue by moving counter/mutex to device. [#810](#810)
+ Fix `EnableLogging` superclass. [#841](#841)
+ Support ROCm 4.1.x and breaking HIP_PLATFORM changes. [#726](#726)
+ Decreased test size for a few device tests. [#742](#742)
+ Fix multiple issues with our CMake HIP and RPATH setup.
  [#712](#712), [#745](#745), [#709](#709)
+ Cleanup our CMake installation step. [#713](#713)
+ Various simplification and fixes to the Windows CMake setup. [#720](#720), [#785](#785)
+ Simplify third-party integration. [#786](#786)
+ Improve Ginkgo device arch flags management. [#696](#696)
+ Other fixes and improvements to the CMake setup.
  [#685](#685), [#792](#792), [#705](#705), [#836](#836)
+ Clarification of dense norm documentation [#784](#784)
+ Various development tools fixes and improvements [#738](#738), [#830](#830), [#840](#840)
+ Make multiple operators/constructors explicit. [#650](#650), [#761](#761)
+ Fix some issues, memory leaks and warnings found by MSVC.
  [#666](#666), [#731](#731)
+ Improved solver memory estimates and consistent iteration counts [#691](#691)
+ Various logger improvements and fixes [#728](#728), [#743](#743), [#754](#754)
+ Fix for ForwardIterator requirements in iterator_factory. [#665](#665)
+ Various benchmark fixes. [#647](#647), [#673](#673), [#722](#722)
+ Various CI fixes and improvements. [#642](#642), [#641](#641), [#795](#795), [#783](#783), [#793](#793), [#852](#852)

Related PR: #866
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1:ST:ready-to-merge This PR is ready to merge. is:enhancement An improvement of an existing feature. mod:core This is related to the core module.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants