Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dpcpp memory issue and capture exception in raw_free #832

Merged
merged 5 commits into from
Jul 16, 2021

Conversation

yhmtsai
Copy link
Member

@yhmtsai yhmtsai commented Jul 12, 2021

This PR limits the direct memcpy between different backend or device, add free after kernel test and capture exception in raw_free

the memcpy issue between backend on the same device is on https://gitlab.com/ginkgo-project/ginkgo-public-ci/-/jobs/1413646244
the raw_free isssue is on https://gitlab.com/ginkgo-project/ginkgo-public-ci/-/jobs/1415200944

the codeplay_host_task does not work as expectation yet.
need to investigate furthermore and use the new api.
Thus, we stay wait workaround to ensure free is after kernel

@yhmtsai yhmtsai added the 1:ST:ready-for-review This PR is ready for review label Jul 12, 2021
@yhmtsai yhmtsai added this to the Ginkgo 1.4.0 milestone Jul 12, 2021
@yhmtsai yhmtsai requested a review from a team July 12, 2021 08:59
@yhmtsai yhmtsai self-assigned this Jul 12, 2021
@ginkgo-bot ginkgo-bot added mod:core This is related to the core module. mod:dpcpp This is related to the DPC++ module. reg:ci-cd This is related to the continuous integration system. reg:testing This is related to testing. labels Jul 12, 2021
Copy link
Member

@upsj upsj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

if (err_code == 0) {
err_code = 1;
}
std::exit(err_code);
Copy link
Member

@upsj upsj Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to think about other ways to approach this at some point, since killing the process may not be desirable for application users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, I am not sure whether the application allows the error situation is happened or not.

});
}
// to ensure everything on queue is finished.
dpcpp->synchronize();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? The case we want to catch is the one without synchronization, isn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure honestly, I do not know whether queue wait the all kernel finished when queue destroy.
If it is like cuda, just terminate. It is possible that we only have allocation and free but no execution.
I use curly brackets to contain the Array allocation to make sure the free is called before the synchronize.
Yes, it catches the free is not after the kernel finishes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the array gets freed, we wait on the queue, that isn't sufficient?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is fixed version. when the free does not contain synchronization, we will probably not wait the kernel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove this line to make sure the program can actually fail due to use-after-free issues?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's shown in the https://gitlab.com/ginkgo-project/ginkgo-public-ci/-/jobs/1415200944
and the code contains sync https://gitlab.com/ginkgo-project/ginkgo-public-ci/-/commit/f9da28bc5b8f6847b4d587b0283b2cc7f7e2f60a
I mean fixed version is the we wait on the queue when we call free.
(the following description is from cuda thinking)
Without sync before program end

  1. when free does not contain wait:
    submit job (not run) -> free -> program end (the job may not run, so we may does not face the free after use)
  2. when free does contain wait: (fixed version)
    submit job (not run) -> run job (due to free contain wait) -> free -> program end
    this case does not need sync in the end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks! I completely missed that the arrays have their own scope, so this is fine to me

@codecov
Copy link

codecov bot commented Jul 12, 2021

Codecov Report

Merging #832 (883dd5c) into develop (26ee87e) will decrease coverage by 0.04%.
The diff coverage is 0.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #832      +/-   ##
===========================================
- Coverage    94.29%   94.25%   -0.05%     
===========================================
  Files          408      408              
  Lines        32742    32756      +14     
===========================================
- Hits         30875    30874       -1     
- Misses        1867     1882      +15     
Impacted Files Coverage Δ
core/test/base/executor.cpp 87.60% <0.00%> (-4.58%) ⬇️
include/ginkgo/core/base/executor.hpp 79.47% <0.00%> (-0.85%) ⬇️
include/ginkgo/core/base/math.hpp 100.00% <ø> (ø)
core/base/extended_float.hpp 91.26% <0.00%> (-0.98%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 26ee87e...883dd5c. Read the comment docs.

@yhmtsai
Copy link
Member Author

yhmtsai commented Jul 12, 2021

After more test, the combination of device gives different result:

  1. opencl:gpu vs opencl:cpu -> use opencl:gpu queue can get the opencl:cpu memory but reverse case can not. (opencl:cpu is like normal host memory)
  2. opencl:gpu vs opencl:gpu -> no matter whether two queue is mapped to the same device, they can not transfer data to each other
  3. level_zero:gpu vs opencl:cpu -> similar to 1.
  4. level_zero:gpu vs level_zero:gpu -> if they are on the same device, they can transfer the memory to each other and use the memory from another one.
  5. opencl:cpu vs opencl:cpu -> can use the memory on different queue and transfer memory

host option does not work in filter_selector (ref), so I need to check it.

I only find one way to get the backend, which using dynamic_cast (with backend type) to type provided by level_zero.
queue has get_backend() returns a enum value
Or, we always use host memory to transfer data when the queues are different (for handling opencl cases)

I put the code here. we can use it check.
Compile: dpcpp <file>
Run: ./a.out <first_device_filter> <second_device_filter>

#include <CL/sycl.hpp>
#include <iostream>

int main(int argc, char *argv[])
{
    if (argc != 3) {
        std::cout << "Usage: " << argv[0] << " first_filter second_filter"
                  << std::endl;
        std::exit(1);
    }
    auto first_filter = sycl::ONEAPI::filter_selector(argv[1]);
    auto second_filter = sycl::ONEAPI::filter_selector(argv[2]);
    auto *first_queue =
        new sycl::queue{first_filter, sycl::property::queue::in_order{}};
    auto *second_queue =
        new sycl::queue{second_filter, sycl::property::queue::in_order{}};
    std::cout << "First queue device name: "
              << first_queue->get_device().get_info<sycl::info::device::name>()
              << std::endl;
    std::cout << "Second queue device name: "
              << second_queue->get_device().get_info<sycl::info::device::name>()
              << std::endl;
    const int n = 32;

    float data[n];
    for (int i = 0; i < n; i++) {
        data[i] = static_cast<float>(i) / n;
    }
    float *first_data = sycl::malloc_device<float>(n, *first_queue);
    float *first_data2 = sycl::malloc_device<float>(n, *first_queue);
    float *second_data = sycl::malloc_device<float>(n, *second_queue);
    float *second_data2 = sycl::malloc_device<float>(n, *second_queue);
    std::cout << "First queue test:" << std::endl;
    std::cout << "From Host to Device" << std::endl;
    first_queue->memcpy(first_data, data, n * sizeof(float)).wait();
    std::cout << "From Device to Device (self-queue)" << std::endl;
    first_queue->memcpy(first_data2, first_data, n * sizeof(float)).wait();
    std::cout << "From Device to Host" << std::endl;
    first_queue->memcpy(data, first_data2, n * sizeof(float)).wait();
    first_queue->wait_and_throw();
    std::cout << "Second queue test:" << std::endl;
    std::cout << "From Host to Device" << std::endl;
    second_queue->memcpy(second_data, data, n * sizeof(float)).wait();
    std::cout << "From Device to Device (self-queue)" << std::endl;
    second_queue->memcpy(second_data2, second_data, n * sizeof(float)).wait();
    std::cout << "From Device to Host" << std::endl;
    second_queue->memcpy(data, second_data2, n * sizeof(float)).wait();
    second_queue->wait_and_throw();
    bool same_device =
        (first_queue->get_device() == second_queue->get_device());
    bool same_context =
        (first_queue->get_context() == second_queue->get_context());
    std::cout << "device check: " << same_device << std::endl;
    std::cout << "context check: " << same_context << std::endl;
    if (same_device) {
        std::cout << "Access the data from different queue on the same device"
                  << std::endl;
        std::cout << "Access the second queue data from first queue"
                  << std::endl;
        first_queue->memcpy(second_data2, second_data, n * sizeof(float))
            .wait();
        std::cout << "Access the first queue data from second queue"
                  << std::endl;
        second_queue->memcpy(first_data2, first_data, n * sizeof(float)).wait();
    }
    std::cout << "Communication between first/second queue:" << std::endl;
    std::cout << "From second to first (handled by first queue)" << std::endl;
    first_queue->memcpy(first_data, second_data, n * sizeof(float)).wait();
    std::cout << "From fisrt to second (handled by second queue)" << std::endl;
    second_queue->memcpy(second_data, first_data, n * sizeof(float)).wait();
    first_queue->wait_and_throw();
    second_queue->wait_and_throw();
    sycl::free(first_data, first_queue->get_context());
    sycl::free(first_data2, first_queue->get_context());
    sycl::free(second_data, second_queue->get_context());
    sycl::free(second_data2, second_queue->get_context());
    return 0;
}

update the accessible check

@upsj
Copy link
Member

upsj commented Jul 12, 2021

A round-trip via host memory sounds sensible, I mean we do the same thing for ROCm <-> CUDA copies.

@yhmtsai
Copy link
Member Author

yhmtsai commented Jul 12, 2021

I forgot to update level_zero information.
level_zero backend allows to transfer data on the same device, but opencl cannot.
get_device() are the same but get_context() are different

Copy link
Member

@tcojean tcojean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Some small comments.

.gitlab-ci.yml Show resolved Hide resolved
.gitlab-ci.yml Show resolved Hide resolved
.gitlab-ci.yml Show resolved Hide resolved
dest->get_queue()->memcpy(dest_ptr, src_ptr, num_bytes).wait();
} else {
// the memcpy only support host<->device or itself memcpy
GKO_NOT_SUPPORTED(dest);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do a copy through the host here like in the ROCM <-> NVIDIA cases. We should also add a debug level message to state that this might not be optimized. I expect that later direct GPU copies will be supported?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add the message?
for the GPU direct copies, it seems to use context to take care (identify) the memory.
If I pass the context to another one, they can communicate without segmentation fault.
but I can also pass the opencl context to level_zero queue and then they communicate without segfault, which is weird.
context should be under the platform but opencl and level zero should be the different platforms

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use #if (GKO_VERBOSE_LEVEL >= 1) && !defined(NDEBUG) to enable output clog

Copy link
Member

@tcojean tcojean Jul 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it work as you say, we would need a global context that is used by all executors? Or can you have one context per device and use the correct context so that the copy works?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be a context per platform.
but I need to check the level_zero/opencl context behavior
two queues with the same context and then they can communicate

@yhmtsai yhmtsai added 1:ST:ready-to-merge This PR is ready to merge. and removed 1:ST:ready-for-review This PR is ready for review labels Jul 15, 2021
@yhmtsai
Copy link
Member Author

yhmtsai commented Jul 15, 2021

rebase!

@sonarcloud
Copy link

sonarcloud bot commented Jul 16, 2021

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
0.0% 0.0% Duplication

@yhmtsai yhmtsai merged commit 8b49b65 into develop Jul 16, 2021
@yhmtsai yhmtsai deleted the fix_dpcpp_cp branch July 16, 2021 10:47
tcojean added a commit that referenced this pull request Aug 20, 2021
Ginkgo release 1.4.0

The Ginkgo team is proud to announce the new Ginkgo minor release 1.4.0. This
release brings most of the Ginkgo functionality to the Intel DPC++ ecosystem
which enables Intel-GPU and CPU execution. The only Ginkgo features which have
not been ported yet are some preconditioners.

Ginkgo's mixed-precision support is greatly enhanced thanks to:
1. The new Accessor concept, which allows writing kernels featuring on-the-fly
memory compression, among other features. The accessor can be used as
header-only, see the [accessor BLAS benchmarks repository](https://github.com/ginkgo-project/accessor-BLAS/tree/develop) as a usage example.
2. All LinOps now transparently support mixed-precision execution. By default,
this is done through a temporary copy which may have a performance impact but
already allows mixed-precision research.

Native mixed-precision ELL kernels are implemented which do not see this cost.
The accessor is also leveraged in a new CB-GMRES solver which allows for
performance improvements by compressing the Krylov basis vectors. Many other
features have been added to Ginkgo, such as reordering support, a new IDR
solver, Incomplete Cholesky preconditioner, matrix assembly support (only CPU
for now), machine topology information, and more!

Supported systems and requirements:
+ For all platforms, cmake 3.13+
+ C++14 compliant compiler
+ Linux and MacOS
  + gcc: 5.3+, 6.3+, 7.3+, all versions after 8.1+
  + clang: 3.9+
  + Intel compiler: 2018+
  + Apple LLVM: 8.0+
  + CUDA module: CUDA 9.0+
  + HIP module: ROCm 3.5+
  + DPC++ module: Intel OneAPI 2021.3. Set the CXX compiler to `dpcpp`.
+ Windows
  + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, all versions after 8.1+
  + Microsoft Visual Studio: VS 2019
  + CUDA module: CUDA 9.0+, Microsoft Visual Studio
  + OpenMP module: MinGW or Cygwin.


Algorithm and important feature additions:
+ Add a new DPC++ Executor for SYCL execution and other base utilities
  [#648](#648), [#661](#661), [#757](#757), [#832](#832)
+ Port matrix formats, solvers and related kernels to DPC++. For some kernels,
  also make use of a shared kernel implementation for all executors (except
  Reference). [#710](#710), [#799](#799), [#779](#779), [#733](#733), [#844](#844), [#843](#843), [#789](#789), [#845](#845), [#849](#849), [#855](#855), [#856](#856)
+ Add accessors which allow multi-precision kernels, among other things.
  [#643](#643), [#708](#708)
+ Add support for mixed precision operations through apply in all LinOps. [#677](#677)
+ Add incomplete Cholesky factorizations and preconditioners as well as some
  improvements to ILU. [#672](#672), [#837](#837), [#846](#846)
+ Add an AMGX implementation and kernels on all devices but DPC++.
  [#528](#528), [#695](#695), [#860](#860)
+ Add a new mixed-precision capability solver, Compressed Basis GMRES
  (CB-GMRES). [#693](#693), [#763](#763)
+ Add the IDR(s) solver. [#620](#620)
+ Add a new fixed-size block CSR matrix format (for the Reference executor).
  [#671](#671), [#730](#730)
+ Add native mixed-precision support to the ELL format. [#717](#717), [#780](#780)
+ Add Reverse Cuthill-McKee reordering [#500](#500), [#649](#649)
+ Add matrix assembly support on CPUs. [#644](#644)
+ Extends ISAI from triangular to general and spd matrices. [#690](#690)

Other additions:
+ Add the possibility to apply real matrices to complex vectors.
  [#655](#655), [#658](#658)
+ Add functions to compute the absolute of a matrix format. [#636](#636)
+ Add symmetric permutation and improve existing permutations.
  [#684](#684), [#657](#657), [#663](#663)
+ Add a MachineTopology class with HWLOC support [#554](#554), [#697](#697)
+ Add an implicit residual norm criterion. [#702](#702), [#818](#818), [#850](#850)
+ Row-major accessor is generalized to more than 2 dimensions and a new
  "block column-major" accessor has been added. [#707](#707)
+ Add an heat equation example. [#698](#698), [#706](#706)
+ Add ccache support in CMake and CI. [#725](#725), [#739](#739)
+ Allow tuning and benchmarking variables non intrusively. [#692](#692)
+ Add triangular solver benchmark [#664](#664)
+ Add benchmarks for BLAS operations [#772](#772), [#829](#829)
+ Add support for different precisions and consistent index types in benchmarks.
  [#675](#675), [#828](#828)
+ Add a Github bot system to facilitate development and PR management.
  [#667](#667), [#674](#674), [#689](#689), [#853](#853)
+ Add Intel (DPC++) CI support and enable CI on HPC systems. [#736](#736), [#751](#751), [#781](#781)
+ Add ssh debugging for Github Actions CI. [#749](#749)
+ Add pipeline segmentation for better CI speed. [#737](#737)


Changes:
+ Add a Scalar Jacobi specialization and kernels. [#808](#808), [#834](#834), [#854](#854)
+ Add implicit residual log for solvers and benchmarks. [#714](#714)
+ Change handling of the conjugate in the dense dot product. [#755](#755)
+ Improved Dense stride handling. [#774](#774)
+ Multiple improvements to the OpenMP kernels performance, including COO,
an exclusive prefix sum, and more. [#703](#703), [#765](#765), [#740](#740)
+ Allow specialization of submatrix and other dense creation functions in solvers. [#718](#718)
+ Improved Identity constructor and treatment of rectangular matrices. [#646](#646)
+ Allow CUDA/HIP executors to select allocation mode. [#758](#758)
+ Check if executors share the same memory. [#670](#670)
+ Improve test install and smoke testing support. [#721](#721)
+ Update the JOSS paper citation and add publications in the documentation.
  [#629](#629), [#724](#724)
+ Improve the version output. [#806](#806)
+ Add some utilities for dim and span. [#821](#821)
+ Improved solver and preconditioner benchmarks. [#660](#660)
+ Improve benchmark timing and output. [#669](#669), [#791](#791), [#801](#801), [#812](#812)


Fixes:
+ Sorting fix for the Jacobi preconditioner. [#659](#659)
+ Also log the first residual norm in CGS [#735](#735)
+ Fix BiCG and HIP CSR to work with complex matrices. [#651](#651)
+ Fix Coo SpMV on strided vectors. [#807](#807)
+ Fix segfault of extract_diagonal, add short-and-fat test. [#769](#769)
+ Fix device_reset issue by moving counter/mutex to device. [#810](#810)
+ Fix `EnableLogging` superclass. [#841](#841)
+ Support ROCm 4.1.x and breaking HIP_PLATFORM changes. [#726](#726)
+ Decreased test size for a few device tests. [#742](#742)
+ Fix multiple issues with our CMake HIP and RPATH setup.
  [#712](#712), [#745](#745), [#709](#709)
+ Cleanup our CMake installation step. [#713](#713)
+ Various simplification and fixes to the Windows CMake setup. [#720](#720), [#785](#785)
+ Simplify third-party integration. [#786](#786)
+ Improve Ginkgo device arch flags management. [#696](#696)
+ Other fixes and improvements to the CMake setup.
  [#685](#685), [#792](#792), [#705](#705), [#836](#836)
+ Clarification of dense norm documentation [#784](#784)
+ Various development tools fixes and improvements [#738](#738), [#830](#830), [#840](#840)
+ Make multiple operators/constructors explicit. [#650](#650), [#761](#761)
+ Fix some issues, memory leaks and warnings found by MSVC.
  [#666](#666), [#731](#731)
+ Improved solver memory estimates and consistent iteration counts [#691](#691)
+ Various logger improvements and fixes [#728](#728), [#743](#743), [#754](#754)
+ Fix for ForwardIterator requirements in iterator_factory. [#665](#665)
+ Various benchmark fixes. [#647](#647), [#673](#673), [#722](#722)
+ Various CI fixes and improvements. [#642](#642), [#641](#641), [#795](#795), [#783](#783), [#793](#793), [#852](#852)


Related PR: #857
tcojean added a commit that referenced this pull request Aug 23, 2021
Release 1.4.0 to master

The Ginkgo team is proud to announce the new Ginkgo minor release 1.4.0. This
release brings most of the Ginkgo functionality to the Intel DPC++ ecosystem
which enables Intel-GPU and CPU execution. The only Ginkgo features which have
not been ported yet are some preconditioners.

Ginkgo's mixed-precision support is greatly enhanced thanks to:
1. The new Accessor concept, which allows writing kernels featuring on-the-fly
memory compression, among other features. The accessor can be used as
header-only, see the [accessor BLAS benchmarks repository](https://github.com/ginkgo-project/accessor-BLAS/tree/develop) as a usage example.
2. All LinOps now transparently support mixed-precision execution. By default,
this is done through a temporary copy which may have a performance impact but
already allows mixed-precision research.

Native mixed-precision ELL kernels are implemented which do not see this cost.
The accessor is also leveraged in a new CB-GMRES solver which allows for
performance improvements by compressing the Krylov basis vectors. Many other
features have been added to Ginkgo, such as reordering support, a new IDR
solver, Incomplete Cholesky preconditioner, matrix assembly support (only CPU
for now), machine topology information, and more!

Supported systems and requirements:
+ For all platforms, cmake 3.13+
+ C++14 compliant compiler
+ Linux and MacOS
  + gcc: 5.3+, 6.3+, 7.3+, all versions after 8.1+
  + clang: 3.9+
  + Intel compiler: 2018+
  + Apple LLVM: 8.0+
  + CUDA module: CUDA 9.0+
  + HIP module: ROCm 3.5+
  + DPC++ module: Intel OneAPI 2021.3. Set the CXX compiler to `dpcpp`.
+ Windows
  + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, all versions after 8.1+
  + Microsoft Visual Studio: VS 2019
  + CUDA module: CUDA 9.0+, Microsoft Visual Studio
  + OpenMP module: MinGW or Cygwin.


Algorithm and important feature additions:
+ Add a new DPC++ Executor for SYCL execution and other base utilities
  [#648](#648), [#661](#661), [#757](#757), [#832](#832)
+ Port matrix formats, solvers and related kernels to DPC++. For some kernels,
  also make use of a shared kernel implementation for all executors (except
  Reference). [#710](#710), [#799](#799), [#779](#779), [#733](#733), [#844](#844), [#843](#843), [#789](#789), [#845](#845), [#849](#849), [#855](#855), [#856](#856)
+ Add accessors which allow multi-precision kernels, among other things.
  [#643](#643), [#708](#708)
+ Add support for mixed precision operations through apply in all LinOps. [#677](#677)
+ Add incomplete Cholesky factorizations and preconditioners as well as some
  improvements to ILU. [#672](#672), [#837](#837), [#846](#846)
+ Add an AMGX implementation and kernels on all devices but DPC++.
  [#528](#528), [#695](#695), [#860](#860)
+ Add a new mixed-precision capability solver, Compressed Basis GMRES
  (CB-GMRES). [#693](#693), [#763](#763)
+ Add the IDR(s) solver. [#620](#620)
+ Add a new fixed-size block CSR matrix format (for the Reference executor).
  [#671](#671), [#730](#730)
+ Add native mixed-precision support to the ELL format. [#717](#717), [#780](#780)
+ Add Reverse Cuthill-McKee reordering [#500](#500), [#649](#649)
+ Add matrix assembly support on CPUs. [#644](#644)
+ Extends ISAI from triangular to general and spd matrices. [#690](#690)

Other additions:
+ Add the possibility to apply real matrices to complex vectors.
  [#655](#655), [#658](#658)
+ Add functions to compute the absolute of a matrix format. [#636](#636)
+ Add symmetric permutation and improve existing permutations.
  [#684](#684), [#657](#657), [#663](#663)
+ Add a MachineTopology class with HWLOC support [#554](#554), [#697](#697)
+ Add an implicit residual norm criterion. [#702](#702), [#818](#818), [#850](#850)
+ Row-major accessor is generalized to more than 2 dimensions and a new
  "block column-major" accessor has been added. [#707](#707)
+ Add an heat equation example. [#698](#698), [#706](#706)
+ Add ccache support in CMake and CI. [#725](#725), [#739](#739)
+ Allow tuning and benchmarking variables non intrusively. [#692](#692)
+ Add triangular solver benchmark [#664](#664)
+ Add benchmarks for BLAS operations [#772](#772), [#829](#829)
+ Add support for different precisions and consistent index types in benchmarks.
  [#675](#675), [#828](#828)
+ Add a Github bot system to facilitate development and PR management.
  [#667](#667), [#674](#674), [#689](#689), [#853](#853)
+ Add Intel (DPC++) CI support and enable CI on HPC systems. [#736](#736), [#751](#751), [#781](#781)
+ Add ssh debugging for Github Actions CI. [#749](#749)
+ Add pipeline segmentation for better CI speed. [#737](#737)


Changes:
+ Add a Scalar Jacobi specialization and kernels. [#808](#808), [#834](#834), [#854](#854)
+ Add implicit residual log for solvers and benchmarks. [#714](#714)
+ Change handling of the conjugate in the dense dot product. [#755](#755)
+ Improved Dense stride handling. [#774](#774)
+ Multiple improvements to the OpenMP kernels performance, including COO,
an exclusive prefix sum, and more. [#703](#703), [#765](#765), [#740](#740)
+ Allow specialization of submatrix and other dense creation functions in solvers. [#718](#718)
+ Improved Identity constructor and treatment of rectangular matrices. [#646](#646)
+ Allow CUDA/HIP executors to select allocation mode. [#758](#758)
+ Check if executors share the same memory. [#670](#670)
+ Improve test install and smoke testing support. [#721](#721)
+ Update the JOSS paper citation and add publications in the documentation.
  [#629](#629), [#724](#724)
+ Improve the version output. [#806](#806)
+ Add some utilities for dim and span. [#821](#821)
+ Improved solver and preconditioner benchmarks. [#660](#660)
+ Improve benchmark timing and output. [#669](#669), [#791](#791), [#801](#801), [#812](#812)


Fixes:
+ Sorting fix for the Jacobi preconditioner. [#659](#659)
+ Also log the first residual norm in CGS [#735](#735)
+ Fix BiCG and HIP CSR to work with complex matrices. [#651](#651)
+ Fix Coo SpMV on strided vectors. [#807](#807)
+ Fix segfault of extract_diagonal, add short-and-fat test. [#769](#769)
+ Fix device_reset issue by moving counter/mutex to device. [#810](#810)
+ Fix `EnableLogging` superclass. [#841](#841)
+ Support ROCm 4.1.x and breaking HIP_PLATFORM changes. [#726](#726)
+ Decreased test size for a few device tests. [#742](#742)
+ Fix multiple issues with our CMake HIP and RPATH setup.
  [#712](#712), [#745](#745), [#709](#709)
+ Cleanup our CMake installation step. [#713](#713)
+ Various simplification and fixes to the Windows CMake setup. [#720](#720), [#785](#785)
+ Simplify third-party integration. [#786](#786)
+ Improve Ginkgo device arch flags management. [#696](#696)
+ Other fixes and improvements to the CMake setup.
  [#685](#685), [#792](#792), [#705](#705), [#836](#836)
+ Clarification of dense norm documentation [#784](#784)
+ Various development tools fixes and improvements [#738](#738), [#830](#830), [#840](#840)
+ Make multiple operators/constructors explicit. [#650](#650), [#761](#761)
+ Fix some issues, memory leaks and warnings found by MSVC.
  [#666](#666), [#731](#731)
+ Improved solver memory estimates and consistent iteration counts [#691](#691)
+ Various logger improvements and fixes [#728](#728), [#743](#743), [#754](#754)
+ Fix for ForwardIterator requirements in iterator_factory. [#665](#665)
+ Various benchmark fixes. [#647](#647), [#673](#673), [#722](#722)
+ Various CI fixes and improvements. [#642](#642), [#641](#641), [#795](#795), [#783](#783), [#793](#793), [#852](#852)

Related PR: #866
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1:ST:ready-to-merge This PR is ready to merge. mod:core This is related to the core module. mod:dpcpp This is related to the DPC++ module. reg:ci-cd This is related to the continuous integration system. reg:testing This is related to testing.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants