Fix dpcpp memory issue and capture exception in raw_free #832

yhmtsai · 2021-07-12T08:59:28Z

This PR limits the direct memcpy between different backend or device, add free after kernel test and capture exception in raw_free

the memcpy issue between backend on the same device is on https://gitlab.com/ginkgo-project/ginkgo-public-ci/-/jobs/1413646244
the raw_free isssue is on https://gitlab.com/ginkgo-project/ginkgo-public-ci/-/jobs/1415200944

the codeplay_host_task does not work as expectation yet.
need to investigate furthermore and use the new api.
Thus, we stay wait workaround to ensure free is after kernel

upsj

LGTM!

upsj · 2021-07-12T09:02:59Z

dpcpp/base/executor.dp.cpp

+        if (err_code == 0) {
+            err_code = 1;
+        }
+        std::exit(err_code);


We need to think about other ways to approach this at some point, since killing the process may not be desirable for application users.

Probably, I am not sure whether the application allows the error situation is happened or not.

upsj · 2021-07-12T09:04:29Z

dpcpp/test/base/executor.dp.cpp

+            });
+        }
+        // to ensure everything on queue is finished.
+        dpcpp->synchronize();


Is this necessary? The case we want to catch is the one without synchronization, isn't it?

Not sure honestly, I do not know whether queue wait the all kernel finished when queue destroy.
If it is like cuda, just terminate. It is possible that we only have allocation and free but no execution.
I use curly brackets to contain the Array allocation to make sure the free is called before the synchronize.
Yes, it catches the free is not after the kernel finishes

When the array gets freed, we wait on the queue, that isn't sufficient?

Yes, it is fixed version. when the free does not contain synchronization, we will probably not wait the kernel.

Can you remove this line to make sure the program can actually fail due to use-after-free issues?

it's shown in the https://gitlab.com/ginkgo-project/ginkgo-public-ci/-/jobs/1415200944
and the code contains sync https://gitlab.com/ginkgo-project/ginkgo-public-ci/-/commit/f9da28bc5b8f6847b4d587b0283b2cc7f7e2f60a
I mean fixed version is the we wait on the queue when we call free.
(the following description is from cuda thinking)
Without sync before program end

when free does not contain wait:
submit job (not run) -> free -> program end (the job may not run, so we may does not face the free after use)

when free does contain wait: (fixed version)
submit job (not run) -> run job (due to free contain wait) -> free -> program end
this case does not need sync in the end

I see, thanks! I completely missed that the arrays have their own scope, so this is fine to me

codecov · 2021-07-12T10:04:36Z

Codecov Report

Merging #832 (883dd5c) into develop (26ee87e) will decrease coverage by 0.04%.
The diff coverage is 0.00%.

@@             Coverage Diff             @@
##           develop     #832      +/-   ##
===========================================
- Coverage    94.29%   94.25%   -0.05%     
===========================================
  Files          408      408              
  Lines        32742    32756      +14     
===========================================
- Hits         30875    30874       -1     
- Misses        1867     1882      +15

Impacted Files	Coverage Δ
core/test/base/executor.cpp	`87.60% <0.00%> (-4.58%)`	⬇️
include/ginkgo/core/base/executor.hpp	`79.47% <0.00%> (-0.85%)`	⬇️
include/ginkgo/core/base/math.hpp	`100.00% <ø> (ø)`
core/base/extended_float.hpp	`91.26% <0.00%> (-0.98%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 26ee87e...883dd5c. Read the comment docs.

yhmtsai · 2021-07-12T14:42:34Z

After more test, the combination of device gives different result:

opencl:gpu vs opencl:cpu -> use opencl:gpu queue can get the opencl:cpu memory but reverse case can not. (opencl:cpu is like normal host memory)
opencl:gpu vs opencl:gpu -> no matter whether two queue is mapped to the same device, they can not transfer data to each other
level_zero:gpu vs opencl:cpu -> similar to 1.
level_zero:gpu vs level_zero:gpu -> if they are on the same device, they can transfer the memory to each other and use the memory from another one.
opencl:cpu vs opencl:cpu -> can use the memory on different queue and transfer memory

host option does not work in filter_selector (ref), so I need to check it.

~~I only find one way to get the backend, which using dynamic_cast (with backend type) to type provided by level_zero.~~
queue has get_backend() returns a enum value
Or, we always use host memory to transfer data when the queues are different (for handling opencl cases)

I put the code here. we can use it check.
Compile: dpcpp <file>
Run: ./a.out <first_device_filter> <second_device_filter>

#include <CL/sycl.hpp>
#include <iostream>

int main(int argc, char *argv[])
{
    if (argc != 3) {
        std::cout << "Usage: " << argv[0] << " first_filter second_filter"
                  << std::endl;
        std::exit(1);
    }
    auto first_filter = sycl::ONEAPI::filter_selector(argv[1]);
    auto second_filter = sycl::ONEAPI::filter_selector(argv[2]);
    auto *first_queue =
        new sycl::queue{first_filter, sycl::property::queue::in_order{}};
    auto *second_queue =
        new sycl::queue{second_filter, sycl::property::queue::in_order{}};
    std::cout << "First queue device name: "
              << first_queue->get_device().get_info<sycl::info::device::name>()
              << std::endl;
    std::cout << "Second queue device name: "
              << second_queue->get_device().get_info<sycl::info::device::name>()
              << std::endl;
    const int n = 32;

    float data[n];
    for (int i = 0; i < n; i++) {
        data[i] = static_cast<float>(i) / n;
    }
    float *first_data = sycl::malloc_device<float>(n, *first_queue);
    float *first_data2 = sycl::malloc_device<float>(n, *first_queue);
    float *second_data = sycl::malloc_device<float>(n, *second_queue);
    float *second_data2 = sycl::malloc_device<float>(n, *second_queue);
    std::cout << "First queue test:" << std::endl;
    std::cout << "From Host to Device" << std::endl;
    first_queue->memcpy(first_data, data, n * sizeof(float)).wait();
    std::cout << "From Device to Device (self-queue)" << std::endl;
    first_queue->memcpy(first_data2, first_data, n * sizeof(float)).wait();
    std::cout << "From Device to Host" << std::endl;
    first_queue->memcpy(data, first_data2, n * sizeof(float)).wait();
    first_queue->wait_and_throw();
    std::cout << "Second queue test:" << std::endl;
    std::cout << "From Host to Device" << std::endl;
    second_queue->memcpy(second_data, data, n * sizeof(float)).wait();
    std::cout << "From Device to Device (self-queue)" << std::endl;
    second_queue->memcpy(second_data2, second_data, n * sizeof(float)).wait();
    std::cout << "From Device to Host" << std::endl;
    second_queue->memcpy(data, second_data2, n * sizeof(float)).wait();
    second_queue->wait_and_throw();
    bool same_device =
        (first_queue->get_device() == second_queue->get_device());
    bool same_context =
        (first_queue->get_context() == second_queue->get_context());
    std::cout << "device check: " << same_device << std::endl;
    std::cout << "context check: " << same_context << std::endl;
    if (same_device) {
        std::cout << "Access the data from different queue on the same device"
                  << std::endl;
        std::cout << "Access the second queue data from first queue"
                  << std::endl;
        first_queue->memcpy(second_data2, second_data, n * sizeof(float))
            .wait();
        std::cout << "Access the first queue data from second queue"
                  << std::endl;
        second_queue->memcpy(first_data2, first_data, n * sizeof(float)).wait();
    }
    std::cout << "Communication between first/second queue:" << std::endl;
    std::cout << "From second to first (handled by first queue)" << std::endl;
    first_queue->memcpy(first_data, second_data, n * sizeof(float)).wait();
    std::cout << "From fisrt to second (handled by second queue)" << std::endl;
    second_queue->memcpy(second_data, first_data, n * sizeof(float)).wait();
    first_queue->wait_and_throw();
    second_queue->wait_and_throw();
    sycl::free(first_data, first_queue->get_context());
    sycl::free(first_data2, first_queue->get_context());
    sycl::free(second_data, second_queue->get_context());
    sycl::free(second_data2, second_queue->get_context());
    return 0;
}

update the accessible check

upsj · 2021-07-12T14:46:51Z

A round-trip via host memory sounds sensible, I mean we do the same thing for ROCm <-> CUDA copies.

yhmtsai · 2021-07-12T15:52:32Z

I forgot to update level_zero information.
level_zero backend allows to transfer data on the same device, but opencl cannot.
get_device() are the same but get_context() are different

tcojean

LGTM. Some small comments.

.gitlab-ci.yml

tcojean · 2021-07-15T11:38:39Z

dpcpp/base/executor.dp.cpp

+            dest->get_queue()->memcpy(dest_ptr, src_ptr, num_bytes).wait();
+        } else {
+            // the memcpy only support host<->device or itself memcpy
+            GKO_NOT_SUPPORTED(dest);


I think we should do a copy through the host here like in the ROCM <-> NVIDIA cases. We should also add a debug level message to state that this might not be optimized. I expect that later direct GPU copies will be supported?

should we add the message?
for the GPU direct copies, it seems to use context to take care (identify) the memory.
If I pass the context to another one, they can communicate without segmentation fault.
but I can also pass the opencl context to level_zero queue and then they communicate without segfault, which is weird.
context should be under the platform but opencl and level zero should be the different platforms

I use #if (GKO_VERBOSE_LEVEL >= 1) && !defined(NDEBUG) to enable output clog

To make it work as you say, we would need a global context that is used by all executors? Or can you have one context per device and use the correct context so that the copy works?

I think it should be a context per platform.
but I need to check the level_zero/opencl context behavior
two queues with the same context and then they can communicate

yhmtsai · 2021-07-15T23:37:57Z

rebase!

note the device here already consider backend

Co-authored-by: Tobias Ribizel <ribizel@kit.edu>

Co-authored-by: Terry Cojean <terry.cojean@kit.edu>

sonarcloud · 2021-07-16T10:04:03Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
1 Code Smell

No Coverage information
0.0% Duplication

Ginkgo release 1.4.0 The Ginkgo team is proud to announce the new Ginkgo minor release 1.4.0. This release brings most of the Ginkgo functionality to the Intel DPC++ ecosystem which enables Intel-GPU and CPU execution. The only Ginkgo features which have not been ported yet are some preconditioners. Ginkgo's mixed-precision support is greatly enhanced thanks to: 1. The new Accessor concept, which allows writing kernels featuring on-the-fly memory compression, among other features. The accessor can be used as header-only, see the [accessor BLAS benchmarks repository](https://github.com/ginkgo-project/accessor-BLAS/tree/develop) as a usage example. 2. All LinOps now transparently support mixed-precision execution. By default, this is done through a temporary copy which may have a performance impact but already allows mixed-precision research. Native mixed-precision ELL kernels are implemented which do not see this cost. The accessor is also leveraged in a new CB-GMRES solver which allows for performance improvements by compressing the Krylov basis vectors. Many other features have been added to Ginkgo, such as reordering support, a new IDR solver, Incomplete Cholesky preconditioner, matrix assembly support (only CPU for now), machine topology information, and more! Supported systems and requirements: + For all platforms, cmake 3.13+ + C++14 compliant compiler + Linux and MacOS + gcc: 5.3+, 6.3+, 7.3+, all versions after 8.1+ + clang: 3.9+ + Intel compiler: 2018+ + Apple LLVM: 8.0+ + CUDA module: CUDA 9.0+ + HIP module: ROCm 3.5+ + DPC++ module: Intel OneAPI 2021.3. Set the CXX compiler to `dpcpp`. + Windows + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, all versions after 8.1+ + Microsoft Visual Studio: VS 2019 + CUDA module: CUDA 9.0+, Microsoft Visual Studio + OpenMP module: MinGW or Cygwin. Algorithm and important feature additions: + Add a new DPC++ Executor for SYCL execution and other base utilities [#648](#648), [#661](#661), [#757](#757), [#832](#832) + Port matrix formats, solvers and related kernels to DPC++. For some kernels, also make use of a shared kernel implementation for all executors (except Reference). [#710](#710), [#799](#799), [#779](#779), [#733](#733), [#844](#844), [#843](#843), [#789](#789), [#845](#845), [#849](#849), [#855](#855), [#856](#856) + Add accessors which allow multi-precision kernels, among other things. [#643](#643), [#708](#708) + Add support for mixed precision operations through apply in all LinOps. [#677](#677) + Add incomplete Cholesky factorizations and preconditioners as well as some improvements to ILU. [#672](#672), [#837](#837), [#846](#846) + Add an AMGX implementation and kernels on all devices but DPC++. [#528](#528), [#695](#695), [#860](#860) + Add a new mixed-precision capability solver, Compressed Basis GMRES (CB-GMRES). [#693](#693), [#763](#763) + Add the IDR(s) solver. [#620](#620) + Add a new fixed-size block CSR matrix format (for the Reference executor). [#671](#671), [#730](#730) + Add native mixed-precision support to the ELL format. [#717](#717), [#780](#780) + Add Reverse Cuthill-McKee reordering [#500](#500), [#649](#649) + Add matrix assembly support on CPUs. [#644](#644) + Extends ISAI from triangular to general and spd matrices. [#690](#690) Other additions: + Add the possibility to apply real matrices to complex vectors. [#655](#655), [#658](#658) + Add functions to compute the absolute of a matrix format. [#636](#636) + Add symmetric permutation and improve existing permutations. [#684](#684), [#657](#657), [#663](#663) + Add a MachineTopology class with HWLOC support [#554](#554), [#697](#697) + Add an implicit residual norm criterion. [#702](#702), [#818](#818), [#850](#850) + Row-major accessor is generalized to more than 2 dimensions and a new "block column-major" accessor has been added. [#707](#707) + Add an heat equation example. [#698](#698), [#706](#706) + Add ccache support in CMake and CI. [#725](#725), [#739](#739) + Allow tuning and benchmarking variables non intrusively. [#692](#692) + Add triangular solver benchmark [#664](#664) + Add benchmarks for BLAS operations [#772](#772), [#829](#829) + Add support for different precisions and consistent index types in benchmarks. [#675](#675), [#828](#828) + Add a Github bot system to facilitate development and PR management. [#667](#667), [#674](#674), [#689](#689), [#853](#853) + Add Intel (DPC++) CI support and enable CI on HPC systems. [#736](#736), [#751](#751), [#781](#781) + Add ssh debugging for Github Actions CI. [#749](#749) + Add pipeline segmentation for better CI speed. [#737](#737) Changes: + Add a Scalar Jacobi specialization and kernels. [#808](#808), [#834](#834), [#854](#854) + Add implicit residual log for solvers and benchmarks. [#714](#714) + Change handling of the conjugate in the dense dot product. [#755](#755) + Improved Dense stride handling. [#774](#774) + Multiple improvements to the OpenMP kernels performance, including COO, an exclusive prefix sum, and more. [#703](#703), [#765](#765), [#740](#740) + Allow specialization of submatrix and other dense creation functions in solvers. [#718](#718) + Improved Identity constructor and treatment of rectangular matrices. [#646](#646) + Allow CUDA/HIP executors to select allocation mode. [#758](#758) + Check if executors share the same memory. [#670](#670) + Improve test install and smoke testing support. [#721](#721) + Update the JOSS paper citation and add publications in the documentation. [#629](#629), [#724](#724) + Improve the version output. [#806](#806) + Add some utilities for dim and span. [#821](#821) + Improved solver and preconditioner benchmarks. [#660](#660) + Improve benchmark timing and output. [#669](#669), [#791](#791), [#801](#801), [#812](#812) Fixes: + Sorting fix for the Jacobi preconditioner. [#659](#659) + Also log the first residual norm in CGS [#735](#735) + Fix BiCG and HIP CSR to work with complex matrices. [#651](#651) + Fix Coo SpMV on strided vectors. [#807](#807) + Fix segfault of extract_diagonal, add short-and-fat test. [#769](#769) + Fix device_reset issue by moving counter/mutex to device. [#810](#810) + Fix `EnableLogging` superclass. [#841](#841) + Support ROCm 4.1.x and breaking HIP_PLATFORM changes. [#726](#726) + Decreased test size for a few device tests. [#742](#742) + Fix multiple issues with our CMake HIP and RPATH setup. [#712](#712), [#745](#745), [#709](#709) + Cleanup our CMake installation step. [#713](#713) + Various simplification and fixes to the Windows CMake setup. [#720](#720), [#785](#785) + Simplify third-party integration. [#786](#786) + Improve Ginkgo device arch flags management. [#696](#696) + Other fixes and improvements to the CMake setup. [#685](#685), [#792](#792), [#705](#705), [#836](#836) + Clarification of dense norm documentation [#784](#784) + Various development tools fixes and improvements [#738](#738), [#830](#830), [#840](#840) + Make multiple operators/constructors explicit. [#650](#650), [#761](#761) + Fix some issues, memory leaks and warnings found by MSVC. [#666](#666), [#731](#731) + Improved solver memory estimates and consistent iteration counts [#691](#691) + Various logger improvements and fixes [#728](#728), [#743](#743), [#754](#754) + Fix for ForwardIterator requirements in iterator_factory. [#665](#665) + Various benchmark fixes. [#647](#647), [#673](#673), [#722](#722) + Various CI fixes and improvements. [#642](#642), [#641](#641), [#795](#795), [#783](#783), [#793](#793), [#852](#852) Related PR: #857

Release 1.4.0 to master The Ginkgo team is proud to announce the new Ginkgo minor release 1.4.0. This release brings most of the Ginkgo functionality to the Intel DPC++ ecosystem which enables Intel-GPU and CPU execution. The only Ginkgo features which have not been ported yet are some preconditioners. Ginkgo's mixed-precision support is greatly enhanced thanks to: 1. The new Accessor concept, which allows writing kernels featuring on-the-fly memory compression, among other features. The accessor can be used as header-only, see the [accessor BLAS benchmarks repository](https://github.com/ginkgo-project/accessor-BLAS/tree/develop) as a usage example. 2. All LinOps now transparently support mixed-precision execution. By default, this is done through a temporary copy which may have a performance impact but already allows mixed-precision research. Native mixed-precision ELL kernels are implemented which do not see this cost. The accessor is also leveraged in a new CB-GMRES solver which allows for performance improvements by compressing the Krylov basis vectors. Many other features have been added to Ginkgo, such as reordering support, a new IDR solver, Incomplete Cholesky preconditioner, matrix assembly support (only CPU for now), machine topology information, and more! Supported systems and requirements: + For all platforms, cmake 3.13+ + C++14 compliant compiler + Linux and MacOS + gcc: 5.3+, 6.3+, 7.3+, all versions after 8.1+ + clang: 3.9+ + Intel compiler: 2018+ + Apple LLVM: 8.0+ + CUDA module: CUDA 9.0+ + HIP module: ROCm 3.5+ + DPC++ module: Intel OneAPI 2021.3. Set the CXX compiler to `dpcpp`. + Windows + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, all versions after 8.1+ + Microsoft Visual Studio: VS 2019 + CUDA module: CUDA 9.0+, Microsoft Visual Studio + OpenMP module: MinGW or Cygwin. Algorithm and important feature additions: + Add a new DPC++ Executor for SYCL execution and other base utilities [#648](#648), [#661](#661), [#757](#757), [#832](#832) + Port matrix formats, solvers and related kernels to DPC++. For some kernels, also make use of a shared kernel implementation for all executors (except Reference). [#710](#710), [#799](#799), [#779](#779), [#733](#733), [#844](#844), [#843](#843), [#789](#789), [#845](#845), [#849](#849), [#855](#855), [#856](#856) + Add accessors which allow multi-precision kernels, among other things. [#643](#643), [#708](#708) + Add support for mixed precision operations through apply in all LinOps. [#677](#677) + Add incomplete Cholesky factorizations and preconditioners as well as some improvements to ILU. [#672](#672), [#837](#837), [#846](#846) + Add an AMGX implementation and kernels on all devices but DPC++. [#528](#528), [#695](#695), [#860](#860) + Add a new mixed-precision capability solver, Compressed Basis GMRES (CB-GMRES). [#693](#693), [#763](#763) + Add the IDR(s) solver. [#620](#620) + Add a new fixed-size block CSR matrix format (for the Reference executor). [#671](#671), [#730](#730) + Add native mixed-precision support to the ELL format. [#717](#717), [#780](#780) + Add Reverse Cuthill-McKee reordering [#500](#500), [#649](#649) + Add matrix assembly support on CPUs. [#644](#644) + Extends ISAI from triangular to general and spd matrices. [#690](#690) Other additions: + Add the possibility to apply real matrices to complex vectors. [#655](#655), [#658](#658) + Add functions to compute the absolute of a matrix format. [#636](#636) + Add symmetric permutation and improve existing permutations. [#684](#684), [#657](#657), [#663](#663) + Add a MachineTopology class with HWLOC support [#554](#554), [#697](#697) + Add an implicit residual norm criterion. [#702](#702), [#818](#818), [#850](#850) + Row-major accessor is generalized to more than 2 dimensions and a new "block column-major" accessor has been added. [#707](#707) + Add an heat equation example. [#698](#698), [#706](#706) + Add ccache support in CMake and CI. [#725](#725), [#739](#739) + Allow tuning and benchmarking variables non intrusively. [#692](#692) + Add triangular solver benchmark [#664](#664) + Add benchmarks for BLAS operations [#772](#772), [#829](#829) + Add support for different precisions and consistent index types in benchmarks. [#675](#675), [#828](#828) + Add a Github bot system to facilitate development and PR management. [#667](#667), [#674](#674), [#689](#689), [#853](#853) + Add Intel (DPC++) CI support and enable CI on HPC systems. [#736](#736), [#751](#751), [#781](#781) + Add ssh debugging for Github Actions CI. [#749](#749) + Add pipeline segmentation for better CI speed. [#737](#737) Changes: + Add a Scalar Jacobi specialization and kernels. [#808](#808), [#834](#834), [#854](#854) + Add implicit residual log for solvers and benchmarks. [#714](#714) + Change handling of the conjugate in the dense dot product. [#755](#755) + Improved Dense stride handling. [#774](#774) + Multiple improvements to the OpenMP kernels performance, including COO, an exclusive prefix sum, and more. [#703](#703), [#765](#765), [#740](#740) + Allow specialization of submatrix and other dense creation functions in solvers. [#718](#718) + Improved Identity constructor and treatment of rectangular matrices. [#646](#646) + Allow CUDA/HIP executors to select allocation mode. [#758](#758) + Check if executors share the same memory. [#670](#670) + Improve test install and smoke testing support. [#721](#721) + Update the JOSS paper citation and add publications in the documentation. [#629](#629), [#724](#724) + Improve the version output. [#806](#806) + Add some utilities for dim and span. [#821](#821) + Improved solver and preconditioner benchmarks. [#660](#660) + Improve benchmark timing and output. [#669](#669), [#791](#791), [#801](#801), [#812](#812) Fixes: + Sorting fix for the Jacobi preconditioner. [#659](#659) + Also log the first residual norm in CGS [#735](#735) + Fix BiCG and HIP CSR to work with complex matrices. [#651](#651) + Fix Coo SpMV on strided vectors. [#807](#807) + Fix segfault of extract_diagonal, add short-and-fat test. [#769](#769) + Fix device_reset issue by moving counter/mutex to device. [#810](#810) + Fix `EnableLogging` superclass. [#841](#841) + Support ROCm 4.1.x and breaking HIP_PLATFORM changes. [#726](#726) + Decreased test size for a few device tests. [#742](#742) + Fix multiple issues with our CMake HIP and RPATH setup. [#712](#712), [#745](#745), [#709](#709) + Cleanup our CMake installation step. [#713](#713) + Various simplification and fixes to the Windows CMake setup. [#720](#720), [#785](#785) + Simplify third-party integration. [#786](#786) + Improve Ginkgo device arch flags management. [#696](#696) + Other fixes and improvements to the CMake setup. [#685](#685), [#792](#792), [#705](#705), [#836](#836) + Clarification of dense norm documentation [#784](#784) + Various development tools fixes and improvements [#738](#738), [#830](#830), [#840](#840) + Make multiple operators/constructors explicit. [#650](#650), [#761](#761) + Fix some issues, memory leaks and warnings found by MSVC. [#666](#666), [#731](#731) + Improved solver memory estimates and consistent iteration counts [#691](#691) + Various logger improvements and fixes [#728](#728), [#743](#743), [#754](#754) + Fix for ForwardIterator requirements in iterator_factory. [#665](#665) + Various benchmark fixes. [#647](#647), [#673](#673), [#722](#722) + Various CI fixes and improvements. [#642](#642), [#641](#641), [#795](#795), [#783](#783), [#793](#793), [#852](#852) Related PR: #866

yhmtsai added the 1:ST:ready-for-review This PR is ready for review label Jul 12, 2021

yhmtsai added this to the Ginkgo 1.4.0 milestone Jul 12, 2021

yhmtsai requested a review from a team July 12, 2021 08:59

yhmtsai self-assigned this Jul 12, 2021

ginkgo-bot added mod:core This is related to the core module. mod:dpcpp This is related to the DPC++ module. reg:ci-cd This is related to the continuous integration system. reg:testing This is related to testing. labels Jul 12, 2021

upsj approved these changes Jul 12, 2021

View reviewed changes

yhmtsai force-pushed the fix_dpcpp_cp branch from 0cbedc1 to 3cc01d1 Compare July 13, 2021 13:48

upsj mentioned this pull request Jul 14, 2021

DPCPP SpGEMM, SpGEAM, Transpose, Sort #799

Merged

tcojean approved these changes Jul 15, 2021

View reviewed changes

yhmtsai force-pushed the fix_dpcpp_cp branch from 3cc01d1 to c8fc979 Compare July 15, 2021 12:55

yhmtsai added 1:ST:ready-to-merge This PR is ready to merge. and removed 1:ST:ready-for-review This PR is ready for review labels Jul 15, 2021

yhmtsai and others added 5 commits July 15, 2021 23:38

cp error between diff backend on the same device

b6aca89

memcpy only when their device are the same

bd5601c

note the device here already consider backend

add a test to check free after kernel

48e73a0

Co-authored-by: Tobias Ribizel <ribizel@kit.edu>

add wait_and_throw, capture exception,add job/test

a2040dd

clog for copy via host when verbose>=1, and debug

883dd5c

Co-authored-by: Terry Cojean <terry.cojean@kit.edu>

github-actions bot force-pushed the fix_dpcpp_cp branch from c8fc979 to 883dd5c Compare July 15, 2021 23:38

yhmtsai merged commit 8b49b65 into develop Jul 16, 2021

yhmtsai deleted the fix_dpcpp_cp branch July 16, 2021 10:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dpcpp memory issue and capture exception in raw_free #832

Fix dpcpp memory issue and capture exception in raw_free #832

yhmtsai commented Jul 12, 2021

upsj left a comment

upsj Jul 12, 2021 •

edited

Loading

yhmtsai Jul 12, 2021

upsj Jul 12, 2021

yhmtsai Jul 12, 2021

upsj Jul 12, 2021

yhmtsai Jul 12, 2021

upsj Jul 16, 2021

yhmtsai Jul 16, 2021

upsj Jul 16, 2021

codecov bot commented Jul 12, 2021 •

edited

Loading

yhmtsai commented Jul 12, 2021 •

edited

Loading

upsj commented Jul 12, 2021

yhmtsai commented Jul 12, 2021 •

edited

Loading

tcojean left a comment

tcojean Jul 15, 2021

yhmtsai Jul 15, 2021

yhmtsai Jul 15, 2021

tcojean Jul 15, 2021 •

edited

Loading

yhmtsai Jul 15, 2021

yhmtsai commented Jul 15, 2021

sonarcloud bot commented Jul 16, 2021

Fix dpcpp memory issue and capture exception in raw_free #832

Fix dpcpp memory issue and capture exception in raw_free #832

Conversation

yhmtsai commented Jul 12, 2021

upsj left a comment

Choose a reason for hiding this comment

upsj Jul 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jul 12, 2021 • edited Loading

Codecov Report

yhmtsai commented Jul 12, 2021 • edited Loading

upsj commented Jul 12, 2021

yhmtsai commented Jul 12, 2021 • edited Loading

tcojean left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tcojean Jul 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhmtsai commented Jul 15, 2021

sonarcloud bot commented Jul 16, 2021

upsj Jul 12, 2021 •

edited

Loading

codecov bot commented Jul 12, 2021 •

edited

Loading

yhmtsai commented Jul 12, 2021 •

edited

Loading

yhmtsai commented Jul 12, 2021 •

edited

Loading

tcojean Jul 15, 2021 •

edited

Loading