-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition in CUDA blocking queue #850
Comments
This is somehow related to #768, #783 and #791 where we added support for using CPU queues from multiple host threads. |
What we implemented for the blocking CPU queue is the usage of a |
The blocking GPU queue does not even achieve this "only one task in the queue is executed at any time" property when called from multiple threads as this was no requirement up until now. If you enqueued two memcpys from two host threads they are executed in parallel. Also the |
To fix the issue that There would still be an issue that multiple tasks that are enqueued in parrallel into a blocking GPU queue may overlap their calls to
being received by the CUDA queue. This would be a pessimization because the host thread 1 would wait for the task enqueued by host thread 1 and 2. To prevent such things we would have to guard the access to the internal CUDA queue with host side synchronization ( Thi would on the other hand introduce the issue that |
Maybe I understood you wrong but we need to fix it on 0.3.5. The example shows that we enqueue the kernel and memcpy into a stream but since we ignore the stream in the implementation for enqueu in the blocking queue wo got the data race. For 0.4.0 O will provide a fix on monday. To have the same behaviour on CPU and GPu we should use always the async api and wait after each memcopy. |
You can do this, but as you already wrote above this will have an undesired side effect that could also be classified as a bug. This will fix a bug for some people but introduce a bug for others. Where exactly is the bug that you described? In your example you only have one stream and only one host CPU thread. In an abstract way what I see is the following order of tasks called on an StreamCudaRtSync:
The |
You missed that cudaMemcpy is not using any stream and is therefore enqueded in the cuda default stream instead of the user requested stream/queue. Because of that the memcpy is performed parallel/asynchronous to the kernel. |
I looked again to my cuda example above and realised that I copied the version where I used cudaMemcpyAsyn instead of cudaMemcpy. Maybe you are therefore can not see the issue. sry for that. I will fix it in a few seconds |
mhhhh now I also confused. Normally cudaMemcpy must block the executor thread until the memcpy finished. It looks like it is not the case. I will have a look on this issue on Monday again. |
This is exactly what I am wondering as well. My understanding was that |
Maybe alpaka calls |
@krzikalla no, in that code snippet both CUDA and alpaka do not work due to race condition (so sometimes it might run fine, but not guaranteed). And they should not because of |
fix alpaka-group#850 By using `cudaStreamCreateWithFlags(..., cudaStreamNonBlocking)` we disabling the blocking behavior of `cudaMemcpy` and other blocking cuda API calls. By using `cudaStreamDefault` we can enforce the old `legacy` behavior.
What is the plan for this issue for the 0.4.0 release? |
My plan is to switch fully to the async api to have the same behavior for all backends. |
Test code for the upcoming alpaka 0.4.0 (added all renamings): #include <stdio.h>
#include "alpaka/alpaka.hpp"
//#define NATIVE_CUDA 0
#ifdef NATIVE_CUDA
#define CUDA_ASSERT(x) if ((x) != cudaSuccess) { printf("cuda fail on line %d", __LINE__); exit(1); }
__global__ void emptyCudaKernel(int threadElementExtent)
{
assert(threadElementExtent == 1);
}
__global__ void myCudaKernel(const double* sourceData)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
constexpr size_t size = 900 * 5 * 5;
// note: here we are supposed to check that i is in the array range
// but this is not what is causing the issue
if (i < size && sourceData[i] != 1.0)
printf("%u %u %u: %f\n",blockDim.x, blockIdx.x, threadIdx.x, sourceData[i]);
}
int main(int argc, char* argv[])
{
cudaStream_t cudaStream;
void * memPtr;
const size_t size = 900 * 5 * 5;
CUDA_ASSERT(cudaSetDevice(0));
CUDA_ASSERT(cudaStreamCreateWithFlags(&cudaStream, cudaStreamNonBlocking));
CUDA_ASSERT(cudaMalloc(&memPtr, size * sizeof(double)));
// note: here we assume size is not a multiple of 64, this is unrelated to the issue
dim3 gridDim(std::size_t(((size - 1) / 64) + 1), 1u, 1u);
dim3 blockDim(64u, 1u, 1u);
emptyCudaKernel<<<gridDim, blockDim, 0, cudaStream>>>(argc);
CUDA_ASSERT(cudaStreamSynchronize(cudaStream));
std::vector<double> sourceMemHost(size, 1.0);
CUDA_ASSERT(cudaMemcpy(memPtr, sourceMemHost.data(), size * sizeof(double), cudaMemcpyHostToDevice));
cudaStreamSynchronize(cudaStream);
myCudaKernel<<<gridDim, blockDim, 0, cudaStream>>>((double*)memPtr);
CUDA_ASSERT(cudaStreamSynchronize(cudaStream));
CUDA_ASSERT(cudaStreamDestroy(cudaStream));
return 0;
}
#else
struct MyKernel
{
template<typename Acc>
ALPAKA_FN_ACC void operator()(Acc const & acc, const double* sourceData) const
{
constexpr size_t size = 900 * 5 * 5;
int i = alpaka::idx::getIdx<alpaka::Grid, alpaka::Threads>(acc)[0u];
// note (same as for CUDA): here we are supposed to check that i is in the array range
// but this is not what is causing the issue
if(i < size && sourceData[i] != 1.0)
printf("%u %u %u %lu\n",blockDim.x, blockIdx.x, threadIdx.x, sourceData[i]);
}
};
struct EmptyKernel
{
template<typename Acc>
ALPAKA_FN_ACC void operator()(Acc const & acc, int threadElementExtent) const
{
assert(threadElementExtent == 1);
}
};
int main(int argc, char* argv[])
{
const size_t size = 900 * 5 * 5;
using ComputeAccelerator = alpaka::acc::AccGpuCudaRt<alpaka::dim::DimInt<1>, std::size_t>;
using ComputeDevice = alpaka::dev::Dev<ComputeAccelerator>;
using ComputeStream = alpaka::queue::QueueCudaRtBlocking;
ComputeDevice computeDevice(alpaka::pltf::getDevByIdx<alpaka::pltf::Pltf<ComputeDevice> >(0));
ComputeStream computeStream (computeDevice);
using V = alpaka::vec::Vec<alpaka::dim::DimInt<1>, std::size_t>;
using WorkDivision = alpaka::workdiv::WorkDivMembers<alpaka::dim::DimInt<1>, std::size_t>;
WorkDivision wd(V(std::size_t(((size - 1) / 64) + 1)), V(std::size_t(64)), V(std::size_t(1)));
using HostAccelerator = alpaka::acc::AccCpuOmp2Blocks<alpaka::dim::DimInt<1>, std::size_t>;
using HostDevice = alpaka::dev::Dev<HostAccelerator>;
alpaka::vec::Vec<alpaka::dim::DimInt<1>, size_t> bufferSize (size);
using HostBufferType = decltype(
alpaka::mem::buf::alloc<double, size_t>(std::declval<HostDevice>(), bufferSize));
using HostViewType = alpaka::mem::view::ViewPlainPtr<alpaka::dev::Dev<HostBufferType>,
alpaka::elem::Elem<HostBufferType>, alpaka::dim::Dim<HostBufferType>, alpaka::idx::Idx<HostBufferType> >;
HostDevice hostDevice(alpaka::pltf::getDevByIdx<alpaka::pltf::Pltf<HostDevice> >(0u));
auto sourceMem = alpaka::mem::buf::alloc<double, size_t>(computeDevice, size);
alpaka::queue::enqueue(computeStream, alpaka::kernel::createTaskKernel<ComputeAccelerator>(wd, EmptyKernel(), argc));
std::vector<double> sourceMemHost(size, 1.0);
HostViewType hostBufferView(sourceMemHost.data(), hostDevice, bufferSize);
alpaka::mem::view::copy(computeStream, sourceMem, hostBufferView, bufferSize);
alpaka::wait::wait(computeStream);
alpaka::queue::enqueue(computeStream,
alpaka::kernel::createTaskKernel<ComputeAccelerator>(wd, MyKernel(), alpaka::mem::view::getPtrNative(sourceMem)));
alpaka::wait::wait(computeStream);
return 0;
}
#endif |
@krzikalla provided me with a test case where we have dataraces between a synchronous memcopy and a kernel.
The problem is that we create the streams with
cudaStreamCreateWithFlags(..., cudaStreamNonBlocking)
. The cuda documentation thatcudaStreamNonBlocking
disable the blocking behavior to the default stream0
. In alpaka we use blocking memcopy operations e.g.cudaMemcpy
if we use theQueueCudaRtBlocking
. The result is that our memcopy operations are running non blocking to all enqueued kernel.This BUG is also available in the last release 0.3.5. This means we need to do a bugfix release even we release soon 0.4.0.
We have different possibilities to solve it
cudaStreamCreate()
The text was updated successfully, but these errors were encountered: