Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cupy raw kernel cannot handle view of cupy ndarray #7267

Closed
kanglcn opened this issue Dec 28, 2022 · 5 comments
Closed

cupy raw kernel cannot handle view of cupy ndarray #7267

kanglcn opened this issue Dec 28, 2022 · 5 comments

Comments

@kanglcn
Copy link

kanglcn commented Dec 28, 2022

Description

When feeding a view of cupy ndarray into a kernel, for example, a slice of a big ndarray, the result looks like the kernel read the original big ndarray not a slice of it.

To Reproduce

import cupy as cp
x = cp.arange(10, dtype=cp.complex64).reshape(2,5)

show = cp.RawKernel(r'''
#include <cuComplex.h>

extern "C" __global__
void show(const cuFloatComplex* x, const int N){
  int i = blockDim.x * blockIdx.x + threadIdx.x;
  
  if(i == 0 ){
      printf("%f\n",cuCrealf(x[N]));
  }
}
''', 'show')

When call the kernel:

show((2,), (5,), (x,cp.int32(6)))
cp.cuda.runtime.deviceSynchronize()

It will print:

6.000000

But if feed a slice of x:

x_slice = x[:,:4]
show((2,), (5,), (x_slice,cp.int32(6)))
cp.cuda.runtime.deviceSynchronize()

It also print:

6.000000

which is not as wanted.

However, if a copy is fed:

x_slice = x[:,:4].copy()
show((2,), (5,), (x_slice,cp.int32(6)))
cp.cuda.runtime.deviceSynchronize()

It print:

7.000000

as expected.

Installation

Conda-Forge (conda install ...)

Environment

OS                           : Linux-3.10.0-1127.18.2.el7.x86_64-x86_64-with-glibc2.17
Python Version               : 3.9.13
CuPy Version                 : 11.2.0
CuPy Platform                : NVIDIA CUDA
NumPy Version                : 1.23.4
SciPy Version                : 1.9.3
Cython Build Version         : 0.29.32
Cython Runtime Version       : None
CUDA Root                    : /users/kangl/miniconda3/envs/rapids-22.10
nvcc PATH                    : /users/kangl/miniconda3/envs/rapids-22.10/bin/nvcc
CUDA Build Version           : 11020
CUDA Driver Version          : 11040
CUDA Runtime Version         : 11070
cuBLAS Version               : (available)
cuFFT Version                : 10702
cuRAND Version               : 10301
cuSOLVER Version             : (11, 4, 0)
cuSPARSE Version             : (available)
NVRTC Version                : (11, 7)
Thrust Version               : 101000
CUB Build Version            : 101000
Jitify Build Version         : 343be31
cuDNN Build Version          : None
cuDNN Version                : None
NCCL Build Version           : 21403
NCCL Runtime Version         : 21403
cuTENSOR Version             : None
cuSPARSELt Build Version     : None
Device 0 Name                : Tesla V100-SXM2-32GB
Device 0 Compute Capability  : 70
Device 0 PCI Bus ID          : 0000:15:00.0

Additional Information

No response

@kanglcn kanglcn added the cat:bug Bugs label Dec 28, 2022
@leofang
Copy link
Member

leofang commented Dec 29, 2022

This is expected. When you only pass a raw pointer to a kernel, it's assumed that the array is contiguous in memory. If you want to handle views, you should also pass the array shape/strides and distribute the workload to each thread yourself. CuPy has the machinery for doing so, but I am not sure if they are considered public API. If it's simple Elementwise operation, it's better to just use ElementwiseKernel instead.

@kanglcn
Copy link
Author

kanglcn commented Dec 30, 2022

Thank you @leofang for your comments! I didn't find the public API in cupy document so I will run deep copy before feeding my RawKernel. I want each thread work on a small patch of a big array. I didn't find the way to use ElementwiseKernel. If there are any smart way to achieve, please let me know! I appreciate it!

@leofang
Copy link
Member

leofang commented Dec 31, 2022

I want each thread work on a small patch of a big array.

It's unclear to me what you intended to do. If you can show us a version that you expect to work with contiguous arrays, we might be able to suggest alternative approaches.

@kanglcn
Copy link
Author

kanglcn commented Jan 2, 2023

Thanks for your nice help! I finally find a way to implant it with elementwise kernel with the raw keyword so I don't need to bother you for this question.

I wonder if there is any easier way to index a multidimensional array? For example, the element of the diagonal of a N*N matrix a can be easily indexed by a[i,i], but in the kernel, I have to use a[i*N+i] which is not intuitive. I have tried to use reduce_dims parameters in ElementwiseKernel() but it doesn't work. Thanks!

@leofang
Copy link
Member

leofang commented Jan 2, 2023

Glad to know it works for you! AFAIK you need to use 1D indexing with ElementwiseKernel. It has to do how the internal machinery is designed to handle array views. Moreover, 1D indexing should be familiar for most C/C++ programmers 🙂

@asi1024 asi1024 added issue-checked and removed cat:bug Bugs labels Jan 4, 2023
@kanglcn kanglcn closed this as completed Jan 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants