Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.cuda.current_device() is changed by CuPy after 10.0 #6569

Closed
fangwei123456 opened this issue Mar 20, 2022 · 6 comments
Closed

torch.cuda.current_device() is changed by CuPy after 10.0 #6569

fangwei123456 opened this issue Mar 20, 2022 · 6 comments
Labels

Comments

@fangwei123456
Copy link

fangwei123456 commented Mar 20, 2022

Description

Recently, @Yanqi-Chen reports a bug with using a Pytorch module accelerated by CuPy when training with Distributed Data Parallel (DDP).

In DDP training, each process uses torch.cuda.current_device() as its default device. But he finds that CuPy will change torch.cuda.current_device(). For example, when training with 4 GPUs, torch.cuda.current_device() should be 0, 1, 2, 3 in each process. But after using CuPy, all processes's torch.cuda.current_device() will output 0.

To Reproduce

I run the following codes to reproduce this problem:

import torch
import cupy
kernel_code = r'''
extern "C" __global__
        void relu(const float* x, float *y, const int &N)
        {
            const int index = blockIdx.x * blockDim.x + threadIdx.x;
            if (index < N)
            {
                y[index] = (float) (x[index] >= 0.0f);
            }
        }
'''

def relu(x: torch.Tensor):
    device_id = x.get_device()
    torch.cuda.set_device(device_id)
    print('1:', torch.cuda.current_device())
    y = torch.zeros_like(x)
    assert device_id >= 0

    with cupy.cuda.Device(device_id):

        kernel = cupy.RawKernel(kernel_code, 'relu')
        threads = 1024
        N = x.numel()

        blocks = (N + threads - 1) // threads
        x = x.contiguous()
        y = y.contiguous()
        N = cupy.asarray(N)
        kernel((blocks,), (threads,), (x.data_ptr(), y.data_ptr(), N))
    print('2:', torch.cuda.current_device())
    return y
device = 'cuda:1'
x = torch.rand([8], device=device) - 0.5
y = relu(x)
print(f'x={x}')
print(f'y={y}')

In machine A, I get outputs:

(pytorch-env) wfang@ubuntu:~/temp_dir$ python test.py 
1: 1
2: 0
x=tensor([-0.1473, -0.3093, -0.0547, -0.1389,  0.4446, -0.3286, -0.4435,  0.0105],
       device='cuda:1')
y=tensor([0., 0., 0., 0., 1., 0., 0., 1.], device='cuda:1')

You can find that the default device is changed from 1 to 0 after with cupy.cuda.Device(device_id).

However, in machine B, I get outputs:

(pytorch-env) wfang@Precision-5820-Tower-X-Series:~/tempdir$ python test.py 
1: 1
2: 1
x=tensor([ 0.0060,  0.0141,  0.4118, -0.4813,  0.4609, -0.3557,  0.3739, -0.3464],
       device='cuda:1')
y=tensor([1., 1., 1., 0., 1., 0., 1., 0.], device='cuda:1')

In machine B, the default device is not changed.

Installation

Source (pip install cupy)

Environment

In machine A:

(pytorch-env) wfang@ubuntu:~/temp_dir$ conda list torch
# packages in environment at /home/wfang/anaconda3/envs/pytorch-env:
#
# Name                    Version                   Build  Channel
pytorch                   1.10.1          py3.9_cuda11.3_cudnn8.2.0_0    pytorch
pytorch-mutex             1.0                        cuda    pytorch
torchaudio                0.10.1               py39_cu113    pytorch
torchvision               0.11.2               py39_cu113    pytorch

(pytorch-env) wfang@ubuntu:~/temp_dir$ conda list cu
# packages in environment at /home/wfang/anaconda3/envs/pytorch-env:
#
# Name                    Version                   Build  Channel
cudatoolkit               11.3.1               h2bc3f7f_2    defaults
cupy-cuda113              10.2.0                   pypi_0    pypi
ncurses                   6.3                  h7f8727e_2    defaults

(pytorch-env) wfang@ubuntu:~/temp_dir$ nvidia-smi
Sun Mar 20 19:41:32 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74       Driver Version: 470.74       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:18:00.0 Off |                    0 |
| N/A   56C    P0   241W / 400W |  13828MiB / 81251MiB |     81%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   61C    P0   218W / 400W |  13830MiB / 81251MiB |     97%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:86:00.0 Off |                   98 |
| N/A   56C    P0   222W / 400W |  13828MiB / 81251MiB |     96%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:AF:00.0 Off |                    0 |
| N/A   57C    P0   215W / 400W |  13828MiB / 81251MiB |     88%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     20830      C   ...vs/pytorch-env/bin/python    13817MiB |
|    1   N/A  N/A     20831      C   ...vs/pytorch-env/bin/python    13819MiB |
|    2   N/A  N/A     20832      C   ...vs/pytorch-env/bin/python    13817MiB |
|    3   N/A  N/A     20833      C   ...vs/pytorch-env/bin/python    13817MiB |
+-----------------------------------------------------------------------------+

(pytorch-env) wfang@ubuntu:~/temp_dir$ gpustat 
ubuntu                   Sun Mar 20 19:44:07 2022  470.74
[0] NVIDIA A100-SXM-80GB | 58'C, 100 % | 13828 / 81251 MB | wfang(13817M)
[1] NVIDIA A100-SXM-80GB | 63'C,  81 % | 13830 / 81251 MB | wfang(13819M)
[2] NVIDIA A100-SXM-80GB | 57'C,  91 % | 13828 / 81251 MB | wfang(13817M)
[3] NVIDIA A100-SXM-80GB | 59'C,  81 % | 13828 / 81251 MB | wfang(13817M)

(pytorch-env) wfang@ubuntu:/usr/local/cuda/bin$ ./nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0

In machine B:

(pytorch-env) wfang@Precision-5820-Tower-X-Series:~/tempdir$ conda list torch
# packages in environment at /home/wfang/anaconda3/envs/pytorch-env:
#
# Name                    Version                   Build  Channel
pytorch                   1.10.1          py3.9_cuda11.3_cudnn8.2.0_0    pytorch
pytorch-mutex             1.0                        cuda    pytorch
torch-tb-profiler         0.3.1                    pypi_0    pypi
torchaudio                0.10.1               py39_cu113    pytorch
torchvision               0.11.2               py39_cu113    pytorch

(pytorch-env) wfang@Precision-5820-Tower-X-Series:~/tempdir$ conda list cu
# packages in environment at /home/wfang/anaconda3/envs/pytorch-env:
#
# Name                    Version                   Build  Channel
cudatoolkit               11.3.1               h2bc3f7f_2    defaults
cupy-cuda111              9.4.0                    pypi_0    pypi
ncurses                   6.2                  h58526e2_4    conda-forge

(pytorch-env) wfang@Precision-5820-Tower-X-Series:~/tempdir$ nvidia-smi
Sun Mar 20 19:42:23 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:17:00.0 Off |                  N/A |
| 19%   36C    P8    16W / 250W |   1448MiB / 11011MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:B3:00.0 Off |                  N/A |
| 18%   36C    P8    21W / 250W |      3MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2335870      C   python                           1445MiB |
+-----------------------------------------------------------------------------+

(pytorch-env) wfang@Precision-5820-Tower-X-Series:~/tempdir$ gpustat 
Precision-5820-Tower-X-Series  Sun Mar 20 19:44:46 2022  465.19.01
[0] NVIDIA GeForce RTX 2080 Ti | 36'C,   0 % |  1448 / 11011 MB | wfang(1445M)
[1] NVIDIA GeForce RTX 2080 Ti | 36'C,   0 % |     3 / 11019 MB |

(pytorch-env) wfang@Precision-5820-Tower-X-Series:/usr/local/cuda/bin$ ./nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Mar_21_19:15:46_PDT_2021
Cuda compilation tools, release 11.3, V11.3.58
Build cuda_11.3.r11.3/compiler.29745058_0

Additional Information

No response

@fangwei123456 fangwei123456 added the cat:bug Bugs label Mar 20, 2022
@fangwei123456 fangwei123456 changed the title torch.cuda.current_device() is changed by CuPy in some machines torch.cuda.current_device() is changed by CuPy Mar 21, 2022
@Yanqi-Chen
Copy link

Yanqi-Chen commented Mar 21, 2022

I further test multiple version of CuPy and confirm that version 10.0.0, 10.1.0, 10.2.0 will come across this behavior that the torch.cuda.current_device() will unexpectedly fall back to cuda:0, while version 9.4.0, 9.6.0 will not reproduce this behavior.

@fangwei123456 fangwei123456 changed the title torch.cuda.current_device() is changed by CuPy torch.cuda.current_device() is changed by CuPy after 10.0 Mar 21, 2022
@kmaehashi
Copy link
Member

Remove this line and it should work:

with cupy.cuda.Device(device_id):

The "current device" is semantics provided by CUDA and not by each library. torch.cuda.set_device() will change the current device of the current thread, so it will take effect on CuPy as well. Mixing multiple libraries to switch the current device may cause unexpected behavior.

@fangwei123456
Copy link
Author

Thanks, it works well. But if I also remove this line:

torch.cuda.set_device(device_id)

def relu(x: torch.Tensor):
    device_id = x.get_device()
    # torch.cuda.set_device(device_id)
    print('1:', torch.cuda.current_device())
    y = torch.zeros_like(x)
    assert device_id >= 0

    # with cupy.cuda.Device(device_id):

    kernel = cupy.RawKernel(kernel_code, 'relu')
    threads = 1024
    N = x.numel()

    blocks = (N + threads - 1) // threads
    x = x.contiguous()
    y = y.contiguous()
    N = cupy.asarray(N)
    print(f'N.device={N.device}')
    kernel((blocks,), (threads,), (x.data_ptr(), y.data_ptr(), N))
    print('2:', torch.cuda.current_device())
    return y
device = 'cuda:1'
x = torch.rand([8], device=device) - 0.5
y = relu(x)
print(f'x={x}')
print(f'y={y}')

Then I will get wrong outputs:

(pytorch-env) wfang@ubuntu:~/temp_dir$ python test.py 
1: 0
N.device=<CUDA Device 0>
2: 0
x=tensor([-0.4328, -0.0702, -0.1543, -0.1021, -0.2400, -0.0898, -0.3499, -0.0559],
       device='cuda:1')
y=tensor([0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:1')

How can I set the device of a CuPy array (e.g., N = cupy.asarray(N)) to use the specific device without using with cupy.cuda.Device(device_id)?

@fangwei123456
Copy link
Author

Or, the best way is using torch.cuda.set_device(device_id) before I create a new CuPy array?

@kmaehashi
Copy link
Member

Or, the best way is using torch.cuda.set_device(device_id) before I create a new CuPy array?

Yes, the point is to use the same library to switch the current device across the codebase.

@fangwei123456
Copy link
Author

OK, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants