Could you kindly provide a list of the environment configurations? #8

leozjr · 2024-03-29T11:34:39Z

Such as an environment.yaml file for Conda, if possible.
It seems that there are some issues with my environment, preventing me from starting the training properly.

my environment（2080ti x 8）：

python                    3.8.16
cudatoolkit               11.8.0
cudnn                     8.4.1.50
pytorch                   1.12.1
pytorch-gpu               1.12.1
torchlight                0.0.1
torchlights               0.4.0
torchvision               0.14.1

I can train normally on my other codes and testing process runs fine, but there seem to be some bugs on training:

RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/arch.py", line 111, in forward
    out = self.encoder(out, xs)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/arch.py", line 56, in forward
    x = self.layers[i](x)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/attention.py", line 244, in forward
    r, _ = self.attn(inputs)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/attention.py", line 85, in forward
    q = torch.matmul(attn, v)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`

or

packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/attention.py", line 246, in forward
    r = self.ffn(r)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/attention.py", line 183, in forward
    x2 = x * torch.sigmoid(w)
RuntimeError: CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

or

RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/arch.py", line 111, in forward
    out = self.encoder(out, xs)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/arch.py", line 56, in forward
    x = self.layers[i](x)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/attention.py", line 246, in forward
    r = self.ffn(r)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/attention.py", line 178, in forward
    x = F.gelu(x)
RuntimeError: CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

or

  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/functional.py", line 2438, in batch_norm
    return torch.batch_norm(
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

This should be unrelated to memory and batch size since I still encounter the issue even with the smallest model.

Could you kindly update your environment configuration? It also might be related to the versions of torch and cudnn.

Or may stem from the use of this：sync_batchnorm

The text was updated successfully, but these errors were encountered:

leozjr · 2024-03-29T12:07:59Z

Well, I found that if I adjust batch_size to 4 and use 4 gpus it can run, but the cuda memory only takes up 1/6, and if I adjust it to 8 or bigger, an error will be reported.

leozjr closed this as completed Mar 29, 2024

leozjr reopened this Mar 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could you kindly provide a list of the environment configurations? #8

Could you kindly provide a list of the environment configurations? #8

leozjr commented Mar 29, 2024 •

edited

leozjr commented Mar 29, 2024 •

edited

Could you kindly provide a list of the environment configurations? #8

Could you kindly provide a list of the environment configurations? #8

Comments

leozjr commented Mar 29, 2024 • edited

leozjr commented Mar 29, 2024 • edited

leozjr commented Mar 29, 2024 •

edited

leozjr commented Mar 29, 2024 •

edited