Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could you kindly provide a list of the environment configurations? #8

Open
leozjr opened this issue Mar 29, 2024 · 1 comment
Open

Comments

@leozjr
Copy link

leozjr commented Mar 29, 2024

Such as an environment.yaml file for Conda, if possible.
It seems that there are some issues with my environment, preventing me from starting the training properly.

my environment(2080ti x 8):

python                    3.8.16
cudatoolkit               11.8.0
cudnn                     8.4.1.50
pytorch                   1.12.1
pytorch-gpu               1.12.1
torchlight                0.0.1
torchlights               0.4.0
torchvision               0.14.1

I can train normally on my other codes and testing process runs fine, but there seem to be some bugs on training:

RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/arch.py", line 111, in forward
    out = self.encoder(out, xs)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/arch.py", line 56, in forward
    x = self.layers[i](x)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/attention.py", line 244, in forward
    r, _ = self.attn(inputs)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/attention.py", line 85, in forward
    q = torch.matmul(attn, v)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`

or

packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/attention.py", line 246, in forward
    r = self.ffn(r)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/attention.py", line 183, in forward
    x2 = x * torch.sigmoid(w)
RuntimeError: CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

or

RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/arch.py", line 111, in forward
    out = self.encoder(out, xs)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/arch.py", line 56, in forward
    x = self.layers[i](x)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/attention.py", line 246, in forward
    r = self.ffn(r)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/attention.py", line 178, in forward
    x = F.gelu(x)
RuntimeError: CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

or

  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/functional.py", line 2438, in batch_norm
    return torch.batch_norm(
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

This should be unrelated to memory and batch size since I still encounter the issue even with the smallest model.

Could you kindly update your environment configuration? It also might be related to the versions of torch and cudnn.

Or may stem from the use of this:sync_batchnorm

@leozjr
Copy link
Author

leozjr commented Mar 29, 2024

Well, I found that if I adjust batch_size to 4 and use 4 gpus it can run, but the cuda memory only takes up 1/6, and if I adjust it to 8 or bigger, an error will be reported.

@leozjr leozjr closed this as completed Mar 29, 2024
@leozjr leozjr reopened this Mar 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant