Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SpecAug/MaskAlongAxis: RuntimeError: random_ expects 'from' to be less than 'to' #5628

Closed
albertz opened this issue Jan 21, 2024 · 2 comments
Labels
Bug bug should be fixed

Comments

@albertz
Copy link
Contributor

albertz commented Jan 21, 2024

Describe the bug

I got the exception RuntimeError: random_ expects 'from' to be less than 'to', but got from=0 >= to=-4752614986133393697 in SpecAug/MaskAlongAxis.

Basic environments:

  • OS information: Linux 5.15.0-46-generic #49-Ubuntu SMP Thu Aug 4 18:03:25 UTC 2022 x86_64
  • python version: 3.11.2 (main, Feb 7 2023, 13:52:42) [GCC 11.3.0]
  • espnet version: espnet 202310
  • pytorch version: pytorch 2.1.0+cu121
  • Git hash: 35c2e2b
    • Commit date: Fri Jan 19 08:44:17 2024 -0500

Environments from torch.utils.collect_env:

Collecting environment information...
PyTorch version: 2.1.0+cu121      
Is debug build: False               
CUDA used to build PyTorch: 12.1                                                
ROCM used to build PyTorch: N/A                                                 
                                                                                
OS: Ubuntu 22.04.3 LTS (x86_64)                                                 
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0                          
Clang version: 15.0.7                                                           
CMake version: Could not collect                                                                                                                                
Libc version: glibc-2.35                                                                                                                                        
                                                                                                                                                                
Python version: 3.11.2 (main, Feb  7 2023, 13:52:42) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-46-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce GTX 980
Nvidia driver version: 530.41.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True  

...

Task information:

Using RETURNN for training on Librispeech.

To Reproduce

(Not sure yet whether this always reproduces or whether this was a random hickup, maybe hardware issue related...)

This was using RETURNN for training an ESPnet model. Specifically, exactly the egs2/librispeech/asr1/conf/tuning/train_asr_e_branchformer.yaml config for the model.

Error logs

Detailed stacktrace:

...
ep 147 train, step 699, total 0.872, loss_ctc 30.930, loss_att 18.760, acc 0.893, loss 22.411, num_seqs 7, max_size:time 168344, max_size:out-spatial 31, mem_usage:cuda:3 7.6GB, 199.871 sec/step
ep 147 train, step 699, total 1.190, loss_ctc 36.713, loss_att 23.028, acc 0.853, loss 27.134, num_seqs 10, max_size:time 124521, max_size:out-spatial 35, mem_usage:cuda:2 7.6GB, 199.980 sec/step
ep 147 train, step 699, total 0.662, loss_ctc 27.851, loss_att 19.407, acc 0.911, loss 21.940, num_seqs 6, max_size:time 192105, max_size:out-spatial 44, mem_usage:cuda:1 7.7GB, 200.795 sec/step
ep 147 train, step 700, total 1.486, loss_ctc 57.620, loss_att 31.015, acc 0.798, loss 38.996, num_seqs 8, max_size:time 152240, max_size:out-spatial 33, mem_usage:cuda:0 7.6GB, 1.688 sec/step
ep 147 train, step 700, total 1.355, loss_ctc 38.556, loss_att 26.460, acc 0.800, loss 30.088, num_seqs 10, max_size:time 122560, max_size:out-spatial 27, mem_usage:cuda:2 7.6GB, 1.657 sec/step
ep 147 train, step 700, total 1.612, loss_ctc 60.367, loss_att 40.237, acc 0.766, loss 46.276, num_seqs 7, max_size:time 177848, max_size:out-spatial 32, mem_usage:cuda:3 7.6GB, 1.755 sec/step
RuntimeError: random_ expects 'from' to be less than 'to', but got from=0 >= to=-4752614986133393697
Unhandled exception <class 'RuntimeError'> in thread <_MainThread(MainThread, started 140128568459264)>, proc 188265.

...
  File "/u/zeyer/setups/combined/2021-05-31/recipe/i6_experiments/users/zeyer/experiments/exp2023_04_25_rf/espnet.py", line 413, in from_scratch_training 
    line: loss, stats, weight = model(
              speech=data.raw_tensor,
              speech_lengths=data_spatial_dim.dyn_size,
              text=targets.raw_tensor.to(torch.int64),
              text_lengths=targets_spatial_dim.dyn_size,
          )
    locals:
      loss = <not found>
      stats = <not found>
      weight = <not found>
      model = <local> ESPnetASRModel( 
                        (frontend): DefaultFrontend( 
                          (stft): Stft(n_fft=512, win_length=512, hop_length=160, center=True, normalized=False, onesided=True)
                          (frontend): Frontend()
                          (logmel): LogMel(sr=16000, n_fft=512, n_mels=80, fmin=0, fmax=8000.0, htk=False)
                        )
                        (specaug): SpecAug(
                          (t...
      speech = <not found>
      data = <local> Tensor{'data', [B?,T|'time'[B?]]}
      data.raw_tensor = <local> tensor[6, 195097] n=1170582 (4.5Mb) x∈[0.252, -0.051] μ=0.252 σ=-0.051 cuda:1
      speech_lengths = <not found>
      data_spatial_dim = <local> Dim{'time'[B?]}
      data_spatial_dim.dyn_size = <local> tensor[6] i32 x∈[177840, 195097] μ=1.910e+05 σ=6.534e+03 [192369, 192809, 193073, 194745, 195097, 177840]
      text = <not found>
      targets = <local> Tensor{'classes', [B?,T|'out-spatial'[B?]], dtype='int32', sparse_dim=Dim{'vocab'(10025)}}
      targets.raw_tensor = <local> tensor[6, 42] i32 n=252 x∈[-1178451361, 1057658400] μ=-1.123e+08 σ=1.073e+09 cuda:1
      targets.raw_tensor.to = <local> <built-in method to of Tensor object at 0x7f709e1c61b0>
      torch = <local> <module 'torch' from '/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/__init__.py'>
      torch.int64 = <local> torch.int64
      text_lengths = <not found>
      targets_spatial_dim = <local> Dim{'out-spatial'[B?]}
      targets_spatial_dim.dyn_size = <local> tensor[6] i32 x∈[25, 42] μ=33.167 σ=5.565 [25, 42, 34, 35, 32, 31]
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in Module._wrapped_call_impl
    line: return self._call_impl(*args, **kwargs)
    locals:
      self = <local> ESPnetASRModel(
                       (frontend): DefaultFrontend(
                         (stft): Stft(n_fft=512, win_length=512, hop_length=160, center=True, normalized=False, onesided=True)
                         (frontend): Frontend()
                         (logmel): LogMel(sr=16000, n_fft=512, n_mels=80, fmin=0, fmax=8000.0, htk=False)
                       )
                       (specaug): SpecAug(
                         (t...
      self._call_impl = <local> <bound method Module._call_impl of ESPnetASRModel(
                                  (frontend): DefaultFrontend(
                                    (stft): Stft(n_fft=512, win_length=512, hop_length=160, center=True, normalized=False, onesided=True)
                                    (frontend): Frontend()
                                    (logmel): LogMel(sr=16000, n_fft=512, n_mels=80, fmin=0, fmax=8000.0, htk=Fals...
      args = <local> ()
      kwargs = <local> {'speech': tensor[6, 195097] n=1170582 (4.5Mb) x∈[-0.051, 0.252] μ=-0.051 σ=0.252 cuda:1, 'speech_lengths': tensor[6] i32 x∈[177840, 195097] μ=1.910e+05 σ=6.534e+03 [192369, 192809, 193073, 194745, 195097, 177840], 'text': tensor[6, 42] i64 n=252 (2.0Kb) x∈[-5019353784009342738, 45166044071332232
...
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in Module._call_impl
    line: return forward_call(*args, **kwargs)
    locals:
      forward_call = <local> <bound method ESPnetASRModel.forward of ESPnetASRModel(
                               (frontend): DefaultFrontend(
                                 (stft): Stft(n_fft=512, win_length=512, hop_length=160, center=True, normalized=False, onesided=True)
                                 (frontend): Frontend()
                                 (logmel): LogMel(sr=16000, n_fft=512, n_mels=80, fmin=0, fmax=8000.0, htk...
      args = <local> ()
      kwargs = <local> {'speech': tensor[6, 195097] n=1170582 (4.5Mb) x∈[0.252, -0.051] μ=0.252 σ=-0.051 cuda:1, 'speech_lengths': tensor[6] i32 x∈[177840, 195097] μ=1.910e+05 σ=6.534e+03 [192369, 192809, 193073, 194745, 195097, 177840], 'text': tensor[6, 42] i64 n=252 (2.0Kb) x∈[-4922713914584930169, 45401658353551242
...
  File "/u/zeyer/setups/combined/2021-05-31/tools/espnet/espnet2/asr/espnet_model.py", line 237, in ESPnetASRModel.forward
    line: encoder_out, encoder_out_lens = self.encode(speech, speech_lengths)
    locals:
      encoder_out = <not found>
      encoder_out_lens = <not found>
      self = <local> ESPnetASRModel(
                       (frontend): DefaultFrontend(
                         (stft): Stft(n_fft=512, win_length=512, hop_length=160, center=True, normalized=False, onesided=True)
                         (frontend): Frontend()
                         (logmel): LogMel(sr=16000, n_fft=512, n_mels=80, fmin=0, fmax=8000.0, htk=False)
                       )
                       (specaug): SpecAug(
                         (t...
      self.encode = <local> <bound method ESPnetASRModel.encode of ESPnetASRModel(
                              (frontend): DefaultFrontend(
                                (stft): Stft(n_fft=512, win_length=512, hop_length=160, center=True, normalized=False, onesided=True)
                                (frontend): Frontend()
                                (logmel): LogMel(sr=16000, n_fft=512, n_mels=80, fmin=0, fmax=8000.0, htk=...
      speech = <local> tensor[6, 195097] n=1170582 (4.5Mb) x∈[-0.051, 0.252] μ=-0.051 σ=0.252 cuda:1
      speech_lengths = <local> tensor[6] i32 x∈[177840, 195097] μ=1.910e+05 σ=6.534e+03 [192369, 192809, 193073, 194745, 195097, 177840]
  File "/u/zeyer/setups/combined/2021-05-31/tools/espnet/espnet2/asr/espnet_model.py", line 382, in ESPnetASRModel.encode
    line: feats, feats_lengths = self.specaug(feats, feats_lengths)
    locals:
      feats = <local> tensor[6, 1220, 80] n=585600 (2.2Mb) x∈[-0.051, 0.252] μ=-0.051 σ=0.252 cuda:1
      feats_lengths = <local> tensor[6] i32 x∈[1112, 1220] μ=1.194e+03 σ=40.913 [1203, 1206, 1207, 1218, 1220, 1112]
      self = <local> ESPnetASRModel(
                       (frontend): DefaultFrontend(
                         (stft): Stft(n_fft=512, win_length=512, hop_length=160, center=True, normalized=False, onesided=True)
                         (frontend): Frontend()
                         (logmel): LogMel(sr=16000, n_fft=512, n_mels=80, fmin=0, fmax=8000.0, htk=False)
                       )
                       (specaug): SpecAug(
                         (t...
      self.specaug = <local> SpecAug(
                               (time_warp): TimeWarp(window=5, mode=bicubic)
                               (freq_mask): MaskAlongAxis(mask_width_range=[0, 27], num_mask=2, axis=freq)
                               (time_mask): MaskAlongAxisVariableMaxWidth(mask_width_ratio_range=[0.0, 0.05], num_mask=10, axis=time)
                             )
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in Module._wrapped_call_impl
    line: return self._call_impl(*args, **kwargs)
    locals:
      self = <local> SpecAug(
                       (time_warp): TimeWarp(window=5, mode=bicubic)
                       (freq_mask): MaskAlongAxis(mask_width_range=[0, 27], num_mask=2, axis=freq)
                       (time_mask): MaskAlongAxisVariableMaxWidth(mask_width_ratio_range=[0.0, 0.05], num_mask=10, axis=time)
                     )
      self._call_impl = <local> <bound method Module._call_impl of SpecAug(
                                  (time_warp): TimeWarp(window=5, mode=bicubic)
                                  (freq_mask): MaskAlongAxis(mask_width_range=[0, 27], num_mask=2, axis=freq)
                                  (time_mask): MaskAlongAxisVariableMaxWidth(mask_width_ratio_range=[0.0, 0.05], num_mask=10, axis=time)
                                )>
      args = <local> (tensor[6, 1220, 80] n=585600 (2.2Mb) x∈[-0.051, 0.252] μ=-0.051 σ=0.252 cuda:1, tensor[6] i32 x∈[1112, 1220] μ=1.194e+03 σ=40.913 [1203, 1206, 1207, 1218, 1220, 1112])
      kwargs = <local> {}
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in Module._call_impl
    line: return forward_call(*args, **kwargs)
    locals:
      forward_call = <local> <bound method SpecAug.forward of SpecAug(
                               (time_warp): TimeWarp(window=5, mode=bicubic)
                               (freq_mask): MaskAlongAxis(mask_width_range=[0, 27], num_mask=2, axis=freq)
                               (time_mask): MaskAlongAxisVariableMaxWidth(mask_width_ratio_range=[0.0, 0.05], num_mask=10, axis=time)
                             )>
      args = <local> (tensor[6, 1220, 80] n=585600 (2.2Mb) x∈[-0.051, 0.252] μ=-0.051 σ=0.252 cuda:1, tensor[6] i32 x∈[1112, 1220] μ=1.194e+03 σ=40.913 [1203, 1206, 1207, 1218, 1220, 1112])
      kwargs = <local> {}
  File "/u/zeyer/setups/combined/2021-05-31/tools/espnet/espnet2/asr/specaug/specaug.py", line 93, in SpecAug.forward
    line: x, x_lengths = self.freq_mask(x, x_lengths)
    locals:
      x = <local> tensor[6, 1220, 80] n=585600 (2.2Mb) x∈[-0.051, 0.252] μ=-0.051 σ=0.252 cuda:1
      x_lengths = <local> tensor[6] i32 x∈[1112, 1220] μ=1.194e+03 σ=40.913 [1203, 1206, 1207, 1218, 1220, 1112]
      self = <local> SpecAug(
                       (time_warp): TimeWarp(window=5, mode=bicubic)
                       (freq_mask): MaskAlongAxis(mask_width_range=[0, 27], num_mask=2, axis=freq)
                       (time_mask): MaskAlongAxisVariableMaxWidth(mask_width_ratio_range=[0.0, 0.05], num_mask=10, axis=time)
                     )
      self.freq_mask = <local> MaskAlongAxis(mask_width_range=[0, 27], num_mask=2, axis=freq)
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in Module._wrapped_call_impl
    line: return self._call_impl(*args, **kwargs)
    locals:
      self = <local> MaskAlongAxis(mask_width_range=[0, 27], num_mask=2, axis=freq)
      self._call_impl = <local> <bound method Module._call_impl of MaskAlongAxis(mask_width_range=[0, 27], num_mask=2, axis=freq)>
      args = <local> (tensor[6, 1220, 80] n=585600 (2.2Mb) x∈[-0.051, 0.252] μ=-0.051 σ=0.252 cuda:1, tensor[6] i32 x∈[1112, 1220] μ=1.194e+03 σ=40.913 [1203, 1206, 1207, 1218, 1220, 1112])
      kwargs = <local> {}
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in Module._call_impl
    line: return forward_call(*args, **kwargs)
    locals:
      forward_call = <local> <bound method MaskAlongAxis.forward of MaskAlongAxis(mask_width_range=[0, 27], num_mask=2, axis=freq)>
      args = <local> (tensor[6, 1220, 80] n=585600 (2.2Mb) x∈[-0.051, 0.252] μ=-0.051 σ=0.252 cuda:1, tensor[6] i32 x∈[1112, 1220] μ=1.194e+03 σ=40.913 [1203, 1206, 1207, 1218, 1220, 1112])
      kwargs = <local> {}
  File "/u/zeyer/setups/combined/2021-05-31/tools/espnet/espnet2/layers/mask_along_axis.py", line 122, in MaskAlongAxis.forward
    line: return mask_along_axis(
              spec,
              spec_lengths,
              mask_width_range=self.mask_width_range,
              dim=self.dim,
              num_mask=self.num_mask,
              replace_with_zero=self.replace_with_zero,
          )
    locals:
      mask_along_axis = <global> <function mask_along_axis at 0x7f70f6c4e8e0>
      spec = <local> tensor[6, 1220, 80] n=585600 (2.2Mb) x∈[-0.051, 0.252] μ=-0.051 σ=0.252 cuda:1
      spec_lengths = <local> tensor[6] i32 x∈[1112, 1220] μ=1.194e+03 σ=40.913 [1203, 1206, 1207, 1218, 1220, 1112]
      mask_width_range = <not found>
      self = <local> MaskAlongAxis(mask_width_range=[0, 27], num_mask=2, axis=freq)
      self.mask_width_range = <local> [0, 27]
      dim = <not found>
      self.dim = <local> 2
      num_mask = <not found>
      self.num_mask = <local> 2
      replace_with_zero = <not found>
      self.replace_with_zero = <local> True
  File "/u/zeyer/setups/combined/2021-05-31/tools/espnet/espnet2/layers/mask_along_axis.py", line 41, in mask_along_axis
    line: mask_pos = torch.randint(
              0, max(1, D - mask_length.max()), (B, num_mask), device=spec.device
          ).unsqueeze(2)
    locals:
      mask_pos = <not found>
      torch = <global> <module 'torch' from '/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/__init__.py'>
      torch.randint = <global> <built-in method randint of type object at 0x7f71cebaeaa0>
      max = <builtin> <built-in function max>
      D = <local> 80
      mask_length = <local> tensor[6, 2, 1] i64 n=12 x∈[-4825293490701537652, 4474139002465713060] μ=-9.419e+17 σ=4.744e+18 cuda:1
      mask_length.max = <local> <built-in method max of Tensor object at 0x7f709e1c4d70>
      B = <local> 6
      num_mask = <local> 2
      device = <not found>
      spec = <local> tensor[6, 1220, 80] n=585600 (2.2Mb) x∈[0.252, -0.051] μ=0.252 σ=-0.051 cuda:1
      spec.device = <local> device(type='cuda', index=1)
      unsqueeze = <not found>
RuntimeError: random_ expects 'from' to be less than 'to', but got from=0 >= to=-4752614986133393697

Module call stack:
(ESPnetASRModel.forward) (root)
(ESPnetASRModel.encode) (root)
(SpecAug.forward) specaug
(MaskAlongAxis.forward) specaug.freq_mask
@albertz albertz added the Bug bug should be fixed label Jan 21, 2024
@albertz
Copy link
Contributor Author

albertz commented Jan 21, 2024

It might be a hardware issue. On that node, I get now:

zeyer@cn-244 ~ % nvidia-smi                        
Unable to determine the device handle for GPU0000:03:00.0: Unknown Error

And dmesg:

[  +0.000844] nvidia 0000:03:00.0: AER: can't recover (no error_detected callback)
[  +0.000001] snd_hda_intel 0000:03:00.1: AER: can't recover (no error_detected callback)
[  +0.000008] pcieport 0000:00:03.0: AER: device recovery failed
[  +0.000001] pcieport 0000:00:03.0: AER: Multiple Uncorrected (Fatal) error received: 0000:00:03.0
[  +0.000004] pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
[  +0.000898] pcieport 0000:00:03.0:   device [8086:6f08] error status/mask=00004020/00000000
[  +0.000889] pcieport 0000:00:03.0:    [ 5] SDES                  
[  +0.000894] pcieport 0000:00:03.0:    [14] CmpltTO                (First)
[  +0.000921] nvidia 0000:03:00.0: AER: can't recover (no error_detected callback)
[  +0.000002] snd_hda_intel 0000:03:00.1: AER: can't recover (no error_detected callback)
[  +1.050439] pcieport 0000:00:03.0: AER: Root Port link has been reset (0)
[  +0.000041] pcieport 0000:00:03.0: AER: device recovery failed

@albertz
Copy link
Contributor Author

albertz commented Jan 21, 2024

I was just looking at the code. There is:

mask_length = torch.randint(
        mask_width_range[0],
        mask_width_range[1],
        (B, num_mask),
        device=spec.device,
    )

And it calls the function like this (as you see from the stacktrace):

    line: return mask_along_axis(
              spec,
              spec_lengths,
              mask_width_range=self.mask_width_range,
              dim=self.dim,
              num_mask=self.num_mask,
              replace_with_zero=self.replace_with_zero,
          )
    locals:
      spec = <local> tensor[6, 1220, 80] n=585600 (2.2Mb) x∈[-0.051, 0.252] μ=-0.051 σ=0.252 cuda:1
      spec_lengths = <local> tensor[6] i32 x∈[1112, 1220] μ=1.194e+03 σ=40.913 [1203, 1206, 1207, 1218, 1220, 1112]
      self = <local> MaskAlongAxis(mask_width_range=[0, 27], num_mask=2, axis=freq)
      self.mask_width_range = <local> [0, 27]
      self.dim = <local> 2
      self.num_mask = <local> 2
      self.replace_with_zero = <local> True

So, mask_length should have values in between 0 and 26 (inclusive).

But then you see later in the stacktrace:

      mask_length = <local> tensor[6, 2, 1] i64 n=12 x∈[-4825293490701537652, 4474139002465713060] μ=-9.419e+17 σ=4.744e+18 cuda:1

So, I guess it's clear that this is some hardware issue. So I guess we can close this.

@albertz albertz closed this as completed Jan 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug bug should be fixed
Projects
None yet
Development

No branches or pull requests

1 participant