SpecAug/MaskAlongAxis: RuntimeError: random_ expects 'from' to be less than 'to' #5628

albertz · 2024-01-21T11:06:58Z

Describe the bug

I got the exception RuntimeError: random_ expects 'from' to be less than 'to', but got from=0 >= to=-4752614986133393697 in SpecAug/MaskAlongAxis.

Basic environments:

OS information: Linux 5.15.0-46-generic #49-Ubuntu SMP Thu Aug 4 18:03:25 UTC 2022 x86_64
python version: 3.11.2 (main, Feb 7 2023, 13:52:42) [GCC 11.3.0]
espnet version: espnet 202310
pytorch version: pytorch 2.1.0+cu121
Git hash: 35c2e2b
- Commit date: Fri Jan 19 08:44:17 2024 -0500

Environments from torch.utils.collect_env:

Collecting environment information...
PyTorch version: 2.1.0+cu121      
Is debug build: False               
CUDA used to build PyTorch: 12.1                                                
ROCM used to build PyTorch: N/A                                                 
                                                                                
OS: Ubuntu 22.04.3 LTS (x86_64)                                                 
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0                          
Clang version: 15.0.7                                                           
CMake version: Could not collect                                                                                                                                
Libc version: glibc-2.35                                                                                                                                        
                                                                                                                                                                
Python version: 3.11.2 (main, Feb  7 2023, 13:52:42) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-46-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce GTX 980
Nvidia driver version: 530.41.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True  

...

Task information:

Using RETURNN for training on Librispeech.

To Reproduce

(Not sure yet whether this always reproduces or whether this was a random hickup, maybe hardware issue related...)

This was using RETURNN for training an ESPnet model. Specifically, exactly the egs2/librispeech/asr1/conf/tuning/train_asr_e_branchformer.yaml config for the model.

Error logs

Detailed stacktrace:

...
ep 147 train, step 699, total 0.872, loss_ctc 30.930, loss_att 18.760, acc 0.893, loss 22.411, num_seqs 7, max_size:time 168344, max_size:out-spatial 31, mem_usage:cuda:3 7.6GB, 199.871 sec/step
ep 147 train, step 699, total 1.190, loss_ctc 36.713, loss_att 23.028, acc 0.853, loss 27.134, num_seqs 10, max_size:time 124521, max_size:out-spatial 35, mem_usage:cuda:2 7.6GB, 199.980 sec/step
ep 147 train, step 699, total 0.662, loss_ctc 27.851, loss_att 19.407, acc 0.911, loss 21.940, num_seqs 6, max_size:time 192105, max_size:out-spatial 44, mem_usage:cuda:1 7.7GB, 200.795 sec/step
ep 147 train, step 700, total 1.486, loss_ctc 57.620, loss_att 31.015, acc 0.798, loss 38.996, num_seqs 8, max_size:time 152240, max_size:out-spatial 33, mem_usage:cuda:0 7.6GB, 1.688 sec/step
ep 147 train, step 700, total 1.355, loss_ctc 38.556, loss_att 26.460, acc 0.800, loss 30.088, num_seqs 10, max_size:time 122560, max_size:out-spatial 27, mem_usage:cuda:2 7.6GB, 1.657 sec/step
ep 147 train, step 700, total 1.612, loss_ctc 60.367, loss_att 40.237, acc 0.766, loss 46.276, num_seqs 7, max_size:time 177848, max_size:out-spatial 32, mem_usage:cuda:3 7.6GB, 1.755 sec/step
RuntimeError: random_ expects 'from' to be less than 'to', but got from=0 >= to=-4752614986133393697
Unhandled exception <class 'RuntimeError'> in thread <_MainThread(MainThread, started 140128568459264)>, proc 188265.

...
  File "/u/zeyer/setups/combined/2021-05-31/recipe/i6_experiments/users/zeyer/experiments/exp2023_04_25_rf/espnet.py", line 413, in from_scratch_training 
    line: loss, stats, weight = model(
              speech=data.raw_tensor,
              speech_lengths=data_spatial_dim.dyn_size,
              text=targets.raw_tensor.to(torch.int64),
              text_lengths=targets_spatial_dim.dyn_size,
          )
    locals:
      loss = <not found>
      stats = <not found>
      weight = <not found>
      model = <local> ESPnetASRModel( 
                        (frontend): DefaultFrontend( 
                          (stft): Stft(n_fft=512, win_length=512, hop_length=160, center=True, normalized=False, onesided=True)
                          (frontend): Frontend()
                          (logmel): LogMel(sr=16000, n_fft=512, n_mels=80, fmin=0, fmax=8000.0, htk=False)
                        )
                        (specaug): SpecAug(
                          (t...
      speech = <not found>
      data = <local> Tensor{'data', [B?,T|'time'[B?]]}
      data.raw_tensor = <local> tensor[6, 195097] n=1170582 (4.5Mb) x∈[0.252, -0.051] μ=0.252 σ=-0.051 cuda:1
      speech_lengths = <not found>
      data_spatial_dim = <local> Dim{'time'[B?]}
      data_spatial_dim.dyn_size = <local> tensor[6] i32 x∈[177840, 195097] μ=1.910e+05 σ=6.534e+03 [192369, 192809, 193073, 194745, 195097, 177840]
      text = <not found>
      targets = <local> Tensor{'classes', [B?,T|'out-spatial'[B?]], dtype='int32', sparse_dim=Dim{'vocab'(10025)}}
      targets.raw_tensor = <local> tensor[6, 42] i32 n=252 x∈[-1178451361, 1057658400] μ=-1.123e+08 σ=1.073e+09 cuda:1
      targets.raw_tensor.to = <local> <built-in method to of Tensor object at 0x7f709e1c61b0>
      torch = <local> <module 'torch' from '/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/__init__.py'>
      torch.int64 = <local> torch.int64
      text_lengths = <not found>
      targets_spatial_dim = <local> Dim{'out-spatial'[B?]}
      targets_spatial_dim.dyn_size = <local> tensor[6] i32 x∈[25, 42] μ=33.167 σ=5.565 [25, 42, 34, 35, 32, 31]
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in Module._wrapped_call_impl
    line: return self._call_impl(*args, **kwargs)
    locals:
      self = <local> ESPnetASRModel(
                       (frontend): DefaultFrontend(
                         (stft): Stft(n_fft=512, win_length=512, hop_length=160, center=True, normalized=False, onesided=True)
                         (frontend): Frontend()
                         (logmel): LogMel(sr=16000, n_fft=512, n_mels=80, fmin=0, fmax=8000.0, htk=False)
                       )
                       (specaug): SpecAug(
                         (t...
      self._call_impl = <local> <bound method Module._call_impl of ESPnetASRModel(
                                  (frontend): DefaultFrontend(
                                    (stft): Stft(n_fft=512, win_length=512, hop_length=160, center=True, normalized=False, onesided=True)
                                    (frontend): Frontend()
                                    (logmel): LogMel(sr=16000, n_fft=512, n_mels=80, fmin=0, fmax=8000.0, htk=Fals...
      args = <local> ()
      kwargs = <local> {'speech': tensor[6, 195097] n=1170582 (4.5Mb) x∈[-0.051, 0.252] μ=-0.051 σ=0.252 cuda:1, 'speech_lengths': tensor[6] i32 x∈[177840, 195097] μ=1.910e+05 σ=6.534e+03 [192369, 192809, 193073, 194745, 195097, 177840], 'text': tensor[6, 42] i64 n=252 (2.0Kb) x∈[-5019353784009342738, 45166044071332232
...
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in Module._call_impl
    line: return forward_call(*args, **kwargs)
    locals:
      forward_call = <local> <bound method ESPnetASRModel.forward of ESPnetASRModel(
                               (frontend): DefaultFrontend(
                                 (stft): Stft(n_fft=512, win_length=512, hop_length=160, center=True, normalized=False, onesided=True)
                                 (frontend): Frontend()
                                 (logmel): LogMel(sr=16000, n_fft=512, n_mels=80, fmin=0, fmax=8000.0, htk...
      args = <local> ()
      kwargs = <local> {'speech': tensor[6, 195097] n=1170582 (4.5Mb) x∈[0.252, -0.051] μ=0.252 σ=-0.051 cuda:1, 'speech_lengths': tensor[6] i32 x∈[177840, 195097] μ=1.910e+05 σ=6.534e+03 [192369, 192809, 193073, 194745, 195097, 177840], 'text': tensor[6, 42] i64 n=252 (2.0Kb) x∈[-4922713914584930169, 45401658353551242
...
  File "/u/zeyer/setups/combined/2021-05-31/tools/espnet/espnet2/asr/espnet_model.py", line 237, in ESPnetASRModel.forward
    line: encoder_out, encoder_out_lens = self.encode(speech, speech_lengths)
    locals:
      encoder_out = <not found>
      encoder_out_lens = <not found>
      self = <local> ESPnetASRModel(
                       (frontend): DefaultFrontend(
                         (stft): Stft(n_fft=512, win_length=512, hop_length=160, center=True, normalized=False, onesided=True)
                         (frontend): Frontend()
                         (logmel): LogMel(sr=16000, n_fft=512, n_mels=80, fmin=0, fmax=8000.0, htk=False)
                       )
                       (specaug): SpecAug(
                         (t...
      self.encode = <local> <bound method ESPnetASRModel.encode of ESPnetASRModel(
                              (frontend): DefaultFrontend(
                                (stft): Stft(n_fft=512, win_length=512, hop_length=160, center=True, normalized=False, onesided=True)
                                (frontend): Frontend()
                                (logmel): LogMel(sr=16000, n_fft=512, n_mels=80, fmin=0, fmax=8000.0, htk=...
      speech = <local> tensor[6, 195097] n=1170582 (4.5Mb) x∈[-0.051, 0.252] μ=-0.051 σ=0.252 cuda:1
      speech_lengths = <local> tensor[6] i32 x∈[177840, 195097] μ=1.910e+05 σ=6.534e+03 [192369, 192809, 193073, 194745, 195097, 177840]
  File "/u/zeyer/setups/combined/2021-05-31/tools/espnet/espnet2/asr/espnet_model.py", line 382, in ESPnetASRModel.encode
    line: feats, feats_lengths = self.specaug(feats, feats_lengths)
    locals:
      feats = <local> tensor[6, 1220, 80] n=585600 (2.2Mb) x∈[-0.051, 0.252] μ=-0.051 σ=0.252 cuda:1
      feats_lengths = <local> tensor[6] i32 x∈[1112, 1220] μ=1.194e+03 σ=40.913 [1203, 1206, 1207, 1218, 1220, 1112]
      self = <local> ESPnetASRModel(
                       (frontend): DefaultFrontend(
                         (stft): Stft(n_fft=512, win_length=512, hop_length=160, center=True, normalized=False, onesided=True)
                         (frontend): Frontend()
                         (logmel): LogMel(sr=16000, n_fft=512, n_mels=80, fmin=0, fmax=8000.0, htk=False)
                       )
                       (specaug): SpecAug(
                         (t...
      self.specaug = <local> SpecAug(
                               (time_warp): TimeWarp(window=5, mode=bicubic)
                               (freq_mask): MaskAlongAxis(mask_width_range=[0, 27], num_mask=2, axis=freq)
                               (time_mask): MaskAlongAxisVariableMaxWidth(mask_width_ratio_range=[0.0, 0.05], num_mask=10, axis=time)
                             )
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in Module._wrapped_call_impl
    line: return self._call_impl(*args, **kwargs)
    locals:
      self = <local> SpecAug(
                       (time_warp): TimeWarp(window=5, mode=bicubic)
                       (freq_mask): MaskAlongAxis(mask_width_range=[0, 27], num_mask=2, axis=freq)
                       (time_mask): MaskAlongAxisVariableMaxWidth(mask_width_ratio_range=[0.0, 0.05], num_mask=10, axis=time)
                     )
      self._call_impl = <local> <bound method Module._call_impl of SpecAug(
                                  (time_warp): TimeWarp(window=5, mode=bicubic)
                                  (freq_mask): MaskAlongAxis(mask_width_range=[0, 27], num_mask=2, axis=freq)
                                  (time_mask): MaskAlongAxisVariableMaxWidth(mask_width_ratio_range=[0.0, 0.05], num_mask=10, axis=time)
                                )>
      args = <local> (tensor[6, 1220, 80] n=585600 (2.2Mb) x∈[-0.051, 0.252] μ=-0.051 σ=0.252 cuda:1, tensor[6] i32 x∈[1112, 1220] μ=1.194e+03 σ=40.913 [1203, 1206, 1207, 1218, 1220, 1112])
      kwargs = <local> {}
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in Module._call_impl
    line: return forward_call(*args, **kwargs)
    locals:
      forward_call = <local> <bound method SpecAug.forward of SpecAug(
                               (time_warp): TimeWarp(window=5, mode=bicubic)
                               (freq_mask): MaskAlongAxis(mask_width_range=[0, 27], num_mask=2, axis=freq)
                               (time_mask): MaskAlongAxisVariableMaxWidth(mask_width_ratio_range=[0.0, 0.05], num_mask=10, axis=time)
                             )>
      args = <local> (tensor[6, 1220, 80] n=585600 (2.2Mb) x∈[-0.051, 0.252] μ=-0.051 σ=0.252 cuda:1, tensor[6] i32 x∈[1112, 1220] μ=1.194e+03 σ=40.913 [1203, 1206, 1207, 1218, 1220, 1112])
      kwargs = <local> {}
  File "/u/zeyer/setups/combined/2021-05-31/tools/espnet/espnet2/asr/specaug/specaug.py", line 93, in SpecAug.forward
    line: x, x_lengths = self.freq_mask(x, x_lengths)
    locals:
      x = <local> tensor[6, 1220, 80] n=585600 (2.2Mb) x∈[-0.051, 0.252] μ=-0.051 σ=0.252 cuda:1
      x_lengths = <local> tensor[6] i32 x∈[1112, 1220] μ=1.194e+03 σ=40.913 [1203, 1206, 1207, 1218, 1220, 1112]
      self = <local> SpecAug(
                       (time_warp): TimeWarp(window=5, mode=bicubic)
                       (freq_mask): MaskAlongAxis(mask_width_range=[0, 27], num_mask=2, axis=freq)
                       (time_mask): MaskAlongAxisVariableMaxWidth(mask_width_ratio_range=[0.0, 0.05], num_mask=10, axis=time)
                     )
      self.freq_mask = <local> MaskAlongAxis(mask_width_range=[0, 27], num_mask=2, axis=freq)
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in Module._wrapped_call_impl
    line: return self._call_impl(*args, **kwargs)
    locals:
      self = <local> MaskAlongAxis(mask_width_range=[0, 27], num_mask=2, axis=freq)
      self._call_impl = <local> <bound method Module._call_impl of MaskAlongAxis(mask_width_range=[0, 27], num_mask=2, axis=freq)>
      args = <local> (tensor[6, 1220, 80] n=585600 (2.2Mb) x∈[-0.051, 0.252] μ=-0.051 σ=0.252 cuda:1, tensor[6] i32 x∈[1112, 1220] μ=1.194e+03 σ=40.913 [1203, 1206, 1207, 1218, 1220, 1112])
      kwargs = <local> {}
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in Module._call_impl
    line: return forward_call(*args, **kwargs)
    locals:
      forward_call = <local> <bound method MaskAlongAxis.forward of MaskAlongAxis(mask_width_range=[0, 27], num_mask=2, axis=freq)>
      args = <local> (tensor[6, 1220, 80] n=585600 (2.2Mb) x∈[-0.051, 0.252] μ=-0.051 σ=0.252 cuda:1, tensor[6] i32 x∈[1112, 1220] μ=1.194e+03 σ=40.913 [1203, 1206, 1207, 1218, 1220, 1112])
      kwargs = <local> {}
  File "/u/zeyer/setups/combined/2021-05-31/tools/espnet/espnet2/layers/mask_along_axis.py", line 122, in MaskAlongAxis.forward
    line: return mask_along_axis(
              spec,
              spec_lengths,
              mask_width_range=self.mask_width_range,
              dim=self.dim,
              num_mask=self.num_mask,
              replace_with_zero=self.replace_with_zero,
          )
    locals:
      mask_along_axis = <global> <function mask_along_axis at 0x7f70f6c4e8e0>
      spec = <local> tensor[6, 1220, 80] n=585600 (2.2Mb) x∈[-0.051, 0.252] μ=-0.051 σ=0.252 cuda:1
      spec_lengths = <local> tensor[6] i32 x∈[1112, 1220] μ=1.194e+03 σ=40.913 [1203, 1206, 1207, 1218, 1220, 1112]
      mask_width_range = <not found>
      self = <local> MaskAlongAxis(mask_width_range=[0, 27], num_mask=2, axis=freq)
      self.mask_width_range = <local> [0, 27]
      dim = <not found>
      self.dim = <local> 2
      num_mask = <not found>
      self.num_mask = <local> 2
      replace_with_zero = <not found>
      self.replace_with_zero = <local> True
  File "/u/zeyer/setups/combined/2021-05-31/tools/espnet/espnet2/layers/mask_along_axis.py", line 41, in mask_along_axis
    line: mask_pos = torch.randint(
              0, max(1, D - mask_length.max()), (B, num_mask), device=spec.device
          ).unsqueeze(2)
    locals:
      mask_pos = <not found>
      torch = <global> <module 'torch' from '/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/__init__.py'>
      torch.randint = <global> <built-in method randint of type object at 0x7f71cebaeaa0>
      max = <builtin> <built-in function max>
      D = <local> 80
      mask_length = <local> tensor[6, 2, 1] i64 n=12 x∈[-4825293490701537652, 4474139002465713060] μ=-9.419e+17 σ=4.744e+18 cuda:1
      mask_length.max = <local> <built-in method max of Tensor object at 0x7f709e1c4d70>
      B = <local> 6
      num_mask = <local> 2
      device = <not found>
      spec = <local> tensor[6, 1220, 80] n=585600 (2.2Mb) x∈[0.252, -0.051] μ=0.252 σ=-0.051 cuda:1
      spec.device = <local> device(type='cuda', index=1)
      unsqueeze = <not found>
RuntimeError: random_ expects 'from' to be less than 'to', but got from=0 >= to=-4752614986133393697

Module call stack:
(ESPnetASRModel.forward) (root)
(ESPnetASRModel.encode) (root)
(SpecAug.forward) specaug
(MaskAlongAxis.forward) specaug.freq_mask

The text was updated successfully, but these errors were encountered:

albertz · 2024-01-21T11:15:32Z

It might be a hardware issue. On that node, I get now:

zeyer@cn-244 ~ % nvidia-smi                        
Unable to determine the device handle for GPU0000:03:00.0: Unknown Error

And dmesg:

[  +0.000844] nvidia 0000:03:00.0: AER: can't recover (no error_detected callback)
[  +0.000001] snd_hda_intel 0000:03:00.1: AER: can't recover (no error_detected callback)
[  +0.000008] pcieport 0000:00:03.0: AER: device recovery failed
[  +0.000001] pcieport 0000:00:03.0: AER: Multiple Uncorrected (Fatal) error received: 0000:00:03.0
[  +0.000004] pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
[  +0.000898] pcieport 0000:00:03.0:   device [8086:6f08] error status/mask=00004020/00000000
[  +0.000889] pcieport 0000:00:03.0:    [ 5] SDES                  
[  +0.000894] pcieport 0000:00:03.0:    [14] CmpltTO                (First)
[  +0.000921] nvidia 0000:03:00.0: AER: can't recover (no error_detected callback)
[  +0.000002] snd_hda_intel 0000:03:00.1: AER: can't recover (no error_detected callback)
[  +1.050439] pcieport 0000:00:03.0: AER: Root Port link has been reset (0)
[  +0.000041] pcieport 0000:00:03.0: AER: device recovery failed

albertz · 2024-01-21T12:33:03Z

I was just looking at the code. There is:

mask_length = torch.randint(
        mask_width_range[0],
        mask_width_range[1],
        (B, num_mask),
        device=spec.device,
    )

And it calls the function like this (as you see from the stacktrace):

    line: return mask_along_axis(
              spec,
              spec_lengths,
              mask_width_range=self.mask_width_range,
              dim=self.dim,
              num_mask=self.num_mask,
              replace_with_zero=self.replace_with_zero,
          )
    locals:
      spec = <local> tensor[6, 1220, 80] n=585600 (2.2Mb) x∈[-0.051, 0.252] μ=-0.051 σ=0.252 cuda:1
      spec_lengths = <local> tensor[6] i32 x∈[1112, 1220] μ=1.194e+03 σ=40.913 [1203, 1206, 1207, 1218, 1220, 1112]
      self = <local> MaskAlongAxis(mask_width_range=[0, 27], num_mask=2, axis=freq)
      self.mask_width_range = <local> [0, 27]
      self.dim = <local> 2
      self.num_mask = <local> 2
      self.replace_with_zero = <local> True

So, mask_length should have values in between 0 and 26 (inclusive).

But then you see later in the stacktrace:

      mask_length = <local> tensor[6, 2, 1] i64 n=12 x∈[-4825293490701537652, 4474139002465713060] μ=-9.419e+17 σ=4.744e+18 cuda:1

So, I guess it's clear that this is some hardware issue. So I guess we can close this.

albertz added the Bug bug should be fixed label Jan 21, 2024

albertz mentioned this issue Jan 21, 2024

Hang rwth-i6/returnn#1500

Closed

albertz closed this as completed Jan 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SpecAug/MaskAlongAxis: RuntimeError: random_ expects 'from' to be less than 'to' #5628

SpecAug/MaskAlongAxis: RuntimeError: random_ expects 'from' to be less than 'to' #5628

albertz commented Jan 21, 2024

albertz commented Jan 21, 2024

albertz commented Jan 21, 2024

SpecAug/MaskAlongAxis: RuntimeError: random_ expects 'from' to be less than 'to' #5628

SpecAug/MaskAlongAxis: RuntimeError: random_ expects 'from' to be less than 'to' #5628

Comments

albertz commented Jan 21, 2024

albertz commented Jan 21, 2024

albertz commented Jan 21, 2024