-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Open
Labels
Description
- GPU resources cannot be released properly when DeepSpeed is canceled.
- env Driver Version: 535.54.03 CUDA Version: 12.2
- log
Jul 20 14:54:55 gpu02 kernel: INFO: rcu_sched detected stalls on CPUs/tasks: { 17} (detected by 40, t=60003 jiffies, g=86287, c=86286, q=237653)
Jul 20 14:54:55 gpu02 kernel: Task dump for CPU 17:
Jul 20 14:54:55 gpu02 kernel: AwsEventLoop 50 R running task 0 54782 1 0x0000008e
Jul 20 14:54:55 gpu02 kernel: Call Trace:
Jul 20 14:54:55 gpu02 kernel: [] unmap_sg+0x5f/0x70
Jul 20 14:54:55 gpu02 kernel: [] nv_unmap_dma_map_scatterlist+0x8e/0xb0 [nvidia]
Jul 20 14:54:55 gpu02 kernel: [] nv_dma_unmap_pages+0x115/0x120 [nvidia]
Jul 20 14:54:55 gpu02 kernel: [] nv_dma_unmap_alloc+0x3d/0x60 [nvidia]
Jul 20 14:54:55 gpu02 kernel: [] _nv039629rm+0xc5/0x1d0 [nvidia]
Jul 20 14:54:55 gpu02 kernel: [] ? _nv031422rm+0x6f/0x90 [nvidia]
Jul 20 14:54:55 gpu02 kernel: [] ? _nv034869rm+0x134/0x3a0 [nvidia]
Jul 20 14:54:55 gpu02 kernel: [] ? _nv034834rm+0x6b/0x130 [nvidia]
Jul 20 14:54:55 gpu02 kernel: [] ? _nv002511rm+0xd/0x20 [nvidia]
Jul 20 14:54:55 gpu02 kernel: [] ? _nv004074rm+0x19/0xb0 [nvidia]
Jul 20 14:54:55 gpu02 kernel: [] ? _nv016053rm+0x51c/0x620 [nvidia]
Jul 20 14:54:55 gpu02 kernel: [] ? _nv043216rm+0xab/0xe0 [nvidia]
Jul 20 14:54:55 gpu02 kernel: [] ? _nv044933rm+0xac/0x130 [nvidia]
Jul 20 14:54:55 gpu02 kernel: [] ? _nv044932rm+0x2ef/0x690 [nvidia]
Jul 20 14:54:55 gpu02 kernel: [] ? _nv044928rm+0xa7/0x1a0 [nvidia]
Jul 20 14:54:55 gpu02 kernel: [] ? rm_cleanup_file_private+0x1db/0x200 [nvidia]
Jul 20 14:54:55 gpu02 kernel: [] ? nvidia_close+0x13b/0x2f0 [nvidia]
Jul 20 14:54:55 gpu02 kernel: [] ? nvidia_frontend_close+0x2f/0x50 [nvidia]
Jul 20 14:54:55 gpu02 kernel: [] ? __fput+0xec/0x260
Jul 20 14:54:55 gpu02 kernel: [] ? ____fput+0xe/0x10
Jul 20 14:54:55 gpu02 kernel: [] ? task_work_run+0xbb/0xe0
Jul 20 14:54:55 gpu02 kernel: [] ? do_exit+0x2d4/0xa50
Jul 20 14:54:55 gpu02 kernel: [] ? timerqueue_del+0x24/0x70
Jul 20 14:54:55 gpu02 kernel: [] ? __remove_hrtimer+0x3f/0xb0
Jul 20 14:54:55 gpu02 kernel: [] ? do_group_exit+0x3f/0xa0
Jul 20 14:54:55 gpu02 kernel: [] ? get_signal_to_deliver+0x1ce/0x5e0
Jul 20 14:54:55 gpu02 kernel: [] ? do_signal+0x57/0x6f0
Jul 20 14:54:55 gpu02 kernel: [] ? ep_poll+0x31e/0x360
Jul 20 14:54:55 gpu02 kernel: [] ? wake_up_state+0x20/0x20
Jul 20 14:54:55 gpu02 kernel: [] ? do_notify_resume+0x72/0xc0
Jul 20 14:54:55 gpu02 kernel: [] ? int_signal+0x12/0x17
Reactions are currently unavailable