Skip to content

[BUG]GPU resources cannot be released properly when DeepSpeed is canceled. GPU A100 40G Driver Version: 535.54.03 CUDA Version: 12.2  #4003

@Dripman

Description

@Dripman
  1. GPU resources cannot be released properly when DeepSpeed is canceled.
  2. env Driver Version: 535.54.03 CUDA Version: 12.2
  3. log
    Jul 20 14:54:55 gpu02 kernel: INFO: rcu_sched detected stalls on CPUs/tasks: { 17} (detected by 40, t=60003 jiffies, g=86287, c=86286, q=237653)
    Jul 20 14:54:55 gpu02 kernel: Task dump for CPU 17:
    Jul 20 14:54:55 gpu02 kernel: AwsEventLoop 50 R running task 0 54782 1 0x0000008e
    Jul 20 14:54:55 gpu02 kernel: Call Trace:
    Jul 20 14:54:55 gpu02 kernel: [] unmap_sg+0x5f/0x70
    Jul 20 14:54:55 gpu02 kernel: [] nv_unmap_dma_map_scatterlist+0x8e/0xb0 [nvidia]
    Jul 20 14:54:55 gpu02 kernel: [] nv_dma_unmap_pages+0x115/0x120 [nvidia]
    Jul 20 14:54:55 gpu02 kernel: [] nv_dma_unmap_alloc+0x3d/0x60 [nvidia]
    Jul 20 14:54:55 gpu02 kernel: [] _nv039629rm+0xc5/0x1d0 [nvidia]
    Jul 20 14:54:55 gpu02 kernel: [] ? _nv031422rm+0x6f/0x90 [nvidia]
    Jul 20 14:54:55 gpu02 kernel: [] ? _nv034869rm+0x134/0x3a0 [nvidia]
    Jul 20 14:54:55 gpu02 kernel: [] ? _nv034834rm+0x6b/0x130 [nvidia]
    Jul 20 14:54:55 gpu02 kernel: [] ? _nv002511rm+0xd/0x20 [nvidia]
    Jul 20 14:54:55 gpu02 kernel: [] ? _nv004074rm+0x19/0xb0 [nvidia]
    Jul 20 14:54:55 gpu02 kernel: [] ? _nv016053rm+0x51c/0x620 [nvidia]
    Jul 20 14:54:55 gpu02 kernel: [] ? _nv043216rm+0xab/0xe0 [nvidia]
    Jul 20 14:54:55 gpu02 kernel: [] ? _nv044933rm+0xac/0x130 [nvidia]
    Jul 20 14:54:55 gpu02 kernel: [] ? _nv044932rm+0x2ef/0x690 [nvidia]
    Jul 20 14:54:55 gpu02 kernel: [] ? _nv044928rm+0xa7/0x1a0 [nvidia]
    Jul 20 14:54:55 gpu02 kernel: [] ? rm_cleanup_file_private+0x1db/0x200 [nvidia]
    Jul 20 14:54:55 gpu02 kernel: [] ? nvidia_close+0x13b/0x2f0 [nvidia]
    Jul 20 14:54:55 gpu02 kernel: [] ? nvidia_frontend_close+0x2f/0x50 [nvidia]
    Jul 20 14:54:55 gpu02 kernel: [] ? __fput+0xec/0x260
    Jul 20 14:54:55 gpu02 kernel: [] ? ____fput+0xe/0x10
    Jul 20 14:54:55 gpu02 kernel: [] ? task_work_run+0xbb/0xe0
    Jul 20 14:54:55 gpu02 kernel: [] ? do_exit+0x2d4/0xa50
    Jul 20 14:54:55 gpu02 kernel: [] ? timerqueue_del+0x24/0x70
    Jul 20 14:54:55 gpu02 kernel: [] ? __remove_hrtimer+0x3f/0xb0
    Jul 20 14:54:55 gpu02 kernel: [] ? do_group_exit+0x3f/0xa0
    Jul 20 14:54:55 gpu02 kernel: [] ? get_signal_to_deliver+0x1ce/0x5e0
    Jul 20 14:54:55 gpu02 kernel: [] ? do_signal+0x57/0x6f0
    Jul 20 14:54:55 gpu02 kernel: [] ? ep_poll+0x31e/0x360
    Jul 20 14:54:55 gpu02 kernel: [] ? wake_up_state+0x20/0x20
    Jul 20 14:54:55 gpu02 kernel: [] ? do_notify_resume+0x72/0xc0
    Jul 20 14:54:55 gpu02 kernel: [] ? int_signal+0x12/0x17

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtraining

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions