amdgpu_plugin: Failed to dump (ret:-22) #2248

rst0git · 2023-08-20T17:08:26Z

The following two errors occur when checkpointing GPU applications with the AMD GPU plugin for CRIU.

When checkpointing a CRI-O container running AlexNet CNN on Ubuntu 20.04 system with MI100, CRIU fails with

(00.208098) amdgpu_plugin: Thread[0x5bb8] started
(00.208503) amdgpu_plugin: amdgpu-pages-252-5bb8.img:Opened file for write with size:33158160384
(02.766607) Error (criu/parasite-syscall.c:88): si_code=2 si_pid=1752711 si_status=9
(02.767880) Error (criu/parasite-syscall.c:93): 1752767 was killed by 9 unexpectedly: Killed

K8s yaml file: alexnet.yaml
Full CRIU log file: criu.log
Hardware configuration: lshw.txt

When attempting to checkpoint CRI-O containers on CentOS 9 system with two MI210 GPUs, CRIU fails with
(the CRIU logs were generated with the patch below to show the value of id_map->src and src_id)

(00.171135) amdgpu_plugin: Number of CPUs:3 GPUs:1
(00.171143) id_map->src: 9704; id_map->dest: 9704; src_id: 39309
(00.171147) Error (amdgpu_plugin.c:322): amdgpu_plugin: maps_get_dest_gpu failed 0
(00.171157) amdgpu_plugin: Dumped devices Failed (ret:-22)
(00.171179) amdgpu_plugin: Process unpaused Ok (ret:0)
(00.171243) Error (amdgpu_plugin.c:1456): amdgpu_plugin: Failed to dump (ret:-22)
(00.171296) ----------------------------------------
(00.171417) Error (criu/cr-dump.c:1669): Dump files (pid: 646845) failed with -1

K8s yaml file: binomial-option.yaml
Full CRIU log file: criu.log
Hardware configuration: lshw.txt

--- a/plugins/amdgpu/amdgpu_plugin_topology.c
+++ b/plugins/amdgpu/amdgpu_plugin_topology.c
@@ -265,6 +265,7 @@ uint32_t maps_get_dest_gpu(const struct device_maps *maps, const uint32_t src_id
        struct id_map *id_map;
 
        list_for_each_entry(id_map, &maps->gpu_maps, listm) {
+               pr_debug("id_map->src: %d; id_map->dest: %d; src_id: %d\n", id_map->src, id_map->dest, src_id);
                if (id_map->src == src_id)
                        return id_map->dest;
        }

In both cases, we use Kubernetes v1.27.4, CRI-O v1.26.0, AMD GPU device plugin, and CRIU compiled from the criu-dev branch.

The text was updated successfully, but these errors were encountered:

github-actions · 2023-09-20T00:37:17Z

A friendly reminder that this issue had no activity for 30 days.

fdavid-amd · 2023-10-10T18:34:37Z

CRIU gets data about the system's GPUs form two places: the PROCESS_INFO CRIU ioctl, and the /sys/class/kfd/kfd/topology sysfs entry. Somehow, on this system, these disagree with each other about the number of devices there are and what their IDs are.

rst0git added gpu/amd bug labels Aug 20, 2023

github-actions bot added the stale-issue label Sep 20, 2023

rst0git added no-auto-close Don't auto-close as a stale issue and removed stale-issue labels Sep 20, 2023

rst0git mentioned this issue Oct 4, 2023

Clarify behavioral guarantees for plugin api #2277

Open

avagin assigned dayatsin-amd and fdavid-amd Oct 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

amdgpu_plugin: Failed to dump (ret:-22) #2248

amdgpu_plugin: Failed to dump (ret:-22) #2248

rst0git commented Aug 20, 2023 •

edited

Loading

github-actions bot commented Sep 20, 2023

fdavid-amd commented Oct 10, 2023

amdgpu_plugin: Failed to dump (ret:-22) #2248

amdgpu_plugin: Failed to dump (ret:-22) #2248

Comments

rst0git commented Aug 20, 2023 • edited Loading

github-actions bot commented Sep 20, 2023

fdavid-amd commented Oct 10, 2023

rst0git commented Aug 20, 2023 •

edited

Loading