Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

amdgpu_plugin: Failed to dump (ret:-22) #2248

Open
rst0git opened this issue Aug 20, 2023 · 2 comments
Open

amdgpu_plugin: Failed to dump (ret:-22) #2248

rst0git opened this issue Aug 20, 2023 · 2 comments
Assignees
Labels
bug gpu/amd no-auto-close Don't auto-close as a stale issue

Comments

@rst0git
Copy link
Member

rst0git commented Aug 20, 2023

The following two errors occur when checkpointing GPU applications with the AMD GPU plugin for CRIU.

  1. When checkpointing a CRI-O container running AlexNet CNN on Ubuntu 20.04 system with MI100, CRIU fails with
(00.208098) amdgpu_plugin: Thread[0x5bb8] started
(00.208503) amdgpu_plugin: amdgpu-pages-252-5bb8.img:Opened file for write with size:33158160384
(02.766607) Error (criu/parasite-syscall.c:88): si_code=2 si_pid=1752711 si_status=9
(02.767880) Error (criu/parasite-syscall.c:93): 1752767 was killed by 9 unexpectedly: Killed

K8s yaml file: alexnet.yaml
Full CRIU log file: criu.log
Hardware configuration: lshw.txt

  1. When attempting to checkpoint CRI-O containers on CentOS 9 system with two MI210 GPUs, CRIU fails with
    (the CRIU logs were generated with the patch below to show the value of id_map->src and src_id)
(00.171135) amdgpu_plugin: Number of CPUs:3 GPUs:1
(00.171143) id_map->src: 9704; id_map->dest: 9704; src_id: 39309
(00.171147) Error (amdgpu_plugin.c:322): amdgpu_plugin: maps_get_dest_gpu failed 0
(00.171157) amdgpu_plugin: Dumped devices Failed (ret:-22)
(00.171179) amdgpu_plugin: Process unpaused Ok (ret:0)
(00.171243) Error (amdgpu_plugin.c:1456): amdgpu_plugin: Failed to dump (ret:-22)
(00.171296) ----------------------------------------
(00.171417) Error (criu/cr-dump.c:1669): Dump files (pid: 646845) failed with -1

K8s yaml file: binomial-option.yaml
Full CRIU log file: criu.log
Hardware configuration: lshw.txt

--- a/plugins/amdgpu/amdgpu_plugin_topology.c
+++ b/plugins/amdgpu/amdgpu_plugin_topology.c
@@ -265,6 +265,7 @@ uint32_t maps_get_dest_gpu(const struct device_maps *maps, const uint32_t src_id
        struct id_map *id_map;
 
        list_for_each_entry(id_map, &maps->gpu_maps, listm) {
+               pr_debug("id_map->src: %d; id_map->dest: %d; src_id: %d\n", id_map->src, id_map->dest, src_id);
                if (id_map->src == src_id)
                        return id_map->dest;
        }

In both cases, we use Kubernetes v1.27.4, CRI-O v1.26.0, AMD GPU device plugin, and CRIU compiled from the criu-dev branch.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@fdavid-amd
Copy link
Contributor

CRIU gets data about the system's GPUs form two places: the PROCESS_INFO CRIU ioctl, and the /sys/class/kfd/kfd/topology sysfs entry. Somehow, on this system, these disagree with each other about the number of devices there are and what their IDs are.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug gpu/amd no-auto-close Don't auto-close as a stale issue
Projects
None yet
Development

No branches or pull requests

3 participants