You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When attempting to checkpoint CRI-O containers on CentOS 9 system with two MI210 GPUs, CRIU fails with
(the CRIU logs were generated with the patch below to show the value of id_map->src and src_id)
(00.171135) amdgpu_plugin: Number of CPUs:3 GPUs:1
(00.171143) id_map->src: 9704; id_map->dest: 9704; src_id: 39309
(00.171147) Error (amdgpu_plugin.c:322): amdgpu_plugin: maps_get_dest_gpu failed 0
(00.171157) amdgpu_plugin: Dumped devices Failed (ret:-22)
(00.171179) amdgpu_plugin: Process unpaused Ok (ret:0)
(00.171243) Error (amdgpu_plugin.c:1456): amdgpu_plugin: Failed to dump (ret:-22)
(00.171296) ----------------------------------------
(00.171417) Error (criu/cr-dump.c:1669): Dump files (pid: 646845) failed with -1
CRIU gets data about the system's GPUs form two places: the PROCESS_INFO CRIU ioctl, and the /sys/class/kfd/kfd/topology sysfs entry. Somehow, on this system, these disagree with each other about the number of devices there are and what their IDs are.
The following two errors occur when checkpointing GPU applications with the AMD GPU plugin for CRIU.
K8s yaml file: alexnet.yaml
Full CRIU log file: criu.log
Hardware configuration: lshw.txt
(the CRIU logs were generated with the patch below to show the value of
id_map->src
andsrc_id
)K8s yaml file: binomial-option.yaml
Full CRIU log file: criu.log
Hardware configuration: lshw.txt
In both cases, we use Kubernetes v1.27.4, CRI-O v1.26.0, AMD GPU device plugin, and CRIU compiled from the
criu-dev
branch.The text was updated successfully, but these errors were encountered: