New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: AMD ROCm support with plugin #1519
Conversation
This recently introduced commit breaks amdgpu_plugin and perhaps more such consumers e.g. the CR operation fails on a BERT pytorch workloads with systems having more than one GPU. I am not sure about how to handle systemcfg file but the current solution looks like a workaround so revert it. This reverts commit ea8921b. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Unsupported VMAs that provide special plugins for backup can be treated as regular VMAs and criu should save their metadata in the dump files. There can be several special backup plugins that might run at different times during checkpoint and restore. To checkpoint tasks that have opened AMD KFD/RenderDXXX device files directly directly or via AMD ROCt library, we can use special plugins such as kfd_plugin.so or similar that implement various EXT and other new hooks to support checkpointing and restore. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com>
kfd_ioctl.h contains the definitions for the APIs and required arguments to call the ioctls so simply copy the header as is for amdgpu plugin. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com>
This applies to Checkpoint-Restore operations for a device file. When restoring VMAs for the device files, criu runs sys_mmap in the pie restore context but the offsets and file path within a device file may change during restore operation so it needs to be adjusted properly. A future patch will add a consumer for this plugin. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com>
To support Checkpoint Restore with AMDGPUs for ROCm workloads, introduce a new plugin to assist CRIU with the help of AMD KFD kernel driver. This initial commit just provides the basic framework to build up further capabilities. Like CRIU, the amdgpu plugin also uses protobuf to serialize and save the amdkfd data which is mostly VRAM contents with some metadata. We generate a data file "kfd.<id>.img" during the dump stage. On restore this file is read and extracted to re-create various types of buffer objects that belonged to the previously checkpointed process. Upon restore the mmap page offset within a device file might change so we use the new hook to update and adjust the mmap offsets for newly created target process. This is needed for sys_mmap call in pie restorer phase. With the current implementation (amdgpu_plugin), we support: - Only compute workloads such (Non Gfx) are supported - Single GPU - GPU visible inside a container - AMD GPU Gfx 9 Family - Pytorch Benchmarks such as BERT Base Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Userptr buffer objects are system memory that criu handles so no need to allocate space for such buffer objects. We should allocate space only for GTT and VRAM memory since that is what we should save in our kfd image file. The contents for userptr buffer objects will be drained by criu via vmsplice since those are paged system memory and saved as regular pages-*.img files. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
PROCPIDMEM interface has memory bandwidth bottleneck during checkpoint so it should be used only when direct mmap/memcpy is not possible. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
During criu restore phase when a device is getting restored with the help of a plugin, some device specific operations need to be delayed until criu finalizes the VMA placements in address space of the target process. But by the time criu finalizes this, its too late since pie phase is over and control is back to criu master process. This change allows an external trigger to each resuming task to check whether it has a device specific operation pending such as issuing an ioctl call. Since this is called from criu master process context, supply the pid of the target process and give a chance to each plugin registered to run device specific operation if the target pid is valid. A consumer for this will be added in a seperate patch. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
For restarting MMU notifiers and to kick-start the evicted queues that were created during a previuos criu restore plugin operation, we need an external trigger that runs outside the target process context i.e. from within master restorer context for each target (restored) pid. This external trigger is required because we need to wait till criu pie finalizes the VMAs relocation within the target process's address space. When kfd recieves this ioctl, it starts the userptr worker and resumes the MMU notifiers. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
This adds the capabilities to Checkpoint and Restore user mode queues that are associated with a typical ROCm process such as memory queue descriptors, control stacks and compute unit masks etc. Also, serialize queue information returned during dump ioctl and store them into image file. Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Plugin should log extra debug data only with -DDEBUG. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Use fopen/fwrite/fread to be able to handle bigger files. Try to open /dev/kfd earlier during restore, so we can fail earlier if amdgpu driver is not loaded. Memcopy data directly into bo_info_test structure during dump to reduce number of memcpy. Moved some variable declarations so that the variables are only valid during the scope where they are used. Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Add support for dumping and restoring events during CRIU checkpoint and restore. Events are used to send notifications to applications when certain tasks are completed or exceptions have occured. Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
- placeholder and initial documentation. - keep with CRIU main documentation Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com>
The doorbell offset depends on the gpu_id. When restoring on a different node, the gpu_id may be different, so we need to re-adjust the doorbell offsets. Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Add a Dockerfile to setup CRIU with its dependencies in an official ROCm pytorch container. Support for plugin will be added in a future patch. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
This sets up the pytorch environment for BERT Transformers and also sets up CRIU along with all its dependencies including amdgpu plugin for supporting CR with AMDGPUs. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Restore operation fails when we perform CR operation of multiple independent kfd proceses because criu caches the ids for the device files with same mnt_ids, inode pair. This might not be the optimal solution but a unique id is required for each kfd process for a successful restore. Decoded fdinfo files also confirm a duplicate id for amdgpu devices and thus cache lookup during checkpoint prevents the actual metadata for the device VMA write in the image files so on a subsequent restore the desired id is not found and hence restore fails even though the actual image file e.g. renderDXXX.img exits. This can be tested with any simple kfd application that uses AMD ROCt library and creates a VRAM buffer object. CR on such individual processes works fine but when we invoke more than one process from a shell script and perform CR on this shell script, the restore operation fails without this change. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
- Make ROCm amdgpu plugin a part of standard criu build, install and uninstall process. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Parse local system topology in /sys/class/kfd/kfd/topology/nodes/ and store properties for each gpu in the CRIU image files. The gpu properties can then be used later during restore to make the process is restored on gpu's with similar properties. Signed-off-by: David Yat Sin <david.yatsin@amd.com>
The device topology on the restore node can be different from the topology on the checkpointed node. The GPUs on the restore node may have different gpu_ids, minor number. or some GPUs may have different properties as checkpointed node. During restore, the CRIU plugin determines the target GPUs to avoid restore failures caused by trying to restore a process on a gpu that is different. Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Add optional parameters to override default behavior during restore. These parameters are passed in as environment variables before executing CRIU. List of parameters: KFD_FW_VER_CHECK - disable firmware version check KFD_SDMA_FW_VER_CHECK - disable SDMA firmware version check KFD_CACHES_COUNT_CHECK - disable caches count check KFD_NUM_GWS_CHECK - disable num_gws check KFD_VRAM_SIZE_CHECK - disable VRAM size check KFD_NUMA_CHECK - preserve NUMA regions Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Adding unit tests for GPU remapping code when checkpointing and restoring on different nodes with different topologies. Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Implement multi-threaded code to read and write contents of each GPU VRAM BOs in parallel in order to speed up dumping process when using multiple GPUs. Moving opening and closing of file descriptors outside of loop so that they are only performed once per GPU. Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Libhsakmt(thunk) uses a shared memory file in /dev/shm/hsakmt_shared_mem and its semaphore in /dev/shm/hsakmt_shared_mem. Adding a check during checkpoint to see if these two files exist. If they exist then the plugin will try to restore them during restore. Signed-off-by: David Yat Sin <david.yatsin@amd.com>
I think, from my point of view, it would make more sense to have multiple pull request. Start with a smaller pull request setting up the CRIU build-system infrastructure for plugins. Provide a minimal dummy plugin as an example. Also, as you mentioned that the plugin requires additional dependencies, the plugin should not be built automatically just by typing Existing CI runs should of course be successful. @checkpoint-restore/maintainers and other thoughts? Everyone happy with the location of the plugins? I suggested |
It fails to compile on fedora 34: $ make amdgpu_plugin
protoc-c --proto_path=. --c_out=. criu-amdgpu.proto
gcc -g -Wall -Werror -D _GNU_SOURCE -shared -nostartfiles -fPIC amdgpu_plugin.c amdgpu_plugin_topology.c criu-amdgpu.pb-c.c -o amdgpu_plugin.so -iquote../../../criu/include -iquote ../../criu/include -lpthread -lrt
amdgpu_plugin_topology.c: In function ‘parse_topo_node_mem_banks’:
amdgpu_plugin_topology.c:466:40: error: ‘%s’ directive writing up to 255 bytes into a region of size between 0 and 299 [-Werror=format-overflow=]
466 | sprintf(bank_path, "%s/%s", path, dirent_node->d_name);
| ^~
amdgpu_plugin_topology.c:466:17: note: ‘sprintf’ output between 2 and 556 bytes into a destination of size 300
466 | sprintf(bank_path, "%s/%s", path, dirent_node->d_name);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
amdgpu_plugin_topology.c:475:53: error: ‘/properties’ directive writing 11 bytes into a region of size between 1 and 300 [-Werror=format-overflow=]
475 | sprintf(properties_path, "%s/properties", bank_path);
| ^~~~~~~~~~~
amdgpu_plugin_topology.c:475:25: note: ‘sprintf’ output between 12 and 311 bytes into a destination of size 300
475 | sprintf(properties_path, "%s/properties", bank_path);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
amdgpu_plugin_topology.c: In function ‘parse_topo_node_iolinks’:
amdgpu_plugin_topology.c:553:42: error: ‘%s’ directive writing up to 255 bytes into a region of size between 0 and 299 [-Werror=format-overflow=]
553 | sprintf(iolink_path, "%s/%s", path, dirent_node->d_name);
| ^~
amdgpu_plugin_topology.c:553:17: note: ‘sprintf’ output between 2 and 556 bytes into a destination of size 300
553 | sprintf(iolink_path, "%s/%s", path, dirent_node->d_name);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
amdgpu_plugin_topology.c:562:53: error: ‘/properties’ directive writing 11 bytes into a region of size between 1 and 300 [-Werror=format-overflow=]
562 | sprintf(properties_path, "%s/properties", iolink_path);
| ^~~~~~~~~~~
amdgpu_plugin_topology.c:562:25: note: ‘sprintf’ output between 12 and 311 bytes into a destination of size 300
562 | sprintf(properties_path, "%s/properties", iolink_path);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
make[1]: *** [Makefile:20: amdgpu_plugin.so] Error 1
make: *** [Makefile:334: amdgpu_plugin] Error 2 |
On Ubuntu20.04 and Fedora34 some warnings are treated as errors. This fixes those format overflow and truncate warnings. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
To build amdgpu_plugin, we need the include headers for libdrm, add this to alpine, openjdk, focal, ppc, mips etc. containers. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
I have force pushed the branch and added two commits for testing purpose. There are still 2-3 more fixes required in the plugin to fix lint and centos errors. |
@@ -452,7 +452,7 @@ static int parse_topo_node_mem_banks(struct tp_node *node, const char *dir_path) | |||
|
|||
while ((dirent_node = readdir(d_node)) != NULL) { | |||
char line[300]; | |||
char bank_path[300]; | |||
char bank_path[1024]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please apply these changes in the "criu/plugin: Support AMD ROCm Checkpoint Restore with KFD" commit, which introduced amdgpu_plugin_topology.c
?
This could be done, for example, with git rebase -i 311aee4ff^
, then change pick
to edit
on the first line and save.
edit 311aee4ff criu/plugin: Support AMD ROCm Checkpoint Restore with KFD
After the changes have been applied you can use git commit -a --ammend
and git rebase --continue
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please advise if this ( f81a453 ) approach to fix some container build dependencies is OK ?
The build dependencies for amdgpu_plugin should be optional and CRIU should work even if
libdrm-dev
is not installed. As far as I know, the current CI doesn't have AMD GPU available to run tests for the plugin, but it would be good to have tests for it.
I am not sure If I understood you here. I made the change to the CI docker scripts but the tests were not triggered. I am not sure if maintainers need to manually trigger those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently CRIU would fail to compile if libdrm
is not installed. Instead, when libdrm
is not installed it should skip building amdgpu_plugin
I made the change to the CI docker scripts but the tests were not triggered.
The CI tests are triggered on git push, for example, you can see in the Alpine test logs that libdrm-dev
has been installed:
Step 3/11 : RUN apk update && apk add $CC bash build-base coreutils git gnutls-dev libaio-dev libcap-dev libnet-dev libnl3-dev nftables nftables-dev pkgconfig protobuf-c-dev protobuf-dev py3-pip py3-protobuf python3 sudo libdrm-dev
...
(59/113) Installing libdrm (2.4.106-r0)
(60/113) Installing libpciaccess-dev (0.16-r0)
(61/113) Installing libdrm-dev (2.4.106-r0)
...
@@ -13,6 +13,7 @@ endif | |||
FOOTER := footer.txt | |||
SRC1 += crit.txt | |||
SRC1 += compel.txt | |||
SRC1 += amdgpu_plugin.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name amdgpu_plugin
(i.e., man amdgpu_plugin
) is global.
What do you think about using something like criu-amdgpu-plugin
instead?
#RUN mkdir -p /home/criu_build_dir | ||
WORKDIR /root/criu_build_dir | ||
#RUN cd /home/criu_build_dir | ||
RUN git clone https://github.com/checkpoint-restore/criu.git |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A Dockerfile like this one (and the one in commit "criu/plugin: Dockerfile for AMD criu repo") would be more useful as a test in scripts/build/
, but if it is part of the repository it could use the following instead of git clone
.
COPY . /criu
WORKDIR /criu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
@@ -0,0 +1,275 @@ | |||
Supporting ROCm with CRIU | |||
========================= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should be added with commit "criu/plugin: Support AMD ROCm Checkpoint Restore with KFD".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
@@ -147,7 +147,7 @@ HOSTCFLAGS += $(WARNINGS) $(DEFINES) -iquote include/ | |||
export AFLAGS CFLAGS USERCLFAGS HOSTCFLAGS | |||
|
|||
# Default target | |||
all: flog criu lib crit | |||
all: flog criu lib crit amdgpu_plugin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As Adrian mentioned in his comment, the plugin should not be built automatically by typing make
. This is because it requires additional dependencies that are optional for CRIU.
To enable the build we could use something like AMDGPU_PLUGIN=1 make
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack, I've added nmk based dependencies check on libdrm in the Makefile
The build dependencies for amdgpu_plugin should be optional and CRIU should work even if |
Can we please have plain (linear without merge commits) commit tree for this pull request? Hard to review this kind of thing: Output of ` git log --graph --oneline`
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a big work, nice! But I would agree with @adrianreber that spliting it into separate pr's would make review easier.
@@ -593,14 +593,49 @@ static int handle_vma(pid_t pid, struct vma_area *vma_area, | |||
} | |||
} else if (*vm_file_fd >= 0) { | |||
struct stat *st_buf = vma_area->vmst; | |||
struct stat st_kfd, st_dri_min; | |||
char img_path[128]; | |||
int ret; | |||
|
|||
if (S_ISREG(st_buf->st_mode)) | |||
/* regular file mapping -- supported */; | |||
else if (S_ISCHR(st_buf->st_mode) && (st_buf->st_rdev == DEVZERO)) | |||
/* devzero mapping -- also makes sense */; | |||
else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Can we please fix the codding style, while on it, as if/else/else-if branches can't have mixed braces/no-braces formating. In this case we should have brackets everywhere. https://www.kernel.org/doc/html/v4.10/process/coding-style.html#placing-braces-and-spaces
-
Can we please put the code from the below branch to new bool function (for instance let's name it handle_plugin_vma), not to overgrow the size of handle_vma(). Probably something like this would be ok:
else if (handle_plugin_vma(...)) {
/* mapping would be handled by external plugin */;
} else {
pr_err("Can't handle non-regular mapping on %d's map %"PRIx64"\n", pid, vma_area->e->start);
goto err;
}
- We should probably check plugin availability here, if there is no proper plugin we should not just skip this mapping - we should fail.
Or even: you can detect that mapping is plugin-dumpable via calling plugin itself via another new hook, that would be a nice symmetric design.
goto err; | ||
ret = stat("/dev/kfd", &st_kfd); | ||
if (ret == -1) { | ||
pr_perror("fstat error for /dev/kfd\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In pr_perror() calls everywhere please remove excess newline specifiers '\n', pr_perror() adds '\n' by itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
@Snorch @rst0git Thanks for your review comments. I will rebase and re-upload the branch but I am not sure how to address builds errors for some of the automation tests e.g. Openj9 failure. I was hoping that the dependency on drm.h will be addressed when the container is built with the libdrm dependency and hence I added that in the Dockerfile. I understand that I'll still need to decouple building the plugin from main criu build but could you please explain what am i missing for the build failures other than clang and lint errors, specially for the cross compilation and other architectures? |
@rajbhar I can replicate the Alpine build error locally as follows: docker build -t criu-alpine -f scripts/build/Dockerfile.alpine . kfd_ioctl.h:26:10: fatal error: drm/drm.h: No such file or directory
26 | #include <drm/drm.h>
| ^~~~~~~~~~~
compilation terminated. It can be fixed with: --- a/plugins/amdgpu/kfd_ioctl.h
+++ b/plugins/amdgpu/kfd_ioctl.h
@@ -23,7 +23,7 @@
#ifndef KFD_IOCTL_H_INCLUDED
#define KFD_IOCTL_H_INCLUDED
-#include <drm/drm.h>
+#include <libdrm/drm.h>
#include <linux/ioctl.h>
/*
|
Thanks. On Alpine, I'll also need to include limits.h header in addition to this. The two drm.h headers are slightly different but it probably won't matter for our use case. Ideally, the kernel src package should have brought in the drm/drm.h like the way kernel expects it and other distros too have it but alpine seems to differ to I have copied the libdrm/* using the Dockerfile for alpine. We can't change the include in this header as it can't be accepted upstream. |
The |
@adrianreber , @Snorch @rst0git - I really appreciate your feedback on this PR. I have tried to address your suggestions in two new PRs #1556 and #1557 so closing this PR. Thanks again for your time and valuable feedback! |
Hello CRIU Community and Maintainers,
Can you please review this "RFC" pull request as we are still adding more features and making some changes to our API IOCTL design. Further to our earlier announcement https://www.spinics.net/lists/amd-gfx/msg62112.html , we made some more enhancements in terms of functionalities and based on some feedback that we received on the mailing list with regards to placement of the plugin and how it should be built etc.
This PR includes another important enhancement that brings in multi-GPU support. The checkpoint / restore time is improved now compared to what we had earlier when we made the announcement, as we added multithreading support and we are further optimizing it using the DMA engines, to copy VRAM contents to image files. This will be added as soon as its ready, probably on top of the feedback we receive here. There are some changes made to core CRIU itself and one possible recent regression was noticed during our testing so we seek your guidance as we are not sure if we understand that well enough.
This work was tested on multi GPU AMD systems with real pytorch workloads such as BERT(Transformers) and BERT Large. We also tested some Tensorflow workloads and found some probable limitations of protobuf that its unable to handle buffer objects more than 4GB. We are still investigating this more but would like to inquire here if any such issue was observed or reported before?
Link to the kernel changes: https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/commits/fxkamd/criu-wip
We again thank you for your time and appreciate your feedback.