RFC: AMD ROCm support with plugin #1519

rajbhar · 2021-06-18T19:55:08Z

Hello CRIU Community and Maintainers,

Can you please review this "RFC" pull request as we are still adding more features and making some changes to our API IOCTL design. Further to our earlier announcement https://www.spinics.net/lists/amd-gfx/msg62112.html , we made some more enhancements in terms of functionalities and based on some feedback that we received on the mailing list with regards to placement of the plugin and how it should be built etc.

This PR includes another important enhancement that brings in multi-GPU support. The checkpoint / restore time is improved now compared to what we had earlier when we made the announcement, as we added multithreading support and we are further optimizing it using the DMA engines, to copy VRAM contents to image files. This will be added as soon as its ready, probably on top of the feedback we receive here. There are some changes made to core CRIU itself and one possible recent regression was noticed during our testing so we seek your guidance as we are not sure if we understand that well enough.

This work was tested on multi GPU AMD systems with real pytorch workloads such as BERT(Transformers) and BERT Large. We also tested some Tensorflow workloads and found some probable limitations of protobuf that its unable to handle buffer objects more than 4GB. We are still investigating this more but would like to inquire here if any such issue was observed or reported before?

Link to the kernel changes: https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/commits/fxkamd/criu-wip

We again thank you for your time and appreciate your feedback.

This recently introduced commit breaks amdgpu_plugin and perhaps more such consumers e.g. the CR operation fails on a BERT pytorch workloads with systems having more than one GPU. I am not sure about how to handle systemcfg file but the current solution looks like a workaround so revert it. This reverts commit ea8921b. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

Unsupported VMAs that provide special plugins for backup can be treated as regular VMAs and criu should save their metadata in the dump files. There can be several special backup plugins that might run at different times during checkpoint and restore. To checkpoint tasks that have opened AMD KFD/RenderDXXX device files directly directly or via AMD ROCt library, we can use special plugins such as kfd_plugin.so or similar that implement various EXT and other new hooks to support checkpointing and restore. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com>

kfd_ioctl.h contains the definitions for the APIs and required arguments to call the ioctls so simply copy the header as is for amdgpu plugin. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com>

This applies to Checkpoint-Restore operations for a device file. When restoring VMAs for the device files, criu runs sys_mmap in the pie restore context but the offsets and file path within a device file may change during restore operation so it needs to be adjusted properly. A future patch will add a consumer for this plugin. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com>

To support Checkpoint Restore with AMDGPUs for ROCm workloads, introduce a new plugin to assist CRIU with the help of AMD KFD kernel driver. This initial commit just provides the basic framework to build up further capabilities. Like CRIU, the amdgpu plugin also uses protobuf to serialize and save the amdkfd data which is mostly VRAM contents with some metadata. We generate a data file "kfd.<id>.img" during the dump stage. On restore this file is read and extracted to re-create various types of buffer objects that belonged to the previously checkpointed process. Upon restore the mmap page offset within a device file might change so we use the new hook to update and adjust the mmap offsets for newly created target process. This is needed for sys_mmap call in pie restorer phase. With the current implementation (amdgpu_plugin), we support: - Only compute workloads such (Non Gfx) are supported - Single GPU - GPU visible inside a container - AMD GPU Gfx 9 Family - Pytorch Benchmarks such as BERT Base Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

Userptr buffer objects are system memory that criu handles so no need to allocate space for such buffer objects. We should allocate space only for GTT and VRAM memory since that is what we should save in our kfd image file. The contents for userptr buffer objects will be drained by criu via vmsplice since those are paged system memory and saved as regular pages-*.img files. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

PROCPIDMEM interface has memory bandwidth bottleneck during checkpoint so it should be used only when direct mmap/memcpy is not possible. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

During criu restore phase when a device is getting restored with the help of a plugin, some device specific operations need to be delayed until criu finalizes the VMA placements in address space of the target process. But by the time criu finalizes this, its too late since pie phase is over and control is back to criu master process. This change allows an external trigger to each resuming task to check whether it has a device specific operation pending such as issuing an ioctl call. Since this is called from criu master process context, supply the pid of the target process and give a chance to each plugin registered to run device specific operation if the target pid is valid. A consumer for this will be added in a seperate patch. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

For restarting MMU notifiers and to kick-start the evicted queues that were created during a previuos criu restore plugin operation, we need an external trigger that runs outside the target process context i.e. from within master restorer context for each target (restored) pid. This external trigger is required because we need to wait till criu pie finalizes the VMAs relocation within the target process's address space. When kfd recieves this ioctl, it starts the userptr worker and resumes the MMU notifiers. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

This adds the capabilities to Checkpoint and Restore user mode queues that are associated with a typical ROCm process such as memory queue descriptors, control stacks and compute unit masks etc. Also, serialize queue information returned during dump ioctl and store them into image file. Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

Plugin should log extra debug data only with -DDEBUG. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

Use fopen/fwrite/fread to be able to handle bigger files. Try to open /dev/kfd earlier during restore, so we can fail earlier if amdgpu driver is not loaded. Memcopy data directly into bo_info_test structure during dump to reduce number of memcpy. Moved some variable declarations so that the variables are only valid during the scope where they are used. Signed-off-by: David Yat Sin <david.yatsin@amd.com>

Add support for dumping and restoring events during CRIU checkpoint and restore. Events are used to send notifications to applications when certain tasks are completed or exceptions have occured. Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

- placeholder and initial documentation. - keep with CRIU main documentation Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com>

The doorbell offset depends on the gpu_id. When restoring on a different node, the gpu_id may be different, so we need to re-adjust the doorbell offsets. Signed-off-by: David Yat Sin <david.yatsin@amd.com>

Add a Dockerfile to setup CRIU with its dependencies in an official ROCm pytorch container. Support for plugin will be added in a future patch. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

This sets up the pytorch environment for BERT Transformers and also sets up CRIU along with all its dependencies including amdgpu plugin for supporting CR with AMDGPUs. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

Restore operation fails when we perform CR operation of multiple independent kfd proceses because criu caches the ids for the device files with same mnt_ids, inode pair. This might not be the optimal solution but a unique id is required for each kfd process for a successful restore. Decoded fdinfo files also confirm a duplicate id for amdgpu devices and thus cache lookup during checkpoint prevents the actual metadata for the device VMA write in the image files so on a subsequent restore the desired id is not found and hence restore fails even though the actual image file e.g. renderDXXX.img exits. This can be tested with any simple kfd application that uses AMD ROCt library and creates a VRAM buffer object. CR on such individual processes works fine but when we invoke more than one process from a shell script and perform CR on this shell script, the restore operation fails without this change. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

- Make ROCm amdgpu plugin a part of standard criu build, install and uninstall process. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

Parse local system topology in /sys/class/kfd/kfd/topology/nodes/ and store properties for each gpu in the CRIU image files. The gpu properties can then be used later during restore to make the process is restored on gpu's with similar properties. Signed-off-by: David Yat Sin <david.yatsin@amd.com>

The device topology on the restore node can be different from the topology on the checkpointed node. The GPUs on the restore node may have different gpu_ids, minor number. or some GPUs may have different properties as checkpointed node. During restore, the CRIU plugin determines the target GPUs to avoid restore failures caused by trying to restore a process on a gpu that is different. Signed-off-by: David Yat Sin <david.yatsin@amd.com>

Add optional parameters to override default behavior during restore. These parameters are passed in as environment variables before executing CRIU. List of parameters: KFD_FW_VER_CHECK - disable firmware version check KFD_SDMA_FW_VER_CHECK - disable SDMA firmware version check KFD_CACHES_COUNT_CHECK - disable caches count check KFD_NUM_GWS_CHECK - disable num_gws check KFD_VRAM_SIZE_CHECK - disable VRAM size check KFD_NUMA_CHECK - preserve NUMA regions Signed-off-by: David Yat Sin <david.yatsin@amd.com>

Adding unit tests for GPU remapping code when checkpointing and restoring on different nodes with different topologies. Signed-off-by: David Yat Sin <david.yatsin@amd.com>

Implement multi-threaded code to read and write contents of each GPU VRAM BOs in parallel in order to speed up dumping process when using multiple GPUs. Moving opening and closing of file descriptors outside of loop so that they are only performed once per GPU. Signed-off-by: David Yat Sin <david.yatsin@amd.com>

Libhsakmt(thunk) uses a shared memory file in /dev/shm/hsakmt_shared_mem and its semaphore in /dev/shm/hsakmt_shared_mem. Adding a check during checkpoint to see if these two files exist. If they exist then the plugin will try to restore them during restore. Signed-off-by: David Yat Sin <david.yatsin@amd.com>

adrianreber · 2021-06-21T08:11:38Z

I think, from my point of view, it would make more sense to have multiple pull request. Start with a smaller pull request setting up the CRIU build-system infrastructure for plugins. Provide a minimal dummy plugin as an example.

Also, as you mentioned that the plugin requires additional dependencies, the plugin should not be built automatically just by typing make. The plugin should be built only when explicitly requested.

Existing CI runs should of course be successful.

@checkpoint-restore/maintainers and other thoughts? Everyone happy with the location of the plugins? I suggested plugins in an email thread on the CRIU mailing list two months ago. Just highlighting this here in case someone from @checkpoint-restore/maintainers did not see it.

rst0git · 2021-06-24T10:11:55Z

It fails to compile on fedora 34:

$ make amdgpu_plugin
protoc-c --proto_path=. --c_out=. criu-amdgpu.proto
gcc -g -Wall -Werror -D _GNU_SOURCE -shared -nostartfiles -fPIC amdgpu_plugin.c amdgpu_plugin_topology.c criu-amdgpu.pb-c.c -o amdgpu_plugin.so -iquote../../../criu/include -iquote ../../criu/include -lpthread -lrt
amdgpu_plugin_topology.c: In function ‘parse_topo_node_mem_banks’:
amdgpu_plugin_topology.c:466:40: error: ‘%s’ directive writing up to 255 bytes into a region of size between 0 and 299 [-Werror=format-overflow=]
  466 |                 sprintf(bank_path, "%s/%s", path, dirent_node->d_name);
      |                                        ^~
amdgpu_plugin_topology.c:466:17: note: ‘sprintf’ output between 2 and 556 bytes into a destination of size 300
  466 |                 sprintf(bank_path, "%s/%s", path, dirent_node->d_name);
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
amdgpu_plugin_topology.c:475:53: error: ‘/properties’ directive writing 11 bytes into a region of size between 1 and 300 [-Werror=format-overflow=]
  475 |                         sprintf(properties_path, "%s/properties", bank_path);
      |                                                     ^~~~~~~~~~~
amdgpu_plugin_topology.c:475:25: note: ‘sprintf’ output between 12 and 311 bytes into a destination of size 300
  475 |                         sprintf(properties_path, "%s/properties", bank_path);
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
amdgpu_plugin_topology.c: In function ‘parse_topo_node_iolinks’:
amdgpu_plugin_topology.c:553:42: error: ‘%s’ directive writing up to 255 bytes into a region of size between 0 and 299 [-Werror=format-overflow=]
  553 |                 sprintf(iolink_path, "%s/%s", path, dirent_node->d_name);
      |                                          ^~
amdgpu_plugin_topology.c:553:17: note: ‘sprintf’ output between 2 and 556 bytes into a destination of size 300
  553 |                 sprintf(iolink_path, "%s/%s", path, dirent_node->d_name);
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
amdgpu_plugin_topology.c:562:53: error: ‘/properties’ directive writing 11 bytes into a region of size between 1 and 300 [-Werror=format-overflow=]
  562 |                         sprintf(properties_path, "%s/properties", iolink_path);
      |                                                     ^~~~~~~~~~~
amdgpu_plugin_topology.c:562:25: note: ‘sprintf’ output between 12 and 311 bytes into a destination of size 300
  562 |                         sprintf(properties_path, "%s/properties", iolink_path);
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
make[1]: *** [Makefile:20: amdgpu_plugin.so] Error 1
make: *** [Makefile:334: amdgpu_plugin] Error 2

On Ubuntu20.04 and Fedora34 some warnings are treated as errors. This fixes those format overflow and truncate warnings. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

To build amdgpu_plugin, we need the include headers for libdrm, add this to alpine, openjdk, focal, ppc, mips etc. containers. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

rajbhar · 2021-06-25T16:40:12Z

I have force pushed the branch and added two commits for testing purpose. There are still 2-3 more fixes required in the plugin to fix lint and centos errors.

rajbhar · 2021-06-25T16:44:49Z

@rst0git could you please advise if this ( f81a453 ) approach to fix some container build dependencies is OK ?

rst0git · 2021-06-27T19:21:40Z

plugins/amdgpu/amdgpu_plugin_topology.c

@@ -452,7 +452,7 @@ static int parse_topo_node_mem_banks(struct tp_node *node, const char *dir_path)

 	while ((dirent_node = readdir(d_node)) != NULL) {
 		char line[300];
-		char bank_path[300];
+		char bank_path[1024];


Could you please apply these changes in the "criu/plugin: Support AMD ROCm Checkpoint Restore with KFD" commit, which introduced amdgpu_plugin_topology.c?
This could be done, for example, with git rebase -i 311aee4ff^, then change pick to edit on the first line and save.

edit 311aee4ff criu/plugin: Support AMD ROCm Checkpoint Restore with KFD

After the changes have been applied you can use git commit -a --ammend and git rebase --continue.

could you please advise if this ( f81a453 ) approach to fix some container build dependencies is OK ?

The build dependencies for amdgpu_plugin should be optional and CRIU should work even if libdrm-dev is not installed. As far as I know, the current CI doesn't have AMD GPU available to run tests for the plugin, but it would be good to have tests for it.

I am not sure If I understood you here. I made the change to the CI docker scripts but the tests were not triggered. I am not sure if maintainers need to manually trigger those.

Currently CRIU would fail to compile if libdrm is not installed. Instead, when libdrm is not installed it should skip building amdgpu_plugin

I made the change to the CI docker scripts but the tests were not triggered.

The CI tests are triggered on git push, for example, you can see in the Alpine test logs that libdrm-dev has been installed:

Step 3/11 : RUN apk update && apk add $CC bash build-base coreutils git gnutls-dev libaio-dev libcap-dev libnet-dev libnl3-dev nftables nftables-dev pkgconfig protobuf-c-dev protobuf-dev py3-pip py3-protobuf python3 sudo libdrm-dev ... (59/113) Installing libdrm (2.4.106-r0) (60/113) Installing libpciaccess-dev (0.16-r0) (61/113) Installing libdrm-dev (2.4.106-r0) ...

rst0git · 2021-06-27T19:36:12Z

Documentation/Makefile

@@ -13,6 +13,7 @@ endif
 FOOTER		:= footer.txt
 SRC1		+= crit.txt
 SRC1		+= compel.txt
+SRC1		+= amdgpu_plugin.txt


The name amdgpu_plugin (i.e., man amdgpu_plugin) is global.
What do you think about using something like criu-amdgpu-plugin instead?

rst0git · 2021-06-27T19:42:08Z

plugins/amdgpu/Dockerfile

+#RUN mkdir -p /home/criu_build_dir
+WORKDIR /root/criu_build_dir
+#RUN cd /home/criu_build_dir
+RUN git clone https://github.com/checkpoint-restore/criu.git


A Dockerfile like this one (and the one in commit "criu/plugin: Dockerfile for AMD criu repo") would be more useful as a test in scripts/build/, but if it is part of the repository it could use the following instead of git clone.

COPY . /criu WORKDIR /criu

rst0git · 2021-06-27T19:46:35Z

plugins/amdgpu/README.md

@@ -0,0 +1,275 @@
+Supporting ROCm with CRIU
+=========================


This file should be added with commit "criu/plugin: Support AMD ROCm Checkpoint Restore with KFD".

rst0git · 2021-06-27T19:54:11Z

Makefile

@@ -147,7 +147,7 @@ HOSTCFLAGS		+= $(WARNINGS) $(DEFINES) -iquote include/
 export AFLAGS CFLAGS USERCLFAGS HOSTCFLAGS

 # Default target
-all: flog criu lib crit
+all: flog criu lib crit amdgpu_plugin


As Adrian mentioned in his comment, the plugin should not be built automatically by typing make. This is because it requires additional dependencies that are optional for CRIU.

To enable the build we could use something like AMDGPU_PLUGIN=1 make instead?

ack, I've added nmk based dependencies check on libdrm in the Makefile

rst0git · 2021-06-27T20:05:02Z

could you please advise if this ( f81a453 ) approach to fix some container build dependencies is OK ?

The build dependencies for amdgpu_plugin should be optional and CRIU should work even if libdrm-dev is not installed. As far as I know, the current CI doesn't have AMD GPU available to run tests for the plugin, but it would be good to have tests for it.

Snorch · 2021-06-30T08:47:44Z

Can we please have plain (linear without merge commits) commit tree for this pull request?

Hard to review this kind of thing:

Output of ` git log --graph --oneline`

*   4f864a19f (HEAD -> rajbhar-amd-rocm-support) Merge branch 'criu-dev' into criu-dev
|\  
| * 98186981d zdtm: add network namespace locking test
| * a226a1b92 test: remove exec test
| * efe927fff criu: add unit testing for config file parser
| * c98b1d9e5 test: add tests for configuration file parsing
| * c627d0c83 config: make configuration file parser more robust
| * 062200697 criu: add cleanup_free attribute
| * 178c90240 bfd: loop through read()/write() when the action is incomplete
| * d46e20e9d ci: disable some tests on CentOS 7
| * 46bb2d7e5 ci: remove old workarounds
| * 3518b4544 ci: switch CentOS 7 test to Cirrus CI
| * 2a89750da ci: disable -x during print_env()
| * f10b887e6 zdtm: allow ignore taint via environment variable
| * 89787bfb6 scripts/vagrant: Use vagrant 2.2.16
| * 7bd807fab scripts/vagrant: Use Fedora 34
| * 5e81dcadf add PKG_CONFIG default in a few more places
| * 7cfe7b4ce crtools: improve error handling on signal setting
| * d89f7dabf build: respect $PKG_CONFIG settings
| * 185036b94 docker-test: use latest containerd release
* | f81a45309 script/builds: add build dependepncy for libdrm
* | a5df3add4 criu/plugin: fix build warnings
* | d83894294 criu/plugin: Restore libhsakmt shared memory files
* | ee928e112 criu/plugin: Read and write BO contents in parallel
* | c87fdf511 criu/plugin: Add unit tests for GPU remapping
* | 98eddc948 criu/plugin: Add parameters to override mapping
* | 16778cc30 criu/plugin: Remap GPUs on checkpoint restore
* | 84135f4ed criu/plugin: Implement system topology parsing
* | 274aabd9e criu/plugin: Add build options for amdgpu plugin
* | d83ddd5b5 criu/plugin: Add whitepaper document
* | 8268b61f8 criu/files: *RFC* Don't cache fd for amdgpu devices
* | 95c9258d3 criu/plugin: Dockerfile for AMD criu repo
* | 1199c20d8 criu/plugin: Pytorch container with criu
* | e4819aa49 criu/plugin: Re-adjust doorbell offset for queues
* | 9ff89731e criu/plugin: Add initial documentation for ROCm support.
* | 0c3230492 criu/plugin: Dump and restore events
* | 24a67616f criu/plugin: Support larger memory footprints
* | c08db4ace criu/plugin: dump debug logs selectively
* | 0a6771fa3 criu/plugin: Add support for dumping and restoring queues
* | ffeb86bf3 criu/plugin: Implement restore late hook for kfd
* | ca99c1504 criu/restore: Introduce restore late stage hook
* | f55fe4894 criu/plugin: optimization for large bar read
* | 4c1b8f30b criu/plugin: Optimize the proto image size
* | 311aee4ff criu/plugin: Support AMD ROCm Checkpoint Restore with KFD
* | 4028ddced criu/files-reg: Add offset and file path plugin
* | 4028ddced criu/files-reg: Add offset and file path plugin
* | 71463130c criu/plugin: Initialize AMD KFD header
* | fbfa5566a criu/parse: Treat some unsupported VMAs as regular
* | 51efa5cdd Revert "Allow systemcfg proc file to be dumped"
|/  
* 7686b939d zdtm/tun_ns: add per-test dependencies

Snorch

It's a big work, nice! But I would agree with @adrianreber that spliting it into separate pr's would make review easier.

Snorch · 2021-06-30T09:53:13Z

criu/proc_parse.c

@@ -593,14 +593,49 @@ static int handle_vma(pid_t pid, struct vma_area *vma_area,
 		}
 	} else if (*vm_file_fd >= 0) {
 		struct stat *st_buf = vma_area->vmst;
+		struct stat st_kfd, st_dri_min;
+		char img_path[128];
+		int ret;

 		if (S_ISREG(st_buf->st_mode))
 			/* regular file mapping -- supported */;
 		else if (S_ISCHR(st_buf->st_mode) && (st_buf->st_rdev == DEVZERO))
 			/* devzero mapping -- also makes sense */;
 		else {


Can we please fix the codding style, while on it, as if/else/else-if branches can't have mixed braces/no-braces formating. In this case we should have brackets everywhere. https://www.kernel.org/doc/html/v4.10/process/coding-style.html#placing-braces-and-spaces

Can we please put the code from the below branch to new bool function (for instance let's name it handle_plugin_vma), not to overgrow the size of handle_vma(). Probably something like this would be ok:

else if (handle_plugin_vma(...)) { /* mapping would be handled by external plugin */; } else { pr_err("Can't handle non-regular mapping on %d's map %"PRIx64"\n", pid, vma_area->e->start); goto err; }

We should probably check plugin availability here, if there is no proper plugin we should not just skip this mapping - we should fail.

Or even: you can detect that mapping is plugin-dumpable via calling plugin itself via another new hook, that would be a nice symmetric design.

Snorch · 2021-06-30T10:27:22Z

criu/proc_parse.c

-			goto err;
+			ret = stat("/dev/kfd", &st_kfd);
+			if (ret == -1) {
+				pr_perror("fstat error for /dev/kfd\n");


In pr_perror() calls everywhere please remove excess newline specifiers '\n', pr_perror() adds '\n' by itself.

rajbhar · 2021-07-08T18:29:29Z

@Snorch @rst0git Thanks for your review comments. I will rebase and re-upload the branch but I am not sure how to address builds errors for some of the automation tests e.g. Openj9 failure. I was hoping that the dependency on drm.h will be addressed when the container is built with the libdrm dependency and hence I added that in the Dockerfile. I understand that I'll still need to decouple building the plugin from main criu build but could you please explain what am i missing for the build failures other than clang and lint errors, specially for the cross compilation and other architectures?

rst0git · 2021-07-10T19:11:49Z

@rajbhar I can replicate the Alpine build error locally as follows:

docker build  -t criu-alpine -f scripts/build/Dockerfile.alpine .

kfd_ioctl.h:26:10: fatal error: drm/drm.h: No such file or directory
   26 | #include <drm/drm.h>
      |          ^~~~~~~~~~~
compilation terminated.

It can be fixed with:

--- a/plugins/amdgpu/kfd_ioctl.h
+++ b/plugins/amdgpu/kfd_ioctl.h
@@ -23,7 +23,7 @@
 #ifndef KFD_IOCTL_H_INCLUDED
 #define KFD_IOCTL_H_INCLUDED
 
-#include <drm/drm.h>
+#include <libdrm/drm.h>
 #include <linux/ioctl.h>
 
 /*

rajbhar · 2021-07-12T03:28:50Z

@rajbhar I can replicate the Alpine build error locally as follows:

docker build  -t criu-alpine -f scripts/build/Dockerfile.alpine .

kfd_ioctl.h:26:10: fatal error: drm/drm.h: No such file or directory
   26 | #include <drm/drm.h>
      |          ^~~~~~~~~~~
compilation terminated.

It can be fixed with:

--- a/plugins/amdgpu/kfd_ioctl.h
+++ b/plugins/amdgpu/kfd_ioctl.h
@@ -23,7 +23,7 @@
 #ifndef KFD_IOCTL_H_INCLUDED
 #define KFD_IOCTL_H_INCLUDED
 
-#include <drm/drm.h>
+#include <libdrm/drm.h>
 #include <linux/ioctl.h>
 
 /*

Thanks. On Alpine, I'll also need to include limits.h header in addition to this. The two drm.h headers are slightly different but it probably won't matter for our use case. Ideally, the kernel src package should have brought in the drm/drm.h like the way kernel expects it and other distros too have it but alpine seems to differ to I have copied the libdrm/* using the Dockerfile for alpine. We can't change the include in this header as it can't be accepted upstream.

rst0git · 2021-07-14T19:27:33Z

Ideally, the kernel src package should have brought in the drm/drm.h like the way kernel expects it and other distros too have it

The linux-lts-dev package includes drm.h, but usually the kernel package is not installed inside containers.

rajbhar · 2021-07-16T00:45:57Z

@adrianreber , @Snorch @rst0git - I really appreciate your feedback on this PR. I have tried to address your suggestions in two new PRs #1556 and #1557 so closing this PR. Thanks again for your time and valuable feedback!

rajbhar and others added 26 commits June 17, 2021 15:53

criu/plugin: optimization for large bar read

f55fe48

PROCPIDMEM interface has memory bandwidth bottleneck during checkpoint so it should be used only when direct mmap/memcpy is not possible. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

criu/plugin: dump debug logs selectively

c08db4a

Plugin should log extra debug data only with -DDEBUG. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

criu/plugin: Add initial documentation for ROCm support.

9ff8973

- placeholder and initial documentation. - keep with CRIU main documentation Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com>

criu/plugin: Re-adjust doorbell offset for queues

e4819aa

The doorbell offset depends on the gpu_id. When restoring on a different node, the gpu_id may be different, so we need to re-adjust the doorbell offsets. Signed-off-by: David Yat Sin <david.yatsin@amd.com>

criu/plugin: Pytorch container with criu

1199c20

Add a Dockerfile to setup CRIU with its dependencies in an official ROCm pytorch container. Support for plugin will be added in a future patch. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

criu/plugin: Dockerfile for AMD criu repo

95c9258

This sets up the pytorch environment for BERT Transformers and also sets up CRIU along with all its dependencies including amdgpu plugin for supporting CR with AMDGPUs. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

criu/plugin: Add whitepaper document

d83ddd5

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

criu/plugin: Add build options for amdgpu plugin

274aabd

- Make ROCm amdgpu plugin a part of standard criu build, install and uninstall process. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

criu/plugin: Add unit tests for GPU remapping

c87fdf5

Adding unit tests for GPU remapping code when checkpointing and restoring on different nodes with different topologies. Signed-off-by: David Yat Sin <david.yatsin@amd.com>

avagin requested review from avagin, adrianreber and 0x7f454c46 June 22, 2021 15:21

rajbhar added 2 commits June 25, 2021 11:35

criu/plugin: fix build warnings

a5df3ad

On Ubuntu20.04 and Fedora34 some warnings are treated as errors. This fixes those format overflow and truncate warnings. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

script/builds: add build dependepncy for libdrm

f81a453

To build amdgpu_plugin, we need the include headers for libdrm, add this to alpine, openjdk, focal, ppc, mips etc. containers. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

Merge branch 'criu-dev' into criu-dev

4f864a1

rst0git reviewed Jun 27, 2021

View reviewed changes

Snorch reviewed Jun 30, 2021

View reviewed changes

This was referenced Jul 15, 2021

Amdgpu rfc 210715 1 #1556

Merged

Amdgpu rfc 210715 2 #1557

Closed

rajbhar closed this Jul 16, 2021

rst0git mentioned this pull request May 10, 2022

amdgpu plugin: Make possible to install in other location (during build) #1877

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: AMD ROCm support with plugin #1519

RFC: AMD ROCm support with plugin #1519

rajbhar commented Jun 18, 2021

adrianreber commented Jun 21, 2021

rst0git commented Jun 24, 2021

rajbhar commented Jun 25, 2021

rajbhar commented Jun 25, 2021

rst0git Jun 27, 2021

rajbhar Jun 30, 2021

rst0git Jun 30, 2021

rst0git Jun 27, 2021

rst0git Jun 27, 2021

rajbhar Jul 14, 2021

rst0git Jun 27, 2021

rajbhar Jul 14, 2021

rst0git Jun 27, 2021

rajbhar Jul 14, 2021 •

edited

rst0git commented Jun 27, 2021

Snorch commented Jun 30, 2021 •

edited

Snorch left a comment

Snorch Jun 30, 2021

Snorch Jun 30, 2021

rajbhar Jul 8, 2021

rajbhar commented Jul 8, 2021

rst0git commented Jul 10, 2021

rajbhar commented Jul 12, 2021 •

edited

rst0git commented Jul 14, 2021

rajbhar commented Jul 16, 2021

		@@ -0,0 +1,275 @@
		Supporting ROCm with CRIU
		=========================

RFC: AMD ROCm support with plugin #1519

RFC: AMD ROCm support with plugin #1519

Conversation

rajbhar commented Jun 18, 2021

adrianreber commented Jun 21, 2021

rst0git commented Jun 24, 2021

rajbhar commented Jun 25, 2021

rajbhar commented Jun 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rajbhar Jul 14, 2021 • edited

Choose a reason for hiding this comment

rst0git commented Jun 27, 2021

Snorch commented Jun 30, 2021 • edited

Snorch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rajbhar commented Jul 8, 2021

rst0git commented Jul 10, 2021

rajbhar commented Jul 12, 2021 • edited

rst0git commented Jul 14, 2021

rajbhar commented Jul 16, 2021

rajbhar Jul 14, 2021 •

edited

Snorch commented Jun 30, 2021 •

edited

rajbhar commented Jul 12, 2021 •

edited