-
Notifications
You must be signed in to change notification settings - Fork 565
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
criu/plugin: Support AMD ROCm Checkpoint Restore with KFD
To support Checkpoint Restore with AMDGPUs for ROCm workloads, introduce a new plugin to assist CRIU with the help of AMD KFD kernel driver. This initial commit just provides the basic framework to build up further capabilities. Like CRIU, the amdgpu plugin also uses protobuf to serialize and save the amdkfd data which is mostly VRAM contents with some metadata. We generate a data file "amdgpu-kfd-<id>.img" during the dump stage. On restore this file is read and extracted to re-create various types of buffer objects that belonged to the previously checkpointed process. Upon restore the mmap page offset within a device file might change so we use the new hook to update and adjust the mmap offsets for newly created target process. This is needed for sys_mmap call in pie restorer phase. Support for queues and events is added in future patches of this series. With the current implementation (amdgpu_plugin), we support: - Only compute workloads such (Non Gfx) are supported - GPU visible inside a container - AMD GPU Gfx 9 Family - Pytorch Benchmarks such as BERT Base amdgpu plugin dependes on libdrm and libdrm_amdgpu which are typically installed with libdrm-dev package. We build amdgpu_plugin only when the dependencies are met on the target system and when user intends to install the amdgpu plugin and not by default with criu build. Suggested-by: Felix Kuehling <felix.kuehling@amd.com> Co-authored-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
- Loading branch information
Showing
9 changed files
with
1,065 additions
and
52 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
ROCM Support(1) | ||
=============== | ||
|
||
NAME | ||
---- | ||
amdgpu_plugin - A plugin extension to CRIU to support checkpoint/restore in | ||
userspace for AMD GPUs. | ||
|
||
|
||
CURRENT SUPPORT | ||
--------------- | ||
Single GPU systems (Gfx9) | ||
Checkpoint / Restore on same system | ||
Checkpoint / Restore inside a docker container | ||
Pytorch | ||
|
||
DESCRIPTION | ||
----------- | ||
Though *criu* is a great tool for checkpointing and restoring running | ||
applications, it has certain limitations such as it cannot handle | ||
applications that have device files open. In order to support *ROCm* based | ||
workloads with *criu* we need to augment criu's core functionality with a | ||
plugin based extension mechanism. *amdgpu_plugin* provides the necessary support | ||
to criu to allow Checkpoint / Restore with ROCm. | ||
|
||
|
||
Dependencies | ||
~~~~~~~~~~~~~~ | ||
*amdkfd support*:: | ||
In order to snapshot the *VRAM* and other *GPU* device states, we require | ||
an updated version of amdkfd(amdgpu) driver. The kernel patches are under | ||
review currently. | ||
|
||
*criu 3.16*:: | ||
This work is rebased on latest criu release available at this time. | ||
|
||
|
||
AUTHOR | ||
------ | ||
The AMDKFD team. | ||
|
||
|
||
COPYRIGHT | ||
--------- | ||
Copyright \(C) 2020-2021, Advanced Micro Devices, Inc. (AMD) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,13 +1,49 @@ | ||
all: dummy_plugin.so | ||
PLUGIN_NAME := amdgpu_plugin | ||
PLUGIN_SOBJ := amdgpu_plugin.so | ||
|
||
dummy_plugin.so: dummy_plugin.c | ||
gcc -g -Werror -D _GNU_SOURCE -Wall -shared -nostartfiles dummy_plugin.c -o dummy_plugin.so -iquote ../../../criu/include -iquote ../../criu/include -fPIC | ||
PLUGIN_INC := ../../../criu/include | ||
PLUGIN_INC_EXTRA := ../../criu/include | ||
PLUGIN_INCLUDE := -iquote$(PLUGIN_INC) -iquote$(PLUGIN_INC_EXTRA) | ||
LIBDRM_INC := -I/usr/include/libdrm | ||
DEPS_OK := amdgpu_plugin.so | ||
DEPS_NOK := ; | ||
|
||
include $(__nmk_dir)msg.mk | ||
|
||
CC := gcc | ||
PLUGIN_CFLAGS := -g -Wall -Werror -D _GNU_SOURCE -shared -nostartfiles -fPIC | ||
|
||
ifeq ($(CONFIG_AMDGPU),y) | ||
all: $(DEPS_OK) | ||
else | ||
all: $(DEPS_NOK) | ||
endif | ||
|
||
criu-amdgpu.pb-c.c: criu-amdgpu.proto | ||
protoc-c --proto_path=. --c_out=. criu-amdgpu.proto | ||
|
||
amdgpu_plugin.so: amdgpu_plugin.c criu-amdgpu.pb-c.c | ||
$(CC) $(PLUGIN_CFLAGS) $^ -o $@ $(PLUGIN_INCLUDE) | ||
|
||
amdgpu_plugin_clean: | ||
$(call msg-clean, $@) | ||
$(Q) $(RM) amdgpu_plugin.so criu-amdgpu.pb-c* | ||
.PHONY: amdgpu_plugin_clean | ||
clean: amdgpu_plugin_clean | ||
|
||
mrproper: clean | ||
|
||
clean: | ||
$(Q) $(RM) dummy_plugin.so | ||
install: | ||
$(Q) mkdir -p $(PLUGINDIR) | ||
$(Q) install -m 644 dummy_plugin.so $(PLUGINDIR) | ||
ifeq ($(CONFIG_AMDGPU),y) | ||
$(E) " INSTALL " $(PLUGIN_NAME) | ||
$(Q) install -m 644 $(PLUGIN_SOBJ) $(PLUGINDIR) | ||
endif | ||
.PHONY: install | ||
|
||
uninstall: | ||
$(Q) $(RM) $(PLUGINDIR)/dummy_plugin.so | ||
ifeq ($(CONFIG_AMDGPU),y) | ||
$(E) " UNINSTALL" $(PLUGIN_NAME) | ||
$(Q) $(RM) $(PLUGINDIR)/$(PLUGIN_SOBJ) | ||
endif | ||
.PHONY: uninstall |
Oops, something went wrong.