Skip to content

Commit

Permalink
criu/plugin: Support AMD ROCm Checkpoint Restore with KFD
Browse files Browse the repository at this point in the history
To support Checkpoint Restore with AMDGPUs for ROCm workloads, introduce
a new plugin to assist CRIU with the help of AMD KFD kernel driver. This
initial commit just provides the basic framework to build up further
capabilities. Like CRIU, the amdgpu plugin also uses protobuf to
serialize
and save the amdkfd data which is mostly VRAM contents with some
metadata.
We generate a data file "amdgpu-kfd-<id>.img" during the dump stage. On restore
this file is read and extracted to re-create various types of buffer
objects that belonged to the previously checkpointed process. Upon
restore the mmap page offset within a device file might change so we use
the new hook to update and adjust the mmap offsets for newly created
target process. This is needed for sys_mmap call in pie restorer phase.
Support for queues and events is added in future patches of this series.

With the current implementation (amdgpu_plugin), we support:
     - Only compute workloads such (Non Gfx) are supported
     - GPU visible inside a container
     - AMD GPU Gfx 9 Family
     - Pytorch Benchmarks such as BERT Base

amdgpu plugin dependes on libdrm and libdrm_amdgpu which are typically
installed with libdrm-dev package. We build amdgpu_plugin only when the
dependencies are met on the target system and when user intends to
install the amdgpu plugin and not by default with criu build.

Suggested-by: Felix Kuehling <felix.kuehling@amd.com>
Co-authored-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
  • Loading branch information
2 people authored and avagin committed Apr 29, 2022
1 parent 71ff9cc commit 55a5993
Show file tree
Hide file tree
Showing 9 changed files with 1,065 additions and 52 deletions.
1 change: 1 addition & 0 deletions Documentation/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ ifeq ($(PYTHON),python3)
SRC1 += criu-ns.txt
endif
SRC1 += compel.txt
SRC1 += amdgpu_plugin.txt
SRC8 += criu.txt
SRC := $(SRC1) $(SRC8)
XMLS := $(patsubst %.txt,%.xml,$(SRC))
Expand Down
45 changes: 45 additions & 0 deletions Documentation/amdgpu_plugin.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
ROCM Support(1)
===============

NAME
----
amdgpu_plugin - A plugin extension to CRIU to support checkpoint/restore in
userspace for AMD GPUs.


CURRENT SUPPORT
---------------
Single GPU systems (Gfx9)
Checkpoint / Restore on same system
Checkpoint / Restore inside a docker container
Pytorch

DESCRIPTION
-----------
Though *criu* is a great tool for checkpointing and restoring running
applications, it has certain limitations such as it cannot handle
applications that have device files open. In order to support *ROCm* based
workloads with *criu* we need to augment criu's core functionality with a
plugin based extension mechanism. *amdgpu_plugin* provides the necessary support
to criu to allow Checkpoint / Restore with ROCm.


Dependencies
~~~~~~~~~~~~~~
*amdkfd support*::
In order to snapshot the *VRAM* and other *GPU* device states, we require
an updated version of amdkfd(amdgpu) driver. The kernel patches are under
review currently.

*criu 3.16*::
This work is rebased on latest criu release available at this time.


AUTHOR
------
The AMDKFD team.


COPYRIGHT
---------
Copyright \(C) 2020-2021, Advanced Micro Devices, Inc. (AMD)
13 changes: 7 additions & 6 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -284,19 +284,19 @@ clean mrproper:
$(Q) $(MAKE) $(build)=crit $@
.PHONY: clean mrproper

clean-dummy_amdgpu_plugin:
clean-amdgpu_plugin:
$(Q) $(MAKE) -C plugins/amdgpu clean
.PHONY: clean dummy_amdgpu_plugin
.PHONY: clean-amdgpu_plugin

clean-top:
$(Q) $(MAKE) -C Documentation clean
$(Q) $(MAKE) $(build)=test/compel clean
$(Q) $(RM) .gitid
.PHONY: clean-top

clean: clean-top clean-dummy_amdgpu_plugin
clean: clean-top clean-amdgpu_plugin

mrproper-top: clean-top clean-dummy_amdgpu_plugin
mrproper-top: clean-top clean-amdgpu_plugin
$(Q) $(RM) $(CONFIG_HEADER)
$(Q) $(RM) $(VERSION_HEADER)
$(Q) $(RM) $(COMPEL_VERSION_HEADER)
Expand Down Expand Up @@ -324,9 +324,9 @@ test: zdtm
$(Q) $(MAKE) -C test
.PHONY: test

dummy_amdgpu_plugin:
amdgpu_plugin: criu
$(Q) $(MAKE) -C plugins/amdgpu all
.PHONY: dummy_amdgpu_plugin
.PHONY: amdgpu_plugin

#
# Generating tar requires tag matched CRIU_VERSION.
Expand Down Expand Up @@ -408,6 +408,7 @@ help:
@echo ' unittest - Run unit tests'
@echo ' lint - Run code linters'
@echo ' indent - Indent C code'
@echo ' amdgpu_plugin - Make AMD GPU plugin'
.PHONY: help

lint:
Expand Down
8 changes: 8 additions & 0 deletions Makefile.config
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,14 @@ ifeq ($(call pkg-config-check,libbpf),y)
export CONFIG_HAS_LIBBPF := y
endif

ifeq ($(call pkg-config-check,libdrm),y)
export CONFIG_AMDGPU := y
$(info Note: Building criu with amdgpu_plugin.)
else
$(info Note: Building criu without amdgpu_plugin.)
$(info Note: libdrm and libdrm_amdgpu are required to build amdgpu_plugin.)
endif

ifeq ($(NO_GNUTLS)x$(call pkg-config-check,gnutls),xy)
LIBS_FEATURES += -lgnutls
export CONFIG_GNUTLS := y
Expand Down
6 changes: 3 additions & 3 deletions Makefile.install
Original file line number Diff line number Diff line change
Expand Up @@ -41,16 +41,16 @@ install-criu: criu
$(Q) $(MAKE) $(build)=criu install
.PHONY: install-criu

install-dummy_amdgpu_plugin: dummy_amdgpu_plugin
install-amdgpu_plugin: amdgpu_plugin
$(Q) $(MAKE) -C plugins/amdgpu install
.PHONY: install-dummy_amdgpu_plugin
.PHONY: install-amdgpu_plugin

install-compel: $(compel-install-targets)
$(Q) $(MAKE) $(build)=compel install
$(Q) $(MAKE) $(build)=compel/plugins install
.PHONY: install-compel

install: install-man install-lib install-criu install-compel ;
install: install-man install-lib install-criu install-compel install-amdgpu_plugin ;
.PHONY: install

uninstall:
Expand Down
50 changes: 43 additions & 7 deletions plugins/amdgpu/Makefile
Original file line number Diff line number Diff line change
@@ -1,13 +1,49 @@
all: dummy_plugin.so
PLUGIN_NAME := amdgpu_plugin
PLUGIN_SOBJ := amdgpu_plugin.so

dummy_plugin.so: dummy_plugin.c
gcc -g -Werror -D _GNU_SOURCE -Wall -shared -nostartfiles dummy_plugin.c -o dummy_plugin.so -iquote ../../../criu/include -iquote ../../criu/include -fPIC
PLUGIN_INC := ../../../criu/include
PLUGIN_INC_EXTRA := ../../criu/include
PLUGIN_INCLUDE := -iquote$(PLUGIN_INC) -iquote$(PLUGIN_INC_EXTRA)
LIBDRM_INC := -I/usr/include/libdrm
DEPS_OK := amdgpu_plugin.so
DEPS_NOK := ;

include $(__nmk_dir)msg.mk

CC := gcc
PLUGIN_CFLAGS := -g -Wall -Werror -D _GNU_SOURCE -shared -nostartfiles -fPIC

ifeq ($(CONFIG_AMDGPU),y)
all: $(DEPS_OK)
else
all: $(DEPS_NOK)
endif

criu-amdgpu.pb-c.c: criu-amdgpu.proto
protoc-c --proto_path=. --c_out=. criu-amdgpu.proto

amdgpu_plugin.so: amdgpu_plugin.c criu-amdgpu.pb-c.c
$(CC) $(PLUGIN_CFLAGS) $^ -o $@ $(PLUGIN_INCLUDE)

amdgpu_plugin_clean:
$(call msg-clean, $@)
$(Q) $(RM) amdgpu_plugin.so criu-amdgpu.pb-c*
.PHONY: amdgpu_plugin_clean
clean: amdgpu_plugin_clean

mrproper: clean

clean:
$(Q) $(RM) dummy_plugin.so
install:
$(Q) mkdir -p $(PLUGINDIR)
$(Q) install -m 644 dummy_plugin.so $(PLUGINDIR)
ifeq ($(CONFIG_AMDGPU),y)
$(E) " INSTALL " $(PLUGIN_NAME)
$(Q) install -m 644 $(PLUGIN_SOBJ) $(PLUGINDIR)
endif
.PHONY: install

uninstall:
$(Q) $(RM) $(PLUGINDIR)/dummy_plugin.so
ifeq ($(CONFIG_AMDGPU),y)
$(E) " UNINSTALL" $(PLUGIN_NAME)
$(Q) $(RM) $(PLUGINDIR)/$(PLUGIN_SOBJ)
endif
.PHONY: uninstall
Loading

0 comments on commit 55a5993

Please sign in to comment.