-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: AMD ROCm support with plugin #1519
Changes from 27 commits
51efa5c
fbfa556
7146313
4028ddc
311aee4
4c1b8f3
f55fe48
ca99c15
ffeb86b
0a6771f
c08db4a
24a6761
0c32304
9ff8973
e4819aa
1199c20
95c9258
8268b61
d83ddd5
274aabd
84135f4
16778cc
98eddc9
c87fdf5
ee928e1
d838942
a5df3ad
f81a453
4f864a1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,99 @@ | ||
ROCM Support(1) | ||
=============== | ||
|
||
NAME | ||
---- | ||
amdgpu_plugin - A plugin extention to CRIU to support checkpoint/restore in | ||
userspace for AMD GPUs. | ||
|
||
|
||
CURRENT SUPPORT | ||
--------------- | ||
Single and Multi GPU systems (Gfx9) | ||
Checkpoint / Restore on same system | ||
Checkpoint / Restore inside a docker container | ||
Pytorch | ||
|
||
DESCRIPTION | ||
----------- | ||
Though *criu* is a great tool for checkpointing and restoring running | ||
applications, it has certain limitations such as it cannot handle | ||
applications that have device files open. In order to support *ROCm* based | ||
workloads with *criu* we need to augment criu's core functionality with a | ||
plugin based extention mechanism. *amdgpu_plugin* provides the necessary support | ||
to criu to allow Checkpoint / Restore with ROCm. | ||
|
||
|
||
Dependencies | ||
~~~~~~~~~~~~~~ | ||
*amdkfd support*:: | ||
In order to snapshot the *VRAM* and other *GPU* device states, we require | ||
an updated version of amdkfd(amdgpu) driver. The kernel patches are under | ||
review currently. | ||
|
||
*criu 3.15*:: | ||
This work is rebased on latest criu release available at this time. | ||
|
||
|
||
OPTIONS | ||
------- | ||
Optional parameters can be passed in as environment variables before | ||
executing criu command. | ||
|
||
*KFD_FW_VER_CHECK*:: | ||
Enable or disable firmware version check. | ||
If enabled, firmware version on restored gpu needs to be greater than or | ||
equal firmware version on checkpointed GPU. Default:Enabled | ||
|
||
E.g: | ||
KFD_FW_VER_CHECK=0 | ||
|
||
*KFD_SDMA_FW_VER_CHECK*:: | ||
Enable or disable SDMA firmware version check. | ||
If enabled, SDMA firmware version on restored gpu needs to be greater than or | ||
equal firmware version on checkpointed GPU. Default:Enabled | ||
|
||
E.g: | ||
KFD_SDMA_FW_VER_CHECK=0 | ||
|
||
*KFD_CACHES_COUNT_CHECK*:: | ||
Enable or disable caches count check. If enabled, the caches count on | ||
restored GPU needs to be greater than or equal caches count on checkpointed | ||
GPU. Default:Enabled | ||
|
||
E.g: | ||
KFD_CACHES_COUNT_CHECK=0 | ||
|
||
*KFD_NUM_GWS_CHECK*:: | ||
Enable or disable num_gws check. If enabled, the num_gws on | ||
restored GPU needs to be greater than or equal num_gws on checkpointed | ||
GPU. Default:Enabled | ||
|
||
E.g: | ||
KFD_NUM_GWS_CHECK=0 | ||
|
||
*KFD_VRAM_SIZE_CHECK*:: | ||
Enable or disable VRAM size check. If enabled, the VRAM size on | ||
restored GPU needs to be greater than or equal VRAM size on checkpointed | ||
GPU. Default:Enabled | ||
|
||
E.g: | ||
KFD_VRAM_SIZE_CHECK=0 | ||
|
||
*KFD_NUMA_CHECK*:: | ||
Enable or disable NUMA CPU region check. If enabled, the plugin will restore | ||
GPUs that belong to one CPU NUMA region to the same CPU NUMA region. | ||
Default:Enabled | ||
|
||
E.g: | ||
KFD_IGNORE_NUMA=1 | ||
|
||
|
||
AUTHOR | ||
------ | ||
The AMDKFD team. | ||
|
||
|
||
COPYRIGHT | ||
--------- | ||
Copyright \(C) 2020-2021, Advanced Micro Devices, Inc. (AMD) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -147,7 +147,7 @@ HOSTCFLAGS += $(WARNINGS) $(DEFINES) -iquote include/ | |
export AFLAGS CFLAGS USERCLFAGS HOSTCFLAGS | ||
|
||
# Default target | ||
all: flog criu lib crit | ||
all: flog criu lib crit amdgpu_plugin | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As Adrian mentioned in his comment, the plugin should not be built automatically by typing To enable the build we could use something like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ack, I've added nmk based dependencies check on libdrm in the Makefile |
||
.PHONY: all | ||
|
||
# | ||
|
@@ -290,15 +290,19 @@ clean mrproper: | |
$(Q) $(MAKE) $(build)=crit $@ | ||
.PHONY: clean mrproper | ||
|
||
clean-amdgpu_plugin: | ||
$(Q) $(MAKE) -C plugins/amdgpu clean | ||
.PHONY: clean-amdgpu_plugin | ||
|
||
clean-top: | ||
$(Q) $(MAKE) -C Documentation clean | ||
$(Q) $(MAKE) $(build)=test/compel clean | ||
$(Q) $(RM) .gitid | ||
.PHONY: clean-top | ||
|
||
clean: clean-top | ||
clean: clean-top clean-amdgpu_plugin | ||
|
||
mrproper-top: clean-top | ||
mrproper-top: clean-top clean-amdgpu_plugin | ||
$(Q) $(RM) $(CONFIG_HEADER) | ||
$(Q) $(RM) $(VERSION_HEADER) | ||
$(Q) $(RM) $(COMPEL_VERSION_HEADER) | ||
|
@@ -326,6 +330,10 @@ test: zdtm | |
$(Q) $(MAKE) -C test | ||
.PHONY: test | ||
|
||
amdgpu_plugin: | ||
$(Q) $(MAKE) -C plugins/amdgpu all | ||
.PHONY: amdgpu_plugin | ||
|
||
# | ||
# Generating tar requires tag matched CRIU_VERSION. | ||
# If not found then simply use GIT's describe with | ||
|
@@ -403,6 +411,7 @@ help: | |
@echo ' cscope - Generate cscope database' | ||
@echo ' test - Run zdtm test-suite' | ||
@echo ' gcov - Make code coverage report' | ||
@echo ' amdgpu_plugin - Make AMD GPU plugin' | ||
.PHONY: help | ||
|
||
lint: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name
amdgpu_plugin
(i.e.,man amdgpu_plugin
) is global.What do you think about using something like
criu-amdgpu-plugin
instead?