Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: AMD ROCm support with plugin #1519

Closed
wants to merge 29 commits into from
Closed
Changes from 1 commit
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
51efa5c
Revert "Allow systemcfg proc file to be dumped"
rajbhar May 15, 2021
fbfa556
criu/parse: Treat some unsupported VMAs as regular
rajbhar Nov 20, 2020
7146313
criu/plugin: Initialize AMD KFD header
rajbhar May 4, 2021
4028ddc
criu/files-reg: Add offset and file path plugin
rajbhar Apr 15, 2021
311aee4
criu/plugin: Support AMD ROCm Checkpoint Restore with KFD
rajbhar Apr 15, 2021
4c1b8f3
criu/plugin: Optimize the proto image size
rajbhar Feb 3, 2021
f55fe48
criu/plugin: optimization for large bar read
rajbhar Feb 26, 2021
ca99c15
criu/restore: Introduce restore late stage hook
rajbhar Apr 15, 2021
ffeb86b
criu/plugin: Implement restore late hook for kfd
rajbhar Apr 15, 2021
0a6771f
criu/plugin: Add support for dumping and restoring queues
dayatsin-amd Jan 26, 2021
c08db4a
criu/plugin: dump debug logs selectively
rajbhar Feb 12, 2021
24a6761
criu/plugin: Support larger memory footprints
dayatsin-amd Feb 16, 2021
0c32304
criu/plugin: Dump and restore events
dayatsin-amd Apr 15, 2021
9ff8973
criu/plugin: Add initial documentation for ROCm support.
rajbhar Mar 18, 2021
e4819aa
criu/plugin: Re-adjust doorbell offset for queues
dayatsin-amd Mar 22, 2021
1199c20
criu/plugin: Pytorch container with criu
rajbhar Apr 13, 2021
95c9258
criu/plugin: Dockerfile for AMD criu repo
rajbhar Apr 20, 2021
8268b61
criu/files: *RFC* Don't cache fd for amdgpu devices
rajbhar Apr 27, 2021
d83ddd5
criu/plugin: Add whitepaper document
fxkamd Apr 30, 2021
274aabd
criu/plugin: Add build options for amdgpu plugin
rajbhar May 12, 2021
84135f4
criu/plugin: Implement system topology parsing
dayatsin-amd Apr 20, 2021
16778cc
criu/plugin: Remap GPUs on checkpoint restore
dayatsin-amd Apr 20, 2021
98eddc9
criu/plugin: Add parameters to override mapping
dayatsin-amd Apr 20, 2021
c87fdf5
criu/plugin: Add unit tests for GPU remapping
dayatsin-amd May 18, 2021
ee928e1
criu/plugin: Read and write BO contents in parallel
dayatsin-amd May 18, 2021
d838942
criu/plugin: Restore libhsakmt shared memory files
dayatsin-amd Jun 10, 2021
a5df3ad
criu/plugin: fix build warnings
rajbhar Jun 25, 2021
f81a453
script/builds: add build dependepncy for libdrm
rajbhar Jun 25, 2021
4f864a1
Merge branch 'criu-dev' into criu-dev
rajbhar Jun 25, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
18 changes: 9 additions & 9 deletions plugins/amdgpu/amdgpu_plugin_topology.c
Original file line number Diff line number Diff line change
Expand Up @@ -452,7 +452,7 @@ static int parse_topo_node_mem_banks(struct tp_node *node, const char *dir_path)

while ((dirent_node = readdir(d_node)) != NULL) {
char line[300];
char bank_path[300];
char bank_path[1024];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please apply these changes in the "criu/plugin: Support AMD ROCm Checkpoint Restore with KFD" commit, which introduced amdgpu_plugin_topology.c?
This could be done, for example, with git rebase -i 311aee4ff^, then change pick to edit on the first line and save.

edit 311aee4ff criu/plugin: Support AMD ROCm Checkpoint Restore with KFD

After the changes have been applied you can use git commit -a --ammend and git rebase --continue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please advise if this ( f81a453 ) approach to fix some container build dependencies is OK ?

The build dependencies for amdgpu_plugin should be optional and CRIU should work even if libdrm-dev is not installed. As far as I know, the current CI doesn't have AMD GPU available to run tests for the plugin, but it would be good to have tests for it.

I am not sure If I understood you here. I made the change to the CI docker scripts but the tests were not triggered. I am not sure if maintainers need to manually trigger those.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently CRIU would fail to compile if libdrm is not installed. Instead, when libdrm is not installed it should skip building amdgpu_plugin

I made the change to the CI docker scripts but the tests were not triggered.

The CI tests are triggered on git push, for example, you can see in the Alpine test logs that libdrm-dev has been installed:

Step 3/11 : RUN apk update && apk add 	$CC 	bash 	build-base 	coreutils 	git 	gnutls-dev 	libaio-dev 	libcap-dev 	libnet-dev 	libnl3-dev 	nftables 	nftables-dev 	pkgconfig 	protobuf-c-dev 	protobuf-dev 	py3-pip 	py3-protobuf 	python3 	sudo 	libdrm-dev
...
(59/113) Installing libdrm (2.4.106-r0)
(60/113) Installing libpciaccess-dev (0.16-r0)
(61/113) Installing libdrm-dev (2.4.106-r0)
...

struct stat st;
int id;

Expand All @@ -463,16 +463,16 @@ static int parse_topo_node_mem_banks(struct tp_node *node, const char *dir_path)
if (sscanf(dirent_node->d_name, "%d", &id) != 1)
continue;

sprintf(bank_path, "%s/%s", path, dirent_node->d_name);
snprintf(bank_path, sizeof(bank_path), "%s/%s", path, dirent_node->d_name);
if (stat(bank_path, &st)) {
pr_err("Cannot to access %s\n", path);
ret = -EACCES;
goto fail;
}
if ((st.st_mode & S_IFMT) == S_IFDIR) {
char properties_path[300];
char properties_path[PATH_MAX];

sprintf(properties_path, "%s/properties", bank_path);
snprintf(properties_path, sizeof(properties_path), "%s/properties", bank_path);

file = fopen(properties_path, "r");
if (!file) {
Expand Down Expand Up @@ -529,7 +529,7 @@ static int parse_topo_node_iolinks(struct tp_node *node, const char *dir_path)
FILE *file = NULL;
int ret = 0;

sprintf(path, "%s/io_links", dir_path);
snprintf(path, sizeof(path), "%s/io_links", dir_path);

d_node = opendir(path);
if (!d_node) {
Expand All @@ -539,7 +539,7 @@ static int parse_topo_node_iolinks(struct tp_node *node, const char *dir_path)

while ((dirent_node = readdir(d_node)) != NULL) {
char line[300];
char iolink_path[300];
char iolink_path[1024];
struct stat st;
int id;

Expand All @@ -550,16 +550,16 @@ static int parse_topo_node_iolinks(struct tp_node *node, const char *dir_path)
if (sscanf(dirent_node->d_name, "%d", &id) != 1)
continue;

sprintf(iolink_path, "%s/%s", path, dirent_node->d_name);
snprintf(iolink_path, sizeof(iolink_path), "%s/%s", path, dirent_node->d_name);
if (stat(iolink_path, &st)) {
pr_err("Cannot to access %s\n", path);
ret = -EACCES;
goto fail;
}
if ((st.st_mode & S_IFMT) == S_IFDIR) {
char properties_path[300];
char properties_path[PATH_MAX];

sprintf(properties_path, "%s/properties", iolink_path);
snprintf(properties_path, sizeof(properties_path), "%s/properties", iolink_path);

file = fopen(properties_path, "r");
if (!file) {
Expand Down