Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

criu/plugin: Add NVIDIA CUDA plugin #2416

Merged
merged 3 commits into from
Jul 1, 2024

Conversation

jesus-ramos
Copy link
Contributor

Adding support for the NVIDIA cuda-checkpoint utility, requires the use of an r555 or higher driver along with the cuda-checkpoint binary.

@rst0git rst0git self-assigned this Jun 4, 2024
@adrianreber
Copy link
Member

adrianreber commented Jun 5, 2024

If you rebase the "CentOS Stream 8" test error should disappear as we removed it (it is EOL).

A README.md with some information about what is going on and how to use it would be helpful in the plugin directory. Maybe a link to the cuda-checkpoint tool.

I would also recommend to at least have two commits here. One with the changes to CRIU and a more detailed commit message why you need the changes and one commits with actual plugin.

@jesus-ramos
Copy link
Contributor Author

For the CRIU changes commit would you just like the 2 changes to cr-dump.c and cr-restore.c or should I include the new plugin hooks/calls in that one as well?

@adrianreber
Copy link
Member

For the CRIU changes commit would you just like the 2 changes to cr-dump.c and cr-restore.c or should I include the new plugin hooks/calls in that one as well?

One goal is to have each commit still compile. If your changes are not directly dependent on each other moving it into extra commits sounds correct. It really depends and is not always a yes or no decision. But if you already ask the question then it would probably make sense to split it, because that seems to indicate to me you also see it as two independent things.

@jesus-ramos jesus-ramos force-pushed the cuda_plugin branch 2 times, most recently from 5a6a20a to c49326c Compare June 7, 2024 18:52
…and run the plugin finalizer later in the dump sequence

Restore rseq_cs state before calling RESUME_DEVICES_LATE as the CUDA plugin will
temporarily unfreeze a thread during the plugin hook to assist with device
restore

Run the plugin finalizer later in the dump sequence since the finalizer is used
by the CUDA plugin to handle some process cleanup

Signed-off-by: Jesus Ramos <jeramos@nvidia.com>
…DEVICES to be used during pstree collection

PAUSE_DEVICES is called before a process is frozen and is used by the CUDA
plugin to place the process in a state that's ready to be checkpointed and
quiesce any pending work

CHECKPOINT_DEVICES is called after all processes in the tree have been frozen
and PAUSE'd and performs the actual checkpointing operation for CUDA
applications

Signed-off-by: Jesus Ramos <jeramos@nvidia.com>
@jesus-ramos jesus-ramos force-pushed the cuda_plugin branch 2 times, most recently from 5cdb90e to 30b4df8 Compare June 10, 2024 18:02
plugins/cuda/README.md Outdated Show resolved Hide resolved
plugins/cuda/README.md Outdated Show resolved Hide resolved
Makefile Show resolved Hide resolved
plugins/cuda/Makefile Outdated Show resolved Hide resolved
@rst0git
Copy link
Member

rst0git commented Jun 11, 2024

@jesus-ramos It would be great if we could add tests for this functionality. We currently don't have GPUs available in CI, but we could add tests to verify that the cuda-checkpoint tool is called with the correct options during checkpoint/restore.

plugins/cuda/cuda_plugin.c Outdated Show resolved Hide resolved
plugins/cuda/cuda_plugin.c Outdated Show resolved Hide resolved
plugins/cuda/cuda_plugin.c Outdated Show resolved Hide resolved
plugins/cuda/cuda_plugin.c Outdated Show resolved Hide resolved
plugins/cuda/cuda_plugin.c Outdated Show resolved Hide resolved
@avagin
Copy link
Member

avagin commented Jun 14, 2024

Thank you for contributing to the project. Overall, this PR looks good to me.

plugins/cuda/cuda_plugin.c Outdated Show resolved Hide resolved
@avagin
Copy link
Member

avagin commented Jun 14, 2024

Here is a POC patch how we can test the cuda plugin with a mocked cuda-checkpoint tool:
avagin@89df45d

plugins/cuda/cuda_plugin.c Outdated Show resolved Hide resolved
@jesus-ramos
Copy link
Contributor Author

#11 17.04 In file included from ../../compel/include/uapi/compel/infect.h:6,
#11 17.04 from ../../criu/include/proc_parse.h:6,
#11 17.04 from cuda_plugin.c:6:
#11 17.04 ../../compel/include/uapi/compel/asm/sigframe.h:5:10: fatal error: asm/elf.h: No such file or directory
#11 17.04 5 | #include <asm/elf.h>
#11 17.04 | ^~~~~~~~~~~
#11 17.04 compilation terminated.

For the ppc and arm builds do I need to add anything to the plugin Makefile to include the proper directory for this?

@rst0git
Copy link
Member

rst0git commented Jun 28, 2024

For the ppc and arm builds do I need to add anything to the plugin Makefile to include the proper directory for this?

This error does not seem to be related to the changes in this pull request. It also occurs on the criu-dev branch.

plugins/cuda/cuda_plugin.c Outdated Show resolved Hide resolved
Adding support for the NVIDIA cuda-checkpoint utility, requires the use of an
r555 or higher driver along with the cuda-checkpoint binary.

Signed-off-by: Jesus Ramos <jeramos@nvidia.com>
@avagin
Copy link
Member

avagin commented Jul 1, 2024

@ jesus-ramos I think we can merge this pr and do other improvements in separate pr-s. Does it sound good to you?

@jesus-ramos
Copy link
Contributor Author

That works for me.

@avagin avagin merged commit c0708cb into checkpoint-restore:criu-dev Jul 1, 2024
8 of 10 checks passed
Copy link
Member

@rst0git rst0git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Snorch
Copy link
Member

Snorch commented Jul 2, 2024

It breaks build in CI:

#11 17.14 In file included from ../../compel/include/uapi/compel/asm/sigframe.h:4,
#11 17.14                  from ../../compel/include/uapi/compel/infect.h:6,
#11 17.14                  from ../../criu/include/proc_parse.h:6,
#11 17.14                  from cuda_plugin.c:6:
#11 17.14 /usr/include/x86_64-linux-gnu/asm/sigcontext.h:40:8: error: redefinition of 'struct _fpx_sw_bytes'
#11 17.14    40 | struct _fpx_sw_bytes {
#11 17.14       |        ^~~~~~~~~~~~~
#11 17.14 In file included from /usr/include/signal.h:301,
#11 17.14                  from ../../criu/include/util.h:7,
#11 17.14                  from cuda_plugin.c:3:

https://github.com/checkpoint-restore/criu/actions/runs/9754988212/job/26922763707?pr=2427

@jesus-ramos
Copy link
Contributor Author

Is it an issue with the include order or something else? I can see about getting some time on an aarch64 machine to check it out.

@Snorch
Copy link
Member

Snorch commented Jul 2, 2024

Is it an issue with the include order or something else? I can see about getting some time on an aarch64 machine to check it out.

Likely, in ./plugins/cuda/cuda_plugin.c:

#include "criu-log.h"
#include "plugin.h"
#include "util.h"
#include "cr_options.h"
#include "pid.h"
#include "proc_parse.h"

#include <common/list.h>
#include <compel/infect.h>

#include <ctype.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/ptrace.h>
#include <sys/wait.h>

criu headers should probably go AFTER system headers, but I'm not fully sure how it causes the problem.

@rst0git
Copy link
Member

rst0git commented Jul 2, 2024

It breaks build in CI:

Andrei opened a pull request to fix this problem: #2428

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants