New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pidfd based pid reuse detection #1341
Add pidfd based pid reuse detection #1341
Conversation
Hi, sorry for a late response! From the first look it seems nice. I will look at it more closer, it may take some time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 Please split " criu: add pidfd based pid reuse detection" to several separate patches for easier code review.
See more info https://www.kernel.org/doc/html/v4.10/process/submitting-patches.html#separate-your-changes
You can split the "add new options" part from actual use of opts.pidfd_out_fd by adding just a "stab" options first and then using them in separate patch.
You can also split actual detect_pid_reuse part from part which fills the hash list and probably criu rpc part can be separated too.
2 I think we can also improve the overall design in two things:
-
Why use two sockets? - We can do the same with one unix socket, see how fdstore works (see fdstore_get, fdstore_add and MSG_PEEK). We can read old fds from socket without removing them from it we can add new pidfds to socket and replace old ones in case of pid-reuse.
-
Inheriting sockets works badly with criu swrk because we can use one criu swrk to dump multiple containers and it would be better to be able to change pid-reuse store socket on each request. You can take a look on how work_dir_fd is passed over rpc request.
criu/kerndat.c
Outdated
} | ||
|
||
if (close(ret)) | ||
return -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of not supported SYS_pidfd_open you would call close(-1) and would get error from close and fail, probably it's not what you want. Please consider to use close_safe() helper.
Also please rename "ret" to "fd" to make your code easier to read.
criu/mem.c
Outdated
struct pidfd_entry { | ||
pid_t pid; | ||
int pidfd; | ||
struct hlist_node hash; /* To lookup pidfd by pid */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strange offsets -> Please left-allign names of struct members with tabs.
|
||
for (i = 0; i < PIDFD_HASH_SIZE; i++) { | ||
hlist_for_each_entry(entry, &pidfd_hash[i], hash) | ||
xfree(entry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wrong, You don't want to cleanup hlist like this.
You should use hlist_for_each_entry_safe and delete entries from hlist for consistency before you free them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZeyadYasser could you address this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@avagin I not sure why github didn't reflect the change in the conversation, but I updated that part using hlist_for_each_entry_safe
. Also I just noticed that I need to reset the heads of each hlist INIT_HLIST_HEAD
, thanks for pointing it out.
https://github.com/checkpoint-restore/criu/pull/1341/files#diff-b6aa45df2259d449d7828c7a5ed410a3282d977aa73607e6111808016dd3e2d3R150
*/ | ||
if (is_pid_closed(entry->pidfd)) { | ||
pr_err("Pid reuse detected for pid %d\n", item->pid->real); | ||
return -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should "return 1;" instead of "return -1;" in both cases above. If new process arrives which was not noticed in previous predump we should just make a full memory image for it without using previous predumps, and we don't want to fail in this case.
Also please use pr_warn instead of pr_err.
"return -1;" is used to report some internal error, 0 is for non-reuse, 1 for reuse cases.
@Snorch Thanks for such a detailed review. I really like the elegance of fdstore_get/fdstore_add utilizing the same socket as the fd storage but I faced a small problem. I am inclined towards using pidfd_getfd() given the elegance it would provide similar to what is done in fdstore_get/fdstore_add. |
It looks like you are right about this problem, sorry I didn't thought about
Some more options:
But probably it's a bit over-complicated.
Probably (3) is the most promissing one, please consider =) (upd: ah I see #1339, that is sad...) |
Sorry for such a late response. I am working on it. I am in the middle of my exams so It might take some time.
Yes, That would have been great doing that without modifying p.haul, It was what I started with at first but unfortunately it wasn't feasible. |
c528687
to
ce5330e
Compare
Sorry for taking too long. I hope I was able to address all the issues in the previous version. I ended up using a single socket to store the pidfds as @Snorch recommended, and now this pidfd_store socket is passed as an RPC option instead of being inherited in swrk mode which comes with the unwanted side effects that @Snorch pointed out. A new syscall called pidfd_getfd allows criu to steal the pidfd_store socket while still keeping the socket alive if criu is closed as it still referenced by a fd from the migration tool. Any feedback would be highly appreciated. Thanks |
9163a29
to
1d62dcc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZeyadYasser Nice work! Please see inline comments.
One more thing, do we have a test for it? Can we provide pidfd_store_sk in zdtm with iterations.
And yes, sorry for a late response...
@@ -100,3 +100,5 @@ __NR_fsopen 430 sys_fsopen (char *fsname, unsigned int flags) | |||
__NR_fsconfig 431 sys_fsconfig (int fd, unsigned int cmd, const char *key, const char *value, int aux) | |||
__NR_fsmount 432 sys_fsmount (int fd, unsigned int flags, unsigned int attr_flags) | |||
__NR_clone3 435 sys_clone3 (struct clone_args *uargs, size_t size) | |||
__NR_pidfd_open 434 sys_pidfd_open (pid_t pid, unsigned int flags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have some formatting problems (likely missing tab):
__NR_clone3 435 sys_clone3 (struct clone_args *uargs, size_t size)
__NR_pidfd_open 434 sys_pidfd_open (pid_t pid, unsigned int flags)
__NR_pidfd_getfd 438 sys_pidfd_getfd (int pidfd, int targetfd, unsigned int flags)
Same applies to all definitions of __NR_pidfd_open in other architectures. Note that in previous version you had this tab.
criu/kerndat.c
Outdated
@@ -1185,6 +1262,14 @@ int kerndat_init(void) | |||
pr_err("has_time_namespace failed when initializing kerndat.\n"); | |||
ret = -1; | |||
} | |||
if (!ret && kerndat_has_pidfd_open()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your function always return 0, so you can make it void and remove error checking like:
if (!ret)
kerndat_has_pidfd_open();
criu/kerndat.c
Outdated
if (val_b == val_a) { | ||
kdat.has_pidfd_getfd = true; | ||
} else { | ||
kdat.has_pidfd_getfd = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When val_b != val_a, that means tha some unexpected problem unrelated to pidfd_getfd support happened, would be better to set ret = -1 instead of has_pidfd_getfd = false.
criu/kerndat.c
Outdated
} | ||
|
||
val_a = 1984; | ||
if (write(fds[0], &val_a, sizeof(val_a)) == -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's check that we wrote the right size:
if (write(fd, &val_a, sizeof(val_a)) != sizeof(val_a)) {
Same applies to read() below.
criu/mem.c
Outdated
} | ||
|
||
pidfd_store_sk = syscall(SYS_pidfd_getfd, pidfd, sk, 0); | ||
close(pidfd); /* Not needed anymore */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This close overrides the errno of syscall. And the pr_perror below would print wrong errno. Probably it's better to instead do:
+ pidfd_store_sk = syscall(SYS_pidfd_getfd, pidfd, sk, 0);
+ if (pidfd_store_sk == -1) {
+ pr_perror("Can't steal fd %d using pidfd_getfd", sk);
+ close(pidfd);
+ goto err;
+ }
+ close(pidfd);
goto err; | ||
} | ||
|
||
pidfd_store_sk = syscall(SYS_pidfd_getfd, pidfd, sk, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer doing close_safe(&pidfd_store_sk) before opening it, because we can have some leftover from pervious iterations of pre-dump/dump, I see that you do close_safe(&pidfd_store_sk); in free_pidfd_store() but it is only done in the last patch, so after "cr-service: add pidfd_store_sk option to rpc.proto" but before "criu: add pidfd based pid reuse detection for RPC clients" we can see situations where this pidfd_store_sk is not -1 and is open fd...
while (1) { | ||
ret = poll(&pollfd, 1, 0); | ||
if (errno == EINTR) | ||
continue; /* restart polling */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably some reasonable limit on polling restarts would be nice here.
criu/mem.c
Outdated
if (pidfd_store_sk == -1) | ||
return 0; | ||
|
||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space at the end of line.
} | ||
|
||
if (send_fds(pidfd_store_sk, NULL, 0, &pidfd, 1, &pid, sizeof(pid_t))) { | ||
pr_perror("Can't send pidfd %d of pid %d", pidfd, pid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!entry)
close(pidfd);
pr_perror("Can't send pidfd %d of pid %d", pidfd, pid); | ||
return -1; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!entry)
close(pidfd);
Else you would leak it.
11d367d
to
37d56d6
Compare
Codecov Report
@@ Coverage Diff @@
## criu-dev #1341 +/- ##
============================================
- Coverage 68.72% 68.50% -0.23%
============================================
Files 133 133
Lines 32279 32468 +189
============================================
+ Hits 22185 22242 +57
- Misses 10094 10226 +132
Continue to review full report at Codecov.
|
@Snorch Thank you for such detailed review. I hope I have addressed the issues appropriately. |
37d56d6
to
cc7bc6b
Compare
I just noticed that the new test was never triggered in CI because it requires a recent kernel (>= 5.6) so I added the test to the Fedora 33 vagrant test because it needs to be run manually for RPC mode to work. I am not sure if I could tell the test to run in RPC mode in the .desc file. |
I think that both are ok: a) If criu didn't send the pidfd in the last iteration, it probably does not see this pid at that time e.g.: a new task forked in migrated container. b) If someone sends empty pidfd_store on non-first iteration - yes that's probably a problem, but it's on the sender side.. Though probably in this case we can fail in init_pidfd_store_hash() if the store is completely empty and at the same time criu has previous images dir option (--prev-images-dir) set. (Or at least print warning.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All comments fixed, and test is working right. Thanks!
Reviewed-by: Pavel Tikhomirov ptikhomirov@virtuozzo.com
cc7bc6b
to
c97ae41
Compare
pidfd_open syscall will be needed later to send pidfds between pre-dump/dump iterations for pid reuse detection. v2: - make kerndat_has_pidfd_open void since 0 is always returned - fix missing tabs in syscall tables Signed-off-by: Zeyad Yasser <zeyady98@gmail.com>
pidfd_getfd syscall will be needed later to send pidfds between pre-dump/dump iterations for pid reuse detection. v2: - check size written/read of val_a/val_b is correct - return with error when val_a != val_b Signed-off-by: Zeyad Yasser <zeyady98@gmail.com>
pidfd_store_sk option will be used later to store tasks pidfds between predumps to detect pid reuse reliably. pidfd_store_sk should be a fd of a connectionless unix socket. init_pidfd_store_sk() steals the socket from the RPC client using pidfd_getfd, checks that it is a connectionless unix socket and checks if it is not initialized before (i.e. unnamed socket). If not initialized the socket is first bound to an abstract name (combination of the real pid/fd to avoid overlap), then it is connected to itself hence allowing us to store the pidfds in the receive queue of the socket (this is similar to how fdstore_init() works). v2: - avoid close(pidfd) overriding errno of SYS_pidfd_open in init_pidfd_store_sk() - close pidfd_store_sk because we might have leftover from previous iterations Signed-off-by: Zeyad Yasser <zeyady98@gmail.com>
pidfd_store which will be used for reliable pidfd based pid reuse detection for RPC clients requires two recent syscalls (pidfd_open and pidfd_getfd). We allow checking if pidfd_store is supported using: 1. CLI: criu check --feature pidfd_store 2. RPC: CRIU_REQ_TYPE__FEATURE_CHECK and set pidfd_store to true in the "features" field of the request Signed-off-by: Zeyad Yasser <zeyady98@gmail.com>
c97ae41
to
3697cb7
Compare
I added a check that fails when the pidfd store is empty after the first pre-dump. |
pollfd.events = POLLIN; | ||
|
||
while (1) { | ||
ret = poll(&pollfd, 1, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
poll with the zero timeout can't return EINTR, can it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@avagin I encountered this behavior before once when polling with zero timeout in Golang, This was most likely due to the Go runtime bombarding the thread with SIGURG
. I am not sure if there is a scenario where CRIU would receive a high frequency of signals that would cause zero timeout poll to interrupt, so I went with the safe route.
Your perspective on this would be highly valuble.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@avagin Looks like it actually can return EINTR.
do_poll()
{
...
if (!count) {
count = wait->error;
if (signal_pending(current))
count = -ERESTARTNOHAND;
}
}
https://github.com/torvalds/linux/blob/8124c8a6b35386f73523d27eacb71b5364a68c4c/fs/select.c#L935
If signal_pending is true. I don't see any explicit protection from it. I not clearly, but also remember that I faced this kinda thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I had to be more detailed. I actually mean that it will not return EINTR if there is any event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZeyadYasser Anyway, the code in check_pidfd_entry_state() is badly broken. Shame on me that I initially missed it...
!!!!One should never rely on errno if libc syscall is successful!!!!
Imagine on the first iteration you get no event and a signal and thus get errno == EINTR or errno was already EINTR before we even started poll-ing because of previous calls, then all later iterations would be successful but errno would still continue to be EINTR, and we would cycle 10 times for no reason.
Also it is known practice for glibc to set errno to some error even in case of successful glibc syscall, because it can do "retries" inside glibc and first try may set errno, while second try would be successful.
Probably it was the case with Golang too, I've seen places in docker code where they forget to check retcode and wrongly rely on errno =)
(upd) So we should check ret too, will send a patch.
(upd2) Patch #1494
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, I didn't know that a call could be successful and errno not changing.
Thank you
*/ | ||
cnt = 0; | ||
while (1) { | ||
entry = xmalloc(sizeof(struct pidfd_entry)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (entry == NULL) ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@avagin I don't understand what is the problem, could you please clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZeyadYasser you should check error on allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g., something like this
entry = xmalloc(sizeof(struct pidfd_entry));
if (entry == NULL) {
return -1;
}
Closes: checkpoint-restore#717 This increases the reliability of pid reuse detection using pidfds, currently through RPC migration tools like P.Haul. A connectionless unix socket is passed to criu in RPC mode through the RPC option pidfd_store_sk. If this option is set, the socket is initialized in init_pidfd_store_sk() to be used as a queue for task pidfds. criu then sends tasks pidfds to this socket in send_pidfd_entry() and receives them in the next pre-dump/dump iteration to build the pidfds hashtable in init_pidfd_store_hash(). These pidfds will be used later in detect_pid_reuse(). How it should be used in migration tools like P.Haul: - Open a connectionless unix socket - Pass the socket fd in the RPC option pidfd_store_sk when doing a pre-dump or dump This will fail if the kernel does not support pidfd_open or pidfd_getfd syscalls, so pidfd_store_sk should not be set if the kernel does not support pidfd_open. This could be checked with: CLI: criu check --feature pidfd_store RPC: CRIU_REQ_TYPE__FEATURE_CHECK and set pidfd_store to true in the "features" field of the request v2: - add reasonable polling restart limit in check_pidfd_entry_state to avoid getting stuck - avoid leaking pidfd in send_pidfd_entry when entry is NULL, otherwise pidfds are freed in free_pidfd_store v3: - check that the passed pidfd store is not empty after the first iteration (i.e. --prev-images-dir option set). v4: - clear pidfd_hash heads - check entry allocation error in init_pidfd_store_hash() Signed-off-by: Zeyad Yasser <zeyady98@gmail.com>
When testing pid reuse using pidfd_store feature in RPC mode we need to pass a unix socket fd used to CRIU in the RPC option pidfd_store_sk to store the pidfds between predump/dump iterations. Signed-off-by: Zeyad Yasser <zeyady98@gmail.com>
This is just a symlink to the original transition/pid_reuse test with the right options passed to trigger the pidfd store based pid reuse detection code path. Pidfd store based detection is supported only in RPC mode which requires passing a unix socket fd to be used as pidfd store and the kernel should support pidfd_open and pidfd_getfd syscalls {'feature': 'pidfd_store'} for this test to work. Signed-off-by: Zeyad Yasser <zeyady98@gmail.com>
3697cb7
to
aa3e624
Compare
@ZeyadYasser I decided to merge this PR in criu-dev. Thank you for this work! We need to do a few actions to merge it into the master branch: https://criu.org/New_features_checklist. As for the code, I think it would be better to move it into a separate file. |
This PR closes #717.
This increases the reliability of pid reuse detection using pidfds, mainly during migration like in P.Haul.
A unix domain socket pair is passed to criu (--pidfd-in-fd and --pidfd-in-fd) in CLI or RPC mode.
If those options are set, criu sends tasks pidfds to pidfd_out_fd in send_pidfd_entry() and receives them in the next pre-dump/dump iteration from pidfd_in_fd in init_pidfd_hash().
These pidfds will be used later in detect_pid_reuse().
When pidfd based pid reuse detection is used, the output of detect_pid_reuse() is deterministic hence it returns -1 if pid reuse is detected (task associated with pidfd is closed) which causes criu to exit.
How it should be used in migration tools like P.Haul:
- Open unix domain socket pair
- Launch criu in SWRK mode so that it inherits the socket pair fds
- Pass the socket pair fds in the RPC opts pidfd_in_fd and pidfd_out_fd when doing a pre-dump or dump
This will fail if the kernel does not support pidfd_open syscall, so those options should not be set if the kernel does not support pidfd_open.
This could be checked with:
criu check --feature pidfd_open
I have already tried this with P.Haul and it is working as expected. As soon as this gets merged I will open a PR for P.Haul to enable this feature.
Thanks @Snorch for giving me an idea on how to approach this.