Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket file missing after successful checkpoint and restore step in postgres image #1914

Closed
shashank-sharma opened this issue Jun 14, 2022 · 8 comments
Assignees

Comments

@shashank-sharma
Copy link

Description

After checkpoint and restore of postgres image (14.3), criu is not able to restore one socket file. I observed that after checkpoint, socket file is still there (I checked by mounting the directory and seeing if it is still there), after restoration is when the socket file is missing

Steps to reproduce the issue:

  1. Pull postgres:14.3 image through docker and export OCI bundle
  2. Start runc container and test if it's working fine
  3. Do checkpoint and then restore it, you'll notice after restore, it doesn't restore .s.PGSQL.5432 socket file

Detailed steps can be found here: https://github.com/shashank-sharma/psql-testing/blob/master/pdb/start.sh
And it is possible to reproduce, I have created minimal example:

https://github.com/shashank-sharma/psql-testing

Describe the results you received:

 ?column?                          <<<<<<<< Showing that postgres working fine
----------
        1
(1 row)

total 12                                        <<<<<<<< ls -la /var/run/postgresql
drwxrwsr-x 2 puser puser 4096 Jun 14 00:50 .
drwxr-xr-x 4 root  root  4096 May 28 11:51 ..
srwxrwxrwx 1 puser puser    0 Jun 14 00:50 .s.PGSQL.5432
-rw------- 1 puser puser   59 Jun 14 00:50 .s.PGSQL.5432.lock

Checkpoint starting                            <<<<<<<< showing runc list
ID          PID         STATUS      BUNDLE      CREATED     OWNER

Restoring
total 12                                       <<<<<<<< After restore, ls -la /var/run/postgresql
drwxrwsr-x 2 puser puser 4096 Jun 14 00:50 .
drwxr-xr-x 4 root  root  4096 May 28 11:51 ..
-rw------- 1 puser puser   59 Jun 14 00:50 .s.PGSQL.5432.lock

Describe the results you expected:

Expected .s.PGSQL.5432 after restoration so that it is possible to use after restore

Additional information you deem important (e.g. issue happens only occasionally):

CRIU logs and information:

CRIU full dump/restore logs:

Log here

Output of `criu --version`:

Version: 3.14

Output of `criu check --all`:

Error (criu/uffd.c:272): uffd: Lazy pages are not available: Function not implemented
Warn  (criu/kerndat.c:869): Can't keep kdat cache on non-tempfs
Warn  (criu/cr-check.c:850): Dirty tracking is OFF. Memory snapshot will not work.
Error (criu/cr-check.c:1207): UFFD is not supported
Error (criu/cr-check.c:1207): UFFD is not supported
Error (criu/cr-check.c:1033): failed to mount autofs: No such device
Looks good but some kernel features are missing
which, depending on your process tree, may cause
dump or restore failure.

Additional environment details:

@avagin
Copy link
Member

avagin commented Jun 14, 2022

(00.469809)      7: unix: ghost: bind id 0x38 ino 621859 addr /var/run/postgresql/.s.PGSQL.5432
(00.469841)      7: unix: ghost: socket id 0x38 ino 621859 name /var/run/postgresql/.s.PGSQL.5432 detected F_OK /var/run/postgresql
(00.469850)      7: unix: ghost: socket id 0x38 ino 621859 name /var/run/postgresql/.s.PGSQL.5432 creating /var/run/postgresql
(00.469906)      7: unix: ghost: id 0x38 621859 fdstore_id 2 /var/run/postgresql/.s.PGSQL.5432

@Snorch could you take a look at why criu decided that it is a ghost socket?

@Snorch Snorch self-assigned this Jun 15, 2022
@Snorch
Copy link
Member

Snorch commented Jun 15, 2022

I checked it in Virtuozzo container and it .s.PGSQL.5432 do not dissapear, It can indicate that we need to port something related to unixsk from vzcriu.

I guess it can be due to overmounted files support which we haven't ported yet on top of mount-v2. (jfyi: in mount-v1 overmounted files are restored as link-remap with rmdir on path).

Will continue investigating.

@avagin
Copy link
Member

avagin commented Jun 15, 2022

@Snorch could you write instructions how to dump/restore with the mount-v1 engine. Maybe it will work with it.

@rst0git
Copy link
Member

rst0git commented Jun 15, 2022

could you write instructions how to dump/restore with the mount-v1 engine.

$ criu --help
...
  --mntns-compat-mode   Use mount engine in compatibility mode. By default criu
                        tries to use mount-v2 mode with more reliable algorithm
                        based on MOVE_MOUNT_SET_GROUP kernel feature

@Snorch
Copy link
Member

Snorch commented Jun 16, 2022

Reproduce is on criu 3.14, so mount-v2 is not involved for sure. https://github.com/shashank-sharma/psql-testing/blob/master/Dockerfile#L11

@Snorch
Copy link
Member

Snorch commented Jun 17, 2022

JFYI: Found this problem in mount-v2 when trying to reproduce with newer criu:
#1918

@Snorch
Copy link
Member

Snorch commented Jun 17, 2022

@shashank-sharma I've updated your reproduce to use latest criu/docker and it works smoothly now:

https://github.com/Snorch/psql-testing/tree/updated-reproduce

Runc starting
 ?column? 
----------
        1
(1 row)

total 4
drwxrwsr-x 1 puser puser 62 Jun 17 12:28 .
drwxr-xr-x 1 root  root  36 May 28 11:51 ..
srwxrwxrwx 1 puser puser  0 Jun 17 12:28 .s.PGSQL.5432
-rw------- 1 puser puser 59 Jun 17 12:28 .s.PGSQL.5432.lock
Checkpoint starting
ID          PID         STATUS      BUNDLE      CREATED     OWNER
Restoring
WARN[0000] Kernel memory settings are ignored and will be removed 
total 4
drwxrwsr-x 1 puser puser 62 Jun 17 12:28 .
drwxr-xr-x 1 root  root  36 May 28 11:51 ..
srwxrwxrwx 1 puser puser  0 Jun 17 12:28 .s.PGSQL.5432
-rw------- 1 puser puser 59 Jun 17 12:28 .s.PGSQL.5432.lock

So I believe problem is already solved.

As you can see there are quiet a lot of unix-socket changes in criu between v3.14 which you use and latest criu-dev branch:

[snorch@turmoil criu]$ git log --oneline v3.14..criu-dev criu/sk-unix.c
baa4516e6 sk-unix: make add_fake_unix_queuers earier and rework find_queuer_for
d14dbb8c7 sk-unix: rework bind_on_deleted() return codes
9e7473516 sk-unix: fix e_str leak in unix_sk_id_add
1b4a9df9c sk-unix: fix uint32_t id variable printf format specifier
530ad9c89 sk-unix: Add support for SOCK_SEQPACKET unix sockets
94111596f sk-unix: Fix TCP_ESTABLISHED checks in unix sockets
26db7adbb clang-format: do automatic comment fixups
c6b5e7d92 sk-unix: fix prep_unix_sk_cwd root and cwd restoring
93dd984ca Run 'make indent' on all C files
2609e98ee sk-unix: ghost: fix deadlock between peer_fle->stage and fds wake up
f10425e05 criu: end pr_(err|warn|msg|info|debug) with \n
878223560 sk-unix: check whether a socket name is NULL before printing it
7c5c81366 sk-unix: rework unix_resolve_name
d0308e5ec sk-unix: make criu respect existing files while restoring ghost unix socket fd
34024dfdc util: move open_proc_self_fd to service_fd
04d7b7157 sk-unix: ignore coverity chroot() warning
9b1921fb7 sk-unix: do not overwrite function parameter
5b751fbaf criu: the type of a socket inode has to be "unsigned int"
e57e74a18 criu: optimize find_unix_sk_by_ino()

Please try latest, hope it works for you too.

@shashank-sharma
Copy link
Author

@Snorch Thanks for the quick update, I just tested it out (3.17) and you are right, it works fine with the latest version.

Closing this issue, Resolved

openvz-integrator pushed a commit to OpenVZ/vzkernel that referenced this issue Jul 12, 2022
…arenting

find_new_reaper() assumes that "has_child_subreaper" logic is safe as
long as we are not the exiting ->child_reaper and this is doubly wrong:

1. In fact it is safe if "pid_ns->child_reaper == father"; there must
   be no children after zap_pid_ns_processes() returns, so it doesn't
   matter what we return in this case and even pid_ns->child_reaper is
   wrong otherwise: we can't reparent to ->child_reaper == current.

   This is not a bug, but this is confusing.

2. It is not safe if we are not pid_ns->child_reaper but from the same
   thread group. We drop tasklist_lock before zap_pid_ns_processes(),
   so another thread can lock it and choose the new reaper from the
   upper namespace if has_child_subreaper == T, and this is obviously
   wrong.

   This is not that bad, zap_pid_ns_processes() won't return until the
   the new reaper reaps all zombies, but this should be fixed anyway.

We could change for_each_thread() loop to use ->exit_state instead of
PF_EXITING which we had to use until 8aac627, or we could change
copy_signal() to check CLONE_NEWPID before setting has_child_subreaper,
but lets change this code so that it is clear we can't look outside of
our namespace, otherwise same_thread_group(reaper, child_reaper) check
will look wrong and confusing anyway.

We can simply start from "father" and fix the problem. We can't wrongly
return a thread from the same thread group if ->is_child_subreaper == T,
we know that all threads have PF_EXITING set.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from ms commit 7d24e2d)
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

=================
Patchset description:
vz7: fix child-reaper reparenting

Forth patch is needed as kernel can reparent process to a dead thread
which is wrong.

Third patch is needed as kernel could reparent process from father from
one pidns to process from different pidns, which creates configurations
not supported by CRIU. Found it when reproducing problem from CRIU
mainstream issue in VZ7 ct.

checkpoint-restore/criu#1914

First and Second are just to make it apply cleaner.

Oleg Nesterov (4):
  exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER
    reparenting
  exit: reparent: document the ->has_child_subreaper checks
  exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction
  exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting
openvz-integrator pushed a commit to OpenVZ/vzkernel that referenced this issue Jul 12, 2022
Swap the "init_task" and same_thread_group() checks.  This way it is more
simple to document these checks and we can remove the link to the previous
discussion on lkml.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from ms commit 175aed3)
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

=================
Patchset description:
vz7: fix child-reaper reparenting

Forth patch is needed as kernel can reparent process to a dead thread
which is wrong.

Third patch is needed as kernel could reparent process from father from
one pidns to process from different pidns, which creates configurations
not supported by CRIU. Found it when reproducing problem from CRIU
mainstream issue in VZ7 ct.

checkpoint-restore/criu#1914

First and Second are just to make it apply cleaner.

Oleg Nesterov (4):
  exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER
    reparenting
  exit: reparent: document the ->has_child_subreaper checks
  exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction
  exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting
openvz-integrator pushed a commit to OpenVZ/vzkernel that referenced this issue Jul 12, 2022
find_new_reaper() checks same_thread_group(reaper, child_reaper) to
prevent the cross-namespace reparenting but this is not enough if the
exiting parent was injected by setns() + fork().

Suppose we have a process P in the root namespace and some namespace X.
P does setns() to enter the X namespace, and forks the child C.
C forks a grandchild G and exits.

The grandchild G should be re-parented to X->child_reaper, but in this
case the ->real_parent chain does not lead to ->child_reaper, so it will
be wrongly reparanted to P's sub-reaper or a global init.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

(cherry picked from ms commit c6c70f4)
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

=================
Patchset description:
vz7: fix child-reaper reparenting

Forth patch is needed as kernel can reparent process to a dead thread
which is wrong.

Third patch is needed as kernel could reparent process from father from
one pidns to process from different pidns, which creates configurations
not supported by CRIU. Found it when reproducing problem from CRIU
mainstream issue in VZ7 ct.

checkpoint-restore/criu#1914

First and Second are just to make it apply cleaner.

Oleg Nesterov (4):
  exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER
    reparenting
  exit: reparent: document the ->has_child_subreaper checks
  exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction
  exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting
openvz-integrator pushed a commit to OpenVZ/vzkernel that referenced this issue Jul 12, 2022
…ting

The ->has_child_subreaper code in find_new_reaper() finds alive "thread"
but returns another "reaper" thread which can be dead.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from ms commit 8a1296a)
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

=================
Patchset description:
vz7: fix child-reaper reparenting

Forth patch is needed as kernel can reparent process to a dead thread
which is wrong.

Third patch is needed as kernel could reparent process from father from
one pidns to process from different pidns, which creates configurations
not supported by CRIU. Found it when reproducing problem from CRIU
mainstream issue in VZ7 ct.

checkpoint-restore/criu#1914

First and Second are just to make it apply cleaner.

Oleg Nesterov (4):
  exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER
    reparenting
  exit: reparent: document the ->has_child_subreaper checks
  exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction
  exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting
openvz-integrator pushed a commit to OpenVZ/vzkernel that referenced this issue Sep 23, 2023
…arenting

find_new_reaper() assumes that "has_child_subreaper" logic is safe as
long as we are not the exiting ->child_reaper and this is doubly wrong:

1. In fact it is safe if "pid_ns->child_reaper == father"; there must
   be no children after zap_pid_ns_processes() returns, so it doesn't
   matter what we return in this case and even pid_ns->child_reaper is
   wrong otherwise: we can't reparent to ->child_reaper == current.

   This is not a bug, but this is confusing.

2. It is not safe if we are not pid_ns->child_reaper but from the same
   thread group. We drop tasklist_lock before zap_pid_ns_processes(),
   so another thread can lock it and choose the new reaper from the
   upper namespace if has_child_subreaper == T, and this is obviously
   wrong.

   This is not that bad, zap_pid_ns_processes() won't return until the
   the new reaper reaps all zombies, but this should be fixed anyway.

We could change for_each_thread() loop to use ->exit_state instead of
PF_EXITING which we had to use until 8aac627, or we could change
copy_signal() to check CLONE_NEWPID before setting has_child_subreaper,
but lets change this code so that it is clear we can't look outside of
our namespace, otherwise same_thread_group(reaper, child_reaper) check
will look wrong and confusing anyway.

We can simply start from "father" and fix the problem. We can't wrongly
return a thread from the same thread group if ->is_child_subreaper == T,
we know that all threads have PF_EXITING set.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from ms commit 7d24e2d)
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

=================
Patchset description:
vz7: fix child-reaper reparenting

Forth patch is needed as kernel can reparent process to a dead thread
which is wrong.

Third patch is needed as kernel could reparent process from father from
one pidns to process from different pidns, which creates configurations
not supported by CRIU. Found it when reproducing problem from CRIU
mainstream issue in VZ7 ct.

checkpoint-restore/criu#1914

First and Second are just to make it apply cleaner.

Oleg Nesterov (4):
  exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER
    reparenting
  exit: reparent: document the ->has_child_subreaper checks
  exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction
  exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting
openvz-integrator pushed a commit to OpenVZ/vzkernel that referenced this issue Sep 23, 2023
Swap the "init_task" and same_thread_group() checks.  This way it is more
simple to document these checks and we can remove the link to the previous
discussion on lkml.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from ms commit 175aed3)
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

=================
Patchset description:
vz7: fix child-reaper reparenting

Forth patch is needed as kernel can reparent process to a dead thread
which is wrong.

Third patch is needed as kernel could reparent process from father from
one pidns to process from different pidns, which creates configurations
not supported by CRIU. Found it when reproducing problem from CRIU
mainstream issue in VZ7 ct.

checkpoint-restore/criu#1914

First and Second are just to make it apply cleaner.

Oleg Nesterov (4):
  exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER
    reparenting
  exit: reparent: document the ->has_child_subreaper checks
  exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction
  exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting
openvz-integrator pushed a commit to OpenVZ/vzkernel that referenced this issue Sep 23, 2023
find_new_reaper() checks same_thread_group(reaper, child_reaper) to
prevent the cross-namespace reparenting but this is not enough if the
exiting parent was injected by setns() + fork().

Suppose we have a process P in the root namespace and some namespace X.
P does setns() to enter the X namespace, and forks the child C.
C forks a grandchild G and exits.

The grandchild G should be re-parented to X->child_reaper, but in this
case the ->real_parent chain does not lead to ->child_reaper, so it will
be wrongly reparanted to P's sub-reaper or a global init.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

(cherry picked from ms commit c6c70f4)
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

=================
Patchset description:
vz7: fix child-reaper reparenting

Forth patch is needed as kernel can reparent process to a dead thread
which is wrong.

Third patch is needed as kernel could reparent process from father from
one pidns to process from different pidns, which creates configurations
not supported by CRIU. Found it when reproducing problem from CRIU
mainstream issue in VZ7 ct.

checkpoint-restore/criu#1914

First and Second are just to make it apply cleaner.

Oleg Nesterov (4):
  exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER
    reparenting
  exit: reparent: document the ->has_child_subreaper checks
  exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction
  exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting
openvz-integrator pushed a commit to OpenVZ/vzkernel that referenced this issue Sep 23, 2023
…ting

The ->has_child_subreaper code in find_new_reaper() finds alive "thread"
but returns another "reaper" thread which can be dead.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from ms commit 8a1296a)
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

=================
Patchset description:
vz7: fix child-reaper reparenting

Forth patch is needed as kernel can reparent process to a dead thread
which is wrong.

Third patch is needed as kernel could reparent process from father from
one pidns to process from different pidns, which creates configurations
not supported by CRIU. Found it when reproducing problem from CRIU
mainstream issue in VZ7 ct.

checkpoint-restore/criu#1914

First and Second are just to make it apply cleaner.

Oleg Nesterov (4):
  exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER
    reparenting
  exit: reparent: document the ->has_child_subreaper checks
  exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction
  exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants