Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rootless service not work with systemd socket activation #9280

Closed
pendulm opened this issue Feb 9, 2021 · 21 comments · Fixed by #9928
Closed

rootless service not work with systemd socket activation #9280

pendulm opened this issue Feb 9, 2021 · 21 comments · Fixed by #9928
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@pendulm
Copy link
Contributor

pendulm commented Feb 9, 2021

/kind bug

Description

podman system service works broken in rootless mode, not get socket from systemd user session, and a lot of noise fill in the system log because system service restart continuously.

Steps to reproduce the issue:

Run command as a non-root user, so podman run in rootless mode

  1. $ systemctl --user start podman.socket

  2. $ nc -U /run/user/$UID/podman/podman.sock

  3. $ journalctl --user -f watch the log

Describe the results you received:
level=info msg="using API endpoint: 'unix:/run/user/1000/podman/podman.sock'" in the log, which mean socket activation not works, and systemd pull up podman service all the time

Describe the results you expected:
we should see level=info msg="using systemd socket activation to determine API endpoint" in the log

Additional information you deem important (e.g. issue happens only occasionally):
run podman system service in root mode is OK, service get unix socket correctly from systemd

Output of podman version:

Version:      3.0.0-dev
API Version:  3.0.0
Go Version:   go1.16rc1
Built:        Wed Feb  3 12:07:25 2021
OS/Arch:      linux/amd64

Package info (e.g. output of rpm -q podman or apt list podman):

podman-3.0.0-0.204.dev.gita086f60.fc34.x86_64

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide?

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):
System: Fedora rawhide

@openshift-ci-robot openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 9, 2021
@afbjorklund
Copy link
Contributor

afbjorklund commented Feb 9, 2021

I think this is normal, since the podman service has a 5 second timeout by default.

podman system service --timeout 5000

EDIT: Never mind, you meant that it fails to hand over the socket within that timeout ?

@pendulm
Copy link
Contributor Author

pendulm commented Feb 9, 2021

I dig into this bug further, and I found the root in pkg/systemd/activation.go

In rootless mode, podman fork&exec a podman process in new user-namespace, and systemd pass LISTEN_PID as pid of the parent, so this check breaks:

	if err != nil || p != os.Getpid() {
		return false
	}

But in root mode, no subprocess be forked, so socket activation works

@pendulm
Copy link
Contributor Author

pendulm commented Feb 9, 2021

I think this is normal, since the podman service has a 5 second timeout by default.

podman system service --timeout 5000

EDIT: Never mind, you meant that it fails to hand over the socket within that timeout ?

yep. podman exit after timeout is normal as expect:
when traffic come in, systemd activate the service, and when idle time, the service shutdown.

but with the bug, thing go wrong:

  1. when traffic come in, systemd activate the service, but podman service fail to get the socket, so create the socket by it self
  2. no traffic go in the new created socket, podman service shutdown after timeout
  3. systemd found service is stopped and the original socket traffic has not been accepted, so start a new podman service again
  4. podman service will not stopping restart unless stop podman socket mannually

@mheon
Copy link
Member

mheon commented Feb 9, 2021

@jwhonce PTAL

@rhatdan
Copy link
Member

rhatdan commented Feb 9, 2021

@giuseppe PTAL

@giuseppe
Copy link
Member

can we just drop this check?

	p, err := strconv.Atoi(pid)
	if err != nil || p != os.Getpid() {
		return false
	}

Does it break anything with root?

@afbjorklund
Copy link
Contributor

afbjorklund commented Feb 11, 2021

systemd is quite vague about it.

        e = getenv("LISTEN_PID");
        if (!e) {
                r = 0;
                goto finish;
        }

        r = parse_pid(e, &pid);
        if (r < 0)
                goto finish;

        /* Is this for us? */
        if (getpid_cached() != pid) {
                r = 0;
                goto finish;
        }

From sd-daemon.c

https://github.com/systemd/systemd/blob/v247/src/libsystemd/sd-daemon/sd-daemon.c#L46_L60

@pendulm
Copy link
Contributor Author

pendulm commented Feb 12, 2021

just drop the check won't works, same as add a plus check like this:

	if err != nil || ! (p == os.Getpid() && p == os.Getppid()  {
		return false
	}

because different module check this logic
pkg/systemd/activation.go and vendor/github.com/coreos/go-systemd/activation/files.go, the later come from vendor.

@giuseppe
Copy link
Member

it seems we have a check in rootless_linux to rewrite LISTEN_PID to the current pid.

Would it be possible for you to check why it is failing?

@pendulm
Copy link
Contributor Author

pendulm commented Feb 16, 2021

it seems we have a check in rootless_linux to rewrite LISTEN_PID to the current pid.

Would it be possible for you to check why it is failing?

sure, I'll check it and make a PR when I have found the bug.

@cmonty14
Copy link

On Debian I get a different error when starting podman.socket in user-mode.

locadmin@pc4-pve:~$ systemctl --user status podman.socket 
● podman.socket
   Loaded: masked (Reason: Unit podman.socket is masked.)
   Active: inactive (dead)

locadmin@pc4-pve:~$ systemctl --user start podman.socket 
Failed to start podman.socket: Unit podman.socket is masked.

@rhatdan
Copy link
Member

rhatdan commented Mar 15, 2021

@vrothberg @giuseppe Any thoughts?

@vrothberg
Copy link
Member

Try systemctl --user unmask

@yan12125
Copy link
Contributor

I noticed the issue happened only if there is already a podman process.

Some tests:

  1. Kill all podman processes
  2. systemctl --user start podman.socket
  3. curl -v --unix-socket $XDG_RUNTIME_DIR/podman/podman.sock http://localhost/version works
  4. Wait for 5 seconds for the daemon to exit
  5. The same curl command as the one in 3. hangs

I'm on Arch Linux with podman 3.0.1. From the build script, apparently podman.socket and podman.service are from upstream (this repo).

@Krejza9
Copy link

Krejza9 commented Mar 18, 2021

@pendulm
any news ?

@yan12125
Copy link
Contributor

I checked a little deeper. The issue seems to be PID mismatching between parent and child. I inserted a line before https://github.com/containers/podman/blob/master/pkg/systemd/activation.go#L17

logrus.Warnf("PID mismatch: %d %d", p, os.Getpid())

And here are relevant systemd logs:

 3月 22 02:28:45 PC951 podman[981610]: time="2021-03-22T02:28:45+08:00" level=warning msg="PID mismatch: 981598 981610"

When that warning is issued, there are three podman processes:

 974206 yen        20   0 40376  1168  1008 S  0.0  0.0  0:00.00 │  ├─ /home/yen/tmp/podman-src/bin/podman
 981598 yen        20   0 1453M 37808 29176 S  0.0  0.2  0:00.03 │  └─ /home/yen/tmp/podman-src/bin/podman --log-level=info system service
 981610 yen        20   0 1597M 38156 29232 S  0.0  0.2  0:00.05 │     └─ /home/yen/tmp/podman-src/bin/podman --log-level=info system service

@pendulm
Copy link
Contributor Author

pendulm commented Apr 1, 2021

In this commit 3f60dc0, adjusted LISTEN_PID to the child PID in reexec_in_user_namespace, but when a pause process leaves for namespace pinin(see link #7133) , the code path is reexec_userns_join, so we should adjusted LISTEN_PID in that function.

@mheon
Copy link
Member

mheon commented Apr 1, 2021

Would you be willing to submit a patch?

@giuseppe
Copy link
Member

giuseppe commented Apr 2, 2021

I agree we should fix this. There was another issue with socket activation, could you see if #9855 helps?

@pendulm
Copy link
Contributor Author

pendulm commented Apr 2, 2021

#9855 not fix this.
@giuseppe @mheon I'll take this, now I'm overwhelming by the complex code of rootless setup

@Krejza9
Copy link

Krejza9 commented Apr 4, 2021

#9855 not fix this.
@giuseppe @mheon I'll take this, now I'm overwhelming by the complex code of rootless setup

Hello @pendulm ,I tested your pull request and it is works like a charm. good job man.

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.