Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API server crashes with race condition in logging code #12904

Closed
riyad opened this issue Jan 18, 2022 · 4 comments · Fixed by #12909
Closed

API server crashes with race condition in logging code #12904

riyad opened this issue Jan 18, 2022 · 4 comments · Fixed by #12909
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@riyad
Copy link
Contributor

riyad commented Jan 18, 2022

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

Running the docker-py integration tests against Podman's API it crashes with the following error roughly 3 out of 4 times:

panic: send on closed channel

goroutine 1199 [running]:
github.com/containers/podman/v3/libpod.(*Container).readFromJournal.func1()
/home/riyad/src/podman/libpod/container_log_linux.go:242 +0x5ae
created by github.com/containers/podman/v3/libpod.(*Container).readFromJournal
/home/riyad/src/podman/libpod/container_log_linux.go:118 +0x812

All following requests receive 111 Connection refused responses.

When it happens it seems to always happen on the AttachContainerTest.test_run_container_reading_socket test.
Sadly only running this test in isolation doesn't trigger the crash. I tried including some of the preceding tests (see below for reproduction) until it crashes roughly 1/3 of the time.

Steps to reproduce the issue:

Run Podman API server

  1. ./bin/podman system service -t 0

Run tests against API

  1. DOCKER_HOST=unix:/run/user/1000/podman/podman.sock pytest -v -k 'RestartContainerTest or AttachContainerTest' tests/integration/api_container_test.py

Triggers the race condition roughly 1 out of 3 times.

Describe the results you received:

Podman API server crashes with:

panic: send on closed channel

goroutine 1199 [running]:
github.com/containers/podman/v3/libpod.(*Container).readFromJournal.func1()
/home/riyad/src/podman/libpod/container_log_linux.go:242 +0x5ae
created by github.com/containers/podman/v3/libpod.(*Container).readFromJournal
/home/riyad/src/podman/libpod/container_log_linux.go:118 +0x812

All following requests receive 111 Connection refused responses.

Describe the results you expected:

Should not crash.

Additional information you deem important (e.g. issue happens only occasionally):

Looking around in old docker-py integration test logs the first occurrence of this crash I can find is running against 49cfed7 ... which kind of coincides with #11263 being merged the day before. I don't know whether it may have caused or just uncovered this race condition. 🤔
Maybe it helps narrowing it down.

Output of podman version:

Last tested on:

Client:       Podman Engine
Version:      4.0.0-dev
API Version:  4.0.0-dev
Go Version:   go1.17.6
Git Commit:   d7492f1514674825a63c89ae635e57f535800c8d
Built:        Mon Jan 17 22:17:30 2022
OS/Arch:      linux/amd64
@openshift-ci openshift-ci bot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 18, 2022
@vrothberg
Copy link
Member

Thanks for the report, I will take a look.

vrothberg added a commit to vrothberg/libpod that referenced this issue Jan 18, 2022
Waiting on an initialized sync.WaitGroup returns immediately.
Hence, move the goroutine to wait and close *after* reading
the logs.

Fixes: containers#12904
Signed-off-by: Valentin Rothberg <rothberg@redhat.com>
@vrothberg
Copy link
Member

Fix for the race is here: #12909

Thanks for providing the excellent reproducer, that really helped tracking the bug down.

vrothberg added a commit to vrothberg/libpod that referenced this issue Jan 19, 2022
Add a regression test for issue containers#12904 to make sure that attaching with
logs=true to the compact endpoint does not blow up.  Note that I did not
find a way to test the output (i.e., '123'); logs are sent in a binary
format and I did not find a way to compare the control characters.

Signed-off-by: Valentin Rothberg <rothberg@redhat.com>
openshift-merge-robot added a commit that referenced this issue Jan 19, 2022
apiv2 test: add regression test for #12904
@riyad
Copy link
Contributor Author

riyad commented Jan 19, 2022

thanks for the fix 🙂

@vrothberg
Copy link
Member

Thanks for the great report! Feel free to open more issues in case you hit some :)

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 21, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants