race condition in `logs -f` with journald driver #10323

vrothberg · 2021-05-12T15:55:20Z

$ while :;do ./bin/podman run --log-driver=journald -d --name foo quay.io/libpod/testimage:20210427 sh -c 'echo hi;sleep 2;echo bye';./bin/podman logs -f foo;./bin/podman rm foo;done
caf4dce7ce3ac5d08da339a8c0ecd4ec97df498cd472fbff61182352e71cf831
hi     <---- there is no subsequent bye!
caf4dce7ce3ac5d08da339a8c0ecd4ec97df498cd472fbff61182352e71cf831
08f58203f738027300d651fed861f1673eb8a1a953b869c788d905664b709938
hi
bye

I know where the error is and will drop a // FIXME in libpod/container_logs_linux.go (and a link to the code once #10222 is merged).

The text was updated successfully, but these errors were encountered:

vrothberg · 2021-05-12T16:00:46Z

@rhatdan FYI

vrothberg · 2021-05-12T16:07:22Z

As pointed out in https://github.com/containers/podman/pull/10222/files#diff-20cc30e1cdf302ef7404e5923eada3912c68c8b8943c0a7a0a834b29236eba69R92, using the Follow API is racy. In order to get it done correctly, we have to implement our custom follow function that forward everything from stdout and stderr UNTIL we read on the journal that the container died (i.e., get the died event).

I looked at the journal and the died event is always printed after the logs are written. The problem at the moment is that we have to many things running concurrently. Having one goroutine reading the log and filtering out what's necessary until we read the died event seems one way to avoid that race.

mheon · 2021-05-12T18:28:49Z

We should refactor the Wait code to use the logic where we wait for the Died event - it’s already implemented for the Compat API, and should not be that bad to pry into Libpod. We should then be able to use that in the Follow function, as well as turning off polling for every consumer of the existing Wait code.

…

On Wed, May 12, 2021 at 12:07 Valentin Rothberg ***@***.***> wrote: As pointed out in https://github.com/containers/podman/pull/10222/files#diff-20cc30e1cdf302ef7404e5923eada3912c68c8b8943c0a7a0a834b29236eba69R92, using the Follow API is racy. In order to get it done correctly, *we* have to implement our custom follow function that forward everything from stdout and stderr UNTIL we read on the journal that the container died (i.e., get the died event). I looked at the journal and the died event is always printed *after* the logs are written. The problem at the moment is that we have to many things running concurrently. Having one goroutine reading the log and filtering out what's necessary until we read the died event seems one way to avoid that race. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#10323 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3AOCCUX7XY6TJVWPQ3WT3TNKRVJANCNFSM44Y4JX6Q> .

vrothberg · 2021-05-13T06:19:58Z

That is not enough to resolve the race unfortunately (tried that). The race is in reading the journal. Follow() may sleep when streaming which can cause the died event to be read before the last logs.

…

On Wed 12 May 2021 at 20:29, Matthew Heon ***@***.***> wrote: We should refactor the Wait code to use the logic where we wait for the Died event - it’s already implemented for the Compat API, and should not be that bad to pry into Libpod. We should then be able to use that in the Follow function, as well as turning off polling for every consumer of the existing Wait code. On Wed, May 12, 2021 at 12:07 Valentin Rothberg ***@***.***> wrote: > As pointed out in > https://github.com/containers/podman/pull/10222/files#diff-20cc30e1cdf302ef7404e5923eada3912c68c8b8943c0a7a0a834b29236eba69R92 , > using the Follow API is racy. In order to get it done correctly, *we* > have to implement our custom follow function that forward everything from > stdout and stderr UNTIL we read on the journal that the container died > (i.e., get the died event). > > I looked at the journal and the died event is always printed *after* the > logs are written. The problem at the moment is that we have to many things > running concurrently. Having one goroutine reading the log and filtering > out what's necessary until we read the died event seems one way to avoid > that race. > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > < #10323 (comment)>, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AB3AOCCUX7XY6TJVWPQ3WT3TNKRVJANCNFSM44Y4JX6Q > > . > — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#10323 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACZDRA5OUWH54IUZZHBF3O3TNLCHJANCNFSM44Y4JX6Q> .

Fix a race in journald driver. Following the logs implies streaming until the container is dead. Streaming happened in one goroutine, waiting for the container to exit/die and signaling that event happened in another goroutine. The nature of having two goroutines running simultaneously is pretty much the core of the race condition. When the streaming goroutines received the signal that the container has exitted, the routine may not have read and written all of the container's logs. Fix this race by reading both, the logs and the events, of the container and stop streaming when the died/exited event has been read. The died event is guaranteed to be after all logs in the journal which guarantees not only consistencty but also a deterministic behavior. Note that the journald log driver now requires the journald event backend to be set. Fixes: containers#10323 Signed-off-by: Valentin Rothberg <rothberg@redhat.com>

vrothberg mentioned this issue May 12, 2021

podman image tree: restore previous behavior #10222

Merged

vrothberg self-assigned this May 21, 2021

vrothberg added the In Progress This issue is actively being worked by the assignee, please do not work on this at this time. label May 21, 2021

vrothberg mentioned this issue May 21, 2021

journald logger: fix race condition #10431

Merged

openshift-merge-robot closed this as completed in #10431 May 26, 2021

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 21, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

race condition in `logs -f` with journald driver #10323

race condition in `logs -f` with journald driver #10323

vrothberg commented May 12, 2021

vrothberg commented May 12, 2021

vrothberg commented May 12, 2021

mheon commented May 12, 2021 via email

vrothberg commented May 13, 2021 via email

race condition in logs -f with journald driver #10323

race condition in logs -f with journald driver #10323

Comments

vrothberg commented May 12, 2021

vrothberg commented May 12, 2021

vrothberg commented May 12, 2021

mheon commented May 12, 2021 via email

vrothberg commented May 13, 2021 via email

race condition in `logs -f` with journald driver #10323

race condition in `logs -f` with journald driver #10323