Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

engine: fix panic on waitgroup #3233

Merged
merged 1 commit into from Jun 14, 2023

Conversation

josedonizetti
Copy link
Collaborator

@josedonizetti josedonizetti commented Jun 13, 2023

1. Explain what the PR does

Wait must be synchronized with the first Add.

This seems to be the case here, as we call Add from inside the goroutine, for a short lived tracee-rules process there might be a race between checkCompletion and signatureStart triggering the error reported on the issues #3230 and #3151

2. Explain how to test it

3. Other comments

Copy link
Member

@geyslan geyslan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. LGTM.

@josedonizetti josedonizetti marked this pull request as draft June 13, 2023 21:28
@yanivagman
Copy link
Collaborator

Didn't we want to merge this before the v0.15.1 release?

@josedonizetti
Copy link
Collaborator Author

@yanivagman I think the PR is right, but I'm trying to be sure by simulating the issue which until now wasn't possible. As we wanted to release yesterday, to avoid blocking I moved this to 0.16.0 as it was before. Not a critical bug.

@OriGlassman
Copy link
Collaborator

Are you guys sure this can only happen by sending a SIGINT? This error was observed on a lambda (where it shouldn't have received a SIGINT)

@josedonizetti
Copy link
Collaborator Author

josedonizetti commented Jun 14, 2023

@OriGlassman It is not because of SIGINT, but a short-lived tracee-rules, which might happen due to different reasons, a signal as one of the issues, or if you are using it to analyze events from a file, maybe the file doesn't have enough events.

The race seems to happen because we close the inputs.Tracee channel to quickly, due to the lack of events. Triggering the stop of the engine consumeSources loop, which starts a Wait op on the waitgroup, but the go routines for the signatures still didn't had the time to setup the waitgroup.

A simple way to simulate it is by creating a file with just one event:

test.json

{"timestamp":1686775048087416430,"threadStartTime":1686754169760481353,"processorId":7,"processId":1,"cgroupId":10999,"threadId":12,"parentProcessId":3293,"hostProcessId":3374,"hostThreadId":3480,"hostParentProcessId":3293,"userId":65534,"mountNamespace":4026533123,"pidNamespace":4026533127,"processName":"prometheus","hostName":"josedonizetti-x","containerId":"632b2b487644f2627ffd0fc148b0c83b1efd43a8a2236975071fb85ae932d922","container":{"id":"632b2b487644f2627ffd0fc148b0c83b1efd43a8a2236975071fb85ae932d922"},"kubernetes":{},"eventId":"2010","eventName":"net_packet_http_request","matchedPolicies":[""],"argsNum":2,"returnValue":0,"syscall":"write","stackAddresses":[0],"contextFlags":{"containerStarted":true,"isCompat":false},"args":[{"name":"metadata","type":"trace.PktMeta","value":{"src_ip":"127.0.0.1","dst_ip":"127.0.0.1","src_port":46122,"dst_port":9100,"protocol":6,"packet_len":336,"iface":"any"}},{"name":"http_request","type":"trace.ProtoHTTPRequest","value":{"method":"GET","protocol":"HTTP/1.1","host":"localhost:9100","uri_path":"/metrics","headers":{"Accept":["application/openmetrics-text;version=1.0.0,application/openmetrics-text;version=0.0.1;q=0.75,text/plain;version=0.0.4;q=0.5,*/*;q=0.1"],"Accept-Encoding":["gzip"],"User-Agent":["Prometheus/2.44.0"],"X-Prometheus-Scrape-Timeout-Seconds":["5"]},"content_length":0}}]}

Then, starting tracee-rules to analyze it in a loop:

while true; do sudo ./dist/tracee-rules --allcaps --input-tracee format:json --input-tracee file:./test.log 2>&1 | grep -v info; done

It should hit the issue every X time that we run tracee-rules.

panic: sync: WaitGroup is reused before previous Wait has returned

goroutine 1 [running]:
sync.(*WaitGroup).Wait(0xc0000f4600?)
        /snap/go/10198/src/sync/waitgroup.go:141 +0x85
github.com/aquasecurity/tracee/pkg/signatures/engine.(*Engine).checkCompletion(0xc0000f4600)
        /home/josedonizetti/code/aquasecurity/tracee/pkg/signatures/engine/engine.go:137 +0x3f
github.com/aquasecurity/tracee/pkg/signatures/engine.(*Engine).consumeSources(0xc0000f4600, {0x17733c8, 0xc0001102c0})
        /home/josedonizetti/code/aquasecurity/tracee/pkg/signatures/engine/engine.go:215 +0x3ea
github.com/aquasecurity/tracee/pkg/signatures/engine.(*Engine).Start(0xc0000f4600, {0x17733c8, 0xc0001102c0})
        /home/josedonizetti/code/aquasecurity/tracee/pkg/signatures/engine/engine.go:112 +0x1e7
main.main.func1(0xc00038df00)
        /home/josedonizetti/code/aquasecurity/tracee/cmd/tracee-rules/main.go:166 +0xb17
github.com/urfave/cli/v2.(*App).RunContext(0xc0003a04e0, {0x1772638?, 0xc0001a0168}, {0xc0001a4120, 0x6, 0x6})
        /home/josedonizetti/go/pkg/mod/github.com/urfave/cli/v2@v2.3.0/app.go:322 +0x97c
github.com/urfave/cli/v2.(*App).Run(...)
        /home/josedonizetti/go/pkg/mod/github.com/urfave/cli/v2@v2.3.0/app.go:224
main.main()
        /home/josedonizetti/code/aquasecurity/tracee/cmd/tracee-rules/main.go:257 +0x97a

or

panic: sync: WaitGroup misuse: Add called concurrently with Wait
goroutine 58 [running]:
sync.(*WaitGroup).Add(0x0?, 0x0?)
        /snap/go/10198/src/sync/waitgroup.go:97 +0xb4
sync.(*WaitGroup).Done(...)
        /snap/go/10198/src/sync/waitgroup.go:108
github.com/aquasecurity/tracee/pkg/signatures/engine.signatureStart({0x7f69ed0f8fa0, 0x7f69ed2b3960}, 0x0?, 0xc00012c6c0?)
        /home/josedonizetti/code/aquasecurity/tracee/pkg/signatures/engine/engine.go:98 +0x29f
created by github.com/aquasecurity/tracee/pkg/signatures/engine.(*Engine).Start
        /home/josedonizetti/code/aquasecurity/tracee/pkg/signatures/engine/engine.go:109 +0xec
panic: sync: WaitGroup misuse: Add called concurrently with Wait

@josedonizetti josedonizetti marked this pull request as ready for review June 14, 2023 21:06
@josedonizetti
Copy link
Collaborator Author

josedonizetti commented Jun 14, 2023

@yanivagman @rafaeldtinoco @geyslan as per the comment above ˆˆ, after several tests, this PR indeed fixes the bug.

Copy link
Contributor

@rafaeldtinoco rafaeldtinoco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rafaeldtinoco rafaeldtinoco merged commit 4ceebab into aquasecurity:main Jun 14, 2023
26 checks passed
@josedonizetti josedonizetti deleted the fix-engine-panic branch June 14, 2023 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tracee-rules panic races when loading/unloading tracee signatures (ctrl+c during initialization).
5 participants