enrich: fixes post control plane #3285

NDStrahilevitz · 2023-06-28T13:42:48Z

The container_create argument enrichment was broken due to a race in context switching between the control plane and the ebpf event pipeline.
As such the control plane would trigger but a context switch could cause the pipeline continue instead. Then when enrichment for the cgroup_mkdir event would trigger, the cgroup info would be missing in the Containers struct, since now the control plane populates it.

The root cause is that the locking scheme in Containers is non transactional. Now that we have multiple pipelines writing to the Containers struct, locking should align with the situation. This PR does that.

    containers: change locking scheme
    
    Locks in the container package were as short as possible.
    This caused transaction issues once a few actors interacted with it.
    
    Changed the locking scheme to be more "transactional" with large locks,
    instead of combining read locks and write locks in the same transaction.

    controlplane: chore fixes
    
    1. Initialize enrichEnabled field
    2. Run started a go routine internally, which was redundant.

Fix #3282

1. Initialize enrichEnabled field 2. Run started a go routine internally, which was redundant.

Locks in the container package were as short as possible. This caused transaction issues once a few actors interacted with it. Changed the locking scheme to be more "transactional" with large locks, instead of combining read locks and write locks in the same transaction.

rafaeldtinoco

Things I could spot:

Can we call the mutex "cgroupMapMutex" instead of "mtx" only ? We're only protecting the map and I think its worth making it clear.

Update: It protects "deleted" as well... still, its cgroupMapMutex related. Either way, unimportant, up to you.

RemoveFromBpfMap (which should be named RemoveFromBPFMap) can be called in parallel: There is no logical error, as the map deletion would fail, but there will be a concurrency error (unable to delete the cgroupID) as return from the 2nd (slightly 2nd) concurrent thread calling it.

Questions:

Is there any plan to "de-duplicate" the repeated code in between *Tracee.processCgroupMkdir and *Controller.processCgroupMkdir ?

All the rest LGTM, but do check the possibility of a concurrent error from RemoveFromBpfMap, and, if you change it, please rename that method to have capital BPF.

Thank you!

NDStrahilevitz · 2023-06-29T07:55:42Z

I can do the rename.
It can but it's not, I haven't removed the processing function for CgroupMkdir but I have unregistered it in the control plane PR, meaning it's not running. So no race condition will actively happen. (sidenote: There is a race condition between that function and the cgroup_rmdir eBPF program, since in short lived cgroups the eBPF program will sometimes occur before the processing logic and then RemoveFromBpfMap will throw an error).

I will remove the processing code so there's no confusion and do the renames. Then squash and open a backport PR (I will wait for your review so you can confirm the formatting is right).

1. Rename mutex in cgroups code 2. Capitalize BPF in RemoveFromBpfMap 3. Delete unused processing code for cgroup events.

Locks in the container package were as short as possible. This caused transaction issues once a few actors interacted with it. Change the locking scheme to be more "transactional" with large locks, instead of combining read locks and write locks in the same transaction. Additional chores in the scope of the commit: 1. Initialize enrichEnabled field in control plane. 2. Control Plane's Run started a go routine internally, which was redundant. 3. Rename mutex in cgroups code to cgroupsMutex. 4. Capitalize BPF in RemoveFromBpfMap. 5. Delete the unused processing code for cgroup events. commit: f638b4e (main), cherry-pick

rafaeldtinoco · 2023-06-29T13:37:38Z

So what is the control plane left with ?

And about backporting multiple commits into a single commit and merging into v0.16.0, I think its a bad idea for some sustaining reasons:

usually cherry-picks and backports should be 1:1 patches from main or else things will get out of control really fast (trust me).
changes in v0.16.0 should be minimal and fixes only, and we should not rely in big code changes (that should be an exception of exceptions) and consider them "fixes". Usually fixes are easy to spot with not so much code changes.

With those changes you should have waited another review of mine instead I believe. And you also did not wait my take on the cherry-picking into v0.16.0.

NDStrahilevitz · 2023-06-29T14:11:53Z

So what is the control plane left with ?

All the logic is in the control plane, there was just leftover (and disconnected) logic in the event processing file. It was already doing all the work, just had to clean up the left overs from the event processing.

And about backporting multiple commits into a single commit and merging into v0.16.0, I think its a bad idea for some sustaining reasons:

usually cherry-picks and backports should be 1:1 patches from main or else things will get out of control really fast (trust me).

changes in v0.16.0 should be minimal and fixes only, and we should not rely in big code changes (that should be an exception of exceptions) and consider them "fixes". Usually fixes are easy to spot with not so much code changes.

I squashed the 3 commits when merging to main, then cherry-picked the squashed commit to v0.16.0. So there is no difference in what you described, it's the same commit in both places (with the addition of the reference in the cherry-picked one).
If the issue is that I decided to squash merge this PR (which went to main), then there was no indication given that I should specifically rebase merge it (on the contrary, since other PRs which were backported were also squash-merged, not rebase-merged). Since I already squash merged here, then obviously I would not cherry pick one by one the 3 original commits of the PR to the release branch.

With those changes you should have waited another review of mine instead I believe. And you also did not wait my take on the cherry-picking into v0.16.0.

I assumed Yaniv's +1 was enough, as we had examples of backports already which he would've also read and known about, and reviewed it on his own volition.

rafaeldtinoco · 2023-06-29T15:05:17Z

I saw the code removal and I would have said for us not to do in the cherry-pick/backport patch in the release tree.

We will eventually agree that the release tree has to have only fixes (and sometimes an ugly fix, not the fix done in MAIN) so we don't introduce bugs (even being as simple as code removal of some code that is not being used).

About Yaniv +1 being enough, you are right. It is enough and I spoke to him about this just now. We will coordinate better and discuss with you (offline) about how we want this tree to be maintained.

I'm writing to you in private with the rest of this message =).

NDStrahilevitz added the backport/v0.17.0 label Jun 28, 2023

NDStrahilevitz requested review from rafaeldtinoco and yanivagman June 28, 2023 13:42

NDStrahilevitz self-assigned this Jun 28, 2023

github-actions bot added the area/ebpf label Jun 28, 2023

NDStrahilevitz mentioned this pull request Jun 28, 2023

[v0.16.0] fix(pkg/counter): finish making counter atomic (#3276) #3284

Merged

NDStrahilevitz added milestone/v0.17.0 backport/v0.17.0 and removed backport/v0.17.0 labels Jun 28, 2023

NDStrahilevitz added 2 commits June 28, 2023 14:40

fix: missing field initialization in controlplane

901af47

1. Initialize enrichEnabled field 2. Run started a go routine internally, which was redundant.

NDStrahilevitz force-pushed the fix_enrich_controlplane branch from e3c880b to 4f11496 Compare June 28, 2023 14:43

rafaeldtinoco approved these changes Jun 28, 2023

View reviewed changes

chore: clean up containers related code

2d6db33

1. Rename mutex in cgroups code 2. Capitalize BPF in RemoveFromBpfMap 3. Delete unused processing code for cgroup events.

NDStrahilevitz merged commit f638b4e into aquasecurity:main Jun 29, 2023
25 checks passed

NDStrahilevitz mentioned this pull request Jun 29, 2023

[v0.16.0] fix: enrich post control plane (#3285) #3289

Merged

NDStrahilevitz removed the milestone/v0.17.0 label Jun 29, 2023

josedonizetti mentioned this pull request Jul 12, 2023

pkg/containers: fix deadlock #3311

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enrich: fixes post control plane #3285

enrich: fixes post control plane #3285

NDStrahilevitz commented Jun 28, 2023

rafaeldtinoco left a comment

NDStrahilevitz commented Jun 29, 2023

rafaeldtinoco commented Jun 29, 2023

NDStrahilevitz commented Jun 29, 2023

rafaeldtinoco commented Jun 29, 2023

enrich: fixes post control plane #3285

enrich: fixes post control plane #3285

Conversation

NDStrahilevitz commented Jun 28, 2023

rafaeldtinoco left a comment

Choose a reason for hiding this comment

NDStrahilevitz commented Jun 29, 2023

rafaeldtinoco commented Jun 29, 2023

NDStrahilevitz commented Jun 29, 2023

rafaeldtinoco commented Jun 29, 2023