Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enrich: fixes post control plane #3285

Merged

Conversation

NDStrahilevitz
Copy link
Collaborator

The container_create argument enrichment was broken due to a race in context switching between the control plane and the ebpf event pipeline.
As such the control plane would trigger but a context switch could cause the pipeline continue instead. Then when enrichment for the cgroup_mkdir event would trigger, the cgroup info would be missing in the Containers struct, since now the control plane populates it.

The root cause is that the locking scheme in Containers is non transactional. Now that we have multiple pipelines writing to the Containers struct, locking should align with the situation. This PR does that.

    containers: change locking scheme
    
    Locks in the container package were as short as possible.
    This caused transaction issues once a few actors interacted with it.
    
    Changed the locking scheme to be more "transactional" with large locks,
    instead of combining read locks and write locks in the same transaction.
    controlplane: chore fixes
    
    1. Initialize enrichEnabled field
    2. Run started a go routine internally, which was redundant.

Fix #3282

1. Initialize enrichEnabled field
2. Run started a go routine internally, which was redundant.
Locks in the container package were as short as possible.
This caused transaction issues once a few actors interacted with it.

Changed the locking scheme to be more "transactional" with large locks,
instead of combining read locks and write locks in the same transaction.
Copy link
Contributor

@rafaeldtinoco rafaeldtinoco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things I could spot:

  1. Can we call the mutex "cgroupMapMutex" instead of "mtx" only ? We're only protecting the map and I think its worth making it clear.

Update: It protects "deleted" as well... still, its cgroupMapMutex related. Either way, unimportant, up to you.

  1. RemoveFromBpfMap (which should be named RemoveFromBPFMap) can be called in parallel: There is no logical error, as the map deletion would fail, but there will be a concurrency error (unable to delete the cgroupID) as return from the 2nd (slightly 2nd) concurrent thread calling it.

Questions:

  1. Is there any plan to "de-duplicate" the repeated code in between *Tracee.processCgroupMkdir and *Controller.processCgroupMkdir ?

All the rest LGTM, but do check the possibility of a concurrent error from RemoveFromBpfMap, and, if you change it, please rename that method to have capital BPF.

Thank you!

@NDStrahilevitz
Copy link
Collaborator Author

  1. I can do the rename.
  2. It can but it's not, I haven't removed the processing function for CgroupMkdir but I have unregistered it in the control plane PR, meaning it's not running. So no race condition will actively happen. (sidenote: There is a race condition between that function and the cgroup_rmdir eBPF program, since in short lived cgroups the eBPF program will sometimes occur before the processing logic and then RemoveFromBpfMap will throw an error).

I will remove the processing code so there's no confusion and do the renames. Then squash and open a backport PR (I will wait for your review so you can confirm the formatting is right).

1. Rename mutex in cgroups code
2. Capitalize BPF in RemoveFromBpfMap
3. Delete unused processing code for cgroup events.
@NDStrahilevitz NDStrahilevitz merged commit f638b4e into aquasecurity:main Jun 29, 2023
25 checks passed
NDStrahilevitz added a commit to NDStrahilevitz/tracee that referenced this pull request Jun 29, 2023
Locks in the container package were as short as possible.
This caused transaction issues once a few actors interacted with it.

Change the locking scheme to be more "transactional" with large locks,
instead of combining read locks and write locks in the same transaction.

Additional chores in the scope of the commit:
1. Initialize enrichEnabled field in control plane.
2. Control Plane's Run started a go routine internally, which was
    redundant.
3. Rename mutex in cgroups code to cgroupsMutex.
4. Capitalize BPF in RemoveFromBpfMap.
5. Delete the unused processing code for cgroup events.

commit: f638b4e (main), cherry-pick
NDStrahilevitz added a commit that referenced this pull request Jun 29, 2023
Locks in the container package were as short as possible.
This caused transaction issues once a few actors interacted with it.

Change the locking scheme to be more "transactional" with large locks,
instead of combining read locks and write locks in the same transaction.

Additional chores in the scope of the commit:
1. Initialize enrichEnabled field in control plane.
2. Control Plane's Run started a go routine internally, which was
    redundant.
3. Rename mutex in cgroups code to cgroupsMutex.
4. Capitalize BPF in RemoveFromBpfMap.
5. Delete the unused processing code for cgroup events.

commit: f638b4e (main), cherry-pick
@rafaeldtinoco
Copy link
Contributor

So what is the control plane left with ?

And about backporting multiple commits into a single commit and merging into v0.16.0, I think its a bad idea for some sustaining reasons:

  • usually cherry-picks and backports should be 1:1 patches from main or else things will get out of control really fast (trust me).

  • changes in v0.16.0 should be minimal and fixes only, and we should not rely in big code changes (that should be an exception of exceptions) and consider them "fixes". Usually fixes are easy to spot with not so much code changes.

With those changes you should have waited another review of mine instead I believe. And you also did not wait my take on the cherry-picking into v0.16.0.

@NDStrahilevitz
Copy link
Collaborator Author

So what is the control plane left with ?

All the logic is in the control plane, there was just leftover (and disconnected) logic in the event processing file. It was already doing all the work, just had to clean up the left overs from the event processing.

And about backporting multiple commits into a single commit and merging into v0.16.0, I think its a bad idea for some sustaining reasons:

  • usually cherry-picks and backports should be 1:1 patches from main or else things will get out of control really fast (trust me).
  • changes in v0.16.0 should be minimal and fixes only, and we should not rely in big code changes (that should be an exception of exceptions) and consider them "fixes". Usually fixes are easy to spot with not so much code changes.

I squashed the 3 commits when merging to main, then cherry-picked the squashed commit to v0.16.0. So there is no difference in what you described, it's the same commit in both places (with the addition of the reference in the cherry-picked one).
If the issue is that I decided to squash merge this PR (which went to main), then there was no indication given that I should specifically rebase merge it (on the contrary, since other PRs which were backported were also squash-merged, not rebase-merged). Since I already squash merged here, then obviously I would not cherry pick one by one the 3 original commits of the PR to the release branch.

With those changes you should have waited another review of mine instead I believe. And you also did not wait my take on the cherry-picking into v0.16.0.

I assumed Yaniv's +1 was enough, as we had examples of backports already which he would've also read and known about, and reviewed it on his own volition.

@rafaeldtinoco
Copy link
Contributor

I saw the code removal and I would have said for us not to do in the cherry-pick/backport patch in the release tree.

We will eventually agree that the release tree has to have only fixes (and sometimes an ugly fix, not the fix done in MAIN) so we don't introduce bugs (even being as simple as code removal of some code that is not being used).

About Yaniv +1 being enough, you are right. It is enough and I spoke to him about this just now. We will coordinate better and discuss with you (offline) about how we want this tree to be maintained.

I'm writing to you in private with the rest of this message =).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Container enrichment is broken
2 participants