Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/sort events by timestamp #1103

Merged

Conversation

AlonZivony
Copy link
Contributor

@AlonZivony AlonZivony commented Oct 21, 2021

Created a method to reorder events and passing them forward sorted by chronological order.
For more information about the issue, see #1113 .

Algorithm

Events from the kernel are received one-by-one from the data channel.
We can rely on the fact that for each CPU, all the events received from it should be ordered by timestamps (except for syscalls which are not ordered because of the way we create the event in 2 steps).
So by adding to each event the source CPU, we can put it in its own CPU queue of events and get all the events to be almost ordered.

To avoid syscalls sorting problem, we can find the appropriate place to put the event in the queue from the end of it, and with maximum of 3 iterations we should find the matching place chronologically.

After we put all the events in queues according to the source CPU, we can extract each time the oldest first-in-queue event from all CPUs and send it forward.

To be able to promise that all events prior to oldest first-in-queue event arrived to other CPU's queues, we can use the fact that the CPUs' queues are ordered and be sure that if all queues sent events after given timestamp, all events prior to that timestamp have arrived. So, we can check for each CPU what is the most recent event it sent (for CPUs that sent new events). From those, we can check which one has the most ancient timestamp. Then, we can be sure that all events with older timestamps than that timestamp were received - and send them.

However, because of the other sorting problem cases (syscalls case and the vCPU case) we cannot send them right away, we need a time buffer. This part is a bit complicated - the vCPU case makes us wait at least 100ms to be sure that a vCPU didn't send a new event right after the previous (the other case is that it has no events to send). The syscalls case makes us wait about 3ms to be sure that there are no new events older than last event received in the CPU. Because the vCPU delayed event can be a syscall event (with older timestamp than the newest event in the CPU's queue), this make things event more complicated.
The solution to the 2 sorting problem cases is to wait at least 100ms until we send the events up to the decided timestamp. This way we can be sure that there won't be any new events received with older timestamp than the chosen one.

To summarize the algorithm - we have a CPU queue for each CPU. We insert new events to the matching CPU's queue, and follow which CPU was updated with new event. Each interval, we check from the most recent events from each CPU which has the oldest timestamp. After a delay of at least 100ms, we send all events from queues up to that timestamp in an ordered way. This way, we can be sure that all sent events are sorted.

Concepts Used

Queues

I implemented a queue structure myself for this PR, because I need an access to the internal queue to be able to insert new events not at the tail of the queue (in the case of syscalls events that are received in unsorted way).

Pools

To reduce allocations and freeing amount, I introduced a Pool struct in this PR. The struct is used to make through it alloc and free of new event nodes which are used in the CPU queues. The pool save the freed nodes, and when alloc is required it return a saved free node if there is one saved, or alloc new one.
To avoid pooling large amount of nodes, the max pooled amount is the number of allocated nodes. If the number exceeds it, the Pool will free half of the amount pooled in it.

@itaysk
Copy link
Collaborator

itaysk commented Oct 21, 2021

Can you please open an issue explaining what this is and what is the problem you were trying to solve

@AlonZivony
Copy link
Contributor Author

The current solution does not seem to solve the problem of disordered events from all sources, just from Perf Buffer CPUs union disordering. It seems that there is another reason for the disordering right now from tests I made, so this PR should stay as a draft until I will give global solution.

@mtcherni95
Copy link
Contributor

mtcherni95 commented Oct 31, 2021

The current solution does not seem to solve the problem of disordered events from all sources, just from Perf Buffer CPUs union disordering. It seems that there is another reason for the disordering right now from tests I made, so this PR should stay as a draft until I will give global solution.

Hi Alon, great work so far! maybe the 'disordering' between events is related to multi core cpu? if so then that would be logical as the libbpf implements a buffer per core. If that's true then we shouldn't expect disordering in single core cpu. What do you think?

@AlonZivony
Copy link
Contributor Author

The current solution does not seem to solve the problem of disordered events from all sources, just from Perf Buffer CPUs union disordering. It seems that there is another reason for the disordering right now from tests I made, so this PR should stay as a draft until I will give global solution.

Hi Alon, great work so far! maybe the 'disordering' between events is related to multi core cpu? if so then that would be logical as the libbpf implements a buffer per core. If that's true then we shouldn't expect disordering in single core cpu. What do you think?

Hi Michael!
To you question, I just discovered with @yanivagman that the reason for the disordering is the result of the way we create events from syscalls. It occurs with single core CPU as well, so using one will not solve all the disordering issues, but will solve the more severe ones. You can take a look at the connected issue I opened to see all the disordering reasons I found.
Anyway, this PR will open soon for CR and merge, so I hope you will have the solution for your problem soon enough :)

@AlonZivony AlonZivony force-pushed the feature/sort-events-by-timestamp branch from c703184 to 692e071 Compare November 4, 2021 13:32
@AlonZivony
Copy link
Contributor Author

AlonZivony commented Nov 4, 2021

After testing the performance of tracee-ebpf using my new feature, this is the results:
The test were taken in an isolated virtual machine, where only tracee-ebpf and the performance tool were used.
The results are the user-mode CPU usage in %, because the feature is user-mode feature.
The results are:

  • Without the feature (main branch) - 20.0%
  • With perf-buffer solution ordering - 23.2%
  • With sort-guaranteed solution (for all 3 ordering issues) - 25.4%

The standard deviation of the results is about +-2% of CPU usage.

The result is that the complete feature increase CPU usage by 27%, and the perf-buffer specific solution increase by 16%.
This is huge performance issue.
What do you think we should do?

@AlonZivony AlonZivony force-pushed the feature/sort-events-by-timestamp branch from 0af0c36 to 18b5cf9 Compare November 9, 2021 13:38
@AlonZivony AlonZivony marked this pull request as ready for review November 9, 2021 13:40
@yanivagman yanivagman linked an issue Nov 9, 2021 that may be closed by this pull request
tracee-ebpf/main.go Outdated Show resolved Hide resolved
tracee-ebpf/tracee/events.go Outdated Show resolved Hide resolved
tracee-ebpf/tracee/events_sorter_test.go Outdated Show resolved Hide resolved
@AlonZivony AlonZivony marked this pull request as draft November 15, 2021 11:41
@AlonZivony AlonZivony marked this pull request as ready for review November 21, 2021 16:45
@AlonZivony AlonZivony force-pushed the feature/sort-events-by-timestamp branch from 70a46b9 to 98fac21 Compare November 21, 2021 16:58
@AlonZivony AlonZivony force-pushed the feature/sort-events-by-timestamp branch from 245cb39 to 390802c Compare November 23, 2021 09:15
@AlonZivony AlonZivony force-pushed the feature/sort-events-by-timestamp branch from 390802c to bed37dd Compare December 2, 2021 12:31
@AlonZivony AlonZivony force-pushed the feature/sort-events-by-timestamp branch 3 times, most recently from 9625664 to 754343c Compare December 19, 2021 10:45
@AlonZivony AlonZivony force-pushed the feature/sort-events-by-timestamp branch from 754343c to 515dd0e Compare December 27, 2021 15:07
pkg/external/external.go Outdated Show resolved Hide resolved
tracee-ebpf/tracee/events_sorter.go Outdated Show resolved Hide resolved
tracee-ebpf/tracee/sorting/events_sorter.go Outdated Show resolved Hide resolved
tracee-ebpf/tracee/sorting/events_sorter.go Outdated Show resolved Hide resolved
tracee-ebpf/tracee/sorting/events_sorter.go Outdated Show resolved Hide resolved
tracee-ebpf/tracee/sorting/events_sorter.go Outdated Show resolved Hide resolved
tracee-ebpf/tracee/sorting/events_sorter_test.go Outdated Show resolved Hide resolved
tracee-ebpf/tracee/sorting/events_sorter.go Outdated Show resolved Hide resolved
tracee-ebpf/tracee/sorting/events_sorter.go Outdated Show resolved Hide resolved
tracee-ebpf/tracee/sorting/events_sorter.go Outdated Show resolved Hide resolved
@yanivagman
Copy link
Collaborator

Also, let's not make this feature the default for now as it brings some overhead, and needs to be experimented first

@AlonZivony AlonZivony force-pushed the feature/sort-events-by-timestamp branch 3 times, most recently from 6627ee2 to 08fe4de Compare January 3, 2022 14:29
@yanivagman
Copy link
Collaborator

related #1360 - maybe we can fix it there?

@AlonZivony
Copy link
Contributor Author

related #1360 - maybe we can fix it there?

I am not sure that is the responsibility of that PR but it is close in the logic.
I will open an issue and if @liamg will think he can put it in the PR in an appropriate way he will fix it.

@AlonZivony
Copy link
Contributor Author

I will mark this PR as a draft until I am sure the bug is solved (currently working on fixing it)

@AlonZivony AlonZivony marked this pull request as draft January 23, 2022 12:16
@AlonZivony AlonZivony force-pushed the feature/sort-events-by-timestamp branch from bbaa54a to 2bf42f9 Compare January 23, 2022 12:54
@AlonZivony AlonZivony marked this pull request as ready for review January 23, 2022 12:56
@rafaeldtinoco
Copy link
Contributor

@AlonZivony is this good for a new review ?

@AlonZivony
Copy link
Contributor Author

@AlonZivony is this good for a new review ?

It's great.
I will address the CR soon, I just have few other important issues to address that are more prioritized for this moment but I will try to address the CR as soon as I can.

@rafaeldtinoco
Copy link
Contributor

rafaeldtinoco commented Jan 27, 2022

Okay @AlonZivony

I think PR could have the following commits:

1) main: external.Event passed as pointer                # change to main.go, tracee.go and events_pipeline.go
2) main, external & tracee.bpf: introduce ProcessorID    # change to tracee.go, tracee.bpf.c and events_pipeline.go
3) utils/environment: parse the number of CPUs available # cpu_analyzing.go
4) events/sorting: introduce event pool                  # pool.go
5) events/sorting: introduce queue & cpuqueue            # queue.go & cpu_queue.go
6) events/sorting: introduce events sorter               # sorting.go
7) events/sorting: introduce events sorter tests         # sorting_test.go
8) events_pipeline: sort events optionally               # change to events_pipeline.go
9) main_test: add sort-events to tests                   # change to main_test.go

And, like showed in comments after suggested titles, most of them are single file commits (or small changes to existing files) so I think that, with these commits, your rebasing commits work will be minimal and, still, allow good maintainability for cherry-picks and further fixes.

Also, are you addressing things I mentioned in comments ? Like adding the algorithm description to the source code, some changes to error messages, a missing link to next/previous for empty pool nodes (iirc).

Looking forward to the final PR so I can +1 it officially.

@AlonZivony AlonZivony force-pushed the feature/sort-events-by-timestamp branch 4 times, most recently from c5c47f1 to 746924e Compare February 2, 2022 12:48
tracee-ebpf/tracee/events.go Outdated Show resolved Hide resolved
tracee-ebpf/tracee/events.go Outdated Show resolved Hide resolved
tracee-ebpf/tracee/events.go Outdated Show resolved Hide resolved
cmd/tracee-ebpf/internal/flags/flags-output.go Outdated Show resolved Hide resolved
@AlonZivony AlonZivony force-pushed the feature/sort-events-by-timestamp branch from 746924e to 8f06cb7 Compare February 2, 2022 17:46
defer eq.mutex.Unlock()
if eq.head == nil {
if eq.tail != nil {
return nil, fmt.Errorf("BUG: TAIL without a HEAD")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say "discrepancy: tail without head"

Copy link
Collaborator

@yanivagman yanivagman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rafaeldtinoco rafaeldtinoco self-requested a review February 2, 2022 19:31
Copy link
Contributor

@rafaeldtinoco rafaeldtinoco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @AlonZivony, awesome job.

@rafaeldtinoco rafaeldtinoco merged commit 0ba181b into aquasecurity:main Feb 2, 2022
@AlonZivony
Copy link
Contributor Author

Thanks!

@AlonZivony AlonZivony deleted the feature/sort-events-by-timestamp branch February 3, 2022 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Events arrive unsorted by timestamp occasionally
6 participants