Feature/sort events by timestamp #1103

AlonZivony · 2021-10-21T13:06:02Z

Created a method to reorder events and passing them forward sorted by chronological order.
For more information about the issue, see #1113 .

Algorithm

Events from the kernel are received one-by-one from the data channel.
We can rely on the fact that for each CPU, all the events received from it should be ordered by timestamps (except for syscalls which are not ordered because of the way we create the event in 2 steps).
So by adding to each event the source CPU, we can put it in its own CPU queue of events and get all the events to be almost ordered.

To avoid syscalls sorting problem, we can find the appropriate place to put the event in the queue from the end of it, and with maximum of 3 iterations we should find the matching place chronologically.

After we put all the events in queues according to the source CPU, we can extract each time the oldest first-in-queue event from all CPUs and send it forward.

To be able to promise that all events prior to oldest first-in-queue event arrived to other CPU's queues, we can use the fact that the CPUs' queues are ordered and be sure that if all queues sent events after given timestamp, all events prior to that timestamp have arrived. So, we can check for each CPU what is the most recent event it sent (for CPUs that sent new events). From those, we can check which one has the most ancient timestamp. Then, we can be sure that all events with older timestamps than that timestamp were received - and send them.

However, because of the other sorting problem cases (syscalls case and the vCPU case) we cannot send them right away, we need a time buffer. This part is a bit complicated - the vCPU case makes us wait at least 100ms to be sure that a vCPU didn't send a new event right after the previous (the other case is that it has no events to send). The syscalls case makes us wait about 3ms to be sure that there are no new events older than last event received in the CPU. Because the vCPU delayed event can be a syscall event (with older timestamp than the newest event in the CPU's queue), this make things event more complicated.
The solution to the 2 sorting problem cases is to wait at least 100ms until we send the events up to the decided timestamp. This way we can be sure that there won't be any new events received with older timestamp than the chosen one.

To summarize the algorithm - we have a CPU queue for each CPU. We insert new events to the matching CPU's queue, and follow which CPU was updated with new event. Each interval, we check from the most recent events from each CPU which has the oldest timestamp. After a delay of at least 100ms, we send all events from queues up to that timestamp in an ordered way. This way, we can be sure that all sent events are sorted.

Concepts Used

Queues

I implemented a queue structure myself for this PR, because I need an access to the internal queue to be able to insert new events not at the tail of the queue (in the case of syscalls events that are received in unsorted way).

Pools

To reduce allocations and freeing amount, I introduced a Pool struct in this PR. The struct is used to make through it alloc and free of new event nodes which are used in the CPU queues. The pool save the freed nodes, and when alloc is required it return a saved free node if there is one saved, or alloc new one.
To avoid pooling large amount of nodes, the max pooled amount is the number of allocated nodes. If the number exceeds it, the Pool will free half of the amount pooled in it.

itaysk · 2021-10-21T23:26:41Z

Can you please open an issue explaining what this is and what is the problem you were trying to solve

AlonZivony · 2021-10-31T06:57:29Z

The current solution does not seem to solve the problem of disordered events from all sources, just from Perf Buffer CPUs union disordering. It seems that there is another reason for the disordering right now from tests I made, so this PR should stay as a draft until I will give global solution.

mtcherni95 · 2021-10-31T09:59:09Z

The current solution does not seem to solve the problem of disordered events from all sources, just from Perf Buffer CPUs union disordering. It seems that there is another reason for the disordering right now from tests I made, so this PR should stay as a draft until I will give global solution.

Hi Alon, great work so far! maybe the 'disordering' between events is related to multi core cpu? if so then that would be logical as the libbpf implements a buffer per core. If that's true then we shouldn't expect disordering in single core cpu. What do you think?

AlonZivony · 2021-10-31T10:04:44Z

The current solution does not seem to solve the problem of disordered events from all sources, just from Perf Buffer CPUs union disordering. It seems that there is another reason for the disordering right now from tests I made, so this PR should stay as a draft until I will give global solution.

Hi Alon, great work so far! maybe the 'disordering' between events is related to multi core cpu? if so then that would be logical as the libbpf implements a buffer per core. If that's true then we shouldn't expect disordering in single core cpu. What do you think?

Hi Michael!
To you question, I just discovered with @yanivagman that the reason for the disordering is the result of the way we create events from syscalls. It occurs with single core CPU as well, so using one will not solve all the disordering issues, but will solve the more severe ones. You can take a look at the connected issue I opened to see all the disordering reasons I found.
Anyway, this PR will open soon for CR and merge, so I hope you will have the solution for your problem soon enough :)

AlonZivony · 2021-11-04T13:59:57Z

After testing the performance of tracee-ebpf using my new feature, this is the results:
The test were taken in an isolated virtual machine, where only tracee-ebpf and the performance tool were used.
The results are the user-mode CPU usage in %, because the feature is user-mode feature.
The results are:

Without the feature (main branch) - 20.0%
With perf-buffer solution ordering - 23.2%
With sort-guaranteed solution (for all 3 ordering issues) - 25.4%

The standard deviation of the results is about +-2% of CPU usage.

The result is that the complete feature increase CPU usage by 27%, and the perf-buffer specific solution increase by 16%.
This is huge performance issue.
What do you think we should do?

tracee-ebpf/main.go

tracee-ebpf/tracee/events.go

tracee-ebpf/tracee/events_sorter_test.go

pkg/external/external.go

tracee-ebpf/tracee/events_sorter.go

tracee-ebpf/tracee/sorting/events_sorter.go

tracee-ebpf/tracee/sorting/events_sorter_test.go

tracee-ebpf/tracee/sorting/events_sorter.go

yanivagman · 2021-12-29T14:44:35Z

Also, let's not make this feature the default for now as it brings some overhead, and needs to be experimented first

yanivagman · 2022-01-23T08:52:47Z

related #1360 - maybe we can fix it there?

AlonZivony · 2022-01-23T09:56:00Z

related #1360 - maybe we can fix it there?

I am not sure that is the responsibility of that PR but it is close in the logic.
I will open an issue and if @liamg will think he can put it in the PR in an appropriate way he will fix it.

AlonZivony · 2022-01-23T12:16:25Z

I will mark this PR as a draft until I am sure the bug is solved (currently working on fixing it)

rafaeldtinoco · 2022-01-26T16:04:06Z

@AlonZivony is this good for a new review ?

AlonZivony · 2022-01-27T14:13:59Z

@AlonZivony is this good for a new review ?

It's great.
I will address the CR soon, I just have few other important issues to address that are more prioritized for this moment but I will try to address the CR as soon as I can.

rafaeldtinoco · 2022-01-27T18:58:48Z

Okay @AlonZivony

I think PR could have the following commits:

1) main: external.Event passed as pointer                # change to main.go, tracee.go and events_pipeline.go
2) main, external & tracee.bpf: introduce ProcessorID    # change to tracee.go, tracee.bpf.c and events_pipeline.go
3) utils/environment: parse the number of CPUs available # cpu_analyzing.go
4) events/sorting: introduce event pool                  # pool.go
5) events/sorting: introduce queue & cpuqueue            # queue.go & cpu_queue.go
6) events/sorting: introduce events sorter               # sorting.go
7) events/sorting: introduce events sorter tests         # sorting_test.go
8) events_pipeline: sort events optionally               # change to events_pipeline.go
9) main_test: add sort-events to tests                   # change to main_test.go

And, like showed in comments after suggested titles, most of them are single file commits (or small changes to existing files) so I think that, with these commits, your rebasing commits work will be minimal and, still, allow good maintainability for cherry-picks and further fixes.

Also, are you addressing things I mentioned in comments ? Like adding the algorithm description to the source code, some changes to error messages, a missing link to next/previous for empty pool nodes (iirc).

Looking forward to the final PR so I can +1 it officially.

tracee-ebpf/tracee/events.go

cmd/tracee-ebpf/internal/flags/flags-output.go

mtcherni95 · 2022-02-02T18:47:09Z

pkg/events/sorting/queue.go

+	defer eq.mutex.Unlock()
+	if eq.head == nil {
+		if eq.tail != nil {
+			return nil, fmt.Errorf("BUG: TAIL without a HEAD")


I would say "discrepancy: tail without head"

yanivagman

LGTM

rafaeldtinoco

LGTM @AlonZivony, awesome job.

AlonZivony · 2022-02-02T21:32:31Z

Thanks!

AlonZivony marked this pull request as draft October 24, 2021 12:28

AlonZivony mentioned this pull request Oct 25, 2021

Events arrive unsorted by timestamp occasionally #1113

Closed

AlonZivony force-pushed the feature/sort-events-by-timestamp branch from c703184 to 692e071 Compare November 4, 2021 13:32

AlonZivony force-pushed the feature/sort-events-by-timestamp branch from 0af0c36 to 18b5cf9 Compare November 9, 2021 13:38

AlonZivony marked this pull request as ready for review November 9, 2021 13:40

yanivagman linked an issue Nov 9, 2021 that may be closed by this pull request

Events arrive unsorted by timestamp occasionally #1113

Closed

mtcherni95 reviewed Nov 10, 2021

View reviewed changes

tracee-ebpf/main.go Outdated Show resolved Hide resolved

tracee-ebpf/tracee/events.go Outdated Show resolved Hide resolved

tracee-ebpf/tracee/events_sorter_test.go Outdated Show resolved Hide resolved

AlonZivony marked this pull request as draft November 15, 2021 11:41

AlonZivony marked this pull request as ready for review November 21, 2021 16:45

AlonZivony force-pushed the feature/sort-events-by-timestamp branch from 70a46b9 to 98fac21 Compare November 21, 2021 16:58

AlonZivony requested review from itaysk and mtcherni95 November 23, 2021 08:36

AlonZivony force-pushed the feature/sort-events-by-timestamp branch from 245cb39 to 390802c Compare November 23, 2021 09:15

AlonZivony force-pushed the feature/sort-events-by-timestamp branch from 390802c to bed37dd Compare December 2, 2021 12:31

AlonZivony force-pushed the feature/sort-events-by-timestamp branch 3 times, most recently from 9625664 to 754343c Compare December 19, 2021 10:45

AlonZivony force-pushed the feature/sort-events-by-timestamp branch from 754343c to 515dd0e Compare December 27, 2021 15:07

yanivagman reviewed Dec 29, 2021

View reviewed changes

AlonZivony force-pushed the feature/sort-events-by-timestamp branch 3 times, most recently from 6627ee2 to 08fe4de Compare January 3, 2022 14:29

AlonZivony requested a review from yanivagman January 4, 2022 15:43

yanivagman mentioned this pull request Jan 5, 2022

tracee-ebpf: reorder events processing #1322

Merged

AlonZivony marked this pull request as draft January 23, 2022 12:16

AlonZivony force-pushed the feature/sort-events-by-timestamp branch from bbaa54a to 2bf42f9 Compare January 23, 2022 12:54

AlonZivony marked this pull request as ready for review January 23, 2022 12:56

AlonZivony force-pushed the feature/sort-events-by-timestamp branch 4 times, most recently from c5c47f1 to 746924e Compare February 2, 2022 12:48

AlonZivony added 6 commits February 2, 2022 19:08

main, external & tracee.bpf: introduce ProcessorID

7aca002

utils/environment: parse the number of CPUs available

474e3ac

events/sorting: introduce events pool

573001a

events/sorting: introduce queue & cpu-queue

51ca84e

events/sorting: introduce events sorter

2554f02

events/sorting: introduce events sorter tests

f16eedf

mtcherni95 reviewed Feb 2, 2022

View reviewed changes

tracee-ebpf/tracee/events.go Outdated Show resolved Hide resolved

tracee-ebpf/tracee/events.go Outdated Show resolved Hide resolved

tracee-ebpf/tracee/events.go Outdated Show resolved Hide resolved

cmd/tracee-ebpf/internal/flags/flags-output.go Outdated Show resolved Hide resolved

AlonZivony added 2 commits February 2, 2022 19:45

tracee & events_pipeline: sort events optionally

d83f08a

main_test: add sort-events to tests

8f06cb7

AlonZivony force-pushed the feature/sort-events-by-timestamp branch from 746924e to 8f06cb7 Compare February 2, 2022 17:46

mtcherni95 reviewed Feb 2, 2022

View reviewed changes

yanivagman approved these changes Feb 2, 2022

View reviewed changes

rafaeldtinoco self-requested a review February 2, 2022 19:31

rafaeldtinoco approved these changes Feb 2, 2022

View reviewed changes

rafaeldtinoco merged commit 0ba181b into aquasecurity:main Feb 2, 2022

AlonZivony deleted the feature/sort-events-by-timestamp branch February 3, 2022 08:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/sort events by timestamp #1103

Feature/sort events by timestamp #1103

AlonZivony commented Oct 21, 2021 •

edited

Loading

itaysk commented Oct 21, 2021

AlonZivony commented Oct 31, 2021

mtcherni95 commented Oct 31, 2021 •

edited

Loading

AlonZivony commented Oct 31, 2021

AlonZivony commented Nov 4, 2021 •

edited

Loading

yanivagman commented Dec 29, 2021

yanivagman commented Jan 23, 2022

AlonZivony commented Jan 23, 2022

AlonZivony commented Jan 23, 2022

rafaeldtinoco commented Jan 26, 2022

AlonZivony commented Jan 27, 2022

rafaeldtinoco commented Jan 27, 2022 •

edited

Loading

mtcherni95 Feb 2, 2022

yanivagman left a comment

rafaeldtinoco left a comment

AlonZivony commented Feb 2, 2022

Feature/sort events by timestamp #1103

Feature/sort events by timestamp #1103

Conversation

AlonZivony commented Oct 21, 2021 • edited Loading

Algorithm

Concepts Used

Queues

Pools

itaysk commented Oct 21, 2021

AlonZivony commented Oct 31, 2021

mtcherni95 commented Oct 31, 2021 • edited Loading

AlonZivony commented Oct 31, 2021

AlonZivony commented Nov 4, 2021 • edited Loading

yanivagman commented Dec 29, 2021

yanivagman commented Jan 23, 2022

AlonZivony commented Jan 23, 2022

AlonZivony commented Jan 23, 2022

rafaeldtinoco commented Jan 26, 2022

AlonZivony commented Jan 27, 2022

rafaeldtinoco commented Jan 27, 2022 • edited Loading

mtcherni95 Feb 2, 2022

Choose a reason for hiding this comment

yanivagman left a comment

Choose a reason for hiding this comment

rafaeldtinoco left a comment

Choose a reason for hiding this comment

AlonZivony commented Feb 2, 2022

AlonZivony commented Oct 21, 2021 •

edited

Loading

mtcherni95 commented Oct 31, 2021 •

edited

Loading

AlonZivony commented Nov 4, 2021 •

edited

Loading

rafaeldtinoco commented Jan 27, 2022 •

edited

Loading