Alpha test on sensor fleet(s) #74

els0r · 2023-03-08T19:32:56Z

Preliminary test done as part of #131 .

Host ID	# Interfaces	# Cores	GB memory	RB Block Size	RB Num Blocks	Classifcation	Comments
85f74f66	75	8	32	1 MiB	2	Mid-range volume, Mid-range number of Interfaces	Details in #131
86c3efe2	6	32	128	1-2 MiB	2-4	High Volume Host	Drops experienced consistently on lower ring buffer setting. Fewer observed when setting to more blocks and higher block size. Came at the expense of a higher memory footprint between 1.1 GB - 2.0 GB vs. 900 MB - 1.2 GB. Still seems to be an issue for traffic bursts.
765bd9af	337	16	32	2 MiB	4	Mid-range volume, High number of Interfaces	Directly went to a larger ring buffer size and number of blocks due to many drops observed across the band. With the default setting of 1 MiB and 2 blocks, the drops were in the thousands. With the current setting, spot checks showed no drops
77c4f356	4	48	128	1 MiB	4	High Volume Host	No drops

Next steps:

deploy and leave running on production hosts (3 flavors)

els0r · 2023-06-23T18:09:41Z

Host 86c3efe2. Some interesting stuff right out of the gate:

╭──────────────────────────────────────────────────────────────────╮
│                        Interface Statuses                        │
├───────┬──────────┬────────────┬───────────┬────────────┬─────────┤
│       │    total │            │     total │            │         │
│ iface │ received │ received   │ processed │ processed  │ dropped │
├───────┼──────────┼────────────┼───────────┼────────────┼─────────┤
│ A     │   8.97 M │ + 1.68 M   │    8.87 M │ + 1.68 M   │ 46      │
│ B     │   4.53 M │ + 969.82 k │    4.52 M │ + 970.25 k │ 39      │
│ C     │   3.94 M │ + 674.14 k │    3.93 M │ + 674.08 k │ 17      │
│ D     │   3.74 M │ + 810.24 k │    3.73 M │ + 810.65 k │ 33      │
│ E     │   5.91 k │ + 701.00   │    5.90 k │ + 698.00   │ 0       │
│ F     │ 908.41 k │ + 196.81 k │  906.76 k │ + 196.65 k │ 4       │
│ G     │   2.16 M │ + 449.68 k │    2.16 M │ + 449.60 k │ 14      │
╰───────┴──────────┴────────────┴───────────┴────────────┴─────────╯

After about 5 minutes of runtime. So lots of traffic, but consistent drops for almost all ifaces. Memory:

 4313 root      20   0 4611292 974784  24072 S   6.2   0.7   0:23.66 goProbe

fako1024 · 2023-06-25T15:01:48Z

@els0r From a perspective of how the buffers work my 5c would be to keep the number of blocks at 4 (globally) and mostly steer the rest via the block size, for the following reasons:

Using a block number of 2 (less doesn't make sense, obviously) could introduce some strange "dead" times where the kernel might not yet write into the block it currently "owns" because it's in the process of being released from userland.
Using a block number significantly higher than 4 probably won't do any good (simply because we're pulling from a single block with a single thread as fast as we can, so having more blocks the kernel is allowed to write to won't help because it owns all but one block at all times, except maybe for that short period of time where we "flip", see first point).
Looking at the profile we do spend quite some time in the PPOLL() call, so increasing the block size might / will help (because it reduces the number of blocking calls per bytes read).

els0r · 2023-06-28T21:43:33Z

New profiles incoming for 765bd9af, based on commit ee565f6:

Runtime info:

            Running since: 2023-06-28 21:38:52 (2h3m18s ago)
  Last scheduled writeout: 2023-06-28 23:40:00 (2m10s ago)

Totals:

    Packets
       Received: 2.11 G / + 31.19 M
      Processed: 2.11 G / + 31.20 M
        Dropped:      + 0

fako1024 · 2023-06-29T09:33:32Z

@els0r The profiles look really good to me (in particular all interface up/down usage is now gone, as I hoped), there's some stray fmt. allocations from the message regarding packet fragmentation, but that will be removed in #55 anyway. Aside from that I don't think that there are any outstanding obvious paths left that we can significantly improve upon at present (at least none come to mind), so I guess the next step would be to try out PGO in #138 (see my comment there) and of course continue testing + collecting more samples.

Nice one, way to go! 💪

Sidenotes:

We might have to thing about tracking the overall number of dropped packets, otherwise it's going to be hard to assess, both now and later in production...
Since at least here I don't see any drops: Do you want to try reducing the block size a little again (as part of Assess reasonable ring buffer sizes #98 )?

els0r · 2023-08-08T01:04:18Z

Operational Data

With #174 deployed on the sensor fleet, there's some interesting data to look at. Bottom line: we have to take a look at drops and errors. They do occur with a non-negligible percentage. Especially on the high-throughput hosts.

What's also very interesting, is how long the writeout takes on the high-throughput hosts. Will adjust the buckets as part of #174 .

Mid-Range Traffic, Many Interfaces

High Traffic, Few Interfaces

@fako1024 : what do you think? Looking at #162 might be a good next course of action. Or beefing up the buffer size. Or one after the other.

fako1024 · 2023-08-08T07:04:53Z

@els0r Thanks for the charts, looks quite interesting indeed. But please let's not touch the buffer size until we understand exactly what's going on (in particular because we already know there is a reason why we have drops). Let's start by looking at #162 first. In addition, metrics aside, what are the errors (they should show up somewhere in the logs, no)? Could be related to #55 (if it's just fragmentation then they will go away) or maybe even fako1024/slimcap#6 ...

I suggest the following tasks:

Check out errors on host side and determine if actions are required
Test first shot in [feature] Prototype local buffer implementation for uninterrupted packet capture during rotation #162 (and maybe re-iterate)

Maybe to elaborate why I'm so pesky about not touching the buffer: It won't really help, just hide an actual challenge / imperfection in our concept. Remember that a buffer is just a buffer (is just a buffer) that is there to cover bursts and periods of being blocked. There's only two scenarios here:

We can't sustain continuous capture on these hosts / interfaces (and hence we're screwed, because increasing the buffer will only make it a little better, but if the host is a bit under pressure or whatever there's still going to be drops). I don't like that scenario.
We get a buffer overflow because capturing is blocked for too long. This must be fixed (hence [feature] Prototype local buffer implementation for uninterrupted packet capture during rotation #162). Similar to scenario 1, increasing the buffer size will only work until something puts a bit too much pressure on the host or writeout is a bit slower for some reason and we gets drops again.

This also leads me to a suggestion: Once we have figured out #162 we could add prometheus metrics on how long an interface is blocked (i.e. how long it is in the state between entering the lock and returning to its normal capturing flow), how many elements are in the temporary buffer and how long it takes to drain them. This should give us a continuous feeling about the behavior of the new local buffer.

els0r · 2023-08-08T20:43:26Z

@els0r Thanks for the charts, looks quite interesting indeed. But please let's not touch the buffer size until we understand exactly what's going on (in particular because we already know there is a reason why we have drops). Let's start by looking at #162 first. In addition, metrics aside, what are the errors (they should show up somewhere in the logs, no)? Could be related to #55 (if it's just fragmentation then they will go away) or maybe even fako1024/slimcap#6 ...

I suggest the following tasks:

Check out errors on host side and determine if actions are required

Test first shot in [DONOTMERGE] Prototype local buffer implementation for uninterrupted packet capture during rotation #162 (and maybe re-iterate)

Sounds good. As for the first point, we would need #178 to inspect them properly, since they are neither logged nor is the errorsMap inspectable at the moment. I remember that we explicitly decided against logging since this is in the critical path. Oh well, so much for that 😓 .

Second point: once #174 is merged, that should be visible immediately. Will focus on that right out the gate.

els0r · 2023-09-03T14:45:48Z

I will conclude this issue with one more test of develop on the sensor hosts. The remainder (the beta test) will follow post-v4 on the Open Systems fleet.

els0r · 2023-09-06T03:11:36Z

Blocked by golang 1.21 not being available yet internally. Will close this issue and report any further findings in follow-up issues.

els0r added this to the v4 Release milestone Mar 8, 2023

fako1024 mentioned this issue Mar 15, 2023

Micro-optimization collection issue #83

Closed

6 tasks

els0r mentioned this issue Apr 10, 2023

global-query #43

Closed

els0r added the performance Performance / optimization related topics label Jun 20, 2023

This was referenced Jun 20, 2023

Add totals to gpctl status #134

Closed

goProbe shows elevated memory consumption after writeout #131

Closed

els0r self-assigned this Jun 23, 2023

fako1024 mentioned this issue Jun 26, 2023

Assess reasonable ring buffer sizes #98

Closed

els0r mentioned this issue Jun 27, 2023

[feature] Add totals to gpctl status and print results of commands as table #135

Merged

fako1024 mentioned this issue Aug 2, 2023

Assess use of profile-guided optimization #138

Closed

6 tasks

This was referenced Aug 7, 2023

Instrument goProbe with metrics #173

Closed

[feature] Prototype local buffer implementation for uninterrupted packet capture during rotation #162

Merged

fako1024 mentioned this issue Aug 8, 2023

[feature] Instrument goProbe with metrics #174

Merged

els0r mentioned this issue Aug 23, 2023

Slimcap: LinkType 823 not supported (yet) #188

Closed

els0r closed this as completed Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alpha test on sensor fleet(s) #74

Alpha test on sensor fleet(s) #74

els0r commented Mar 8, 2023 •

edited

Loading

els0r commented Jun 23, 2023 •

edited

Loading

fako1024 commented Jun 25, 2023

els0r commented Jun 28, 2023

fako1024 commented Jun 29, 2023 •

edited

Loading

els0r commented Aug 8, 2023 •

edited

Loading

fako1024 commented Aug 8, 2023 •

edited by els0r

Loading

els0r commented Aug 8, 2023

els0r commented Sep 3, 2023

els0r commented Sep 6, 2023

Alpha test on sensor fleet(s) #74

Alpha test on sensor fleet(s) #74

Comments

els0r commented Mar 8, 2023 • edited Loading

els0r commented Jun 23, 2023 • edited Loading

fako1024 commented Jun 25, 2023

els0r commented Jun 28, 2023

fako1024 commented Jun 29, 2023 • edited Loading

els0r commented Aug 8, 2023 • edited Loading

Operational Data

Mid-Range Traffic, Many Interfaces

High Traffic, Few Interfaces

fako1024 commented Aug 8, 2023 • edited by els0r Loading

els0r commented Aug 8, 2023

els0r commented Sep 3, 2023

els0r commented Sep 6, 2023

els0r commented Mar 8, 2023 •

edited

Loading

els0r commented Jun 23, 2023 •

edited

Loading

fako1024 commented Jun 29, 2023 •

edited

Loading

els0r commented Aug 8, 2023 •

edited

Loading

fako1024 commented Aug 8, 2023 •

edited by els0r

Loading