-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alpha test on sensor fleet(s) #74
Comments
Host 86c3efe2. Some interesting stuff right out of the gate:
After about 5 minutes of runtime. So lots of traffic, but consistent drops for almost all ifaces. Memory:
|
@els0r From a perspective of how the buffers work my 5c would be to keep the number of blocks at 4 (globally) and mostly steer the rest via the block size, for the following reasons:
|
New profiles incoming for 765bd9af, based on commit ee565f6:
|
@els0r The profiles look really good to me (in particular all interface up/down usage is now gone, as I hoped), there's some stray Nice one, way to go! 💪 Sidenotes:
|
Operational DataWith #174 deployed on the sensor fleet, there's some interesting data to look at. Bottom line: we have to take a look at drops and errors. They do occur with a non-negligible percentage. Especially on the high-throughput hosts. What's also very interesting, is how long the writeout takes on the high-throughput hosts. Will adjust the buckets as part of #174 . Mid-Range Traffic, Many InterfacesHigh Traffic, Few Interfaces@fako1024 : what do you think? Looking at #162 might be a good next course of action. Or beefing up the buffer size. Or one after the other. |
@els0r Thanks for the charts, looks quite interesting indeed. But please let's not touch the buffer size until we understand exactly what's going on (in particular because we already know there is a reason why we have drops). Let's start by looking at #162 first. In addition, metrics aside, what are the errors (they should show up somewhere in the logs, no)? Could be related to #55 (if it's just fragmentation then they will go away) or maybe even fako1024/slimcap#6 ... I suggest the following tasks:
Maybe to elaborate why I'm so pesky about not touching the buffer: It won't really help, just hide an actual challenge / imperfection in our concept. Remember that a buffer is just a buffer (is just a buffer) that is there to cover bursts and periods of being blocked. There's only two scenarios here:
This also leads me to a suggestion: Once we have figured out #162 we could add prometheus metrics on how long an interface is blocked (i.e. how long it is in the state between entering the lock and returning to its normal capturing flow), how many elements are in the temporary buffer and how long it takes to drain them. This should give us a continuous feeling about the behavior of the new local buffer. |
Sounds good. As for the first point, we would need #178 to inspect them properly, since they are neither logged nor is the errorsMap inspectable at the moment. I remember that we explicitly decided against logging since this is in the critical path. Oh well, so much for that 😓 . Second point: once #174 is merged, that should be visible immediately. Will focus on that right out the gate. |
I will conclude this issue with one more test of |
Blocked by golang 1.21 not being available yet internally. Will close this issue and report any further findings in follow-up issues. |
Preliminary test done as part of #131 .
Next steps:
The text was updated successfully, but these errors were encountered: