Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow admins to define bulk walk repetition sizes #4105

Closed
eschoeller opened this issue Jan 28, 2021 · 33 comments
Closed

Allow admins to define bulk walk repetition sizes #4105

eschoeller opened this issue Jan 28, 2021 · 33 comments
Labels
bug Undesired behaviour enhancement General tag for an enhancement resolved A fixed issue
Milestone

Comments

@eschoeller
Copy link

I'm currently running on 1.2.14. I'm experiencing a problem with the 'SNMP Bulkwalk Fetch Size' parameter. Somehow this was set to 60. I had a device which could not run any data queries, and they would all produce this error:

ERROR: Data Query returned no indexes.

Through my debugging process I started looking at TCP traffic, and found these notable items:

00:47:25.417177 IP 1.1.1.70.48162 > 1.1.1.14.snmp:  C=public GetBulk(32)  N=0 M=60 .1.3.6.1.4.1.3854.3.5.3.1.1
00:47:26.028596 IP 1.1.1.14.snmp > 1.1.1.70.48162:  C=public GetResponse(32)  tooBig[errorIndex==0] .1.3.6.1.4.1.3854.3.5.3.1.1=
00:47:26.028850 IP 1.1.1.70.48162 > 1.1.1.14.snmp:  C=public GetRequest(32)  .1.3.6.1.4.1.3854.3.5.3.1.1
00:47:26.030905 IP 1.1.1.14.snmp > 1.1.1.70.48162:  C=public GetResponse(32)  noSuchName@1 .1.3.6.1.4.1.3854.3.5.3.1.1=

I realized the M=60 seemed like a query size, and that was actually what the 'Maximum OIDs Per Get Request' was set to for that device. So, I changed that to 10. But that didn't fix it. So, then I stumbled upon the global 'SNMP Bulkwalk Fetch Size' which was also set at 60. I tinkered with this and found that the maximum value I could use with this device was 30. But while I was troubleshooting this, I found that setting it to 30 yielded a 35 second runtime of the verbose data query. Setting it to 10 cut that query time in half, to just 16 seconds.

So, I don't really understand where the performance gains are with this setting. It seems to fetch the same information over and over again. My script query actually executed this GetBulk operaiton 19 times during a verbose query. Plus, there was no evidence of a problem in the logs, and the one error message I did get about having no indexes was very misleading. This "tooBig" response should likely be caught by Cacti somewhere and alerted on. Lastly, I'm not sure it's the best idea to have this defined as a global setting. It should likely be per-host, just like 'Maximum OIDs Per Get Request' is.

But, my problem is solved for now. I have my data query working. I am going to try and make a mental note not to ever increase this Bulkwalk Fetch Size.

@eschoeller eschoeller added bug Undesired behaviour unverified Some days we don't have a clue labels Jan 28, 2021
@eschoeller
Copy link
Author

I can see partly what happened here in #1281

@TheWitness
Copy link
Member

TheWitness commented Jan 28, 2021

@eschoeller, what happens when you do the bulkwalk from net-snmp directly with no options?

Also, I did make a change recently as maxoids should not have any control of bulkwalks. It's designed for gets only.

@eschoeller
Copy link
Author

eschoeller commented Jan 30, 2021

$ snmpbulkwalk -m ALL -v2c -c public 10.2.9.8 .1.3.6.1.4.1.3854.3.5

18:18:30.170040 IP 172.20.5.6.41763 > 10.2.9.8.snmp:  C=public GetBulk(37)  N=0 M=10 .1.3.6.1.4.1.3854.3.5.4.1.61.0.0.2.8.0
18:18:30.199253 IP 10.2.9.8.snmp > 172.20.5.6.41763:  C=public GetResponse(258)  .1.3.6.1.4.1.3854.3.5.4.1.61.0.0.2.9.0=14242639 .1.3.6.1.4.1.3854.3.5.4.1.70.0.0.2.0.0=0 .1.3.6.1.4.1.3854.3.5.4.1.70.0.0.2.1.0=0 .1.3.6.1.4.1.3854.3.5.4.1.70.0.0.2.2.0=0 .1.3.6.1.4.1.3854.3.5.4.1.70.0.0.2.3.0=0 .1.3.6.1.4.1.3854.3.5.4.1.70.0.0.2.4.0=0 .1.3.6.1.4.1.3854.3.5.4.1.70.0.0.2.5.0=0 .1.3.6.1.4.1.3854.3.5.4.1.70.0.0.2.6.0=0 .1.3.6.1.4.1.3854.3.5.4.1.70.0.0.2.7.0=0 .1.3.6.1.4.1.3854.3.5.4.1.70.0.0.2.8.0=0
18:18:30.199874 IP 172.20.5.6.41763 > 172.20.9.8.snmp:  C=public GetBulk(37)  N=0 M=10 .1.3.6.1.4.1.3854.3.5.4.1.70.0.0.2.8.0
18:18:30.225547 IP 10.2.9.8.snmp > 172.20.5.6.41763:  C=public GetResponse(283)  .1.3.6.1.4.1.3854.3.5.4.1.70.0.0.2.9.0=0 .1.3.6.1.4.1.3854.3.5.4.1.1000.0.0.2.0.0=131072 .1.3.6.1.4.1.3854.3.5.4.1.1000.0.0.2.1.0=131328 .1.3.6.1.4.1.3854.3.5.4.1.1000.0.0.2.2.0=131584 .1.3.6.1.4.1.3854.3.5.4.1.1000.0.0.2.3.0=131840 .1.3.6.1.4.1.3854.3.5.4.1.1000.0.0.2.4.0=132096 .1.3.6.1.4.1.3854.3.5.4.1.1000.0.0.2.5.0=132352 .1.3.6.1.4.1.3854.3.5.4.1.1000.0.0.2.6.0=132608 .1.3.6.1.4.1.3854.3.5.4.1.1000.0.0.2.7.0=132864 .1.3.6.1.4.1.3854.3.5.4.1.1000.0.0.2.8.0=133120

etc....

Looks like it's using M=10 from the command line with no options.

@TheWitness
Copy link
Member

Is bulkwalk more performance without options or with?

@TheWitness
Copy link
Member

@eschoeller Bump!

@eschoeller
Copy link
Author

OK. So, running snmpbulkwalk -Cr with the following options, yields the following runtimes, walking the entire .1.3.6.1.4.1.3854.3.5 tree, which is 701 objects.

1: 0m2.750s
5: 0m4.853s
10: 0m8.405s
20: 0m16.706s
30: 0m35.496s
40: Reason: (tooBig) Response message would have been too large.

So, uh, is that the expected result? The default of 10 really ... sucks here .... And anything more than that is even worse. I don't even really understand what "max-repetitions field in the GETBULK PDUs. This specifies the maximum number of iterations over the repeating variables." means in general. Does this device have a "broken" snmp agent?

@eschoeller
Copy link
Author

I guess the idea of the bulkwalk is to cut down on the number of SNMP requests, and the number of packets in general ... and using a setting of -Cr1 essentially negates any benefits of that over just a regular snmpwalk?

@bmfmancini
Copy link
Member

bmfmancini commented Feb 10, 2021 via email

@eschoeller
Copy link
Author

Yeah, a lot of my devices are actually embedded devices, like in PDUs, UPSes, and other really odd funky things, with very rudimentary SNMP implementations. Almost no switches, and a random smattering of Linux machines. So, I have a feeling I might try tinkering with this setting across some of my device classes and see some more interesting results.
For Cacti in general, I'm not really sure what this means ... but if there's a way to capture the "tooBig" response and alarm on it, that would probably be a good starting point to help people end up where I'm headed.

@TheWitness
Copy link
Member

TheWitness commented Feb 10, 2021

So, if you leave off the -CrX, what is the response like @eschoeller? The too big comment is that it's forced to use UDP, and the total packet size is like over the 1500 barrier, in fact some VPN's clip the snot out of the MTU. UDP don't fragment well ever. I'm just debating leaving that setting off entirely.

@eschoeller
Copy link
Author

With no -CrX the runtime is 8.4s. Same as -Cr10, which makes sense since that's the default.

If I had all the time in the world I'd take a sampling of our entire infrastructure and run these scenarios on each type of device to get a better idea of what benefit (if any) increasing the value does, in terms of runtime performance. Since I'm on a 1 minute poller, I'm always chasing after polling speed increases. I've had to distribute across 5 polling machines to keep from exceeding our runtime on the main poller. I don't have a massive amount of data sources, but as I mentioned, I have lots of devices with tiny embedded ASICs which get clobbered pretty easily. Point well taken about VPNs and firewalls. I am traversing some firewalls between our polling machines and a lot of our devices. But, I'm on a high-speed campus network and not dealing with WAN latencies, thankfully. Others might find this setting useful in their environment, I'm not sure.

@bmfmancini
Copy link
Member

Here are my results and this is walking the entire device tree
after 20 OID though well I think the device was not having it

 time snmpbulkwalk -Cr1  

real    1m50.521s
user    0m0.109s
sys     0m0.092s
 time snmpbulkwalk -Cr10  

real    0m20.095s
user    0m0.036s
sys     0m0.017s
 time snmpbulkwalk -Cr20

real    0m12.107s
user    0m0.032s
sys     0m0.006s

@TheWitness
Copy link
Member

Okay, so in summary, the max_oid's was designed for get requests.

Please vote on one of the following proposals:

  1. An option to either use bulk walk, for v2 and v3 devices or skip (Essentially force -Cr1)
  2. Provide a bulk walk size for use with Data Queries only.

Let me know which one you two prefer. Anyone else trolling this issue should chime in as well.

@bmfmancini
Copy link
Member

I think 1 is a good option

@eschoeller
Copy link
Author

eschoeller commented Feb 13, 2021 via email

@TheWitness
Copy link
Member

@eschoeller bump!

@eschoeller
Copy link
Author

My bad! I got tied up, and now I'm leaving again for the weekend.
What I want to do is run some tests against each type of device I have. Get an idea of what I'm working with. Then, If I see a difference - ie. some devices work well with -Cr1 and others with -Cr10, I'd suggest we make note of that in the tool-tip, with a suggestion on how to determine what setting to use. Then, perhaps in the future, possibly create an "Auto Optimize" feature in Cacti which will run a bulkwalk with various settings against a given device and then automagically determine which settings to use to poll the device effectively. You could then apply those automatically optimized settings to all devices using the same device template. But.. if I find nothing that runs well with Cr[10-60] then there's no point in the auto-optimize. And, perhaps, there are other settings which would benefit from an "Auto Optimize" type feature.

@bmfmancini
Copy link
Member

bmfmancini commented Feb 19, 2021 via email

@TheWitness
Copy link
Member

Okay, last call for alcohol. Someone summarize in a single sentence, and not a run-on one either, what you would like done.

@TheWitness TheWitness added enhancement General tag for an enhancement and removed unverified Some days we don't have a clue labels Feb 23, 2021
@bmfmancini
Copy link
Member

I think An option to either use bulk walk, for v2 and v3 devices or skip (Essentially force -Cr1) Is the best route

@eschoeller
Copy link
Author

eschoeller commented Feb 24, 2021

(sorry this is not a one-liner)
I spent about an hour or so on some data analysis. The results are attached. Some of my first thoughts:

  1. The setting is important.
  2. There is no way to generalize the value, but -Cr5 looks good. Performance results were unexpected in many cases.
  3. There appears to be some real serious performance gains by tuning it correctly
  4. In the end, we most likely need some kind of auto-optimize feature. I can't begin to think about how end-users would tune this.

bulk.txt

And, I apologize, I realize I didn't directly answer your question. But I am very intrigued by these findings. Take a look at the data and let me know what you guys think. My next step is to go back and review the configuration of my devices ... and then apply these "performance tuning" options and see if there's an improvement in polling times.

@TheWitness
Copy link
Member

TheWitness commented Feb 25, 2021

So, I read this then as two answers.

  1. Provide a bulk walk size
  2. Provide a way during Re-index, or during the device being first added to attempt to run what that value should be.
  3. On the description for the setting, let the user know that it will be 'auto' select during reindex (so don't touch it).

Or more George Jetson style:

  1. Due the tuning
  2. Feed the user mushrooms, and keep them in the dark.

I good with either. I may have gotten some of that wrong, likely just a replacement food substance rather than mushrooms though.

@eschoeller
Copy link
Author

I think circling back to the original issue - is there a way we can get better logging if we hit the tooBig[errorIndex==0] error? Indicating the bulkwalk setting should be decreased?

@TheWitness
Copy link
Member

Post the exact error you get from the cli.

@TheWitness
Copy link
Member

Just about to drop this in...

image

Verbose Query looks like the following when Auto Detect/Set is chosen.

image

I chose to increment one at a time from: 1, 5, 10, 15, 20, 30, 40, 50, 60 instead of taking a btree approach, mainly cause I did not want to have to think that hard.

TheWitness added a commit that referenced this issue Mar 21, 2021
- Issue with SNMP Bulkwalk Repetitions Size
- This enhancement adds a new column to the host table for bulk walk size us
- Detection can happen on each re-index, set manually, or detected and set one time.
- Changed Max OID's to a dropdown
- Max Repetitions is also a dropdown now
- This setting was also carried to the Automation portion of Cacti and the Add Device CLI
@TheWitness
Copy link
Member

Please review in your labs and provide feedback.

@bmfmancini
Copy link
Member

bmfmancini commented Mar 21, 2021 via email

@TheWitness TheWitness added this to the v1.2.17 milestone Mar 21, 2021
@TheWitness TheWitness added the resolved A fixed issue label Mar 21, 2021
@eschoeller
Copy link
Author

Wow, that's cool, you did it!! I think you just need 'Repititions' with no apostrophe in the drop-down though;)

For SNMPv2 and SNMPv3 Devices, the SNMP Bulk Walk chunk size. For very large switches, or for high latency WAN connections, increasing this value to lead to more repid Data Query Re-Index operations. However, some devices to not operate well to large Bulk Walk sizes. Cacti can \'auto tune\' this value upon request.

I would tweak this description to:

For SNMPv2 and SNMPv3 Devices, the SNMP Bulk Walk max-repetitions size. The default is 10. For very large switches, high performance servers, Jumbo Frame Networks or for high latency WAN connections, increasing this value may increase poller performance. More data is packed into a single SNMP packet which can reduce data query run time. However, some devices may completely refuse to respond to packets with a max-repetition size which is set too large. This can be especially true for lower-powered IoT type devices or smaller embedded IT appliances. Special attention to the overall network path MTU should also be considered since setting a value which is too high could lead to packet fragmentation. Cacti can attempt to \'auto tune\' this value upon request.

In my research on the subject I certainly saw 'chunk size' referenced, but the man page that I'm seeing says:

-Cr<NUM>
    Set the max-repetitions field in the GETBULK PDUs. This specifies the maximum number of iterations over the repeating variables. The default is 10.

So I think that might be the newer nomenclature for it. I decided to go read RFC 3416 and here is the relevant section:

One of the aims of the GetBulkRequest-PDU, specified in this 
protocol, is to minimize the number of protocol exchanges required to
retrieve a large amount of management information.  As such, this PDU
type allows an SNMP entity supporting command generator applications
to request that the response be as large as possible given the
constraints on message sizes.  These constraints include the limits
on the size of messages which the SNMP entity supporting command
responder applications can generate, and the SNMP entity supporting
command generator applications can receive.

However, it is possible that such maximum sized messages may be
larger than the Path MTU of the path across the network traversed by
the messages.  In this situation, such messages are subject to
fragmentation.  Fragmentation is generally considered to be harmful
[FRAG], since among other problems, it leads to a decrease in the
reliability of the transfer of the messages.  Thus, an SNMP entity
which sends a GetBulkRequest-PDU must take care to set its parameters
accordingly, so as to reduce the risk of fragmentation.  In
particular, under conditions of network stress, only small values
should be used for max-repetitions.

Some additional good info there, which I included in the tool-tip description.

In regards to your question 7 days ago. I think the only message I got back in the cacti log was

ERROR: Data Query returned no indexes.

Which was pretty generic and led to the wild goose chase on this. I don't think Cacti was really aware of the snmp-network-level issue of the device responding back with "C=public GetResponse(32) tooBig[errorIndex==0]". And I don't know if Cacti would really be able to capture that SNMP response and throw a specific alert in the log saying "Hey! Your bulkwalk size is too big, reduce it!!"

Thanks again for looking at this!

TheWitness added a commit that referenced this issue Mar 21, 2021
- Thanks Eric!
@TheWitness
Copy link
Member

It can, return an error, but we have to write code for that. There is a replacement SNMP module on the way, one thing at a time though.

@TheWitness
Copy link
Member

Can you please post the exact error message you get when you exceed the max-repeaters supported by your device. I can not get it to break on my linux box.

@TheWitness
Copy link
Member

Oh, use the network luke, use the network:

[root@vmhost3 cacti]# snmpbulkwalk -Cr10000000000 -v2c -c public 192.168.11.1
Error in packet.
Reason: (tooBig) Response message would have been too large.
SNMPv2-SMI::mib-2 = No Such Object available on this agent at this OID

TheWitness added a commit that referenced this issue Mar 21, 2021
This will help people who set their snmpbulkwalk size too large by providing a better log message for thme.
@TheWitness
Copy link
Member

Okay, that should be fixed now too. See if you can get it to report the error.

@eschoeller
Copy link
Author

Lol @ -Cr10000000000 that's awesome!! The force is strong with this one!

@netniV netniV changed the title Issue with SNMP Bulkwalk Fetch Size Allow admins to define bulk walk repetition sizes Apr 30, 2021
@github-actions github-actions bot locked and limited conversation to collaborators Jul 30, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Undesired behaviour enhancement General tag for an enhancement resolved A fixed issue
Projects
None yet
Development

No branches or pull requests

3 participants