New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow admins to define bulk walk repetition sizes #4105
Comments
I can see partly what happened here in #1281 |
@eschoeller, what happens when you do the bulkwalk from net-snmp directly with no options? Also, I did make a change recently as maxoids should not have any control of bulkwalks. It's designed for gets only. |
$ snmpbulkwalk -m ALL -v2c -c public 10.2.9.8 .1.3.6.1.4.1.3854.3.5
etc.... Looks like it's using M=10 from the command line with no options. |
Is bulkwalk more performance without options or with? |
@eschoeller Bump! |
OK. So, running 1: 0m2.750s So, uh, is that the expected result? The default of 10 really ... sucks here .... And anything more than that is even worse. I don't even really understand what "max-repetitions field in the GETBULK PDUs. This specifies the maximum number of iterations over the repeating variables." means in general. Does this device have a "broken" snmp agent? |
I guess the idea of the bulkwalk is to cut down on the number of SNMP requests, and the number of packets in general ... and using a setting of -Cr1 essentially negates any benefits of that over just a regular snmpwalk? |
Yeah that's some interesting finding but I think it may be device specific
on our switches 10 goes through no problem when I compare 1 to 10 1 is
slower
But on slower devices 10 slows down and we have to set it to 1
But there is a trade off with larger setups more snmp packets can easily
kill an NNI or some firewalls if you need to Traverse a firewall to get to
your devices
Another thing too if you only set 1 and have like 30k data sources and go
through pat your pat port space can be strained
…On Tue., Feb. 9, 2021, 20:33 Eric Schoeller, ***@***.***> wrote:
I guess the idea of the bulkwalk is to cut down on the number of SNMP
requests, and the number of packets in general ... and using a setting of
-Cr1 essentially negates any benefits of that over just a regular snmpwalk?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#4105 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADGEXTCTZOPDUZ2FPVSJVULS6HO7NANCNFSM4WWSUETA>
.
|
Yeah, a lot of my devices are actually embedded devices, like in PDUs, UPSes, and other really odd funky things, with very rudimentary SNMP implementations. Almost no switches, and a random smattering of Linux machines. So, I have a feeling I might try tinkering with this setting across some of my device classes and see some more interesting results. |
So, if you leave off the -CrX, what is the response like @eschoeller? The too big comment is that it's forced to use UDP, and the total packet size is like over the 1500 barrier, in fact some VPN's clip the snot out of the MTU. UDP don't fragment well ever. I'm just debating leaving that setting off entirely. |
With no -CrX the runtime is 8.4s. Same as -Cr10, which makes sense since that's the default. If I had all the time in the world I'd take a sampling of our entire infrastructure and run these scenarios on each type of device to get a better idea of what benefit (if any) increasing the value does, in terms of runtime performance. Since I'm on a 1 minute poller, I'm always chasing after polling speed increases. I've had to distribute across 5 polling machines to keep from exceeding our runtime on the main poller. I don't have a massive amount of data sources, but as I mentioned, I have lots of devices with tiny embedded ASICs which get clobbered pretty easily. Point well taken about VPNs and firewalls. I am traversing some firewalls between our polling machines and a lot of our devices. But, I'm on a high-speed campus network and not dealing with WAN latencies, thankfully. Others might find this setting useful in their environment, I'm not sure. |
Here are my results and this is walking the entire device tree
|
Okay, so in summary, the max_oid's was designed for get requests. Please vote on one of the following proposals:
Let me know which one you two prefer. Anyone else trolling this issue should chime in as well. |
I think 1 is a good option |
I have some ideas on this I’m going to take a deeper dive on Monday
Eric.
… On Feb 10, 2021, at 8:24 AM, Sean Mancini ***@***.***> wrote:
I think 1 is a good option
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
@eschoeller bump! |
My bad! I got tied up, and now I'm leaving again for the weekend. |
Yes! That would be cool
Especially on networks with a mixed bag of devices some device types may be
better of with say 1 oid but currently its a global setting for the device
default
…On Thu., Feb. 18, 2021, 23:25 Eric Schoeller, ***@***.***> wrote:
My bad! I got tied up, and now I'm leaving again for the weekend.
What I want to do is run some tests against each type of device I have.
Get an idea of what I'm working with. Then, If I see a difference - ie.
some devices work well with -Cr1 and others with -Cr10, I'd suggest we make
note of that in the tool-tip, with a suggestion on how to determine what
setting to use. Then, perhaps in the future, possibly create an "Auto
Optimize" feature in Cacti which will run a bulkwalk with various settings
against a given device and then automagically determine which settings to
use to poll the device effectively. You could then apply those
automatically optimized settings to all devices using the same device
template. But.. if I find *nothing* that runs well with Cr[10-60] then
there's no point in the auto-optimize. And, perhaps, there are other
settings which would benefit from an "Auto Optimize" type feature.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#4105 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADGEXTGA6Z6TB6OSY23EQKLS7XR4TANCNFSM4WWSUETA>
.
|
Okay, last call for alcohol. Someone summarize in a single sentence, and not a run-on one either, what you would like done. |
I think An option to either use bulk walk, for v2 and v3 devices or skip (Essentially force -Cr1) Is the best route |
(sorry this is not a one-liner)
And, I apologize, I realize I didn't directly answer your question. But I am very intrigued by these findings. Take a look at the data and let me know what you guys think. My next step is to go back and review the configuration of my devices ... and then apply these "performance tuning" options and see if there's an improvement in polling times. |
So, I read this then as two answers.
Or more George Jetson style:
I good with either. I may have gotten some of that wrong, likely just a replacement food substance rather than mushrooms though. |
I think circling back to the original issue - is there a way we can get better logging if we hit the tooBig[errorIndex==0] error? Indicating the bulkwalk setting should be decreased? |
Post the exact error you get from the cli. |
- Issue with SNMP Bulkwalk Repetitions Size - This enhancement adds a new column to the host table for bulk walk size us - Detection can happen on each re-index, set manually, or detected and set one time. - Changed Max OID's to a dropdown - Max Repetitions is also a dropdown now - This setting was also carried to the Automation portion of Cacti and the Add Device CLI
Please review in your labs and provide feedback. |
That looks really good Larry
…On Sun., Mar. 21, 2021, 14:26 TheWitness, ***@***.***> wrote:
Just about to drop this in...
[image: image]
<https://user-images.githubusercontent.com/1439914/111916406-299bea00-8a51-11eb-84ff-5af3cde76877.png>
Verbose Query looks like the following when Auto Detect/Set is chosen.
[image: image]
<https://user-images.githubusercontent.com/1439914/111916447-56e89800-8a51-11eb-8c38-f340696c17fa.png>
I chose to increment one at a time from: 1, 5, 10, 15, 20, 30, 40, 50, 60
instead of taking a btree approach, mainly cause I did not want to have to
think that hard.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#4105 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADGEXTEEYF2BOUM5OYNTLVLTEY267ANCNFSM4WWSUETA>
.
|
Wow, that's cool, you did it!! I think you just need 'Repititions' with no apostrophe in the drop-down though;)
I would tweak this description to:
In my research on the subject I certainly saw 'chunk size' referenced, but the man page that I'm seeing says:
So I think that might be the newer nomenclature for it. I decided to go read RFC 3416 and here is the relevant section:
Some additional good info there, which I included in the tool-tip description. In regards to your question 7 days ago. I think the only message I got back in the cacti log was
Which was pretty generic and led to the wild goose chase on this. I don't think Cacti was really aware of the snmp-network-level issue of the device responding back with "C=public GetResponse(32) tooBig[errorIndex==0]". And I don't know if Cacti would really be able to capture that SNMP response and throw a specific alert in the log saying "Hey! Your bulkwalk size is too big, reduce it!!" Thanks again for looking at this! |
It can, return an error, but we have to write code for that. There is a replacement SNMP module on the way, one thing at a time though. |
Can you please post the exact error message you get when you exceed the max-repeaters supported by your device. I can not get it to break on my linux box. |
Oh, use the network luke, use the network: [root@vmhost3 cacti]# snmpbulkwalk -Cr10000000000 -v2c -c public 192.168.11.1
Error in packet.
Reason: (tooBig) Response message would have been too large.
SNMPv2-SMI::mib-2 = No Such Object available on this agent at this OID |
This will help people who set their snmpbulkwalk size too large by providing a better log message for thme.
Okay, that should be fixed now too. See if you can get it to report the error. |
Lol @ -Cr10000000000 that's awesome!! The force is strong with this one! |
I'm currently running on 1.2.14. I'm experiencing a problem with the 'SNMP Bulkwalk Fetch Size' parameter. Somehow this was set to 60. I had a device which could not run any data queries, and they would all produce this error:
Through my debugging process I started looking at TCP traffic, and found these notable items:
I realized the M=60 seemed like a query size, and that was actually what the 'Maximum OIDs Per Get Request' was set to for that device. So, I changed that to 10. But that didn't fix it. So, then I stumbled upon the global 'SNMP Bulkwalk Fetch Size' which was also set at 60. I tinkered with this and found that the maximum value I could use with this device was 30. But while I was troubleshooting this, I found that setting it to 30 yielded a 35 second runtime of the verbose data query. Setting it to 10 cut that query time in half, to just 16 seconds.
So, I don't really understand where the performance gains are with this setting. It seems to fetch the same information over and over again. My script query actually executed this GetBulk operaiton 19 times during a verbose query. Plus, there was no evidence of a problem in the logs, and the one error message I did get about having no indexes was very misleading. This "tooBig" response should likely be caught by Cacti somewhere and alerted on. Lastly, I'm not sure it's the best idea to have this defined as a global setting. It should likely be per-host, just like 'Maximum OIDs Per Get Request' is.
But, my problem is solved for now. I have my data query working. I am going to try and make a mental note not to ever increase this Bulkwalk Fetch Size.
The text was updated successfully, but these errors were encountered: