Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache VDEV Enumeration and Small Suggestions #1

Closed
sempervictus opened this issue Sep 12, 2016 · 10 comments
Closed

Cache VDEV Enumeration and Small Suggestions #1

sempervictus opened this issue Sep 12, 2016 · 10 comments
Assignees

Comments

@sempervictus
Copy link

sempervictus commented Sep 12, 2016

Thank you for this zenpack - its a lifesaver in our environment. At the latest version (0.7.0), cache drive vdev enumeration fails, i've had to comment it out (https://github.com/daviswr/ZenPacks.daviswr.ZFS/blob/master/ZenPacks/daviswr/ZFS/modeler/plugins/daviswr/cmd/ZPool.py#L153). I'll spin up a lab system to replicate the error, but along the lines of "NoneType has no member named 'dev'".

Separately, i've added local thresholds for pool capacity notification - may be useful to have them in the zenpack. A pool at 90% is something to be concerned about (especially with automated snapshots or heavy use). Also, would be very useful to have a configuration option to disable enumeration of snapshots. Some of our systems have thousands of snapshots across datasets, it gets painful pretty quick (we are only monitoring pools for now anyway, but DS usage and ZVOL IO would be nice).

@daviswr daviswr self-assigned this Sep 13, 2016
@daviswr
Copy link
Owner

daviswr commented Sep 13, 2016

Hey, thanks! I'm glad you're getting some use out of it; surprised how quickly anyone noticed this repo. 0.7.0 as a version was a bit arbitrary, based on guidelines from the Managing ZenPacks document, and it's the "seems to work with my one box" release.

I think the cache device enumeration problem in the ZPool modeler was from trying to pull the device name from a non-matching regex and should be fixed now. Mind grabbing the 0.7.1 egg and giving it a shot?

I really like your ideas and will add thresholds and "ignore" zProperties in the future. Could you elaborate on what you mean by DS usage?

The zfs command doesn't have a parameter like iostat, is there a way to monitor IO on a dataset? At first glance, Linux doesn't appear to have anything quite like fsstat on Solaris (thanks, random thread].

Any other feedback or input you've got would be appreciated, too!

@sempervictus
Copy link
Author

Thanks for getting back so quick. DS usage being dataset usage. Far as io
tracing, see the netlink iostat PR in ZoLs github (will link when not
responding by email from a phone)... In proper zfs, there's no io tracing
outside of the pool stats and zvols AFAIK.
On Sep 13, 2016 9:13 AM, "Wes Davis" notifications@github.com wrote:

Hey, thanks! I'm glad you're getting some use out of it; surprised how
quickly anyone noticed this repo. 0.7.0 as a version was a bit arbitrary,
based on guidelines from the Managing ZenPacks
http://zenpacklib.zenoss.com/en/1.0.3/managing-zenpacks.html document,
and it's the "seems to work with my one box" release.

I think the cache device enumeration problem in the ZPool modeler was from
trying to pull the device name from a non-matching regex and should be
fixed now
https://github.com/daviswr/ZenPacks.daviswr.ZFS/blob/master/ZenPacks/daviswr/ZFS/modeler/plugins/daviswr/cmd/ZPool.py#L155.
Mind grabbing the 0.7.1 egg and giving it a shot?

I really like your ideas and will add thresholds and "ignore" zProperties
in the future. Could you elaborate on what you mean by DS usage?

The zfs command doesn't have a parameter like iostat, is there a way to
monitor IO on a dataset? At first glance, Linux doesn't appear to have
anything quite like fsstat on Solaris (thanks, random thread
https://serverfault.com/questions/278652/getting-zfs-per-dataset-io-statistics-or-nfs-per-export-io-statistics
].

Any other feedback or input you've got would be appreciated, too!


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABRPjL1pRe8VlY3E9xRrxAcF6fOFiAqMks5qpqFjgaJpZM4J65-h
.

@daviswr
Copy link
Owner

daviswr commented Sep 14, 2016

I've added zProperties for ignoring datasets based on name/type and pool names. There's a section about it in the readme, but the types are pretty straightforward. Also included 80% (sev 3) and 90% (sev 4) capacity thresholds in the ZPool performance template.

Mind seeing if 0.7.2 works for you?

@sempervictus
Copy link
Author

Seems to work, but probably want to drop the debug output (ignoring X) so
as not to fill logfiles with thousands of entries :). I also made 90%
capacity a critical level event.
On Sep 14, 2016 00:29, "Wes Davis" notifications@github.com wrote:

I've added zProperties for ignoring datasets based on name/type and pool
names. There's a section about it in the readme, but the types are pretty
straightforward. Also included 80% (sev 3) and 90% (sev 4) capacity
thresholds in the ZPool performance template.

Mind seeing if 0.7.2 works for you?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABRPjMevsPovUsnbcr0mbLVt56FbgSikks5qp3g5gaJpZM4J65-h
.

@daviswr
Copy link
Owner

daviswr commented Sep 14, 2016

Ah, yeah, log.info() probably would generate a lot of noise with the number of snaps you've got. I'll crank that down to debug.
Was on the fence about the severity levels, maybe I can make that more configurable via zProperties.

@sempervictus
Copy link
Author

This looks a lot better, now we're down to warnings being generated about command execution timeouts:

Datasource ZFSDataset/zfs-get command timed out

Seeing same for zpool-get and zpool-iostat on a system with tens of thousands of snaps (zfs send target for multiple SAN systems).
I've bumped the zCommandTimeout to 60s, and timed the execution manually, which doesnt usually go above 20s (lots of snaps, sometimes requiring lots of on-disk metadata lookups). Does the zenpack respect that property? Maybe make the timeout adjustable for these (zfs-get, zpool-get, zpool-iostat) if it doesnt?

@daviswr
Copy link
Owner

daviswr commented Sep 14, 2016

AFAIK, the zencommand daemon's handling the SSH connections, datasource command executions, and returning the output to the datasources' parsers.

Right now, the zfs-get, zpool-get, and zpool-iostat datasources execute every minute. The -gets could probably be okay with every 5, but since the zpool-iostat datasource is a point-in-time gauge, it's probably not useful at a 5-min interval. "zpool iostat" is probably a little less intensive than "zfs get all", too.

Might be worth checking to see if the zSshConcurrentSessions value is less than what MaxSessions is configured to on your server, assuming it's running OpenSSH. Default's 10 for both.

I also understand that there's a bug in Zenoss 5.1.5, to be fixed in 5.1.7, where zencommand just keeps adding to its queue if an SSH session times out. I'm still on 4.2.5.

All the datasource commands are the same no matter the component, so zencommand should just be running each once, but that also means the zfs.get parser is spending time processing output for ignored DSes. Might be better if the zfs-get datasource passes the ZFSDataset component title to zfs, so it runs one for each. Probably a little more SSH traffic, but it might finish faster if it's not asking for stats for on snaps you're not modeling. I'll see if I can make that change soon.

@sempervictus
Copy link
Author

This is all 4.2.5, so no feedback on the 5.1 stuff.
The changes to allow zproperties to set the thresholds are raising:

zenhub|User-supplied Python expression (device.zZPoolThresholdCritical) for maximum value caused error: ['zpool-get_capacity']

for all 3 properties, registering as warnings on the zenoss host (not the ZFS host).

@daviswr
Copy link
Owner

daviswr commented Sep 16, 2016

Yeah, I got a few of the same overnight. Small number, though, not from every collection cycle. Odd.

Digging around in zendmd, the device's zProperties are inherited by the component, so I've changed the ZFSStoragePool template's thresholds to here.zZPoolThreshold.... Going to let that run for a while before I push the changed yaml.

@daviswr
Copy link
Owner

daviswr commented Oct 8, 2016

If the performance templates are working for you, I'd like to close this issue. Or are you still getting timeouts?

As for ZVol I/O, would you mind opening a separate issue for it? I don't honestly know if it's something I'll be able to implement, but it might not be a bad idea to have it in its own thread.

@daviswr daviswr closed this as completed Dec 22, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants