cmd/scollector: add a kill switch for total memory used by scollector #1866

Merged
merged 1 commit into from Aug 16, 2016

Conversation

Projects
None yet
2 participants
@gbrayut
Contributor

gbrayut commented Aug 12, 2016

This should help prevent memory leaks in CGO or WMI from causing scollector to consume too much memory.

@@ -245,17 +245,27 @@ func main() {
}
collect.MaxQueueLen = conf.MaxQueueLen
}
- maxMemMegaBytes := uint64(500)
+ maxMemMB := uint64(500)

This comment has been minimized.

@captncraig

captncraig Aug 12, 2016

Contributor

I'm not sure I love the default of having the kill switch active. May surprise some people.

@captncraig

captncraig Aug 12, 2016

Contributor

I'm not sure I love the default of having the kill switch active. May surprise some people.

This comment has been minimized.

@gbrayut

gbrayut Aug 12, 2016

Contributor

It has always been active, but only for runtime memory. This adds a second check for total memory usage.

We were using over 1GB on many systems (around 80GB total on all systems) due to a WMI leak, which could have caused other systems to fail if we hadn't noticed.

@gbrayut

gbrayut Aug 12, 2016

Contributor

It has always been active, but only for runtime memory. This adds a second check for total memory usage.

We were using over 1GB on many systems (around 80GB total on all systems) due to a WMI leak, which could have caused other systems to fail if we hadn't noticed.

cmd/scollector/main.go
+ if allocMB > maxMemMB {
+ panic(fmt.Sprintf("memory max runtime reached: (current alloc: %v megabytes, max: %v megabytes)", allocMB, maxMemMB))
+ }
+ //See proccess_windows.go and process_linux.go for total process memory usage.

This comment has been minimized.

@captncraig

captncraig Aug 12, 2016

Contributor

Is it true that this will only work if they happen to be monitoring the scollector process? Or maybe just any process monitoring enabled at all?

@captncraig

captncraig Aug 12, 2016

Contributor

Is it true that this will only work if they happen to be monitoring the scollector process? Or maybe just any process monitoring enabled at all?

This comment has been minimized.

@captncraig

captncraig Aug 12, 2016

Contributor

Have you tried this with the wmi leaky build and seen it trigger at the appropriate time?

@captncraig

captncraig Aug 12, 2016

Contributor

Have you tried this with the wmi leaky build and seen it trigger at the appropriate time?

This comment has been minimized.

@gbrayut

gbrayut Aug 12, 2016

Contributor

I tested by removing all the defer calls in the WMI package. It took 38 minutes for scollector to panic:

image

@gbrayut

gbrayut Aug 12, 2016

Contributor

I tested by removing all the defer calls in the WMI package. It took 38 minutes for scollector to panic:

image

@captncraig

This comment has been minimized.

Show comment
Hide comment
@captncraig

captncraig Aug 12, 2016

Contributor

very nice. LGTM

Contributor

captncraig commented Aug 12, 2016

very nice. LGTM

@gbrayut

This comment has been minimized.

Show comment
Hide comment
@gbrayut

gbrayut Aug 12, 2016

Contributor

This only works if the process monitoring collectors are enabled, but it matches based on the os.Getpid call, so I don't think you need to be explicitly monitoring scollector.

I'm going to manually deploy this to a few systems including ny-bosun01 (largest scollector memory usage on linux in our environment) to make sure there aren't any issues. Can then merge it on Monday.

Contributor

gbrayut commented Aug 12, 2016

This only works if the process monitoring collectors are enabled, but it matches based on the os.Getpid call, so I don't think you need to be explicitly monitoring scollector.

I'm going to manually deploy this to a few systems including ny-bosun01 (largest scollector memory usage on linux in our environment) to make sure there aren't any issues. Can then merge it on Monday.

@gbrayut

This comment has been minimized.

Show comment
Hide comment
@gbrayut

gbrayut Aug 12, 2016

Contributor

So it appears we don't log panics on Windows so I changed the error from a panic to a fatal. I also added #1867 to find a way to log panics.

The fatal error will look like this:

image

Contributor

gbrayut commented Aug 12, 2016

So it appears we don't log panics on Windows so I changed the error from a panic to a fatal. I also added #1867 to find a way to log panics.

The fatal error will look like this:

image

@gbrayut

This comment has been minimized.

Show comment
Hide comment
@gbrayut

gbrayut Aug 16, 2016

Contributor

^ was just a rebase and a missing multiplier:
image

Merging to master after travisci builds

Contributor

gbrayut commented Aug 16, 2016

^ was just a rebase and a missing multiplier:
image

Merging to master after travisci builds

@gbrayut gbrayut merged commit aad1661 into master Aug 16, 2016

1 check was pending

continuous-integration/travis-ci/pr The Travis CI build is in progress
Details

@gbrayut gbrayut deleted the scollector_mem branch Aug 16, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment