Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible goroutine leak #153

Closed
andrew-edgar opened this issue Jul 30, 2019 · 7 comments
Closed

Possible goroutine leak #153

andrew-edgar opened this issue Jul 30, 2019 · 7 comments

Comments

@andrew-edgar
Copy link
Contributor

Thanks for submitting an issue to routing-release. We are always trying to improve! To help us, please fill out the following template.

Issue

Seeing on the gorouter.numGoRoutines grow over time to even as high as 400K goroutines. We need to know if this is an issue and what is getting "stuck"?

Context

using 0.188.0

Steps to Reproduce

Unsure.

Expected result

A steady state of number of goroutines not growing linearly

Current result

Nothing seems to be an issue but could become one depending on how long the gorouter is running.

Possible Fix

name of issue Output Results

image

I will see how I can attach a stack dump performed on one of the gorouters.

@cf-gitbot
Copy link

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/167596191

The labels on this github issue will be updated when the story is started.

@andrew-edgar
Copy link
Contributor Author

I have attached a dump from one of the servers (curl http://127.0.0.1:17002/debug/pprof/goroutine?debug=2 > /tmp/dump.out)

This is from one of our production envs where we have had the gorouter running for 22days. This is not our largest env as in an even larger env we see 400K goroutines in the metric. This is bad enough with over 50K. Perhaps you can see what is "stuck" or point us to how this could be happening and if it is normal.

goroutinedump.zip

@KauzClay
Copy link
Contributor

KauzClay commented Aug 19, 2019

We looked at the dump and saw tons of this block:

goroutine 57656398 [chan send, 29506 minutes]:
code.cloudfoundry.org/gorouter/proxy/handler.(*Forwarder).ForwardIO.func2(0xb904e0, 0xc01829d1e0, 0xc023967880, 0xc016613640, 0xc028626240)
	/var/vcap/data/compile/gorouter/src/code.cloudfoundry.org/gorouter/proxy/handler/forwarder.go:42 +0xcd
created by code.cloudfoundry.org/gorouter/proxy/handler.(*Forwarder).ForwardIO
	/var/vcap/data/compile/gorouter/src/code.cloudfoundry.org/gorouter/proxy/handler/forwarder.go:40 +0x1f7

This function is used in websockets. Do you expect many long lived websocket connections on your platform?

@KauzClay and @tcdowney , CF Networking Program Members

@andrew-edgar
Copy link
Contributor Author

Yes in this environment we do have many websockets. But it is very much over the number of established websockets. I can look at the metrics and give you an idea of how many we have as perhaps there is an issue with closing these out when the socket closes.

@tcdowney
Copy link
Member

@andrew-edgar

We believe we've narrowed down the issue. We were able to reproduce this behavior by pushing a slow websocket server such that the Goroute would time out the request with a BackendReadTimeout (502).

The bug is here:

	go func() {
		resp, err = http.ReadResponse(bufio.NewReader(teedReader), nil)
		headerWasRead <- struct{}{}
	}()

	select {
	case <-headerWasRead:
		if err != nil {
			return 0
		}
	case <-time.After(f.BackendReadTimeout):
		f.Logger.Error("websocket-forwardio", zap.Error(errors.New("timeout waiting for http response from backend")))
		return 0
	}

When that timeout occurs the headerWasRead channel won't ever receive what the spawned goroutine is trying to send so they'll block on that. They'll end up looking like they do in your goroutine dump:

goroutine 351 [chan send, 64 minutes]:
code.cloudfoundry.org/gorouter/proxy/handler.(*Forwarder).ForwardIO.func2(0xb94a60, 0xc00032d560, 0xc000010810, 0xc000394c60, 0xc0003a1620)
        /Users/pivotal/workspace/routing-release/src/code.cloudfoundry.org/gorouter/proxy/handler/forwarder.go:42 +0xcd
created by code.cloudfoundry.org/gorouter/proxy/handler.(*Forwarder).ForwardIO
        /Users/pivotal/workspace/routing-release/src/code.cloudfoundry.org/gorouter/proxy/handler/forwarder.go:40 +0x1f7

We're working on a fix since we definitely shouldn't be leaking goroutines like this, but I'm curious to know if you're seeing any performance implications from this in your environments? I'd imagine higher memory usage... are you also seeing higher CPU usage on your gorouters?

Thanks,
@tcdowney and @tstannard, CF Networking Program Members

@andrew-edgar
Copy link
Contributor Author

We didn't see a lot of problems. but we saw this in the metrics and in theory could be a problem in the future. It is also a problem when diagnosing issues with so many threads and trying to debug a problem that could be occuring. In fact the gorouter is usually working well and this has very little impact except for some extra memory usage.

@tcdowney
Copy link
Member

Good to know! 👍

We pushed a commit that should resolve the goroutine leak and are currently working on refactoring this code a bit in general.

cloudfoundry/gorouter@a165083

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants