New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proxy: add handling for subprocess cleanup, retries #311
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding some self comments to try to direct the review.
config/relay.go
Outdated
@@ -30,6 +31,11 @@ func (r *Relay) Default() error { | |||
r.ListenTimeout = 15 | |||
} | |||
|
|||
if r.ListenTimeout >= 30 { | |||
err := fmt.Errorf("listen_timeout should be less 30s (APIGateway integration limit)") | |||
return errors.Wrap(err, ".listen_timeout") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure the vision here, but it did feel a little dirty to assume APIGateway in this part of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could move that to a Validate()
method in ./config/lambda.go
, I suppose we should give it a little breathing room to ensure the error surfaces, API GW-level timeouts are a bit more annoying to troubleshoot
I didn't realize API GW's was so low, would definitely be nice to cap lambda.timeout
to that as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I couldn't really find a great way to accomplish that without an import cycle. I went ahead and made Relay
aware of what platform it is deployed on, so that it can validate itself based on that.
// TODO: add timeout | ||
// TODO: scope all plugin logs to their plugin name | ||
// TODO: utilize BufferPool | ||
// TODO: if the first Start() fails then bail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one was already done as far as I could tell. relay.New
implements this.
} | ||
|
||
if err := p.Start(); err != nil { | ||
return nil, err | ||
} | ||
|
||
// Launch a goroutine for cleaning up the old commands as they are abandoned | ||
go p.cleanupAbandoned() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To stay lightweight, we could also launch this on demand.
} | ||
|
||
p.port = port | ||
p.target = target |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to keep the property that if we return an error, p
is not mutated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tightened this up even more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just kidding, this caused a pretty bad bug - reverting.
http/relay/relay.go
Outdated
case <-time.After(10 * time.Second): | ||
ctx.Warnf("proxy (pid=%d) sending SIGKILL", cmd.Process.Pid) | ||
cmd.Process.Kill() | ||
<-waitDone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No matter what, we have to wait for the Wait
to return, otherwise we don't know if the resources were atually released.
t.Run("GET simple", func(t *testing.T) { | ||
// newRequestBody is a helper to fetch the body from a child process route | ||
newRequestBody := func(t *testing.T, method, path string, body io.Reader) string { | ||
t.Helper() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if you're aware, but if you use this and have failing tests - t.Helper()
tells go to skip this stack frame for error reporting line numbers and use the parent stack frame. VERY helpful for knowing which tests actually failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! yea I had read about it in the release notes but I haven't tried it
t.Run("child process cleanup", func(t *testing.T) { | ||
|
||
// In this test, we test that a child process who stops accepting network connections | ||
// (but hasn't crashed) is gracefully asked to be shut down |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is effectively the test case that I was running into in prod, up-proxy
thought my process failed (because of a network error), however in actuality it didn't. Therefore the process stayed around forever, and continued to use resource on the lambda container.
}) | ||
|
||
// In this test, we test that a child process who swallows the "nice" shutdown signal | ||
// will eventually be sent a SIGKILL and shut down |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a nice thing up can do, if your clean shutdown isn't working. I think all Lambda processes need to be kill-9 safe anyway
|
||
pair := strings.SplitN(string(b), ":", 2) | ||
return pair[0], pair[1], nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused?
|
||
let server; | ||
|
||
const routes = {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactored this without adding deps, it was getting complex.
added:
/env
to fetch a key from process.env/close
to stop accepting connections without killing the server/pid
to fetch the pid/swallowSignals
to install signal handlers
woah awesome! Should be able to get to this today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool LGTM! thanks for digging into this, I don't see any issues, I'll pull it down and test it out.
the additional timeout option sounds good to me too
config/relay.go
Outdated
@@ -30,6 +31,11 @@ func (r *Relay) Default() error { | |||
r.ListenTimeout = 15 | |||
} | |||
|
|||
if r.ListenTimeout >= 30 { | |||
err := fmt.Errorf("listen_timeout should be less 30s (APIGateway integration limit)") | |||
return errors.Wrap(err, ".listen_timeout") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could move that to a Validate()
method in ./config/lambda.go
, I suppose we should give it a little breathing room to ensure the error surfaces, API GW-level timeouts are a bit more annoying to troubleshoot
I didn't realize API GW's was so low, would definitely be nice to cap lambda.timeout
to that as well.
b := p.config.Proxy.Backoff.Backoff() | ||
|
||
retries := 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think b
has b.Attempts
we could utilize for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also defaulted maxRetries
to be config.Proxy.Backoff.Attmpets.
} | ||
|
||
p.port = port | ||
p.target = target |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think this is ready for re-review.
did you have any issues running the tests from root? oddly they work every time in http/relay, but fail every time from root haha checking it out |
nvm, seems to be that |
awesome 💥 , big thanks man! |
As a result of #308, I spent quite a long time debugging up + my application code.
In the end there were errors in both places, for up - I tried to fix and prevent this debugging session from anyone in the future.
listen_timeout
can guard against nonsense usage (longer than 30 seconds, when 30 seconds is the APIGateway limit).basic
in internal/proxy/request.goIt seems that most of these have already been identified as issues, TODOS, so likely nothing surprising here. I know that I didn't follow the normal "discuss features before implementing them", so feel free to throw all this code away if you like.
Implementation Strategy
When the child http server encounters an error, asynchronously ensure it is cleanly shut down by sending a nice SIGINT followed by a SIGKILL after 10 seconds of non-response. This shutdown process is done in parallel with setting up a new child server (on a separate goroutine) in order to allow the proxy to serve traffic as fast as possible. However in order to prevent slow shutdowns from causing resource exhaustion like in #308 we make the channel blocking with N=3. In addition, I tried to add sufficient logging to help future developers debug and understand this in the future.
Known Bugs
none
Notable Omissions