Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proxy: add handling for subprocess cleanup, retries #311

Closed
wants to merge 9 commits into from

Conversation

skabbes
Copy link
Contributor

@skabbes skabbes commented Sep 6, 2017

As a result of #308, I spent quite a long time debugging up + my application code.

In the end there were errors in both places, for up - I tried to fix and prevent this debugging session from anyone in the future.

  1. A network error was assumed to be a process crash, which is not always the case (this was actually causing lambda to run out of file descriptors for processes in this case)
  2. A 'bad route' would backoff and retry forever
  3. listen_timeout can guard against nonsense usage (longer than 30 seconds, when 30 seconds is the APIGateway limit).
  4. unused function basic in internal/proxy/request.go

It seems that most of these have already been identified as issues, TODOS, so likely nothing surprising here. I know that I didn't follow the normal "discuss features before implementing them", so feel free to throw all this code away if you like.

Implementation Strategy
When the child http server encounters an error, asynchronously ensure it is cleanly shut down by sending a nice SIGINT followed by a SIGKILL after 10 seconds of non-response. This shutdown process is done in parallel with setting up a new child server (on a separate goroutine) in order to allow the proxy to serve traffic as fast as possible. However in order to prevent slow shutdowns from causing resource exhaustion like in #308 we make the channel blocking with N=3. In addition, I tried to add sufficient logging to help future developers debug and understand this in the future.

Known Bugs
none

Notable Omissions

  • Even though process shutdown is aysnc from the http serving code, it is still serialized (failing processes aren't cleaned up concurrently with each other)
  • SIGKILL delay is not configurable (naming suggestion: shutdown_timeout?). This would be particularly helpful in tests, since I have added a 10 second test :(
  • A failed child process cannot be recognized until another request is made.
  • Clean shutdown of relay.Proxy (and cleanup goroutine)
  • Concurrent execution: The whole proxy class is serialized on the ServeHTTP entry point, since lambda operates this way anyway - I don't see a strong reason to layer this complexity in until a FaaS adds actual concurrent execution.

Copy link
Contributor Author

@skabbes skabbes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding some self comments to try to direct the review.

config/relay.go Outdated
@@ -30,6 +31,11 @@ func (r *Relay) Default() error {
r.ListenTimeout = 15
}

if r.ListenTimeout >= 30 {
err := fmt.Errorf("listen_timeout should be less 30s (APIGateway integration limit)")
return errors.Wrap(err, ".listen_timeout")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure the vision here, but it did feel a little dirty to assume APIGateway in this part of the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could move that to a Validate() method in ./config/lambda.go, I suppose we should give it a little breathing room to ensure the error surfaces, API GW-level timeouts are a bit more annoying to troubleshoot

I didn't realize API GW's was so low, would definitely be nice to cap lambda.timeout to that as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I couldn't really find a great way to accomplish that without an import cycle. I went ahead and made Relay aware of what platform it is deployed on, so that it can validate itself based on that.

// TODO: add timeout
// TODO: scope all plugin logs to their plugin name
// TODO: utilize BufferPool
// TODO: if the first Start() fails then bail
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one was already done as far as I could tell. relay.New implements this.

}

if err := p.Start(); err != nil {
return nil, err
}

// Launch a goroutine for cleaning up the old commands as they are abandoned
go p.cleanupAbandoned()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To stay lightweight, we could also launch this on demand.

}

p.port = port
p.target = target
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to keep the property that if we return an error, p is not mutated.

Copy link
Member

@tj tj Sep 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tightened this up even more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just kidding, this caused a pretty bad bug - reverting.

case <-time.After(10 * time.Second):
ctx.Warnf("proxy (pid=%d) sending SIGKILL", cmd.Process.Pid)
cmd.Process.Kill()
<-waitDone
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No matter what, we have to wait for the Wait to return, otherwise we don't know if the resources were atually released.

t.Run("GET simple", func(t *testing.T) {
// newRequestBody is a helper to fetch the body from a child process route
newRequestBody := func(t *testing.T, method, path string, body io.Reader) string {
t.Helper()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you're aware, but if you use this and have failing tests - t.Helper() tells go to skip this stack frame for error reporting line numbers and use the parent stack frame. VERY helpful for knowing which tests actually failed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! yea I had read about it in the release notes but I haven't tried it

t.Run("child process cleanup", func(t *testing.T) {

// In this test, we test that a child process who stops accepting network connections
// (but hasn't crashed) is gracefully asked to be shut down
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is effectively the test case that I was running into in prod, up-proxy thought my process failed (because of a network error), however in actuality it didn't. Therefore the process stayed around forever, and continued to use resource on the lambda container.

})

// In this test, we test that a child process who swallows the "nice" shutdown signal
// will eventually be sent a SIGKILL and shut down
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a nice thing up can do, if your clean shutdown isn't working. I think all Lambda processes need to be kill-9 safe anyway


pair := strings.SplitN(string(b), ":", 2)
return pair[0], pair[1], nil
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused?


let server;

const routes = {};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored this without adding deps, it was getting complex.

added:

  • /env to fetch a key from process.env
  • /close to stop accepting connections without killing the server
  • /pid to fetch the pid
  • /swallowSignals to install signal handlers

@tj
Copy link
Member

tj commented Sep 6, 2017

woah awesome! Should be able to get to this today

Copy link
Member

@tj tj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool LGTM! thanks for digging into this, I don't see any issues, I'll pull it down and test it out.

the additional timeout option sounds good to me too

config/relay.go Outdated
@@ -30,6 +31,11 @@ func (r *Relay) Default() error {
r.ListenTimeout = 15
}

if r.ListenTimeout >= 30 {
err := fmt.Errorf("listen_timeout should be less 30s (APIGateway integration limit)")
return errors.Wrap(err, ".listen_timeout")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could move that to a Validate() method in ./config/lambda.go, I suppose we should give it a little breathing room to ensure the error surfaces, API GW-level timeouts are a bit more annoying to troubleshoot

I didn't realize API GW's was so low, would definitely be nice to cap lambda.timeout to that as well.

b := p.config.Proxy.Backoff.Backoff()

retries := 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think b has b.Attempts we could utilize for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also defaulted maxRetries to be config.Proxy.Backoff.Attmpets.

}

p.port = port
p.target = target
Copy link
Member

@tj tj Sep 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM!

Copy link
Contributor Author

@skabbes skabbes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think this is ready for re-review.

@tj
Copy link
Member

tj commented Sep 10, 2017

did you have any issues running the tests from root? oddly they work every time in http/relay, but fail every time from root haha checking it out

@tj
Copy link
Member

tj commented Sep 10, 2017

nvm, seems to be that gt program, go test works fine

@tj
Copy link
Member

tj commented Sep 10, 2017

awesome 💥 , big thanks man!

@tj tj closed this in badf351 Sep 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants