Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop gracefully on SIGTERM #21

Closed
drolando opened this issue Mar 5, 2018 · 4 comments

Comments

Projects
None yet
2 participants
@drolando
Copy link
Member

commented Mar 5, 2018

When an instance fails the healthcheck, paasta sends a SIGTERM, waits a bit and then kills the process if it hasn't stopped yet.

Right now we don't catch SIGTERM, so the process dies as soon as it receives it. All in-flight requests are lost and clients see this as a 503.

@drolando drolando added the bug label Mar 5, 2018

@ArnaudBrousseau

This comment has been minimized.

Copy link
Contributor

commented Mar 6, 2018

@drolando: nice find 🥇

That explains why we're seeing spurious 503 spikes. graceful_mesos_shutdown is not the root cause as I initially thought, it's just what exposes this bug.

@ArnaudBrousseau ArnaudBrousseau self-assigned this Mar 6, 2018

@ArnaudBrousseau

This comment has been minimized.

Copy link
Contributor

commented Mar 6, 2018

So far I haven't able to solve this problem, but here's a repro:

  1. make dev
  2. curl -o /dev/null -iv -H 'X-Smartstack-Destination: yelp-main.internalapi' -H 'X-Smartstack-Source: spectre.main' -H 'X-Ctx-Tarpit: yelp-main.internalapi' 'localhost:32927/status'. This triggers a slow request to internalapi (timeout of 15s.)
  3. (in a different session): docker ps | grep abrousse | cut -d' ' -f1 | xargs docker stop --time 30. This shuts down the Casper container by sending SIGTERM first, then SIGKILL after 30s.

Expected:

The request to /status finishes then nginx/docker shuts down gracefully

Actual:

Connection is aborted and we get "curl: (52) Empty reply from server", docker shuts down at the same time


Some things that might be useful to note here

  • process hierarchy:
dumb-init start.sh
|___start.sh
    |___nginx (master)
        |___nginx (worker)
  • From dumb-init's docs: "In its default mode, dumb-init establishes a session rooted at the child, and sends signals to the entire process group. This is useful if you have a poorly-behaving child (such as a shell script) which won't normally signal its children before dying."
  • From nginx's docs: SIGTERM causes a fast shutdown, and SIGQUIT causes a graceful shutdown (that's what we want)
  • To check that SIGQUIT was the right thing I hoped onto the container, fired up a slow request to /status and ran /usr/local/openresty/nginx/sbin/nginx -s quit. The slow request finished, then nginx exited and the container shut down.
@drolando

This comment has been minimized.

Copy link
Member Author

commented Mar 6, 2018

We can use dumb-init to change the SIGTERM in a SIGQUIT

dumb-init --rewrite 15:3

@ArnaudBrousseau

This comment has been minimized.

Copy link
Contributor

commented Mar 6, 2018

We can use dumb-init to change the SIGTERM in a SIGQUIT

Nice. Just verified, it totally works (CMD ["dumb-init", "--rewrite", "15:3", "/code/start.sh"])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.