Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JUJU-1887] Introduces service grace periods for shutdown and pebble shut down signals. #184

Closed
wants to merge 1 commit into from

Conversation

tlm
Copy link
Contributor

@tlm tlm commented Jan 25, 2023

  • This commit adds a new cmd arg to pebble run that allows the user to specify the behaviour to take place when Pebble receives sigterm. The two supported values are ignore and terminate.

    This change is needed for Kubernetes and Juju to address bug lp1951415. When Kubernetes decides to remove a pod from the cluster to reschedule else where it sends a sigterm to all containers in the pod at once. Having Pebble ignore this sigterm and giving the charm a chance to run it's tear down hooks to completion before the workload under control exits is critical for the day to day operations of the charm.

  • The second change in this commit adds a new api endpoint to Pebble called shutdown. If Pebble is ignoring sigterm messages from controlling applications such as Kubernetes then we need a way to signal to Pebble to shutdown.

  • Third change is the ability to define a grace period for services and how long they receive to handle SIGTERM and shut down before being killed by Pebble.

This work is being proposed as part of the overall fix in Juju for https://bugs.launchpad.net/juju/+bug/1951415

Testing:

The easiest way to perform manual testing against this change is to create a pebble service definition and run this with pebble run --sigterm=ignore. From there you should be able to send term to the process and watch the log output indicate that it is ignoring the signal. kill -s TERM <pid>

@cjdcordeiro
Copy link
Collaborator

this is interesting.

While spec-ing Pebble for ROCKs I came across a similar dilemma.

Although your proposal does the job for the given scenario, I wonder if the following implementation would make a bit more sense and serve a more general purpose:

what about a --stop-signal option? I.e. pebble run --stop-signal SIGINT. Container runtime managers (like Docker) already have this option, which means the "stopping" of a container doesn't always mean a SIGTERM will be sent to the container application, aka Pebble

@benhoyt
Copy link
Contributor

benhoyt commented Jan 25, 2023

@cjdcordeiro Thanks for the pointer to Docker's --stop-signal (docs here). Yeah, I think that's a nice design. Just like Docker, our default is SIGTERM. From their docs:

The --stop-signal flag sets the system call signal that will be sent to the container to exit. This signal can be a signal name in the format SIG, for instance SIGKILL, or an unsigned number that matches a position in the kernel’s syscall table, for instance 9.

The default is SIGTERM if not specified.

Pebble calls this "shutdown" (for example, the "shutdown" action in the Pebble config), and the term "stop signal" in Pebble's case would more mean the signal Pebble sends to services to stop them, which is a different thing. So I suggest we call this --shutdown-signal and have it default to SIGTERM (allow names or numbers).

To disable (as Tom wants here), we could just set it to an arbitrary signal like --shutdown-signal=SIGUSR1, or perhaps better, allow --shutdown-signal=none as a special case.

client/pebble.go Outdated Show resolved Hide resolved
client/pebble.go Outdated Show resolved Hide resolved
client/pebble.go Outdated Show resolved Hide resolved
cmd/pebble/cmd_run.go Outdated Show resolved Hide resolved
cmd/pebble/cmd_run.go Outdated Show resolved Hide resolved
internal/daemon/api_pebble_test.go Outdated Show resolved Hide resolved
Copy link
Member

@hpidcock hpidcock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this.

cmd/pebble/cmd_run.go Outdated Show resolved Hide resolved
cmd/pebble/cmd_run.go Outdated Show resolved Hide resolved
cmd/pebble/cmd_run.go Outdated Show resolved Hide resolved
internal/daemon/api_pebble.go Outdated Show resolved Hide resolved
@tlm tlm force-pushed the sigterm-ignore branch 2 times, most recently from c648bcc to 2f4dace Compare January 31, 2023 23:36
@tlm tlm changed the title Introduces the ability to ignore sigterm. [JUJU-1887] Introduces the ability to ignore sigterm. Feb 1, 2023
Copy link
Contributor

@benhoyt benhoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the --shutdown-signals flag with the list, thanks. I've left a few comments for potential improvements/simplifications.

cmd/pebble/cmd_run.go Outdated Show resolved Hide resolved
cmd/pebble/cmd_run.go Outdated Show resolved Hide resolved
cmd/pebble/cmd_run.go Outdated Show resolved Hide resolved
cmd/pebble/cmd_run.go Outdated Show resolved Hide resolved
cmd/pebble/cmd_run.go Outdated Show resolved Hide resolved
cmd/pebble/cmd_run.go Outdated Show resolved Hide resolved
cmd/pebble/cmd_run.go Outdated Show resolved Hide resolved
cmd/pebble/cmd_run.go Outdated Show resolved Hide resolved
cmd/pebble/cmd_run.go Outdated Show resolved Hide resolved
cmd/pebble/cmd_run.go Outdated Show resolved Hide resolved
Copy link
Contributor

@benhoyt benhoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks: re-reviewed and added a few more comments. I think there's a few things we need to fix yet, particularly the handling of failWait and other uses of killWait.

README.md Outdated Show resolved Hide resolved
cmd/pebble/cmd_run.go Outdated Show resolved Hide resolved
cmd/pebble/cmd_run.go Outdated Show resolved Hide resolved
CreateDirs bool `long:"create-dirs"`
Hold bool `long:"hold"`
HTTP string `long:"http"`
ShutdownSignals string `long:"shutdown-signals" defaults:"INT,QUIT,TERM"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default tag isn't actually being used at the moment, because it's misspelled as defaults instead of default. Hmmm, might be tricky (might have to use os/exec to start a subprocess), but I wonder if we need a cmd_run test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about it tests. I'll dive into it and see what I can come up with.

cmd/pebble/cmd_run.go Outdated Show resolved Hide resolved
internal/daemon/api_shutdown.go Outdated Show resolved Hide resolved
internal/overlord/servstate/handlers.go Outdated Show resolved Hide resolved
internal/overlord/servstate/handlers.go Outdated Show resolved Hide resolved
internal/plan/plan.go Outdated Show resolved Hide resolved
internal/plan/plan.go Show resolved Hide resolved
@tlm tlm force-pushed the sigterm-ignore branch 2 times, most recently from 4d2a07d to 176632e Compare February 7, 2023 02:31
Copy link
Contributor

@benhoyt benhoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me now, thanks! Just one small comment re the short*Wait constants.

return func() {
killWait, failWait = old1, old2
defaultGracePeriod, killWaitDuration = old1, old2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks. One more thing. These (at the top of servstate/manager_test.go):

	shortKillWait = 100 * time.Millisecond
	shortFailWait = 200 * time.Millisecond

Should probably be changed to shortGracePeriod and shortKillWaitDuration, and the latter could be changed to 100 * time.Millisecond to match what it was before (now that it's additive).

@tlm tlm changed the title [JUJU-1887] Introduces the ability to ignore sigterm. [JUJU-1887] Introduces service grace periods for shutdown and pebble shut down signals. Feb 7, 2023
- This commit adds a new cmd arg to pebble run that allows the user to
  specify the behaviour to take place when Pebble receives sigterm. The
  two supported values are ignore and terminate.

  This change is needed for Kubernetes and Juju to address bug
  lp1951415. When Kubernetes decides to remove a pod from the cluster to
  reschedule else where it sends a sigterm to all containers in the pod
  at once. Having Pebble ignore this sigterm and giving the charm a
  chance to run it's tear down hooks to completion before the workload
  under control exits is critical for the day to day operations of the
  charm.

- The second change in this commit adds a new api endpoint to Pebble
  called shutdown. If Pebble is ignoring sigterm messages from
  controlling applications such as Kubernetes then we need a way to
  signal to Pebble to shutdown.

- Third change is the ability to define a grace period for services and
  how long they receive to handle SIGTERM and shut down before being
  killed by Pebble
@tlm
Copy link
Contributor Author

tlm commented Feb 8, 2023

Closing in favor of #190

@tlm tlm closed this Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants