Implement drain scripts for Postgres to properly use the fast shutdown mode #2320

mvach · 2021-09-08T13:16:56Z

What is this change about?

Here we propose to use SIGINT in order to shut down the Postgres daemon with the Fast Shutdown mode. See the Shutting Down the Server documentation chapter for more details.

This Fast Shutdown mode is implemented here in the drain script.

Please provide contextual information.

While we were upgrading from Postgres v10 to Postgres v13 at SAP, we found out that Postgres 10 was not properly shut down, which leaded to blocking issue with the migration. The error message was “The source cluster was not shut down cleanly”.

When SIGTERM is send to Postgres, the daemon enters the Smart Shutdown mode and can take a while to shut down. In our tests, we stopped waiting after 7 minutes: Postgres hadn't shut down yet!

This explains why the postgres-release is sending a SIGINT at monit stop.
See: https://github.com/cloudfoundry/postgres-release/blob/develop/jobs/postgres/templates/postgres_ctl.sh.erb#L58

This Fast Shutdown mode is implemented here in the drain script. We could not get BPM to properly send a SIGINT because the SIGTERM is hardcoded in the BPM Golang code. An upcoming PR in BPM should address this issue and provide proper configuration for the actual first signal to send to the process.

In its hardcoded shutdown sequence, BPM then sends a SIGQUIT which means “Immediate Shutdown” mode for Postgres. That's where bad things happen, and the cluster state doesn't reach the expected shut down status (see below).
See also https://bosh.io/docs/job-lifecycle/#stop where the hardcoded BPM shutdown sequence is now documented.

What tests have you run against this PR?

When running the pg_controldata utility, the “Database cluster state” is expected to be shut down for the v10-to-v13 migration to succeed.

But with the BPM sequence SIGTERM/15/SIGQUIT/2/SIGKILL, that turns out to be inappropriate for Postgres, we get an unexpected “Database cluster state” of in production. And the migration fails.

Here is how we run the pg_controldata utility:

su - vcap -c "/var/vcap/packages/postgres-10/bin/pg_controldata  -D /var/vcap/store/postgres-10" \
    | grep -F "Database cluster state"

Test sequence:

After implementing the drain script, we could reproduce the issue commenting out our new code and running bosh stop then checking the cluster state with the pg_controldata utility.
After that, when properly running the drain script, we could run bosh stop and then conclude the cluster state was correct with the pg_controldata utility.

How should this change be described in bosh release notes?

Implemented drain scripts for the postgres-9.4, postgres-10 and postgres jobs, in order to implement a proper “fast shutdown” sequence for the Postgres daemon.

Does this PR introduce a breaking change?

No. Just fixes an issue that had been there for a long time.

Tag your pair, your PM, and/or team!

Co-authored-by: Benjamin Gandon

…wn mode Co-authored-by: Benjamin Gandon <benjamin.gandon@sap.com>

linux-foundation-easycla · 2021-09-08T13:17:00Z

The committers are authorized under a signed CLA.

✅ mvach (060ea8e)
✅ Benjamin Gandon (a388874)

bgandon · 2021-09-08T14:13:41Z

Matthias is going to fix the EasyCLA issue by the way.

Co-authored-by: Matthias Vach <matthias.vach@sap.com>

mvach · 2021-09-22T08:23:10Z

Hi, is there any chance to get this merged?

ramonskie · 2021-09-24T09:15:30Z

i don't see any breaking changes so for me this can be accepted

rkoster

Thanks for the PR, like the care that was taken around formatting the log output!

I'm however not fully convinced this approach solves the problem fully since we don't wait for the fast shut down to be completed (or maybe I'm missing something). If this is indeed the case I would propose to use the pg_ctl to perform a fast shutdown instead.

rkoster · 2021-09-24T10:48:49Z

jobs/postgres-10/templates/drain.erb

+
+
+postgres_pid=$(/var/vcap/packages/bpm/bin/bpm pid postgres-10)
+kill -s SIGINT "${postgres_pid}"


This return immediately right? If so there is a change on a busy instance that bosh thinks draining is done and continues with sending a stop to bpm via monit.

An alternative would be to change to packaging script to also compile the pg_ctl script:

pushd src/bin/pg_ctl make make install popd

And use the same call as the Postgres release: https://github.com/cloudfoundry/postgres-release/blob/develop/jobs/postgres/templates/postgres_ctl.sh.erb#L58

This should give more certainty that the process is actually stopping.

No, we only need to fire-and-forget a SIGINT because BPM is currently not able to do it, and there is no need for any other complications.

With the drain script, here is the resulting sequence:

Monit is told not to restart Postgres if it stops.

the drain script sends a SIGINT and Postgres starts the fast shutdown. No need for wait loop or any timeout at this point. The drain script exits immediately. The point here is only to start the fast shutdown sequence of Postgres.

monit stop delegates to bpm stop. If the Postgres process is still alive, BPM sends a SIGTERM but Postgres is already doing a fast shutdown, so the signal is ignored.

BPM waits for 15 seconds and Postgres properly stop within this delay, whereas sending a SIGTERM only could have it hag for 5+ minutes.

If ever the 15 seconds delay expires, BPM would send a SIGQUIT and this would be perfectly legal because a fast shutdown is way shorter than that.

This explains why only sending a SIGINT from the drain script is enough.

Please also consider that this drain script is a fast-path solution. The better way to proceed is having BPM send a SIGINT directly, just like the Postgres release does. The PR is already pending, but this will take more time and we are in a hurry for our Postgres migration that has been blocked for 2+ weeks.

I'm still confused about The better way to proceed is having BPM send a SIGINT directly, just like the Postgres release does. Since I could find no reference to SIGINT in the postgres-release.

In the Postgres release they do (source):

su - vcap -c "${PACKAGE_DIR}/bin/pg_ctl stop -m fast -w -D ${DATA_DIR}"

-m --mode=mode Specifies the shutdown mode. mode can be smart, fast, or immediate, or the first letter of one of these three. If this option is omitted, fast is the default.

-w --wait Wait for the operation to complete. This is supported for the modes start, stop, restart, promote, and register, and is the default for those modes. When waiting, pg_ctl repeatedly checks the server's PID file, sleeping for a short amount of time between checks. Startup is considered complete when the PID file indicates that the server is ready to accept connections. Shutdown is considered complete when the server removes the PID file. pg_ctl returns an exit code based on the success of the startup or shutdown. If the operation does not complete within the timeout (see option -t), then pg_ctl exits with a nonzero exit status. But note that the operation might continue in the background and eventually succeed.

source: https://www.postgresql.org/docs/10/app-pg-ctl.html

No problem, it's only that pg_ctl stop -m fast sends a SIGINT signal to Postgres, as detailed the Shutting Down the Server documentation.

The pg_ctl program provides a convenient interface for sending these signals to shut down the server

rkoster

Has this change been tested in combination with the director drain script? The workers won't be able to complete their active tasks while Postgres performs a fast stop.

rkoster · 2021-09-24T19:35:13Z

jobs/postgres-10/templates/drain.erb

+
+
+postgres_pid=$(/var/vcap/packages/bpm/bin/bpm pid postgres-10)
+kill -s SIGINT "${postgres_pid}"


I'm still confused about The better way to proceed is having BPM send a SIGINT directly, just like the Postgres release does. Since I could find no reference to SIGINT in the postgres-release.

In the Postgres release they do (source):

su - vcap -c "${PACKAGE_DIR}/bin/pg_ctl stop -m fast -w -D ${DATA_DIR}"

-m --mode=mode Specifies the shutdown mode. mode can be smart, fast, or immediate, or the first letter of one of these three. If this option is omitted, fast is the default.

-w --wait Wait for the operation to complete. This is supported for the modes start, stop, restart, promote, and register, and is the default for those modes. When waiting, pg_ctl repeatedly checks the server's PID file, sleeping for a short amount of time between checks. Startup is considered complete when the PID file indicates that the server is ready to accept connections. Shutdown is considered complete when the server removes the PID file. pg_ctl returns an exit code based on the success of the startup or shutdown. If the operation does not complete within the timeout (see option -t), then pg_ctl exits with a nonzero exit status. But note that the operation might continue in the background and eventually succeed.

source: https://www.postgresql.org/docs/10/app-pg-ctl.html

Co-authored-by: Benjamin Gandon <benjamin.gandon@sap.com> Co-authored-by: Ramon Makkelie <ramon.makkelie@sap.com>

bgandon · 2021-09-27T10:00:16Z

Has this change been tested in combination with the director drain script? The workers won't be able to complete their active tasks while Postgres performs a fast stop.

Good catch indeed, this is definitely a concern to take into account.
As a result, here is what we've modified:

Remove Postgres drain scripts
Create custom stop scripts to be called at monit stop postgres
Have those custom stop scripts both send a SIGINT and call bpm stop postgres in sequence.

With these changes, we'll leave enough time for the Director drain script to nicely stop the workers and we'll properly stop the database afterwards with the fast shutdown mode.

Then if you want us to rely on pg_ctl instead of killing the process directly, we're open to that either.

rkoster

LGTM 👍

Implement a drain script for postgres 10 properly use the fast shutdo…

060ea8e

…wn mode Co-authored-by: Benjamin Gandon <benjamin.gandon@sap.com>

cf-gitbot added the unscheduled label Sep 8, 2021

mvach force-pushed the postgres-drain branch 2 times, most recently from 886c8b7 to c8ac7a8 Compare September 9, 2021 06:27

Implement same drain script for all postgres jobs

a388874

Co-authored-by: Matthias Vach <matthias.vach@sap.com>

mvach force-pushed the postgres-drain branch from c8ac7a8 to a388874 Compare September 9, 2021 10:00

bgandon mentioned this pull request Sep 13, 2021

Customize shutdown signal to send SIGINT to Postgres cloudfoundry/bpm-release#152

Merged

bgandon changed the title ~~Implement a drain script for postgres 10 properly use the fast shutdown mode~~ Implement drain scripts for Postgres to properly use the fast shutdown mode Sep 22, 2021

ramonskie self-assigned this Sep 24, 2021

ramonskie self-requested a review September 24, 2021 09:15

ramonskie approved these changes Sep 24, 2021

View reviewed changes

rkoster requested changes Sep 24, 2021

View reviewed changes

rkoster reviewed Sep 24, 2021

View reviewed changes

Use custom bpm stop script to shutdown postgres after draining is done

bcbaddf

Co-authored-by: Benjamin Gandon <benjamin.gandon@sap.com> Co-authored-by: Ramon Makkelie <ramon.makkelie@sap.com>

rkoster approved these changes Sep 27, 2021

View reviewed changes

rkoster merged commit 607bf89 into master Sep 27, 2021

cf-gitbot removed the unscheduled label Sep 27, 2021

rkoster deleted the postgres-drain branch September 27, 2021 13:19

bgandon mentioned this pull request Nov 1, 2021

Leverage the new BPM shutdown_signal feature in Postgres jobs #2334

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement drain scripts for Postgres to properly use the fast shutdown mode #2320

Implement drain scripts for Postgres to properly use the fast shutdown mode #2320

mvach commented Sep 8, 2021 •

edited by bgandon

linux-foundation-easycla bot commented Sep 8, 2021 •

edited

bgandon commented Sep 8, 2021

mvach commented Sep 22, 2021

ramonskie commented Sep 24, 2021

rkoster left a comment

rkoster Sep 24, 2021

bgandon Sep 24, 2021

rkoster Sep 24, 2021

bgandon Sep 27, 2021

rkoster left a comment

rkoster Sep 24, 2021

bgandon commented Sep 27, 2021 •

edited

rkoster left a comment



		postgres_pid=$(/var/vcap/packages/bpm/bin/bpm pid postgres-10)
		kill -s SIGINT "${postgres_pid}"

Implement drain scripts for Postgres to properly use the fast shutdown mode #2320

Implement drain scripts for Postgres to properly use the fast shutdown mode #2320

Conversation

mvach commented Sep 8, 2021 • edited by bgandon

What is this change about?

Please provide contextual information.

What tests have you run against this PR?

How should this change be described in bosh release notes?

Does this PR introduce a breaking change?

Tag your pair, your PM, and/or team!

linux-foundation-easycla bot commented Sep 8, 2021 • edited

bgandon commented Sep 8, 2021

mvach commented Sep 22, 2021

ramonskie commented Sep 24, 2021

rkoster left a comment

Choose a reason for hiding this comment

rkoster Sep 24, 2021

Choose a reason for hiding this comment

bgandon Sep 24, 2021

Choose a reason for hiding this comment

rkoster Sep 24, 2021

Choose a reason for hiding this comment

bgandon Sep 27, 2021

Choose a reason for hiding this comment

rkoster left a comment

Choose a reason for hiding this comment

rkoster Sep 24, 2021

Choose a reason for hiding this comment

bgandon commented Sep 27, 2021 • edited

rkoster left a comment

Choose a reason for hiding this comment

mvach commented Sep 8, 2021 •

edited by bgandon

linux-foundation-easycla bot commented Sep 8, 2021 •

edited

bgandon commented Sep 27, 2021 •

edited