Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement drain scripts for Postgres to properly use the fast shutdown mode #2320

Merged
merged 3 commits into from Sep 27, 2021

Conversation

mvach
Copy link
Contributor

@mvach mvach commented Sep 8, 2021

What is this change about?

Here we propose to use SIGINT in order to shut down the Postgres daemon with the Fast Shutdown mode. See the Shutting Down the Server documentation chapter for more details.

This Fast Shutdown mode is implemented here in the drain script.

Please provide contextual information.

While we were upgrading from Postgres v10 to Postgres v13 at SAP, we found out that Postgres 10 was not properly shut down, which leaded to blocking issue with the migration. The error message was “The source cluster was not shut down cleanly”.

When SIGTERM is send to Postgres, the daemon enters the Smart Shutdown mode and can take a while to shut down. In our tests, we stopped waiting after 7 minutes: Postgres hadn't shut down yet!

This explains why the postgres-release is sending a SIGINT at monit stop.
See: https://github.com/cloudfoundry/postgres-release/blob/develop/jobs/postgres/templates/postgres_ctl.sh.erb#L58

This Fast Shutdown mode is implemented here in the drain script. We could not get BPM to properly send a SIGINT because the SIGTERM is hardcoded in the BPM Golang code. An upcoming PR in BPM should address this issue and provide proper configuration for the actual first signal to send to the process.

In its hardcoded shutdown sequence, BPM then sends a SIGQUIT which means “Immediate Shutdown” mode for Postgres. That's where bad things happen, and the cluster state doesn't reach the expected shut down status (see below).
See also https://bosh.io/docs/job-lifecycle/#stop where the hardcoded BPM shutdown sequence is now documented.

What tests have you run against this PR?

When running the pg_controldata utility, the “Database cluster state” is expected to be shut down for the v10-to-v13 migration to succeed.

But with the BPM sequence SIGTERM/15/SIGQUIT/2/SIGKILL, that turns out to be inappropriate for Postgres, we get an unexpected “Database cluster state” of in production. And the migration fails.

Here is how we run the pg_controldata utility:

su - vcap -c "/var/vcap/packages/postgres-10/bin/pg_controldata  -D /var/vcap/store/postgres-10" \
    | grep -F "Database cluster state"

Test sequence:

  1. After implementing the drain script, we could reproduce the issue commenting out our new code and running bosh stop then checking the cluster state with the pg_controldata utility.

  2. After that, when properly running the drain script, we could run bosh stop and then conclude the cluster state was correct with the pg_controldata utility.

How should this change be described in bosh release notes?

  • Implemented drain scripts for the postgres-9.4, postgres-10 and postgres jobs, in order to implement a proper “fast shutdown” sequence for the Postgres daemon.

Does this PR introduce a breaking change?

No. Just fixes an issue that had been there for a long time.

Tag your pair, your PM, and/or team!

Co-authored-by: Benjamin Gandon

…wn mode

Co-authored-by: Benjamin Gandon <benjamin.gandon@sap.com>
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Sep 8, 2021

CLA Signed

The committers are authorized under a signed CLA.

@bgandon
Copy link
Contributor

bgandon commented Sep 8, 2021

Matthias is going to fix the EasyCLA issue by the way.

@mvach mvach force-pushed the postgres-drain branch 2 times, most recently from 886c8b7 to c8ac7a8 Compare September 9, 2021 06:27
Co-authored-by: Matthias Vach <matthias.vach@sap.com>
@mvach
Copy link
Contributor Author

mvach commented Sep 22, 2021

Hi, is there any chance to get this merged?

@bgandon bgandon changed the title Implement a drain script for postgres 10 properly use the fast shutdown mode Implement drain scripts for Postgres to properly use the fast shutdown mode Sep 22, 2021
@ramonskie ramonskie self-assigned this Sep 24, 2021
@ramonskie ramonskie self-requested a review September 24, 2021 09:15
@ramonskie
Copy link
Contributor

i don't see any breaking changes so for me this can be accepted

Copy link
Contributor

@rkoster rkoster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, like the care that was taken around formatting the log output!

I'm however not fully convinced this approach solves the problem fully since we don't wait for the fast shut down to be completed (or maybe I'm missing something). If this is indeed the case I would propose to use the pg_ctl to perform a fast shutdown instead.



postgres_pid=$(/var/vcap/packages/bpm/bin/bpm pid postgres-10)
kill -s SIGINT "${postgres_pid}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This return immediately right? If so there is a change on a busy instance that bosh thinks draining is done and continues with sending a stop to bpm via monit.

An alternative would be to change to packaging script to also compile the pg_ctl script:

  pushd src/bin/pg_ctl
    make
    make install
  popd

And use the same call as the Postgres release: https://github.com/cloudfoundry/postgres-release/blob/develop/jobs/postgres/templates/postgres_ctl.sh.erb#L58

This should give more certainty that the process is actually stopping.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we only need to fire-and-forget a SIGINT because BPM is currently not able to do it, and there is no need for any other complications.

With the drain script, here is the resulting sequence:

  1. Monit is told not to restart Postgres if it stops.
  2. the drain script sends a SIGINT and Postgres starts the fast shutdown. No need for wait loop or any timeout at this point. The drain script exits immediately. The point here is only to start the fast shutdown sequence of Postgres.
  3. monit stop delegates to bpm stop. If the Postgres process is still alive, BPM sends a SIGTERM but Postgres is already doing a fast shutdown, so the signal is ignored.
  4. BPM waits for 15 seconds and Postgres properly stop within this delay, whereas sending a SIGTERM only could have it hag for 5+ minutes.
  5. If ever the 15 seconds delay expires, BPM would send a SIGQUIT and this would be perfectly legal because a fast shutdown is way shorter than that.

This explains why only sending a SIGINT from the drain script is enough.

Please also consider that this drain script is a fast-path solution. The better way to proceed is having BPM send a SIGINT directly, just like the Postgres release does. The PR is already pending, but this will take more time and we are in a hurry for our Postgres migration that has been blocked for 2+ weeks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still confused about The better way to proceed is having BPM send a SIGINT directly, just like the Postgres release does. Since I could find no reference to SIGINT in the postgres-release.

In the Postgres release they do (source):

su - vcap -c "${PACKAGE_DIR}/bin/pg_ctl stop -m fast -w -D ${DATA_DIR}"
-m
--mode=mode
  Specifies the shutdown mode. mode can be smart, fast, or immediate, or the first letter of one of these three. If this option is omitted, fast is the default.
-w
--wait
Wait for the operation to complete. This is supported for the modes start, stop, restart, promote, and register, and is the default for those modes.

When waiting, pg_ctl repeatedly checks the server's PID file, sleeping for a short amount of time between checks. Startup is considered complete when the PID file indicates that the server is ready to accept connections. Shutdown is considered complete when the server removes the PID file. pg_ctl returns an exit code based on the success of the startup or shutdown.

If the operation does not complete within the timeout (see option -t), then pg_ctl exits with a nonzero exit status. But note that the operation might continue in the background and eventually succeed.

source: https://www.postgresql.org/docs/10/app-pg-ctl.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, it's only that pg_ctl stop -m fast sends a SIGINT signal to Postgres, as detailed the Shutting Down the Server documentation.

The pg_ctl program provides a convenient interface for sending these signals to shut down the server

Copy link
Contributor

@rkoster rkoster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this change been tested in combination with the director drain script? The workers won't be able to complete their active tasks while Postgres performs a fast stop.



postgres_pid=$(/var/vcap/packages/bpm/bin/bpm pid postgres-10)
kill -s SIGINT "${postgres_pid}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still confused about The better way to proceed is having BPM send a SIGINT directly, just like the Postgres release does. Since I could find no reference to SIGINT in the postgres-release.

In the Postgres release they do (source):

su - vcap -c "${PACKAGE_DIR}/bin/pg_ctl stop -m fast -w -D ${DATA_DIR}"
-m
--mode=mode
  Specifies the shutdown mode. mode can be smart, fast, or immediate, or the first letter of one of these three. If this option is omitted, fast is the default.
-w
--wait
Wait for the operation to complete. This is supported for the modes start, stop, restart, promote, and register, and is the default for those modes.

When waiting, pg_ctl repeatedly checks the server's PID file, sleeping for a short amount of time between checks. Startup is considered complete when the PID file indicates that the server is ready to accept connections. Shutdown is considered complete when the server removes the PID file. pg_ctl returns an exit code based on the success of the startup or shutdown.

If the operation does not complete within the timeout (see option -t), then pg_ctl exits with a nonzero exit status. But note that the operation might continue in the background and eventually succeed.

source: https://www.postgresql.org/docs/10/app-pg-ctl.html

Co-authored-by: Benjamin Gandon <benjamin.gandon@sap.com>
Co-authored-by: Ramon Makkelie <ramon.makkelie@sap.com>
@bgandon
Copy link
Contributor

bgandon commented Sep 27, 2021

Has this change been tested in combination with the director drain script? The workers won't be able to complete their active tasks while Postgres performs a fast stop.

Good catch indeed, this is definitely a concern to take into account.
As a result, here is what we've modified:

  1. Remove Postgres drain scripts
  2. Create custom stop scripts to be called at monit stop postgres
  3. Have those custom stop scripts both send a SIGINT and call bpm stop postgres in sequence.

With these changes, we'll leave enough time for the Director drain script to nicely stop the workers and we'll properly stop the database afterwards with the fast shutdown mode.

Then if you want us to rely on pg_ctl instead of killing the process directly, we're open to that either.

Copy link
Contributor

@rkoster rkoster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants