New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement drain scripts for Postgres to properly use the fast shutdown mode #2320
Conversation
…wn mode Co-authored-by: Benjamin Gandon <benjamin.gandon@sap.com>
Matthias is going to fix the EasyCLA issue by the way. |
886c8b7
to
c8ac7a8
Compare
Co-authored-by: Matthias Vach <matthias.vach@sap.com>
c8ac7a8
to
a388874
Compare
Hi, is there any chance to get this merged? |
i don't see any breaking changes so for me this can be accepted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, like the care that was taken around formatting the log output!
I'm however not fully convinced this approach solves the problem fully since we don't wait for the fast shut down to be completed (or maybe I'm missing something). If this is indeed the case I would propose to use the pg_ctl
to perform a fast shutdown instead.
jobs/postgres-10/templates/drain.erb
Outdated
|
||
|
||
postgres_pid=$(/var/vcap/packages/bpm/bin/bpm pid postgres-10) | ||
kill -s SIGINT "${postgres_pid}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This return immediately right? If so there is a change on a busy instance that bosh thinks draining is done and continues with sending a stop to bpm via monit.
An alternative would be to change to packaging script to also compile the pg_ctl script:
pushd src/bin/pg_ctl
make
make install
popd
And use the same call as the Postgres release: https://github.com/cloudfoundry/postgres-release/blob/develop/jobs/postgres/templates/postgres_ctl.sh.erb#L58
This should give more certainty that the process is actually stopping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we only need to fire-and-forget a SIGINT because BPM is currently not able to do it, and there is no need for any other complications.
With the drain script, here is the resulting sequence:
- Monit is told not to restart Postgres if it stops.
- the drain script sends a SIGINT and Postgres starts the fast shutdown. No need for wait loop or any timeout at this point. The drain script exits immediately. The point here is only to start the fast shutdown sequence of Postgres.
monit stop
delegates tobpm stop
. If the Postgres process is still alive, BPM sends a SIGTERM but Postgres is already doing a fast shutdown, so the signal is ignored.- BPM waits for 15 seconds and Postgres properly stop within this delay, whereas sending a SIGTERM only could have it hag for 5+ minutes.
- If ever the 15 seconds delay expires, BPM would send a SIGQUIT and this would be perfectly legal because a fast shutdown is way shorter than that.
This explains why only sending a SIGINT from the drain script is enough.
Please also consider that this drain script is a fast-path solution. The better way to proceed is having BPM send a SIGINT directly, just like the Postgres release does. The PR is already pending, but this will take more time and we are in a hurry for our Postgres migration that has been blocked for 2+ weeks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still confused about The better way to proceed is having BPM send a SIGINT directly, just like the Postgres release does.
Since I could find no reference to SIGINT in the postgres-release.
In the Postgres release they do (source):
su - vcap -c "${PACKAGE_DIR}/bin/pg_ctl stop -m fast -w -D ${DATA_DIR}"
-m
--mode=mode
Specifies the shutdown mode. mode can be smart, fast, or immediate, or the first letter of one of these three. If this option is omitted, fast is the default.
-w
--wait
Wait for the operation to complete. This is supported for the modes start, stop, restart, promote, and register, and is the default for those modes.
When waiting, pg_ctl repeatedly checks the server's PID file, sleeping for a short amount of time between checks. Startup is considered complete when the PID file indicates that the server is ready to accept connections. Shutdown is considered complete when the server removes the PID file. pg_ctl returns an exit code based on the success of the startup or shutdown.
If the operation does not complete within the timeout (see option -t), then pg_ctl exits with a nonzero exit status. But note that the operation might continue in the background and eventually succeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem, it's only that pg_ctl stop -m fast
sends a SIGINT signal to Postgres, as detailed the Shutting Down the Server documentation.
The
pg_ctl
program provides a convenient interface for sending these signals to shut down the server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this change been tested in combination with the director drain script? The workers won't be able to complete their active tasks while Postgres performs a fast stop.
jobs/postgres-10/templates/drain.erb
Outdated
|
||
|
||
postgres_pid=$(/var/vcap/packages/bpm/bin/bpm pid postgres-10) | ||
kill -s SIGINT "${postgres_pid}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still confused about The better way to proceed is having BPM send a SIGINT directly, just like the Postgres release does.
Since I could find no reference to SIGINT in the postgres-release.
In the Postgres release they do (source):
su - vcap -c "${PACKAGE_DIR}/bin/pg_ctl stop -m fast -w -D ${DATA_DIR}"
-m
--mode=mode
Specifies the shutdown mode. mode can be smart, fast, or immediate, or the first letter of one of these three. If this option is omitted, fast is the default.
-w
--wait
Wait for the operation to complete. This is supported for the modes start, stop, restart, promote, and register, and is the default for those modes.
When waiting, pg_ctl repeatedly checks the server's PID file, sleeping for a short amount of time between checks. Startup is considered complete when the PID file indicates that the server is ready to accept connections. Shutdown is considered complete when the server removes the PID file. pg_ctl returns an exit code based on the success of the startup or shutdown.
If the operation does not complete within the timeout (see option -t), then pg_ctl exits with a nonzero exit status. But note that the operation might continue in the background and eventually succeed.
Co-authored-by: Benjamin Gandon <benjamin.gandon@sap.com> Co-authored-by: Ramon Makkelie <ramon.makkelie@sap.com>
Good catch indeed, this is definitely a concern to take into account.
With these changes, we'll leave enough time for the Director drain script to nicely stop the workers and we'll properly stop the database afterwards with the fast shutdown mode. Then if you want us to rely on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
What is this change about?
Here we propose to use
SIGINT
in order to shut down the Postgres daemon with the Fast Shutdown mode. See the Shutting Down the Server documentation chapter for more details.This Fast Shutdown mode is implemented here in the drain script.
Please provide contextual information.
While we were upgrading from Postgres v10 to Postgres v13 at SAP, we found out that Postgres 10 was not properly shut down, which leaded to blocking issue with the migration. The error message was “The source cluster was not shut down cleanly”.
When SIGTERM is send to Postgres, the daemon enters the Smart Shutdown mode and can take a while to shut down. In our tests, we stopped waiting after 7 minutes: Postgres hadn't shut down yet!
This explains why the postgres-release is sending a SIGINT at
monit stop
.See: https://github.com/cloudfoundry/postgres-release/blob/develop/jobs/postgres/templates/postgres_ctl.sh.erb#L58
This Fast Shutdown mode is implemented here in the drain script. We could not get BPM to properly send a SIGINT because the SIGTERM is hardcoded in the BPM Golang code. An upcoming PR in BPM should address this issue and provide proper configuration for the actual first signal to send to the process.
In its hardcoded shutdown sequence, BPM then sends a SIGQUIT which means “Immediate Shutdown” mode for Postgres. That's where bad things happen, and the cluster state doesn't reach the expected
shut down
status (see below).See also https://bosh.io/docs/job-lifecycle/#stop where the hardcoded BPM shutdown sequence is now documented.
What tests have you run against this PR?
When running the
pg_controldata
utility, the “Database cluster state” is expected to beshut down
for the v10-to-v13 migration to succeed.But with the BPM sequence SIGTERM/15/SIGQUIT/2/SIGKILL, that turns out to be inappropriate for Postgres, we get an unexpected “Database cluster state” of
in production
. And the migration fails.Here is how we run the
pg_controldata
utility:Test sequence:
After implementing the
drain
script, we could reproduce the issue commenting out our new code and runningbosh stop
then checking the cluster state with thepg_controldata
utility.After that, when properly running the
drain
script, we could runbosh stop
and then conclude the cluster state was correct with thepg_controldata
utility.How should this change be described in bosh release notes?
drain
scripts for thepostgres-9.4
,postgres-10
andpostgres
jobs, in order to implement a proper “fast shutdown” sequence for the Postgres daemon.Does this PR introduce a breaking change?
No. Just fixes an issue that had been there for a long time.
Tag your pair, your PM, and/or team!
Co-authored-by: Benjamin Gandon