Skip to content

postgres fails to restart after VM reboot #33

@jsievers

Description

@jsievers

Steps to reproduce

  • use bosh to deploy a postgres-release v23 VM
  • reboot the VM on IaaS level (e.g. nova reboot <VM_CID> if on openstack)

Observed behaviour

  • postgres consistently keeps failing to start because the pid file directory /var/vcap/sys/run/postgres/ does not exist.

from postgres_ctl.err.log :

[2018-01-05 12:53:31+0000] + main start
[2018-01-05 12:53:31+0000] + local action
[2018-01-05 12:53:31+0000] + action=start
[2018-01-05 12:53:31+0000] + source /var/vcap/jobs/postgres/bin/pgconfig.sh
[2018-01-05 12:53:31+0000] ++ current_version=9.6.4
[2018-01-05 12:53:31+0000] ++ pgversion_current=postgres-9.6.4
[2018-01-05 12:53:31+0000] ++ pgversion_old=postgres-9.6.3
[2018-01-05 12:53:31+0000] ++ pgversion_older=postgres-9.6.2
[2018-01-05 12:53:31+0000] ++ JOB_DIR=/var/vcap/jobs/postgres
[2018-01-05 12:53:31+0000] ++ PACKAGE_DIR=/var/vcap/packages/postgres-9.6.4
[2018-01-05 12:53:31+0000] ++ STORE_DIR=/var/vcap/store
[2018-01-05 12:53:31+0000] ++ PG_STORE_DIR=/var/vcap/store/postgres
[2018-01-05 12:53:31+0000] ++ DATA_DIR=/var/vcap/store/postgres/postgres-9.6.4
[2018-01-05 12:53:31+0000] ++ DATA_DIR_PREVIOUS=/var/vcap/store/postgres/postgres-previous
[2018-01-05 12:53:31+0000] ++ DATA_DIR_OLD=/var/vcap/store/postgres/postgres-unknown
[2018-01-05 12:53:31+0000] ++ PACKAGE_DIR_OLD=/var/vcap/packages/postgres-unknown
[2018-01-05 12:53:31+0000] ++ POSTGRES_UPGRADE_LOCK=/var/vcap/store/postgres/POSTGRES_UPGRADE_LOCK
[2018-01-05 12:53:31+0000] ++ pgversion_upgrade_from=postgres-unknown
[2018-01-05 12:53:31+0000] ++ '[' -d /var/vcap/store/postgres/postgres-9.6.3 -a -f /var/vcap/store/postgres/postgres-9.6.3/postgresql.conf ']'
[2018-01-05 12:53:31+0000] ++ '[' -d /var/vcap/store/postgres/postgres-9.6.2 -a -f /var/vcap/store/postgres/postgres-9.6.2/postgresql.conf ']'
[2018-01-05 12:53:31+0000] ++ RUN_DIR=/var/vcap/sys/run/postgres
[2018-01-05 12:53:31+0000] ++ LOG_DIR=/var/vcap/sys/log/postgres
[2018-01-05 12:53:31+0000] ++ PIDFILE=/var/vcap/sys/run/postgres/postgres.pid
[2018-01-05 12:53:31+0000] ++ CONTROL_JOB_PIDFILE=/var/vcap/sys/run/postgres/postgresctl.pid
[2018-01-05 12:53:31+0000] ++ HOST=0.0.0.0
[2018-01-05 12:53:31+0000] ++ PORT=5524
[2018-01-05 12:53:31+0000] ++ [[ -n '' ]]
[2018-01-05 12:53:31+0000] ++ LD_LIBRARY_PATH=/var/vcap/packages/postgres-9.6.4/lib
[2018-01-05 12:53:31+0000] + set +u
[2018-01-05 12:53:31+0000] + source /var/vcap/packages/postgres-common/utils.sh
[2018-01-05 12:53:31+0000] + set -u
[2018-01-05 12:53:31+0000] + case "${action}" in
[2018-01-05 12:53:31+0000] + pid_guard /var/vcap/sys/run/postgres/postgresctl.pid 'Postgres control job' false
[2018-01-05 12:53:31+0000] + local pidfile=/var/vcap/sys/run/postgres/postgresctl.pid
[2018-01-05 12:53:31+0000] + local 'name=Postgres control job'
[2018-01-05 12:53:31+0000] + local logall=false
[2018-01-05 12:53:31+0000] + '[' false == true ']'
[2018-01-05 12:53:31+0000] + '[' -f /var/vcap/sys/run/postgres/postgresctl.pid ']'
[2018-01-05 12:53:31+0000] + echo 1315
[2018-01-05 12:53:31+0000] /var/vcap/jobs/postgres/bin/postgres_ctl: line 18: /var/vcap/sys/run/postgres/postgresctl.pid: No such file or directory
[2018-01-05 12:55:11+0000] + main start
[2018-01-05 12:55:11+0000] + local action
[2018-01-05 12:55:11+0000] + action=start
[2018-01-05 12:55:11+0000] + source /var/vcap/jobs/postgres/bin/pgconfig.sh
[2018-01-05 12:55:11+0000] ++ current_version=9.6.4
[2018-01-05 12:55:11+0000] ++ pgversion_current=postgres-9.6.4
[2018-01-05 12:55:11+0000] ++ pgversion_old=postgres-9.6.3
[2018-01-05 12:55:11+0000] ++ pgversion_older=postgres-9.6.2
[2018-01-05 12:55:11+0000] ++ JOB_DIR=/var/vcap/jobs/postgres
[2018-01-05 12:55:11+0000] ++ PACKAGE_DIR=/var/vcap/packages/postgres-9.6.4
[2018-01-05 12:55:11+0000] ++ STORE_DIR=/var/vcap/store
[2018-01-05 12:55:11+0000] ++ PG_STORE_DIR=/var/vcap/store/postgres
[2018-01-05 12:55:11+0000] ++ DATA_DIR=/var/vcap/store/postgres/postgres-9.6.4
[2018-01-05 12:55:11+0000] ++ DATA_DIR_PREVIOUS=/var/vcap/store/postgres/postgres-previous
[2018-01-05 12:55:11+0000] ++ DATA_DIR_OLD=/var/vcap/store/postgres/postgres-unknown
[2018-01-05 12:55:11+0000] ++ PACKAGE_DIR_OLD=/var/vcap/packages/postgres-unknown
[2018-01-05 12:55:11+0000] ++ POSTGRES_UPGRADE_LOCK=/var/vcap/store/postgres/POSTGRES_UPGRADE_LOCK
[2018-01-05 12:55:11+0000] ++ pgversion_upgrade_from=postgres-unknown
[2018-01-05 12:55:11+0000] ++ '[' -d /var/vcap/store/postgres/postgres-9.6.3 -a -f /var/vcap/store/postgres/postgres-9.6.3/postgresql.conf ']'
[2018-01-05 12:55:11+0000] ++ '[' -d /var/vcap/store/postgres/postgres-9.6.2 -a -f /var/vcap/store/postgres/postgres-9.6.2/postgresql.conf ']'
[2018-01-05 12:55:11+0000] ++ RUN_DIR=/var/vcap/sys/run/postgres
[2018-01-05 12:55:11+0000] ++ LOG_DIR=/var/vcap/sys/log/postgres
[2018-01-05 12:55:11+0000] ++ PIDFILE=/var/vcap/sys/run/postgres/postgres.pid
[2018-01-05 12:55:11+0000] ++ CONTROL_JOB_PIDFILE=/var/vcap/sys/run/postgres/postgresctl.pid
[2018-01-05 12:55:11+0000] ++ HOST=0.0.0.0
[2018-01-05 12:55:11+0000] ++ PORT=5524
[2018-01-05 12:55:11+0000] ++ [[ -n '' ]]
[2018-01-05 12:55:11+0000] ++ LD_LIBRARY_PATH=/var/vcap/packages/postgres-9.6.4/lib
[2018-01-05 12:55:11+0000] + set +u
[2018-01-05 12:55:11+0000] + source /var/vcap/packages/postgres-common/utils.sh
[2018-01-05 12:55:11+0000] + set -u
[2018-01-05 12:55:11+0000] + case "${action}" in
[2018-01-05 12:55:11+0000] + pid_guard /var/vcap/sys/run/postgres/postgresctl.pid 'Postgres control job' false
[2018-01-05 12:55:11+0000] + local pidfile=/var/vcap/sys/run/postgres/postgresctl.pid
[2018-01-05 12:55:11+0000] + local 'name=Postgres control job'
[2018-01-05 12:55:11+0000] + local logall=false
[2018-01-05 12:55:11+0000] + '[' false == true ']'
[2018-01-05 12:55:11+0000] + '[' -f /var/vcap/sys/run/postgres/postgresctl.pid ']'
[2018-01-05 12:55:11+0000] + echo 1473
[2018-01-05 12:55:11+0000] /var/vcap/jobs/postgres/bin/postgres_ctl: line 18: /var/vcap/sys/run/postgres/postgresctl.pid: No such file or directory

the important line from log above is

[2018-01-05 12:53:31+0000] /var/vcap/jobs/postgres/bin/postgres_ctl: line 18: /var/vcap/sys/run/postgres/postgresctl.pid: No such file or directory

Expected behaviour

postgres is able to survive a VM reboot.

Workaround

bosh ssh into the postgres VM and manually execute

mkdir -p /var/vcap/sys/run/postgres/
chown -R vcap:vcap /var/vcap/sys/run/
monit start postgres

Possible fix

there are similar bugs already fixed in other bosh releases, see e.g. cloudfoundry/capi-release#25

I think the problem is that the control scripts rely on prestart to create $RUN_DIR:

mkdir -p "${RUN_DIR}"
chown -R vcap:vcap "${RUN_DIR}"

and in case of external VM reboot (i.e. outside of any bosh operation), prestart is not executed.
/var/vcap/data/sys/run/ is a temp file system

$ mount | grep sys/run
tmpfs on /var/vcap/data/sys/run type tmpfs (rw,size=1m)

and thus is empty after VM reboot (which is fine).

although there is an mkdir -p $RUN_DIR in

, this is too late in case of VM reboot because of this line executed earlier:

echo $$ > ${CONTROL_JOB_PIDFILE}

which assumes the directory $RUN_DIR already exists.

Initialization of $RUN_DIR should probably happen as the very first thing when start) is invoked, see the fix for a similar problem in cloud_controller_ng

In our scenario (we run CCDB on postgres-release) this bug has caused an extended Cloudfoundry downtime triggered by forced VM reboots which were required to deploy urgent security fixes on the IaaS level.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions