Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only exit at upgrade when the on-disk binary is ready. #771

Conversation

DimCitus
Copy link
Collaborator

See #749.

@DimCitus DimCitus added enhancement New feature or request Size: S Effort Estimate: Small labels Jul 12, 2021
@DimCitus DimCitus added this to the Sprint 2021 W31 W32 milestone Jul 12, 2021
@DimCitus DimCitus requested a review from JelteF July 12, 2021 12:47
@DimCitus DimCitus self-assigned this Jul 12, 2021
src/bin/pg_autoctl/keeper.c Show resolved Hide resolved
src/bin/pg_autoctl/keeper.c Outdated Show resolved Hide resolved
src/bin/pg_autoctl/keeper.c Outdated Show resolved Hide resolved
src/bin/pg_autoctl/keeper.c Outdated Show resolved Hide resolved
keeperVersion.pg_autoctl_version,
keeperVersion.required_extension_version);
exit(EXIT_CODE_MONITOR);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do anything here in the else case? e.g. log something and/or return false?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point indeed. I have added extra logging (but that will happen every second) and returned false in that case, so that we don't try to use a monitor protocol version that we don't know. I have yet to figure a nice way to test this code path now (if I just upgrade the binary locally in a make cluster environment, because it's the same binary for every node, the right thing will happen...).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My trick for testing this locally was to change the version number everywhere to 1.7, but change this to compare against hardcoded "1.6" instead of the PG_AUTOCTL_EXTENSION_VERSION constant:
https://github.com/citusdata/pg_auto_failover/blob/0fecd653f9ef43019e4d76e347afce378adcc481/src/bin/pg_autoctl/keeper.c#L1405

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually no, I think it was a different line where I did this trick. I made pg_autoctl version always return 1.6 I think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I added preliminary support for running an upgrade from within a docker compose environment, where it's possible to use the following command to replace a running container with a new container freshly built from the repository:

$ docker compose up -d --no-deps --build monitor

I added a little Makefile in tests/upgrade along with the docker-compose file to use. Then the idea would be to install such a patch as the following, and run make upgrade, which replaces the monitor and we can see the new error message about the upgrade to be done on the other nodes.

The docker compose network is sometimes acting up though, and I have had other errors, such as the following, making it pretty hard to reliably test for upgrade paths. Still, it's a promising work in progress, I believe.

monitor_1  | 21:29:44 25 WARN  Failed to resolve DNS name "79a3a3ebe79e": Temporary failure in name resolution
monitor_1  | 21:29:44 25 WARN  Failed to find IP address for hostname "79a3a3ebe79e", see above for details
monitor_1  | 21:29:44 25 WARN  Failed to grant connection to local network.
monitor_1  | 21:29:44 1 ERROR pg_autoctl service listener exited with exit status 12
monitor_1  | 21:29:44 1 INFO  Restarting service listener
diff --git a/src/bin/pg_autoctl/defaults.h b/src/bin/pg_autoctl/defaults.h
index 8e7e2eaf..643dc96a 100644
--- a/src/bin/pg_autoctl/defaults.h
+++ b/src/bin/pg_autoctl/defaults.h
@@ -17,7 +17,7 @@
 #define PG_AUTOCTL_VERSION "1.6.1"
 
 /* version of the extension that we requite to talk to on the monitor */
-#define PG_AUTOCTL_EXTENSION_VERSION "1.6"
+#define PG_AUTOCTL_EXTENSION_VERSION "1.7"
 
 /* environment variable to use to make DEBUG facilities available */
 #define PG_AUTOCTL_DEBUG "PG_AUTOCTL_DEBUG"
diff --git a/src/monitor/Makefile b/src/monitor/Makefile
index 8e6bf321..478fb2a7 100644
--- a/src/monitor/Makefile
+++ b/src/monitor/Makefile
@@ -2,7 +2,7 @@
 # Licensed under the PostgreSQL License.
 
 EXTENSION = pgautofailover
-EXTVERSION = 1.6
+EXTVERSION = 1.7
 
 SRC_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
 
diff --git a/src/monitor/metadata.h b/src/monitor/metadata.h
index 9e50ab72..0f8b39a3 100644
--- a/src/monitor/metadata.h
+++ b/src/monitor/metadata.h
@@ -15,7 +15,7 @@
 
 #include "storage/lockdefs.h"
 
-#define AUTO_FAILOVER_EXTENSION_VERSION "1.6"
+#define AUTO_FAILOVER_EXTENSION_VERSION "1.7"
 #define AUTO_FAILOVER_EXTENSION_NAME "pgautofailover"
 #define AUTO_FAILOVER_SCHEMA_NAME "pgautofailover"
 #define AUTO_FAILOVER_FORMATION_TABLE "pgautofailover.formation"
diff --git a/src/monitor/pgautofailover.control b/src/monitor/pgautofailover.control
index a649eb76..d0504410 100644
--- a/src/monitor/pgautofailover.control
+++ b/src/monitor/pgautofailover.control
@@ -1,5 +1,5 @@
 comment = 'pg_auto_failover'
-default_version = '1.6'
+default_version = '1.7'
 module_pathname = '$libdir/pgautofailover'
 relocatable = false
 requires = 'btree_gist'

Copy link
Collaborator Author

@DimCitus DimCitus Jul 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what happened with the DNS resolution is actually that on the first run in the docker compose environment, the hostname would be set to a dynamic hash (the container hash/name). We would store that as the default --hostname in our pg_autoctl configuration file. And at the next start-up of course the container name/hash would change, but pg_autoctl would still try to resolve the previous container name, which the DNS now knows nothing about.

I have added a retry loop to the DNS temporary resolution failures, and more importantly fixed the docker compose file to use known hostnames thanks to the hostname: service parameter.

I have also improved the tests/upgrade/Makefile to handle the patching of the source code for the 1.7 version, which we would have to maintain for other versions later of course. And added a README that makes it easy enough to navigate the tooling.

Oh, in passing, I have validated that the current PR actually works as intended. The Makefile also allows to run a full upgrade of all the nodes, using the docker compose commands.

src/bin/pg_autoctl/ipaddr.c Outdated Show resolved Hide resolved
@DimCitus DimCitus force-pushed the fix/do-not-restart-for-upgrade-when-version-on-disk-has-not-changed branch from f3f2308 to 5544491 Compare August 3, 2021 18:51
DimCitus and others added 12 commits August 19, 2021 17:23
This happens when e.g. the HBA rules have not been set yet on the monitor.
We have a timing where we could ping successfully and then fail to connect,
and in that situation we had a spurious call to pgsql_finish() that would
reset the connection statement type to SINGLE, even when it was before set
to SINGLE.
The new directory tests/upgrade contains a docker compose file that allows
running a monitor and three nodes in a way that will allow testing upgrades
thanks to the following command:

  $ docker-compose up --build monitor

This will rebuild the docker image used in the docker compose environment
from the local repository checkout and restart the monitor with the new
binaries.

To test upgrades from earlier versions of pg_auto_failover though, hacking
the docker compose file to its argument-based style (instead of the current
environment based style) would be necessary, because this PR introduces new
code to enhance our support for environment variables at create time.
Only allow connections to be made to the monitor when everything else has
been checked and is known to be ready.
Debian bullseye got released 5 days ago and this broke the docker build
file.
@DimCitus DimCitus force-pushed the fix/do-not-restart-for-upgrade-when-version-on-disk-has-not-changed branch from 34708ad to c827134 Compare August 19, 2021 15:23
@DimCitus DimCitus merged commit 9284de8 into master Aug 19, 2021
@DimCitus DimCitus deleted the fix/do-not-restart-for-upgrade-when-version-on-disk-has-not-changed branch August 19, 2021 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Size: S Effort Estimate: Small
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants