Plug connection leaks found during profiling #582

gkokolatos · 2021-02-04T19:20:46Z

It seems that pgsql_execute_with_params() during its lifetime has been
inconsistently altered. The latest version notes in the comments that the
connection is not persistant to facilitate error handling. However that was not
entirely true and several parts of the code assumed it to not be true. Others
assumed to be true and failed to release the connection once used.

For the sake of clarity, the function will now explicitly close the connection
that has used, regardless of wether it is a new or existing connection. That
simplifies most of the code and plugs the connection leaks.

It also unconvers an inconsistency on the connections used for notification. The
code mixed the connection it was using to listen to events from the monitor and
with others. A new PGconn member has been added in the monitor struct to
distinguish between the two distinct cases.

DimCitus · 2021-02-05T10:36:47Z

Hey Goergios,

You're making good points here, and the analysis is solid, but as you saw our goal has shifted and the code has not been maintained well enough to show our current intentions. Sorry about that.

It seems that pgsql_execute_with_params() during its lifetime has been
inconsistently altered. The latest version notes in the comments that the
connection is not persistant to facilitate error handling. However that was not
entirely true and several parts of the code assumed it to not be true. Others
assumed to be true and failed to release the connection once used.

If you run pg_autoctl in DEBUG mode (using -vv for very verbose) you will see connections and disconnections made in the log messages. The idea is that we should refrains from series of connect, disconnect, connect, disconnect, connect, disconnect within the same work unit. That's not the best way to use Postgres, we should be smart enough to re-use a connection and manage the client-side libpq clean-up that is necessary.

For the sake of clarity, the function will now explicitly close the connection
that has used, regardless of wether it is a new or existing connection. That
simplifies most of the code and plugs the connection leaks.

We started from that much simpler implementation yes, and then made it more complex to be able to re-use a previously established connection.

It also unconvers an inconsistency on the connections used for notification. The
code mixed the connection it was using to listen to events from the monitor and
with others. A new PGconn member has been added in the monitor struct to
distinguish between the two distinct cases.

What's wrong with running queries in the same connection when you LISTEN for changes?

gkokolatos · 2021-02-05T12:04:44Z

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On Friday, February 5, 2021 11:37 AM, Dimitri Fontaine ***@***.***> wrote: Hey Goergios, You're making good points here, and the analysis is solid, but as you saw our goal has shifted and the code has not been maintained well enough to show our current intentions. Sorry about that.

It happens to all software :) So the real problem is leaking connections and connection structs. With out this patch, several 10s of Kb where leaked (definitely lost) every few loops under some test scenarios. This can become substantial memleak on a long running process. That is what this PR is plugging.

> It seems that pgsql_execute_with_params() during its lifetime has been > inconsistently altered. The latest version notes in the comments that the > connection is not persistant to facilitate error handling. However that was not > entirely true and several parts of the code assumed it to not be true. Others > assumed to be true and failed to release the connection once used. If you run `pg_autoctl` in DEBUG mode (using `-vv` for very verbose) you will see connections and disconnections made I the log messages. The idea is that we should refrains from series of connect, disconnect, connect, disconnect, connect, disconnect within the same work unit. That's not the best way to use Postgres, we should be smart enough to re-use a connection and manage the client-side libpq clean-up that is necessary.

Absolutely. You should not leak memory though. My rough read is that in order to achieve that, a small redesign of the interface will be needed. IMHO, it might make sense to prevent the memleaks while the redesign is taking place which can end up being a while. Again, only an opinion :)

> For the sake of clarity, the function will now explicitly close the connection > that has used, regardless of wether it is a new or existing connection. That > simplifies most of the code and plugs the connection leaks. We started from that much simpler implementation yes, and then made it more complex to be able to re-use a previously established connection. > It also unconvers an inconsistency on the connections used for notification. The > code mixed the connection it was using to listen to events from the monitor and > with others. A new PGconn member has been added in the monitor struct to > distinguish between the two distinct cases. What's wrong with running queries in the same connection when you LISTEN for changes?

Nothing, providing you are not leaking. If you try to plug the leaks with the current interface, you will stop listening to events in parts of the code where it is not desired to stop listening to events. Or that has been my understanding.

…

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

DimCitus · 2021-02-05T12:29:41Z

Absolutely. You should not leak memory though. My rough read is that in order to achieve that, a small redesign of the interface will be needed. IMHO, it might make sense to prevent the memleaks while the redesign is taking place which can end up being a while. Again, only an opinion :)

Okay, between your message and a chat with @JelteF I am now convinced that we should 1. plug the leak and 2. find a principled way to be smart about re-using connections when that's better.

Nothing, providing you are not leaking. If you try to plug the leaks with the current interface, you will stop listening to events in parts of the code where it is not desired to stop listening to events. Or that has been my understanding.

Yeah that's a good point too.

DimCitus

Please rebase to current master and let's see about the CI before merging!

Thinking more about it, I think this breaks some of our retry loops, such as the one in keeper_register_and_init in https://github.com/citusdata/pg_auto_failover/blob/master/src/bin/pg_autoctl/keeper.c#L1231

DimCitus · 2021-02-05T14:53:12Z

src/bin/pg_autoctl/keeper_pg_init.c

@@ -546,6 +546,7 @@ wait_until_primary_is_ready(Keeper *keeper,
 			KeeperStateData *keeperState = &(keeper->state);
 			int timeoutMs = PG_AUTOCTL_KEEPER_SLEEP_TIME * 1000;

+			(void) pgsql_listen(&(monitor->notificationClient), &((char *){ 0 }));


We need at least a comment that explains why it's okay to send an empty list of channels here, and then I would rather avoid this advanced notation and just use a const char *channels[] = { 0 }; variable.

Yeah, that's clearly a hack. I only added it as a quick way to open a connection as there was no proper interface available.
Instead of hacking pgsql_listen(), how about exposing a lower level interface like pgsql_open_connection()?

Yeah we don't want to expose the lower-level interface. The whole point is that ensuring libpq-level clean-up and “lifetime management” is pretty hard, so we want to have that all sit in the same place and remain kind of opaque to the higher levels.

Also I'm not sure why we now have an explicit call to pgsql_listen that we didn't have before. Maybe that's where the new API could happen?

DimCitus · 2021-02-05T14:53:45Z

src/bin/pg_autoctl/service_keeper.c

+	 * Finally make establish a connection for notifications in case it had
+	 * closed before
+	 */
+	(void) pgsql_listen(&(keeper->monitor.notificationClient), &((char *){ 0 }));


Same as before, I'm not happy with this notation, let's make it a whole lot more explicit please?

DimCitus

I took time to review the changes and the incompatibility with our manual transaction handling that we do in places. Do you want to update the PR?

DimCitus · 2021-03-15T12:16:45Z

src/bin/pg_autoctl/pgsql.c

+	PQfinish(pgsql->connection);
+	pgsql->connection = NULL;


In a couple places in the code we are handling transactions to sync local state file creation with transaction commit on the monitor. When the local activity fails, we ROLLBACK the transaction on the monitor. I think we need to track if an explicit transaction is being used in our PGSQL object and provide pgsql_begin, pgsql_commit, and pgsql_rollback functions, and then have the PGfinish call in pgsql_execute_with_params depend on whether an explicit transaction is in flight or not.

gkokolatos · 2021-03-15T13:46:47Z

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On Monday, March 15, 2021 1:17 PM, Dimitri Fontaine ***@***.***> wrote: @DimCitus requested changes on this pull request. I took time to review the changes and the incompatibility with our manual transaction handling that we do in places. Do you want to update the PR?

Thank you. Let me try to get a fresh look at it.

--------------------------------------------------------------- In [src/bin/pg_autoctl/pgsql.c](#582 (comment)): > + PQfinish(pgsql->connection); + pgsql->connection = NULL; In a couple places in the code we are handling transactions to sync local state file creation with transaction commit on the monitor. When the local activity fails, we ROLLBACK the transaction on the monitor. I think we need to track if an explicit transaction is being used in our PGSQL object and provide pgsql_begin, pgsql_commit, and pgsql_rollback functions, and then have the PGfinish call in pgsql_execute_with_params depend on whether an explicit transaction is in flight or not.

Yeah, it seems that the naive, 'use once' approach will not cut it if a transaction has to be open. Let me try some more targeted valgrind runs and see if I can catch the actual offender(s) before attempting a re-write of the API.

…

— You are receiving this because you authored the thread. Reply to this email directly, [view it on GitHub](#582 (review)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/ALBTIO4RYLKNWBALRBS77T3TDX3F3ANCNFSM4XDM5VVQ).

DimCitus · 2021-03-15T16:58:09Z

Yeah, it seems that the naive, 'use once' approach will not cut it if a transaction has to be open. Let me try some more targeted valgrind runs and see if I can catch the actual offender(s) before attempting a re-write of the API.

In case that's needed, I think most call sites would remain exactly the same as today. Only those where we issue manual BEGIN/ROLLBACK/COMMIT instructions would have to change. That's not too many, I think I can only find one...

gkokolatos · 2021-03-19T16:01:19Z

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On Monday, March 15, 2021 5:58 PM, Dimitri Fontaine ***@***.***> wrote: > Yeah, it seems that the naive, 'use once' approach will not cut it if a transaction has to be open. Let me try some more targeted valgrind runs and see if I can catch the actual offender(s) before attempting a re-write of the API. In case that's needed, I think most call sites would remain exactly the same as today. Only those where we issue manual BEGIN/ROLLBACK/COMMIT instructions would have to change. That's not too many, I think I can only find one...

Thank you for looking and apologies for the delay. I rebased the current (I know it is public but I am not assuming anyone to have used it :)) and tried to address the comments. Valgrind seems happy. However I am getting some flakiness in the test_022_detect_network_partition test. Let the infra run the tests and see what it says. //Georgios

…

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

gkokolatos · 2021-03-24T09:35:00Z

@DimCitus Now that was a bit embarrassing. I had previously pushed only the rebase and not the changes.

Please find a fresh push with a new rebase and the requested changes.

Valgrind is still happy and the tests do pass locally. Let us wait for the CI to conclude.

DimCitus · 2021-03-24T10:39:37Z

@DimCitus Now that was a bit embarrassing. I had previously pushed only the rebase and not the changes.
Please find a fresh push with a new rebase and the requested changes.

Thanks @gkokolatos ! Your approach/API looks better than my own attempt yesterday. I think it'd be good to rename and improve the emptyChannels to char *emptyChannelsList = { NULL }; but that's a minor issue.

Valgrind is still happy and the tests do pass locally. Let us wait for the CI to conclude.

Some of the failures I see seem related to a merge error, typically:

======================================================================
FAIL: test_extension_update.test_001_update_extension
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/python/3.7.6/lib/python3.7/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/travis/build/citusdata/pg_auto_failover/tests/test_extension_update.py", line 41, in test_001_update_extension
    eq_(results, [("dummy",)])
AssertionError: [('1.5.0.2',)] != [('dummy',)]

----------------------------------------------------------------------
Ran 34 tests in 74.179s

And then you need to run make indent again apparently.

And another one is the Travis infamous one, where your PR is running some processes in DEBUG mode:

The job exceeded the maximum log length, and has been terminated.

gkokolatos · 2021-03-24T11:51:25Z

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On Wednesday, March 24, 2021 11:40 AM, Dimitri Fontaine ***@***.***> wrote: > @DimCitus Now that was a bit embarrassing. I had previously pushed only the rebase and not the changes. > Please find a fresh push with a new rebase and the requested changes. Thanks @gkokolatos ! Your approach/API looks better than my own attempt yesterday. I think it'd be good to rename and improve the `emptyChannels` to `char *emptyChannelsList = { NULL };` but that's a minor issue.

Thank you! Fixed.

> Valgrind is still happy and the tests do pass locally. Let us wait for the CI to conclude. Some of the failures I see seem related to a merge error, typically: ====================================================================== FAIL: test_extension_update.test_001_update_extension ---------------------------------------------------------------------- Traceback (most recent call last): File "/opt/python/3.7.6/lib/python3.7/site-packages/nose/case.py", line 198, in runTest self.test(*self.arg) File "/home/travis/build/citusdata/pg_auto_failover/tests/test_extension_update.py", line 41, in test_001_update_extension eq_(results, [("dummy",)]) AssertionError: [('1.5.0.2',)] != [('dummy',)] ---------------------------------------------------------------------- Ran 34 tests in 74.179s

Let me have a look at that.

And then you need to run `make indent` again apparently.

Too true. Fixed

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

gkokolatos · 2021-03-24T12:19:47Z

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On Wednesday, March 24, 2021 12:51 PM, ***@***.***> wrote: ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Wednesday, March 24, 2021 11:40 AM, Dimitri Fontaine ***@***.*** wrote: > > @DimCitus Now that was a bit embarrassing. I had previously pushed only the rebase and not the changes. > > Please find a fresh push with a new rebase and the requested changes. > > Thanks @gkokolatos ! Your approach/API looks better than my own attempt yesterday. I think it'd be good to rename and improve the `emptyChannels` to `char *emptyChannelsList = { NULL };` but that's a minor issue. Thank you! Fixed. > > Valgrind is still happy and the tests do pass locally. Let us wait for the CI to conclude. > > Some of the failures I see seem related to a merge error, typically: > > ====================================================================== > FAIL: test_extension_update.test_001_update_extension > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "/opt/python/3.7.6/lib/python3.7/site-packages/nose/case.py", line 198, in runTest > self.test(*self.arg) > File "/home/travis/build/citusdata/pg_auto_failover/tests/test_extension_update.py", line 41, in test_001_update_extension > eq_(results, [("dummy",)]) > AssertionError: [('1.5.0.2',)] != [('dummy',)] > > ---------------------------------------------------------------------- > Ran 34 tests in 74.179s > Let me have a look at that.

There exists some flakiness for sure. What really caught my eye was test_multi_ifdown.test_011_prepare_candidate_priorities which had many a successful runs (e.g. 2887.1 and 2887.2) but in 2887.3 there was a deadlock detected. Log follows which can also be seen here (https://travis-ci.com/github/citusdata/pg_auto_failover/jobs/493356759): 12:01:24 28140 ERROR Monitor ERROR: deadlock detected 358712:01:24 28140 ERROR Monitor DETAIL: Process 28157 waits for ShareLock on transaction 3012; blocked by process 28156. 358812:01:24 28140 ERROR Monitor Process 28156 waits for ExclusiveLock on advisory lock [16385,822708183,0,11]; blocked by process 28157. 358912:01:24 28140 ERROR Monitor HINT: See server log for query details. 359012:01:24 28140 ERROR Monitor CONTEXT: while updating tuple (0,17) in relation "node" 359112:01:24 28140 ERROR Monitor SQL statement "UPDATE pgautofailover.node SET candidatepriority = $1, replicationquorum = $2 WHERE nodeid = $3 and nodehost = $4 AND nodeport = $5" 359212:01:24 28140 ERROR Failed to update node candidate priority on node "node_2"in formation "default" for candidate_priority: "100" 359312:01:24 28140 ERROR Failed to set "candidate-priority" to "100". I am not certain if and how this is related to the current PR, but it seems serious enough to demand some scrutiny. I will be taking a look but feel free to weigh in, if you want.

…

> And then you need to run `make indent` again apparently. Too true. Fixed > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub, or unsubscribe.

DimCitus

I did another round of review, focusing on some details that seem to require more attention.

DimCitus · 2021-03-24T14:14:15Z

src/bin/pg_autoctl/keeper_pg_init.c

+			char *emptyChannelList[] = { NULL };

+			(void) pgsql_listen(&(monitor->notificationClient), emptyChannelList);


By the way, how does it work now? We are listening to no channel at all, how do we expect to get a notification that some state change happened in our formation and group, or even for our own node?

DimCitus · 2021-03-24T14:16:49Z

src/bin/pg_autoctl/pgsql.c

+	pgsql->connectionStatementType = PGSQL_CONNECTION_MULTI_STATEMENT;
+	connection = pgsql_open_connection(pgsql);
+	if (connection == NULL)


Could we check for "transaction already in progress" errors and report it as a BUG, possibly forcing an exit?

DimCitus · 2021-03-24T14:19:56Z

src/bin/pg_autoctl/service_keeper.c

+				monitor->notificationClient.connectionStatementType ==
+				PGSQL_CONNECTION_SINGLE_STATEMENT)


That's always false, so that we never close the connection, right?

DimCitus · 2021-03-24T14:25:53Z

src/bin/pg_autoctl/service_keeper.c

+	/* Finally establish a connection for notifications if none present */
+	(void) pgsql_listen(&(keeper->monitor.notificationClient), emptyChannelList);
+


I believe we don't need that here, the monitor_wait_for_state_change in the beginning of the main loop reconnects if needed. We have a now spurious "Lost connection" warning that we should probably get rid of in monitor_wait_for_state_change that said, I believe it's now expected to have to establish a connection every once in a while from that point.

gkokolatos · 2021-03-24T14:42:11Z

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On Wednesday, March 24, 2021 3:29 PM, Dimitri Fontaine ***@***.***> wrote: @DimCitus requested changes on this pull request. I did another round of review, focusing on some details that seem to require more attention.

Excellent, thank you!

In src/bin/pg_autoctl/keeper_pg_init.c: > + char *emptyChannelList[] = { NULL }; + (void) pgsql_listen(&(monitor->notificationClient), emptyChannelList); By the way, how does it work now? We are listening to no channel at all, how do we expect to get a notification that some state change happened in our formation and group, or even for our own node?

It works as it did before. The following call is to monitor_wait_for_state_change does right at the top ``` PGconn *connection = monitor->notificationClient.connection; WaitForStateChangeNotificationContext context = { (char *) formation, groupId, nodeId, false /* stateHasChanged */ }; char *channels[] = { "state", NULL }; if (connection == NULL) { log_warn("Lost connection."); return false; } ``` which means that it demands to have a connection open. The call to pgsql_listen() with an empty list, does exactly that, opens a connection. It is needed because the lower level function pgsql_open_connection() is not exposed. Of course we could expose that one, but you had objected when I suggested it.

In src/bin/pg_autoctl/pgsql.c: > + pgsql->connectionStatementType = PGSQL_CONNECTION_MULTI_STATEMENT; + connection = pgsql_open_connection(pgsql); + if (connection == NULL) Could we check for "transaction already in progress" errors and report it as a BUG, possibly forcing an exit?

I think that pgsql_open_connection() should take care of that. Or isn't it?

In src/bin/pg_autoctl/service_keeper.c: > + monitor->notificationClient.connectionStatementType == + PGSQL_CONNECTION_SINGLE_STATEMENT) That's always false, so that we never close the connection, right?

It is set as single statement from the listen call earlier. So it is always true. It is added in the code in order to enforce expectations.

In src/bin/pg_autoctl/service_keeper.c: > + /* Finally establish a connection for notifications if none present */ + (void) pgsql_listen(&(keeper->monitor.notificationClient), emptyChannelList); + I believe we don't need that here, the `monitor_wait_for_state_change` in the beginning of the main loop reconnects if needed. We have a now spurious "Lost connection" warning that we should probably get rid of in `monitor_wait_for_state_change` that said, I believe it's now expected to have to establish a connection every once in a while from that point.

That was necessary a in the first iteration of the code. Let me recheck it. Thanks.

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

gkokolatos · 2021-03-24T15:28:46Z

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On Wednesday, March 24, 2021 3:42 PM, ***@***.***> wrote: ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Wednesday, March 24, 2021 3:29 PM, Dimitri Fontaine ***@***.*** wrote: > @DimCitus requested changes on this pull request. > I did another round of review, focusing on some details that seem to require more attention. Excellent, thank you! > In src/bin/pg_autoctl/keeper_pg_init.c: > > > - char *emptyChannelList[] = { NULL }; > > > > > > - (void) pgsql_listen(&(monitor->notificationClient), emptyChannelList); > > > > By the way, how does it work now? We are listening to no channel at all, how do we expect to get a notification that some state change happened in our formation and group, or even for our own node? It works as it did before. The following call is to monitor_wait_for_state_change does right at the top PGconn *connection = monitor->notificationClient.connection; WaitForStateChangeNotificationContext context = { (char *) formation, groupId, nodeId, false /* stateHasChanged */ }; char *channels[] = { "state", NULL }; if (connection == NULL) { log_warn("Lost connection."); return false; } which means that it demands to have a connection open. The call to pgsql_listen() with an empty list, does exactly that, opens a connection. It is needed because the lower level function pgsql_open_connection() is not exposed. Of course we could expose that one, but you had objected when I suggested it. > In src/bin/pg_autoctl/pgsql.c: > > > - pgsql->connectionStatementType = PGSQL_CONNECTION_MULTI_STATEMENT; > > - connection = pgsql_open_connection(pgsql); > - if (connection == NULL) > > Could we check for "transaction already in progress" errors and report it as a BUG, possibly forcing an exit? I think that pgsql_open_connection() should take care of that. Or isn't it? > In src/bin/pg_autoctl/service_keeper.c: > > > - monitor->notificationClient.connectionStatementType == > > > > > > - PGSQL_CONNECTION_SINGLE_STATEMENT) > > > > That's always false, so that we never close the connection, right? It is set as single statement from the listen call earlier. So it is always true. It is added in the code in order to enforce expectations. > In src/bin/pg_autoctl/service_keeper.c: > > > - /* Finally establish a connection for notifications if none present */ > > - (void) pgsql_listen(&(keeper->monitor.notificationClient), emptyChannelList); > - > > I believe we don't need that here, the `monitor_wait_for_state_change` in the beginning of the main loop reconnects if needed. We have a now spurious "Lost connection" warning that we should probably get rid of in `monitor_wait_for_state_change` that said, I believe it's now expected to have to establish a connection every once in a while from that point. That was necessary a in the first iteration of the code. Let me recheck it. Thanks.

I did recheck and it is needed because the keeper_node_active_loop calls monitor_wait_for_state_change which by itself expects an open connection. Previous to the PR, the connection would be closed and then reopen (intentionally leak) in keeper_node_active. The call was added there instead of the tight loop before monitor_wait_for_state_change() in order to closely resemble the previous location of opening connections. Now I am a bit curious, did you successfully managed run a node without that specific pgsql_listen() call? . I failed every singe time in every scenario, but if you did manage, then it would be very useful to try to reproduce.

…

> — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub, or unsubscribe.

DimCitus · 2021-03-30T08:56:37Z

I think this PR needs to be rebased/merged on-top on current master's branch again. Recent changes in the release numbers (expected files for the monitor extension should mention extension version 1.5 now) and the Pyroute2 integration in the test framework are the main changes.

It seems that pgsql_execute_with_params() during its lifetime has been inconsistently altered. The latest version notes in the comments that the connection is not persistant to facilitate error handling. However that was not entirely true and several parts of the code assumed it to not be true. Others assumed to be true and failed to release the connection once used. For the sake of clarity, the function will now explicitly close the connection that has used, regardless of wether it is a new or existing connection. That simplifies most of the code and plugs the connection leaks. It also unconvers an inconsistency on the connections used for notification. The code mixed the connection it was using to listen to events from the monitor and with others. A new PGconn member has been added in the monitor struct to distinguish between the two distinct cases.

… close that connection when appropriate

* Remove confusing pgsql_listen calls in favour of a new user friendly call * Use the same tight conn loop around monitor_wait_for_state_change everywhere * Correctly close the connections after it.

gkokolatos · 2021-03-30T09:52:35Z

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On Tuesday, March 30, 2021 10:56 AM, Dimitri Fontaine ***@***.***> wrote: I think this PR needs to be rebased/merged on-top on current master's branch again. Recent changes in the release numbers (expected files for the monitor extension should mention extension version 1.5 now) and the Pyroute2 integration in the test framework are the main changes.

Sure. Force pushed a rebased version.

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

DimCitus · 2021-03-30T11:53:19Z

Thanks! I'm not sure why we still have the following error:

======================================================================
FAIL: test_extension_update.test_001_update_extension
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/python/3.7.6/lib/python3.7/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/travis/build/citusdata/pg_auto_failover/tests/test_extension_update.py", line 41, in test_001_update_extension
    eq_(results, [("dummy",)])
AssertionError: [('1.5',)] != [('dummy',)]

----------------------------------------------------------------------

Can you reproduce it locally and get more logs maybe? Will have a look later, is it possible that the changes in this PR are somehow preventing the automated ALTER EXTENSION UPDATE mechanics at startup of the pg_autoctl of the monitor?

To repro, with a docker environment available, simply do:

make TEST=test_extension_update run-test

gkokolatos · 2021-03-30T11:59:27Z

On Tue, Mar 30, 2021 at 13:53, Dimitri Fontaine ***@***.***> wrote: Thanks! I'm not sure why we still have the following error: ====================================================================== FAIL: test_extension_update.test_001_update_extension ---------------------------------------------------------------------- Traceback (most recent call last): File "/opt/python/3.7.6/lib/python3.7/site-packages/nose/case.py", line 198, in runTest self.test(*self.arg) File "/home/travis/build/citusdata/pg_auto_failover/tests/test_extension_update.py", line 41, in test_001_update_extension eq_(results, [("dummy",)]) AssertionError: [('1.5',)] != [('dummy',)] ---------------------------------------------------------------------- Can you reproduce it locally and get more logs maybe? Will have a look later, is it possible that the changes in this PR are somehow preventing the automated ALTER EXTENSION UPDATE mechanics at startup of the pg_autoctl of the monitor?

I saw that. I am currently looking at it. I will give an update by eod if I am still in doubt.

…

— You are receiving this because you were mentioned. Reply to this email directly, [view it on GitHub](#582 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/ALBTIO5VCEV62NU5UKHU3K3TGG3T7ANCNFSM4XDM5VVQ).

Addresses CI failure of test_extension_update case.

gkokolatos · 2021-03-30T15:02:45Z

The latest commit seems to address the test_extension_update.test_001_update_extension failure.

There seems to be one more failure as per https://travis-ci.com/github/citusdata/pg_auto_failover/jobs/494941418
which is to be indent related. Apologies for the chatter.

DimCitus · 2021-03-30T15:20:31Z

It seems Travis is confused, can you push a meaningless commit to trigger another build?

DimCitus · 2021-03-30T17:18:46Z

Merged! Thanks for your contribution!

gkokolatos · 2021-03-31T06:37:22Z

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On Tuesday, March 30, 2021 7:19 PM, Dimitri Fontaine ***@***.***> wrote: Merged! Thanks for your contribution!

Awesome! Thank you for carrying it across the line.

…

— You are receiving this because you were mentioned. Reply to this email directly, [view it on GitHub](#582 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/ALBTIO7OAH6SO6FBBCLBSWDTGIBYPANCNFSM4XDM5VVQ).

DimCitus self-requested a review February 5, 2021 10:29

DimCitus added the bug Something isn't working label Feb 5, 2021

DimCitus requested a review from JelteF February 5, 2021 12:09

DimCitus previously approved these changes Feb 5, 2021

View reviewed changes

DimCitus added this to the Sprint 2021 W4 W5 milestone Feb 5, 2021

DimCitus requested changes Feb 5, 2021

View reviewed changes

DimCitus modified the milestones: Sprint 2021 W4 W5, Sprint 2021 W6 W7 Feb 8, 2021

DimCitus modified the milestones: Sprint 2021 W6 W7, Sprint 2021 W9 W10 Mar 1, 2021

DimCitus modified the milestones: Sprint 2021 W9 W10, Sprint 2021 W11 W12 Mar 15, 2021

DimCitus requested changes Mar 15, 2021

View reviewed changes

gkokolatos mentioned this pull request Mar 16, 2021

Do not leak psycopg2 connections during testing which can lead to fla… #628

Merged

gkokolatos force-pushed the pg_autoctl-open_connection branch from 78c1e76 to 8b61e50 Compare March 19, 2021 15:55

DimCitus mentioned this pull request Mar 23, 2021

Fix/pgsql connection leaks #637

Closed

gkokolatos force-pushed the pg_autoctl-open_connection branch from 8b61e50 to ccec708 Compare March 24, 2021 09:32

DimCitus requested changes Mar 24, 2021

View reviewed changes

DimCitus modified the milestones: Sprint 2021 W11 W12, Sprint 2021 W13 W14 Mar 29, 2021

Georgios Kokolatos added 6 commits March 30, 2021 09:32

Attempt to address review comments

118aa7d

Rename emptyChannel to emptyChannelList as per review comments

087135b

Perform an indent run

256d2c4

Use monitor.notificationClient instead of pgsql where appropriate and…

ccf33e8

… close that connection when appropriate

Address review comments.

0ed0cd0

* Remove confusing pgsql_listen calls in favour of a new user friendly call * Use the same tight conn loop around monitor_wait_for_state_change everywhere * Correctly close the connections after it.

gkokolatos force-pushed the pg_autoctl-open_connection branch from 9462288 to 0ed0cd0 Compare March 30, 2021 09:49

Extention version functions require a multi statement connection

2671c3e

Addresses CI failure of test_extension_update case.

Indent run

104d9bb

DimCitus approved these changes Mar 30, 2021

View reviewed changes

DimCitus merged commit 8351581 into hapostgres:master Mar 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plug connection leaks found during profiling #582

Plug connection leaks found during profiling #582

gkokolatos commented Feb 4, 2021

DimCitus commented Feb 5, 2021 •

edited

gkokolatos commented Feb 5, 2021 via email

DimCitus commented Feb 5, 2021

DimCitus left a comment

DimCitus Feb 5, 2021

gkokolatos Feb 5, 2021

DimCitus Feb 5, 2021

DimCitus Feb 5, 2021

DimCitus left a comment

DimCitus Mar 15, 2021

gkokolatos commented Mar 15, 2021 via email

DimCitus commented Mar 15, 2021

gkokolatos commented Mar 19, 2021 via email

gkokolatos commented Mar 24, 2021 •

edited

DimCitus commented Mar 24, 2021 •

edited

gkokolatos commented Mar 24, 2021 via email

gkokolatos commented Mar 24, 2021 via email

DimCitus left a comment

DimCitus Mar 24, 2021

DimCitus Mar 24, 2021

DimCitus Mar 24, 2021

DimCitus Mar 24, 2021

gkokolatos commented Mar 24, 2021 via email

gkokolatos commented Mar 24, 2021 via email

DimCitus commented Mar 30, 2021

gkokolatos commented Mar 30, 2021 via email

DimCitus commented Mar 30, 2021 •

edited

gkokolatos commented Mar 30, 2021 via email

gkokolatos commented Mar 30, 2021

DimCitus commented Mar 30, 2021

DimCitus commented Mar 30, 2021

gkokolatos commented Mar 31, 2021 via email

		char *emptyChannelList[] = { NULL };

		(void) pgsql_listen(&(monitor->notificationClient), emptyChannelList);

		monitor->notificationClient.connectionStatementType ==
		PGSQL_CONNECTION_SINGLE_STATEMENT)

		/* Finally establish a connection for notifications if none present */
		(void) pgsql_listen(&(keeper->monitor.notificationClient), emptyChannelList);

Plug connection leaks found during profiling #582

Plug connection leaks found during profiling #582

Conversation

gkokolatos commented Feb 4, 2021

DimCitus commented Feb 5, 2021 • edited

gkokolatos commented Feb 5, 2021 via email

DimCitus commented Feb 5, 2021

DimCitus left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DimCitus left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gkokolatos commented Mar 15, 2021 via email

DimCitus commented Mar 15, 2021

gkokolatos commented Mar 19, 2021 via email

gkokolatos commented Mar 24, 2021 • edited

DimCitus commented Mar 24, 2021 • edited

gkokolatos commented Mar 24, 2021 via email

gkokolatos commented Mar 24, 2021 via email

DimCitus left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gkokolatos commented Mar 24, 2021 via email

gkokolatos commented Mar 24, 2021 via email

DimCitus commented Mar 30, 2021

gkokolatos commented Mar 30, 2021 via email

DimCitus commented Mar 30, 2021 • edited

gkokolatos commented Mar 30, 2021 via email

gkokolatos commented Mar 30, 2021

DimCitus commented Mar 30, 2021

DimCitus commented Mar 30, 2021

gkokolatos commented Mar 31, 2021 via email

DimCitus commented Feb 5, 2021 •

edited

gkokolatos commented Mar 24, 2021 •

edited

DimCitus commented Mar 24, 2021 •

edited

DimCitus commented Mar 30, 2021 •

edited