Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node-postgres handles server disconnection differently on macOS and Linux #1942

Open
jmacmahon opened this issue Aug 6, 2019 · 24 comments
Open

Comments

@jmacmahon
Copy link

jmacmahon commented Aug 6, 2019

Hello, the company I work for has been using pg-promise for our database service and we've run into an issue with failover, which we believe is an error in the underlying node-postgres library. We believe the issue is due to the way Linux handles socket timeout events differently to macOS.

Steps to reproduce:

  • Connect to a Postgres server. We used one hosted in AWS RDS with MultiAZ failover enabled.
  • Run a query every N seconds
  • Reboot the server such that the connection is dropped without a TCP FIN packet. We did a reboot with failover in AWS RDS.

Note: we believe this scenario is not specific to RDS, but rather any network outage or server failure which does not send a TCP FIN packet.

Expected outcome and actual outcome on macOS:

  • The next query fails and the failing client is removed from the pool.
  • Subsequent queries use a new client which tries to establish a fresh connection.
  • When the server reboot/failover is complete, these queries will succeed.

Actual outcome on Linux:

  • The next query fails, but the bad client is not removed from the pool.
  • Subsequent queries try to re-use the bad client and fail even after the reboot/failover is complete.

Detailed order of events

macOS

  • Successful query
  • Reboot DB
  • DB stops listening on original IP
  • Client begins a further query
  • TCP sends query, does not recieve an ACK
  • TCP begins retransmission, does not receive an ACK
  • Approximately 18 seconds after sending, TCP ceases retransmission and sends a RST
  • Client rejects query promise with "Error: read ETIMEDOUT"
  • Immediately after the error the "connection" remains in node-pg's pool
  • Almost immediately afterwards the pool emits an "error" event
  • The "connection" is removed from node-pg's pool
  • Client begins another query
  • DNS fetches the new IP
  • TCP successfully submits and retrieves the query to the new IP

Linux

  • Successful query
  • Reboot DB
  • DB stops listening on original IP
  • Client begins a further query
  • TCP sends query, does not receive an ACK
  • TCP begins retransmission, does not receive an ACK
  • Approximately 18 seconds after sending, TCP ceases retransmission and sends a RST
  • Client rejects query promise with a "Error: Connection terminated unexpectedly"
    error Error: Connection terminated unexpectedly
      at Connection.con.once (/src/node_modules/pg-promise/node_modules/pg/lib/client.js:235:9)
      at Object.onceWrapper (events.js:313:30)
      at emitNone (events.js:106:13)
      at Connection.emit (events.js:208:7)
      at Socket.<anonymous> (/src/node_modules/pg-promise/node_modules/pg/lib/connection.js:131:10)
      at emitNone (events.js:111:20)
      at Socket.emit (events.js:208:7)
      at endReadableNT (_stream_readable.js:1056:12)
      at _combinedTickCallback (internal/process/next_tick.js:138:11)
      at process._tickDomainCallback (internal/process/next_tick.js:218:9)
    
  • After the error, the "connection" remains in node-pg's pool
  • Subsequent queries fail immediately without sending any data with the following error:
    error Error: Client has encountered a connection error and is not queryable
      at process.nextTick (/src/node_modules/pg-promise/node_modules/pg/lib/client.js:500:25)
      at _combinedTickCallback (internal/process/next_tick.js:131:7)
      at process._tickDomainCallback (internal/process/next_tick.js:218:9)
    
@jmacmahon jmacmahon changed the title node-postgres handles server disconnection differently on MacOS and Linux node-postgres handles server disconnection differently on macOS and Linux Aug 6, 2019
@charmander
Copy link
Collaborator

The pool removes clients pretty aggressively (e.g. even when the query error is an error in the query). Are you using pool.query or acquiring clients manually? What does the error handling look like there?

@jmacmahon
Copy link
Author

We're using pg-promise's .any method, which from a quick look at the code looks like it uses the pool.

@charmander
Copy link
Collaborator

It looks like a pg-promise bug to me. https://github.com/vitaly-t/pg-promise/blob/3d2c670af42e9e1d93d4dc23095061550b40a7c3/lib/database.js#L333 The connection is returned to the pool even if the query fails because the connection dropped.

@vitaly-t
Copy link
Contributor

vitaly-t commented Aug 8, 2019

@charmander Could you, please point anywhere in documentation or otherwise, that a dropped connection object cannot be returned back to the pool? It's just this is the first time I hear something like that.

@sehrope
Copy link
Contributor

sehrope commented Aug 8, 2019

@vitaly-t It's in the docs: https://node-postgres.com/api/pool#releaseCallback

The API has been like that for as long as I can remember.

@vitaly-t
Copy link
Contributor

vitaly-t commented Aug 8, 2019

@sehrope What is in the docks about not allowing to release a dropped connection back to the pool? I don't see anything.

@sehrope
Copy link
Contributor

sehrope commented Aug 8, 2019

The pool is not aware of the connection status. It's up to the user to inform the pool that the connection is either still valid (via client.release()) or invalid (via client.release(err)).

From the docs:

The releaseCallback releases an acquired client back to the pool. If you pass a truthy value in the err position to the callback, instead of releasing the client to the pool, the pool will be instructed to disconnect and destroy this client, leaving a space within itself for a new client.

Using pool.query(...) does this automatically. If you manage the connection yourself or provide your own wrapper around pool.connect() then you need to ensure that errant connections are evicted via client.release(err).

A "smarter" wrapper might differentiate between transient errors (ex: UNIQUE key violation) or permanent ones (ex: connection killed), but a simple default is to design applications with queries that do not fail and evict connections from the pool when they do.

@vitaly-t
Copy link
Contributor

vitaly-t commented Aug 8, 2019

@sehrope Thanks! I may need to make some changes in pg-promise now :)

@joelmukuthu
Copy link

joelmukuthu commented Aug 8, 2019

Hi, I also experienced the same issue some time back.

  1. I think the docs should explicitly mention that connection errors (and query errors, from my findings) should be passed to client.release. We can send a PR for that now.
  2. I never looked into it too much the last time, but it seemed that some internal state in Client didn’t get restored properly when a query ended with an error. Was that a wrong finding? If not, does someone know more on how it could be fixed so that clients are still reusable even after an error?

EDIT: @sehrope I found that clients couldn’t be reused even after a UNIQUE key violation, but perhaps that was just something wrong with my setup. I’m afraid I can’t confirm it right now (away from computer).

Best,
Joel

@sehrope
Copy link
Contributor

sehrope commented Aug 8, 2019

  • I think the docs should explicitly mention that connection errors (and query errors, from my findings) should be passed to client.release. We can send a PR for that now.

+1

  • I never looked into it too much the last time, but it seemed that some internal state in Client didn’t get restored properly when a query ended with an error. Was that a wrong finding? If not, does someone know more on how it could be fixed so that clients are still reusable even after an error?

A client should be reusable after running an errant command. If you're within a transaction you'll need to issue a ROLLBACK before the server let's you run anything else. But the connection itself should still be usable and allow you to issue further commands.

If you encounter an error where that is not the case please provide an example as that would be a bug.

@sehrope
Copy link
Contributor

sehrope commented Aug 8, 2019

@sehrope I found that clients couldn’t be reused even after a UNIQUE key violation, but perhaps that was just something wrong with my setup. I’m afraid I can’t confirm it right now (away from computer).

This is likely due to being within a transaction. You'll need to either:

  1. ROLLBACK on failure before returning the connection to the pool.
  2. Wrap the possibly errant block in a SAVEPOINT (and roll back to the save point if it fails and you want to proceed).
  3. Or, change your code to not cause the error (ex: change to using INSERT ... ON CONFLICT ...).

@joelmukuthu
Copy link

@sehrope thank you. It wasn’t in a transaction though; I’ll try to reproduce it in a minimal setup when I get back to computer.

@vitaly-t
Copy link
Contributor

vitaly-t commented Aug 13, 2019

Here's what troubles me... this documentation implies that the client is supposed to be able to identify a dropped connection from all other types of errors, if it is to pass a truthy value into the release function.

But this is far from trivial, especially when executing a transaction, there are so many different types of failures that may occur, and how to filter them out from what actually represents an issue in communication with the server.

Some time ago I had to write a complex function to detect just that:

https://github.com/vitaly-t/pg-promise/blob/e7a7c36eb668a0cb633253ae694ef7c85f08455d/lib/utils/index.js#L205

Which I still use to check whether or not to attempt executing ROLLBACK on a failed transaction.

And this is far from a friendly approach that can be considered for the client-side code.

My understanding thus far, the driver must be able to identify connectivity issues on its own, and mark the client as disconnected, because it is too complex to expect the client to do so. And this is the piece of information that's totally missing in documentation here.

So how should this be approached?

@johanneswuerbach
Copy link
Contributor

johanneswuerbach commented Aug 14, 2019

A while back I proposed brianc/node-pg-pool#119, which we use internally. This won’t catch everything, but at least handle those cases more nicely. Obviously those internal properties shouldn’t be accessed like that, so the client would need an external API for that.

@vitaly-t
Copy link
Contributor

vitaly-t commented Aug 14, 2019

@johanneswuerbach If it does throw away broken connections from the pool in a timely fashion, then it's all we need. And as per my previous post, I fully support that broken connections must be handled between the driver and the pool, and not not by the client. The client should only receive a notification when the connection dies, as it does now (in part), and that's it.

@sehrope
Copy link
Contributor

sehrope commented Aug 14, 2019

@vitaly-t wrote:

Here's what troubles me... this documentation implies that the client is supposed to be able to identify a dropped connection from all other types of errors, if it is to pass a truthy value into the release function.

But this is far from trivial, especially when executing a transaction, there are so many different types of failures that may occur, and how to filter them out from what actually represents an issue in communication with the server.

The simplest approach is to treat any error as non-recoverable and instruct the pool to evict the connection. That will always be correct though it could lead to trashing your pool if errors are common enough in normal routes. For example if you try to create a resource and get a duplicate key error.

A more advanced approach would be to default to discarding on errors, but if you're wrapped in a transaction then attempt a ROLLBACK. If the ROLLBACK succeeds then consider the connection salvaged. Otherwise, discard it. This will handle the previous thrashing situation yet still evict the connection if it is totally broken (ex: TCP error).

Some time ago I had to write a complex function to detect just that:

I wouldn't bother with anything general like that. In theory you could check for a FATAL error from the server or something similar from the internal driver, but much safer and easier to default to evicting and not try to be too clever.

My understanding thus far, the driver must be able to identify connectivity issues on its own, and mark the client as disconnected, because it is too complex to expect the client to do so. And this is the piece of information that's totally missing in documentation here.

You always need a way for the client to inform the driver that the connection should be evicted, even if the connection is still usable. Say you had a route that alters the session in some way that is not recoverable (some SET ... or executing some PL that mutates internal state). You would not want the pool to give back that connection to anyone else so you could do client.release(true) to make sure that does not happen.

If it does throw away broken connections from the pool in a timely fashion, then it's all we need.

The failures could happen at any time so the end user must always deal with errors directly. Even a "test after checkout from pool" could fail between the test and the user's actual DB work. It might work the majority of the time but there's a race condition if it's killed in between the test completing and the user performing the actual work.

@johanneswuerbach
Copy link
Contributor

johanneswuerbach commented Aug 14, 2019

brianc/node-pg-pool#119 throws away the connections when the library user calls client.release and the pg client is not in a queryable state. The queryable state is determined by the driver itself and should already be a good source of truth.

Non-queryable clients directly return errors if you try to call client.query already today, so some of that logic is already out there and tested.

The only thing not detectable so far is returning a client in an open and failed, but not rolled back transaction as that condition is currently not detect by the driver.

@vitaly-t
Copy link
Contributor

I wouldn't bother with anything general like that. In theory you could check for a FATAL error from the server or something similar from the internal driver, but much safer and easier to default to evicting and not try to be too clever.

If we are not clever enough, we may dispense with alive connection while inside a transaction by mistake. When in failure inside a transaction, and before executing ROLLBACK we need to know if it is at all possible to do ROLLBACK on the current connection.

Otherwise, if we just start throwing away live connections, we will end up with transactions locked on the server, without a good reason.

@vitaly-t
Copy link
Contributor

@johanneswuerbach Are you still using pg-promise or switched to the driver directly? I'm asking in case I start playing with some changes for this, whether you would be interested in helping...

@sehrope
Copy link
Contributor

sehrope commented Aug 14, 2019

If we are not clever enough, we may dispense with alive connection while inside a transaction by mistake.

If you're the one handling transaction management then you know if you're in a transaction and should attempt a ROLLBACK for a failed unit of work.

If you're not the one handling transaction management then you're not handling transaction management and should not attempt anything.

Are you trying to infer if the user of your library failed a transaction outside of a library-level helper and trying to recover from it?

When in failure inside a transaction, and before executing ROLLBACK we need to know if it is at all possible to do ROLLBACK on the current connection.

If you're the one managing the transaction then just try to do the ROLLBACK. It either works (connection is good) or it doesn't work (connection is definitely no good).

It's even possible that the connection is in a workable state (i.e. TCP is fine, DB is still open) when you "verify" that a ROLLBACK is possible and it gets killed before you actually ROLLBACK. Best to just try it and see if it worked.

Otherwise, if we just start throwing away live connections, we will end up with transactions locked on the server, without a good reason.

The PostgreSQL server will ROLLBACK any open transactions for closed connections. Otherwise you could permanently lock up any server by running:

BEGIN;
SELECT * FROM some_table FOR UPDATE;
-- Exit the connection...

@vitaly-t
Copy link
Contributor

If you're the one handling transaction management then you know if you're in a transaction and should attempt a ROLLBACK for a failed unit of work.

It is not that simple. See the original issue, which forced me to start using that error verification.

Presumably, it is now fixed. I'm gonna do some good re-testing of this again. And if it is indeed fixed, it does simplify things a bit.

Previously, we could often end up with the driver's query method suddenly dead-lock when trying to execute ROLLBACK while using a broken connection.

I will get back on this after some more testing.

@sehrope
Copy link
Contributor

sehrope commented Aug 14, 2019

@vitaly-t Okay that makes sense as a workaround but only because not bubbling up the connection error was a bug. Assuming that's fixed you shouldn't have to do anything like that and blindly issuing a ROLLBACK on error should work.

@johanneswuerbach
Copy link
Contributor

@vitaly-t no, I'm still using pg-promise and happy to help. Not sure how much time I've, but I can definitely test any changes against our internal test suite, which exercises various failover scenarios.

@vitaly-t
Copy link
Contributor

@jmacmahon At this point I suggest that you follow it up here instead ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants