Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to handle a stream being silently killed in the background #2121

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

brianc
Copy link
Owner

@brianc brianc commented Feb 26, 2020

I think this will fix the issue outlined in #2112

I just pounded out this initial version pretty quickly as I'm not 100% sure it'll fix the issue yet....as I'm not sure my test is exactly replicating what's going on in lambda. In the test I'm "cleanly" severing the connected stream to the backend which doesn't emit an error anywhere so the client/pool still thinks its okay.

Could someone from #2112 try applying this patch to their code in lambda and see if it fixes the issue? If so I'll polish this up, write a couple more tests, and get a patch version fix out.

await pool.query('SELECT NOW()')
await pool.query('SELECT NOW()')
await pool.query('SELECT NOW()')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you should also add something like:

const client2 = await pool.connect();
try {
  // Verify that `client` was removed from the pool and replaced w/ a new client
  expect(client2).to.not.equal(client);
} finally {
  await client2.release();
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Otherwise, it's possible that client just magically fixed itself through some other means. So, adding this check helps to rule out that possibility)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yah that's not a bad idea! I'd like to hold off on merging this until I get some confirmation its actually fixing things in aws lambda. It also unfortunately doesn't work w/ the native bindings yet so I'll need to somehow figure that out.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

As I mentioned in this thread, I suspect that both "pg" and "knex" were encountering the same issue within their respective pool implementations. So, we'll most likely need to implement a similar fix on the "knex" side.

// remove this client, it's died in the background - this can happen
// in aws lambdas - they go idle and the streams are closed without an error
if (client.connection && client.connection.stream.readyState !== 'open') {
this._remove(client)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current logic: if it notices that the next-available client is already dead, then it removes it from the pool. Afterwards, it instantiates a new client to replace it.

Would there be any advantage to looping on the client eviction step? In pseudo-code:

while(this._idle.length) {
  const client = this._idle.pop();

  // ... Other important details I'm ignoring for the moment...

  if(isDead(client)) {
    this.remove(client);
  } else {
    return client
  }
}

if(!this._isFull()) {
  return this.newClient(...);
}

throw new Error('unexpected condition');

This seems like it helps in 2 ways:

  1. It clears out the dead clients more quickly, and
  2. It avoids the need to wait for this.newClient(..) when there is already another client available.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably, yeah - good suggestions!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi guys, I want to test this branch, but npm says its missing the version... tried to fork it, put the version (https://github.com/592da/node-postgres) and then compiling failed : Cannot find module 'pg'

@brianc

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I suspect that's because this is a mono-repo that is built via Lerna . You might need to build it locally first and then link to the build artifacts. (@brianc might know another approach. I haven't actually worked with this codebase before).

I don't remember the exact build steps for Lerna off the top of my head, but I think it's something like:

  1. Install Lerna
  2. Navigate to the root of the repo
  3. Run the command: lerna bootstrap

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi guys, I want to test this branch, but npm says its missing the version... tried to fork it, put the version (https://github.com/592da/node-postgres) and then compiling failed : Cannot find module 'pg'

if you wanna test it it's probably easiest to just apply the diff of the 1 file i changed inside your node_modules/pg-pool folder to index.js (or just copy/paste the changes).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your reply @briandamaged, unfortunately, it didn't work, maybe because its also imported from typeorm

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brianc , having the same issue...

2020-02-26T21:38:37.302Z	da5bbf66-06b7-4c69-9c85-ab84b1b777a1	INFO	Error: Connection terminated unexpectedly
    at Connection.<anonymous> (/var/task/src/rest/webpack:/node_modules/pg/lib/client.js:255:1)
    at Object.onceWrapper (events.js:312:28)
    at Connection.emit (events.js:223:5)
    at Connection.EventEmitter.emit (domain.js:475:20)
    at Socket.<anonymous> (/var/task/src/rest/webpack:/node_modules/pg/lib/connection.js:139:1)
    at Socket.emit (events.js:228:7)
    at Socket.EventEmitter.emit (domain.js:475:20)
    at endReadableNT (_stream_readable.js:1185:12)
    at processTicksAndRejections (internal/process/task_queues.js:81:21)```

return this._acquireClient(client, pendingItem, idleListener, false)
// remove this client, it's died in the background - this can happen
// in aws lambdas - they go idle and the streams are closed without an error
if (client.connection && client.connection.stream.readyState !== 'open') {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing some mention online that stream.readyState is deprecated/unsupported. It might be safer to do this instead:

(client.connection.stream.readable && client.connection.stream.writable)

When experimenting w/ things locally, I'm finding that these properties change as soon as you call stream.end().

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah those properties don't exist in all supported versions of node that pg supports. stream.readyState does exist in all supported versions.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When experimenting w/ things locally, I'm finding that these properties change as soon as you call stream.end().

you should check if readyState changes as well (I think it does) - reason is any official patch is gonna need to work in node 8, 10, 12, and 13. So if its "officially deprecated" in node but still works, I'd use it (as it probably wasn't deprecated in older versions of node). Typically I do that, write a bunch of tests for the feature, and then if a test breaks on an edge release of node I either decide to drop old version support at that point or make a compatibilty branch in the code that will handle both old and new versions of node.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure -- I'll give that a try after the initial experiment. Speaking of which: you can find the experiment within the knex pooling here:

https://github.com/briandamaged/knex/tree/experiment/aws-disconnections

And, here's the link to the specific commit:

briandamaged/knex@1490611

So, I think this roughly aligns w/ your implementation. (Aside from the readyState piece)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yah totally actually I don't wanna burden you w/ backwards compatiblity - if you find a way to make it work in lambda w/ a version of node, i'll do my best to back port it and "production-ify" it for this library. Mostly it's just hard for me to isolate & test since I don't have a lambda nor much experience w/ them so me setting all that testing infrastrucutre up is a huge time sync. But if you figure out the root cause, I'll help any way I can getting a fix out

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah -- yeah, I currently don't have a way to recreate the issue, either. I just know that Knex's current implementation relies upon Events to detect the closed connections. So, this just seems like a possible way to work around that issue.

Hopefully we'll hear some good news from this experiment! If so, then we can figure out how to formulate the concepts into something that is production-worthy. (One thought: we might even need to swap out implementations depending upon Node version. Ex: use readable / writable if the Node version supports those properties; otherwise, fall back to readyState)

@brianc
Copy link
Owner Author

brianc commented Feb 27, 2020

I put this up as a proof of concept fix. If this doesn't fix the issue please lemme know & I'll close it. If it does fix the issue, let me know & I'll clean it up and land it for real.

If this doesn't end up fixing it I hope this can be a starting point for someone to do more investigation and eventually submit a pull request. As I don't have a lambda or a way to reproduce the problem I don't think my time is best spent trying to track this particular issue down. So...I'm open to a PR to fix the issue & supplying ideas and guidance but I'm likely not going to push on fixing this issue myself until I take care of other higher priority issues for the project.

@michaelseibt
Copy link

I'll give this a try. Since we're not able to reproduce the issue reliably, it could take a while to come back to you. Thanks for the patience!

@michaelseibt
Copy link

michaelseibt commented Feb 28, 2020

Here is what I tried: I've added a couple of logging statements around the lines of code touched by this PR:

console.log("[PG-NODE] Client state is", client.connection.stream.readyState);
      if (client.connection && client.connection.stream.readyState !== 'open') {
        console.warn("[PG-NODE] Removing faulty client");
        this._remove(client)
      } else {
        const idleListener = idleItem.idleListener
        console.warn("[PG-NODE] Returning client");
        return this._acquireClient(client, pendingItem, idleListener, false)
      }

Built it, shipped it, waited. After a while, these logs came up:

08:47:13 INFO	[PG-NODE] Client state is open
08:47:13 WARN	[PG-NODE] Returning client
08:56:14 INFO	[PG-NODE] Client state is open
08:56:14 WARN	[PG-NODE] Returning client
08:56:14 ERROR	Uncaught Exception [...] Connection terminated unexpectedly
08:59:18 INFO	[PG-NODE] Client state is open
08:59:18 WARN	[PG-NODE] Returning client
... repeats a couple of times until ...
09:09:24 INFO	[PG-NODE] Client state is open
09:09:24 WARN	[PG-NODE] Returning client
09:09:24 ERROR	Uncaught Exception [...] Connection terminated unexpectedly

Is there something else that I could provide to help debugging this?

Edit: I noticed that two different Lambdas (not two instances of the same Lambda) had the same error logged at the same time: 09:09:24.78 and 09:09:24.518

Two messages are also logged in RDS:

2020-02-28 09:24:44 UTC:[IP_REDACTED](42612):[DB_REDACTED]:[29454]:LOG: could not receive data from client: Connection reset by peer
2020-02-28 09:24:44 UTC:[IP_REDACTED](59061):[DB_REDACTED]:[29453]:LOG: could not receive data from client: Connection reset by peer

@592da
Copy link

592da commented Mar 5, 2020

any news regarding ?

@osdiab
Copy link

osdiab commented Mar 31, 2020

looks like this PR is stalling? any way to help?

@brianc
Copy link
Owner Author

brianc commented Mar 31, 2020

looks like this PR is stalling? any way to help?

any news regarding ?

I don't really use lambda - the best way to help would be to contribute a PR to fix the issue - without me having access to lambda I'm likely not going to be very helpful in finding a fix. Will definitely be down to review & merge any fix if you find it!

@dgobaud
Copy link

dgobaud commented May 7, 2020

we are having this same problem in Google Cloud it seems

@bryantbiggs
Copy link

bump - hoping for some resolution 😬 🙏🏽

@mikelax
Copy link

mikelax commented Nov 5, 2021

I am also seeing the same issues reported in issue #2112 so I would love to help out however I can if this is a possible fix for it. Given more than a year has elapsed, I was wondering if this fix becomes easier if you don't need to support older versions of nodejs that are now out of support.

You should now be able to say the minimum supported version is v12, and drop support for v8 and v10 (lts/dubnium).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants