Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upGradual rise in postgres processes, leading to memory exhaustion #2075
Comments
This comment has been minimized.
This comment has been minimized.
|
A few stats, taken 60s apart, from node n1: |
This comment has been minimized.
This comment has been minimized.
|
Ah, I think I might have found a possible culprit for memory use: my last test left 1593 postgres processes running:
|
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Doing a little more experimenting: it looks like the number of postgres processes gradually rises through the course of a test, at the rate of roughly 1.5 process/second, across the whole cluster. Killing the client doesn't cause those processes to exit, and they also survive the death of Collectively, ~120 postgres processes on one node used about 500MB of memory (as measured by
|
This comment has been minimized.
This comment has been minimized.
|
Initial finding:
it looks like when yb-tserver terminates, it terminates postmaster process but the per-connection postgres processes are not being terminated. |
This comment has been minimized.
This comment has been minimized.
|
I've been thinking about this a bit, and--even if yb-tserver did kill every postgres process when it exited, that doesn't solve the underlying problem, does it? Since this problem persists even after killing the Jepsen process (which is where all the clients are), it feels... there should be some code that detects the connection is closed and terminates the postgres worker process, yeah? Is it possible that the postgres processes are locked up in some way that's preventing them from exiting correctly, perhaps because of their 'idle in transaction' state? |
This comment has been minimized.
This comment has been minimized.
|
Per postgres docs (https://www.postgresql.org/docs/11/server-shutdown.html): We use sigkill in two places currently:
We should fix both. In this case, however, it does not look like (2) is the cause, because we log that we are sending sigkill to the process and I don't see that log line in the uploaded logs. |
This comment has been minimized.
This comment has been minimized.
|
On Linux, we tell tserver to send SIGINT to its child process when it terminates: Some PostgreSQL links on "idle in transaction (aborted)":
From the second link:
Presumably, all client processes have terminated by this point, so all the connections should be closed. We need to double-check we are detecting this and stopping the backend properly. |
This comment has been minimized.
This comment has been minimized.
|
This issue is indeed the problem with v1.3.1.0, but no longer reproducible with |
This comment has been minimized.
This comment has been minimized.
|
@frozenspider -- let's try to figure out which commit fixed the issue. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Thank you very much @frozenspider for confirming the commit! |
Long-running Jepsen append tests (~1200 seconds) of YugaByte DB 1.3.1.0 have seen repeated tserver crashes, due to both file limits and memory exhaustion. There are no forced crashes, partitions, or other nemesis activities in this test: it's just a healthy happy cluster trying to serve transactions. I've attached memory, rpc, and thread stats from the admin interface for the first node in one such test, while it was allocating ~5MB/sec of memory, but before it'd crashed.
Full Jepsen & node logs (16GB uncompressed)
stats.tar.gz
The workload in question is a variant of the append test, where our reads perform a select on an un-indexed secondary key field (
k2), rather than selecting by primary key. Updates still use primary keys, in this particular test. You can reproduce this issue using Jepsen 0f46d2ea72b437a7c851bdce71757feaf2a2092f, and running (e.g. on EC2):