[YSQL] Gradual rise in postgres processes, leading to memory exhaustion #2075
Long-running Jepsen append tests (~1200 seconds) of YugaByte DB 188.8.131.52 have seen repeated tserver crashes, due to both file limits and memory exhaustion. There are no forced crashes, partitions, or other nemesis activities in this test: it's just a healthy happy cluster trying to serve transactions. I've attached memory, rpc, and thread stats from the admin interface for the first node in one such test, while it was allocating ~5MB/sec of memory, but before it'd crashed.
The workload in question is a variant of the append test, where our reads perform a select on an un-indexed secondary key field (
The text was updated successfully, but these errors were encountered:
Ah, I think I might have found a possible culprit for memory use: my last test left 1593 postgres processes running:
Doing a little more experimenting: it looks like the number of postgres processes gradually rises through the course of a test, at the rate of roughly 1.5 process/second, across the whole cluster. Killing the client doesn't cause those processes to exit, and they also survive the death of
Collectively, ~120 postgres processes on one node used about 500MB of memory (as measured by
it looks like when yb-tserver terminates, it terminates postmaster process but the per-connection postgres processes are not being terminated.
I've been thinking about this a bit, and--even if yb-tserver did kill every postgres process when it exited, that doesn't solve the underlying problem, does it? Since this problem persists even after killing the Jepsen process (which is where all the clients are), it feels... there should be some code that detects the connection is closed and terminates the postgres worker process, yeah? Is it possible that the postgres processes are locked up in some way that's preventing them from exiting correctly, perhaps because of their 'idle in transaction' state?
Per postgres docs (https://www.postgresql.org/docs/11/server-shutdown.html):
We use sigkill in two places currently:
We should fix both. In this case, however, it does not look like (2) is the cause, because we log that we are sending sigkill to the process and I don't see that log line in the uploaded logs.
On Linux, we tell tserver to send SIGINT to its child process when it terminates:
Some PostgreSQL links on "idle in transaction (aborted)":
From the second link:
Presumably, all client processes have terminated by this point, so all the connections should be closed. We need to double-check we are detecting this and stopping the backend properly.