Connection leaking in 0.9.0 #110

yaroot · 2020-09-26T10:17:35Z

Describe the bug

Clients couldn't get new connections after upgrading to 0.9.0 , rolling back to 0.8.2 resolves the issue.
All clients utilize connection pooling (either jdbc or libpq based) with pool size of 2~3, connection life-time set to 60s, the number of active connections is very stable.

active_connections:

total_connections:

To Reproduce

Steps to reproduce the behavior.

Version

What is the version of pgagroal ?

PostgreSQL

What is the version of PostgreSQL ?

12.4-1.pgdg18.04+1

libev

What is the version of libev ?

1:4.22-1

OpenSSL

What is the version of OpenSSL ?

1.1.1-1ubuntu2.1~18.04.6

Access method

Which security access method is used (trust, password, md5, scram-sha-256) ?

md5

OS

Which Operating System (OS) is used ?

ubuntu 18.04

ulimit

What is the output from ulimit -a ?
5000

Configuration

Can you provide the configuration pgagroal ?

pgagroal.conf

[pgagroal]
host = 0.0.0.0
port = 5555
metrics = 5556
management = 5557

log_type = console
log_level = info
log_path = 

max_connections = 120
idle_timeout = 1
validation = off
unix_socket_dir = /tmp/.s.pgagroal

pipeline = session

[primary]
host = prod-pg01-a
port = 5432

Debug logs

Can you provide any debug logs (log_level = debug5) of the issue ?

Tip

I compiled pgagroal myself, packages are up-to-date

apt install wget libev4 libev-dev libssl1.1 libssl-dev systemd libsystemd-dev python-docutils gcc-8 cmake -y
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr

The text was updated successfully, but these errors were encountered:

jesperpedersen · 2020-09-28T14:53:18Z

Do you have any hints from pgagroal-cli -c pgagroal.conf details when you see the limit reached ?

Are you using both a pgagroal_users.conf file and pgagroal_databases.conf file ? If so, how is the latter configured ?

Using idle_timeout = 1 is very aggressive - would 60 be a better setting ?

In 0.9.x unix_socket_dir changed to be a real directory so using unix_socket_dir = /tmp would be a better solution.

Locally I'm seeing connections being timed out so I'm trying to replicate what you are seeing.

Thanks in advance for any additional feedback / information you can provide !

yaroot · 2020-09-28T15:27:02Z

Ah yes I wasn't thinking about idle_timeout, no users/databases settings, unix_socket_dir was set to /tmp when using 0.9.0.
I did not check details command, but from pgagroal_connection at the time (around 09:30 on the graph) only 44 connections were active, pgagroal_active_connections = 120 and pgagroal_total_connections ~50.

jesperpedersen · 2020-09-28T15:30:49Z

It would be great if you could verify with pgagroal-cli details - maybe it is a statistics problem ?

You can check with ps aux | grep pgagroal and ps aux | grep postgres on the servers, and compare that to the output from the statistics and details.

jesperpedersen · 2020-09-28T15:36:10Z

Could you verify the raw output from /metrics and SELECT * FROM pg_stat_activity; too ?

yaroot · 2020-09-29T01:23:51Z

I'm sorry, the issue was the clients could not get new connections, so I think the metrics were correct. I'll try reproduce locally.

yaroot · 2020-09-29T03:40:33Z

Here's the python script I use for testing test.txt, ran it several times, tried 1 and 60 for idle_timeout.

The first issue I found is ERROR: DISCARD ALL cannot run inside a transaction block from postgres server logs, caused by killing the client without committing the transaction. It may be safer to issue 'ROLLBACK; DISCARD ALL;'. This is probably not the issue I encountered in production.

Metrics shows both active and total connections consistently went above 60.

details matches metrics and pg_stat_activity.

I set idle_timeout to 1 because the ram usage of several postgres process could grow well above 1G causing OOM, so I want all clients to disconnect regularly to recycle those ram.

jesperpedersen · 2020-09-29T13:46:42Z

I have added a commit that guards against an active transaction during DISCARD ALL. Currently we are not tracking transactions for the session pool, but that is something to consider. I agree that this is probably not what you are seeing though.

I'll keep looking based on your feedback - it is interesting that active_connections is much higher than total_connections, and that you are hitting max_connections.

Thanks for your test case, and let me know if you find more information to share !

jesperpedersen · 2020-09-29T14:31:55Z

For your test case - did you have max_connections = 120 still ? Because active/total connections also include connections in states that the pool is currently processing for possible deletion.

Their state is shown in the details report, so that would be helpful if you are seeing something weird there (/metrics will show it too)

jesperpedersen · 2020-09-29T19:24:07Z

You said "connection life-time set to 60s" - what did you mean by that ?

yaroot · 2020-09-30T00:22:28Z

Yes I left max_connections as 120. There was nothing weird in pgagroal_connection, half of the pool open, most of them in_use, the only weird thing is pgagroal_active_connections hitting max value and clients reporting connecting timeout.

Connection life-time limits the lifetime of a connection in the pool, configured with maxLifetime for hikaricp and HASURA_GRAPHQL_PG_CONN_LIFETIME for graphql-engine.

yaroot · 2020-09-30T00:36:28Z

There's another issue I think is related, sometimes 1 pgagroal process would hog 1 entire cpu, I don't have any screenshots but yesterday in my testing I triggered 3 (my testing vm has only 1 cpu, each shows up about 30% cpu usage in htop).

jesperpedersen · 2020-09-30T12:29:23Z

How many pgagroal processes are active once pgagroal_active_connections hits 120 ? Clients will time out once max_connections is reached.

I'll try and see if I can make a reproducer with hikaricp.

jesperpedersen · 2020-09-30T12:30:58Z

If you see high CPU for a pgagroal process could you attach gdb to it and create a backtrace ?

gdb /usr/bin/pgagroal <pid>
bt

Or profile the system using perf to create a flamegraph ?

yaroot · 2020-10-02T05:59:08Z

How many pgagroal processes are active once pgagroal_active_connections hits 120 ? Clients will time out once max_connections is reached.

about half not_init, a few free, the rest in_use.

yaroot · 2020-10-02T06:05:09Z

Okay I think I triggered it again with hikaricp, metrics.txt shows pgagroal_active_connections at 111

details is consistent with metrics except for Active connections, and there are 84 pgagroal processes

Active connections:  0
Total connections:   83
Max connections:     120

yaroot · 2020-10-02T06:17:04Z

Action connections sometimes goes up to 80 but mostly stays at 0

Active connections: 80

yaroot · 2020-10-02T06:24:35Z

I also run flush-idle and flush-gracefully regularly. Here is the code I use.

yaroot · 2020-10-02T06:57:19Z

okay graceful doesn't count as active in commandline but does in metrics, but pgagroal_active_connections is higher than it should be.

jesperpedersen · 2020-10-02T12:19:59Z

I have fixed the gracefully status in #113 - thanks for reporting !

The problem is still pgagroal_active_connections (the internal counter), so will keep looking at that. My test case for hikaricp shows that "additional" PostgreSQL packets are sent which are not needed (using 3.4.5). My Agroal is ok. I'll need to investigate further want can be done about this.

yaroot · 2020-10-03T00:52:26Z

wait a connection pool could do that?

jesperpedersen · 2020-10-05T18:08:20Z

Connection pools deal with connection lifecycle, and there are various properties that influences the protocol interaction.

jesperpedersen · 2020-10-05T18:12:29Z

I have added an assert on the master branch. Could you do a run with a Debug build ? Note, that max_connections is 8 then.

Getting the log

log_type = file
log_level = debug5
log_path = pgagroal.log

would be good if the assert triggers.

jesperpedersen · 2020-10-05T18:17:09Z

Maybe try a run with

non_blocking = off

as well.

jesperpedersen · 2020-10-07T17:24:30Z

I have added a connection tracker in #114 that will hopefully help to track down this issue further. You can use the master branch with a Release build, just do

log_type = file
log_level = info
log_path = pgagroal.log
tracker = on

and it will output the needed information. Please, continue your run until you see an issue - either connections timing out, or a mismatch in the statistics.

Send me the log via my email.

Thanks in advance !

yaroot · 2020-10-11T05:19:59Z

I left it run for a few hours, release build on master 56cc3b2

pg8000.exceptions.ProgrammingError: {'S': 'FATAL', 'V': 'FATAL', 'C': '53300', 'M': 'connection pool is full'}
java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 1000ms.

metrics.txt
conf.txt
pgagroal.log.gz

jesperpedersen · 2020-10-12T11:39:38Z

Thank you ! This is very helpful :)

jesperpedersen · 2020-10-12T14:49:58Z

I have fixed an issue dealing with bad file descriptors when the connection is obtained.

Any chance you can rerun your test with the latest master branch and attach the log ?

jesperpedersen · 2020-10-13T13:28:54Z

I went ahead and released 0.9.1 with the issues found so far on this issue.

Please, open a new issue if you are still seeing problems with the tracker log.

Thanks for your help on this !

yaroot added the bug Something isn't working label Sep 26, 2020

jesperpedersen self-assigned this Sep 28, 2020

jesperpedersen added a commit that referenced this issue Sep 29, 2020

[#110] Guard against active transaction for DISCARD ALL

976ef1b

jesperpedersen added a commit that referenced this issue Sep 29, 2020

[#110] Guard against active transaction for DISCARD ALL

a19ba88

jesperpedersen added a commit that referenced this issue Oct 12, 2020

[#110] Assign pid before the bad connection event can happen

8c91679

jesperpedersen added a commit that referenced this issue Oct 12, 2020

[#110] Assign pid before the bad connection event can happen

39a1017

jesperpedersen closed this as completed Oct 13, 2020

yaroot mentioned this issue Oct 14, 2020

Fail to get new connections #115

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection leaking in 0.9.0 #110

Connection leaking in 0.9.0 #110

yaroot commented Sep 26, 2020

jesperpedersen commented Sep 28, 2020

yaroot commented Sep 28, 2020 •

edited

Loading

jesperpedersen commented Sep 28, 2020

jesperpedersen commented Sep 28, 2020

yaroot commented Sep 29, 2020

yaroot commented Sep 29, 2020 •

edited

Loading

jesperpedersen commented Sep 29, 2020

jesperpedersen commented Sep 29, 2020

jesperpedersen commented Sep 29, 2020

yaroot commented Sep 30, 2020

yaroot commented Sep 30, 2020 •

edited

Loading

jesperpedersen commented Sep 30, 2020

jesperpedersen commented Sep 30, 2020

yaroot commented Oct 2, 2020

yaroot commented Oct 2, 2020 •

edited

Loading

yaroot commented Oct 2, 2020

yaroot commented Oct 2, 2020

yaroot commented Oct 2, 2020

jesperpedersen commented Oct 2, 2020

yaroot commented Oct 3, 2020

jesperpedersen commented Oct 5, 2020

jesperpedersen commented Oct 5, 2020

jesperpedersen commented Oct 5, 2020

jesperpedersen commented Oct 7, 2020

yaroot commented Oct 11, 2020

jesperpedersen commented Oct 12, 2020

jesperpedersen commented Oct 12, 2020

jesperpedersen commented Oct 13, 2020

Connection leaking in 0.9.0 #110

Connection leaking in 0.9.0 #110

Comments

yaroot commented Sep 26, 2020

jesperpedersen commented Sep 28, 2020

yaroot commented Sep 28, 2020 • edited Loading

jesperpedersen commented Sep 28, 2020

jesperpedersen commented Sep 28, 2020

yaroot commented Sep 29, 2020

yaroot commented Sep 29, 2020 • edited Loading

jesperpedersen commented Sep 29, 2020

jesperpedersen commented Sep 29, 2020

jesperpedersen commented Sep 29, 2020

yaroot commented Sep 30, 2020

yaroot commented Sep 30, 2020 • edited Loading

jesperpedersen commented Sep 30, 2020

jesperpedersen commented Sep 30, 2020

yaroot commented Oct 2, 2020

yaroot commented Oct 2, 2020 • edited Loading

yaroot commented Oct 2, 2020

yaroot commented Oct 2, 2020

yaroot commented Oct 2, 2020

jesperpedersen commented Oct 2, 2020

yaroot commented Oct 3, 2020

jesperpedersen commented Oct 5, 2020

jesperpedersen commented Oct 5, 2020

jesperpedersen commented Oct 5, 2020

jesperpedersen commented Oct 7, 2020

yaroot commented Oct 11, 2020

jesperpedersen commented Oct 12, 2020

jesperpedersen commented Oct 12, 2020

jesperpedersen commented Oct 13, 2020

yaroot commented Sep 28, 2020 •

edited

Loading

yaroot commented Sep 29, 2020 •

edited

Loading

yaroot commented Sep 30, 2020 •

edited

Loading

yaroot commented Oct 2, 2020 •

edited

Loading