Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection leaking in 0.9.0 #110

Closed
yaroot opened this issue Sep 26, 2020 · 28 comments
Closed

Connection leaking in 0.9.0 #110

yaroot opened this issue Sep 26, 2020 · 28 comments
Assignees
Labels
bug Something isn't working

Comments

@yaroot
Copy link

yaroot commented Sep 26, 2020

Describe the bug

Clients couldn't get new connections after upgrading to 0.9.0 , rolling back to 0.8.2 resolves the issue.
All clients utilize connection pooling (either jdbc or libpq based) with pool size of 2~3, connection life-time set to 60s, the number of active connections is very stable.

active_connections:
Screenshot_2020-09-26T19:10:40+09:00

total_connections:
Screenshot_2020-09-26T19:15:35+09:00

To Reproduce

Steps to reproduce the behavior.

Version

What is the version of pgagroal ?

PostgreSQL

What is the version of PostgreSQL ?

12.4-1.pgdg18.04+1

libev

What is the version of libev ?

1:4.22-1

OpenSSL

What is the version of OpenSSL ?

1.1.1-1ubuntu2.1~18.04.6

Access method

Which security access method is used (trust, password, md5, scram-sha-256) ?

md5

OS

Which Operating System (OS) is used ?

ubuntu 18.04

ulimit

What is the output from ulimit -a ?
5000

Configuration

Can you provide the configuration pgagroal ?

  • pgagroal.conf
[pgagroal]
host = 0.0.0.0
port = 5555
metrics = 5556
management = 5557

log_type = console
log_level = info
log_path = 

max_connections = 120
idle_timeout = 1
validation = off
unix_socket_dir = /tmp/.s.pgagroal

pipeline = session

[primary]
host = prod-pg01-a
port = 5432

Debug logs

Can you provide any debug logs (log_level = debug5) of the issue ?

Tip

I compiled pgagroal myself, packages are up-to-date

apt install wget libev4 libev-dev libssl1.1 libssl-dev systemd libsystemd-dev python-docutils gcc-8 cmake -y
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr
@yaroot yaroot added the bug Something isn't working label Sep 26, 2020
@jesperpedersen jesperpedersen self-assigned this Sep 28, 2020
@jesperpedersen
Copy link
Collaborator

Do you have any hints from pgagroal-cli -c pgagroal.conf details when you see the limit reached ?

Are you using both a pgagroal_users.conf file and pgagroal_databases.conf file ? If so, how is the latter configured ?

Using idle_timeout = 1 is very aggressive - would 60 be a better setting ?

In 0.9.x unix_socket_dir changed to be a real directory so using unix_socket_dir = /tmp would be a better solution.

Locally I'm seeing connections being timed out so I'm trying to replicate what you are seeing.

Thanks in advance for any additional feedback / information you can provide !

@yaroot
Copy link
Author

yaroot commented Sep 28, 2020

Ah yes I wasn't thinking about idle_timeout, no users/databases settings, unix_socket_dir was set to /tmp when using 0.9.0.
I did not check details command, but from pgagroal_connection at the time (around 09:30 on the graph) only 44 connections were active, pgagroal_active_connections = 120 and pgagroal_total_connections ~50.

@jesperpedersen
Copy link
Collaborator

It would be great if you could verify with pgagroal-cli details - maybe it is a statistics problem ?

You can check with ps aux | grep pgagroal and ps aux | grep postgres on the servers, and compare that to the output from the statistics and details.

@jesperpedersen
Copy link
Collaborator

Could you verify the raw output from /metrics and SELECT * FROM pg_stat_activity; too ?

@yaroot
Copy link
Author

yaroot commented Sep 29, 2020

I'm sorry, the issue was the clients could not get new connections, so I think the metrics were correct. I'll try reproduce locally.

@yaroot
Copy link
Author

yaroot commented Sep 29, 2020

Here's the python script I use for testing test.txt, ran it several times, tried 1 and 60 for idle_timeout.

The first issue I found is ERROR: DISCARD ALL cannot run inside a transaction block from postgres server logs, caused by killing the client without committing the transaction. It may be safer to issue 'ROLLBACK; DISCARD ALL;'. This is probably not the issue I encountered in production.

Metrics shows both active and total connections consistently went above 60.

Screenshot_2020-09-29T11:36:55+09:00

Screenshot_2020-09-29T12:25:16+09:00

details matches metrics and pg_stat_activity.

I set idle_timeout to 1 because the ram usage of several postgres process could grow well above 1G causing OOM, so I want all clients to disconnect regularly to recycle those ram.

@jesperpedersen
Copy link
Collaborator

I have added a commit that guards against an active transaction during DISCARD ALL. Currently we are not tracking transactions for the session pool, but that is something to consider. I agree that this is probably not what you are seeing though.

I'll keep looking based on your feedback - it is interesting that active_connections is much higher than total_connections, and that you are hitting max_connections.

Thanks for your test case, and let me know if you find more information to share !

@jesperpedersen
Copy link
Collaborator

For your test case - did you have max_connections = 120 still ? Because active/total connections also include connections in states that the pool is currently processing for possible deletion.

Their state is shown in the details report, so that would be helpful if you are seeing something weird there (/metrics will show it too)

@jesperpedersen
Copy link
Collaborator

You said "connection life-time set to 60s" - what did you mean by that ?

@yaroot
Copy link
Author

yaroot commented Sep 30, 2020

Yes I left max_connections as 120. There was nothing weird in pgagroal_connection, half of the pool open, most of them in_use, the only weird thing is pgagroal_active_connections hitting max value and clients reporting connecting timeout.

Connection life-time limits the lifetime of a connection in the pool, configured with maxLifetime for hikaricp and HASURA_GRAPHQL_PG_CONN_LIFETIME for graphql-engine.

@yaroot
Copy link
Author

yaroot commented Sep 30, 2020

There's another issue I think is related, sometimes 1 pgagroal process would hog 1 entire cpu, I don't have any screenshots but yesterday in my testing I triggered 3 (my testing vm has only 1 cpu, each shows up about 30% cpu usage in htop).

@jesperpedersen
Copy link
Collaborator

How many pgagroal processes are active once pgagroal_active_connections hits 120 ? Clients will time out once max_connections is reached.

I'll try and see if I can make a reproducer with hikaricp.

@jesperpedersen
Copy link
Collaborator

If you see high CPU for a pgagroal process could you attach gdb to it and create a backtrace ?

gdb /usr/bin/pgagroal <pid>
bt

Or profile the system using perf to create a flamegraph ?

@yaroot
Copy link
Author

yaroot commented Oct 2, 2020

How many pgagroal processes are active once pgagroal_active_connections hits 120 ? Clients will time out once max_connections is reached.

about half not_init, a few free, the rest in_use.

@yaroot
Copy link
Author

yaroot commented Oct 2, 2020

Okay I think I triggered it again with hikaricp, metrics.txt shows pgagroal_active_connections at 111

details is consistent with metrics except for Active connections, and there are 84 pgagroal processes

Active connections:  0
Total connections:   83
Max connections:     120

@yaroot
Copy link
Author

yaroot commented Oct 2, 2020

Action connections sometimes goes up to 80 but mostly stays at 0

Active connections: 80

@yaroot
Copy link
Author

yaroot commented Oct 2, 2020

I also run flush-idle and flush-gracefully regularly. Here is the code I use.

@yaroot
Copy link
Author

yaroot commented Oct 2, 2020

okay graceful doesn't count as active in commandline but does in metrics, but pgagroal_active_connections is higher than it should be.

@jesperpedersen
Copy link
Collaborator

I have fixed the gracefully status in #113 - thanks for reporting !

The problem is still pgagroal_active_connections (the internal counter), so will keep looking at that. My test case for hikaricp shows that "additional" PostgreSQL packets are sent which are not needed (using 3.4.5). My Agroal is ok. I'll need to investigate further want can be done about this.

@yaroot
Copy link
Author

yaroot commented Oct 3, 2020

wait a connection pool could do that?

@jesperpedersen
Copy link
Collaborator

Connection pools deal with connection lifecycle, and there are various properties that influences the protocol interaction.

@jesperpedersen
Copy link
Collaborator

I have added an assert on the master branch. Could you do a run with a Debug build ? Note, that max_connections is 8 then.

Getting the log

log_type = file
log_level = debug5
log_path = pgagroal.log

would be good if the assert triggers.

@jesperpedersen
Copy link
Collaborator

Maybe try a run with

non_blocking = off

as well.

@jesperpedersen
Copy link
Collaborator

I have added a connection tracker in #114 that will hopefully help to track down this issue further. You can use the master branch with a Release build, just do

log_type = file
log_level = info
log_path = pgagroal.log
tracker = on

and it will output the needed information. Please, continue your run until you see an issue - either connections timing out, or a mismatch in the statistics.

Send me the log via my email.

Thanks in advance !

@yaroot
Copy link
Author

yaroot commented Oct 11, 2020

I left it run for a few hours, release build on master 56cc3b2

pg8000.exceptions.ProgrammingError: {'S': 'FATAL', 'V': 'FATAL', 'C': '53300', 'M': 'connection pool is full'}
java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 1000ms.

metrics.txt
conf.txt
pgagroal.log.gz

Screenshot_2020-10-11T14:19:03+09:00

Screenshot_2020-10-11T14:18:53+09:00

@jesperpedersen
Copy link
Collaborator

Thank you ! This is very helpful :)

@jesperpedersen
Copy link
Collaborator

I have fixed an issue dealing with bad file descriptors when the connection is obtained.

Any chance you can rerun your test with the latest master branch and attach the log ?

@jesperpedersen
Copy link
Collaborator

I went ahead and released 0.9.1 with the issues found so far on this issue.

Please, open a new issue if you are still seeing problems with the tracker log.

Thanks for your help on this !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants