fides >= 2.18.0 maintains a large number (~50) of open idle connections to hosted database #3918

daveqnet · 2023-08-14T14:57:22Z

Bug Description

Since August 3rd the nightly build (and to a lesser degree the rc build) of fides on our staging environment is keeping a large number of idle database connections open. It's about a 3x increase.

[edit: I've reproduced this on fides 2.18.0]

Occasionally this can overwhelm our database instance, making it unavailable for other staging tenants.

See additional context section below for more detail.

Confidential & internal data about the infra that I don't wish to include on a public issue can be viewed in the associated OPS ticket.

Steps to Reproduce

Deploy the nightly build of fides on a Kubernetes cluster
Check the number of idle connections fides keeps open to the hosted postgresql database.

Expected behavior

The number of connections would be much lower, e.g. 1/3 of observed.

Screenshots

n/a

Environment

Version: >= 2.18.0
OS: Kubernetes (AWS EKS v1.27.3-eks-a5565ad)
Python Version: 3.10.12
Docker Version: containerd://1.6.19

Additional context

Query: how many connections are open?

SELECT usename, state, count(*) AS connection_count
FROM pg_stat_activity
WHERE usename = 'fides_nightly'
GROUP BY usename, state
ORDER BY connection_count DESC;

usename      |state|connection_count|
-------------+-----+----------------+
fides_nightly|idle |              57|

Query: how many connections are active?

SELECT usename, state, query, backend_start
FROM pg_stat_activity
WHERE usename = 'fides_nightly' AND state <> 'idle'
ORDER BY backend_start DESC;

usename|state|query|backend_start|
-------+-----+-----+-------------+

Query: how long have the connections been open?

SELECT usename, state, age(now(), backend_start) AS connection_duration
FROM pg_stat_activity
WHERE usename = 'fides_nightly'
ORDER BY backend_start;

usename      |state|connection_duration|
-------------+-----+-------------------+
fides_nightly|idle |    03:01:21.410322|
fides_nightly|idle |    01:27:42.397356|
fides_nightly|idle |    01:26:45.455808|
fides_nightly|idle |    01:26:15.463816|
fides_nightly|idle |    01:25:45.460022|
fides_nightly|idle |    01:25:15.452283|
fides_nightly|idle |    01:24:45.458899|
fides_nightly|idle |    01:24:15.461488|
fides_nightly|idle |    01:23:45.460777|
fides_nightly|idle |    01:23:15.462621|
fides_nightly|idle |    01:22:52.730462|
fides_nightly|idle |    01:19:15.462048|
fides_nightly|idle |    01:15:56.910751|
fides_nightly|idle |    01:15:26.881626|
fides_nightly|idle |    01:14:56.860062|
fides_nightly|idle |    01:13:56.828506|
fides_nightly|idle |    01:12:56.781522|
fides_nightly|idle |    01:12:26.755376|
fides_nightly|idle |    01:11:56.732336|
fides_nightly|idle |    01:11:26.706441|
fides_nightly|idle |    01:10:56.692042|
fides_nightly|idle |    01:10:26.656191|
fides_nightly|idle |    01:08:56.554017|
fides_nightly|idle |    01:06:27.388653|
fides_nightly|idle |    01:00:45.462934|
fides_nightly|idle |    01:00:15.460313|
fides_nightly|idle |     00:59:45.46394|
fides_nightly|idle |    00:58:51.692505|
fides_nightly|idle |    00:57:56.092717|
fides_nightly|idle |    00:57:26.078899|
fides_nightly|idle |    00:56:56.045336|
fides_nightly|idle |    00:56:26.010928|
fides_nightly|idle |    00:55:55.995537|
fides_nightly|idle |    00:55:25.987037|
fides_nightly|idle |    00:54:55.937741|
fides_nightly|idle |    00:54:25.927437|
fides_nightly|idle |    00:53:25.895177|
fides_nightly|idle |    00:52:55.864517|
fides_nightly|idle |    00:52:25.849862|
fides_nightly|idle |    00:51:05.467596|
fides_nightly|idle |    00:50:35.458049|
fides_nightly|idle |    00:48:10.654575|
fides_nightly|idle |     00:43:55.40883|
fides_nightly|idle |    00:40:10.257298|
fides_nightly|idle |    00:37:50.700027|
fides_nightly|idle |    00:34:05.417863|
fides_nightly|idle |    00:33:05.481834|
fides_nightly|idle |    00:25:54.559949|
fides_nightly|idle |    00:18:04.771576|
fides_nightly|idle |    00:16:35.457527|
fides_nightly|idle |    00:15:35.457228|
fides_nightly|idle |    00:15:05.458018|
fides_nightly|idle |    00:14:35.462944|
fides_nightly|idle |    00:09:35.419186|

Query: where are the connections being made from?

SELECT usename, client_addr, count(*) AS connection_count
FROM pg_stat_activity
WHERE usename = 'fides_nightly'
GROUP BY usename, client_addr
ORDER BY connection_count DESC;

Note: I've redacted the IP address, but it's a single address.

usename      |client_addr|connection_count|
-------------+-----------+----------------+
fides_nightly|x.x.x.x    |              57|

Query: What are non-idle connections being used for?

SELECT usename, state, query, backend_start
FROM pg_stat_activity
WHERE usename = 'fides_nightly' AND state <> 'idle'
ORDER BY backend_start DESC;

usename|state|query|backend_start|
-------+-----+-----+-------------+

The text was updated successfully, but these errors were encountered:

daveqnet · 2023-08-14T15:14:29Z

This might be related to /health checking

2023-08-14 14:20:28.142 | ERROR    | fides.api.db.database:get_db_health:90 - Unable to reach the database: Exception: Database connection failed with engine:
Engine(postgresql+psycopg2://fides_nightly:***@redacted-host:redacted-port/redacted-db-name)!

@ThomasLaPiana, the date for when you merged #3884 roughly lines up with db conn count increase I'm seeing. Do you think that the changes you made could cause a lot of idle open db conns?

ThomasLaPiana · 2023-08-14T15:56:58Z

@daveqnet absolutely, it should be a one-line fix!

ThomasLaPiana · 2023-08-15T04:39:31Z

@daveqnet I'm fixing this on the application side hopefully, but can we also configure postgres to recycle connections more often too? Defense on both sides seems prudent

ThomasLaPiana · 2023-08-15T06:00:59Z

@daveqnet I'm having a hard time reproducing this locally. When I run off of current code I see ~5 connections maintained. The Engine we are using will by default keep a connection pool open.

I am seeing however that the application has a limit of 50 with an additional 50 set as overflow. Maybe something is keeping application connections open? Is anything happening on that instance other than health checks?

daveqnet · 2023-08-15T08:40:12Z

@daveqnet I'm fixing this on the application side hopefully, but can we also configure postgres to recycle connections more often too? Defense on both sides seems prudent

@ThomasLaPiana From an architecture perspective we'll probably have to put something like AWS RDS Proxy in front of shared db instances for multiple tenants, but that'll take a while to deliver.

In the meantime, yes, there's absolutely some config changes that can be made on the shared postgres db side.

For example, the idle connection timeout is currently set to 24 hours (86400000 ms). On local fides deployments this is set to 0 (no timeout limit).

Get timeout

SELECT name, setting, unit
FROM pg_settings
WHERE name = 'idle_in_transaction_session_timeout';

name                               |setting |unit|
-----------------------------------+--------+----+
idle_in_transaction_session_timeout|86400000|ms  |

I can set this timeout to, say, 30 minutes (1800000 ms). Based on the db conn durations I'm seeing, this would reduce the db conn count by 90%.

daveqnet · 2023-08-15T08:42:51Z

@daveqnet I'm having a hard time reproducing this locally. When I run off of current code I see ~5 connections maintained. The Engine we are using will by default keep a connection pool open.

I am seeing however that the application has a limit of 50 with an additional 50 set as overflow. Maybe something is keeping application connections open? Is anything happening on that instance other than health checks?

@ThomasLaPiana Honestly the only requests I'm seeing in the fides webserver logs are automated check requests to /health and to /. Nightly doesn't seem to have many real users at the moment.

However, nightly is a little unusual in that it has two workers. Do workers connect to the host db, or just the webserver?

daveqnet · 2023-08-16T08:43:52Z

Just to add some additional info:

The same issue is present in fides 2.18.0
A valid workaround is to reduce idle_in_transaction_session_timeout to a low value on the database-side
Workers do indeed connect to the host db. Here's an example of a worker error log when the db is overloaded.

worker error log

[2023-08-16 07:24:53,505: ERROR/ForkPoolWorker-1] Task fides.api.service.privacy_request.request_runner_service.run_privacy_request[371982a4-8b32-45fe-82cc-92c468747303] raised unexpected: OperationalError('(psycopg2.OperationalError) connection to server at "redacted.rds.amazonaws.com" (x.x.x.x), port xxxx failed: FATAL:  remaining connection slots are reserved for non-replication superuser and rds_superuser connections\n')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/fides/api/util/wrappers.py", line 12, in wrap
    loop = asyncio.get_running_loop()
RuntimeError: no running event loop

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3243, in _wrap_pool_connect
    return fn()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 310, in connect
    return _ConnectionFairy._checkout(self)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 868, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 476, in checkout
    rec = pool._do_get()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/impl.py", line 145, in _do_get
    with util.safe_reraise():
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
    raise exception
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/impl.py", line 143, in _do_get
    return self._create_connection()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 256, in _create_connection
    return _ConnectionRecord(self)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 371, in __init__
    self.__connect()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 665, in __connect
    with util.safe_reraise():
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
    raise exception
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 661, in __connect
    self.dbapi_connection = connection = pool._invoke_creator(self)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/create.py", line 590, in connect
    return dialect.connect(*cargs, **cparams)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 584, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/usr/local/lib/python3.10/site-packages/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: connection to server at "redacted.rds.amazonaws.com" (x.x.x.x), port xxxx failed: FATAL:  remaining connection slots are reserved for non-replication superuser and rds_superuser connections


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/celery/app/trace.py", line 451, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/celery/app/trace.py", line 734, in __protected_call__
    return self.run(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/fides/api/util/wrappers.py", line 15, in wrap
    return asyncio.run(func(*args, **kwargs))
  File "/usr/local/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.10/site-packages/fides/api/service/privacy_request/request_runner_service.py", line 301, in run_privacy_request
    privacy_request = PrivacyRequest.get(db=session, object_id=privacy_request_id)
  File "/usr/local/lib/python3.10/site-packages/fides/api/db/base_class.py", line 152, in get
    return db.query(cls).get(object_id)
  File "<string>", line 2, in get
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/deprecations.py", line 401, in warned
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 943, in get
    return self._get_impl(ident, loading.load_on_pk_identity)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 947, in _get_impl
    return self.session._get_impl(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 2874, in _get_impl
    return db_load_fn(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/loading.py", line 530, in load_on_pk_identity
    session.execute(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1688, in execute
    conn = self._connection_for_bind(bind)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1529, in _connection_for_bind
    return self._transaction._connection_for_bind(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 747, in _connection_for_bind
    conn = bind.connect()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3197, in connect
    return self._connection_cls(self, close_with_result=close_with_result)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 96, in __init__
    else engine.raw_connection()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3276, in raw_connection
    return self._wrap_pool_connect(self.pool.connect, _connection)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3246, in _wrap_pool_connect
    Connection._handle_dbapi_exception_noconnection(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 2100, in _handle_dbapi_exception_noconnection
    util.raise_(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
    raise exception
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3243, in _wrap_pool_connect
    return fn()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 310, in connect
    return _ConnectionFairy._checkout(self)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 868, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 476, in checkout
    rec = pool._do_get()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/impl.py", line 145, in _do_get
    with util.safe_reraise():
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
    raise exception
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/impl.py", line 143, in _do_get
    return self._create_connection()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 256, in _create_connection
    return _ConnectionRecord(self)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 371, in __init__
    self.__connect()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 665, in __connect
    with util.safe_reraise():
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
    raise exception
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 661, in __connect
    self.dbapi_connection = connection = pool._invoke_creator(self)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/create.py", line 590, in connect
    return dialect.connect(*cargs, **cparams)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 584, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/usr/local/lib/python3.10/site-packages/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection to server at "redacted.rds.amazonaws.com" (x.x.x.x), port xxxx failed: FATAL:  remaining connection slots are reserved for non-replication superuser and rds_superuser connections

(Background on this error at: https://sqlalche.me/e/14/e3q8)

daveqnet · 2023-08-17T10:03:01Z

A valid workaround is to reduce idle_in_transaction_session_timeout to a low value on the database-side

I was wrong on this. This timeout only ends connections in the idle in transaction state, it doesn't time out connections in the idle state. From the database-side it looks like the best way we can solve this is by introducing a connection pooler like AWS RDS Proxy between the tenants and the database.

@ThomasLaPiana, there's no config we can set on the fides containers to limit their connections client-side, is there?

ThomasLaPiana · 2023-08-17T10:07:52Z

@daveqnet our docs say there is: https://ethyca.github.io/fides/2.18.0/config/ (ctrl + f for database and find the documentation in the code embed)

We default to 50 which seems kind of high

daveqnet · 2023-08-17T10:09:11Z

Aha, yes, I should probably check the config docs before asking a config question!! 🤦

Thanks @ThomasLaPiana 🙏

ThomasLaPiana · 2023-08-17T10:10:32Z

I worked hard to write a script to auto-generate that documentation 🥲

daveqnet · 2023-08-17T10:15:04Z

I worked hard to write a script to auto-generate that documentation 🥲

I'm genuinely going to add it to my bookmarks toolbar 😆

ThomasLaPiana · 2023-08-17T10:16:13Z

I worked hard to write a script to auto-generate that documentation 🥲

I'm genuinely going to add it to my bookmarks toolbar 😆

😂 ❤️ 💯

daveqnet added the bug Something isn't working label Aug 14, 2023

ThomasLaPiana mentioned this issue Aug 15, 2023

Close DB connections after health check #3919

Closed

9 tasks

ThomasLaPiana self-assigned this Aug 15, 2023

daveqnet changed the title ~~Nightly build of fides maintaining a large number of open idle connections to hosted database~~ fides >= 2.18.0 maintaining a large number of open idle connections to hosted database Aug 16, 2023

daveqnet changed the title ~~fides >= 2.18.0 maintaining a large number of open idle connections to hosted database~~ Nightly build of fides maintaining a large number of open idle connections to hosted database Aug 16, 2023

daveqnet changed the title ~~Nightly build of fides maintaining a large number of open idle connections to hosted database~~ fides >= 2.18.0 maintains a large number (~50) of open idle connections to hosted database Aug 17, 2023

ThomasLaPiana mentioned this issue Aug 22, 2023

Add database and worker health endpoints #3947

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fides >= 2.18.0 maintains a large number (~50) of open idle connections to hosted database #3918

fides >= 2.18.0 maintains a large number (~50) of open idle connections to hosted database #3918

daveqnet commented Aug 14, 2023 •

edited

daveqnet commented Aug 14, 2023

ThomasLaPiana commented Aug 14, 2023

ThomasLaPiana commented Aug 15, 2023

ThomasLaPiana commented Aug 15, 2023

daveqnet commented Aug 15, 2023

daveqnet commented Aug 15, 2023

daveqnet commented Aug 16, 2023

daveqnet commented Aug 17, 2023

ThomasLaPiana commented Aug 17, 2023 •

edited

daveqnet commented Aug 17, 2023

ThomasLaPiana commented Aug 17, 2023

daveqnet commented Aug 17, 2023

ThomasLaPiana commented Aug 17, 2023

fides >= 2.18.0 maintains a large number (~50) of open idle connections to hosted database #3918

fides >= 2.18.0 maintains a large number (~50) of open idle connections to hosted database #3918

Comments

daveqnet commented Aug 14, 2023 • edited

Bug Description

Steps to Reproduce

Expected behavior

Screenshots

Environment

Additional context

daveqnet commented Aug 14, 2023

ThomasLaPiana commented Aug 14, 2023

ThomasLaPiana commented Aug 15, 2023

ThomasLaPiana commented Aug 15, 2023

daveqnet commented Aug 15, 2023

daveqnet commented Aug 15, 2023

daveqnet commented Aug 16, 2023

daveqnet commented Aug 17, 2023

ThomasLaPiana commented Aug 17, 2023 • edited

daveqnet commented Aug 17, 2023

ThomasLaPiana commented Aug 17, 2023

daveqnet commented Aug 17, 2023

ThomasLaPiana commented Aug 17, 2023

daveqnet commented Aug 14, 2023 •

edited

ThomasLaPiana commented Aug 17, 2023 •

edited