Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fides >= 2.18.0 maintains a large number (~50) of open idle connections to hosted database #3918

Open
daveqnet opened this issue Aug 14, 2023 · 13 comments
Assignees
Labels
bug Something isn't working

Comments

@daveqnet
Copy link
Contributor

daveqnet commented Aug 14, 2023

Bug Description

Since August 3rd the nightly build (and to a lesser degree the rc build) of fides on our staging environment is keeping a large number of idle database connections open. It's about a 3x increase.

[edit: I've reproduced this on fides 2.18.0]

Occasionally this can overwhelm our database instance, making it unavailable for other staging tenants.

See additional context section below for more detail.

Confidential & internal data about the infra that I don't wish to include on a public issue can be viewed in the associated OPS ticket.

Steps to Reproduce

  1. Deploy the nightly build of fides on a Kubernetes cluster
  2. Check the number of idle connections fides keeps open to the hosted postgresql database.

Expected behavior

The number of connections would be much lower, e.g. 1/3 of observed.

Screenshots

n/a

Environment

  • Version: >= 2.18.0
  • OS: Kubernetes (AWS EKS v1.27.3-eks-a5565ad)
  • Python Version: 3.10.12
  • Docker Version: containerd://1.6.19

Additional context

Query: how many connections are open?

SELECT usename, state, count(*) AS connection_count
FROM pg_stat_activity
WHERE usename = 'fides_nightly'
GROUP BY usename, state
ORDER BY connection_count DESC;
usename      |state|connection_count|
-------------+-----+----------------+
fides_nightly|idle |              57|

Query: how many connections are active?

SELECT usename, state, query, backend_start
FROM pg_stat_activity
WHERE usename = 'fides_nightly' AND state <> 'idle'
ORDER BY backend_start DESC;
usename|state|query|backend_start|
-------+-----+-----+-------------+

Query: how long have the connections been open?

SELECT usename, state, age(now(), backend_start) AS connection_duration
FROM pg_stat_activity
WHERE usename = 'fides_nightly'
ORDER BY backend_start;
usename      |state|connection_duration|
-------------+-----+-------------------+
fides_nightly|idle |    03:01:21.410322|
fides_nightly|idle |    01:27:42.397356|
fides_nightly|idle |    01:26:45.455808|
fides_nightly|idle |    01:26:15.463816|
fides_nightly|idle |    01:25:45.460022|
fides_nightly|idle |    01:25:15.452283|
fides_nightly|idle |    01:24:45.458899|
fides_nightly|idle |    01:24:15.461488|
fides_nightly|idle |    01:23:45.460777|
fides_nightly|idle |    01:23:15.462621|
fides_nightly|idle |    01:22:52.730462|
fides_nightly|idle |    01:19:15.462048|
fides_nightly|idle |    01:15:56.910751|
fides_nightly|idle |    01:15:26.881626|
fides_nightly|idle |    01:14:56.860062|
fides_nightly|idle |    01:13:56.828506|
fides_nightly|idle |    01:12:56.781522|
fides_nightly|idle |    01:12:26.755376|
fides_nightly|idle |    01:11:56.732336|
fides_nightly|idle |    01:11:26.706441|
fides_nightly|idle |    01:10:56.692042|
fides_nightly|idle |    01:10:26.656191|
fides_nightly|idle |    01:08:56.554017|
fides_nightly|idle |    01:06:27.388653|
fides_nightly|idle |    01:00:45.462934|
fides_nightly|idle |    01:00:15.460313|
fides_nightly|idle |     00:59:45.46394|
fides_nightly|idle |    00:58:51.692505|
fides_nightly|idle |    00:57:56.092717|
fides_nightly|idle |    00:57:26.078899|
fides_nightly|idle |    00:56:56.045336|
fides_nightly|idle |    00:56:26.010928|
fides_nightly|idle |    00:55:55.995537|
fides_nightly|idle |    00:55:25.987037|
fides_nightly|idle |    00:54:55.937741|
fides_nightly|idle |    00:54:25.927437|
fides_nightly|idle |    00:53:25.895177|
fides_nightly|idle |    00:52:55.864517|
fides_nightly|idle |    00:52:25.849862|
fides_nightly|idle |    00:51:05.467596|
fides_nightly|idle |    00:50:35.458049|
fides_nightly|idle |    00:48:10.654575|
fides_nightly|idle |     00:43:55.40883|
fides_nightly|idle |    00:40:10.257298|
fides_nightly|idle |    00:37:50.700027|
fides_nightly|idle |    00:34:05.417863|
fides_nightly|idle |    00:33:05.481834|
fides_nightly|idle |    00:25:54.559949|
fides_nightly|idle |    00:18:04.771576|
fides_nightly|idle |    00:16:35.457527|
fides_nightly|idle |    00:15:35.457228|
fides_nightly|idle |    00:15:05.458018|
fides_nightly|idle |    00:14:35.462944|
fides_nightly|idle |    00:09:35.419186|

Query: where are the connections being made from?

SELECT usename, client_addr, count(*) AS connection_count
FROM pg_stat_activity
WHERE usename = 'fides_nightly'
GROUP BY usename, client_addr
ORDER BY connection_count DESC;

Note: I've redacted the IP address, but it's a single address.

usename      |client_addr|connection_count|
-------------+-----------+----------------+
fides_nightly|x.x.x.x    |              57|

Query: What are non-idle connections being used for?

SELECT usename, state, query, backend_start
FROM pg_stat_activity
WHERE usename = 'fides_nightly' AND state <> 'idle'
ORDER BY backend_start DESC;
usename|state|query|backend_start|
-------+-----+-----+-------------+
@daveqnet daveqnet added the bug Something isn't working label Aug 14, 2023
@daveqnet
Copy link
Contributor Author

This might be related to /health checking

2023-08-14 14:20:28.142 | ERROR    | fides.api.db.database:get_db_health:90 - Unable to reach the database: Exception: Database connection failed with engine:
Engine(postgresql+psycopg2://fides_nightly:***@redacted-host:redacted-port/redacted-db-name)!

@ThomasLaPiana, the date for when you merged #3884 roughly lines up with db conn count increase I'm seeing. Do you think that the changes you made could cause a lot of idle open db conns?

@ThomasLaPiana
Copy link
Contributor

@daveqnet absolutely, it should be a one-line fix!

@ThomasLaPiana
Copy link
Contributor

@daveqnet I'm fixing this on the application side hopefully, but can we also configure postgres to recycle connections more often too? Defense on both sides seems prudent

@ThomasLaPiana
Copy link
Contributor

@daveqnet I'm having a hard time reproducing this locally. When I run off of current code I see ~5 connections maintained. The Engine we are using will by default keep a connection pool open.

I am seeing however that the application has a limit of 50 with an additional 50 set as overflow. Maybe something is keeping application connections open? Is anything happening on that instance other than health checks?

@daveqnet
Copy link
Contributor Author

@daveqnet I'm fixing this on the application side hopefully, but can we also configure postgres to recycle connections more often too? Defense on both sides seems prudent

@ThomasLaPiana From an architecture perspective we'll probably have to put something like AWS RDS Proxy in front of shared db instances for multiple tenants, but that'll take a while to deliver.

In the meantime, yes, there's absolutely some config changes that can be made on the shared postgres db side.

For example, the idle connection timeout is currently set to 24 hours (86400000 ms). On local fides deployments this is set to 0 (no timeout limit).

Get timeout

SELECT name, setting, unit
FROM pg_settings
WHERE name = 'idle_in_transaction_session_timeout';
name                               |setting |unit|
-----------------------------------+--------+----+
idle_in_transaction_session_timeout|86400000|ms  |

I can set this timeout to, say, 30 minutes (1800000 ms). Based on the db conn durations I'm seeing, this would reduce the db conn count by 90%.

@daveqnet
Copy link
Contributor Author

@daveqnet I'm having a hard time reproducing this locally. When I run off of current code I see ~5 connections maintained. The Engine we are using will by default keep a connection pool open.

I am seeing however that the application has a limit of 50 with an additional 50 set as overflow. Maybe something is keeping application connections open? Is anything happening on that instance other than health checks?

@ThomasLaPiana Honestly the only requests I'm seeing in the fides webserver logs are automated check requests to /health and to /. Nightly doesn't seem to have many real users at the moment.

However, nightly is a little unusual in that it has two workers. Do workers connect to the host db, or just the webserver?

@daveqnet
Copy link
Contributor Author

Just to add some additional info:

  • The same issue is present in fides 2.18.0
  • A valid workaround is to reduce idle_in_transaction_session_timeout to a low value on the database-side
  • Workers do indeed connect to the host db. Here's an example of a worker error log when the db is overloaded.
worker error log
[2023-08-16 07:24:53,505: ERROR/ForkPoolWorker-1] Task fides.api.service.privacy_request.request_runner_service.run_privacy_request[371982a4-8b32-45fe-82cc-92c468747303] raised unexpected: OperationalError('(psycopg2.OperationalError) connection to server at "redacted.rds.amazonaws.com" (x.x.x.x), port xxxx failed: FATAL:  remaining connection slots are reserved for non-replication superuser and rds_superuser connections\n')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/fides/api/util/wrappers.py", line 12, in wrap
    loop = asyncio.get_running_loop()
RuntimeError: no running event loop

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3243, in _wrap_pool_connect
    return fn()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 310, in connect
    return _ConnectionFairy._checkout(self)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 868, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 476, in checkout
    rec = pool._do_get()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/impl.py", line 145, in _do_get
    with util.safe_reraise():
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
    raise exception
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/impl.py", line 143, in _do_get
    return self._create_connection()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 256, in _create_connection
    return _ConnectionRecord(self)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 371, in __init__
    self.__connect()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 665, in __connect
    with util.safe_reraise():
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
    raise exception
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 661, in __connect
    self.dbapi_connection = connection = pool._invoke_creator(self)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/create.py", line 590, in connect
    return dialect.connect(*cargs, **cparams)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 584, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/usr/local/lib/python3.10/site-packages/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: connection to server at "redacted.rds.amazonaws.com" (x.x.x.x), port xxxx failed: FATAL:  remaining connection slots are reserved for non-replication superuser and rds_superuser connections


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/celery/app/trace.py", line 451, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/celery/app/trace.py", line 734, in __protected_call__
    return self.run(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/fides/api/util/wrappers.py", line 15, in wrap
    return asyncio.run(func(*args, **kwargs))
  File "/usr/local/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.10/site-packages/fides/api/service/privacy_request/request_runner_service.py", line 301, in run_privacy_request
    privacy_request = PrivacyRequest.get(db=session, object_id=privacy_request_id)
  File "/usr/local/lib/python3.10/site-packages/fides/api/db/base_class.py", line 152, in get
    return db.query(cls).get(object_id)
  File "<string>", line 2, in get
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/deprecations.py", line 401, in warned
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 943, in get
    return self._get_impl(ident, loading.load_on_pk_identity)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 947, in _get_impl
    return self.session._get_impl(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 2874, in _get_impl
    return db_load_fn(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/loading.py", line 530, in load_on_pk_identity
    session.execute(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1688, in execute
    conn = self._connection_for_bind(bind)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1529, in _connection_for_bind
    return self._transaction._connection_for_bind(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 747, in _connection_for_bind
    conn = bind.connect()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3197, in connect
    return self._connection_cls(self, close_with_result=close_with_result)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 96, in __init__
    else engine.raw_connection()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3276, in raw_connection
    return self._wrap_pool_connect(self.pool.connect, _connection)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3246, in _wrap_pool_connect
    Connection._handle_dbapi_exception_noconnection(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 2100, in _handle_dbapi_exception_noconnection
    util.raise_(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
    raise exception
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3243, in _wrap_pool_connect
    return fn()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 310, in connect
    return _ConnectionFairy._checkout(self)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 868, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 476, in checkout
    rec = pool._do_get()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/impl.py", line 145, in _do_get
    with util.safe_reraise():
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
    raise exception
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/impl.py", line 143, in _do_get
    return self._create_connection()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 256, in _create_connection
    return _ConnectionRecord(self)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 371, in __init__
    self.__connect()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 665, in __connect
    with util.safe_reraise():
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
    raise exception
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 661, in __connect
    self.dbapi_connection = connection = pool._invoke_creator(self)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/create.py", line 590, in connect
    return dialect.connect(*cargs, **cparams)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 584, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/usr/local/lib/python3.10/site-packages/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection to server at "redacted.rds.amazonaws.com" (x.x.x.x), port xxxx failed: FATAL:  remaining connection slots are reserved for non-replication superuser and rds_superuser connections

(Background on this error at: https://sqlalche.me/e/14/e3q8)

@daveqnet daveqnet changed the title Nightly build of fides maintaining a large number of open idle connections to hosted database fides >= 2.18.0 maintaining a large number of open idle connections to hosted database Aug 16, 2023
@daveqnet daveqnet changed the title fides >= 2.18.0 maintaining a large number of open idle connections to hosted database Nightly build of fides maintaining a large number of open idle connections to hosted database Aug 16, 2023
@daveqnet daveqnet changed the title Nightly build of fides maintaining a large number of open idle connections to hosted database fides >= 2.18.0 maintains a large number (~50) of open idle connections to hosted database Aug 17, 2023
@daveqnet
Copy link
Contributor Author

A valid workaround is to reduce idle_in_transaction_session_timeout to a low value on the database-side

I was wrong on this. This timeout only ends connections in the idle in transaction state, it doesn't time out connections in the idle state. From the database-side it looks like the best way we can solve this is by introducing a connection pooler like AWS RDS Proxy between the tenants and the database.

@ThomasLaPiana, there's no config we can set on the fides containers to limit their connections client-side, is there?

@ThomasLaPiana
Copy link
Contributor

ThomasLaPiana commented Aug 17, 2023

@daveqnet our docs say there is: https://ethyca.github.io/fides/2.18.0/config/ (ctrl + f for database and find the documentation in the code embed)

We default to 50 which seems kind of high

@daveqnet
Copy link
Contributor Author

Aha, yes, I should probably check the config docs before asking a config question!! 🤦

Thanks @ThomasLaPiana 🙏

@ThomasLaPiana
Copy link
Contributor

I worked hard to write a script to auto-generate that documentation 🥲

@daveqnet
Copy link
Contributor Author

I worked hard to write a script to auto-generate that documentation 🥲

I'm genuinely going to add it to my bookmarks toolbar 😆

@ThomasLaPiana
Copy link
Contributor

I worked hard to write a script to auto-generate that documentation 🥲

I'm genuinely going to add it to my bookmarks toolbar 😆

😂 ❤️ 💯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants