SPIKE: Connection Pooling and RDS Proxy Costs

Addressing issue: SPIKE: Explore the option of not using RDS Proxy to Reduce FAM Cost

Background

Our backend components (API and Auth function) are implemented as Lambda functions. Since Lambda functions are stateless, each request to the Lambda needs to connect to the database in order to fulfill the request. Creating a database connection is an expensive operation in terms of time and memory. High numbers of concurrent connections can degrade database performance. For these reasons, connection pooling is required.

Current Situation

We currently use RDS Proxy. This AWS SaaS service is highly performant and requires very little effort to maintain. Unfortunately, it is quite costly as the minimum configurable size is 8 ACU which is much higher than the requirements of our current solution (RDS Proxy is charged per hour based on the number of ACUs configured).

Additionally, we run four different environments (PROD, TEST, DEV, and TOOLS). Each of these environments is currently running at all times, accumulating costs for the RDS Proxy component 24/7.

Further Considerations

We currently have a single database proxy user that is shared by both the Auth function and the API. There is a security requirement to have a separate proxy user for each security profile. We conducted a spike (SPIKE: RDS proxy usage refinement) to determine if we can use the RDS Proxy configuration with multiple proxy accounts. Turns out this should not be a problem.

The other services evaluated (PgBouncer and PgpoolII) also support multiple proxy accounts, so this turns out to be a non-issue. It is assumed that we will be doing this for any connection pooling solution.

Options Analysis

Option 1: Control RDS Proxy Costs by Spinning Down Environments

Obviously, PROD needs to be operating continuously. There is no opportunity for cost savings by turning it off periodically.

TEST, DEV, and TOOLS are only in use periodically. Because the provisioning of these environments is managed by Terraform, it is possible to destroy the environment (or at least the RDS Proxy) intermittently.

Pros:

Simplest way to save money. No engineering costs to start doing this (just time and effort).
Only pay for RDS Proxy during the hours when TEST, DEV, and TOOLS are actively running.
High availability in PROD. Good performance with no effort.
The minimum configuration for RDS Proxy (8 ACU) will likely never have to be increased to meet future production load.
RDS Proxy has a very high level of security (based on AWS Secrets)

Cons:

Takes team time & effort to manage the state of all the different environments. Team time comes at the opportunity cost of being able to use that time to develop new features and accomplish other work.
There are some issues with spinning up environments from scratch. It takes time and there is at least one known race condition in the Terraform configuration that requires the GitHub Action to be re-run. "Destroy" and "Deploy".
Still expensive to run the RDS Proxy in production.
The "Destroy" and "Deploy" scripts can be scheduled to occur automatically, but during the time they are running they will still be incurring costs at the full rate for RDS Proxy.

Option 2: Connect Directly to Database in non-PROD Environments

It's possible to introduce control logic in Terraform to not provision the RDS Proxy service in lower environments. In this case, the Auth function and the API would connect directly to the database service (still using the database proxy accounts).

Pros:

No cost incurred in lower environments for RDS Proxy.
High availability in PROD. Good performance with no effort.
The minimum configuration for RDS Proxy (8 ACU) will likely never have to be increased to meet future production load.
RDS Proxy has a very high level of security (based on AWS Secrets) which we would still benefit from in production.

Cons:

TEST, DEV, and TOOLS would no longer be prod-like environments in this respect. Testing in these environments will not find defects that are related to the RDS Proxy configuration.
Requests that connect directly to the database will run slower and could potentially time out. Performance testing in lower environments will not be possible. The responsiveness of the web application will probably be impacted, and it will be difficult to tell during testing whether poor performance is being caused by the database connection setup or for some other reason.
The PROD environment costs will remain the same. We will most likely need to use the RDS Proxy in the TEST environment at least some of the time in order to get realistic data for UAT and release testing.

Option 3: Connect Directly to Database in DEV and TOOLS, Leaving TEST as a PROD-like Environment

Same as Option 2, but keep the RDS Proxy in TEST for performance testing and prod-like UI performance.

Pros:

No cost incurred in DEV and TOOLS environments for RDS Proxy.
High availability in PROD. Good performance with no effort.
TEST environment is still PROD-like for testing purposes.
The minimum configuration for RDS Proxy (8 ACU) will likely never have to be increased to meet future production load.
RDS Proxy has a very high level of security (based on AWS Secrets)

Cons:

DEV, and TOOLS would no longer be prod-like environments in this respect. Testing in these environments will not find defects that are related to the RDS Proxy configuration.
Requests that connect directly to the database will run slower and could potentially time out. Performance testing in lower environments will not be possible. The responsiveness of the web application will probably be impacted, and it will be difficult to tell during testing whether poor performance is being caused by the database connection setup or for some other reason.
The PROD environment costs will remain the same. We will most likely need to use the RDS Proxy in the TEST environment at least some of the time in order to get realistic data for UAT and release testing.
Costs for the TEST environment will continue to be the same as today (same as PROD).

Option 4: Implement a Third-Party Connection Pool

Options for PostgreSQL connection pooling tools are outlined in this article.

The idea would be to provision one of the available open-source connection pooling products to fulfil the same function as is currently being performed by the RDS Proxy. These products would run on a persistent containerized server in the AWS environment (probably an EC2 server or an ECS cluster).

The two main candidates are pgpool-II and pgbouncer. They each have their own pros and cons, but they are similar in that they are a solution that requires some kind of linux-based hosting. The cost for either of these would be limited to the cost of the infrastructure (EC2 or ECS).

General Pros:

More flexible in terms of costs. The smallest possible configuration could be quite cheap.
Can be scaled differently in each environment if necessary to save costs.

General Cons:

Lots of tricky engineering work to set it up.
Supporting high availability is even trickier.
Less secure as it requires more components to harden.
Uncertain how much scaling is required and what the ultimate cost would turn out to be at the high end.
Account password management does not support AWS Secrets so passwords would be stored in a less secure fashion.

Option 4a: PgBouncer

PgBouncer Pros:

Simpler than PgpoolII
Only feature is connection pooling, so less attack surface for security risks.

PgBouncer Cons:

Impossible to protect passwords in a way that is as secure as AWS Secrets

Option 4b: PgpoolII

PgpoolII Pros:

More options for authentication

PgpoolII Cons:

Includes features for managing PostgreSQL clusters (replication, load balancing, etc.). We don't need any of these features since high-availability is inherent in RDS Aurora Serverless. Extra features create a larger attack surface for security risks.
Impossible to protect passwords in a way that is as secure as AWS Secrets

Recommendation:

Option 3: Connect Directly to Database in DEV and TOOLS, Leaving TEST as a PROD-like Environment

Rationale:

The security of the password management for RDS Proxy (AWS Secrets) can't be beat, and the performance in production is excellent.
The cost is somewhat high, but the number of hours it would take to set up and operate a third-party solution makes the monthly cost of the service look pretty good. A third-party solution would also incur infrastructure costs (unknown at this time).
We really don't need the RDS Proxy in DEV and TOOLS, so it makes sense to not use it there. The benefits of keeping it running in TEST outweigh the costs
If we really want to save the money of having RDS Proxy in TEST, we can always pull it out (like DEV and TOOLS) or spin down the TEST environment when we are not doing release testing.

FDS Access Management (FAM)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPIKE: Connection Pooling and RDS Proxy Costs

Background

Current Situation

Further Considerations

Options Analysis

Option 1: Control RDS Proxy Costs by Spinning Down Environments

Option 2: Connect Directly to Database in non-PROD Environments

Option 3: Connect Directly to Database in DEV and TOOLS, Leaving TEST as a PROD-like Environment

Option 4: Implement a Third-Party Connection Pool

Option 4a: PgBouncer

Option 4b: PgpoolII

Recommendation:

FDS Access Management (FAM)

User Guide

Operations Guide

Team Heartwood Notes

Clone this wiki locally