Mass extinction of connections impacting AWS Aurora reader endpoint load balancing #1247

gedl · 2018-09-29T12:36:56Z

Environment

HikariCP version: 3.2.0
JDK version     : 1.8.0_162
Database        : Aurora MySQL 5.6.10a
Driver version  : mysql-connector-java 8.0.12

Extra info: connection pool size: 10, max idle unset.

Having set maxLifetime to 30m (the default 1800000 millis) I would expect the behaviour described in #480 to cause connections to be recycled "out of phase" to avoid mass extinction of the pool.
What I am observing check here is that AWS Aurora reader endpoint is dispatching lots of connections to the same read replica (note that in the X-axis the interval of Y-axis upward and downward jumps are "exactly" 30 minutes). This graph represents >40th generation of connections.
The result is that aurora, arguably because it somehow caches the number of connections on each replica, assigns a whole pool to one replica, eventually (in our case we have 3) leaving one replica with nearly all the connections of all application servers, and the other 2 replicas almost IDLE for a generation's lifetime.

I would expect the changes in #480 to gradually scatter the recycling of the pool, up to a maximum of 18s variance after some generations. Admittedly the x-axis of the graph is not granular enough to tell exactly how apart the connections are reaching the aurora cluster, but it doesn't seem that they are spread in any material way.

I have 3 questions:
1 - has anyone observed this phenomena with this, or similar setup
2 - what is my best logging option on the hikari side to observe the application side lifecycle of each generation of connections?
3 - is there anyway to directly set the amount of variance desired to avoid mass extinction?

Thank you very much.

The text was updated successfully, but these errors were encountered:

goughy000 · 2018-10-06T07:27:55Z

Any reason you are using the mysql connector rather than mariadb? AWS suggests usage of the mariadb connector in their documentation, and the driver also contains extra functionality to handle Aurora clusters more effectively compared with the MySQL connector.

If you make the switch, ensure you activate the functionality in the jdbc url and use the cluster endpoint address:
jdbc:mysql:aurora://cluster.cluster-xxxx.eu-west-1.rds.amazonaws.com/db

Important: you must use the *.amazonaws.com address, if you wrap it in a custom CNAME then it won't work effectively in the mariadb driver. See here

More info: https://mariadb.com/kb/en/library/failover-and-high-availability-with-mariadb-connector-j/#specifics-for-amazon-aurora

brettwooldridge · 2018-10-20T08:34:22Z

@gedl have you tried the above suggestion? Any update on this issue?

gedl · 2018-11-03T00:16:02Z

Hey, sorry for the delay.

Inspired by the MariaDb driver we ended up implementing a generic jdbc driver that works on top of any aurora cluster, fully supporting mysql and psql, and presumably with future aurora flavours of other jdbc-accessible RDBSs.

We've also made it open source: https://github.com/DiceTechnology/dice-fairlink

It works via AWS Aurora SDK and therefore does not rely on amazonaws.com sub-domains.

Should I close this case, or do you want to pursue my point nr 3 ?

brettwooldridge · 2018-11-04T05:38:34Z

First of all, awesome. Just awesome. 👏

I love to see open source contributions like this.

Let’s leave this open for the time being. If retirements from the pool are not well distributed enough, I think we need a better algorithm. A deterministic one would also be better than our relying on a pseudo random distribution to avoid extinction events.

Again, really impressed and inspired by your team’s initiative in taking the bull by the horns re: Aurora.

gedl · 2018-12-09T01:13:52Z

Only noticed your response now.

We were surprised by the lack of solutions for what seems to be a common problem with the usage of such a popular combination (HikariCP + RDS/Aurora) and thought this could be useful.

We've taken so much from opensource and would hate to see people wasting time with these "details" instead of making their products great, so open sourcing it was the only acceptable thing to do.

It has been working well in production since, even though I'd like to see it spreading the connections even better. There are a couple of edge cases related to the arithmetics (connection pool size not divisible by number of replicas, etc), but it's much better than before.

gedl · 2019-08-19T18:23:43Z

Because this thread is still open, I think it's relevant to note that dice-fairlink versions 1.x.x had a scalability problem, where it would be rate limited by the RDS API should many client applications were deployed in the same AWS account (they would all hit the RDS API and roughly the same time).

Versions 2.x.x have worked around this undocumented limits imposed by AWS.

Avarga954 mentioned this issue Feb 1, 2023

Allow for a configurable variance on maxLifetime timeouts #2004

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mass extinction of connections impacting AWS Aurora reader endpoint load balancing #1247

Mass extinction of connections impacting AWS Aurora reader endpoint load balancing #1247

gedl commented Sep 29, 2018

goughy000 commented Oct 6, 2018 •

edited

Loading

brettwooldridge commented Oct 20, 2018

gedl commented Nov 3, 2018

brettwooldridge commented Nov 4, 2018

gedl commented Dec 9, 2018

gedl commented Aug 19, 2019

Mass extinction of connections impacting AWS Aurora reader endpoint load balancing #1247

Mass extinction of connections impacting AWS Aurora reader endpoint load balancing #1247

Comments

gedl commented Sep 29, 2018

Environment

goughy000 commented Oct 6, 2018 • edited Loading

brettwooldridge commented Oct 20, 2018

gedl commented Nov 3, 2018

brettwooldridge commented Nov 4, 2018

gedl commented Dec 9, 2018

gedl commented Aug 19, 2019

goughy000 commented Oct 6, 2018 •

edited

Loading