Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Akka Http Client pool connections are not reestablished after DNS positive-ttl #1226

Open
wojda opened this issue Jun 23, 2017 · 25 comments
Open
Labels
1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted discuss Tickets that need some discussion before proceeding help wanted Identifies issues that the core team will likely not have time to work on t:client Issues related to the HTTP Client

Comments

@wojda
Copy link

wojda commented Jun 23, 2017

We have found that under some circumstances, Akka's Http client is not honoring the positive-ttl expiry value, not picking up new DNS entries.

It looks as if under load, and using the default http connection pool, the client will never try resolving again the DNS entry if the connection is not closed, regardless of the positive-ttl value.

Steps to reproduce:
0. Using akka-http 10.0.6 and akka-core 2.5.2

  1. DNS resolves test.com to server_A with ip_A
  2. Run akka http application with following settings:
    dns.inet-address { positive-ttl = 30s negative-ttl = 30s }
  3. Run load continuously to this akka http app which does requests to test.com
  4. Change DNS entry (in /etc/hosts, for instance) to point to ip_B. NOTE: server_A with ip_A is still running.
  5. Wait for positive-ttl dns cache expiry (30 seconds, in this example)

Expected behaviour:

  • DNS cache expires, every new request should be sent to ip_B.

Current behaviour:

  • New requests after DNS expiry time are still going to ip_A.
  • The only way to have the akka http application to pick up the new DNS entry is by restarting.
@jrudolph
Copy link
Member

That seems to be the case because akka's DNS resolver is based on JVM's InetAddress.getAllByName which introduces another layer of caching.

You can already observe the behavior by just using java.net.InetAddress.getAllByName("...") and changing /etc/hosts entries in between.

It seems the JVM DNS caching layer is configured using java.security.Security properties which are defined in a security file if a SecurityManager is installed, otherwise it can be overridden (or turned of in this case) by setting this JVM property using -Dsun.net.inetaddr.ttl=0. Can you see if that works for you?

@jrudolph jrudolph added the 0 - new Ticket is unclear on it's purpose or if it is valid or not label Jun 26, 2017
@wojda
Copy link
Author

wojda commented Jul 2, 2017

Thank you @jrudolph for a quick response. The problem is not related to JVM's DNS resolver unfortunately, that would be something easy to fix.
I've written a test that is failing and shows the issue, you can find it here: https://github.com/wojda/AkkaHttpDnsInvestigation. I wanted to make sure it's not the problem with JVM or other system config so the test starts three docker containers, one with akka http client, and two with the same server (but different ip). You can build and run the test with one command, please check readme.md. I hope the test will be useful.

I've done a quick investigation too. According to logs from Akka, a hostname is only resolved when a new connection is created. Example:

[DEBUG] [akka://client-system/system/IO-TCP/selectors/$a/7] Resolving server.com before connecting
[DEBUG] [akka://client-system/system/IO-TCP/selectors/$a/7] Attempting connection to [server.com/172.17.0.3:8080]
[DEBUG] [akka://client-system/system/IO-TCP/selectors/$a/7] Connection established to [server.com:8080]

In my case, because of high TPS (no idle connections) and the fact that the depricated server_A is still running and responding, a new connection is never created. After changing DNS entry, Akka Http client uses existing connection pool. Please correct me if I'm wrong, in that case 'positive-ttl' has no effect, because Akka Http client does not create new connections and not resolve the hostname.

@jrudolph
Copy link
Member

jrudolph commented Jul 5, 2017

Akka Http client does not create new connections and not resolve the hostname.

Yes, that's correct. I guess if you need connections to be refreshed we would need to add another feature to restrict the life-time of persistent connections (which could be a reasonable thing to do).

@mdedetrich
Copy link
Contributor

I suspect that this is causing issues on our end where the underlying host isn't getting updated due to dns timeout not being honoured, is there a workaround for this?

@jrudolph jrudolph added discuss Tickets that need some discussion before proceeding t:client Issues related to the HTTP Client 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted and removed 0 - new Ticket is unclear on it's purpose or if it is valid or not labels Jul 19, 2017
@jrudolph jrudolph changed the title DNS positive-ttl is not honored by Akka Http Client Akka Http Client pool connections are not reestablished after DNS positive-ttl Jul 19, 2017
@jrudolph
Copy link
Member

@mdedetrich I think so far it isn't clear what could or should be done on the Akka HTTP level.

So far, the only confirmed "issue" in Akka Http is that it keeps active persistent connections open for as long as possible. I'd say that's pretty reasonable behavior. Why make a new connection (potentially to a new IP address) when the old one is still alive and serving requests? Or are you seeing something different? Can the server be changed to close connections after a while?

Are there any other HTTP clients that actually couple DNS lifetimes with lifetimes of pool connections?

@jrudolph
Copy link
Member

That said, we might want to an API to give users more control over the pools. This could e.g. be a method that requests to close all connections to a given host without shutting down that pool completely.

@jrudolph
Copy link
Member

jrudolph commented Jul 19, 2017

square/okhttp#3374 also suggested to solve this on the server side / loadbalancer.

@randomstatistic
Copy link

Why make a new connection (potentially to a new IP address) when the old one is still alive and serving requests?

This is the reason I got interested in the thread. Regardless of the mechanism that you use to convert a "host reference" to a pool of servers, (DNS, LB) you end up with a pool of persistent connections.

So let's say you have two servers A, B. Your client establishes a connection pool with roughly the same number of connections to each of the two, because balancing load is what your "host reference" is for.
Now B goes down, maybe you just need to restart it. All the connections to B are broken, and the client establishes replacement connections to A to get the pool back up to the desired size.
Now B comes back up, but there's no way (unless I'm missing something) to instruct the client to rebalance the persistent connections. All the traffic is now going to A until A closes its connections.

A connection lifetime (either in duration or request count) would help solve this by gradually rebalancing the connection pool.

@jrudolph
Copy link
Member

A connection lifetime (either in duration or request count) would help solve this by gradually rebalancing the connection pool.

I agree that this would probably help. But also note, that you are pushing a backend issue to the client here. I think this issue can be seen as evidence that this is a brittle solution that requires full control over all sides of the connections.

@mdedetrich
Copy link
Contributor

@jrudolph My issue was actually unrelated, so you can ignore my earlier comment

@sergeykolbasov
Copy link

sergeykolbasov commented Jul 20, 2017

Hi there

I guess, it would be nice and meaningful to have behaviour similar to what Finagle did for their client.

  • Watermark connection pool with lower and higher marks
  • Graceful rotation of connections by TTL. Let's say, every few minutes (or any other configurable value) new connection is pushed in pool while old one is popped, in respect of low mark.

@jrudolph
Copy link
Member

Watermark connection pool with lower and higher marks

@imliar could you explain how this is related to this ticket? I tried to understand the documentation but from a glance I didn't understand what this is about? Maybe it's because finagle is about services while akka-http is only concerned about http?

Graceful rotation of connections by TTL. Let's say, every few minutes (or any other configurable value) new connection is pushed in pool while old one is popped, in respect of low mark.

I guess you mean .withSession.maxLifeTime(20.seconds) which seems to be similar to the suggestion above.

@sergeykolbasov
Copy link

sergeykolbasov commented Jul 20, 2017

@jrudolph Yes, but no

Watermark connection pool is just one of mechanics for a pooling when you have minimal and maximal amount of connections in pool, and as far as you have more load than minimal amount of connections could serve, it will increase em up to higher mark.

Connection shut down could be achieved with any pooling, but with WM amount of connections will never go down to zero (unless it's not specifically defined by configuration) resulting to cold connection pool. It could be a different topic ofc

@jrudolph
Copy link
Member

Sounds like our min-connections / max-connections settings.

@avietrov
Copy link

avietrov commented Jul 29, 2017

But also note, that you are pushing a backend issue to the client here. I think this issue can be seen as evidence that this is a brittle solution that requires full control over all sides of the connections.

@jrudolph an example that doesn't involve any back-ends failing is gradual traffic switch. If that is achieved by having two load balances and a weighted DNS resolution (e.g. how AWS Route 53 does it), then the issue cannot be solved on load balancer level (as suggested referencing okhttp), since the traffic is actually getting diverted from one LB to another.

In this case the only solution I'm aware of is to forcefully kill the "old" LB, thus throwing 5xx, which will kill connections on client's side and force akka-http to re-establish new connections, which in its turn resolves DNS. Doing so at high load, results in significant amount of errors and most likely opening a circuit breaker. And nether client nor server are happy about that.

@jrudolph
Copy link
Member

jrudolph commented Jul 29, 2017 via email

@alivanni
Copy link

alivanni commented Oct 5, 2017

If it's gradual the load balancer can start to close idle persistent connections

@jrudolph I feel like there is chicken and egg problem here. Connections will never become idle because client will never move traffic away from old LB / stack. This is actually what we are trying to achieve - force client to start sending requests to new stack.

@agorski
Copy link

agorski commented Apr 20, 2018

@jrudolph do you consider any solution for the issue?
Caching DNS entries forever is not the best idea for cloud. You can add or remove servers dynamically, so caching will not work.

Do you have any idea at least for workaround?

@MikhailGolubtsov
Copy link

MikhailGolubtsov commented Jun 21, 2018

@jrudolph I agree with @agorski and @alivanni and don't see a way how to workaround outside of akka-http. I cannot make additional arguments, but please consider it a real issue, it's critical to my team by causing trouble in production and if there is no solution we have to migrate away from akka-http client unfortunately (and I know another team in Zalando who did also because of this).

@johanandren
Copy link
Member

There is a PR in progress which adds max-connection-keep-alive-time.

@raboof
Copy link
Member

raboof commented Jun 26, 2018

Related to #1768. We are aware and agree this is an area where we plan to improve. We're currently working on improving our DNS infrastructure and one of the next steps is to also take into account the TTL.

@AnanyaDeb
Copy link

AnanyaDeb commented Sep 27, 2021

My team is facing a related problem and needed some guidance regarding the same. We are using scredis library to connect to AWS redis. On change of ip of the redis node, re resolution of the host is not being attempted.

The relevant section application conf setting is :

            "negative-ttl" : "never",
            "positive-ttl" : "never",
            "provider-object" : "akka.io.dns.internal.AsyncDnsProvider",
            "resolve-timeout" : "5s",
            "search-domains" : "default"
        },
        "dispatcher" : "akka.actor.internal-dispatcher",
        "resolver" : "async-dns"
    }

The relevant versions we are using are:

    akkaHttp = "10.2.6"
    scredis = "2.4.3"
    akkaActor = "2.6.16"

From the logs we see when the older ip address ceases to respond the reconnection attempt happens to older ip rather than trying to re resolve the hostname:

[INFO] [akka://scredis/user/<$hostname>-6379-listener-actor] Connection has been shutdown abruptly
[INFO] [scredis-scredis.io.akka.io-dispatcher-20] [akka://scredis/user/<$hostname>-6379-listener-actor/<$hostname>-6379-io-actor-2] Connecting to <$hostname>/<old_ip>:6379
[ERROR][scredis-scredis.io.akka.io-dispatcher-20] [akka://scredis/user/<$hostname>-6379-listener-actor/<$hostname>-6379-io-actor-2] Could not connect to <$hostname>/<old_ip>:6379: Command failed

So in spite of connection being shut down the new connection does not re resolve the hostname. Could you direct us to what the reason could be.

@jrudolph
Copy link
Member

jrudolph commented Nov 8, 2021

You can enable debug logging to see what the AsyncDnsResolver in action. I think by now this issue is largely resolved by using AsyncDnsResolver with appropriate TTL settings for DNS and akka.http.host-connection-pool.max-connection-lifetime for the pool. If you need more control than that you can also change ClientConnectionSettings and set a custom ClientTransport to implement whatever resolution logic is right.

@jrudolph jrudolph added the help wanted Identifies issues that the core team will likely not have time to work on label Nov 8, 2021
@jrudolph
Copy link
Member

jrudolph commented Nov 8, 2021

A more automatic solution (implementing what the title of this issue says) could be to add some logic to do manual DNS resolution directly in the pool using the AsyncDnsResolver and use the returned TTLs to automatically apply a max-connection-lieftime setting for each connection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted discuss Tickets that need some discussion before proceeding help wanted Identifies issues that the core team will likely not have time to work on t:client Issues related to the HTTP Client
Projects
None yet
Development

No branches or pull requests