Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Artery memory leak #28390

Closed
gakesson opened this issue Dec 17, 2019 · 18 comments
Closed

Artery memory leak #28390

gakesson opened this issue Dec 17, 2019 · 18 comments
Assignees
Milestone

Comments

@gakesson
Copy link
Contributor

We have an application running on Akka 2.5.19 and we use Akka arterty (not cluster) to communicate between two remote actor systems.

We noticed a memory leak in Artery Association when we had an incorrect network routing setup between two actor systems (messages were dropped in one direction), which causes a scala collection in AssociationState to pile up and entries never get released. Eventually the application goes into a continous GC (CMS in our case), until restarted.

When trying to connect to the other actor system below error log is continously printed:

Outbound message stream to [akka://AS-A13600@<host>:<port>] failed. Restarting it. Handshake with [akka://AS-A13600@<host>:<port>] did not complete within 20000 ms
akka.remote.artery.OutboundHandshake$HandshakeTimeoutException: Handshake with [akka://AS-A13600@<host>:<port>] did not complete within 20000 ms

We have no specific remote artery configuration (except for security), i.e. use the default settings.
This issue is reproducible every time by having a host trying to communicate with a remote actor system:

  • Send 10 messages per second to 3 different remote actors using ActorSelection
  • On remote host, drop network packages using /sbin/iptables -A INPUT -p tcp --destination-port -j DROP

Issue becomes visible after a few hours on a 384mb heap (260 old generation).
Attaching a snippet on the heapdump taken (3 different Association consuming >99% of retained heap).
association_leak

@patriknw
Copy link
Member

@gakesson Thanks for reporting and describing how to reproduce. Will look into it.

@patriknw patriknw added bug t:remoting:artery 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted labels Dec 17, 2019
@patriknw
Copy link
Member

by the way, is it with tcp or aeron-upd transport?

@gakesson
Copy link
Contributor Author

Thanks Patrik!
We're running TCP.

@patriknw
Copy link
Member

Do you see anything more that could be interesting in the logs? Anything about quarantine? Could you share logs for akka.remote?

I haven't been able to reproduce yet. It stays at 1 one instance of AssociationState.

Tried on mac so using pfctl instead of iptables, but should be the same.

@gakesson
Copy link
Contributor Author

It is not AssociationState instances that is leaking but the Promise$DefaultPromise member of the AssociationState.
In the heapdump snippet there is three AssociationState that suffers from leaking - one for each host that drops messages - and each of AssociationState instances contains this promise scala datastructure that is piling up.

If you're unable to reproduce that too I will try to create a stand-alone application for it, replicating the usage of our production application. Please let me know!

@patriknw
Copy link
Member

ok, I was looking for wrong thing then. I'll try some more tomorrow.

@He-Pin
Copy link
Member

He-Pin commented Dec 17, 2019

@gakesson I would be better you share the heap dump with the team.

@gakesson
Copy link
Contributor Author

@hepin1989 I agree but I'm not allowed to do that :-|
If you're unable to reproduce it using the procedure I provided, I will create stand-alone application

@He-Pin
Copy link
Member

He-Pin commented Dec 18, 2019

@gakesson Yes , forgot that. A reproducer would be really helpful I think.

@patriknw
Copy link
Member

I think I understand how this can happen. I haven't been able to reproduce with a "real" scenario so a stand-alone application would be very useful to ensure that we are fixing the right thing.

I think the reason is that the handshake isn't completed, the stream restarted, and that builds up the callbacks on the uniqueRemoteAddressPromise, keeping references to the old incarnations of the streams. It looks like this:

Screenshot 2019-12-18 at 08 52 45

Does that match your heap dump if you drill down into the DefaultPromise?

@patriknw patriknw self-assigned this Dec 18, 2019
@patriknw patriknw added 3 - in progress Someone is working on this ticket and removed 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted labels Dec 18, 2019
@gakesson
Copy link
Contributor Author

Yes, it sure looks as the same.
Attaching a histogram view which shows similar to your's.
leak_histogram

@patriknw
Copy link
Member

Good, I have an easy way to reproduce it now, so a stand-alone application might not be needed, but I'd like that you test with a snapshot when I have something ready.

@gakesson
Copy link
Contributor Author

gakesson commented Dec 18, 2019

Sure, sounds good but I think the snapshot needs to be based on Akka 2.5.x (we just recently moved to 2.5.26, but 2.6.x will require us to do some code changes).

patriknw added a commit that referenced this issue Dec 18, 2019
* If the handshake doesn't complete the Promise in AssociationState was
  not completed and each new restarted stream added future callback to
  it OutboundHandshake stage. Those references are kept in the promise
  and therefore old OutboundHandshake (and probably entire stream) couldn't
  be garbage collected.
* Using own notification mechanism to have control of listener deregistration
  from postStop instead of using Promise/Future.
* Trying to create new Promise after failure/restart would be difficult due
  to that the same AssociationState Promise is accessed from several outbound
  streams, all possibly restarted individually. Therefore easier to cleanup
  from postStop.
@patriknw
Copy link
Member

@gakesson You can try the fix with Akka version 2.5-20191220-123947. You need to add the snapshot repository to your build, see https://doc.akka.io/docs/akka/current/project/links.html#snapshots-repository

It's on top of latest 2.5.27.

If you want to review the changes you find them in PR #28407

@gakesson
Copy link
Contributor Author

gakesson commented Jan 7, 2020

@patriknw Sorry for the delay but I've been away on holiday.
I will start this verification today and let you know (it takes a while for the issue to materialize)

@gakesson
Copy link
Contributor Author

gakesson commented Jan 8, 2020

@patriknw
The correction works for our application

Test has been running for 24h (previously the issue materialized after ~8h) and there is no build-up in the object retention.

Heap dump also confirms that the memory leak is resolved.

Thanks for the swift help with this issue! May I ask when the next 2.5.x is planned to be released (and if this correction will be incorporated into that release)?

@patriknw
Copy link
Member

patriknw commented Jan 8, 2020

It will be included in upcoming 2.5.28 and 2.6.2 within a few weeks. You should be able to use the timestamped version until then.

@patriknw patriknw added this to the 2.5.28 milestone Jan 8, 2020
@gakesson
Copy link
Contributor Author

gakesson commented Jan 8, 2020

Thank you!

patriknw added a commit that referenced this issue Jan 13, 2020
* If the handshake doesn't complete the Promise in AssociationState was
  not completed and each new restarted stream added future callback to
  it OutboundHandshake stage. Those references are kept in the promise
  and therefore old OutboundHandshake (and probably entire stream) couldn't
  be garbage collected.
* Using own notification mechanism to have control of listener deregistration
  from postStop instead of using Promise/Future.
* Trying to create new Promise after failure/restart would be difficult due
  to that the same AssociationState Promise is accessed from several outbound
  streams, all possibly restarted individually. Therefore easier to cleanup
  from postStop.
patriknw added a commit that referenced this issue Jan 13, 2020
* If the handshake doesn't complete the Promise in AssociationState was
  not completed and each new restarted stream added future callback to
  it OutboundHandshake stage. Those references are kept in the promise
  and therefore old OutboundHandshake (and probably entire stream) couldn't
  be garbage collected.
* Using own notification mechanism to have control of listener deregistration
  from postStop instead of using Promise/Future.
* Trying to create new Promise after failure/restart would be difficult due
  to that the same AssociationState Promise is accessed from several outbound
  streams, all possibly restarted individually. Therefore easier to cleanup
  from postStop.

(cherry picked from commit 3b7e609)
patriknw added a commit that referenced this issue Jan 20, 2020
* Fix memory leak of restarting Artery outbound stream, #28390 (#28407)

* If the handshake doesn't complete the Promise in AssociationState was
  not completed and each new restarted stream added future callback to
  it OutboundHandshake stage. Those references are kept in the promise
  and therefore old OutboundHandshake (and probably entire stream) couldn't
  be garbage collected.
* Using own notification mechanism to have control of listener deregistration
  from postStop instead of using Promise/Future.
* Trying to create new Promise after failure/restart would be difficult due
  to that the same AssociationState Promise is accessed from several outbound
  streams, all possibly restarted individually. Therefore easier to cleanup
  from postStop.

(cherry picked from commit 3b7e609)

* move mima filter to 2.6.1.backwards.excludes
@patriknw patriknw removed the 3 - in progress Someone is working on this ticket label Jan 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants