Artery memory leak #28390

gakesson · 2019-12-17T07:28:44Z

We have an application running on Akka 2.5.19 and we use Akka arterty (not cluster) to communicate between two remote actor systems.

We noticed a memory leak in Artery Association when we had an incorrect network routing setup between two actor systems (messages were dropped in one direction), which causes a scala collection in AssociationState to pile up and entries never get released. Eventually the application goes into a continous GC (CMS in our case), until restarted.

When trying to connect to the other actor system below error log is continously printed:

Outbound message stream to [akka://AS-A13600@<host>:<port>] failed. Restarting it. Handshake with [akka://AS-A13600@<host>:<port>] did not complete within 20000 ms
akka.remote.artery.OutboundHandshake$HandshakeTimeoutException: Handshake with [akka://AS-A13600@<host>:<port>] did not complete within 20000 ms

We have no specific remote artery configuration (except for security), i.e. use the default settings.
This issue is reproducible every time by having a host trying to communicate with a remote actor system:

Send 10 messages per second to 3 different remote actors using ActorSelection
On remote host, drop network packages using /sbin/iptables -A INPUT -p tcp --destination-port -j DROP

Issue becomes visible after a few hours on a 384mb heap (260 old generation).
Attaching a snippet on the heapdump taken (3 different Association consuming >99% of retained heap).

The text was updated successfully, but these errors were encountered:

patriknw · 2019-12-17T13:40:55Z

@gakesson Thanks for reporting and describing how to reproduce. Will look into it.

patriknw · 2019-12-17T13:51:32Z

by the way, is it with tcp or aeron-upd transport?

gakesson · 2019-12-17T14:57:11Z

Thanks Patrik!
We're running TCP.

patriknw · 2019-12-17T15:39:33Z

Do you see anything more that could be interesting in the logs? Anything about quarantine? Could you share logs for akka.remote?

I haven't been able to reproduce yet. It stays at 1 one instance of AssociationState.

Tried on mac so using pfctl instead of iptables, but should be the same.

gakesson · 2019-12-17T16:49:05Z

It is not AssociationState instances that is leaking but the Promise$DefaultPromise member of the AssociationState.
In the heapdump snippet there is three AssociationState that suffers from leaking - one for each host that drops messages - and each of AssociationState instances contains this promise scala datastructure that is piling up.

If you're unable to reproduce that too I will try to create a stand-alone application for it, replicating the usage of our production application. Please let me know!

patriknw · 2019-12-17T17:43:05Z

ok, I was looking for wrong thing then. I'll try some more tomorrow.

He-Pin · 2019-12-17T19:41:36Z

@gakesson I would be better you share the heap dump with the team.

gakesson · 2019-12-18T07:00:44Z

@hepin1989 I agree but I'm not allowed to do that :-|
If you're unable to reproduce it using the procedure I provided, I will create stand-alone application

He-Pin · 2019-12-18T07:06:11Z

@gakesson Yes , forgot that. A reproducer would be really helpful I think.

patriknw · 2019-12-18T07:57:29Z

I think I understand how this can happen. I haven't been able to reproduce with a "real" scenario so a stand-alone application would be very useful to ensure that we are fixing the right thing.

I think the reason is that the handshake isn't completed, the stream restarted, and that builds up the callbacks on the uniqueRemoteAddressPromise, keeping references to the old incarnations of the streams. It looks like this:

Does that match your heap dump if you drill down into the DefaultPromise?

gakesson · 2019-12-18T11:26:08Z

Yes, it sure looks as the same.
Attaching a histogram view which shows similar to your's.

patriknw · 2019-12-18T11:42:34Z

Good, I have an easy way to reproduce it now, so a stand-alone application might not be needed, but I'd like that you test with a snapshot when I have something ready.

gakesson · 2019-12-18T12:13:47Z

Sure, sounds good but I think the snapshot needs to be based on Akka 2.5.x (we just recently moved to 2.5.26, but 2.6.x will require us to do some code changes).

* If the handshake doesn't complete the Promise in AssociationState was not completed and each new restarted stream added future callback to it OutboundHandshake stage. Those references are kept in the promise and therefore old OutboundHandshake (and probably entire stream) couldn't be garbage collected. * Using own notification mechanism to have control of listener deregistration from postStop instead of using Promise/Future. * Trying to create new Promise after failure/restart would be difficult due to that the same AssociationState Promise is accessed from several outbound streams, all possibly restarted individually. Therefore easier to cleanup from postStop.

patriknw · 2019-12-20T14:30:30Z

@gakesson You can try the fix with Akka version 2.5-20191220-123947. You need to add the snapshot repository to your build, see https://doc.akka.io/docs/akka/current/project/links.html#snapshots-repository

It's on top of latest 2.5.27.

If you want to review the changes you find them in PR #28407

gakesson · 2020-01-07T06:56:51Z

@patriknw Sorry for the delay but I've been away on holiday.
I will start this verification today and let you know (it takes a while for the issue to materialize)

gakesson · 2020-01-08T09:06:08Z

@patriknw
The correction works for our application

Test has been running for 24h (previously the issue materialized after ~8h) and there is no build-up in the object retention.

Heap dump also confirms that the memory leak is resolved.

Thanks for the swift help with this issue! May I ask when the next 2.5.x is planned to be released (and if this correction will be incorporated into that release)?

patriknw · 2020-01-08T09:30:13Z

It will be included in upcoming 2.5.28 and 2.6.2 within a few weeks. You should be able to use the timestamped version until then.

gakesson · 2020-01-08T11:38:22Z

Thank you!

* If the handshake doesn't complete the Promise in AssociationState was not completed and each new restarted stream added future callback to it OutboundHandshake stage. Those references are kept in the promise and therefore old OutboundHandshake (and probably entire stream) couldn't be garbage collected. * Using own notification mechanism to have control of listener deregistration from postStop instead of using Promise/Future. * Trying to create new Promise after failure/restart would be difficult due to that the same AssociationState Promise is accessed from several outbound streams, all possibly restarted individually. Therefore easier to cleanup from postStop.

* If the handshake doesn't complete the Promise in AssociationState was not completed and each new restarted stream added future callback to it OutboundHandshake stage. Those references are kept in the promise and therefore old OutboundHandshake (and probably entire stream) couldn't be garbage collected. * Using own notification mechanism to have control of listener deregistration from postStop instead of using Promise/Future. * Trying to create new Promise after failure/restart would be difficult due to that the same AssociationState Promise is accessed from several outbound streams, all possibly restarted individually. Therefore easier to cleanup from postStop. (cherry picked from commit 3b7e609)

* Fix memory leak of restarting Artery outbound stream, #28390 (#28407) * If the handshake doesn't complete the Promise in AssociationState was not completed and each new restarted stream added future callback to it OutboundHandshake stage. Those references are kept in the promise and therefore old OutboundHandshake (and probably entire stream) couldn't be garbage collected. * Using own notification mechanism to have control of listener deregistration from postStop instead of using Promise/Future. * Trying to create new Promise after failure/restart would be difficult due to that the same AssociationState Promise is accessed from several outbound streams, all possibly restarted individually. Therefore easier to cleanup from postStop. (cherry picked from commit 3b7e609) * move mima filter to 2.6.1.backwards.excludes

patriknw added bug t:remoting:artery 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted labels Dec 17, 2019

patriknw self-assigned this Dec 18, 2019

patriknw added 3 - in progress Someone is working on this ticket and removed 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted labels Dec 18, 2019

patriknw mentioned this issue Dec 18, 2019

Fix memory leak of restarting Artery outbound stream, #28390 #28407

Merged

patriknw added this to the 2.5.28 milestone Jan 8, 2020

patriknw mentioned this issue Jan 13, 2020

Fix memory leak of restarting Artery outbound stream, #28390 #28478

Merged

patriknw removed the 3 - in progress Someone is working on this ticket label Jan 20, 2020

patriknw closed this as completed Jan 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Artery memory leak #28390

Artery memory leak #28390

gakesson commented Dec 17, 2019

patriknw commented Dec 17, 2019

patriknw commented Dec 17, 2019

gakesson commented Dec 17, 2019

patriknw commented Dec 17, 2019

gakesson commented Dec 17, 2019

patriknw commented Dec 17, 2019

He-Pin commented Dec 17, 2019

gakesson commented Dec 18, 2019

He-Pin commented Dec 18, 2019

patriknw commented Dec 18, 2019

gakesson commented Dec 18, 2019

patriknw commented Dec 18, 2019

gakesson commented Dec 18, 2019 •

edited

Loading

patriknw commented Dec 20, 2019

gakesson commented Jan 7, 2020

gakesson commented Jan 8, 2020

patriknw commented Jan 8, 2020

gakesson commented Jan 8, 2020

Artery memory leak #28390

Artery memory leak #28390

Comments

gakesson commented Dec 17, 2019

patriknw commented Dec 17, 2019

patriknw commented Dec 17, 2019

gakesson commented Dec 17, 2019

patriknw commented Dec 17, 2019

gakesson commented Dec 17, 2019

patriknw commented Dec 17, 2019

He-Pin commented Dec 17, 2019

gakesson commented Dec 18, 2019

He-Pin commented Dec 18, 2019

patriknw commented Dec 18, 2019

gakesson commented Dec 18, 2019

patriknw commented Dec 18, 2019

gakesson commented Dec 18, 2019 • edited Loading

patriknw commented Dec 20, 2019

gakesson commented Jan 7, 2020

gakesson commented Jan 8, 2020

patriknw commented Jan 8, 2020

gakesson commented Jan 8, 2020

gakesson commented Dec 18, 2019 •

edited

Loading