Delegate Ref Counting to ByteBuf in Netty Transport #81096

original-brownbear · 2021-11-29T12:21:37Z

Tracking down recent memory leaks was made unnecessarily hard
by wrapping the ByteBuf ref counting with our own counter. This
way, we would not record the increments and decrements on the Netty
leak tracker, making it useless as far as identifying the concrete
source of a request with the logged leak only containing touch points
up until our inbound handler code.

As a side note: It would also be nice to do the same on the rest layer, but it's quite a bit harder there since we don't really manage a ref count for rest content today, instead we just delay releasing content until we send the response for some messages.

Tracking down recent memory leaks was made unnecessarily hard by wrapping the `ByteBuf` ref couting with our own counter. This way, we would not record the increments and decrements on the Netty leak tracker, making it useless as far as identifying the concrete source of a request with the logged leak only containing touch points up until our inbound handler code.

elasticmachine · 2021-11-29T12:21:40Z

Pinging @elastic/es-distributed (Team:Distributed)

original-brownbear · 2021-11-29T12:24:22Z

...ort-netty4/src/main/java/org/elasticsearch/transport/netty4/Netty4MessageChannelHandler.java

+            try {
+                buffer.retain();
+            } catch (RuntimeException e) {
+                assert refCount() == 0 : "should only fail if fully released but ref count was [" + refCount() + "]";


Not the most elegant solution, but Netty ref counting simply doesn't have this functionality. Also, doesn't really matter as in practice the tryIncRef call seems not be used on the network buffers so there's performance relevant impact here.

Perhaps check if buffer.refCnt() == 0 first? We still need to catch an exception here but it'd be much rarer.

++ added, makes sense

DaveCTurner · 2021-11-29T12:32:30Z

server/src/main/java/org/elasticsearch/transport/RemoteConnectionManager.java

-        public boolean hasReferences() {
-            return true;
+        public int refCount() {
+            return 1;


The point of hasReferences() is to avoid having to return a fake value like this for implementations that always leak. There's something weird about having a refcount which incRef() and decRef() don't affect.

Do we need to expose the exact refcount here? If not, can we put this back to how it was?

is to avoid having to return a fake value like this for implementations that always leak

Right. Unfortunately, this would mean removing org.elasticsearch.common.bytes.ReleasableBytesReference#refCount which is quite the large change (although we eventually only make use of the number in tests seemingly) and doesn't really seem worth the effort to avoid a cosmetic issue like this?

I guess Netty does the same and uses fake 1 for the ref count in EmptyByteBuf and the like so I figured it's good enough for us here as well :)

Ah I see. Still, half of those usages are assert refCount() > 0; which just become assert hasReferences(); and the others in tests are pretty much all equally trivial. I almost made that change myself when introducing hasReferences(). I'd rather do the Right Thing here.

Hmm ok, let me try

Ok I concede that this was much easier than expected. 3cf250c ... effectively only a single spot where we lost some trivial coverage (ref count == 2 assertion)

DaveCTurner · 2021-11-29T12:33:25Z

...ort-netty4/src/main/java/org/elasticsearch/transport/netty4/Netty4MessageChannelHandler.java

+            try {
+                buffer.retain();
+            } catch (RuntimeException e) {
+                assert refCount() == 0 : "should only fail if fully released but ref count was [" + refCount() + "]";


Perhaps check if buffer.refCnt() == 0 first? We still need to catch an exception here but it'd be much rarer.

…cking

DaveCTurner

LGTM

DaveCTurner · 2021-11-29T15:31:22Z

server/src/test/java/org/elasticsearch/transport/InboundDecoderTests.java

@@ -107,8 +107,6 @@ public void testDecode() throws IOException {
            final Object endMarker = fragments.get(1);

            assertEquals(messageBytes, content);
-            // Ref count is incremented since the bytes are forwarded as a fragment
-            assertEquals(2, releasable2.refCount());


You could reasonably keep this coverage by releasing releasable2 and asserting that it still hasReferences().

++ brought this back

…cking

original-brownbear · 2021-11-29T18:48:41Z

Thanks David!

elasticsearchmachine · 2021-11-29T18:49:53Z

💔 Backport failed

Status	Branch	Result
❌	8.0	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 81096

* upstream/master: (150 commits) Fix ComposableIndexTemplate equals when composed_of is null (elastic#80864) Optimize DLS bitset building for matchAll query (elastic#81030) URL option for BaseRunAsSuperuserCommand (elastic#81025) Less Verbose Serialization of Snapshot Failure in SLM Metadata (elastic#80942) Fix shadowed vars pt7 (elastic#80996) Fail shards early when we can detect a type missmatch (elastic#79869) Delegate Ref Counting to ByteBuf in Netty Transport (elastic#81096) Clarify `unassigned.reason` docs (elastic#81017) Strip blocks from settings for reindex targets (elastic#80887) Split off the values supplier for ScriptDocValues (elastic#80635) [ML] Switch message and detail for model snapshot deprecations (elastic#81108) [DOCS] Update xrefs for snapshot restore docs (elastic#81023) [ML] Updates visiblity of validate API (elastic#81061) Track histogram of transport handling times (elastic#80581) [ML] Fix datafeed preview with remote indices (elastic#81099) [ML] Fix acceptable model snapshot versions in ML deprecation checker (elastic#81060) [ML] Add logging for failing PyTorch test (elastic#81044) Extending the timeout waiting for snapshot to be ready (elastic#81018) [ML] Fix incorrect logging of unexpected model size error (elastic#81089) [ML] Make inference timeout test more reliable (elastic#81094) ... # Conflicts: # server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java

* upstream/master: (55 commits) Fix ComposableIndexTemplate equals when composed_of is null (elastic#80864) Optimize DLS bitset building for matchAll query (elastic#81030) URL option for BaseRunAsSuperuserCommand (elastic#81025) Less Verbose Serialization of Snapshot Failure in SLM Metadata (elastic#80942) Fix shadowed vars pt7 (elastic#80996) Fail shards early when we can detect a type missmatch (elastic#79869) Delegate Ref Counting to ByteBuf in Netty Transport (elastic#81096) Clarify `unassigned.reason` docs (elastic#81017) Strip blocks from settings for reindex targets (elastic#80887) Split off the values supplier for ScriptDocValues (elastic#80635) [ML] Switch message and detail for model snapshot deprecations (elastic#81108) [DOCS] Update xrefs for snapshot restore docs (elastic#81023) [ML] Updates visiblity of validate API (elastic#81061) Track histogram of transport handling times (elastic#80581) [ML] Fix datafeed preview with remote indices (elastic#81099) [ML] Fix acceptable model snapshot versions in ML deprecation checker (elastic#81060) [ML] Add logging for failing PyTorch test (elastic#81044) Extending the timeout waiting for snapshot to be ready (elastic#81018) [ML] Fix incorrect logging of unexpected model size error (elastic#81089) [ML] Make inference timeout test more reliable (elastic#81094) ...

original-brownbear added >non-issue :Distributed/Network Http and internode communication implementations v8.0.0 v8.1.0 labels Nov 29, 2021

elasticmachine added the Team:Distributed Meta label for distributed team label Nov 29, 2021

original-brownbear commented Nov 29, 2021

View reviewed changes

DaveCTurner reviewed Nov 29, 2021

View reviewed changes

original-brownbear added 2 commits November 29, 2021 13:35

Merge remote-tracking branch 'elastic/master' into fix-netty-leak-tra…

1dfdc87

…cking

pre-flight-check

1325c00

original-brownbear requested a review from DaveCTurner November 29, 2021 12:42

original-brownbear mentioned this pull request Nov 29, 2021

Faster Inbound Pipeline #80656

Open

original-brownbear added 2 commits November 29, 2021 14:19

Merge remote-tracking branch 'elastic/master' into fix-netty-leak-tra…

d73bbfd

…cking

hide count

3cf250c

DaveCTurner approved these changes Nov 29, 2021

View reviewed changes

original-brownbear added 2 commits November 29, 2021 16:33

Merge remote-tracking branch 'elastic/master' into fix-netty-leak-tra…

789d11a

…cking

keep coverage

7f275d6

original-brownbear added the auto-backport-and-merge Automatically create backport pull requests and merge when ready label Nov 29, 2021

original-brownbear merged commit 256521e into elastic:master Nov 29, 2021

original-brownbear deleted the fix-netty-leak-tracking branch November 29, 2021 18:49

pugnascotia added v8.0.0-rc2 and removed v8.0.0 labels Feb 1, 2022

original-brownbear restored the fix-netty-leak-tracking branch April 18, 2023 20:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delegate Ref Counting to ByteBuf in Netty Transport #81096

Delegate Ref Counting to ByteBuf in Netty Transport #81096

original-brownbear commented Nov 29, 2021 •

edited

Loading

elasticmachine commented Nov 29, 2021

original-brownbear Nov 29, 2021

DaveCTurner Nov 29, 2021

original-brownbear Nov 29, 2021 •

edited

Loading

DaveCTurner Nov 29, 2021 •

edited

Loading

original-brownbear Nov 29, 2021

DaveCTurner Nov 29, 2021

original-brownbear Nov 29, 2021

original-brownbear Nov 29, 2021

DaveCTurner Nov 29, 2021

DaveCTurner left a comment

DaveCTurner Nov 29, 2021

original-brownbear Nov 29, 2021

original-brownbear commented Nov 29, 2021

elasticsearchmachine commented Nov 29, 2021

Delegate Ref Counting to ByteBuf in Netty Transport #81096

Delegate Ref Counting to ByteBuf in Netty Transport #81096

Conversation

original-brownbear commented Nov 29, 2021 • edited Loading

elasticmachine commented Nov 29, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

original-brownbear Nov 29, 2021 • edited Loading

Choose a reason for hiding this comment

DaveCTurner Nov 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

original-brownbear commented Nov 29, 2021

elasticsearchmachine commented Nov 29, 2021

💔 Backport failed

original-brownbear commented Nov 29, 2021 •

edited

Loading

original-brownbear Nov 29, 2021 •

edited

Loading

DaveCTurner Nov 29, 2021 •

edited

Loading