Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delegate Ref Counting to ByteBuf in Netty Transport #81096

Merged

Conversation

original-brownbear
Copy link
Member

@original-brownbear original-brownbear commented Nov 29, 2021

Tracking down recent memory leaks was made unnecessarily hard
by wrapping the ByteBuf ref counting with our own counter. This
way, we would not record the increments and decrements on the Netty
leak tracker, making it useless as far as identifying the concrete
source of a request with the logged leak only containing touch points
up until our inbound handler code.

As a side note: It would also be nice to do the same on the rest layer, but it's quite a bit harder there since we don't really manage a ref count for rest content today, instead we just delay releasing content until we send the response for some messages.

Tracking down recent memory leaks was made unnecessarily hard
by wrapping the `ByteBuf` ref couting with our own counter. This
way, we would not record the increments and decrements on the Netty
leak tracker, making it useless as far as identifying the concrete
source of a request with the logged leak only containing touch points
up until our inbound handler code.
@original-brownbear original-brownbear added >non-issue :Distributed/Network Http and internode communication implementations v8.0.0 v8.1.0 labels Nov 29, 2021
@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team label Nov 29, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

try {
buffer.retain();
} catch (RuntimeException e) {
assert refCount() == 0 : "should only fail if fully released but ref count was [" + refCount() + "]";
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the most elegant solution, but Netty ref counting simply doesn't have this functionality. Also, doesn't really matter as in practice the tryIncRef call seems not be used on the network buffers so there's performance relevant impact here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps check if buffer.refCnt() == 0 first? We still need to catch an exception here but it'd be much rarer.

Copy link
Member Author

@original-brownbear original-brownbear Nov 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ added, makes sense

public boolean hasReferences() {
return true;
public int refCount() {
return 1;
Copy link
Contributor

@DaveCTurner DaveCTurner Nov 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of hasReferences() is to avoid having to return a fake value like this for implementations that always leak. There's something weird about having a refcount which incRef() and decRef() don't affect.

Do we need to expose the exact refcount here? If not, can we put this back to how it was?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is to avoid having to return a fake value like this for implementations that always leak

Right. Unfortunately, this would mean removing org.elasticsearch.common.bytes.ReleasableBytesReference#refCount which is quite the large change (although we eventually only make use of the number in tests seemingly) and doesn't really seem worth the effort to avoid a cosmetic issue like this?

I guess Netty does the same and uses fake 1 for the ref count in EmptyByteBuf and the like so I figured it's good enough for us here as well :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. Still, half of those usages are assert refCount() > 0; which just become assert hasReferences(); and the others in tests are pretty much all equally trivial. I almost made that change myself when introducing hasReferences(). I'd rather do the Right Thing here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm ok, let me try

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I concede that this was much easier than expected. 3cf250c ... effectively only a single spot where we lost some trivial coverage (ref count == 2 assertion)

try {
buffer.retain();
} catch (RuntimeException e) {
assert refCount() == 0 : "should only fail if fully released but ref count was [" + refCount() + "]";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps check if buffer.refCnt() == 0 first? We still need to catch an exception here but it'd be much rarer.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -107,8 +107,6 @@ public void testDecode() throws IOException {
final Object endMarker = fragments.get(1);

assertEquals(messageBytes, content);
// Ref count is incremented since the bytes are forwarded as a fragment
assertEquals(2, releasable2.refCount());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could reasonably keep this coverage by releasing releasable2 and asserting that it still hasReferences().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ brought this back

@original-brownbear original-brownbear added the auto-backport-and-merge Automatically create backport pull requests and merge when ready label Nov 29, 2021
@original-brownbear
Copy link
Member Author

Thanks David!

@original-brownbear original-brownbear merged commit 256521e into elastic:master Nov 29, 2021
@original-brownbear original-brownbear deleted the fix-netty-leak-tracking branch November 29, 2021 18:49
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.0 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 81096

weizijun added a commit to weizijun/elasticsearch that referenced this pull request Nov 30, 2021
* upstream/master: (150 commits)
  Fix ComposableIndexTemplate equals when composed_of is null (elastic#80864)
  Optimize DLS bitset building for matchAll query (elastic#81030)
  URL option for BaseRunAsSuperuserCommand (elastic#81025)
  Less Verbose Serialization of Snapshot Failure in SLM Metadata (elastic#80942)
  Fix shadowed vars pt7 (elastic#80996)
  Fail shards early when we can detect a type missmatch (elastic#79869)
  Delegate Ref Counting to ByteBuf in Netty Transport (elastic#81096)
  Clarify `unassigned.reason` docs (elastic#81017)
  Strip blocks from settings for reindex targets (elastic#80887)
  Split off the values supplier for ScriptDocValues (elastic#80635)
  [ML] Switch message and detail for model snapshot deprecations (elastic#81108)
  [DOCS] Update xrefs for snapshot restore docs (elastic#81023)
  [ML] Updates visiblity of validate API (elastic#81061)
  Track histogram of transport handling times (elastic#80581)
  [ML] Fix datafeed preview with remote indices (elastic#81099)
  [ML] Fix acceptable model snapshot versions in ML deprecation checker (elastic#81060)
  [ML] Add logging for failing PyTorch test (elastic#81044)
  Extending the timeout waiting for snapshot to be ready (elastic#81018)
  [ML] Fix incorrect logging of unexpected model size error (elastic#81089)
  [ML] Make inference timeout test more reliable (elastic#81094)
  ...

# Conflicts:
#	server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Nov 30, 2021
* upstream/master: (55 commits)
  Fix ComposableIndexTemplate equals when composed_of is null (elastic#80864)
  Optimize DLS bitset building for matchAll query (elastic#81030)
  URL option for BaseRunAsSuperuserCommand (elastic#81025)
  Less Verbose Serialization of Snapshot Failure in SLM Metadata (elastic#80942)
  Fix shadowed vars pt7 (elastic#80996)
  Fail shards early when we can detect a type missmatch (elastic#79869)
  Delegate Ref Counting to ByteBuf in Netty Transport (elastic#81096)
  Clarify `unassigned.reason` docs (elastic#81017)
  Strip blocks from settings for reindex targets (elastic#80887)
  Split off the values supplier for ScriptDocValues (elastic#80635)
  [ML] Switch message and detail for model snapshot deprecations (elastic#81108)
  [DOCS] Update xrefs for snapshot restore docs (elastic#81023)
  [ML] Updates visiblity of validate API (elastic#81061)
  Track histogram of transport handling times (elastic#80581)
  [ML] Fix datafeed preview with remote indices (elastic#81099)
  [ML] Fix acceptable model snapshot versions in ML deprecation checker (elastic#81060)
  [ML] Add logging for failing PyTorch test (elastic#81044)
  Extending the timeout waiting for snapshot to be ready (elastic#81018)
  [ML] Fix incorrect logging of unexpected model size error (elastic#81089)
  [ML] Make inference timeout test more reliable (elastic#81094)
  ...
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Nov 30, 2021
* upstream/master: (55 commits)
  Fix ComposableIndexTemplate equals when composed_of is null (elastic#80864)
  Optimize DLS bitset building for matchAll query (elastic#81030)
  URL option for BaseRunAsSuperuserCommand (elastic#81025)
  Less Verbose Serialization of Snapshot Failure in SLM Metadata (elastic#80942)
  Fix shadowed vars pt7 (elastic#80996)
  Fail shards early when we can detect a type missmatch (elastic#79869)
  Delegate Ref Counting to ByteBuf in Netty Transport (elastic#81096)
  Clarify `unassigned.reason` docs (elastic#81017)
  Strip blocks from settings for reindex targets (elastic#80887)
  Split off the values supplier for ScriptDocValues (elastic#80635)
  [ML] Switch message and detail for model snapshot deprecations (elastic#81108)
  [DOCS] Update xrefs for snapshot restore docs (elastic#81023)
  [ML] Updates visiblity of validate API (elastic#81061)
  Track histogram of transport handling times (elastic#80581)
  [ML] Fix datafeed preview with remote indices (elastic#81099)
  [ML] Fix acceptable model snapshot versions in ML deprecation checker (elastic#81060)
  [ML] Add logging for failing PyTorch test (elastic#81044)
  Extending the timeout waiting for snapshot to be ready (elastic#81018)
  [ML] Fix incorrect logging of unexpected model size error (elastic#81089)
  [ML] Make inference timeout test more reliable (elastic#81094)
  ...
@original-brownbear original-brownbear restored the fix-netty-leak-tracking branch April 18, 2023 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport-and-merge Automatically create backport pull requests and merge when ready :Distributed/Network Http and internode communication implementations >non-issue Team:Distributed Meta label for distributed team v8.0.0-rc2 v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants