Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9478: Prevent DWPTDeleteQueue from referencing itself and leaking memory #1779

Merged
merged 4 commits into from
Aug 24, 2020

Conversation

s1monw
Copy link
Member

@s1monw s1monw commented Aug 24, 2020

In LUCENE-9304 we introduced some fixes that unfortunately hold on to the previous
DWPTDeleteQueue which is essentially leaking IW memory and cause applications to fail.
This fixes the memory leak and adds a test to ensure its not leaking memory.

…ing memory

In LUCENE-9304 we introduced some fixes that unfortunately hold on to the previous
DWPTDeleteQueue which is essentially leaking IW memory and cause applications to fail.
This fixes the memory leak and adds a test to ensure it's not leaking memory.
@s1monw
Copy link
Member Author

s1monw commented Aug 24, 2020

@uschindler can you please take a look at the test @mikemccand you should also look please

WeakAndNext weakAndNext = new WeakAndNext();
DocumentsWriterDeleteQueue next = weakAndNext.next;
assertNotNull(next);
System.gc();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is terrible but I don't know of a better way of doing this. this at least works on my machine :) and would have found the issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@uschindler @dweiss maybe you have better ideas

WeakAndNext weakAndNext = new WeakAndNext();
DocumentsWriterDeleteQueue next = weakAndNext.next;
assertNotNull(next);
System.gc();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked openjdk sources and they use System.gc() with a similar expectation that it will always try to flush all weak refs before returning, so this seems fine. The test itself could maybe be made more explicit by utilizing an explicit ReferenceQueue rather than asserting the reference has been cleared but this is a matter of taste - if this works, it will in both cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool! I think the simpler the better

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also use a ReferenceQueue to do the check. The weak.get() has a bad taste to me.
I did not check that System.gc() tries to cleanup weak refs, but if it does and we can rely on this, I am fine.

@s1monw s1monw requested a review from dweiss August 24, 2020 18:27
Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good lord what a subtle memory leak!!! Lambda's closure holding onto this even though this.nextSeqNo is already final!! Evil.

Can we bound the expected size of the leak? It is only the tail of the last delete queue on full flush? So you must have a number of deletes pending at full flush (commit, refresh) for this to matter?

Have we seen apps in practice where this is very costly?

I opened mikemccand/luceneutil#78 to figure out why our nightly benchmarks at least didn't notice this ...

@dnhatn
Copy link
Member

dnhatn commented Aug 24, 2020

Can we bound the expected size of the leak? It is only the tail of the last delete queue on full flush? So you must have a number of deletes pending at full flush (commit, refresh) for this to matter?

Each delete queue would leak around 500 bytes.

Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK with this code.

I would really like to see the lambda methods generated by javac. I will do a javap call before and after your patch.

IMHO, the test is fine, but it only tests if there's a reference to this in order to access nextSeqNo.

WeakAndNext weakAndNext = new WeakAndNext();
DocumentsWriterDeleteQueue next = weakAndNext.next;
assertNotNull(next);
System.gc();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also use a ReferenceQueue to do the check. The weak.get() has a bad taste to me.
I did not check that System.gc() tries to cleanup weak refs, but if it does and we can rely on this, I am fine.

@uschindler
Copy link
Contributor

uschindler commented Aug 24, 2020

This is a limitation of the Java Compiler javac. Whenever you access an instance field from a class (final or not doe snot matter), it created a lamda bound to this. The lambda method then does a getfield and invokes method. Here is the Bytecode of the lambda before this change:

  private long lambda$advanceQueue$1();
    Code:
       0: aload_0
       1: getfield      #20                 // Field nextSeqNo:Ljava/util/concurrent/atomic/AtomicLong;
       4: invokevirtual #75                 // Method java/util/concurrent/atomic/AtomicLong.get:()J
       7: lconst_1
       8: lsub
       9: lreturn

As you see it's a non-static method. The generated lambda class therefor has a reference to this (methodhandle bound to this).

After your patch, the lambda is static and instead gets one parameter:

  private static long lambda$getPrevMaxSeqIdSupplier$1(java.util.concurrent.atomic.AtomicLong);
    Code:
       0: aload_0
       1: invokevirtual #75                 // Method java/util/concurrent/atomic/AtomicLong.get:()J
       4: lconst_1
       5: lsub
       6: lreturn

The generated anonymous class will have just be bound to the AtomicLong and not this.

The method signature of both MethodHandles is identical, one parameter (this or AtomicLong
), returning long.

By analyzing the code we see that issue is fixed, test is not necessarily needed, but it's hard to check the bytecode :-)

+1 for the fix!

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@s1monw s1monw merged commit 098f0dc into apache:master Aug 24, 2020
s1monw added a commit that referenced this pull request Aug 24, 2020
…ing memory (#1779)

In LUCENE-9304 we introduced some fixes that unfortunately hold on to the previous
DWPTDeleteQueue which is essentially leaking IW memory and cause applications to fail.
This fixes the memory leak and adds a test to ensure its not leaking memory.
s1monw added a commit that referenced this pull request Aug 24, 2020
…ing memory (#1779)

In LUCENE-9304 we introduced some fixes that unfortunately hold on to the previous
DWPTDeleteQueue which is essentially leaking IW memory and cause applications to fail.
This fixes the memory leak and adds a test to ensure its not leaking memory.
gus-asf pushed a commit to gus-asf/lucene-solr that referenced this pull request Sep 4, 2020
…ing memory (apache#1779)

In LUCENE-9304 we introduced some fixes that unfortunately hold on to the previous
DWPTDeleteQueue which is essentially leaking IW memory and cause applications to fail.
This fixes the memory leak and adds a test to ensure its not leaking memory.
@wojtek2kdev
Copy link

Hello, I'm newbie in Apache Lucene-Solr. Can someone explain me this bug as for kid? How can I reproduce it on my computer? I want to start participate by bug fixing, but at first I wanna check how previous issues was patched. Thanks in advance!

gsmiller pushed a commit to gsmiller/lucene-solr that referenced this pull request Mar 17, 2021
…ing memory (apache#1779)

In LUCENE-9304 we introduced some fixes that unfortunately hold on to the previous
DWPTDeleteQueue which is essentially leaking IW memory and cause applications to fail.
This fixes the memory leak and adds a test to ensure its not leaking memory.

cr https://code.amazon.com/reviews/CR-33310184
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants