Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename GarbageCollectionLogger, optionally pause scans/compactions until memory usage is lower #3161

Merged
merged 25 commits into from Feb 16, 2023

Conversation

dlmarion
Copy link
Contributor

Accumulo has had a GarbageCollectionLogger for a long time. Currently it runs every 5 seconds in the server processes and logs a message that memory is low when the amount of free memory is less than 5% of the heap.

This PR modifies the GarbageCollectionLogger to optionally take a more active role. This PR:

  1. Renames the GarbageCollectionLogger to LowMemoryDetector.
  2. Adds a volatile boolean to the LowMemoryDetector that it sets to true when it detects that memory is low and it's not operating in a passive manner.
  3. Adds new properties for the server processes to set whether the LowMemoryDetector is active or passive, it's check interval, and the free memory threshold.
  4. Modifies the lookup and nextBatch methods in TabletBase such that:
    a. if low memory at the entrance to the method, it waits until that condition no longer exists
    b. if low memory while aggregating results, it returns the batch of results early (similar to time limit or batch limit being exceeded)
  5. Modified the LowMemoryDetector so that iterators that aggregate KV's or create a lot of garbage can use this information to return early if possible.
    a. Added isRunningLowOnMemory() to SKVI interface, default implementation returns false
    b. Modiifed WrappingIterator to implement isRunningLowOnMemory() using the LowMemoryDetector.
  6. Added two new metrics that are incremented when the start of a scan is paused or when the scan returns early due to low memory.

@dlmarion dlmarion self-assigned this Jan 13, 2023
@dlmarion dlmarion added this to In progress in 3.0.0 via automation Jan 13, 2023
@EdColeman
Copy link
Contributor

Did you look at the changes proposed in #2827? With the current version there can be a synchronization issue and with a quick look I didn't see you change the initialization. I'm going to pull this down to look in more detail, but maybe you already addressed it?

@dlmarion
Copy link
Contributor Author

I didn't look at #2827 in the context of these changes. I'll review.

@dlmarion
Copy link
Contributor Author

I didn't look at #2827 in the context of these changes. I'll review.

I didn't really change GarbageCollectionLogger too much. I didn't remove the lock, so if that's the synchronization issue you are referring to, then I believe it's not an issue.

@EdColeman
Copy link
Contributor

@dlmarion - I do not understand your response. The previous GC code gets flagged by findbugs at the next compliance level above the level that we currently run. The code change that I proposed fixed that warning. I don't think you changed the instantiation code from the original - so I think that your code will also have the same warning / issue. If you took my initialization and moved the GC instantiation into the context I think it would resolve the findbugs warning / potential synchronization issue. Because you also renamed the GC, most of my PR is OBE, but the initialization change should still be valid. Or maybe another approach?

@dlmarion
Copy link
Contributor Author

I didn't understand your comment about there being a synchronization issue. I think I was missing that The ... code gets flagged by findbugs at the next compliance level above the level that we currently run.. My PR does not address the findbugs issue.

@dlmarion
Copy link
Contributor Author

I will bump the compliance level on this branch and see what happens

@EdColeman
Copy link
Contributor

My point is you are changing the same code. Your code is an opportunity to correct the issue now rather than as a follow-on.

@dlmarion
Copy link
Contributor Author

Yep, I'm with you. I'll bump the compliance level on this branch and see what happens.

@dlmarion
Copy link
Contributor Author

Bumping the maxRank for FindBugs caused the build to fail in core, way before it reached the server code where the GarbageCollectionLogger code was. So, I removed the static modifiers from the code in the hopes that that was enough to prevent the SSD_DO_NOT_USE_INSTANCE_LOCK_ON_SHARED_STATIC_DATA error that you saw.

…the LowMemoryDetector object

in the other constructors of TabletIteratorEnvironment. These other constructors are used by minc
and majc processes. I didn't wire up the LowMemoryDetector to pause minc/majc compaction processes
at this point (could be done in FileCompactor.compact() for majc), but the information is available
to compaction iterators that extend WrappingIterator.
@dlmarion dlmarion changed the title Rename GarbageCollectionLogger, use it to pause scans until memory usage is lower Rename GarbageCollectionLogger, optionally pause scans/compactions until memory usage is lower Jan 18, 2023
@ctubbsii
Copy link
Member

In testing this, I have seen the ITs hang twice, for at least 8 hours each before I terminated the build. I have not investigated to determine what is causing the problem. It may be unrelated, and a mere coincidence, but I have not seen it happen with other builds recently.

@dlmarion
Copy link
Contributor Author

Kicked off a new IT build. I don't believe that your test runs had my latest changes. 2/3 of my IT's have passed as of this point. 🤞

@dlmarion
Copy link
Contributor Author

IT build passed

@dlmarion
Copy link
Contributor Author

2nd IT build passed based on commit 16b5531. @ctubbsii - not sure what the issue was yesterday.

@dlmarion
Copy link
Contributor Author

@keith-turner - could you take a look at this when you get a chance? I talked with @ctubbsii and he said he had not objection to including in 3.0, wanted to get your input before I merge.

@dlmarion
Copy link
Contributor Author

Resolved conflicts and kicked off another full IT build.

Made logging less verbose in the normal case and more specific in the
recovery case. Modified interface to throw InterruptedException in the
case where the action performed a sleep so that the calling thread
could handle / raise the exception. Change the use of uninterruptible
sleep to Thread.sleep in Action implementations.
@dlmarion
Copy link
Contributor Author

Full IT build passed

@dlmarion dlmarion merged commit 6d5e593 into apache:main Feb 16, 2023
3.0.0 automation moved this from In progress to Done Feb 16, 2023
@dlmarion dlmarion deleted the send-scan-results-when-low-on-memory branch February 16, 2023 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
3.0.0
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

4 participants