Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid memory mapping hydrants after they are persisted & after they are merged for native batch ingestion #11123

Merged
merged 22 commits into from
May 11, 2021

Conversation

loquisgon
Copy link

@loquisgon loquisgon commented Apr 16, 2021

Description

Currently native batch ingestion may run out of memory while ingesting files with lots of logical segments (i.e. Sinks) and multiple physical segments (i.e. FireHydrants) per sink. Memory profiling indicates that one source of memory consumption is the references to memory mapped files being created as firehydrants are created. This will avoid memory mapping firehydrants during segment creation for native batch ingestion and dropping the memory mapping right after they are merged.

The fix is relatively simple. Introduce a flag when the Appenderator is created to indicate whether it is working on a "real time" or batch task. When it is working on a batch task use the flag to avoid mapping the segments. In addition, just drop the QueriableIndexSegment from the fire hydrant just after a given sink is merged.

I also added metrics to track sinks & hydrants periodically (when persisting and at the end of segment creation phase). This is to have some information for debugging given that these data structures are the heaviest consumers of memory and even though hydrants are no longer mapped during batch their references still accumulate.


This PR has:

  • [ X] been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • [X ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • [ X] been tested in a test Druid cluster.

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it would be useful to add some tests to show that this has the intended effect (maybe also easier/possible after modifying the appenderator heap usage calculation to consider whether or not the segment has been mapped)

Comment on lines 865 to 926
// Drop the queriable indexes behind the hydrants... they are not needed anymore and their
// mapped file references
// can generate OOMs during merge if enough of them are held back...
for (FireHydrant fireHydrant : sink) {
fireHydrant.swapSegment(null);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is true, I think realtime queries still use the pre-merge intermediary segments and we do not do any sort of swap and replace to the merged segment

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that queries can still happen for realtime after hydrants are merged... I just added code to deal with this case. For realtime ingestion, memory mappings must remain after merge.

Comment on lines 32 to 33
private final Supplier<QueryableIndex> indexSupplier;
private final Supplier<QueryableIndexStorageAdapter> queryableIndexStorageAdapterSupplier;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a way to limit the supplier to being part of the FireHydrant instead of doing this modification here?

While I can't think of any ill side-effect that this might cause since these suppliers shouldn't be called too many times in the scheme of things, this change also has huge surface area because of it happening here instead of being limited to ingestion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rolled back the memoization code....

@@ -347,7 +348,8 @@ public AppenderatorAddResult add(
}
}

if (!skipBytesInMemoryOverheadCheck && bytesCurrentlyInMemory.get() - bytesToBePersisted > maxBytesTuningConfig) {
if (!skipBytesInMemoryOverheadCheck
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think the memory overhead calculation checks are going to end up triggering the supplier which will map the segment? Specifically calculateMMappedHydrantMemoryInUsed, which is going to try to get the storage adapter to count the number of columns. I think FireHydrant is going to need some way to track whether or not the segment has been mapped (related to my earlier thread on why the supplier might also be more suitable to live in FireHydrant somehow, so that we can have some side effect to let the hydrant know the mapping has happened)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch... I did not see memory being mapped in my tests but it was because I had "skipBytesInMemoryOverheadCheck": true in the ingestion spec....

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored the memory calculations a little in my last commit, enough to make it work, but they still need more work (judging from my brief look at the heap dumps)

@loquisgon loquisgon marked this pull request as draft April 19, 2021 19:10
@loquisgon loquisgon marked this pull request as ready for review April 23, 2021 01:56
indexesToPersist.add(Pair.of(sink.swap(), identifier));
totalHydrantsPersisted.addAndGet(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the count for hydrant increased by 1 here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because sink.swap() creates a new hydrant (which should not be counted) but returns the old hydrant (which needs to be counted)

@@ -558,35 +570,44 @@ public void clear() throws InterruptedException
final List<Pair<FireHydrant, SegmentIdWithShardSpec>> indexesToPersist = new ArrayList<>();
int numPersistedRows = 0;
long bytesPersisted = 0L;
AtomicLong totalHydrantsCount = new AtomicLong();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for using AtomicLong here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it is being used inside the lambda thus the variable needs to be final. Since the value of this is set more than once it cannot be a primitive type.

Copy link
Author

@loquisgon loquisgon May 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot about SettableSupplier, I could use this rather than AtomicLong since there is no multiple thread access to that variable. But I think that because it is a counter using the built in add in AtomicLong is convenient in this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not use concurrent data structures where there is no concurrent access from multiple threads. It makes code very confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you can use MutableLong instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to use MutableLong

}
}

private static SegmentIdWithShardSpec si(String interval, String version, int partitionNum)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename this to something more readable?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is copied from the existing unit test for the real time appenderator...but sure I can rename it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed

);
}

static InputRow ir(String ts, String dim, Object met)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename this to something more readable?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed

);
}

private static <T> List<T> sorted(final List<T> xs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this method required? Since SegmentIdWithShardSpec and DataSegment implement Comparable, is calling Collections.sort on list directly possible?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed custom collector using stream sort instead (fix in next push).

if (!isRealTime()) {
// sanity:
if (fireHydrant.getPersistedFile() == null) {
throw new ISE("Persisted file for batch hydrant is null!");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you want any other info in this message to debug the exception? It's ok if not.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should never happen... it is just sanity. Can you think of anything else to say?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added hydrant (toString), a little more but not a lot more.

* @return The persisted segment id. This is needed to recreate mapped files before merging.
* It will be null for real time hydrants
*/
public @Nullable SegmentId getPersistedSegmentId()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add a note about why persistedSegmentId is required in addition to getSegmentId()? I understand that the latter will throw an NPE?

Copy link
Author

@loquisgon loquisgon Apr 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having persistedSegmentId in addition to segmentId is kind of ugly. But I was trying to avoid touching the existing code as much as possible. In order to use only segment id then I would have to introduce a "segmentId" private member that sometimes it is set by the adapter and others it is set by the "persisted segment id". I am on the fence on whether doing this would even be more confusing still. In any case, our discussion was for the next step to do is to split Appendarator into two implementations: One batch and one for real time. The realtime appenderator would be practically the same implementation before these changes. The batch appenderator would take the current incremental direction to its logical conclusion: avoid keeping all data structures that are not needed in memory for native batch (which would now remove OOMs due to relation of data size to memory consumption). Then this code is temporary and we can have a cleaner implementation in the BatchAppenderatorImpl.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what persistedSegmentId is. Is it ever different from segmentId?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The persistedSegmentId is the same as segmentId. However, segmentId is being accessed indirectly through the adapter. Closing the QueryableIndex nullifies the reference to it inside the adapter thus after this action segmentId is no longer accesible. I decided to add a new data member and a new method to store the segmentId after the queryable index is closed. The segmentId is required to re-open the QueryableIndex in the merge phase.

Copy link
Author

@loquisgon loquisgon Apr 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to make the code less confusing I can still keep the persistedSegmentId reference but modify the getSegmentId method as follows:
public SegmentId getSegmentId() { if (adapter.get() != null) { return adapter.get().getId(); } else { return persistedSegmentId; } }

Would you prefer this instead? Personally I prefer the explicit way to obtain that reference but the fact that there are so many questions already makes it clear that it is confusing. So I am fine with an alternative. Let me know if the original way, or the above works or whether you have a better idea.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved persisted metadata to AppenderatorImpl as a map. Added comments to the Map to why it is necessary. I think this way is much cleaner. In next push.

Agustin Gonzalez added 3 commits April 26, 2021 12:08
// in order to facilitate the mapping of the QueryableIndex associated with a given hydrant
// at merge time. This is necessary since batch appenderator will not map the QueryableIndex
// at persist time in order to minimize its memory footprint.
private final Map<FireHydrant, Pair<File, SegmentId>> persistedHydrantMetadata = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does FireHydrant has a hashCode implementation? Maybe using identityHashMap makes more sense here? Also, do we ever need to clear this map? like when org.apache.druid.segment.realtime.appenderator.Appenderator#clear or org.apache.druid.segment.realtime.appenderator.Appenderator#drop is called.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, IndentyHashMap makes it explicit that the key is a Java reference. Also, clearing the Map in the places you referenced above.

… fact that keys are Java references. Maintain persisted metadata when dropping/closing segments.
Copy link
Contributor

@abhishekagarwal87 abhishekagarwal87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

* in order to facilitate the mapping of the QueryableIndex associated with a given hydrant
* at merge time. This is necessary since batch appenderator will not map the QueryableIndex
* at persist time in order to minimize its memory footprint. This has to be synchronized since the
* map bay be accessed from multiple threads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In next push

@@ -691,6 +739,8 @@ public Object call() throws IOException
if (sink.finishWriting()) {
totalRows.addAndGet(-sink.getNumRows());
}
// count hydrants for stats:
pushedHydrantsCount.addAndGet(IterableUtils.size(sink));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use Iterables.size instead so you don't have to add new dependency on commons-collections4

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In next push

@@ -1320,6 +1320,7 @@ Additional peon configs include:
|`druid.peon.mode`|Choices are "local" and "remote". Setting this to local means you intend to run the peon as a standalone process (Not recommended).|remote|
|`druid.indexer.task.baseDir`|Base temporary working directory.|`System.getProperty("java.io.tmpdir")`|
|`druid.indexer.task.baseTaskDir`|Base temporary working directory for tasks.|`${druid.indexer.task.baseDir}/persistent/task`|
|`druid.indexer.task.batchMemoryMappedIndex`|If false, native batch ingestion will not map indexes thus saving heap space. This does not apply to streaming ingestion, just to batch.|`false`|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add comment on how is this used, what's its purpose, and why would a user need to set it to true?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In next push

server/pom.xml Outdated
@@ -454,6 +454,11 @@
<version>1.3</version>
<scope>test</scope>
</dependency>
<dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed. If we use Iterables.size instead

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In next push

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants