Avoid memory mapping hydrants after they are persisted & after they are merged for native batch ingestion #11123

loquisgon · 2021-04-16T18:41:21Z

Description

Currently native batch ingestion may run out of memory while ingesting files with lots of logical segments (i.e. Sinks) and multiple physical segments (i.e. FireHydrants) per sink. Memory profiling indicates that one source of memory consumption is the references to memory mapped files being created as firehydrants are created. This will avoid memory mapping firehydrants during segment creation for native batch ingestion and dropping the memory mapping right after they are merged.

The fix is relatively simple. Introduce a flag when the Appenderator is created to indicate whether it is working on a "real time" or batch task. When it is working on a batch task use the flag to avoid mapping the segments. In addition, just drop the QueriableIndexSegment from the fire hydrant just after a given sink is merged.

I also added metrics to track sinks & hydrants periodically (when persisting and at the end of segment creation phase). This is to have some information for debugging given that these data structures are the heaviest consumers of memory and even though hydrants are no longer mapped during batch their references still accumulate.

This PR has:

[ X] been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
[X ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
[ X] been tested in a test Druid cluster.

clintropolis

i think it would be useful to add some tests to show that this has the intended effect (maybe also easier/possible after modifying the appenderator heap usage calculation to consider whether or not the segment has been mapped)

clintropolis · 2021-04-16T23:43:23Z

server/src/main/java/org/apache/druid/segment/realtime/appenderator/AppenderatorImpl.java

+      // Drop the queriable indexes  behind the hydrants... they are not needed anymore and their
+      // mapped file references
+      // can generate OOMs during merge if enough of them are held back...
+      for (FireHydrant fireHydrant : sink) {
+        fireHydrant.swapSegment(null);
+      }


I'm not sure this is true, I think realtime queries still use the pre-merge intermediary segments and we do not do any sort of swap and replace to the merged segment

It turns out that queries can still happen for realtime after hydrants are merged... I just added code to deal with this case. For realtime ingestion, memory mappings must remain after merge.

clintropolis · 2021-04-16T23:48:59Z

processing/src/main/java/org/apache/druid/segment/QueryableIndexSegment.java

+  private final Supplier<QueryableIndex> indexSupplier;
+  private final Supplier<QueryableIndexStorageAdapter> queryableIndexStorageAdapterSupplier;


I wonder if there is a way to limit the supplier to being part of the FireHydrant instead of doing this modification here?

While I can't think of any ill side-effect that this might cause since these suppliers shouldn't be called too many times in the scheme of things, this change also has huge surface area because of it happening here instead of being limited to ingestion.

I rolled back the memoization code....

clintropolis · 2021-04-16T23:56:23Z

server/src/main/java/org/apache/druid/segment/realtime/appenderator/AppenderatorImpl.java

@@ -347,7 +348,8 @@ public AppenderatorAddResult add(
          }
        }

-        if (!skipBytesInMemoryOverheadCheck && bytesCurrentlyInMemory.get() - bytesToBePersisted > maxBytesTuningConfig) {
+        if (!skipBytesInMemoryOverheadCheck


Hmm, I think the memory overhead calculation checks are going to end up triggering the supplier which will map the segment? Specifically calculateMMappedHydrantMemoryInUsed, which is going to try to get the storage adapter to count the number of columns. I think FireHydrant is going to need some way to track whether or not the segment has been mapped (related to my earlier thread on why the supplier might also be more suitable to live in FireHydrant somehow, so that we can have some side effect to let the hydrant know the mapping has happened)

Good catch... I did not see memory being mapped in my tests but it was because I had "skipBytesInMemoryOverheadCheck": true in the ingestion spec....

I refactored the memory calculations a little in my last commit, enough to make it work, but they still need more work (judging from my brief look at the heap dumps)

…ure tracking calculations

maytasm · 2021-04-26T06:51:04Z

server/src/main/java/org/apache/druid/segment/realtime/appenderator/AppenderatorImpl.java

        indexesToPersist.add(Pair.of(sink.swap(), identifier));
+        totalHydrantsPersisted.addAndGet(1);


Why is the count for hydrant increased by 1 here?

Because sink.swap() creates a new hydrant (which should not be counted) but returns the old hydrant (which needs to be counted)

maytasm · 2021-04-26T06:51:17Z

server/src/main/java/org/apache/druid/segment/realtime/appenderator/AppenderatorImpl.java

@@ -558,35 +570,44 @@ public void clear() throws InterruptedException
    final List<Pair<FireHydrant, SegmentIdWithShardSpec>> indexesToPersist = new ArrayList<>();
    int numPersistedRows = 0;
    long bytesPersisted = 0L;
+    AtomicLong totalHydrantsCount = new AtomicLong();


What's the reason for using AtomicLong here?

Because it is being used inside the lambda thus the variable needs to be final. Since the value of this is set more than once it cannot be a primitive type.

I forgot about SettableSupplier, I could use this rather than AtomicLong since there is no multiple thread access to that variable. But I think that because it is a counter using the built in add in AtomicLong is convenient in this case.

Please do not use concurrent data structures where there is no concurrent access from multiple threads. It makes code very confusing.

I guess you can use MutableLong instead.

Changed to use MutableLong

maytasm · 2021-04-26T07:01:45Z

...xing-service/src/test/java/org/apache/druid/indexing/appenderator/BatchAppenderatorTest.java

+    }
+  }
+
+  private static SegmentIdWithShardSpec si(String interval, String version, int partitionNum)


Can you rename this to something more readable?

That is copied from the existing unit test for the real time appenderator...but sure I can rename it.

maytasm · 2021-04-26T07:01:49Z

...xing-service/src/test/java/org/apache/druid/indexing/appenderator/BatchAppenderatorTest.java

+    );
+  }
+
+  static InputRow ir(String ts, String dim, Object met)


Can you rename this to something more readable?

abhishekagarwal87 · 2021-04-26T17:12:12Z

...xing-service/src/test/java/org/apache/druid/indexing/appenderator/BatchAppenderatorTest.java

+    );
+  }
+
+  private static <T> List<T> sorted(final List<T> xs)


is this method required? Since SegmentIdWithShardSpec and DataSegment implement Comparable, is calling Collections.sort on list directly possible?

Removed custom collector using stream sort instead (fix in next push).

abhishekagarwal87 · 2021-04-26T17:24:50Z

server/src/main/java/org/apache/druid/segment/realtime/appenderator/AppenderatorImpl.java

+          if (!isRealTime()) {
+            // sanity:
+            if (fireHydrant.getPersistedFile() == null) {
+              throw new ISE("Persisted file for batch hydrant is null!");


would you want any other info in this message to debug the exception? It's ok if not.

That should never happen... it is just sanity. Can you think of anything else to say?

Added hydrant (toString), a little more but not a lot more.

abhishekagarwal87 · 2021-04-26T17:33:19Z

server/src/main/java/org/apache/druid/segment/realtime/FireHydrant.java

+   * @return The persisted segment id. This is needed to recreate mapped files before merging.
+   * It will be null for real time hydrants
+   */
+  public @Nullable SegmentId getPersistedSegmentId()


can you also add a note about why persistedSegmentId is required in addition to getSegmentId()? I understand that the latter will throw an NPE?

Having persistedSegmentId in addition to segmentId is kind of ugly. But I was trying to avoid touching the existing code as much as possible. In order to use only segment id then I would have to introduce a "segmentId" private member that sometimes it is set by the adapter and others it is set by the "persisted segment id". I am on the fence on whether doing this would even be more confusing still. In any case, our discussion was for the next step to do is to split Appendarator into two implementations: One batch and one for real time. The realtime appenderator would be practically the same implementation before these changes. The batch appenderator would take the current incremental direction to its logical conclusion: avoid keeping all data structures that are not needed in memory for native batch (which would now remove OOMs due to relation of data size to memory consumption). Then this code is temporary and we can have a cleaner implementation in the BatchAppenderatorImpl.

I don't understand what persistedSegmentId is. Is it ever different from segmentId?

The persistedSegmentId is the same as segmentId. However, segmentId is being accessed indirectly through the adapter. Closing the QueryableIndex nullifies the reference to it inside the adapter thus after this action segmentId is no longer accesible. I decided to add a new data member and a new method to store the segmentId after the queryable index is closed. The segmentId is required to re-open the QueryableIndex in the merge phase.

In order to make the code less confusing I can still keep the persistedSegmentId reference but modify the getSegmentId method as follows:
public SegmentId getSegmentId() { if (adapter.get() != null) { return adapter.get().getId(); } else { return persistedSegmentId; } }

Would you prefer this instead? Personally I prefer the explicit way to obtain that reference but the fact that there are so many questions already makes it clear that it is confusing. So I am fine with an alternative. Let me know if the original way, or the above works or whether you have a better idea.

Moved persisted metadata to AppenderatorImpl as a map. Added comments to the Map to why it is necessary. I think this way is much cleaner. In next push.

…emoved superfluous differences and fix comment typo. Removed custom comparator

abhishekagarwal87 · 2021-05-03T11:43:10Z

server/src/main/java/org/apache/druid/segment/realtime/appenderator/AppenderatorImpl.java

+  // in order to facilitate the mapping of the QueryableIndex associated with a given hydrant
+  // at merge time. This is necessary since batch appenderator will not map the QueryableIndex
+  // at persist time in order to minimize its memory footprint.
+  private final Map<FireHydrant, Pair<File, SegmentId>> persistedHydrantMetadata = new HashMap<>();


does FireHydrant has a hashCode implementation? Maybe using identityHashMap makes more sense here? Also, do we ever need to clear this map? like when org.apache.druid.segment.realtime.appenderator.Appenderator#clear or org.apache.druid.segment.realtime.appenderator.Appenderator#drop is called.

Yeah, IndentyHashMap makes it explicit that the key is a Java reference. Also, clearing the Map in the places you referenced above.

… fact that keys are Java references. Maintain persisted metadata when dropping/closing segments.

abhishekagarwal87

LGTM

…t this to "true" make code fallback to previous code path.

maytasm · 2021-05-07T21:29:21Z

server/src/main/java/org/apache/druid/segment/realtime/appenderator/AppenderatorImpl.java

+   * in order to facilitate the mapping of the QueryableIndex associated with a given hydrant
+   * at merge time. This is necessary since batch appenderator will not map the QueryableIndex
+   * at persist time in order to minimize its memory footprint. This has to be synchronized since the
+   * map bay be accessed from multiple threads.


In next push

maytasm · 2021-05-07T21:36:58Z

server/src/main/java/org/apache/druid/segment/realtime/appenderator/AppenderatorImpl.java

@@ -691,6 +739,8 @@ public Object call() throws IOException
      if (sink.finishWriting()) {
        totalRows.addAndGet(-sink.getNumRows());
      }
+      // count hydrants for stats:
+      pushedHydrantsCount.addAndGet(IterableUtils.size(sink));


You can use Iterables.size instead so you don't have to add new dependency on commons-collections4

In next push

maytasm · 2021-05-07T21:37:34Z

docs/configuration/index.md

@@ -1320,6 +1320,7 @@ Additional peon configs include:
 |`druid.peon.mode`|Choices are "local" and "remote". Setting this to local means you intend to run the peon as a standalone process (Not recommended).|remote|
 |`druid.indexer.task.baseDir`|Base temporary working directory.|`System.getProperty("java.io.tmpdir")`|
 |`druid.indexer.task.baseTaskDir`|Base temporary working directory for tasks.|`${druid.indexer.task.baseDir}/persistent/task`|
+|`druid.indexer.task.batchMemoryMappedIndex`|If false, native batch ingestion will not map indexes thus saving heap space. This does not apply to streaming ingestion, just to batch.|`false`|


Can you add comment on how is this used, what's its purpose, and why would a user need to set it to true?

In next push

maytasm · 2021-05-07T21:38:44Z

server/pom.xml

@@ -454,6 +454,11 @@
            <version>1.3</version>
            <scope>test</scope>
        </dependency>
+        <dependency>


Not needed. If we use Iterables.size instead

In next push

…a dependency), and fixing a typo in a comment.

clintropolis added Area - Ingestion Area - Operations labels Apr 16, 2021

clintropolis reviewed Apr 16, 2021

View reviewed changes

loquisgon marked this pull request as draft April 19, 2021 19:10

loquisgon marked this pull request as ready for review April 23, 2021 01:56

Agustin Gonzalez added 12 commits April 23, 2021 09:49

Avoid mapping hydrants in create segments phase for native ingestion

e738c24

Drop queriable indices after a given sink is fully merged

3ffd592

Do not drop memory mappings for realtime ingestion

98781c4

Style fixes

57124de

Renamed to match use case better

d4e279e

Rollback memoization code and use the real time flag instead

6fcb967

Null ptr fix in FireHydrant toString plus adjustments to memory press…

78b1884

…ure tracking calculations

Style

f0a08d6

Log some count stats

0824e21

Make sure sinks size is obtained at the right time

87f05e8

BatchAppenderator unit test

e75e2ab

Fix comment typos

c5f2840

loquisgon force-pushed the avoidqi-ingestion branch from f063436 to c5f2840 Compare April 23, 2021 17:04

maytasm reviewed Apr 26, 2021

View reviewed changes

abhishekagarwal87 reviewed Apr 26, 2021

View reviewed changes

Agustin Gonzalez added 3 commits April 26, 2021 12:08

Renamed methods to make them more readable

6445eb2

Move persisted metadata from FireHydrant class to AppenderatorImpl. R…

0de3cb9

…emoved superfluous differences and fix comment typo. Removed custom comparator

Missing dependency

63559e8

abhishekagarwal87 reviewed May 3, 2021

View reviewed changes

Make persisted hydrant metadata map concurrent and better reflect the…

699d656

… fact that keys are Java references. Maintain persisted metadata when dropping/closing segments.

abhishekagarwal87 approved these changes May 4, 2021

View reviewed changes

Agustin Gonzalez added 3 commits May 4, 2021 15:38

Replaced concurrent variables with normal ones

09c2d2f

Added batchMemoryMappedIndex "fallback" flag with default "false". Se…

faadfbd

…t this to "true" make code fallback to previous code path.

Style fix.

9fb20b6

maytasm requested changes May 7, 2021

View reviewed changes

Agustin Gonzalez added 3 commits May 7, 2021 17:20

Added note to new setting in doc, using Iterables.size (and removing …

a302b88

…a dependency), and fixing a typo in a comment.

Forgot to commit this edited documentation message

383196e

Merge branch 'master' into avoidqi-ingestion

f83b092

maytasm approved these changes May 11, 2021

View reviewed changes

maytasm merged commit 8e5048e into apache:master May 11, 2021

jihoonson mentioned this pull request Jun 25, 2021

Bound memory utilization for dynamic partitioning (i.e. memory growth is constant) #11294

Merged

3 tasks

clintropolis added this to the 0.22.0 milestone Aug 12, 2021

clintropolis mentioned this pull request Sep 3, 2021

[Draft] 0.22.0 Release Notes #11657

Closed

		private final Supplier<QueryableIndex> indexSupplier;
		private final Supplier<QueryableIndexStorageAdapter> queryableIndexStorageAdapterSupplier;

		indexesToPersist.add(Pair.of(sink.swap(), identifier));
		totalHydrantsPersisted.addAndGet(1);

Avoid memory mapping hydrants after they are persisted & after they are merged for native batch ingestion #11123

Avoid memory mapping hydrants after they are persisted & after they are merged for native batch ingestion #11123

Conversation

loquisgon commented Apr 16, 2021 • edited Loading

Description

clintropolis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

loquisgon May 4, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

loquisgon Apr 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

loquisgon Apr 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhishekagarwal87 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

loquisgon commented Apr 16, 2021 •

edited

Loading

loquisgon May 4, 2021 •

edited

Loading

loquisgon Apr 26, 2021 •

edited

Loading

loquisgon Apr 26, 2021 •

edited

Loading