Refactor index merging, replace Rowboats with RowIterators and RowPointers #5335

leventov · 2018-02-02T18:22:52Z

What this PR does is explained in this issue: #4622.

Key interfaces added in this PR are TimeAndDimsIterator, TimeAndDimsPointer, RowIterator and RowPointer. They have very elaborate Javadocs, please read them in order to get the idea of this PR better.

It's performance effect not as good as I expected:

BEFORE
Benchmark                    (numSegments)  (rollup)  (rowsPerSegment)  (schema)  Mode  Cnt        Score       Error  Units
IndexMergeBenchmark.mergeV9              5      true             75000     basic  avgt   25  3209755.018 ± 44128.873  us/op
IndexMergeBenchmark.mergeV9              5     false             75000     basic  avgt   25  3123278.657 ± 37920.421  us/op

AFTER
Benchmark                    (numSegments)  (rollup)  (rowsPerSegment)  (schema)  Mode  Cnt        Score       Error  Units
IndexMergeBenchmark.mergeV9              5      true             75000     basic  avgt   25  3302141.747 ± 31396.513  us/op
IndexMergeBenchmark.mergeV9              5     false             75000     basic  avgt   25  3115059.181 ± 36265.165  us/op

I. e. it's makes index merging with rollup 3% slower. This is because this PR starts to become really beneficial when garbage-free object metric ser/de is implemented, as mentioned here: #5172, and I wanted to do that before publishing this PR, but it turned out to be a pandora box and it may take several months to complete that work. So I decided to publish this PR even though it doesn't improve performance yet, because this work is already done and i don't want to re-do it when master diverges.

Another reason why this PR doesn't seem so appealing yet, is because the current way of storing data in the incremental index (dims is an array of objects; e. g. double-typed dimension is stored as Double object) doesn't penalize the current way of merging indexes (explained in #4622). I. e. it's just moving Double object from one place to another, no new objects are created.

However, I'm going to replace the current way of storing data in the incremental index as Object[] with some based on ByteBuffers to reduce memory footprint (and finally remove Aggregators, leaving only BufferAggregators that could work in both realtime and historical). When that is done, with the current way of index merging a lot of new objects will need to be created when index merging is started, although no difference for the new index merging structures, presented in this PR.

…nters

leventov · 2018-02-05T12:53:22Z

I tagged Design Review because I want more people to learn about the new logic of index merging via review.

leventov · 2018-02-05T13:05:15Z

Tagged Bug because before, without dimensions: {"index":100} and events with a string dimension which is empty: {"market":"", "index":100} were not coalesced, however it seems like it was wrong. Those events could be found in druid.sample.json. See changes in SchemalessTestFullTest and SchemalessTestSimpleTest.

leventov · 2018-02-05T13:06:29Z

processing/src/main/java/io/druid/segment/DimensionIndexer.java

-   */
-  EncodedKeyComponentType convertUnsortedEncodedKeyComponentToSortedEncodedKeyComponent(EncodedKeyComponentType key);
-
+  ColumnValueSelector convertUnsortedValuesToSorted(ColumnValueSelector selectorWithUnsortedValues);


@jon-wei could you please help to create a good javadoc for this method? I forgot why we need to sort dimension values before index merging.

I don't remember either. Definitely they must be sorted after merging (dictionaries need to be sorted) but I forget if there is a specific reason why they need to be sorted before merging.

Sorry, just saw this comment!

The writeDimValueAndSetupDimConversion step that builds the merged dictionary across indexes and the conversion buffers used for converting per-index dictionary IDs -> merged dictionary IDs builds the conversions from each index's sorted dictionary, so the Rowboat iterable returned by IncrementalIndexAdapter also needs to use the sorted ID space.

DictionaryMergeIterator uses a priority queue to merge the per-index dictionary values, so I guess that's why the sorted IDs are used now.

Now I don't understand why merging is not broken, if it compares IndexedInts (in this PR; or int[], before) just based on sorted values, without looking up the original values. E. g.
Index1:
null -> 0,
apple -> 1,
banana -> 2.

Index2:
null -> 0,
watermelon -> 1.

Then it will merge [watermelon] row from the second index before [banana] row from the first index, because it's sorted index is lower.

writeDimValueAndSetupDimConversion would look at the actual String values to create an int buffer for each index to be merged, used as a mapping between sorted index-specific IDs -> sorted merged IDs.

Using that example, the dictionary building step would create merged dictionary:
null -> 0
apple -> 1
banana -> 2
watermelon -3

As part of that process, a conversion buffer for each index to be merged would be generated:
Index 1:
0 -> 0 (null)
1 -> 1 (apple)
2 -> 2 (banana)

Index 2:
0 -> 0 (null)
1 -> 3 (watermelon)

These conversions are applied by MMappedIndexRowIterable before the rows are fed into the row merging function

Ok, thanks. I see now. Updated Javadoc comments.

However now as I clearly see the whole picture, I have another question... #5526

leventov · 2018-02-05T17:04:32Z

This PR is ready for review. It's going to conflict heavily with #5278 so #5278 should be merged first, but review don't need to wait for that.

…rbage

leventov · 2018-02-12T14:19:24Z

@nishantmonu51 could you please review this?

gianm · 2018-02-13T00:12:05Z

Another reason why this PR doesn't seem so appealing yet, is because the current way of storing data in the incremental index (dims is an array of objects; e. g. double-typed dimension is stored as Double object) doesn't penalize the current way of merging indexes (explained in #4622). I. e. it's just moving Double object from one place to another, no new objects are created.

However, I'm going to replace the current way of storing data in the incremental index as Object[] with some based on ByteBuffers to reduce memory footprint (and finally remove Aggregators, leaving only BufferAggregators that could work in both realtime and historical). When that is done, with the current way of index merging a lot of new objects will need to be created when index merging is started, although no difference for the new index merging structures, presented in this PR.

@leventov - in this PR you write that it may take several months to complete the series of changes you proposed, but you have also recently written elsewhere that you are "probably going to stop being involved in [Druid] development soon". Forgive me for asking, but are you going to be an active contributor long enough to complete this series of contributions and help with follow up? I ask because right now this area of the code is quite stable, and may become less stable as a result of major changes. Improvements are good, but it's always best if the original author is around to fix bugs, help with problems that may arise, comment on why things were done certain ways, and so on.

For example one potential bug could be a performance bug -- I have done experiments with replacing all uses of Aggregator with BufferAggregator before, and ran into problems with timeseries queries slowing down and growable aggregators in IncrementalIndex taking more memory than expected (I don't remember the exact details, but IIRC, some aggregator in datasketches has a buffer version that allocates the max upfront but the non-buffer versions can start small). If you are using the same approach I did in those experiments then your approach might run into the same problems. But even if not, that's just one example -- there are other potential bugs that could occur too.

leventov · 2018-02-13T14:00:17Z

@leventov - in this PR you write that it may take several months to complete the series of changes you proposed, but you have also recently written elsewhere that you are "probably going to stop being involved in [Druid] development soon". Forgive me for asking, but are you going to be an active contributor long enough to complete this series of contributions and help with follow up? I ask because right now this area of the code is quite stable, and may become less stable as a result of major changes.

Amount of work is not so big, I envision that it may take several months because for example it will require to propose some changes to https://github.com/DataSketches/memory, wait them merged (with PR review discussions, etc.), then wait until the library with needed changes is released in Maven central. After that (but not in parallel), do the same in https://github.com/RoaringBitmap/RoaringBitmap. After that, some PR(s) in Druid. There might be some delays along the way.

Another reason why it may take several months is that I'm going to reduce my involvement in Druid. But it doesn't take several months of full-time work.

Improvements are good, but it's always best if the original author is around to fix bugs, help with problems that may arise, comment on why things were done certain ways, and so on.

I'm definitely going to be available for such things.

For example one potential bug could be a performance bug -- I have done experiments with replacing all uses of Aggregator with BufferAggregator before, and ran into problems with timeseries queries slowing down and growable aggregators in IncrementalIndex taking more memory than expected (I don't remember the exact details, but IIRC, some aggregator in datasketches has a buffer version that allocates the max upfront but the non-buffer versions can start small). If you are using the same approach I did in those experiments then your approach might run into the same problems. But even if not, that's just one example -- there are other potential bugs that could occur too.

Thanks for this notice, I didn't think about this, but I now see that it could be a problem, if all dimensions and metrics are required to be stored in a single inflexible (because of the shared base) ByteBuffer. Possible solution could be to allow certain Object metrics to be stored as separate (not shared, hence resizable) ByteBuffers, and when a metric Object requires a resize, it just abandons the former heap buffer and allocates a new, bigger one and stores the data in it. API may look like

interface AggregatorFactory {
  /** true return designates that metrics aggregated with this factory need to be stored
      in separate buffers */
  default boolean needsFlexibleAggregate() { return false; }
}

interface BufferAggregator {
  void aggregate(ByteBuffer buf, int position); // this method already exists

  /** If the aggregated object doesn't fit the given buffer (from pos 0 to capacity),
      this method needs to allocate a new one and return from this method */
  default ByteBuffer aggregateFlexible(ByteBuffer buf) {
    aggregate(buf, 0);
    return buf;
  }
}

nishantmonu51 · 2018-02-14T17:15:37Z

@leventov I am on vacations for this week, will start reviewing this from next week.

jihoonson · 2018-02-14T18:11:37Z

I will start to review in a few days.

…rbage

jihoonson

Reviewed up to ForwardingRowIterator.

jihoonson · 2018-02-23T22:12:52Z

processing/src/main/java/io/druid/segment/DoubleDimensionMergerV9.java

  {
-    return false;
+    return IndexMergerV9.createDoubleColumnSerializer(segmentWriteOutMedium, dimensionName, indexSpec);


Looks like IndexMergerV9.createDoubleColumnSerializer() can be moved to DoubleDimensionHandler, and it would be probably better because type-specific handling can be done in only that class. Same for other types. What do you think?

I don't have a definite opinion. Your point makes sense, but also it makes sense to keep those methods (for Double, Float and Long) together. I would not change this in this PR, because it's not in the scope of this PR. I didn't put those methods in IndexMergerV9 in this PR.

…o-garbage

…rbage

fjy · 2018-03-21T22:45:41Z

I strongly think we need to be really careful with a large refactor of the ingestion code that doesn't seem to add much of a performance improvement and carefully test that it doesn't break anything. My vote is to not merge this code unless we are confident about that.

leventov · 2018-03-21T22:50:52Z

@fjy this PR actually fixes a bug in index merging code.

leventov · 2018-03-23T17:31:47Z

@fjy I feel that current unit and integration test suite tests index merging well. It wasn't that I wrote some code and it started to pass all tests immediately, far from that. Actually I was surprised how subtle corner cases were revealed (and I had to fix my new code) during unit testing.

This PR doesn't improve performance directly, but it's a part of a grand plan, that will allow to get rid of Aggregator interface, for instance (with the current merging scheme, it won't be possible to do that reasonably efficiently). I explained that in details above in this PR comments and in #5335.

And on top of that, this PR fixes a bug.

…rbage

nishantmonu51 · 2018-04-09T14:46:38Z

processing/src/main/java/io/druid/segment/MergingRowIterator.java

+ * In other words, MergingRowIterator is an equivalent to {@link com.google.common.collect.Iterators#mergeSorted}, but
+ * for {@link RowIterator}s rather than simple {@link java.util.Iterator}s.
+ *
+ * Implementation detail: this class uses binary heap priority queue algorithm to sort pointers, but it also momoizes


s/momoizes/memoizes/g

nishantmonu51 · 2018-04-09T14:50:40Z

processing/src/main/java/io/druid/segment/QueryableIndexIndexableAdapter.java

+
+  /**
+   * On {@link #moveToNext()} and {@link #mark()}, this class copies all column values into a set of {@link
+   * SettableColumnValueSelector} instances. Alternative approach was to save only offset in column and use they same


s/they same/the same/g

nishantmonu51 · 2018-04-09T15:02:55Z

processing/src/main/java/io/druid/segment/incremental/IncrementalIndexRowHolder.java

+  public void inspectRuntimeShape(RuntimeShapeInspector inspector)
+  {
+    // nothing to inspect
+  }


note to self : reviewed till here.

nishantmonu51 · 2018-04-09T15:12:35Z

Hi @leventov, please summarize the changes in the PR description, especially around the new concepts/interfaces introduced and their purpose to would make the review easy.

Also, can you list down all the future changes that are needed to make this PR useful and get some good results ?

Finally, is this code tested somewhere in production ? If not, can we get branch tested in some test cluster to gain some more confidence here.

…rbage

…ator

leventov · 2018-04-09T19:03:57Z

please summarize the changes in the PR description, especially around the new concepts/interfaces introduced and their purpose to would make the review easy.

I've added a second paragraph in the first message in this PR. You should basically read javadocs of 4 added classes and interfaces. I don't see a point in copy-pasting Javadoc here.

can you list down all the future changes that are needed to make this PR useful and get some good results ?

There is ObjectStrategy refactoring, see the first message in this thread:

I. e. it's makes index merging with rollup 3% slower. This is because this PR starts to become really beneficial when garbage-free object metric ser/de is implemented, as mentioned here: #5172, and I wanted to do that before publishing this PR, but it turned out to be a pandora box and it may take several months to complete that work. So I decided to publish this PR even though it doesn't improve performance yet, because this work is already done and i don't want to re-do it when master diverges.

Some more info: apache/datasketches-memory#14 (comment). It's already 70% done (for several months already), it awaits for this PR and also for some changes in DataSketches/memory. It will allow to drastically reduce amount of garbage produced when Object columns are involved into index merging (like histogram, dataSketches).

Finally, is this code tested somewhere in production ? If not, can we get branch tested in some test cluster to gain some more confidence here.

No, it's not used in production. I'm not going to test this change in production before it's part of a release, because we moved away from this practice. Also see #5335 (comment)

leventov · 2018-04-17T03:59:26Z

@nishantmonu51 do you have more comments?

nishantmonu51 · 2018-04-23T20:07:17Z

LGTM. 👍

nishantmonu51 · 2018-04-23T20:09:41Z

@fjy: do you still have concerns in merging this PR ?

fjy · 2018-04-28T00:30:45Z

👍

leventov · 2018-04-29T13:34:08Z

Thanks for reviews.

…nters (apache#5335) * Refactor index merging, replace Rowboats with RowIterators and RowPointers * Add javadocs * Fix a bug in QueryableIndexIndexableAdapter * Fixes * Remove unused declarations * Remove unused GenericColumn.isNull() method * Fix test * Address comments * Rearrange some code in MergingRowIterator for more clarity * Self-review * Fix style * Improve docs * Fix docs * Rename IndexMergerV9.writeDimValueAndSetupDimConversion to setUpDimConversion() * Update Javadocs * Minor fixes * Doc fixes, more code comments, cleanup of RowCombiningTimeAndDimsIterator * Fix doc link

Refactor index merging, replace Rowboats with RowIterators and RowPoi…

3e252e9

…nters

leventov added the Area - Batch Ingestion label Feb 2, 2018

leventov assigned nishantmonu51 Feb 2, 2018

leventov added 3 commits February 3, 2018 17:04

Add javadocs

86e1c09

Fix a bug in QueryableIndexIndexableAdapter

28ba4b5

Fixes

6e66482

leventov added Improvement Bug Design Review labels Feb 5, 2018

leventov commented Feb 5, 2018

View reviewed changes

Remove unused declarations

590b110

leventov added this to the 0.13.0 milestone Feb 5, 2018

Merge remote-tracking branch 'upstream/master' into index-merge-no-ga…

eb158ca

…rbage

leventov added 2 commits February 21, 2018 15:30

Merge remote-tracking branch 'upstream/master' into index-merge-no-ga…

f51ffde

…rbage

Remove unused GenericColumn.isNull() method

77c02fc

leventov added the Development Blocker label Feb 23, 2018

jihoonson reviewed Feb 23, 2018

View reviewed changes

leventov mentioned this pull request Feb 26, 2018

Slim Memory; remove hasByteBuffer() and getByteBuffer() methods apache/datasketches-memory#42

Closed

leventov added 3 commits February 27, 2018 12:30

Merge branch 'master' of github.com:druid-io/druid into index-merge-n…

f7e9a26

…o-garbage

Merge remote-tracking branch 'upstream/master' into index-merge-no-ga…

111cb86

…rbage

Fix test

625b7a2

Update Javadocs

e1434fc

leventov mentioned this pull request Mar 31, 2018

SQL compatible Null Handling Part 2 - Processing Layer and Druid-SQL changes #5452

Closed

Merge remote-tracking branch 'upstream/master' into index-merge-no-ga…

56b2a5f

…rbage

nishantmonu51 reviewed Apr 9, 2018

View reviewed changes

leventov added 4 commits April 9, 2018 23:25

Merge remote-tracking branch 'upstream/master' into index-merge-no-ga…

28ee32b

…rbage

Minor fixes

15d1c35

Doc fixes, more code comments, cleanup of RowCombiningTimeAndDimsIter…

cfec46a

…ator

Fix doc link

99c7e8a

clintropolis approved these changes Apr 28, 2018

View reviewed changes

jihoonson merged commit 9be0007 into apache:master Apr 28, 2018

leventov deleted the index-merge-no-garbage branch April 29, 2018 13:34

leventov mentioned this pull request Apr 29, 2018

Oak: New Concurrent Key-Value Map #5698

Closed

clambertus unassigned nishantmonu51 Jul 6, 2018

leventov mentioned this pull request Jan 16, 2019

datsketches extension updated to use the latest sketches-core-0.12.0 #6381

Merged

jon-wei mentioned this pull request Jan 16, 2019

DoublesSketchAggregatorFactory doesn't implement makeAggregateCombiner #6877

Closed

leventov mentioned this pull request Jul 23, 2019

[Proposal] BufferAggregator support for growable sketches. #8126

Open

a2l007 mentioned this pull request Jul 24, 2019

Indexing tasks containing thetaSketches results in incorrect sketch values #7741

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor index merging, replace Rowboats with RowIterators and RowPointers #5335

Refactor index merging, replace Rowboats with RowIterators and RowPointers #5335

leventov commented Feb 2, 2018 •

edited

Loading

leventov commented Feb 5, 2018 •

edited

Loading

leventov commented Feb 5, 2018

leventov Feb 5, 2018

leventov Mar 2, 2018

gianm Mar 3, 2018

jon-wei Mar 21, 2018

leventov Mar 21, 2018

jon-wei Mar 22, 2018

leventov Mar 23, 2018

leventov commented Feb 5, 2018

leventov commented Feb 12, 2018

gianm commented Feb 13, 2018

leventov commented Feb 13, 2018 •

edited

Loading

nishantmonu51 commented Feb 14, 2018

jihoonson commented Feb 14, 2018

jihoonson left a comment

jihoonson Feb 23, 2018

leventov Feb 27, 2018

jihoonson Mar 1, 2018

fjy commented Mar 21, 2018

leventov commented Mar 21, 2018

leventov commented Mar 23, 2018

nishantmonu51 Apr 9, 2018

leventov Apr 9, 2018

nishantmonu51 Apr 9, 2018

leventov Apr 9, 2018

nishantmonu51 Apr 9, 2018

nishantmonu51 commented Apr 9, 2018

leventov commented Apr 9, 2018

leventov commented Apr 17, 2018

nishantmonu51 commented Apr 23, 2018

nishantmonu51 commented Apr 23, 2018

fjy commented Apr 28, 2018

leventov commented Apr 29, 2018

Refactor index merging, replace Rowboats with RowIterators and RowPointers #5335

Refactor index merging, replace Rowboats with RowIterators and RowPointers #5335

Conversation

leventov commented Feb 2, 2018 • edited Loading

leventov commented Feb 5, 2018 • edited Loading

leventov commented Feb 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leventov commented Feb 5, 2018

leventov commented Feb 12, 2018

gianm commented Feb 13, 2018

leventov commented Feb 13, 2018 • edited Loading

nishantmonu51 commented Feb 14, 2018

jihoonson commented Feb 14, 2018

jihoonson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjy commented Mar 21, 2018

leventov commented Mar 21, 2018

leventov commented Mar 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nishantmonu51 commented Apr 9, 2018

leventov commented Apr 9, 2018

leventov commented Apr 17, 2018

nishantmonu51 commented Apr 23, 2018

nishantmonu51 commented Apr 23, 2018

fjy commented Apr 28, 2018

leventov commented Apr 29, 2018

leventov commented Feb 2, 2018 •

edited

Loading

leventov commented Feb 5, 2018 •

edited

Loading

leventov commented Feb 13, 2018 •

edited

Loading