Simplifying dimension merging #2094

navis · 2015-12-15T09:43:23Z

Currently, dimension merging is processed by two stages. One for dictionary, one for index. If this can be processed by single stage, total processing time could be deceased.

navis · 2015-12-16T02:31:51Z

Reduced dim conversion time from 11.6 sec to 7.1 sec, with 12 index with 500K rows each.

binlijin · 2015-12-16T08:24:12Z

processing/src/main/java/io/druid/segment/IndexMerger.java

@@ -549,10 +553,10 @@ private File makeIndexFiles(
    IOPeon ioPeon = new TmpFileIOPeon();
    ArrayList<FileOutputSupplier> dimOuts = Lists.newArrayListWithCapacity(mergedDimensions.size());
    Map<String, Integer> dimensionCardinalities = Maps.newHashMap();
-    ArrayList<Map<String, IntBuffer>> dimConversions = Lists.newArrayListWithCapacity(indexes.size());
+    ArrayList<Map<String, int[]>> dimConversions = Lists.newArrayListWithCapacity(indexes.size());


What about the performance improvement when use IntBuffer instead of int[]?

Don't think it will affect performace (I'll check that tomorrow). Is IntBuffer is better than int[]? Less intuitive for me.

I just do not know why use IntBuffer，and other people can tell why.

previously the IntBuffer was allocated off-heap, with this change it will be on-heap. Whether this is better or worse can be debated. Currently we rely on garbage collection to hopefully cleanup our buffers before we run out of memory, but allocating it on-heap may create lots of heap pressure if those arrays are big and long-lived. It might be worth some benchmarks to get a sense of how things will be affected.

With direct IntBuffer, it took 8 seconds, not making any notable differences in performance. I prefer int[] but I'm ok with direct buffer. Opinion?

Heap pressure is already really bad during the making of index files, and making sure we know how this change impacts heap pressure during that time is important. During the merge and persist phase of realtime tasks, we already have very high CPU usage, enough to where you have to be aware how it impacts query performance. Adding in more heap pressure during that phase should be done with great care.

There are also issues with heap size during the reduce portion of hadoop tasks (or spark batch tasks). So I'm curious if adding more objects (int[]) messes the limit of the number of rows per segment (or if it impacts high cardinality dimensions).

Reverted to direct IntBuffer.

binlijin · 2015-12-16T09:22:59Z

👍

fjy · 2015-12-18T20:17:36Z

processing/src/main/java/io/druid/segment/IndexMerger.java

    for (String dimension : mergedDimensions) {
      final GenericIndexedWriter<String> writer = new GenericIndexedWriter<String>(
          ioPeon, dimension, GenericIndexed.STRING_STRATEGY
      );
      writer.open();

-      List<Indexed<String>> dimValueLookups = Lists.newArrayListWithCapacity(indexes.size());
-      DimValueConverter[] converters = new DimValueConverter[indexes.size()];
+      int counter = 0;


can we have a more descriptive name than counter?

numDimensions or something

renamed to numMergeIndex

fjy · 2015-12-29T19:14:38Z

👍 This looks good to me, but I think someone else familiar with this code needs to do a review

binlijin · 2016-01-06T12:23:47Z

+1

binlijin · 2016-01-07T00:44:14Z

can you squash the commits ?
If no one have any question i will merge.

navis · 2016-01-07T01:32:36Z

@binlijin squashed. thanks.

fjy · 2016-01-07T05:58:51Z

@binlijin do you mind holding off on merging? I just want 1 more pair of eyes to review this

@xvrl can you take a look?

binlijin · 2016-01-07T06:16:52Z

@fjy ok

navis · 2016-01-19T23:40:46Z

@binlijin It's becoming more and more painful to rebase. Do you really have a time to look into this?

binlijin · 2016-01-20T06:14:52Z

@navis, i am ok with the PR，but fj tell need @xrvl to take a look.

binlijin · 2016-01-20T06:14:54Z

@navis, i am ok with the PR，but fj tell need @xrvl to take a look.

fjy · 2016-01-20T06:19:59Z

@navis yes, there's multiple teams working on the same code, which is why we need proposals to coordinate, and all the changes are important

fjy · 2016-01-21T00:37:05Z

processing/src/main/java/io/druid/segment/IndexMerger.java

@@ -556,142 +561,79 @@ protected File makeIndexFiles(
      dimConversions.add(Maps.<String, IntBuffer>newHashMap());
    }

+    final int TIME_WRITE = 0;


can these be static? also can we define them at the top of the file?

remove timer. just used to check the performance.

fjy · 2016-01-21T00:45:20Z

processing/src/main/java/io/druid/segment/IndexMerger.java

+    }
+  }
+
+  private static class Timer


why not use Guava's Stopwatch?

removed this part

jon-wei · 2016-01-22T02:02:44Z

@navis

Looks good to me, I don't have any additional comments beyond what @fjy has already noted.

Can you rebase? I'll do another review pass on this after.

navis · 2016-01-24T23:55:05Z

Rebased on master, barely. I'll address comments.

jon-wei · 2016-01-26T19:19:10Z

processing/src/main/java/io/druid/segment/IndexMerger.java

    }
  }

  /**
   * Get old dictId from new dictId, and only support access in order
   */
-  public static class DictIdSeeker
+  static class WithConversion implements IndexSeeker


Can you rename "WithConversion", e.g. ConvertingIndexSeeker or IndexSeekerWithConversion? Currently it sounds a bit like a boolean parameter and it's not immediately clear that it's a seeker

jon-wei · 2016-01-26T19:44:13Z

👍 looks good to me after addressing the Seeker renaming comments

fjy · 2016-01-27T03:43:52Z

@himanshug @xvrl do you want to take a look? I think this is getting close to ready and will merge unless there's more comments

himanshug · 2016-01-28T21:56:50Z

processing/src/main/java/io/druid/segment/IndexMerger.java

-    }
-
-    Iterable<Rowboat> theRows = rowMergerFn.apply(boats);
+    Iterable<Rowboat> theRows = makeRowIterable(


assuming above code is as is moved into method makeRowIterable(..)

Yes, made a method to be used in V9Merger.

himanshug · 2016-01-28T21:58:20Z

👍 pls squash.

navis · 2016-01-29T01:04:18Z

squashed

Simplifying dimension merging

navis force-pushed the simplify-index-merge branch from 80a3377 to 72d4987 Compare December 16, 2015 02:30

binlijin reviewed Dec 16, 2015
View reviewed changes

navis force-pushed the simplify-index-merge branch 2 times, most recently from 98e5a91 to 82bab15 Compare December 17, 2015 02:59

fjy reviewed Dec 18, 2015
View reviewed changes

navis force-pushed the simplify-index-merge branch from 82bab15 to a278fe5 Compare December 21, 2015 01:06

navis force-pushed the simplify-index-merge branch from a278fe5 to 35bc224 Compare January 7, 2016 01:32

navis force-pushed the simplify-index-merge branch 3 times, most recently from 182a0e1 to c1b0f06 Compare January 18, 2016 02:51

fjy added this to the 0.9.0 milestone Jan 20, 2016

fjy reviewed Jan 21, 2016
View reviewed changes

navis force-pushed the simplify-index-merge branch from c1b0f06 to 05fc7dc Compare January 24, 2016 23:52

navis mentioned this pull request Jan 25, 2016

Replace string[] with int[] for dimensions #2085

Merged

jon-wei reviewed Jan 26, 2016
View reviewed changes

navis force-pushed the simplify-index-merge branch from d55fc5e to d74e526 Compare January 27, 2016 00:42

himanshug reviewed Jan 28, 2016
View reviewed changes

one-pass merging of dictionary & index

dd774ef

navis force-pushed the simplify-index-merge branch from d74e526 to dd774ef Compare January 29, 2016 01:04

himanshug added a commit that referenced this pull request Jan 29, 2016

Merge pull request #2094 from navis/simplify-index-merge

93c50d8

Simplifying dimension merging

himanshug merged commit 93c50d8 into apache:master Jan 29, 2016

fjy mentioned this pull request Feb 5, 2016

druid-0.9.0 release notes #2404

Closed

navis deleted the simplify-index-merge branch February 13, 2016 03:03

Simplifying dimension merging #2094

Simplifying dimension merging #2094

Conversation

navis commented Dec 15, 2015

navis commented Dec 16, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

binlijin commented Dec 16, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjy commented Dec 29, 2015

binlijin commented Jan 6, 2016

binlijin commented Jan 7, 2016

navis commented Jan 7, 2016

fjy commented Jan 7, 2016

binlijin commented Jan 7, 2016

navis commented Jan 19, 2016

binlijin commented Jan 20, 2016

binlijin commented Jan 20, 2016

fjy commented Jan 20, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jon-wei commented Jan 22, 2016

navis commented Jan 24, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jon-wei commented Jan 26, 2016

fjy commented Jan 27, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

himanshug commented Jan 28, 2016

navis commented Jan 29, 2016