build v9 directly #2138

KurtYoung · 2015-12-22T04:19:13Z

This PR tracks the feature of building v9 directly which had been discussed in https://groups.google.com/forum/#!topic/druid-development/0CxhljSGeeo

We can divide this PR into 3 main parts:

Changing of ColumnPartSerde's interface. I had change ColumnPartSerde's interface to separate the Serializer and Deserializer. So we can have multiple serializers, one of it has the same behavior with old ones. And another one is the new streaming version. The Deserializer just do the same thing like old ones.
Added IndexedIntsWriter interface and some *Writers which are responsible for writing dimension ids. You can see the hierarchy that i divided the writers to 2 main sub abstract classes: SingleValueIndexedIntsWriter and MultiValueIndexedIntsWriter.
Here are the classed which are doing the real things:
VSizeIndexedIntsWriter (single value, vsize encoded, not compressed)
CompressedIntsIndexedWriter (single value, not vsize encoded, compressed)
CompressedVSizeIntsIndexedWriter (single value, vsize encoded, compressed)
VSizeIndexedWriter (multi value, both offset and values are vsized, not compressed)
CompressedVSizeIndexedV3Writer (multi value, only values are vsized, compressed)
More details can be found here: https://groups.google.com/forum/#!topic/druid-development/0CxhljSGeeo
GenericColumnSerializer and its sub-classes are for writing metrics. They are almost identical to MetricColumnSerializer you familiar with, except the GenericColumnSerializer can write to a Channel(which is allocated from smoosher when building). The sub-classes are:
LongColumnSerializer (write long metrics)
FloatColumnSerializer (write float metrics)
ComplexColumnSerializer (writer complex metrics)

nishantmonu51 · 2015-12-22T09:02:35Z

processing/src/main/java/io/druid/segment/data/CompressedIntsIndexedWriter.java

+      flattener.write(StupidResourceHolder.create(endBuffer));
+    }
+    endBuffer = null;
+    flattener.close();


can we move close to a finally block.

fjy · 2015-12-23T05:55:49Z

@KurtYoung This is an awesome PR and one that I know several folks want to code review. Everything is going to be slow during the Christmas holidays, but this will get attention after new years.

nishantmonu51 · 2015-12-23T12:53:16Z

👍, seems good to go after a minor nitpick for closing in finally block.
@KurtYoung thanks for the awesome contribution.

KurtYoung · 2015-12-24T06:38:02Z

added IndexMergerV9, changed some low level interfaces but totally compatible with the old way.
IndexMergerV9 has not been tested, just make the whole thing compiled and old tests passed

Some explanation and thoughts here:

Changes to ColumnPartSerde's interface is to separate the Serializer and Deserializer. So we can have multiple serializers, one of it has the same behavior with old ones. And another one is the new streaming version.
All writers(time, dim, metrics) are backed by tmp files and copy to the final smoosh file directly.
GenericColumnSerializer and its sub-classes are almost identical to MetricColumnSerializer, except the GenericColumnSerializer can write to a Channel(which is allocated from smoosher when building). The MetricColumnSerializer are used for build v8 index, and can be removed later.
IndexedIntsWriter are responsible for writing dimension ids which now has 4 types of the formats, more details are here: https://groups.google.com/forum/#!topic/druid-development/0CxhljSGeeo

And here are some points I think should be discussed with your guys when writing the codes:
1. about skippedDimensions in IndexMaker: dimensions with cardinality 0 are treated as skipped dimensions. But in current version of IncrementalIndex, even null dim value are converted to empty string and treated as a valid value, so I assume all the dimensions will be effective so there are no logic about skipped dimensions.
2. about nullSet when building inverted index both in IndexMaker and converting v8 to v9, it checks all dimension's null rows and make sure Dictionary and BitmapIndex are having these null rows. As mentioned in 1, the null values are treated properly in IncrementalIndex, so I did not handle this situation also.

binlijin · 2015-12-24T07:53:32Z

processing/src/main/java/io/druid/segment/data/CompressedIntsIndexedWriter.java

+    if (!endBuffer.hasRemaining()) {
+      endBuffer.rewind();
+      flattener.write(StupidResourceHolder.create(endBuffer));
+      endBuffer.rewind();


New two IntBuffer to reuse, when after write switch them to make sure GenericIndexedWriter can sort correct.

KurtYoung · 2015-12-25T08:14:11Z

hmm...I see the point why the null dict value and null row set are handled both in IndexMaker and IndexIO's converting.

My previous decision about skippedDimension & nullSet are wrong, just ignore it.

Working on this now...

KurtYoung · 2015-12-26T05:46:08Z

Found a bug of merge & maker about dimension orders:
#2162
I will add some unit tests and marked Ignore, let's just fix it in another PR

KurtYoung · 2015-12-26T13:00:10Z

Is it ok to insert null to every dimension's dictionary even if the dimension did not contain any null values?
By doing this, there will be no need to iterate all rows just want to determine whether one dimension contains null value and to bump dictionary id if we did not.

Update: Found a way to deal with null value now, but it's a little tricky(there are comments in IncrementalIndexAdapter) and easy to create inconsistency(IncrementalIndexStorageAdapter also rely on IncrementalIndex but does not have this logic right now or it's just does not need this right now). I think proposal above is a possible and easy solution, what do you guys think?

navis · 2015-12-28T01:52:42Z

Wishfully -1 could be used for null and number of distincts would be specified for each dimension in meta. Would be possible?

nishantmonu51 · 2015-12-28T05:07:13Z

@KurtYoung: are there any corresponding changes in the filters/query path for null handling in case we add null to every dimension dictionary.

KurtYoung · 2015-12-28T05:17:15Z

@nishantmonu51 I'm also aware of this, the current implementation did not add null to each dimension but handled null value in both IncrementalIndexAdapter and IndexMergerV9.

fjy · 2015-12-29T19:18:13Z

@KurtYoung there's been a lot of optimizations in the old index merger over the last few months. Are those optimizations incorporated in building the v9 segment directly?

fjy · 2015-12-29T19:19:01Z

...c/main/java/io/druid/query/aggregation/datasketches/theta/SketchMergeComplexMetricSerde.java

@@ -67,11 +65,10 @@ public Object extractValue(InputRow inputRow, String metricName)
  }

  @Override
-  public ColumnPartSerde deserializeColumn(ByteBuffer buffer, ColumnBuilder builder)
+  public void deserializeColumn(ByteBuffer buffer, ColumnBuilder builder)


since we are changing the behavior of this method, can we please add a comment on the interface about how the method is supposed to be used?

fjy · 2015-12-29T21:40:02Z

@KurtYoung I can't find the logic where you actually use index merger v9 instead of index merger

fjy · 2015-12-29T21:42:49Z

@KurtYoung I think the best way to think about how to handle nulls/empty strings in Druid is described in this PR: #995

fjy · 2015-12-29T21:45:31Z

I did a first pass over this PR but didn't go into detail for IndexMergerV9. Will look into more once we that know it reasonably works. High level I'm on board with the changes.

KurtYoung · 2015-12-30T01:16:03Z

@fjy Actually, I did not change any logic to use IndexMergerV9 now, but I switch the current IndexMerger's logic to IndexMergerV9's and make all the test cases passes.
I think maybe it's a good time to switch to IndexMergerV9 after you passed this PR.

fjy · 2015-12-30T01:47:48Z

@KurtYoung can't seem to make any comments for indexmergerv9

fjy · 2015-12-30T02:50:09Z

processing/src/main/java/io/druid/segment/IndexMaker.java

+            ComplexColumnPartSerde.legacySerializerBuilder()
+                                  .withTypeName(complexType)
+                                  .withDelegate(metricColumn)
+                                  .build(),
            metBuilder,
            metric
        );


I am not able to make any comments for IndexMergerV9 below this line.

is IndexMergerV9 just a rename of IndexMerger.java?

If that's the case, did you remove the v8 to v9 conversion step in IndexMerger?

IndexMergerV9 make v9 index files directly, the main step are very like with IndexMerger except the v8 to v9 conversion step is no longer necessary.

fjy · 2016-01-11T19:08:04Z

@himanshug Any more comments?

@KurtYoung Can we add some way to switch between IndexMerger and IndexMerger v9 in the configuration? Index Merger should be the default.

himanshug · 2016-01-11T19:15:06Z

@fjy I am 👍 once the configuration to switch to IndexMergerV9 is in place.

xvrl · 2016-01-11T19:46:33Z

I believe @cheddar should have a look at this one. He had a lot of opinions about format when I made changes to introduce dimension compression, and this one introduces even more changes.

I also agree with @gianm we should be able to switch between implementations until it has been verified to be production ready.

KurtYoung · 2016-01-12T06:10:05Z

@fjy ok, i will add a config about this. Also looking forward to @cheddar 's comments :)

KurtYoung · 2016-01-13T06:46:13Z

Added "buildV9Directly" option to TuningConfig, docs are updated

fjy · 2016-01-15T01:17:24Z

@himanshug @xvrl we good to move forward?

himanshug · 2016-01-15T15:51:41Z

👍 for me

fjy · 2016-01-15T20:36:33Z

👍 for me too. Will leave this open until tomm to see if anyone else has comments.

@KurtYoung have you filled out the CLA: http://druid.io/community/cla.html

You guys might consider a corporate CLA.

himanshug · 2016-01-15T21:06:33Z

@fjy @KurtYoung pls squash the commits / cleanup the history, very useful contribution.

xvrl · 2016-01-15T21:50:46Z

@fjy @KurtYoung I might be wrong, but I still see some comments outstanding. Can we respond or address them?

KurtYoung · 2016-01-16T02:48:03Z

@fjy I have filled out individual CLA, don't know if i had the right to fill a corporate CLA.

fjy · 2016-01-16T02:51:29Z

@KurtYoung thanks

@xvrl any more comments?

KurtYoung · 2016-01-16T02:52:25Z

@xvrl All these comments had been addressed and solved.

add unit tests for IndexMergerV9 and fix some bugs add more unit tests and fix bugs handle null values and add more tests minor changes & use LoggingProgressIndicator in IndexGeneratorReducer make some static class public from IndexMerger minor changes and add some comments changes for comments

KurtYoung · 2016-01-16T03:31:48Z

squashed into 3 commits.

build v9 directly

xvrl · 2016-01-22T22:11:19Z

indexing-hadoop/src/main/java/io/druid/indexer/HadoopTuningConfig.java

@@ -191,6 +196,11 @@ public boolean getUseCombiner()
    return useCombiner;
  }

+  @JsonProperty
+  public Boolean getBuildV9Directly() {


since buildV9Directly is never null, this should probably be boolean instead of Boolean

and also renamed to isBuildV9Directly

nishantmonu51 reviewed Dec 22, 2015
View reviewed changes

KurtYoung force-pushed the feature-build-v9 branch from 54725a7 to 8daf783 Compare December 24, 2015 07:14

binlijin reviewed Dec 24, 2015
View reviewed changes

KurtYoung mentioned this pull request Dec 28, 2015

IndexMerger & IndexMaker make the wrong result if indexes have different dimension order #2162

Closed

fjy reviewed Dec 29, 2015
View reviewed changes

fjy changed the title ~~build v9 directly~~ [WIP] build v9 directly Dec 29, 2015

fjy reviewed Dec 30, 2015
View reviewed changes

KurtYoung force-pushed the feature-build-v9 branch from 5d4423f to 75fdd58 Compare January 13, 2016 06:45

KurtYoung added 3 commits January 16, 2016 11:25

add some streaming writers

bb50d2a

add config for build v9 directly and update docs

82ff98c

KurtYoung force-pushed the feature-build-v9 branch from 75fdd58 to 82ff98c Compare January 16, 2016 03:30

fjy added a commit that referenced this pull request Jan 16, 2016

Merge pull request #2138 from KurtYoung/feature-build-v9

f6a1a4a

build v9 directly

fjy merged commit f6a1a4a into apache:master Jan 16, 2016

himanshug mentioned this pull request Jan 22, 2016

fixing regressions #2322

Merged

xvrl reviewed Jan 22, 2016
View reviewed changes

fjy modified the milestone: 0.9.0 Feb 4, 2016

fjy mentioned this pull request Feb 5, 2016

druid-0.9.0 release notes #2404

Closed

seoeun25 added a commit to seoeun25/incubator-druid that referenced this pull request Jan 10, 2020

apache#2138 Implment SQLMetadataStorageActionHandler

a4b1113

build v9 directly #2138

build v9 directly #2138

Conversation

KurtYoung commented Dec 22, 2015

Choose a reason for hiding this comment

fjy commented Dec 23, 2015

nishantmonu51 commented Dec 23, 2015

KurtYoung commented Dec 24, 2015

Choose a reason for hiding this comment

KurtYoung commented Dec 25, 2015

KurtYoung commented Dec 26, 2015

KurtYoung commented Dec 26, 2015

navis commented Dec 28, 2015

nishantmonu51 commented Dec 28, 2015

KurtYoung commented Dec 28, 2015

fjy commented Dec 29, 2015

Choose a reason for hiding this comment

fjy commented Dec 29, 2015

fjy commented Dec 29, 2015

fjy commented Dec 29, 2015

KurtYoung commented Dec 30, 2015

fjy commented Dec 30, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjy commented Jan 11, 2016

himanshug commented Jan 11, 2016

xvrl commented Jan 11, 2016

KurtYoung commented Jan 12, 2016

KurtYoung commented Jan 13, 2016

fjy commented Jan 15, 2016

himanshug commented Jan 15, 2016

fjy commented Jan 15, 2016

himanshug commented Jan 15, 2016

xvrl commented Jan 15, 2016

KurtYoung commented Jan 16, 2016

fjy commented Jan 16, 2016

KurtYoung commented Jan 16, 2016

KurtYoung commented Jan 16, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment