Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build v9 directly #2138

Merged
merged 3 commits into from
Jan 16, 2016
Merged

build v9 directly #2138

merged 3 commits into from
Jan 16, 2016

Conversation

KurtYoung
Copy link
Contributor

This PR tracks the feature of building v9 directly which had been discussed in https://groups.google.com/forum/#!topic/druid-development/0CxhljSGeeo

We can divide this PR into 3 main parts:

  1. Changing of ColumnPartSerde's interface. I had change ColumnPartSerde's interface to separate the Serializer and Deserializer. So we can have multiple serializers, one of it has the same behavior with old ones. And another one is the new streaming version. The Deserializer just do the same thing like old ones.
  2. Added IndexedIntsWriter interface and some *Writers which are responsible for writing dimension ids. You can see the hierarchy that i divided the writers to 2 main sub abstract classes: SingleValueIndexedIntsWriter and MultiValueIndexedIntsWriter.
    Here are the classed which are doing the real things:
    VSizeIndexedIntsWriter (single value, vsize encoded, not compressed)
    CompressedIntsIndexedWriter (single value, not vsize encoded, compressed)
    CompressedVSizeIntsIndexedWriter (single value, vsize encoded, compressed)
    VSizeIndexedWriter (multi value, both offset and values are vsized, not compressed)
    CompressedVSizeIndexedV3Writer (multi value, only values are vsized, compressed)
    More details can be found here: https://groups.google.com/forum/#!topic/druid-development/0CxhljSGeeo
  3. GenericColumnSerializer and its sub-classes are for writing metrics. They are almost identical to MetricColumnSerializer you familiar with, except the GenericColumnSerializer can write to a Channel(which is allocated from smoosher when building). The sub-classes are:
    LongColumnSerializer (write long metrics)
    FloatColumnSerializer (write float metrics)
    ComplexColumnSerializer (writer complex metrics)

flattener.write(StupidResourceHolder.create(endBuffer));
}
endBuffer = null;
flattener.close();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move close to a finally block.

@fjy
Copy link
Contributor

fjy commented Dec 23, 2015

@KurtYoung This is an awesome PR and one that I know several folks want to code review. Everything is going to be slow during the Christmas holidays, but this will get attention after new years.

@nishantmonu51
Copy link
Member

👍, seems good to go after a minor nitpick for closing in finally block.
@KurtYoung thanks for the awesome contribution.

@KurtYoung
Copy link
Contributor Author

added IndexMergerV9, changed some low level interfaces but totally compatible with the old way.
IndexMergerV9 has not been tested, just make the whole thing compiled and old tests passed

Some explanation and thoughts here:

  1. Changes to ColumnPartSerde's interface is to separate the Serializer and Deserializer. So we can have multiple serializers, one of it has the same behavior with old ones. And another one is the new streaming version.
  2. All writers(time, dim, metrics) are backed by tmp files and copy to the final smoosh file directly.
  3. GenericColumnSerializer and its sub-classes are almost identical to MetricColumnSerializer, except the GenericColumnSerializer can write to a Channel(which is allocated from smoosher when building). The MetricColumnSerializer are used for build v8 index, and can be removed later.
  4. IndexedIntsWriter are responsible for writing dimension ids which now has 4 types of the formats, more details are here: https://groups.google.com/forum/#!topic/druid-development/0CxhljSGeeo

And here are some points I think should be discussed with your guys when writing the codes:
1. about skippedDimensions in IndexMaker: dimensions with cardinality 0 are treated as skipped dimensions. But in current version of IncrementalIndex, even null dim value are converted to empty string and treated as a valid value, so I assume all the dimensions will be effective so there are no logic about skipped dimensions.
2. about nullSet when building inverted index both in IndexMaker and converting v8 to v9, it checks all dimension's null rows and make sure Dictionary and BitmapIndex are having these null rows. As mentioned in 1, the null values are treated properly in IncrementalIndex, so I did not handle this situation also.

if (!endBuffer.hasRemaining()) {
endBuffer.rewind();
flattener.write(StupidResourceHolder.create(endBuffer));
endBuffer.rewind();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New two IntBuffer to reuse, when after write switch them to make sure GenericIndexedWriter can sort correct.

@KurtYoung
Copy link
Contributor Author

hmm...I see the point why the null dict value and null row set are handled both in IndexMaker and IndexIO's converting.

My previous decision about skippedDimension & nullSet are wrong, just ignore it.

Working on this now...

@KurtYoung
Copy link
Contributor Author

Found a bug of merge & maker about dimension orders:
#2162
I will add some unit tests and marked Ignore, let's just fix it in another PR

@KurtYoung
Copy link
Contributor Author

Is it ok to insert null to every dimension's dictionary even if the dimension did not contain any null values?
By doing this, there will be no need to iterate all rows just want to determine whether one dimension contains null value and to bump dictionary id if we did not.

Update: Found a way to deal with null value now, but it's a little tricky(there are comments in IncrementalIndexAdapter) and easy to create inconsistency(IncrementalIndexStorageAdapter also rely on IncrementalIndex but does not have this logic right now or it's just does not need this right now). I think proposal above is a possible and easy solution, what do you guys think?

@navis
Copy link
Contributor

navis commented Dec 28, 2015

Wishfully -1 could be used for null and number of distincts would be specified for each dimension in meta. Would be possible?

@nishantmonu51
Copy link
Member

@KurtYoung: are there any corresponding changes in the filters/query path for null handling in case we add null to every dimension dictionary.

@KurtYoung
Copy link
Contributor Author

@nishantmonu51 I'm also aware of this, the current implementation did not add null to each dimension but handled null value in both IncrementalIndexAdapter and IndexMergerV9.

@fjy
Copy link
Contributor

fjy commented Dec 29, 2015

@KurtYoung there's been a lot of optimizations in the old index merger over the last few months. Are those optimizations incorporated in building the v9 segment directly?

@@ -67,11 +65,10 @@ public Object extractValue(InputRow inputRow, String metricName)
}

@Override
public ColumnPartSerde deserializeColumn(ByteBuffer buffer, ColumnBuilder builder)
public void deserializeColumn(ByteBuffer buffer, ColumnBuilder builder)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are changing the behavior of this method, can we please add a comment on the interface about how the method is supposed to be used?

@fjy
Copy link
Contributor

fjy commented Dec 29, 2015

@KurtYoung I can't find the logic where you actually use index merger v9 instead of index merger

@fjy
Copy link
Contributor

fjy commented Dec 29, 2015

@KurtYoung I think the best way to think about how to handle nulls/empty strings in Druid is described in this PR: #995

@fjy
Copy link
Contributor

fjy commented Dec 29, 2015

I did a first pass over this PR but didn't go into detail for IndexMergerV9. Will look into more once we that know it reasonably works. High level I'm on board with the changes.

@fjy fjy changed the title build v9 directly [WIP] build v9 directly Dec 29, 2015
@KurtYoung
Copy link
Contributor Author

@fjy Actually, I did not change any logic to use IndexMergerV9 now, but I switch the current IndexMerger's logic to IndexMergerV9's and make all the test cases passes.
I think maybe it's a good time to switch to IndexMergerV9 after you passed this PR.

@fjy
Copy link
Contributor

fjy commented Dec 30, 2015

@KurtYoung can't seem to make any comments for indexmergerv9

ComplexColumnPartSerde.legacySerializerBuilder()
.withTypeName(complexType)
.withDelegate(metricColumn)
.build(),
metBuilder,
metric
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not able to make any comments for IndexMergerV9 below this line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is IndexMergerV9 just a rename of IndexMerger.java?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case, did you remove the v8 to v9 conversion step in IndexMerger?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IndexMergerV9 make v9 index files directly, the main step are very like with IndexMerger except the v8 to v9 conversion step is no longer necessary.

@fjy
Copy link
Contributor

fjy commented Jan 11, 2016

@himanshug Any more comments?

@KurtYoung Can we add some way to switch between IndexMerger and IndexMerger v9 in the configuration? Index Merger should be the default.

@himanshug
Copy link
Contributor

@fjy I am 👍 once the configuration to switch to IndexMergerV9 is in place.

@xvrl
Copy link
Member

xvrl commented Jan 11, 2016

I believe @cheddar should have a look at this one. He had a lot of opinions about format when I made changes to introduce dimension compression, and this one introduces even more changes.

I also agree with @gianm we should be able to switch between implementations until it has been verified to be production ready.

@KurtYoung
Copy link
Contributor Author

@fjy ok, i will add a config about this. Also looking forward to @cheddar 's comments :)

@KurtYoung
Copy link
Contributor Author

Added "buildV9Directly" option to TuningConfig, docs are updated

@fjy
Copy link
Contributor

fjy commented Jan 15, 2016

@himanshug @xvrl we good to move forward?

@himanshug
Copy link
Contributor

👍 for me

@fjy
Copy link
Contributor

fjy commented Jan 15, 2016

👍 for me too. Will leave this open until tomm to see if anyone else has comments.

@KurtYoung have you filled out the CLA: http://druid.io/community/cla.html

You guys might consider a corporate CLA.

@himanshug
Copy link
Contributor

@fjy @KurtYoung pls squash the commits / cleanup the history, very useful contribution.

@xvrl
Copy link
Member

xvrl commented Jan 15, 2016

@KurtYoung
Copy link
Contributor Author

@fjy I have filled out individual CLA, don't know if i had the right to fill a corporate CLA.

@fjy
Copy link
Contributor

fjy commented Jan 16, 2016

@KurtYoung thanks

@xvrl any more comments?

@KurtYoung
Copy link
Contributor Author

@xvrl All these comments had been addressed and solved.

add unit tests for IndexMergerV9 and fix some bugs

add more unit tests and fix bugs

handle null values and add more tests

minor changes & use LoggingProgressIndicator in IndexGeneratorReducer

make some static class public from IndexMerger

minor changes and add some comments

changes for comments
@KurtYoung
Copy link
Contributor Author

squashed into 3 commits.

fjy added a commit that referenced this pull request Jan 16, 2016
@fjy fjy merged commit f6a1a4a into apache:master Jan 16, 2016
@himanshug himanshug mentioned this pull request Jan 22, 2016
@@ -191,6 +196,11 @@ public boolean getUseCombiner()
return useCombiner;
}

@JsonProperty
public Boolean getBuildV9Directly() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since buildV9Directly is never null, this should probably be boolean instead of Boolean

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and also renamed to isBuildV9Directly

@fjy fjy modified the milestone: 0.9.0 Feb 4, 2016
@fjy fjy mentioned this pull request Feb 5, 2016
seoeun25 added a commit to seoeun25/incubator-druid that referenced this pull request Jan 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants