-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize create inverted indexes #2111
optimize create inverted indexes #2111
Conversation
Performance number1 : 2015-12-18 08:58:31,493 INFO [main] segment.IndexMerger (Logger.java:info(70)) - Starting dimension[nid] with cardinality[10,493,398] 2015-12-18 09:02:06,165 INFO [main] segment.IndexMerger (Logger.java:info(70)) - outDir[/tmp/base472859607006656847flush/merged/v8-tmp] completed inverted.drd in 386,635 millis. After: 2015-12-18 08:43:03,655 INFO [main] segment.IndexMerger (Logger.java:info(70)) - Starting dimension[nid] with cardinality[10,493,398] 2015-12-18 08:45:03,878 INFO [main] segment.IndexMerger (Logger.java:info(70)) - outDir[/tmp/base4666050658270672045flush/merged/v8-tmp] completed inverted.drd in 287,941 millis. |
Performance number2 : 2015-12-18 09:45:12,948 INFO [main] segment.IndexMerger (Logger.java:info(70)) - Starting dimension[nid] with cardinality[4,362,606] 2015-12-18 09:46:15,038 INFO [main] segment.IndexMerger (Logger.java:info(70)) - outDir[/tmp/base6193429426037721634flush/merged/v8-tmp] completed inverted.drd in 118,692 millis. After: 2015-12-18 09:28:52,253 INFO [main] segment.IndexMerger (Logger.java:info(70)) - Starting dimension[nid] with cardinality[4,362,606] 2015-12-18 09:29:33,492 INFO [main] segment.IndexMerger (Logger.java:info(70)) - outDir[/tmp/base5295145984422027811flush/merged/v8-tmp] completed inverted.drd in 96,796 millis. |
@binlijin just looking at your merging times, have you thought about sharding your data more? |
In any case, this is cool |
DictIdSeeker[] dictIdConverter = new DictIdSeeker[indexes.size()]; | ||
for (int j = 0; j < indexes.size(); j++) { | ||
IntBuffer dimConversion = dimConversions.get(j).get(dimension); | ||
if(dimConversion != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor formatting, need a space here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually there's a bunch of formatting stuff in this PR. Please make sure to use the style guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use eclipse and find the eclipse_formatting.xml is not useful, and i will try use IntelliJ.
@fjy, we have a big datasource and every day we need to build 30 billion records, so may be we need more big segment. |
@binlijin You can create multiple segments for the same time interval with different shard numbers. I think you should try to keep segments around 5M rows. This is what we did for 100+ billion records per day. |
@fjy, The big datasource we need to keep 15 day's data, and will do query on per day's data,so what if we have too much segments, do druid can handle? |
@fjy, what is the problem and why the travis fail? |
@fjy, thanks.. |
👍 this looks good to me now, but I think someone else who knows this code should do a review as well |
Related to #2138 |
@xvrl can you take a look? |
} | ||
|
||
final Indexed<String> dimSet = getDimValueLookup(dimension); | ||
|
||
// BitmapIndexSeeker is the main performance boost comes from. | ||
// In the previous version of index merge, during the creation of invert index, we do something like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we update the comments to explain how BitmapIndexHolder works?
ping @xvrl |
return new EmptyBitmapIndexSeeker(); | ||
if (dictId >= 0) { | ||
final Indexed<String> dimValues = getDimValueLookup(dimension); | ||
String value = Strings.nullToEmpty(dimValues.get(dictId)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason to call nullToEmpty here? this seems it might be an artifact of wrapping DimDim with NullValueConverterDimDim, however getBitmapIndex relies on the actual values stored in DimDim, not the values returned by the wrapper, it that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right, NullValueConverterDimDim will convert empty to null, we need convert it back to the actual values, because getBitmapIndex relies on the actual values stored in DimDim.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this warrants a comment, given that it took me a while to track down the reason for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, done, already add a comment for this.
@xvrl rebase |
rebase |
@himanshug can you take a look to help finish this off? |
if (dictId >= 0) { | ||
return new BitmapCompressedIndexedInts(bitmaps.getBitmap(dictId)); | ||
} else { | ||
return new EmptyIndexedInts(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe EmptyIndexedInts should be a singleton, it already has a static final instance, can you use that? also make the no arg constructor in that class be private.
@binlijin can you update the PR description with a summary of why this change improves performance, it will be helpful to anyone looking at PR. |
public static class DictIdSeeker | ||
{ | ||
static final int NOT_EXIST = -1; | ||
static final int NOT_INIT = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u make both static variables private as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i see that they are used in other places.
@himanshug do update the PR description. |
rebase |
Assert.assertEquals(1, dictIdSeeker.seek(2)); | ||
try { | ||
dictIdSeeker.seek(1); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should add an Assert.fail(..) here or else the verification doesn't happen for the case when exception is not thrown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, i will fix it.
👍 after #2111 (comment) is resolved. |
I'm still 👍 @binlijin feel free to merge this after you address @himanshug's comment |
rebase and fix test |
optimize create inverted indexes
In index persist or merge when "Create Inverted Indexes" phase, it iterate dim's every value, then get the value's dictionary id in each index to get the bitmap.
We can direct iterate value's dictionary id, and get the corresponding dictionary id in each index from dimConversion to get the bitmap.
This can improve performance much when dim's cardinality high.
Current i do not see any improvement when the data is small.
But we find when large data do hadoop batch ingest and with some high cardinality dimensions the create inverted indexes in Index merger takes the most time.
I will do the performance later with large data.