Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize create inverted indexes #2111

Merged
merged 1 commit into from
Jan 22, 2016
Merged

optimize create inverted indexes #2111

merged 1 commit into from
Jan 22, 2016

Conversation

binlijin
Copy link
Contributor

In index persist or merge when "Create Inverted Indexes" phase, it iterate dim's every value, then get the value's dictionary id in each index to get the bitmap.
We can direct iterate value's dictionary id, and get the corresponding dictionary id in each index from dimConversion to get the bitmap.
This can improve performance much when dim's cardinality high.

Current i do not see any improvement when the data is small.
But we find when large data do hadoop batch ingest and with some high cardinality dimensions the create inverted indexes in Index merger takes the most time.
I will do the performance later with large data.

@binlijin
Copy link
Contributor Author

Performance number1 :
Before:
2015-12-18 08:55:39,529 INFO [main] segment.IndexMerger (Logger.java:info(70)) - outDir[/tmp/base472859607006656847flush/merged/v8-tmp] completed walk through of 11,192,533 rows in 295,312 millis.

2015-12-18 08:58:31,493 INFO [main] segment.IndexMerger (Logger.java:info(70)) - Starting dimension[nid] with cardinality[10,493,398]
2015-12-18 08:59:57,578 INFO [main] segment.IndexMerger (Logger.java:info(70)) - Completed dimension[nid] in 86,085 millis.

2015-12-18 09:02:06,165 INFO [main] segment.IndexMerger (Logger.java:info(70)) - outDir[/tmp/base472859607006656847flush/merged/v8-tmp] completed inverted.drd in 386,635 millis.

After:
2015-12-18 08:40:15,936 INFO [main] segment.IndexMerger (Logger.java:info(70)) - outDir[/tmp/base4666050658270672045flush/merged/v8-tmp] completed walk through of 11,192,533 rows in 292,092 millis.

2015-12-18 08:43:03,655 INFO [main] segment.IndexMerger (Logger.java:info(70)) - Starting dimension[nid] with cardinality[10,493,398]
2015-12-18 08:43:22,763 INFO [main] segment.IndexMerger (Logger.java:info(70)) - Completed dimension[nid] in 19,108 millis.

2015-12-18 08:45:03,878 INFO [main] segment.IndexMerger (Logger.java:info(70)) - outDir[/tmp/base4666050658270672045flush/merged/v8-tmp] completed inverted.drd in 287,941 millis.

@binlijin
Copy link
Contributor Author

Performance number2 :
Before:
2015-12-18 09:44:16,345 INFO [main] segment.IndexMerger (Logger.java:info(70)) - outDir[/tmp/base6193429426037721634flush/merged/v8-tmp] completed walk through of 4,477,564 rows in 112,079 millis.

2015-12-18 09:45:12,948 INFO [main] segment.IndexMerger (Logger.java:info(70)) - Starting dimension[nid] with cardinality[4,362,606]
2015-12-18 09:45:32,210 INFO [main] segment.IndexMerger (Logger.java:info(70)) - Completed dimension[nid] in 19,262 millis.

2015-12-18 09:46:15,038 INFO [main] segment.IndexMerger (Logger.java:info(70)) - outDir[/tmp/base6193429426037721634flush/merged/v8-tmp] completed inverted.drd in 118,692 millis.

After:
2015-12-18 09:27:56,696 INFO [main] segment.IndexMerger (Logger.java:info(70)) - outDir[/tmp/base5295145984422027811flush/merged/v8-tmp] completed walk through of 4,477,564 rows in 119,256 millis.

2015-12-18 09:28:52,253 INFO [main] segment.IndexMerger (Logger.java:info(70)) - Starting dimension[nid] with cardinality[4,362,606]
2015-12-18 09:28:58,954 INFO [main] segment.IndexMerger (Logger.java:info(70)) - Completed dimension[nid] in 6,701 millis.

2015-12-18 09:29:33,492 INFO [main] segment.IndexMerger (Logger.java:info(70)) - outDir[/tmp/base5295145984422027811flush/merged/v8-tmp] completed inverted.drd in 96,796 millis.

@fjy
Copy link
Contributor

fjy commented Dec 18, 2015

@binlijin just looking at your merging times, have you thought about sharding your data more?

@fjy
Copy link
Contributor

fjy commented Dec 18, 2015

In any case, this is cool

DictIdSeeker[] dictIdConverter = new DictIdSeeker[indexes.size()];
for (int j = 0; j < indexes.size(); j++) {
IntBuffer dimConversion = dimConversions.get(j).get(dimension);
if(dimConversion != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor formatting, need a space here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there's a bunch of formatting stuff in this PR. Please make sure to use the style guide.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use eclipse and find the eclipse_formatting.xml is not useful, and i will try use IntelliJ.

@binlijin
Copy link
Contributor Author

@fjy, we have a big datasource and every day we need to build 30 billion records, so may be we need more big segment.

@fjy
Copy link
Contributor

fjy commented Dec 21, 2015

@binlijin You can create multiple segments for the same time interval with different shard numbers. I think you should try to keep segments around 5M rows. This is what we did for 100+ billion records per day.

@binlijin
Copy link
Contributor Author

@fjy, The big datasource we need to keep 15 day's data, and will do query on per day's data,so what if we have too much segments, do druid can handle?

@binlijin binlijin closed this Dec 22, 2015
@binlijin binlijin reopened this Dec 22, 2015
@binlijin binlijin closed this Dec 28, 2015
@binlijin binlijin reopened this Dec 28, 2015
@binlijin binlijin closed this Dec 28, 2015
@binlijin binlijin reopened this Dec 28, 2015
@binlijin
Copy link
Contributor Author

@fjy, what is the problem and why the travis fail?

@fjy
Copy link
Contributor

fjy commented Dec 29, 2015

@binlijin there's a couple of non-deterministic unit tests

If you pull the latest master and merge in #2165, things should pass

@binlijin
Copy link
Contributor Author

@fjy, thanks..

@binlijin binlijin closed this Dec 29, 2015
@binlijin binlijin reopened this Dec 29, 2015
@fjy
Copy link
Contributor

fjy commented Dec 29, 2015

👍 this looks good to me now, but I think someone else who knows this code should do a review as well

@binlijin binlijin closed this Dec 30, 2015
@binlijin binlijin reopened this Dec 30, 2015
@binlijin
Copy link
Contributor Author

binlijin commented Jan 7, 2016

Related to #2138

@binlijin
Copy link
Contributor Author

binlijin commented Jan 7, 2016

@xvrl can you take a look?

}

final Indexed<String> dimSet = getDimValueLookup(dimension);

// BitmapIndexSeeker is the main performance boost comes from.
// In the previous version of index merge, during the creation of invert index, we do something like
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we update the comments to explain how BitmapIndexHolder works?

@binlijin
Copy link
Contributor Author

ping @xvrl

return new EmptyBitmapIndexSeeker();
if (dictId >= 0) {
final Indexed<String> dimValues = getDimValueLookup(dimension);
String value = Strings.nullToEmpty(dimValues.get(dictId));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason to call nullToEmpty here? this seems it might be an artifact of wrapping DimDim with NullValueConverterDimDim, however getBitmapIndex relies on the actual values stored in DimDim, not the values returned by the wrapper, it that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, NullValueConverterDimDim will convert empty to null, we need convert it back to the actual values, because getBitmapIndex relies on the actual values stored in DimDim.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this warrants a comment, given that it took me a while to track down the reason for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done, already add a comment for this.

@binlijin binlijin closed this Jan 18, 2016
@binlijin
Copy link
Contributor Author

@xvrl rebase

@binlijin binlijin reopened this Jan 18, 2016
@binlijin binlijin closed this Jan 20, 2016
@binlijin
Copy link
Contributor Author

rebase

@binlijin binlijin reopened this Jan 20, 2016
@fjy fjy added this to the 0.9.0 milestone Jan 20, 2016
@fjy
Copy link
Contributor

fjy commented Jan 21, 2016

@himanshug can you take a look to help finish this off?

if (dictId >= 0) {
return new BitmapCompressedIndexedInts(bitmaps.getBitmap(dictId));
} else {
return new EmptyIndexedInts();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe EmptyIndexedInts should be a singleton, it already has a static final instance, can you use that? also make the no arg constructor in that class be private.

@himanshug
Copy link
Contributor

@binlijin can you update the PR description with a summary of why this change improves performance, it will be helpful to anyone looking at PR.

public static class DictIdSeeker
{
static final int NOT_EXIST = -1;
static final int NOT_INIT = -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u make both static variables private as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i see that they are used in other places.

@binlijin
Copy link
Contributor Author

@himanshug do update the PR description.

@binlijin binlijin closed this Jan 21, 2016
@binlijin
Copy link
Contributor Author

rebase

@binlijin binlijin reopened this Jan 21, 2016
Assert.assertEquals(1, dictIdSeeker.seek(2));
try {
dictIdSeeker.seek(1);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should add an Assert.fail(..) here or else the verification doesn't happen for the case when exception is not thrown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, i will fix it.

@himanshug
Copy link
Contributor

👍 after #2111 (comment) is resolved.

@fjy
Copy link
Contributor

fjy commented Jan 21, 2016

I'm still 👍

@binlijin feel free to merge this after you address @himanshug's comment

@binlijin binlijin closed this Jan 22, 2016
@binlijin
Copy link
Contributor Author

rebase and fix test

@binlijin binlijin reopened this Jan 22, 2016
binlijin added a commit that referenced this pull request Jan 22, 2016
@binlijin binlijin merged commit 1d1f4d9 into apache:master Jan 22, 2016
@fjy fjy mentioned this pull request Feb 5, 2016
@binlijin binlijin deleted the optimize-create-inverted-indexes branch February 18, 2016 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants