-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplifying dimension merging #2094
Conversation
80a3377
to
72d4987
Compare
Reduced dim conversion time from 11.6 sec to 7.1 sec, with 12 index with 500K rows each. |
@@ -549,10 +553,10 @@ private File makeIndexFiles( | |||
IOPeon ioPeon = new TmpFileIOPeon(); | |||
ArrayList<FileOutputSupplier> dimOuts = Lists.newArrayListWithCapacity(mergedDimensions.size()); | |||
Map<String, Integer> dimensionCardinalities = Maps.newHashMap(); | |||
ArrayList<Map<String, IntBuffer>> dimConversions = Lists.newArrayListWithCapacity(indexes.size()); | |||
ArrayList<Map<String, int[]>> dimConversions = Lists.newArrayListWithCapacity(indexes.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the performance improvement when use IntBuffer instead of int[]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think it will affect performace (I'll check that tomorrow). Is IntBuffer is better than int[]? Less intuitive for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just do not know why use IntBuffer,and other people can tell why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previously the IntBuffer was allocated off-heap, with this change it will be on-heap. Whether this is better or worse can be debated. Currently we rely on garbage collection to hopefully cleanup our buffers before we run out of memory, but allocating it on-heap may create lots of heap pressure if those arrays are big and long-lived. It might be worth some benchmarks to get a sense of how things will be affected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With direct IntBuffer, it took 8 seconds, not making any notable differences in performance. I prefer int[] but I'm ok with direct buffer. Opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heap pressure is already really bad during the making of index files, and making sure we know how this change impacts heap pressure during that time is important. During the merge and persist phase of realtime tasks, we already have very high CPU usage, enough to where you have to be aware how it impacts query performance. Adding in more heap pressure during that phase should be done with great care.
There are also issues with heap size during the reduce portion of hadoop tasks (or spark batch tasks). So I'm curious if adding more objects (int[]) messes the limit of the number of rows per segment (or if it impacts high cardinality dimensions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted to direct IntBuffer.
👍 |
98e5a91
to
82bab15
Compare
for (String dimension : mergedDimensions) { | ||
final GenericIndexedWriter<String> writer = new GenericIndexedWriter<String>( | ||
ioPeon, dimension, GenericIndexed.STRING_STRATEGY | ||
); | ||
writer.open(); | ||
|
||
List<Indexed<String>> dimValueLookups = Lists.newArrayListWithCapacity(indexes.size()); | ||
DimValueConverter[] converters = new DimValueConverter[indexes.size()]; | ||
int counter = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have a more descriptive name than counter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numDimensions or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to numMergeIndex
82bab15
to
a278fe5
Compare
👍 This looks good to me, but I think someone else familiar with this code needs to do a review |
+1 |
can you squash the commits ? |
a278fe5
to
35bc224
Compare
@binlijin squashed. thanks. |
@fjy ok |
182a0e1
to
c1b0f06
Compare
@binlijin It's becoming more and more painful to rebase. Do you really have a time to look into this? |
1 similar comment
@navis yes, there's multiple teams working on the same code, which is why we need proposals to coordinate, and all the changes are important |
@@ -556,142 +561,79 @@ protected File makeIndexFiles( | |||
dimConversions.add(Maps.<String, IntBuffer>newHashMap()); | |||
} | |||
|
|||
final int TIME_WRITE = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can these be static? also can we define them at the top of the file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove timer. just used to check the performance.
} | ||
} | ||
|
||
private static class Timer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use Guava's Stopwatch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed this part
c1b0f06
to
05fc7dc
Compare
Rebased on master, barely. I'll address comments. |
} | ||
} | ||
|
||
/** | ||
* Get old dictId from new dictId, and only support access in order | ||
*/ | ||
public static class DictIdSeeker | ||
static class WithConversion implements IndexSeeker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rename "WithConversion", e.g. ConvertingIndexSeeker or IndexSeekerWithConversion? Currently it sounds a bit like a boolean parameter and it's not immediately clear that it's a seeker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
👍 looks good to me after addressing the Seeker renaming comments |
d55fc5e
to
d74e526
Compare
@himanshug @xvrl do you want to take a look? I think this is getting close to ready and will merge unless there's more comments |
} | ||
|
||
Iterable<Rowboat> theRows = rowMergerFn.apply(boats); | ||
Iterable<Rowboat> theRows = makeRowIterable( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assuming above code is as is moved into method makeRowIterable(..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, made a method to be used in V9Merger.
👍 pls squash. |
d74e526
to
dd774ef
Compare
squashed |
Simplifying dimension merging
Currently, dimension merging is processed by two stages. One for dictionary, one for index. If this can be processed by single stage, total processing time could be deceased.