Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplifying dimension merging #2094

Merged
merged 1 commit into from
Jan 29, 2016
Merged

Conversation

navis
Copy link
Contributor

@navis navis commented Dec 15, 2015

Currently, dimension merging is processed by two stages. One for dictionary, one for index. If this can be processed by single stage, total processing time could be deceased.

@navis
Copy link
Contributor Author

navis commented Dec 16, 2015

Reduced dim conversion time from 11.6 sec to 7.1 sec, with 12 index with 500K rows each.

@@ -549,10 +553,10 @@ private File makeIndexFiles(
IOPeon ioPeon = new TmpFileIOPeon();
ArrayList<FileOutputSupplier> dimOuts = Lists.newArrayListWithCapacity(mergedDimensions.size());
Map<String, Integer> dimensionCardinalities = Maps.newHashMap();
ArrayList<Map<String, IntBuffer>> dimConversions = Lists.newArrayListWithCapacity(indexes.size());
ArrayList<Map<String, int[]>> dimConversions = Lists.newArrayListWithCapacity(indexes.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the performance improvement when use IntBuffer instead of int[]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think it will affect performace (I'll check that tomorrow). Is IntBuffer is better than int[]? Less intuitive for me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just do not know why use IntBuffer,and other people can tell why.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously the IntBuffer was allocated off-heap, with this change it will be on-heap. Whether this is better or worse can be debated. Currently we rely on garbage collection to hopefully cleanup our buffers before we run out of memory, but allocating it on-heap may create lots of heap pressure if those arrays are big and long-lived. It might be worth some benchmarks to get a sense of how things will be affected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With direct IntBuffer, it took 8 seconds, not making any notable differences in performance. I prefer int[] but I'm ok with direct buffer. Opinion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heap pressure is already really bad during the making of index files, and making sure we know how this change impacts heap pressure during that time is important. During the merge and persist phase of realtime tasks, we already have very high CPU usage, enough to where you have to be aware how it impacts query performance. Adding in more heap pressure during that phase should be done with great care.

There are also issues with heap size during the reduce portion of hadoop tasks (or spark batch tasks). So I'm curious if adding more objects (int[]) messes the limit of the number of rows per segment (or if it impacts high cardinality dimensions).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted to direct IntBuffer.

@binlijin
Copy link
Contributor

👍

@navis navis force-pushed the simplify-index-merge branch 2 times, most recently from 98e5a91 to 82bab15 Compare December 17, 2015 02:59
for (String dimension : mergedDimensions) {
final GenericIndexedWriter<String> writer = new GenericIndexedWriter<String>(
ioPeon, dimension, GenericIndexed.STRING_STRATEGY
);
writer.open();

List<Indexed<String>> dimValueLookups = Lists.newArrayListWithCapacity(indexes.size());
DimValueConverter[] converters = new DimValueConverter[indexes.size()];
int counter = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a more descriptive name than counter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numDimensions or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to numMergeIndex

@fjy
Copy link
Contributor

fjy commented Dec 29, 2015

👍 This looks good to me, but I think someone else familiar with this code needs to do a review

@binlijin
Copy link
Contributor

binlijin commented Jan 6, 2016

+1

@binlijin
Copy link
Contributor

binlijin commented Jan 7, 2016

can you squash the commits ?
If no one have any question i will merge.

@navis
Copy link
Contributor Author

navis commented Jan 7, 2016

@binlijin squashed. thanks.

@fjy
Copy link
Contributor

fjy commented Jan 7, 2016

@binlijin do you mind holding off on merging? I just want 1 more pair of eyes to review this

@xvrl can you take a look?

@binlijin
Copy link
Contributor

binlijin commented Jan 7, 2016

@fjy ok

@navis navis force-pushed the simplify-index-merge branch 3 times, most recently from 182a0e1 to c1b0f06 Compare January 18, 2016 02:51
@navis
Copy link
Contributor Author

navis commented Jan 19, 2016

@binlijin It's becoming more and more painful to rebase. Do you really have a time to look into this?

@binlijin
Copy link
Contributor

@navis, i am ok with the PR,but fj tell need @xrvl to take a look.

1 similar comment
@binlijin
Copy link
Contributor

@navis, i am ok with the PR,but fj tell need @xrvl to take a look.

@fjy
Copy link
Contributor

fjy commented Jan 20, 2016

@navis yes, there's multiple teams working on the same code, which is why we need proposals to coordinate, and all the changes are important

@fjy fjy added this to the 0.9.0 milestone Jan 20, 2016
@@ -556,142 +561,79 @@ protected File makeIndexFiles(
dimConversions.add(Maps.<String, IntBuffer>newHashMap());
}

final int TIME_WRITE = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can these be static? also can we define them at the top of the file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove timer. just used to check the performance.

}
}

private static class Timer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use Guava's Stopwatch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this part

@jon-wei
Copy link
Contributor

jon-wei commented Jan 22, 2016

@navis

Looks good to me, I don't have any additional comments beyond what @fjy has already noted.

Can you rebase? I'll do another review pass on this after.

@navis
Copy link
Contributor Author

navis commented Jan 24, 2016

Rebased on master, barely. I'll address comments.

}
}

/**
* Get old dictId from new dictId, and only support access in order
*/
public static class DictIdSeeker
static class WithConversion implements IndexSeeker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename "WithConversion", e.g. ConvertingIndexSeeker or IndexSeekerWithConversion? Currently it sounds a bit like a boolean parameter and it's not immediately clear that it's a seeker

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@jon-wei
Copy link
Contributor

jon-wei commented Jan 26, 2016

👍 looks good to me after addressing the Seeker renaming comments

@fjy
Copy link
Contributor

fjy commented Jan 27, 2016

@himanshug @xvrl do you want to take a look? I think this is getting close to ready and will merge unless there's more comments

}

Iterable<Rowboat> theRows = rowMergerFn.apply(boats);
Iterable<Rowboat> theRows = makeRowIterable(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming above code is as is moved into method makeRowIterable(..)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, made a method to be used in V9Merger.

@himanshug
Copy link
Contributor

👍 pls squash.

@navis
Copy link
Contributor Author

navis commented Jan 29, 2016

squashed

himanshug added a commit that referenced this pull request Jan 29, 2016
@himanshug himanshug merged commit 93c50d8 into apache:master Jan 29, 2016
@fjy fjy mentioned this pull request Feb 5, 2016
@navis navis deleted the simplify-index-merge branch February 13, 2016 03:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants