New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding range based pruning to bloom index #232
Conversation
vinothchandar
commented
Jul 31, 2017
- keys compared lexicographically using String::compareTo
- Range metadata additionally written into parquet file footers
- Trim fat & few optimizations to speed up indexing
- Add param to control whether input shall be cached, to speed up lookup
- Add param to turn on/off range pruning
- Auto compute of parallelism now simply factors in amount of comparisons done
- More accurate parallelism computation when range pruning is on
- tests added & hardened, docs updated
@ovj @alunarbeach Please test it out once if you can. Planning to merge this in the next few days |
@vinothchandar will try this change today. |
@vinothchandar will update you tomorrow. |
|
||
private final String fileName; | ||
|
||
private final String minRecordKey; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the min and maxRecordKeys be optional ? In case no pruning is used ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are null, using a different constructor. Optional's original intention is to be used for return types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I see no harm in using Optionals for member variables too. This makes sure anybody else using these variables has to account for the possibility of their absence (instead of running into some of null pointer later). Thoughts ?
return filteredFiles.iterator(); | ||
}).collect(); | ||
|
||
if (config.getBloomIndexPruneByRanges()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way to avoid multiple if-else's ? I guess using some sort of Handler is an overkill but probably a way to read minMax keys in both paths and mark minRecordKey and maxRecordKey as optional if they are not present (since we will be reading the footer in both cases anyways ?). Just a thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the first and second blocks needs to separated anyways, so not a lot of reuse gained by doing so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks simple and effective. Excited about this change.
@@ -32,7 +33,7 @@ | |||
* | |||
* Also contains logic to roll-over the log file | |||
*/ | |||
public class HoodieLogFile { | |||
public class HoodieLogFile implements Serializable { | |||
public static final String DELTA_EXTENSION = ".log"; | |||
|
|||
private final Path path; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Path is not serializable afaik. So HoodieLogFile will fail on serializing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as long as you use kryo, it serializes. See the changes made for tests.
try { | ||
String[] minMaxKeys = ParquetUtils.readMinMaxRecordKeys(ft._2().getFileStatus().getPath()); | ||
return new Tuple2<>(ft._1(), new BloomIndexFileInfo(ft._2().getFileName(), minMaxKeys[0], minMaxKeys[1])); | ||
} catch (MetadataNotFoundException me) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably have a tool which rewrites current parquet files and add min/max keys. Can we have a issue to track this tool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current design is backwards compatible and will work on files without any range information as well. Such a migration tool will amount to a full read and bulk insert - so like to leave that as it is if possible.. thoughts?
} | ||
|
||
public Builder bloomIndexUseCaching(boolean useCaching) { | ||
props.setProperty(BLOOM_INDEX_PRUNE_BY_RANGES_PROP, String.valueOf(useCaching)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinothchandar It should be BLOOM_INDEX_USE_CACHING_PROP :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
geez ok. will change. good catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No Issues.
- keys compared lexicographically using String::compareTo - Range metadata additionally written into parquet file footers - Trim fat & few optimizations to speed up indexing - Add param to control whether input shall be cached, to speed up lookup - Add param to turn on/off range pruning - Auto compute of parallelism now simply factors in amount of comparisons done - More accurate parallelism computation when range pruning is on - tests added & hardened, docs updated
116ba29
to
d7542e6
Compare
Thanks @vinothchandar for the change. Ran below test with and without range pruning. Saw large improvement for "filterExists" query. Rough idea about the dataset which was purposefully created to test it.
|
@ovj thanks for testing it out, looks like a win . @alunarbeach can you please your exp as well? it would be helpful @prazanna any other concerns before I merge this? |
Ran the below test with and without range pruning. Partition Fields - DayId, HourId With Range Pruning: This is a huge win for the overall run time. |
@alunarbeach this is awesome to know |