Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding range based pruning to bloom index #232

Merged
merged 1 commit into from Aug 4, 2017

Conversation

vinothchandar
Copy link
Member

  • keys compared lexicographically using String::compareTo
  • Range metadata additionally written into parquet file footers
  • Trim fat & few optimizations to speed up indexing
  • Add param to control whether input shall be cached, to speed up lookup
  • Add param to turn on/off range pruning
  • Auto compute of parallelism now simply factors in amount of comparisons done
  • More accurate parallelism computation when range pruning is on
  • tests added & hardened, docs updated

@vinothchandar
Copy link
Member Author

@ovj @alunarbeach Please test it out once if you can. Planning to merge this in the next few days

@ovj
Copy link
Contributor

ovj commented Jul 31, 2017

@vinothchandar will try this change today.

@alunarbeach
Copy link
Contributor

@vinothchandar will update you tomorrow.


private final String fileName;

private final String minRecordKey;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the min and maxRecordKeys be optional ? In case no pruning is used ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are null, using a different constructor. Optional's original intention is to be used for return types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I see no harm in using Optionals for member variables too. This makes sure anybody else using these variables has to account for the possibility of their absence (instead of running into some of null pointer later). Thoughts ?

return filteredFiles.iterator();
}).collect();

if (config.getBloomIndexPruneByRanges()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to avoid multiple if-else's ? I guess using some sort of Handler is an overkill but probably a way to read minMax keys in both paths and mark minRecordKey and maxRecordKey as optional if they are not present (since we will be reading the footer in both cases anyways ?). Just a thought.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the first and second blocks needs to separated anyways, so not a lot of reuse gained by doing so

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, agree.

Copy link
Contributor

@prazanna prazanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks simple and effective. Excited about this change.

@@ -32,7 +33,7 @@
*
* Also contains logic to roll-over the log file
*/
public class HoodieLogFile {
public class HoodieLogFile implements Serializable {
public static final String DELTA_EXTENSION = ".log";

private final Path path;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path is not serializable afaik. So HoodieLogFile will fail on serializing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as long as you use kryo, it serializes. See the changes made for tests.

try {
String[] minMaxKeys = ParquetUtils.readMinMaxRecordKeys(ft._2().getFileStatus().getPath());
return new Tuple2<>(ft._1(), new BloomIndexFileInfo(ft._2().getFileName(), minMaxKeys[0], minMaxKeys[1]));
} catch (MetadataNotFoundException me) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably have a tool which rewrites current parquet files and add min/max keys. Can we have a issue to track this tool?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current design is backwards compatible and will work on files without any range information as well. Such a migration tool will amount to a full read and bulk insert - so like to leave that as it is if possible.. thoughts?

}

public Builder bloomIndexUseCaching(boolean useCaching) {
props.setProperty(BLOOM_INDEX_PRUNE_BY_RANGES_PROP, String.valueOf(useCaching));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinothchandar It should be BLOOM_INDEX_USE_CACHING_PROP :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

geez ok. will change. good catch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No Issues.

 - keys compared lexicographically using String::compareTo
 - Range metadata additionally written into parquet file footers
 - Trim fat & few optimizations to speed up indexing
 - Add param to control whether input shall be cached, to speed up lookup
 - Add param to turn on/off range pruning
 - Auto compute of parallelism now simply factors in amount of comparisons done
 - More accurate parallelism computation when range pruning is on
 - tests added & hardened, docs updated
@ovj
Copy link
Contributor

ovj commented Aug 4, 2017

Thanks @vinothchandar for the change.

Ran below test with and without range pruning. Saw large improvement for "filterExists" query. Rough idea about the dataset which was purposefully created to test it.

  • 1000 files with approximately 500M each.
  • Each file is having ~2M records. (200M records were inserted with 100 parallelism for bulkInsert to create dataset.)
  • Now ran 2 workloads one with range pruning and another one without it. With the change it took ~15-30min to do filter exists vs 2.5hrs+ for without it (existing change). Didn't notice any stage taking too much time or is having very high parallelism.

@vinothchandar
Copy link
Member Author

@ovj thanks for testing it out, looks like a win .

@alunarbeach can you please your exp as well? it would be helpful

@prazanna any other concerns before I merge this?

@prazanna prazanna merged commit 8620964 into apache:master Aug 4, 2017
@alunarbeach
Copy link
Contributor

Ran the below test with and without range pruning.
Did a Bulk Insert of the dataset is ~474M records (~110 GB Snappy Parquet). Key is a combination of date + uuid. Then did a upsert of 57M records over the 474M records.

Partition Fields - DayId, HourId
Without Range pruning:
BULK_INSERT Time - 1 Hr 33 Min
Upsert Time - 1 Hr 29 Min

With Range Pruning:
BULK_INSERT Time - 1 Hr 32 Min
Upsert Time - 1 hr.

This is a huge win for the overall run time.
Thanks @vinothchandar

@prazanna
Copy link
Contributor

prazanna commented Aug 4, 2017

@alunarbeach this is awesome to know

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants