Adding range based pruning to bloom index #232

vinothchandar · 2017-07-31T05:22:21Z

keys compared lexicographically using String::compareTo
Range metadata additionally written into parquet file footers
Trim fat & few optimizations to speed up indexing
Add param to control whether input shall be cached, to speed up lookup
Add param to turn on/off range pruning
Auto compute of parallelism now simply factors in amount of comparisons done
More accurate parallelism computation when range pruning is on
tests added & hardened, docs updated

vinothchandar · 2017-07-31T05:24:18Z

@ovj @alunarbeach Please test it out once if you can. Planning to merge this in the next few days

ovj · 2017-07-31T17:51:57Z

@vinothchandar will try this change today.

alunarbeach · 2017-07-31T19:14:07Z

@vinothchandar will update you tomorrow.

n3nash · 2017-07-31T20:29:44Z

hoodie-client/src/main/java/com/uber/hoodie/index/bloom/BloomIndexFileInfo.java

+
+    private final String fileName;
+
+    private final String minRecordKey;


Should the min and maxRecordKeys be optional ? In case no pruning is used ?

they are null, using a different constructor. Optional's original intention is to be used for return types.

Okay, I see no harm in using Optionals for member variables too. This makes sure anybody else using these variables has to account for the possibility of their absence (instead of running into some of null pointer later). Thoughts ?

n3nash · 2017-07-31T20:41:42Z

hoodie-client/src/main/java/com/uber/hoodie/index/bloom/HoodieBloomIndex.java

+                    return filteredFiles.iterator();
+                }).collect();
+
+        if (config.getBloomIndexPruneByRanges()) {


is there a way to avoid multiple if-else's ? I guess using some sort of Handler is an overkill but probably a way to read minMax keys in both paths and mark minRecordKey and maxRecordKey as optional if they are not present (since we will be reading the footer in both cases anyways ?). Just a thought.

the first and second blocks needs to separated anyways, so not a lot of reuse gained by doing so

yeah, agree.

prazanna

This looks simple and effective. Excited about this change.

prazanna · 2017-08-02T21:07:51Z

hoodie-common/src/main/java/com/uber/hoodie/common/model/HoodieLogFile.java

@@ -32,7 +33,7 @@
 *
 * Also contains logic to roll-over the log file
 */
-public class HoodieLogFile {
+public class HoodieLogFile implements Serializable {
    public static final String DELTA_EXTENSION = ".log";

    private final Path path;


Path is not serializable afaik. So HoodieLogFile will fail on serializing.

as long as you use kryo, it serializes. See the changes made for tests.

prazanna · 2017-08-02T21:16:49Z

hoodie-client/src/main/java/com/uber/hoodie/index/bloom/HoodieBloomIndex.java

+                        try {
+                            String[] minMaxKeys = ParquetUtils.readMinMaxRecordKeys(ft._2().getFileStatus().getPath());
+                            return new Tuple2<>(ft._1(), new BloomIndexFileInfo(ft._2().getFileName(), minMaxKeys[0], minMaxKeys[1]));
+                        } catch (MetadataNotFoundException me) {


We should probably have a tool which rewrites current parquet files and add min/max keys. Can we have a issue to track this tool?

Current design is backwards compatible and will work on files without any range information as well. Such a migration tool will amount to a full read and bulk insert - so like to leave that as it is if possible.. thoughts?

alunarbeach · 2017-08-04T14:12:43Z

hoodie-client/src/main/java/com/uber/hoodie/config/HoodieIndexConfig.java

+        }
+
+        public Builder bloomIndexUseCaching(boolean useCaching) {
+            props.setProperty(BLOOM_INDEX_PRUNE_BY_RANGES_PROP, String.valueOf(useCaching));


@vinothchandar It should be BLOOM_INDEX_USE_CACHING_PROP :)

geez ok. will change. good catch

- keys compared lexicographically using String::compareTo - Range metadata additionally written into parquet file footers - Trim fat & few optimizations to speed up indexing - Add param to control whether input shall be cached, to speed up lookup - Add param to turn on/off range pruning - Auto compute of parallelism now simply factors in amount of comparisons done - More accurate parallelism computation when range pruning is on - tests added & hardened, docs updated

ovj · 2017-08-04T19:57:46Z

Thanks @vinothchandar for the change.

Ran below test with and without range pruning. Saw large improvement for "filterExists" query. Rough idea about the dataset which was purposefully created to test it.

1000 files with approximately 500M each.
Each file is having ~2M records. (200M records were inserted with 100 parallelism for bulkInsert to create dataset.)
Now ran 2 workloads one with range pruning and another one without it. With the change it took ~15-30min to do filter exists vs 2.5hrs+ for without it (existing change). Didn't notice any stage taking too much time or is having very high parallelism.

vinothchandar · 2017-08-04T19:59:47Z

@ovj thanks for testing it out, looks like a win .

@alunarbeach can you please your exp as well? it would be helpful

@prazanna any other concerns before I merge this?

alunarbeach · 2017-08-04T20:28:58Z

Ran the below test with and without range pruning.
Did a Bulk Insert of the dataset is ~474M records (~110 GB Snappy Parquet). Key is a combination of date + uuid. Then did a upsert of 57M records over the 474M records.

Partition Fields - DayId, HourId
Without Range pruning:
BULK_INSERT Time - 1 Hr 33 Min
Upsert Time - 1 Hr 29 Min

With Range Pruning:
BULK_INSERT Time - 1 Hr 32 Min
Upsert Time - 1 hr.

This is a huge win for the overall run time.
Thanks @vinothchandar

prazanna · 2017-08-04T20:50:40Z

@alunarbeach this is awesome to know

vinothchandar requested a review from prazanna July 31, 2017 05:22

n3nash reviewed Jul 31, 2017

View reviewed changes

prazanna approved these changes Aug 2, 2017

View reviewed changes

alunarbeach suggested changes Aug 4, 2017

View reviewed changes

vinothchandar force-pushed the index-range-pruning branch from 116ba29 to d7542e6 Compare August 4, 2017 18:38

prazanna merged commit 8620964 into apache:master Aug 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding range based pruning to bloom index #232

Adding range based pruning to bloom index #232

vinothchandar commented Jul 31, 2017

vinothchandar commented Jul 31, 2017

ovj commented Jul 31, 2017

alunarbeach commented Jul 31, 2017

n3nash Jul 31, 2017

vinothchandar Jul 31, 2017

n3nash Jul 31, 2017

n3nash Jul 31, 2017

vinothchandar Jul 31, 2017

n3nash Jul 31, 2017

prazanna left a comment

prazanna Aug 2, 2017

vinothchandar Aug 2, 2017

prazanna Aug 2, 2017

vinothchandar Aug 2, 2017

alunarbeach Aug 4, 2017

vinothchandar Aug 4, 2017

alunarbeach Aug 4, 2017

ovj commented Aug 4, 2017

vinothchandar commented Aug 4, 2017

alunarbeach commented Aug 4, 2017

prazanna commented Aug 4, 2017


		private final String fileName;

		private final String minRecordKey;

Adding range based pruning to bloom index #232

Adding range based pruning to bloom index #232

Conversation

vinothchandar commented Jul 31, 2017

vinothchandar commented Jul 31, 2017

ovj commented Jul 31, 2017

alunarbeach commented Jul 31, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prazanna left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ovj commented Aug 4, 2017

vinothchandar commented Aug 4, 2017

alunarbeach commented Aug 4, 2017

prazanna commented Aug 4, 2017