[WIP][CARBONDATA-2931][BloomDataMap] Optimize bloom datamap pruning #2713

kevinjmh · 2018-09-12T12:39:50Z

re-use shard pruning info from default datamap. File scan will be only used to determined whether we can use merged shard, instead of getting full shard path.
create one BloomCoarseGrainDataMap object per segment instead of per shard. (This is also preparation for parallel segment pruning). For merged shard, no effect.

Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:

Any interfaces changed?
Any backward compatibility impacted?
Document update required?
Testing done
Please provide details on
- Whether new unit test cases have been added or why no new tests are required?
- How it is tested? Please attach test report.
- Is it a performance related change? Please attach the performance test report.
- Any additional information to help reviewers in testing this change.
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

CarbonDataQA · 2018-09-12T12:49:26Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/257/

CarbonDataQA · 2018-09-12T13:58:18Z

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/426/

CarbonDataQA · 2018-09-12T14:01:16Z

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.3/8496/

kevinjmh · 2018-09-13T00:39:44Z

retest this please

CarbonDataQA · 2018-09-13T00:49:11Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/262/

CarbonDataQA · 2018-09-13T01:57:30Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/431/

CarbonDataQA · 2018-09-13T02:02:31Z

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.3/8501/

CarbonDataQA · 2018-09-26T15:43:45Z

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/8778/

CarbonDataQA · 2018-09-26T16:05:59Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/711/

CarbonDataQA · 2018-09-26T21:34:40Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/533/

CarbonDataQA · 2018-10-08T03:44:43Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/734/

CarbonDataQA · 2018-10-08T04:50:29Z

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/8999/

CarbonDataQA · 2018-10-08T05:12:15Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/931/

qiuchenjian · 2018-12-18T02:21:04Z

.../bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMapFactory.java

      }
+      segmentMap.put(segment.getSegmentNo(), shardPaths);


segmentMap is used cache the shardPaths, now it's useless，I don't think it's necessary to get shardPaths
it's ok to change segmentMap to a Set that add segment no

qiuchenjian · 2018-12-18T07:36:20Z

.../bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMapFactory.java

+      String fileName = carbonFile.getName();
+      if (fileName.equals(BloomIndexFileStore.MERGE_BLOOM_INDEX_SHARD_NAME)) {
+        mergeShardExist = true;
+      } else if (fileName.equals(BloomIndexFileStore.MERGE_INPROGRESS_FILE)) {
        mergeShardInprogress = true;


If MERGE_INPROGRESS_FILE exists, shard's index file will be deleted sometime, so this scene need to be focus on, but this question shows up before this PR

Yes, you are right. We need to fix this. If we allow to use bloom filter when the index files are merging, maybe any IO Exception will occur in following steps when the merging is done.

Some simple ideas for this:

datamap do not choose bloom when merging is under action

change the pruning logic to segment independent, any datamap excepts default datamap can reject or fail the segment pruning ( by return null or ?), and no more result blocklet intersection for this datamap, such that this does not affect final result

One more idea is that we can delay the deletion of original shards in query, referring to segment management. That is when mergeShard exists and no merge inprogress file in a query, we can assure to delete original shards safely.

Actually, retry the query should be OK

CarbonDataQA · 2018-12-27T03:40:18Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1988/

CarbonDataQA · 2018-12-27T04:45:44Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2165/

CarbonDataQA · 2018-12-27T21:32:22Z

Build Success with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10240/

CarbonDataQA · 2019-07-24T09:06:48Z

Build Failed with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/59/

CarbonDataQA · 2019-07-24T09:37:29Z

Build Failed with Spark 2.3.2, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/58/

CarbonDataQA1 · 2020-02-03T07:03:12Z

Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/59/

kevinjmh changed the title ~~[CARBONDATA-2931][BloomDataMap] Optimize bloom datamap pruning~~ [WIP][CARBONDATA-2931][BloomDataMap] Optimize bloom datamap pruning Sep 17, 2018

asfgit force-pushed the master branch from 23a9e7c to e07df44 Compare September 26, 2018 07:48

all shard of one segment use one datamap

88b0f67

kevinjmh force-pushed the bloom_shard_op branch from 8544ed6 to 88b0f67 Compare October 8, 2018 03:31

qiuchenjian reviewed Dec 18, 2018

View reviewed changes

apache deleted a comment from kevinjmh Dec 18, 2018

qiuchenjian reviewed Dec 18, 2018

View reviewed changes

kevinjmh closed this Mar 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][CARBONDATA-2931][BloomDataMap] Optimize bloom datamap pruning #2713

[WIP][CARBONDATA-2931][BloomDataMap] Optimize bloom datamap pruning #2713

kevinjmh commented Sep 12, 2018 •

edited

CarbonDataQA commented Sep 12, 2018

CarbonDataQA commented Sep 12, 2018

CarbonDataQA commented Sep 12, 2018

kevinjmh commented Sep 13, 2018

CarbonDataQA commented Sep 13, 2018

CarbonDataQA commented Sep 13, 2018

CarbonDataQA commented Sep 13, 2018

CarbonDataQA commented Sep 26, 2018

CarbonDataQA commented Sep 26, 2018

CarbonDataQA commented Sep 26, 2018

CarbonDataQA commented Oct 8, 2018

CarbonDataQA commented Oct 8, 2018

CarbonDataQA commented Oct 8, 2018

qiuchenjian Dec 18, 2018

qiuchenjian Dec 18, 2018

kevinjmh Dec 18, 2018

kevinjmh Dec 18, 2018 •

edited

CarbonDataQA commented Dec 27, 2018

CarbonDataQA commented Dec 27, 2018

CarbonDataQA commented Dec 27, 2018

CarbonDataQA commented Jul 24, 2019

CarbonDataQA commented Jul 24, 2019

CarbonDataQA1 commented Feb 3, 2020

[WIP][CARBONDATA-2931][BloomDataMap] Optimize bloom datamap pruning #2713

[WIP][CARBONDATA-2931][BloomDataMap] Optimize bloom datamap pruning #2713

Conversation

kevinjmh commented Sep 12, 2018 • edited

CarbonDataQA commented Sep 12, 2018

CarbonDataQA commented Sep 12, 2018

CarbonDataQA commented Sep 12, 2018

kevinjmh commented Sep 13, 2018

CarbonDataQA commented Sep 13, 2018

CarbonDataQA commented Sep 13, 2018

CarbonDataQA commented Sep 13, 2018

CarbonDataQA commented Sep 26, 2018

CarbonDataQA commented Sep 26, 2018

CarbonDataQA commented Sep 26, 2018

CarbonDataQA commented Oct 8, 2018

CarbonDataQA commented Oct 8, 2018

CarbonDataQA commented Oct 8, 2018

qiuchenjian Dec 18, 2018

Choose a reason for hiding this comment

qiuchenjian Dec 18, 2018

Choose a reason for hiding this comment

kevinjmh Dec 18, 2018

Choose a reason for hiding this comment

kevinjmh Dec 18, 2018 • edited

Choose a reason for hiding this comment

CarbonDataQA commented Dec 27, 2018

CarbonDataQA commented Dec 27, 2018

CarbonDataQA commented Dec 27, 2018

CarbonDataQA commented Jul 24, 2019

CarbonDataQA commented Jul 24, 2019

CarbonDataQA1 commented Feb 3, 2020

kevinjmh commented Sep 12, 2018 •

edited

kevinjmh Dec 18, 2018 •

edited