Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][CARBONDATA-2931][BloomDataMap] Optimize bloom datamap pruning #2713

Closed
wants to merge 1 commit into from

Conversation

kevinjmh
Copy link
Member

@kevinjmh kevinjmh commented Sep 12, 2018

  1. re-use shard pruning info from default datamap. File scan will be only used to determined whether we can use merged shard, instead of getting full shard path.
  2. create one BloomCoarseGrainDataMap object per segment instead of per shard. (This is also preparation for parallel segment pruning). For merged shard, no effect.

Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:

  • Any interfaces changed?

  • Any backward compatibility impacted?

  • Document update required?

  • Testing done
    Please provide details on
    - Whether new unit test cases have been added or why no new tests are required?
    - How it is tested? Please attach test report.
    - Is it a performance related change? Please attach the performance test report.
    - Any additional information to help reviewers in testing this change.

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/257/

@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/426/

@CarbonDataQA
Copy link

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.3/8496/

@kevinjmh
Copy link
Member Author

retest this please

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/262/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/431/

@CarbonDataQA
Copy link

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.3/8501/

@kevinjmh kevinjmh changed the title [CARBONDATA-2931][BloomDataMap] Optimize bloom datamap pruning [WIP][CARBONDATA-2931][BloomDataMap] Optimize bloom datamap pruning Sep 17, 2018
@CarbonDataQA
Copy link

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/8778/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/711/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/533/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/734/

@CarbonDataQA
Copy link

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/8999/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/931/

}
segmentMap.put(segment.getSegmentNo(), shardPaths);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

segmentMap is used cache the shardPaths, now it's useless,I don't think it's necessary to get shardPaths
it's ok to change segmentMap to a Set that add segment no

@apache apache deleted a comment from kevinjmh Dec 18, 2018
String fileName = carbonFile.getName();
if (fileName.equals(BloomIndexFileStore.MERGE_BLOOM_INDEX_SHARD_NAME)) {
mergeShardExist = true;
} else if (fileName.equals(BloomIndexFileStore.MERGE_INPROGRESS_FILE)) {
mergeShardInprogress = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If MERGE_INPROGRESS_FILE exists, shard's index file will be deleted sometime, so this scene need to be focus on, but this question shows up before this PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. We need to fix this. If we allow to use bloom filter when the index files are merging, maybe any IO Exception will occur in following steps when the merging is done.

Some simple ideas for this:

  1. datamap do not choose bloom when merging is under action
  2. change the pruning logic to segment independent, any datamap excepts default datamap can reject or fail the segment pruning ( by return null or ?), and no more result blocklet intersection for this datamap, such that this does not affect final result

Copy link
Member Author

@kevinjmh kevinjmh Dec 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more idea is that we can delay the deletion of original shards in query, referring to segment management. That is when mergeShard exists and no merge inprogress file in a query, we can assure to delete original shards safely.

Actually, retry the query should be OK

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1988/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2165/

@CarbonDataQA
Copy link

Build Success with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10240/

@CarbonDataQA
Copy link

Build Failed with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/59/

@CarbonDataQA
Copy link

Build Failed with Spark 2.3.2, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/58/

@CarbonDataQA1
Copy link

Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/59/

@kevinjmh kevinjmh closed this Mar 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants