Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-4530: Optimize partition pruning with metadata caching for the … #519

Closed
wants to merge 7 commits into from

Conversation

amansinha100
Copy link

…single partition case.

  • Enhance PruneScanRule to detect single partitions based on referenced dirs in the filter.
  • Keep a new status of EXPANDED_PARTIAL for FileSelection.
  • Create separate .directories metadata file to prune directories first before files.
  • Introduce cacheFileRoot attribute to keep track of the parent directory of the cache file after partition pruning.

if (mDirs.getDirectories().size() > 0) {
FileSelection dirSelection = FileSelection.createFromDirectories(mDirs.getDirectories(), selection);
dirSelection.setExpandedPartial();
return new DynamicDrillTable(fsPlugin, storageEngineName, userName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you missing the isDirReadable() check here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally don't call isDirReadable() here because that method returns true if a metadata cache file exists and I am doing a similar check for the directories file here with fs.exists(dirMetaPath). If this check fails, we will fall through to the old code path (line 223) which does check isDirReadable().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment then, perhaps.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense.. I will add a comment. thanks for reviewing.

@parthchandra
Copy link
Contributor

The metadata cache file changes look good. (Longer term, we need to clean this up and maybe split metadata into smaller files and/or consider providing alternative metadata cache implementations, but that is another issue.)

if (checkForSingle &&
partitions.get(0).isCompositePartition() /* apply single partition check only for composite partitions */) {
// Inner loop: within each batch iterate over the PartitionLocations
for (PartitionLocation part : partitions) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's possible to refactor this part of code such that only file system prune scan will have this logic? This simplePartition optimization only applies to file system prune. Also, doOnMatch() method is getting bigger. Something like: descriptor has supportsSinglePartOptimization(), and prune scan rule has doSinglePartOpt(), which is by default a no-op, and has implementation for file system prune rule.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the rule has become too complex. I will go ahead and at least refactor this specific optimization.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jinfengni I started doing the refactoring to move this particular optimization into a partition descriptor specific class. However, I was having to propagate several internal states to the doSinglePartOpt() method...such as partition map, the referenced dirs bitset , the value vector array and others. Also, since this optimization occurs inside an outer loop, it is not straightforward without doing a broader restructuring of the PruneScanRule.doOnMatch() method. Would you be ok to defer this refactoring to an enhancement JIRA ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's ok to put the refactoring effort as an enhancement JIR.

@amansinha100 amansinha100 force-pushed the DRILL-4530-1 branch 4 times, most recently from 5ab5157 to dff2857 Compare July 8, 2016 15:55
if (fileNames.isEmpty()) {
List<String> finalFileNames;
if (fileSet != null) {
finalFileNames = Lists.newArrayList(fileSet);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use fileNames only? From Line 589 / 607, seems fileSet is assigned from fileNames; seems they are same under two ELSE branches.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, fileNames can contain duplicates whereas fileSet has a unique list. fileSet was already present in the code earlier and used by removeUnneededRowGroups(). I am leveraging fileSet because when expanding directories it may be possible to end up with duplicate files which we don't want. I could perhaps add a comment here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the need to remove duplicates and the confusion comes from old code. Is it reasonable to change existing code, and only keep fileSet in all cases? Keep fileName and fileSet seems to be a bit confusing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can try to keep only the fileSet and get rid of fileNames (although I would still need the finalFileNames list since FileSelection only accepts a List, not a Set).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit #ef37b77 addresses this comment.

@jinfengni
Copy link
Contributor

Other than prior comments, the pruning logic looks good to me.

@amansinha100 amansinha100 force-pushed the DRILL-4530-1 branch 2 times, most recently from ce94e7f to 50e6c98 Compare July 16, 2016 01:33
Aman Sinha added 6 commits July 15, 2016 18:34
…single partition case.

 - Enhance PruneScanRule to detect single partitions based on referenced dirs in the filter.
 - Keep a new status of EXPANDED_PARTIAL for FileSelection.
 - Create separate .directories metadata file to prune directories first before files.
 - Introduce cacheFileRoot attribute to keep track of the parent directory of the cache file after partition pruning.
@amansinha100
Copy link
Author

@jinfengni , thanks for the review. I have addressed your review comments. Pls take another look.

@jinfengni
Copy link
Contributor

+1

LGTM.

@asfgit asfgit closed this in 4f818d0 Jul 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants