New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRILL-4530: Optimize partition pruning with metadata caching for the … #519
Conversation
if (mDirs.getDirectories().size() > 0) { | ||
FileSelection dirSelection = FileSelection.createFromDirectories(mDirs.getDirectories(), selection); | ||
dirSelection.setExpandedPartial(); | ||
return new DynamicDrillTable(fsPlugin, storageEngineName, userName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you missing the isDirReadable() check here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intentionally don't call isDirReadable() here because that method returns true if a metadata cache file exists and I am doing a similar check for the directories file here with fs.exists(dirMetaPath). If this check fails, we will fall through to the old code path (line 223) which does check isDirReadable().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment then, perhaps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense.. I will add a comment. thanks for reviewing.
The metadata cache file changes look good. (Longer term, we need to clean this up and maybe split metadata into smaller files and/or consider providing alternative metadata cache implementations, but that is another issue.) |
if (checkForSingle && | ||
partitions.get(0).isCompositePartition() /* apply single partition check only for composite partitions */) { | ||
// Inner loop: within each batch iterate over the PartitionLocations | ||
for (PartitionLocation part : partitions) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's possible to refactor this part of code such that only file system prune scan will have this logic? This simplePartition optimization only applies to file system prune. Also, doOnMatch() method is getting bigger. Something like: descriptor has supportsSinglePartOptimization(), and prune scan rule has doSinglePartOpt(), which is by default a no-op, and has implementation for file system prune rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the rule has become too complex. I will go ahead and at least refactor this specific optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jinfengni I started doing the refactoring to move this particular optimization into a partition descriptor specific class. However, I was having to propagate several internal states to the doSinglePartOpt() method...such as partition map, the referenced dirs bitset , the value vector array and others. Also, since this optimization occurs inside an outer loop, it is not straightforward without doing a broader restructuring of the PruneScanRule.doOnMatch() method. Would you be ok to defer this refactoring to an enhancement JIRA ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it's ok to put the refactoring effort as an enhancement JIR.
5ab5157
to
dff2857
Compare
if (fileNames.isEmpty()) { | ||
List<String> finalFileNames; | ||
if (fileSet != null) { | ||
finalFileNames = Lists.newArrayList(fileSet); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use fileNames only? From Line 589 / 607, seems fileSet is assigned from fileNames; seems they are same under two ELSE branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, fileNames can contain duplicates whereas fileSet has a unique list. fileSet was already present in the code earlier and used by removeUnneededRowGroups(). I am leveraging fileSet because when expanding directories it may be possible to end up with duplicate files which we don't want. I could perhaps add a comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the need to remove duplicates and the confusion comes from old code. Is it reasonable to change existing code, and only keep fileSet in all cases? Keep fileName and fileSet seems to be a bit confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can try to keep only the fileSet and get rid of fileNames (although I would still need the finalFileNames list since FileSelection only accepts a List, not a Set).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit #ef37b77 addresses this comment.
Other than prior comments, the pruning logic looks good to me. |
ce94e7f
to
50e6c98
Compare
…single partition case. - Enhance PruneScanRule to detect single partitions based on referenced dirs in the filter. - Keep a new status of EXPANDED_PARTIAL for FileSelection. - Create separate .directories metadata file to prune directories first before files. - Introduce cacheFileRoot attribute to keep track of the parent directory of the cache file after partition pruning.
…rtition info is initialized.
…dd several unit tests.
…d instead leverage it from FileSelection.
50e6c98
to
5065fc6
Compare
…ss this state where needed.
5065fc6
to
ce509af
Compare
@jinfengni , thanks for the review. I have addressed your review comments. Pls take another look. |
+1 LGTM. |
…single partition case.