New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRILL-6030: Managed sort should minimize number of batches in a k-way merge #1075
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks! Just a minor suggestion about changing the existing config option rather than adding a new hard-coded value.
@@ -84,7 +85,7 @@ public SortConfig(DrillConfig config) { | |||
if (limit > 0) { | |||
mergeLimit = Math.max(limit, MIN_MERGE_LIMIT); | |||
} else { | |||
mergeLimit = Integer.MAX_VALUE; | |||
mergeLimit = DEFAULT_MERGE_LIMIT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The merge limit is already a config option. (I'd forgotten about that.) The comment on the config option says "Limit on the number of spilled batches that can be merged in a single pass." So, let's just set that default (in drill-override-conf
) to your new value of 128 and leave the code here unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, it is better to change the default to avoid upgrade problems. In an upgrade scenario, users may simply overwrite drill-override.conf
from their prior installations and forget to set the merge limit. Is there a reason not to change the default merge limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be a misunderstanding of how config options work. We define the defaults in Drill's own source code: drill-module.conf
in each module. (Here it is in java-exec
.)
To change the default option, we change the value in drill-module.conf
. In the highly unlikely case that a user has overridden this value in drill-override.conf
, their value will be used. But, the option is not documented in drill-override-example.conf
so it is very, very unlikely that anyone created an override. (The property is meant to be internal, for use in tests.)
So, rather than introducing yet another variable, we might as well use the existing config property. This has the added advantage that, if experience suggests that we need a smaller or larger limit for some scenarios, we can make the adjustment in the field via the config system.
One additional thought. This bug was found when sorting 18 GB of data in 8 GB of memory. That is, a case in which the sort must spill. What happens in the case in which the 18 GB of data is sorted in, say, 20 GB of memory (an in-memory sort)? We don't want the merge limit to force a spill in this case; kind of defeats the purpose of an in-memory sort. So:
One possible solution is to:
|
The scenario when all batches can be merged in memory is covered by 'if (canUseMemoryMerge()) |
@paul-rogers Please review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still LGTM
+1
… merge This closes apache#1075
@paul-rogers Please review