Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-6030: Managed sort should minimize number of batches in a k-way merge #1075

Closed
wants to merge 1 commit into from
Closed

Conversation

vrozov
Copy link
Member

@vrozov vrozov commented Dec 18, 2017

@paul-rogers Please review

Copy link
Contributor

@paul-rogers paul-rogers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks! Just a minor suggestion about changing the existing config option rather than adding a new hard-coded value.

@@ -84,7 +85,7 @@ public SortConfig(DrillConfig config) {
if (limit > 0) {
mergeLimit = Math.max(limit, MIN_MERGE_LIMIT);
} else {
mergeLimit = Integer.MAX_VALUE;
mergeLimit = DEFAULT_MERGE_LIMIT;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The merge limit is already a config option. (I'd forgotten about that.) The comment on the config option says "Limit on the number of spilled batches that can be merged in a single pass." So, let's just set that default (in drill-override-conf) to your new value of 128 and leave the code here unchanged.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, it is better to change the default to avoid upgrade problems. In an upgrade scenario, users may simply overwrite drill-override.conf from their prior installations and forget to set the merge limit. Is there a reason not to change the default merge limit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be a misunderstanding of how config options work. We define the defaults in Drill's own source code: drill-module.conf in each module. (Here it is in java-exec.)

To change the default option, we change the value in drill-module.conf. In the highly unlikely case that a user has overridden this value in drill-override.conf, their value will be used. But, the option is not documented in drill-override-example.conf so it is very, very unlikely that anyone created an override. (The property is meant to be internal, for use in tests.)

So, rather than introducing yet another variable, we might as well use the existing config property. This has the added advantage that, if experience suggests that we need a smaller or larger limit for some scenarios, we can make the adjustment in the field via the config system.

@paul-rogers
Copy link
Contributor

One additional thought. This bug was found when sorting 18 GB of data in 8 GB of memory. That is, a case in which the sort must spill.

What happens in the case in which the 18 GB of data is sorted in, say, 20 GB of memory (an in-memory sort)? We don't want the merge limit to force a spill in this case; kind of defeats the purpose of an in-memory sort.

So:

  1. Does the limit affect in memory sort? If so, we need to revise the solution.
  2. Does the in-memory sort suffer from a similar performance issue? If so, we need to revise the in memory sort.

One possible solution is to:

  1. Defer sorting of individual batches until necessary.
  2. Sort batches just before spilling.
  3. If all batches fit in memory, do a single, combined sort (using an SV4).

@vrozov
Copy link
Member Author

vrozov commented Dec 19, 2017

The scenario when all batches can be merged in memory is covered by 'if (canUseMemoryMerge())check inSortImpl.java:399`. The affected code path applies only to cases where merge between spilled and in-memory batches is necessary. Note that this is a short term fix to improve managed sort performance, in a long run, it is necessary to have an ability to merge all batches in memory (using SV4) without spilling and be able to merge it with the spilled data.

@vrozov
Copy link
Member Author

vrozov commented Dec 21, 2017

@paul-rogers Please review

Copy link
Contributor

@paul-rogers paul-rogers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
+1

Copy link
Contributor

@paul-rogers paul-rogers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still LGTM
+1

@asfgit asfgit closed this in dcaac1b Jan 12, 2018
@vrozov vrozov deleted the DRILL-6030 branch January 24, 2018 17:50
ilooner pushed a commit to ilooner/drill that referenced this pull request Jan 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants