DRILL-6236: batch sizing for hash join #1227

ppadma · 2018-04-19T20:03:32Z

No description provided.

Ben-Zvi

In addition to the other comments: We will have some work to do when this code is merged with the Hash-Join Spill code (PR coming soon). Some merge work would be just mechanical (e.g., HashJoinProbeTemplate code was moved into HashJoinBatch), but other would require some thinking. E.g., The new spill code re-assigns the left/right incoming when reading from the spill files -- should the Memory Manager be updated over that data ? (which was already read, via the original left/right).

Ben-Zvi · 2018-04-19T22:06:27Z

exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinBatch.java

+    OUTPUT_BATCH_COUNT,
+    AVG_OUTPUT_BATCH_BYTES,
+    AVG_OUTPUT_ROW_BYTES,
+    OUTPUT_RECORD_COUNT;


The metrics are to be used also by customers; is this information relevant for them ? Is this too detailed (e.g., can be logged instead).

It is relevant in the sense that they provide high level picture of amount of data being processed, memory usage etc. by each operator. This is also helpful when debugging trying to figure out what is going on.

Putting these metrics inside operator Metric class will not work. For joins these metrics were moved inside JoinBatchMemoryManager.Metric class since they are memory manager metrics. So when you call updateBatchMemoryManagerStats() it updates the operator stats but using ordinals from JoinBatchMemoryManager.Metric class. So the ordinal for LEFT_INPUT_BATCH_COUNT will be 0 not 4 (which is required).
I think we should improve our OperatorsMetricRegistry to register multiple Metric classes for an operator.

Ben-Zvi · 2018-04-19T23:18:18Z

exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinBatch.java

    // skip first batch if count is zero, as it may be an empty schema batch
    if (isFurtherProcessingRequired(rightUpstream) && right.getRecordCount() == 0) {
      for (final VectorWrapper<?> w : right) {
        w.clear();
      }
      rightUpstream = next(right);
+      // For build side, use aggregate i.e. average row width across batches
+      batchMemoryManager.update(RIGHT_INDEX, 0,true);


Why is update() being called when the right has zero rows ? Shouldn't it be called for every new right incoming batch ?

There is a call to "next" right above the update.

Ben-Zvi · 2018-04-19T23:19:30Z

exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinBatch.java

@@ -346,6 +369,7 @@ public void executeBuildPhase() throws SchemaChangeException, ClassTransformatio
        }
        // Fall through
      case OK:
+        batchMemoryManager.update(LEFT_INDEX);


Should it be the RIGHT_INDEX here ?

Ben-Zvi · 2018-04-19T23:23:30Z

exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchMemoryManager.java

@@ -241,4 +261,41 @@ public int getOutputBatchSize() {
  public int getOffsetVectorWidth() {
    return UInt4Vector.VALUE_WIDTH;
  }
+
+  public void allocateVectors(VectorContainer container) {


Cleaner implementation: Just call the following method, with outputRowCount as the second parameter.

Ben-Zvi · 2018-04-20T00:19:38Z

exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchMemoryManager.java

+  }
+
+  public void allocateVectors(List<ValueVector> valueVectors) {
+    // Allocate memory for the vectors.


Same idea/comment as above; can avoid some duplicate code by calling allocateVectors(valueVectors, outputRecordCount)

Ben-Zvi · 2018-04-20T01:58:14Z

exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchMemoryManager.java

+    if (inputBatchStats[index] == null) {
+      inputBatchStats[index] = new BatchStats();
+    }
+    updateIncomingStats(index);
  }

  public void setRecordBatchSizer(RecordBatchSizer sizer) {


Can instead just call the above method with DEFAULT_INPUT_INDEX as the first parameter.

Ben-Zvi · 2018-04-20T02:00:14Z

exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchMemoryManager.java

@@ -147,6 +149,12 @@ public int update(int inputIndex, int outputPosition) {
    return getOutputRowCount();
  }

+  public int update(int inputIndex, int outputPosition, boolean useAggregate) {
+    // by default just return the outputRowCount
+    return update(inputIndex, outputPosition, false);


Is this an infinite recursive call ??

ppadma · 2018-04-20T16:54:59Z

@Ben-Zvi Thanks a lot for the review. updated PR with review comments taken care of. Please take a look.

Regarding spill files, here are my thoughts.
For build side, I am using aggregate statistics i.e. average of all batches. On probe side, I am using stats for each batch coming in and adjusting the output row count. So, we can skip applying sizing for batches spilled from build side and continue to do what I am doing on the probe side. Once your code is merged in, I will refactor the code as needed.

sohami · 2018-04-20T19:13:48Z

exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinBatch.java

@@ -157,14 +172,20 @@ public int metricId() {
    }
  }

+  public class HashJoinMemoryManager extends JoinBatchMemoryManager {


Not required you can directly use JoinBatchMemoryManager.

sohami · 2018-04-20T19:13:57Z

exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinBatch.java

+    OUTPUT_BATCH_COUNT,
+    AVG_OUTPUT_BATCH_BYTES,
+    AVG_OUTPUT_ROW_BYTES,
+    OUTPUT_RECORD_COUNT;


Putting these metrics inside operator Metric class will not work. For joins these metrics were moved inside JoinBatchMemoryManager.Metric class since they are memory manager metrics. So when you call updateBatchMemoryManagerStats() it updates the operator stats but using ordinals from JoinBatchMemoryManager.Metric class. So the ordinal for LEFT_INPUT_BATCH_COUNT will be 0 not 4 (which is required).
I think we should improve our OperatorsMetricRegistry to register multiple Metric classes for an operator.

ppadma · 2018-04-20T20:57:23Z

@sohami thanks for the review. updated with review comments addressed. please take a look.

sohami · 2018-04-20T21:05:39Z

exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinBatch.java

@@ -560,6 +554,40 @@ public void close() {
    super.close();
  }

+  @Override
+  protected void updateBatchMemoryManagerStats() {
+    stats.setLongStat(Metric.LEFT_INPUT_BATCH_COUNT, batchMemoryManager.getNumIncomingBatches(LEFT_INDEX));


@ppadma - The main motive of moving the metrics inside JoinBatchMemoryManager was to avoid duplicate definition in all the BinaryRecordBatches. I think we should improve on OperatorMetricsRegistry rather than just overriding this method.
Or if you want you can create a new JIRA to track metrics related improvement.

@sohami I will create a JIRA and address that in a separate PR. For now, I would like to override this method. is that ok ?

Should be fine.

Ben-Zvi · 2018-04-20T21:39:28Z

Need to update the subject line of this PR: the '-' is missing between "DRILL" and "6236" (should be DRILL-6236) ; because of this missing '-' the PR is not listed in the Jira ....

ppadma · 2018-04-20T22:12:42Z

@Ben-Zvi my bad. I updated the title. but, it has not updated the JIRA. trying to figure this out.

Ben-Zvi · 2018-04-20T22:18:58Z

Need to be "DRILL" in capital letters ...

ppadma · 2018-04-25T22:03:47Z

@Ben-Zvi I manually added the PR link to the JIRA. all code review comments are addressed. can you look at the latest changes ?

ilooner · 2018-04-26T20:12:57Z

@ppadma Please fix travis failure

ppadma · 2018-05-27T04:36:46Z

@Ben-Zvi I merged with spill to disk changes and fixed all issues. please take a look.

ilooner · 2018-05-27T05:30:01Z

...xec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinMemoryCalculatorImpl.java

@@ -234,6 +235,7 @@ public HashJoinState getState() {
    private int recordsPerPartitionBatchBuild;
    private int recordsPerPartitionBatchProbe;
    private int outputBatchNumRecords;
+    private int outputBatchSize;
    private Map<String, Long> buildValueSizes;
    private Map<String, Long> probeValueSizes;
    private Map<String, Long> keySizes;


@ppadma I think outputBatchNumRecords, buildValueSizes, probeValueSizes, and keySizes are unused now since we are directly passing in the outputBatchSize. This is great since the calculator has been simplified. Could you also delete these unused parameters?

ppadma · 2018-05-30T21:35:12Z

@Ben-Zvi I rebased and updated the PR. Please review the latest diffs.

Ben-Zvi · 2018-05-30T23:12:09Z

...xec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinMemoryCalculatorImpl.java

@@ -448,8 +442,7 @@ private void calculateMemoryUsage()
        safetyFactor,
        reserveHash);

-      maxOutputBatchSize = computeMaxOutputBatchSize(buildValueSizes, probeValueSizes, keySizes,
-        outputBatchNumRecords, safetyFactor, fragmentationFactor);
+      maxOutputBatchSize = (long) (outputBatchSize * fragmentationFactor * safetyFactor);


Maybe the "outputBatchSize" needs to be casted to (double) to ensure that the whole multiplication is performed as a double-multiplication.

Ben-Zvi

Few comments ...

Ben-Zvi · 2018-05-30T23:48:58Z

.../java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinProbeTemplate.java

@@ -262,6 +272,7 @@ private void executeProbePhase() throws SchemaChangeException {
                probeBatch.getSchema());
            }
          case OK:
+            setTargetOutputCount(outgoingJoinBatch.getBatchMemoryManager().update(probeBatch, LEFT_INDEX,outputRecords));


This code is called when a new LEFT incoming batch is read. At this point the outgoing batch may be "half full". Looks like this call is modifying the "targetOutputRecords" variable. If so, then it would not match the allocated size for the outgoing batch. For example, if made bigger, then the code above would try to add rows (to the outgoing) beyond the original allocation size !

It will not make it bigger. It will look at remaining memory and adjust the row count based on that.

Here is the relevant code from updateInternal function:
final long remainingMemory = Math.max(configOutputBatchSize - memoryUsed, 0);
// These are number of rows we can fit in remaining memory based on new outgoing row width.
final int numOutputRowsRemaining = RecordBatchSizer.safeDivide(remainingMemory, newOutgoingRowWidth);

Ben-Zvi · 2018-05-31T00:32:29Z

exec/java-exec/src/main/java/org/apache/drill/exec/record/JoinBatchMemoryManager.java

@@ -85,13 +71,50 @@ public int update(int inputIndex, int outputPosition) {
  }

  @Override
-  public RecordBatchSizer.ColumnSize getColumnSize(String name) {


Why is the overriding method deleted ? It is used by Lateral-Join and Merge-Join. By deleting it they are going to use the one from the super class (RecordBatchMemoryManager) .

Because it is redundant. super class is doing the same thing.

Ben-Zvi · 2018-05-31T00:56:04Z

exec/java-exec/src/main/java/org/apache/drill/exec/record/JoinBatchMemoryManager.java

-    RecordBatchSizer rightSizer = getRecordBatchSizer(RIGHT_INDEX);
+  public int update(int inputIndex, int outputPosition, boolean useAggregate) {
+    switch (inputIndex) {
+      case LEFT_INDEX:


A cleanup suggestion: There are too many "update()" methods. And the LEFT never use aggregate, and the RIGHT always use aggregate. So how about instead:

private int foo(RecordBatch batch, int inputIndex, boolean useAggregate) { setRecordBatchSizer(inputIndex, new RecordBatchSizer(batch)); updateIncomingStats(inputIndex); return useAggregate ? (int) getAvgInputRowWidth(inputIndex) : getRecordBatchSizer(inputIndex).getRowAllocSize(); } public int updateRight(RecordBatch batch,int outputPosition) { rightRowWidth = foo(batch,RIGHT_INDEX,true); return updateInternal(outputPosition); } public int updateLeft(RecordBatch batch,int outputPosition) { leftRowWidth = foo(batch,LEFT_INDEX,false); return updateInternal(outputPosition); }

I rearranged the code. Got rid of left and right. Instead, using array called rowWidth which can be indexed by input index. It is better now. Unfortunately, each operator calls update with different parameters. So, we have different versions of the same function.

Right is not always "use aggregate".
It is based on the operator. For example, for merge join, we do not use aggregate. It is batch by batch.

Ben-Zvi

+1

ppadma force-pushed the DRILL-6236 branch 2 times, most recently from ccce105 to aa7a3da Compare April 19, 2018 21:46

Ben-Zvi reviewed Apr 20, 2018

View reviewed changes

sohami reviewed Apr 20, 2018

View reviewed changes

ppadma force-pushed the DRILL-6236 branch from 889291e to 4c4a815 Compare April 20, 2018 22:04

ppadma changed the title ~~Drill 6236: batch sizing for hash join~~ Drill-6236: batch sizing for hash join Apr 20, 2018

ppadma changed the title ~~Drill-6236: batch sizing for hash join~~ DRILL-6236: batch sizing for hash join Apr 20, 2018

ppadma force-pushed the DRILL-6236 branch 2 times, most recently from b781d04 to 5b6ba46 Compare April 25, 2018 22:00

ppadma force-pushed the DRILL-6236 branch 2 times, most recently from bd1ff5c to 50a6339 Compare April 27, 2018 22:33

ppadma force-pushed the DRILL-6236 branch from 50a6339 to ceea281 Compare May 9, 2018 17:27

ppadma force-pushed the DRILL-6236 branch from 461f2cb to 52a7e39 Compare May 27, 2018 04:34

ppadma force-pushed the DRILL-6236 branch from 52a7e39 to 66023cb Compare May 27, 2018 05:20

ilooner reviewed May 27, 2018

View reviewed changes

DRILL-6343: bit vector copyFromSafe is not doing realloc

75f8d70

ppadma force-pushed the DRILL-6236 branch 2 times, most recently from 0d388b2 to 76bbc59 Compare May 30, 2018 21:34

ppadma force-pushed the DRILL-6236 branch 2 times, most recently from 45d2897 to 45f3b01 Compare May 30, 2018 22:21

Ben-Zvi reviewed May 30, 2018

View reviewed changes

ppadma force-pushed the DRILL-6236 branch 2 times, most recently from 7c33cd0 to a5664f2 Compare May 30, 2018 23:32

Ben-Zvi reviewed May 31, 2018

View reviewed changes

Ben-Zvi approved these changes May 31, 2018

View reviewed changes

DRILL-6236:Batch sizing for hash join

7b0866a

ppadma force-pushed the DRILL-6236 branch from 5367596 to 7b0866a Compare May 31, 2018 20:59

ilooner mentioned this pull request Jun 1, 2018

DRILL-4411: hash join should limit batch based on size and number of records #381

Closed

parthchandra closed this in 480ade9 Jun 2, 2018

DRILL-6236: batch sizing for hash join #1227

DRILL-6236: batch sizing for hash join #1227

Conversation

ppadma commented Apr 19, 2018

Ben-Zvi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ppadma commented Apr 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ppadma commented Apr 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ben-Zvi commented Apr 20, 2018 • edited

ppadma commented Apr 20, 2018

Ben-Zvi commented Apr 20, 2018

ppadma commented Apr 25, 2018

ilooner commented Apr 26, 2018

ppadma commented May 27, 2018

Choose a reason for hiding this comment

ppadma commented May 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ben-Zvi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ppadma May 31, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ben-Zvi left a comment

Choose a reason for hiding this comment

Ben-Zvi commented Apr 20, 2018 •

edited

ppadma May 31, 2018 •

edited