Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7081] Faster sort-based shuffle path using binary processing cache-aware sort #5868

Closed
wants to merge 96 commits into from
Closed
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
96 commits
Select commit Hold shift + click to select a range
81d52c5
WIP on UnsafeSorter
JoshRosen Apr 29, 2015
abf7bfe
Add basic test case.
JoshRosen Apr 29, 2015
57a4ea0
Make initialSize configurable in UnsafeSorter
JoshRosen Apr 30, 2015
e900152
Add test for empty iterator in UnsafeSorter
JoshRosen May 1, 2015
767d3ca
Fix invalid range in UnsafeSorter.
JoshRosen May 1, 2015
3db12de
Minor simplification and sanity checks in UnsafeSorter
JoshRosen May 1, 2015
4d2f5e1
WIP
JoshRosen May 1, 2015
8e3ec20
Begin code cleanup.
JoshRosen May 1, 2015
253f13e
More cleanup
JoshRosen May 1, 2015
9c6cf58
Refactor to use DiskBlockObjectWriter.
JoshRosen May 1, 2015
e267cee
Fix compilation of UnsafeSorterSuite
JoshRosen May 1, 2015
e2d96ca
Expand serializer API and use new function to help control when new U…
JoshRosen May 1, 2015
d3cc310
Flag that SparkSqlSerializer2 supports relocation
JoshRosen May 1, 2015
87e721b
Renaming and comments
JoshRosen May 1, 2015
0748458
Port UnsafeShuffleWriter to Java.
JoshRosen May 2, 2015
026b497
Re-use a buffer in UnsafeShuffleWriter
JoshRosen May 2, 2015
1433b42
Store record length as int instead of long.
JoshRosen May 2, 2015
240864c
Remove PrefixComputer and require prefix to be specified as part of i…
JoshRosen May 2, 2015
bfc12d3
Add tests for serializer relocation property.
JoshRosen May 3, 2015
b8a09fe
Back out accidental log4j.properties change
JoshRosen May 3, 2015
c2fca17
Small refactoring of SerializerPropertiesSuite to enable test re-use:
JoshRosen May 3, 2015
f17fa8f
Add missing newline
JoshRosen May 3, 2015
8958584
Fix bug in calculating free space in current page.
JoshRosen May 3, 2015
595923a
Remove some unused variables.
JoshRosen May 3, 2015
5e100b2
Super-messy WIP on external sort
JoshRosen May 4, 2015
2776aca
First passing test for ExternalSorter.
JoshRosen May 4, 2015
f156a8f
Hacky metrics integration; refactor some interfaces.
JoshRosen May 4, 2015
3490512
Misc. cleanup
JoshRosen May 4, 2015
3aeaff7
More refactoring and cleanup; begin cleaning iterator interfaces
JoshRosen May 4, 2015
7ee918e
Re-order imports in tests
JoshRosen May 5, 2015
69232fd
Enable compressible address encoding for off-heap mode.
JoshRosen May 5, 2015
57f1ec0
WIP towards packed record pointers for use in optimized shuffle sort.
JoshRosen May 5, 2015
f480fb2
WIP in mega-refactoring towards shuffle-specific sort.
JoshRosen May 5, 2015
133c8c9
WIP towards testing UnsafeShuffleWriter.
JoshRosen May 5, 2015
4f70141
Fix merging; now passes UnsafeShuffleSuite tests.
JoshRosen May 5, 2015
aaea17b
Add comments to UnsafeShuffleSpillWriter.
JoshRosen May 6, 2015
b674412
Merge remote-tracking branch 'origin/master' into unsafe-sort
JoshRosen May 6, 2015
11feeb6
Update TODOs related to shuffle write metrics.
JoshRosen May 7, 2015
8a6fe52
Rename UnsafeShuffleSpillWriter to UnsafeShuffleExternalSorter
JoshRosen May 7, 2015
cfe0ec4
Address a number of minor review comments:
JoshRosen May 7, 2015
e67f1ea
Remove upper type bound in ShuffleWriter interface.
JoshRosen May 7, 2015
5e8cf75
More minor cleanup
JoshRosen May 7, 2015
1ce1300
More minor cleanup
JoshRosen May 7, 2015
b95e642
Refactor and document logic that decides when to spill.
JoshRosen May 7, 2015
9883e30
Merge remote-tracking branch 'origin/master' into unsafe-sort
JoshRosen May 8, 2015
722849b
Add workaround for transferTo() bug in merging code; refactor tests.
JoshRosen May 8, 2015
7cd013b
Begin refactoring to enable proper tests for spilling.
JoshRosen May 9, 2015
9b7ebed
More defensive programming RE: cleaning up spill files and memory aft…
JoshRosen May 9, 2015
e8718dd
Merge remote-tracking branch 'origin/master' into unsafe-sort
JoshRosen May 9, 2015
1929a74
Update to reflect upstream ShuffleBlockManager -> ShuffleBlockResolve…
JoshRosen May 9, 2015
01afc74
Actually read data in UnsafeShuffleWriterSuite
JoshRosen May 10, 2015
8f5061a
Strengthen assertion to check partitioning
JoshRosen May 10, 2015
67d25ba
Update Exchange operator's copying logic to account for new shuffle m…
JoshRosen May 10, 2015
fd4bb9e
Use own ByteBufferOutputStream rather than Kryo's
JoshRosen May 10, 2015
9d1ee7c
Fix MiMa excludes for ShuffleWriter change
JoshRosen May 10, 2015
fcd9a3c
Add notes + tests for maximum record / page sizes.
JoshRosen May 10, 2015
27b18b0
That for inserting records AT the max record size.
JoshRosen May 10, 2015
4a01c45
Remove unnecessary log message
JoshRosen May 10, 2015
f780fb1
Add test demonstrating which compression codecs support concatenation.
JoshRosen May 11, 2015
b57c17f
Disable some overly-verbose logs that rendered DEBUG useless.
JoshRosen May 11, 2015
1ef56c7
Revise compression codec support in merger; test cross product of con…
JoshRosen May 11, 2015
b3b1924
Properly implement close() and flush() in DummySerializerInstance.
JoshRosen May 11, 2015
0d4d199
Bump up shuffle.memoryFraction to make tests pass.
JoshRosen May 11, 2015
ec6d626
Add notes on maximum # of supported shuffle partitions.
JoshRosen May 11, 2015
ae538dc
Document UnsafeShuffleManager.
JoshRosen May 11, 2015
ea4f85f
Roll back an unnecessary change in Spillable.
JoshRosen May 11, 2015
1e3ad52
Delete unused ByteBufferOutputStream class.
JoshRosen May 11, 2015
39434f9
Avoid integer multiplication overflow in getMemoryUsage (thanks FindB…
JoshRosen May 11, 2015
e1855e5
Fix a handful of misc. IntelliJ inspections
JoshRosen May 11, 2015
7c953f9
Add test that covers UnsafeShuffleSortDataFormat.swap().
JoshRosen May 11, 2015
8531286
Add tests that automatically trigger spills.
JoshRosen May 11, 2015
69d5899
Remove some unnecessary override vals
JoshRosen May 11, 2015
d4e6d89
Update to bit shifting constants
JoshRosen May 11, 2015
4f0b770
Attempt to implement proper shuffle write metrics.
JoshRosen May 12, 2015
e58a6b4
Add more tests for PackedRecordPointer encoding.
JoshRosen May 12, 2015
e995d1a
Introduce MAX_SHUFFLE_OUTPUT_PARTITIONS.
JoshRosen May 12, 2015
56781a1
Rename UnsafeShuffleSorter to UnsafeShuffleInMemorySorter
JoshRosen May 12, 2015
0ad34da
Fix off-by-one in nextInt() call
JoshRosen May 12, 2015
85da63f
Cleanup in UnsafeShuffleSorterIterator.
JoshRosen May 12, 2015
fdcac08
Guard against overflow when expanding sort buffer.
JoshRosen May 12, 2015
2d4e4f4
Address some minor comments in UnsafeShuffleExternalSorter.
JoshRosen May 12, 2015
57312c9
Clarify fileBufferSize units
JoshRosen May 12, 2015
6276168
Remove ability to disable spilling in UnsafeShuffleExternalSorter.
JoshRosen May 12, 2015
4a2c785
rename 'sort buffer' to 'pointer array'
JoshRosen May 12, 2015
e3b8855
Cleanup in UnsafeShuffleWriter
JoshRosen May 12, 2015
c2ce78e
Fix a missed usage of MAX_PARTITION_ID
JoshRosen May 12, 2015
d5779c6
Merge remote-tracking branch 'origin/master' into unsafe-sort
JoshRosen May 12, 2015
5e189c6
Track time spend closing / flushing files; split TimeTrackingOutputSt…
JoshRosen May 12, 2015
df07699
Attempt to clarify confusing metrics update code
JoshRosen May 12, 2015
de40b9d
More comments to try to explain metrics code
JoshRosen May 12, 2015
4023fa4
Add @Private annotation to some Java classes.
JoshRosen May 12, 2015
51812a7
Change shuffle manager sort name to tungsten-sort
JoshRosen May 13, 2015
52a9981
Fix some bugs in the address packing code.
JoshRosen May 13, 2015
d494ffe
Fix deserialization of JavaSerializer instances.
JoshRosen May 13, 2015
7610f2f
Add tests for proper cleanup of shuffle data.
JoshRosen May 13, 2015
ef0a86e
Fix scalastyle errors
JoshRosen May 13, 2015
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ conf/*.properties
conf/*.conf
conf/*.xml
conf/slaves
core/build/py4j/
docs/_site
docs/api
target/
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,265 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.shuffle.unsafe;

import scala.Option;
import scala.Product2;
import scala.reflect.ClassTag;
import scala.reflect.ClassTag$;

import java.io.File;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.util.Iterator;
import java.util.LinkedList;

import com.esotericsoftware.kryo.io.ByteBufferOutputStream;

import org.apache.spark.Partitioner;
import org.apache.spark.ShuffleDependency;
import org.apache.spark.SparkEnv;
import org.apache.spark.TaskContext;
import org.apache.spark.executor.ShuffleWriteMetrics;
import org.apache.spark.scheduler.MapStatus;
import org.apache.spark.scheduler.MapStatus$;
import org.apache.spark.serializer.SerializationStream;
import org.apache.spark.serializer.Serializer;
import org.apache.spark.serializer.SerializerInstance;
import org.apache.spark.shuffle.IndexShuffleBlockManager;
import org.apache.spark.shuffle.ShuffleWriter;
import org.apache.spark.storage.BlockManager;
import org.apache.spark.storage.BlockObjectWriter;
import org.apache.spark.storage.ShuffleBlockId;
import org.apache.spark.unsafe.PlatformDependent;
import org.apache.spark.unsafe.memory.MemoryBlock;
import org.apache.spark.unsafe.memory.TaskMemoryManager;
import org.apache.spark.unsafe.sort.UnsafeSorter;
import static org.apache.spark.unsafe.sort.UnsafeSorter.*;

// IntelliJ gets confused and claims that this class should be abstract, but this actually compiles
public class UnsafeShuffleWriter<K, V> implements ShuffleWriter<K, V> {

private static final int PAGE_SIZE = 1024 * 1024; // TODO: tune this
private static final int SER_BUFFER_SIZE = 1024 * 1024; // TODO: tune this
private static final ClassTag<Object> OBJECT_CLASS_TAG = ClassTag$.MODULE$.Object();

private final IndexShuffleBlockManager shuffleBlockManager;
private final BlockManager blockManager = SparkEnv.get().blockManager();
private final int shuffleId;
private final int mapId;
private final TaskMemoryManager memoryManager;
private final SerializerInstance serializer;
private final Partitioner partitioner;
private final ShuffleWriteMetrics writeMetrics;
private final LinkedList<MemoryBlock> allocatedPages = new LinkedList<MemoryBlock>();
private final int fileBufferSize;
private MapStatus mapStatus = null;

private MemoryBlock currentPage = null;
private long currentPagePosition = PAGE_SIZE;

/**
* Are we in the process of stopping? Because map tasks can call stop() with success = true
* and then call stop() with success = false if they get an exception, we want to make sure
* we don't try deleting files, etc twice.
*/
private boolean stopping = false;

public UnsafeShuffleWriter(
IndexShuffleBlockManager shuffleBlockManager,
UnsafeShuffleHandle<K, V> handle,
int mapId,
TaskContext context) {
this.shuffleBlockManager = shuffleBlockManager;
this.mapId = mapId;
this.memoryManager = context.taskMemoryManager();
final ShuffleDependency<K, V, V> dep = handle.dependency();
this.shuffleId = dep.shuffleId();
this.serializer = Serializer.getSerializer(dep.serializer()).newInstance();
this.partitioner = dep.partitioner();
this.writeMetrics = new ShuffleWriteMetrics();
context.taskMetrics().shuffleWriteMetrics_$eq(Option.apply(writeMetrics));
this.fileBufferSize =
// Use getSizeAsKb (not bytes) to maintain backwards compatibility for units
(int) SparkEnv.get().conf().getSizeAsKb("spark.shuffle.file.buffer", "32k") * 1024;
}

public void write(scala.collection.Iterator<Product2<K, V>> records) {
try {
final long[] partitionLengths = writeSortedRecordsToFile(sortRecords(records));
shuffleBlockManager.writeIndexFile(shuffleId, mapId, partitionLengths);
mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
} catch (Exception e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do u want to catch throwable or exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching Throwable is risky if we can't re-throw it, since that might cause us to drop an OutOfMemoryError. Instead, I think that we can use a finally block that checks a boolean flag to see whether cleanup needs to be performed. This avoids the bad practice in the current code of throwing from a finally block.

PlatformDependent.throwException(e);
}
}

private void ensureSpaceInDataPage(long requiredSpace) throws Exception {
if (requiredSpace > PAGE_SIZE) {
// TODO: throw a more specific exception?
throw new Exception("Required space " + requiredSpace + " is greater than page size (" +
PAGE_SIZE + ")");
} else if (requiredSpace > (PAGE_SIZE - currentPagePosition)) {
currentPage = memoryManager.allocatePage(PAGE_SIZE);
currentPagePosition = currentPage.getBaseOffset();
allocatedPages.add(currentPage);
}
}

private void freeMemory() {
final Iterator<MemoryBlock> iter = allocatedPages.iterator();
while (iter.hasNext()) {
memoryManager.freePage(iter.next());
iter.remove();
}
}

private Iterator<RecordPointerAndKeyPrefix> sortRecords(
scala.collection.Iterator<? extends Product2<K, V>> records) throws Exception {
final UnsafeSorter sorter = new UnsafeSorter(
memoryManager,
RECORD_COMPARATOR,
PREFIX_COMPARATOR,
4096 // Initial size (TODO: tune this!)
);

final byte[] serArray = new byte[SER_BUFFER_SIZE];
final ByteBuffer serByteBuffer = ByteBuffer.wrap(serArray);
// TODO: we should not depend on this class from Kryo; copy its source or find an alternative
final SerializationStream serOutputStream =
serializer.serializeStream(new ByteBufferOutputStream(serByteBuffer));

while (records.hasNext()) {
final Product2<K, V> record = records.next();
final K key = record._1();
final int partitionId = partitioner.getPartition(key);
serByteBuffer.position(0);
serOutputStream.writeKey(key, OBJECT_CLASS_TAG);
serOutputStream.writeValue(record._2(), OBJECT_CLASS_TAG);
serOutputStream.flush();

final int serializedRecordSize = serByteBuffer.position();
assert (serializedRecordSize > 0);
// Need 4 bytes to store the record length.
ensureSpaceInDataPage(serializedRecordSize + 4);

final long recordAddress =
memoryManager.encodePageNumberAndOffset(currentPage, currentPagePosition);
final Object baseObject = currentPage.getBaseObject();
PlatformDependent.UNSAFE.putInt(baseObject, currentPagePosition, serializedRecordSize);
currentPagePosition += 4;
PlatformDependent.copyMemory(
serArray,
PlatformDependent.BYTE_ARRAY_OFFSET,
baseObject,
currentPagePosition,
serializedRecordSize);
currentPagePosition += serializedRecordSize;

sorter.insertRecord(recordAddress, partitionId);
}

return sorter.getSortedIterator();
}

private long[] writeSortedRecordsToFile(
Iterator<RecordPointerAndKeyPrefix> sortedRecords) throws IOException {
final File outputFile = shuffleBlockManager.getDataFile(shuffleId, mapId);
final ShuffleBlockId blockId =
new ShuffleBlockId(shuffleId, mapId, IndexShuffleBlockManager.NOOP_REDUCE_ID());
final long[] partitionLengths = new long[partitioner.numPartitions()];

int currentPartition = -1;
BlockObjectWriter writer = null;

final byte[] arr = new byte[SER_BUFFER_SIZE];
while (sortedRecords.hasNext()) {
final RecordPointerAndKeyPrefix recordPointer = sortedRecords.next();
final int partition = (int) recordPointer.keyPrefix;
assert (partition >= currentPartition);
if (partition != currentPartition) {
// Switch to the new partition
if (currentPartition != -1) {
writer.commitAndClose();
partitionLengths[currentPartition] = writer.fileSegment().length();
}
currentPartition = partition;
writer =
blockManager.getDiskWriter(blockId, outputFile, serializer, fileBufferSize, writeMetrics);
}

final Object baseObject = memoryManager.getPage(recordPointer.recordPointer);
final long baseOffset = memoryManager.getOffsetInPage(recordPointer.recordPointer);
final int recordLength = (int) PlatformDependent.UNSAFE.getLong(baseObject, baseOffset);
PlatformDependent.copyMemory(
baseObject,
baseOffset + 4,
arr,
PlatformDependent.BYTE_ARRAY_OFFSET,
recordLength);
assert (writer != null); // To suppress an IntelliJ warning
writer.write(arr, 0, recordLength);
// TODO: add a test that detects whether we leave this call out:
writer.recordWritten();
}

if (writer != null) {
writer.commitAndClose();
partitionLengths[currentPartition] = writer.fileSegment().length();
}

return partitionLengths;
}

@Override
public Option<MapStatus> stop(boolean success) {
try {
if (stopping) {
return Option.apply(null);
} else {
stopping = true;
freeMemory();
if (success) {
return Option.apply(mapStatus);
} else {
// The map task failed, so delete our output data.
shuffleBlockManager.removeDataByMap(shuffleId, mapId);
return Option.apply(null);
}
}
} finally {
freeMemory();
// TODO: increment the shuffle write time metrics
}
}

private static final RecordComparator RECORD_COMPARATOR = new RecordComparator() {
@Override
public int compare(
Object leftBaseObject, long leftBaseOffset, Object rightBaseObject, long rightBaseOffset) {
return 0;
}
};

private static final PrefixComparator PREFIX_COMPARATOR = new PrefixComparator() {
@Override
public int compare(long prefix1, long prefix2) {
return (int) (prefix1 - prefix2);
}
};
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.unsafe.sort;

import static org.apache.spark.unsafe.sort.UnsafeSorter.RecordPointerAndKeyPrefix;
import org.apache.spark.util.collection.SortDataFormat;

/**
* Supports sorting an array of (record pointer, key prefix) pairs. Used in {@link UnsafeSorter}.
*
* Within each long[] buffer, position {@code 2 * i} holds a pointer pointer to the record at
* index {@code i}, while position {@code 2 * i + 1} in the array holds an 8-byte key prefix.
*/
final class UnsafeSortDataFormat extends SortDataFormat<RecordPointerAndKeyPrefix, long[]> {

public static final UnsafeSortDataFormat INSTANCE = new UnsafeSortDataFormat();

private UnsafeSortDataFormat() { }

@Override
public RecordPointerAndKeyPrefix getKey(long[] data, int pos) {
// Since we re-use keys, this method shouldn't be called.
throw new UnsupportedOperationException();
}

@Override
public RecordPointerAndKeyPrefix newKey() {
return new RecordPointerAndKeyPrefix();
}

@Override
public RecordPointerAndKeyPrefix getKey(long[] data, int pos, RecordPointerAndKeyPrefix reuse) {
reuse.recordPointer = data[pos * 2];
reuse.keyPrefix = data[pos * 2 + 1];
return reuse;
}

@Override
public void swap(long[] data, int pos0, int pos1) {
long tempPointer = data[pos0 * 2];
long tempKeyPrefix = data[pos0 * 2 + 1];
data[pos0 * 2] = data[pos1 * 2];
data[pos0 * 2 + 1] = data[pos1 * 2 + 1];
data[pos1 * 2] = tempPointer;
data[pos1 * 2 + 1] = tempKeyPrefix;
}

@Override
public void copyElement(long[] src, int srcPos, long[] dst, int dstPos) {
dst[dstPos * 2] = src[srcPos * 2];
dst[dstPos * 2 + 1] = src[srcPos * 2 + 1];
}

@Override
public void copyRange(long[] src, int srcPos, long[] dst, int dstPos, int length) {
System.arraycopy(src, srcPos * 2, dst, dstPos * 2, length * 2);
}

@Override
public long[] allocate(int length) {
assert (length < Integer.MAX_VALUE / 2) : "Length " + length + " is too large";
return new long[length * 2];
}

}
Loading