[SPARK-18819][CORE] Double byte alignment on ARM platforms #16403

Closed
wants to merge 4 commits into
from

Conversation

Projects
None yet
4 participants
@michaelkamprath

What changes were proposed in this pull request?

On 32-bit ARM platforms, doubles require 8-byte alignment. The use of Java's sun.misc.Unsafe library for memory block management exposes Spark to misaligned doubles when using org.apache.spark.unsafe.Platform.getDouble() to extract a double from a memory block with arbitrary offset locations. For the alignment sensitive platforms, this proposed change creates an properly aligned buffer that the bytes to be cast to a double are first copied to, then the double the double is extracted via the sun.misc.Unsafe.getDouble() method. On all other platforms the previous behavior is maintained, which was to cast the double value from the memory block regardless of offset alignment .

How was this patch tested?

All unit tests pass, plus tested on both x86 and ARM71 platforms. For the ARM71 platform, it is verified that this fix removes the problem as described in SPARK-18819.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Dec 26, 2016

Can one of the admins verify this patch?

Can one of the admins verify this patch?

@srowen

I think the question will be whether this can be supported with virtually 0 overhead to the x64 case.

@@ -44,6 +49,9 @@
public static final int DOUBLE_ARRAY_OFFSET;
private static final boolean unaligned;
+
+ private static final MemoryBlock _doubleBuffer;

This comment has been minimized.

@srowen

srowen Dec 27, 2016

Member

This is a globally shared variable -- that's not going to work is it? reads would all corrupt each other.

@srowen

srowen Dec 27, 2016

Member

This is a globally shared variable -- that's not going to work is it? reads would all corrupt each other.

This comment has been minimized.

@michaelkamprath

michaelkamprath Dec 28, 2016

Admittedly, I am not deeply familiar with how this class is being used concurrently within a single JVM. If multiple threads are using it concurrently, you are right and a different approach must be found. I was trying to avoid the cost of reallocating a buffer with every call to getDouble().

@michaelkamprath

michaelkamprath Dec 28, 2016

Admittedly, I am not deeply familiar with how this class is being used concurrently within a single JVM. If multiple threads are using it concurrently, you are right and a different approach must be found. I was trying to avoid the cost of reallocating a buffer with every call to getDouble().

@@ -119,7 +127,12 @@ public static void putFloat(Object object, long offset, float value) {
}
public static double getDouble(Object object, long offset) {
- return _UNSAFE.getDouble(object, offset);
+ if ( null == _doubleBuffer) {

This comment has been minimized.

@srowen

srowen Dec 27, 2016

Member

Style nit:

if (_doubleBuffer == null) {
@srowen

srowen Dec 27, 2016

Member

Style nit:

if (_doubleBuffer == null) {
+ byte[] heapObj = new byte[16];
+ long offset = BYTE_ARRAY_OFFSET;
+ long bufferSize = 16;
+ for (long i = 0; i < 8; ++i ) {

This comment has been minimized.

@srowen

srowen Dec 27, 2016

Member

Spacing:

for (long i = 0; i < 8; i++) {
  if ((offset + i) % 8 == 0) {

Why a long?

@srowen

srowen Dec 27, 2016

Member

Spacing:

for (long i = 0; i < 8; i++) {
  if ((offset + i) % 8 == 0) {

Why a long?

This comment has been minimized.

@michaelkamprath

michaelkamprath Dec 28, 2016

Simply to match the offset variable, which needs to be a long.

@michaelkamprath

michaelkamprath Dec 28, 2016

Simply to match the offset variable, which needs to be a long.

+ return _UNSAFE.getDouble(object, offset);
+ } else {
+ copyMemory(object, offset, _doubleBuffer.getBaseObject(), _doubleBuffer.getBaseOffset(), 8);
+ return _UNSAFE.getDouble(_doubleBuffer.getBaseObject(), _doubleBuffer.getBaseOffset());

This comment has been minimized.

@kiszk

kiszk Dec 27, 2016

Contributor

If getLong() never causes an alignment error, how about Double.longBitsToDouble(_UNSAFE.getLong()).
While I am not familiar with ARM architecture, I am curious which one is faster?

@kiszk

kiszk Dec 27, 2016

Contributor

If getLong() never causes an alignment error, how about Double.longBitsToDouble(_UNSAFE.getLong()).
While I am not familiar with ARM architecture, I am curious which one is faster?

This comment has been minimized.

@michaelkamprath

michaelkamprath Dec 28, 2016

I will test this idea and report back. Thanks!

@michaelkamprath

michaelkamprath Dec 28, 2016

I will test this idea and report back. Thanks!

+ offset += i;
+ bufferSize -= i;
+ break;
+ }

This comment has been minimized.

@kiszk

kiszk Dec 27, 2016

Contributor

Can we assign null into _doubleBuffer if offset % 8 == 0?

@kiszk

kiszk Dec 27, 2016

Contributor

Can we assign null into _doubleBuffer if offset % 8 == 0?

This comment has been minimized.

@michaelkamprath

michaelkamprath Dec 28, 2016

No, we can't, because we would still need the buffer to avoid doing the direct double read from an unaligned memory location on line 131.

The reason for finding an aligned offset here is because on ARM7, the actual memory address for the byte[0] location is not the memory address of the byte[] object. That starting point for the 0-index item is indicated by Platform.BYTE_ARRAY_OFFSET, which is the Java overhead memory used for managing the byte[] object. So if we read from byte[0], that could be unaligned because of the Java overhead, hence the reason to find the index in the byte buffer where overhead+index is aligned. I am depending on the Java behavior of aligning on objects (here, the byte[] object) for only needing to check that the offset is also aligned.

@michaelkamprath

michaelkamprath Dec 28, 2016

No, we can't, because we would still need the buffer to avoid doing the direct double read from an unaligned memory location on line 131.

The reason for finding an aligned offset here is because on ARM7, the actual memory address for the byte[0] location is not the memory address of the byte[] object. That starting point for the 0-index item is indicated by Platform.BYTE_ARRAY_OFFSET, which is the Java overhead memory used for managing the byte[] object. So if we read from byte[0], that could be unaligned because of the Java overhead, hence the reason to find the index in the byte buffer where overhead+index is aligned. I am depending on the Java behavior of aligning on objects (here, the byte[] object) for only needing to check that the offset is also aligned.

@kiszk

This comment has been minimized.

Show comment
Hide comment
@kiszk

kiszk Dec 28, 2016

Contributor

While we ideally expect static final is evaluated as a constant by a JIT compiler, I do not know the fact. I would appreciate someone would look at the binary code generated by OpenJDK since I cannot do it due to some reasons.

Contributor

kiszk commented Dec 28, 2016

While we ideally expect static final is evaluated as a constant by a JIT compiler, I do not know the fact. I would appreciate someone would look at the binary code generated by OpenJDK since I cannot do it due to some reasons.

@michaelkamprath

This comment has been minimized.

Show comment
Hide comment
@michaelkamprath

michaelkamprath Dec 28, 2016

I've refactored this issue's solution using the idea that @kiszk supplied above. This ended up being a better idea, thanks. It produced cleaner code than my original allocated buffer approach. Also, based on the input that this solution should have zero impact on the x86 platform, I've refactored to a functor design pattern which creates the equivalent performance behavior for x86 as before this fix.

The performance impact of the using an intermediate long as a buffer when fetching a double from a memory block looks to be about 10%. I ran some performance tests writing and reading random doubles to random aligned locations in a buffer 100M times over. On my mac laptop and an ARM unit, the performance difference was the long-buffered approach was about 10% slower than the original direct fetch approach. However, since the this refactor isolates the fetching behavior into a functor, the 10% slowdown only affects the ARM platform.

I also added some unit tests if only to convince myself I hadn't broken anything.

I've refactored this issue's solution using the idea that @kiszk supplied above. This ended up being a better idea, thanks. It produced cleaner code than my original allocated buffer approach. Also, based on the input that this solution should have zero impact on the x86 platform, I've refactored to a functor design pattern which creates the equivalent performance behavior for x86 as before this fix.

The performance impact of the using an intermediate long as a buffer when fetching a double from a memory block looks to be about 10%. I ran some performance tests writing and reading random doubles to random aligned locations in a buffer 100M times over. On my mac laptop and an ARM unit, the performance difference was the long-buffered approach was about 10% slower than the original direct fetch approach. However, since the this refactor isolates the fetching behavior into a functor, the 10% slowdown only affects the ARM platform.

I also added some unit tests if only to convince myself I hadn't broken anything.

@srowen

Still needs a fair bit of work, but might be more feasible. I am not sure about the overhead of the extra dispatch here unless we can get it to inline reliably. The hit for x64 may be too much.

+
+import sun.misc.Unsafe;
+
+public class DoubleAccessBuffered implements DoubleAccessFunctor {

This comment has been minimized.

@srowen

srowen Dec 28, 2016

Member

(Why public?)

@srowen

srowen Dec 28, 2016

Member

(Why public?)

This comment has been minimized.

@michaelkamprath

michaelkamprath Dec 28, 2016

No reason but forgot to uncheck the Eclipse modifiers when creating a new class. Fixed.

@michaelkamprath

michaelkamprath Dec 28, 2016

No reason but forgot to uncheck the Eclipse modifiers when creating a new class. Fixed.

+
+import sun.misc.Unsafe;
+
+class DoubleAccessDirect implements DoubleAccessFunctor {

This comment has been minimized.

@srowen

srowen Dec 28, 2016

Member

Implementations should be final for completeness and to maximize chance of JIT inlining.

@srowen

srowen Dec 28, 2016

Member

Implementations should be final for completeness and to maximize chance of JIT inlining.

This comment has been minimized.

+ _doubleAccessFunc = new DoubleAccessBuffered();
+ } else {
+ // set the buffer address to null to indicate no buffer allocated
+ _doubleAccessFunc = new DoubleAccessDirect();;

This comment has been minimized.

@srowen

srowen Dec 28, 2016

Member

extra semicolon

@srowen

srowen Dec 28, 2016

Member

extra semicolon

This comment has been minimized.

@@ -44,6 +48,9 @@
public static final int DOUBLE_ARRAY_OFFSET;
private static final boolean unaligned;
+
+ private static final DoubleAccessFunctor _doubleAccessFunc;

This comment has been minimized.

@srowen

srowen Dec 28, 2016

Member

Java/Scala wouldn't prefix members with _

@srowen

srowen Dec 28, 2016

Member

Java/Scala wouldn't prefix members with _

This comment has been minimized.

+ arch);
+ _doubleAccessFunc = new DoubleAccessBuffered();
+ } else {
+ // set the buffer address to null to indicate no buffer allocated

This comment has been minimized.

@srowen

srowen Dec 28, 2016

Member

The comment seems wrong

@srowen

srowen Dec 28, 2016

Member

The comment seems wrong

This comment has been minimized.

@michaelkamprath

michaelkamprath Dec 28, 2016

removed (vestige from prior approach)

@michaelkamprath

michaelkamprath Dec 28, 2016

removed (vestige from prior approach)

@@ -252,6 +272,7 @@ public static void throwException(Throwable t) {
LONG_ARRAY_OFFSET = 0;
FLOAT_ARRAY_OFFSET = 0;
DOUBLE_ARRAY_OFFSET = 0;
+ _doubleAccessFunc = null;

This comment has been minimized.

@srowen

srowen Dec 28, 2016

Member

Is this supposed to be null now rather than just the other implementation?

@srowen

srowen Dec 28, 2016

Member

Is this supposed to be null now rather than just the other implementation?

This comment has been minimized.

@michaelkamprath

michaelkamprath Dec 28, 2016

I am setting it to null to keep in line with the behavior of the _UNSAFE class variable. In this case, _UNSAFE is null. In that case, any calls to the Platform.get... or Platform.put... will result in a NullPointerException. I'm assuming it is a programming error elsewhere to call these methods without checking Platform.unaligned() ( example ).

@michaelkamprath

michaelkamprath Dec 28, 2016

I am setting it to null to keep in line with the behavior of the _UNSAFE class variable. In this case, _UNSAFE is null. In that case, any calls to the Platform.get... or Platform.put... will result in a NullPointerException. I'm assuming it is a programming error elsewhere to call these methods without checking Platform.unaligned() ( example ).

+ // write a double to an unaligned location
+ long unalignedOffset = testBuffer.getBaseOffset() + 3;
+ // double check unalignment just to be sure:
+ if (unalignedOffset%8 == 0) {

This comment has been minimized.

@srowen

srowen Dec 28, 2016

Member

Space around operators like %

@srowen

srowen Dec 28, 2016

Member

Space around operators like %

This comment has been minimized.

+ long unalignedOffset = testBuffer.getBaseOffset() + 3;
+ // double check unalignment just to be sure:
+ if (unalignedOffset%8 == 0) {
+ ++unalignedOffset;

This comment has been minimized.

@srowen

srowen Dec 28, 2016

Member

x++ not ++x unless there is a specific need for preincrement

@srowen

srowen Dec 28, 2016

Member

x++ not ++x unless there is a specific need for preincrement

This comment has been minimized.

+ buffered.getDouble(unsafe, testBuffer.getBaseObject(), alignedOffset),
+ 0.000001);
+ } catch (Throwable cause) {
+ // can't run test because Unsafe is unavailable.

This comment has been minimized.

@srowen

srowen Dec 28, 2016

Member

You can't do this in a test, it will catch any error. Construct the test to deal with this case explicitly.

@srowen

srowen Dec 28, 2016

Member

You can't do this in a test, it will catch any error. Construct the test to deal with this case explicitly.

This comment has been minimized.

@michaelkamprath

michaelkamprath Dec 28, 2016

refactored test to isolate the Unsafe object creation better.

@michaelkamprath

michaelkamprath Dec 28, 2016

refactored test to isolate the Unsafe object creation better.

+
+import sun.misc.Unsafe;
+
+interface DoubleAccessFunctor {

This comment has been minimized.

@kiszk

kiszk Dec 28, 2016

Contributor

Is there any reason to use interface? I think that abstract class is enough for this.
I like virtual call rather than interface call. This is because the code sequnce for virtual call is simpler than that for interface call.

@kiszk

kiszk Dec 28, 2016

Contributor

Is there any reason to use interface? I think that abstract class is enough for this.
I like virtual call rather than interface call. This is because the code sequnce for virtual call is simpler than that for interface call.

This comment has been minimized.

@michaelkamprath

michaelkamprath Dec 28, 2016

No reason other than there was no commonality between the concrete classes other than the interface. I will switch to an abstract class given your performance input.

@michaelkamprath

michaelkamprath Dec 28, 2016

No reason other than there was no commonality between the concrete classes other than the interface. I will switch to an abstract class given your performance input.

This comment has been minimized.

@srowen

srowen Dec 30, 2016

Member

Interesting @kiszk I had not encountered this before. I would have used an interface for simplicity, but I understand this is an argument about performance in hot code. Design-wise it's virtually identical. According to http://stackoverflow.com/questions/21608240/performance-difference-between-passing-interface-and-class-reloaded this particular case might be just as fast. I think we can leave it unless someone has a strong feeling otherwise.

@srowen

srowen Dec 30, 2016

Member

Interesting @kiszk I had not encountered this before. I would have used an interface for simplicity, but I understand this is an argument about performance in hot code. Design-wise it's virtually identical. According to http://stackoverflow.com/questions/21608240/performance-difference-between-passing-interface-and-class-reloaded this particular case might be just as fast. I think we can leave it unless someone has a strong feeling otherwise.

@srowen

I'm still sort of uneasy about the complexity and tiny overhead in this path. Dumb question, what's the use case for running Spark on ARM? I still think of that as a mobile architecture.

@@ -22,10 +22,14 @@
import java.lang.reflect.Method;
import java.nio.ByteBuffer;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
import sun.misc.Cleaner;
import sun.misc.Unsafe;
public final class Platform {

This comment has been minimized.

@srowen

srowen Dec 30, 2016

Member

BTW there is a whole second copy of this class under the 'sketch' module that I think would also have to be modified. Hm, then the new classes need to be copied too, ugh

@srowen

srowen Dec 30, 2016

Member

BTW there is a whole second copy of this class under the 'sketch' module that I think would also have to be modified. Hm, then the new classes need to be copied too, ugh

This comment has been minimized.

@michaelkamprath

michaelkamprath Dec 30, 2016

I missed that. I'll address that when we determine the final path here (per my comment below).

@michaelkamprath

michaelkamprath Dec 30, 2016

I missed that. I'll address that when we determine the final path here (per my comment below).

+ unsafeField.setAccessible(true);
+ unsafe = (sun.misc.Unsafe) unsafeField.get(null);
+ } catch (Throwable cause) {
+ unsafe = null;

This comment has been minimized.

@srowen

srowen Dec 30, 2016

Member

Could just return here to skip the test?

@srowen

srowen Dec 30, 2016

Member

Could just return here to skip the test?

+
+ // determine whether double access should be aligned.
+ String arch = System.getProperty("os.arch", "");
+ if (arch.matches("^(arm|arm32)")) {

This comment has been minimized.

@kiszk

kiszk Dec 30, 2016

Contributor

What's happen on ARM 64-bit?

@kiszk

kiszk Dec 30, 2016

Contributor

What's happen on ARM 64-bit?

This comment has been minimized.

@michaelkamprath

michaelkamprath Dec 30, 2016

@kiszk I have tested on ARM 64 (aarch64). Any alignment works for double access there, though 8-byte aligned access looks to be about 10% faster than unaligned access. Using an intermediate long buffer (your idea) is about 5% slower than direct access regardless of being aligned or not. In both cases, I tested with an ODROID C2 using Oracle Java 8.

@michaelkamprath

michaelkamprath Dec 30, 2016

@kiszk I have tested on ARM 64 (aarch64). Any alignment works for double access there, though 8-byte aligned access looks to be about 10% faster than unaligned access. Using an intermediate long buffer (your idea) is about 5% slower than direct access regardless of being aligned or not. In both cases, I tested with an ODROID C2 using Oracle Java 8.

This comment has been minimized.

@kiszk

kiszk Dec 30, 2016

Contributor

Thanks for your clarification. I was afraid that ARM 64 may return arm.

@kiszk

kiszk Dec 30, 2016

Contributor

Thanks for your clarification. I was afraid that ARM 64 may return arm.

@michaelkamprath

This comment has been minimized.

Show comment
Hide comment
@michaelkamprath

michaelkamprath Dec 30, 2016

@srowen To answer the use case question, it is primarily academic for learning and testing. Students and researchers build clusters of Raspberry PI, ODROID, or other SBCs to have a cost effective access to a multi-node hardware cluster. Here are some examples of projects. There is even a commercial vendor selling these SBC clusters. In my own case, its being used to economically learn how to deal problems of efficiency (it's easier to spot and work through patterns of inefficiency on constrained systems than full powered systems).

I am personally not aware if there are currently any server-class CPUs that requires double alignment. Double alignment SPARC processors used to be the bane of my existence in the early 2000's, but that was over a decade ago. My understanding is that today x86 supports unaligned double access with a theoretical performance hit that in practice is rarely seen. Typically you never concern yourself with alignment in Java because the JVM takes care of it for you, but here we are delving into the world of Unsafe, which bypasses the protections the JVM provides. Admittedly, it took me a long while to even figure out that my problem was related to alignment because as indicated, I haven't dealt with such issues in over a decade.

With all that said, maybe a better approach here is to create a patch that users can use to create a spark build when they want to run Spark on a system that requires double alignment, which to the best of my knowledge are currently just ARM 32-bit CPUs. That would even let the code change to be more concise, without needing to determine at runtime which method to use. And if ever should arise a server-class CPU with alignment requirements, we know what to do.

michaelkamprath commented Dec 30, 2016

@srowen To answer the use case question, it is primarily academic for learning and testing. Students and researchers build clusters of Raspberry PI, ODROID, or other SBCs to have a cost effective access to a multi-node hardware cluster. Here are some examples of projects. There is even a commercial vendor selling these SBC clusters. In my own case, its being used to economically learn how to deal problems of efficiency (it's easier to spot and work through patterns of inefficiency on constrained systems than full powered systems).

I am personally not aware if there are currently any server-class CPUs that requires double alignment. Double alignment SPARC processors used to be the bane of my existence in the early 2000's, but that was over a decade ago. My understanding is that today x86 supports unaligned double access with a theoretical performance hit that in practice is rarely seen. Typically you never concern yourself with alignment in Java because the JVM takes care of it for you, but here we are delving into the world of Unsafe, which bypasses the protections the JVM provides. Admittedly, it took me a long while to even figure out that my problem was related to alignment because as indicated, I haven't dealt with such issues in over a decade.

With all that said, maybe a better approach here is to create a patch that users can use to create a spark build when they want to run Spark on a system that requires double alignment, which to the best of my knowledge are currently just ARM 32-bit CPUs. That would even let the code change to be more concise, without needing to determine at runtime which method to use. And if ever should arise a server-class CPU with alignment requirements, we know what to do.

@srowen

This comment has been minimized.

Show comment
Hide comment
@srowen

srowen Dec 31, 2016

Member

I can't imagine Raspberry Pi is a good way to run Spark. It won't be cost effective compared to a simple commodity server. Yeah, I'm concerned that this introduces overhead for every other architecture in a hot path, and I am not sure this actually makes Spark work on such an architetcure anyway.

Member

srowen commented Dec 31, 2016

I can't imagine Raspberry Pi is a good way to run Spark. It won't be cost effective compared to a simple commodity server. Yeah, I'm concerned that this introduces overhead for every other architecture in a hot path, and I am not sure this actually makes Spark work on such an architetcure anyway.

@michaelkamprath

This comment has been minimized.

Show comment
Hide comment
@michaelkamprath

michaelkamprath Jan 2, 2017

@srowen No one has any delusions that an ARM platform is the place to do heavy lifting for data. But if your goal is academic in nature (hands on learning how to set up and operate a cluster, learning the API, etc), it's a very economical platform. I do understand your concerns for the core target and totally get why you wouldn't want this fix in the main line, so I will instead pursue a patch approach for those who'd be interested. I'm not sure what that means here ... do you or I close this pull request?

@srowen No one has any delusions that an ARM platform is the place to do heavy lifting for data. But if your goal is academic in nature (hands on learning how to set up and operate a cluster, learning the API, etc), it's a very economical platform. I do understand your concerns for the core target and totally get why you wouldn't want this fix in the main line, so I will instead pursue a patch approach for those who'd be interested. I'm not sure what that means here ... do you or I close this pull request?

[SPARK-18819] refactor as standalone patch
Refactoring this fix as a standalone patch that can be used to
create custom build of Spark that run on ARM32 platforms. Since
these changes will not be merged into the main line, removed
conditional usage of aligned or unaligned double access and made
this code only us aligned double access.
@michaelkamprath

This comment has been minimized.

Show comment
Hide comment
@michaelkamprath

michaelkamprath Jan 21, 2017

I am withdrawing this pull request and will instead publish a patch for those who are interested in using Spark on a platforms requiring byte alignment for double access.

I am withdrawing this pull request and will instead publish a patch for those who are interested in using Spark on a platforms requiring byte alignment for double access.

@michaelkamprath michaelkamprath deleted the michaelkamprath:fix/SPARK-18819_double-alignment branch Jan 21, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment