I found a race condition that can occur when running the sunflow benchmark on multiple threads, resulting in the following stack trace:
===== DaCapo unknown sunflow starting warmup 1 =====
Exception in thread "Thread-1" java.lang.NullPointerException
at org.sunflow.core.Geometry.intersect(Geometry.java:93)
at org.sunflow.core.Instance.intersect(Instance.java:135)
at org.sunflow.core.InstanceList.intersectPrimitive(InstanceList.java:30)
at org.sunflow.core.accel.BoundingIntervalHierarchy.intersect(BoundingIntervalHierarchy.java:559)
at org.sunflow.core.Scene.trace(Scene.java:199)
at org.sunflow.core.LightServer$1.run(LightServer.java:231)
at java.base@21.0.1/java.lang.Thread.runWith(Thread.java:1596)
at java.base@21.0.1/java.lang.Thread.run(Thread.java:1583)
at org.graalvm.nativeimage.builder/com.oracle.svm.core.thread.PlatformThreads.threadStartRoutine(PlatformThreads.java:784)
at org.graalvm.nativeimage.builder/com.oracle.svm.core.posix.thread.PosixPlatformThreads.pthreadStartRoutine(PosixPlatformThreads.java:211)
java.lang.reflect.InvocationTargetException
at java.base@21.0.1/java.lang.reflect.Method.invoke(Method.java:580)
at org.dacapo.harness.Sunflow.validate(Sunflow.java:79)
at org.dacapo.harness.Benchmark.run(Benchmark.java:195)
at org.dacapo.harness.TestHarness.runBenchmark(TestHarness.java:200)
at org.dacapo.harness.TestHarness.main(TestHarness.java:153)
at java.base@21.0.1/java.lang.reflect.Method.invoke(Method.java:580)
at Harness.main(Unknown Source)
at java.base@21.0.1/java.lang.invoke.LambdaForm$DMH/sa346b79c.invokeStaticInit(LambdaForm$DMH)
Caused by: java.lang.RuntimeException: Image check failed! - #errors: 89833
at org.sunflow.Benchmark.print(Benchmark.java:252)
at org.sunflow.system.UI.printError(UI.java:63)
at org.sunflow.Benchmark.kernelEnd(Benchmark.java:240)
... 8 more
The issue seems to lie in the sunflow project itself and has laid dormant for years.
Taking a look at the implementation of the Geometry class I was able to find the cause of the issue.
The upshot is in this snippet:
private AccelerationStructure accel;
private int builtAccel;
void intersect(Ray r, IntersectionState state) {
/* [...] */
if (builtAccel == 0)
build();
accel.intersect(r, state); // This line throws NPE due to accel == null
}
private synchronized void build() {
if (builtAccel != 0)
return;
if (primitives != null) {
int n = primitives.getNumPrimitives();
/* [...] */
accel = AccelerationStructureFactory.create(acceltype, n, true);
accel.build(primitives);
} else {
accel = new NullAccelerator();
}
builtAccel = 1;
}
Here intersect is being called concurrently from multiple threads.
The problem is builtAccel is a plain old int that is set in a synchronized method, but read in a non-synchronized method.
Because there's no forced ordering between builtAccel and accel, it is possible for accel to be loaded before builtAccel is (or for builtAccel to be set to 1 before accel is initialized), leading to a NPE.
The upstream fix would be to make builtAccel volatile to force a memory barrier between checking and setting builtAccel. I'm not sure how feasible it would be to patch the benchmark, so a workaround that could be applied in the harness would be to somehow call build() on all Geometry instances in the during scene initialization, before any call to intersect is performed.
I found a race condition that can occur when running the
sunflowbenchmark on multiple threads, resulting in the following stack trace:The issue seems to lie in the sunflow project itself and has laid dormant for years.
Taking a look at the implementation of the Geometry class I was able to find the cause of the issue.
The upshot is in this snippet:
Here
intersectis being called concurrently from multiple threads.The problem is
builtAccelis a plain old int that is set in a synchronized method, but read in a non-synchronized method.Because there's no forced ordering between
builtAccelandaccel, it is possible foraccelto be loaded beforebuiltAccelis (or forbuiltAccelto be set to 1 beforeaccelis initialized), leading to a NPE.The upstream fix would be to make
builtAccelvolatile to force a memory barrier between checking and settingbuiltAccel. I'm not sure how feasible it would be to patch the benchmark, so a workaround that could be applied in the harness would be to somehow callbuild()on allGeometryinstances in the during scene initialization, before any call tointersectis performed.