Skip to content

Race condition in Sunflow benchmark #258

@CRefice

Description

@CRefice

I found a race condition that can occur when running the sunflow benchmark on multiple threads, resulting in the following stack trace:

===== DaCapo unknown sunflow starting warmup 1 =====
Exception in thread "Thread-1" java.lang.NullPointerException
	at org.sunflow.core.Geometry.intersect(Geometry.java:93)
	at org.sunflow.core.Instance.intersect(Instance.java:135)
	at org.sunflow.core.InstanceList.intersectPrimitive(InstanceList.java:30)
	at org.sunflow.core.accel.BoundingIntervalHierarchy.intersect(BoundingIntervalHierarchy.java:559)
	at org.sunflow.core.Scene.trace(Scene.java:199)
	at org.sunflow.core.LightServer$1.run(LightServer.java:231)
	at java.base@21.0.1/java.lang.Thread.runWith(Thread.java:1596)
	at java.base@21.0.1/java.lang.Thread.run(Thread.java:1583)
	at org.graalvm.nativeimage.builder/com.oracle.svm.core.thread.PlatformThreads.threadStartRoutine(PlatformThreads.java:784)
	at org.graalvm.nativeimage.builder/com.oracle.svm.core.posix.thread.PosixPlatformThreads.pthreadStartRoutine(PosixPlatformThreads.java:211)
java.lang.reflect.InvocationTargetException
	at java.base@21.0.1/java.lang.reflect.Method.invoke(Method.java:580)
	at org.dacapo.harness.Sunflow.validate(Sunflow.java:79)
	at org.dacapo.harness.Benchmark.run(Benchmark.java:195)
	at org.dacapo.harness.TestHarness.runBenchmark(TestHarness.java:200)
	at org.dacapo.harness.TestHarness.main(TestHarness.java:153)
	at java.base@21.0.1/java.lang.reflect.Method.invoke(Method.java:580)
	at Harness.main(Unknown Source)
	at java.base@21.0.1/java.lang.invoke.LambdaForm$DMH/sa346b79c.invokeStaticInit(LambdaForm$DMH)
Caused by: java.lang.RuntimeException: Image check failed! - #errors: 89833
	at org.sunflow.Benchmark.print(Benchmark.java:252)
	at org.sunflow.system.UI.printError(UI.java:63)
	at org.sunflow.Benchmark.kernelEnd(Benchmark.java:240)
	... 8 more 

The issue seems to lie in the sunflow project itself and has laid dormant for years.
Taking a look at the implementation of the Geometry class I was able to find the cause of the issue.

The upshot is in this snippet:

    private AccelerationStructure accel;
    private int builtAccel;

    void intersect(Ray r, IntersectionState state) {
        /* [...] */
        if (builtAccel == 0)
            build();
        accel.intersect(r, state); // This line throws NPE due to accel == null
    }

    private synchronized void build() {
        if (builtAccel != 0)
            return;
        if (primitives != null) {
            int n = primitives.getNumPrimitives();
            /* [...] */
            accel = AccelerationStructureFactory.create(acceltype, n, true);
            accel.build(primitives);
        } else {
            accel = new NullAccelerator();
        }
        builtAccel = 1;
    }

Here intersect is being called concurrently from multiple threads.
The problem is builtAccel is a plain old int that is set in a synchronized method, but read in a non-synchronized method.
Because there's no forced ordering between builtAccel and accel, it is possible for accel to be loaded before builtAccel is (or for builtAccel to be set to 1 before accel is initialized), leading to a NPE.

The upstream fix would be to make builtAccel volatile to force a memory barrier between checking and setting builtAccel. I'm not sure how feasible it would be to patch the benchmark, so a workaround that could be applied in the harness would be to somehow call build() on all Geometry instances in the during scene initialization, before any call to intersect is performed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions