Skip to content

JNI lazy initialization fails under stage cancellation, causing executor-wide NoClassDefFoundError (AuronCallNativeWrapper) #2118

@Riefu

Description

@Riefu

Description

We encountered an issue in production where Auron native execution fails intermittently on a single executor with:
java.lang.NoClassDefFoundError: Could not initialize class org.apache.auron.jni.AuronCallNativeWrapper

The failure is isolated to a single executor JVM. Other executors on the same node continue to work normally.

Once triggered, all subsequent tasks scheduled on that executor consistently fail with the same error until the executor or application is restarted.


Relevant Logs

Initial failure:

INFO AuronCallNativeWrapper: Initializing native environment (batchSize=10000, memoryFraction=0.6)

INFO Executor: Executor is trying to kill task ..., reason: Stage cancelled

26/03/25 19:01:45 INFO AuronCallNativeWrapper: Initializing native environment (batchSize=10000, memoryFraction=0.6)
26/03/25 19:01:51 INFO Executor: Executor is trying to kill task 9.0 in stage 8157.0 (TID 52888), reason: Stage cancelled
26/03/25 19:01:51 INFO Executor: Executor is trying to kill task 49.0 in stage 8157.0 (TID 52928), reason: Stage cancelled
26/03/25 19:01:51 INFO Executor: Executor is trying to kill task 89.0 in stage 8157.0 (TID 52968), reason: Stage cancelled
26/03/25 19:01:51 INFO Executor: Executor is trying to kill task 29.0 in stage 8157.0 (TID 52908), reason: Stage cancelled
26/03/25 19:01:51 INFO Executor: Executor is trying to kill task 69.0 in stage 8157.0 (TID 52948), reason: Stage cancelled
26/03/25 19:01:51 ERROR Executor: Exception in task 49.0 in stage 8157.0 (TID 52928)
java.lang.NoClassDefFoundError: Could not initialize class org.apache.auron.jni.AuronCallNativeWrapper
at org.apache.spark.sql.auron.NativeHelper$.executeNativePlan(NativeHelper.scala:103)
at org.apache.spark.sql.execution.auron.shuffle.AuronShuffleWriterBase.nativeShuffleWrite(AuronShuffleWriterBase.scala:67)
at org.apache.spark.sql.execution.auron.plan.NativeShuffleExchangeExec$$anon$1.internalWrite(NativeShuffleExchangeExec.scala:187)
at org.apache.spark.sql.execution.auron.plan.NativeShuffleExchangeExec$$anon$1.write(NativeShuffleExchangeExec.scala:131)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
26/03/25 19:01:51 ERROR Executor: Exception in task 69.0 in stage 8157.0 (TID 52948)
java.lang.NoClassDefFoundError: Could not initialize class org.apache.auron.jni.AuronCallNativeWrapper
at org.apache.spark.sql.auron.NativeHelper$.executeNativePlan(NativeHelper.scala:103)
at org.apache.spark.sql.execution.auron.shuffle.AuronShuffleWriterBase.nativeShuffleWrite(AuronShuffleWriterBase.scala:67)
at org.apache.spark.sql.execution.auron.plan.NativeShuffleExchangeExec$$anon$1.internalWrite(NativeShuffleExchangeExec.scala:187)
at org.apache.spark.sql.execution.auron.plan.NativeShuffleExchangeExec$$anon$1.write(NativeShuffleExchangeExec.scala:131)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
26/03/25 19:01:51 ERROR Executor: Exception in task 89.0 in stage 8157.0 (TID 52968)
java.lang.ExceptionInInitializerError
at org.apache.spark.sql.auron.NativeHelper$.executeNativePlan(NativeHelper.scala:103)
at org.apache.spark.sql.execution.auron.shuffle.AuronShuffleWriterBase.nativeShuffleWrite(AuronShuffleWriterBase.scala:67)
at org.apache.spark.sql.execution.auron.plan.NativeShuffleExchangeExec$$anon$1.internalWrite(NativeShuffleExchangeExec.scala:187)
at org.apache.spark.sql.execution.auron.plan.NativeShuffleExchangeExec$$anon$1.write(NativeShuffleExchangeExec.scala:131)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: error loading native libraries: java.nio.channels.ClosedByInterruptException
at org.apache.auron.jni.SparkAuronAdaptor.loadAuronLib(SparkAuronAdaptor.java:52)
at org.apache.auron.jni.AuronCallNativeWrapper.(AuronCallNativeWrapper.java:74)
... 13 more
26/03/25 19:01:51 ERROR Executor: Exception in task 9.0 in stage 8157.0 (TID 52888)
java.lang.NoClassDefFoundError: Could not initialize class org.apache.auron.jni.AuronCallNativeWrapper
at org.apache.spark.sql.auron.NativeHelper$.executeNativePlan(NativeHelper.scala:103)
at org.apache.spark.sql.execution.auron.shuffle.AuronShuffleWriterBase.nativeShuffleWrite(AuronShuffleWriterBase.scala:67)
at org.apache.spark.sql.execution.auron.plan.NativeShuffleExchangeExec$$anon$1.internalWrite(NativeShuffleExchangeExec.scala:187)
at org.apache.spark.sql.execution.auron.plan.NativeShuffleExchangeExec$$anon$1.write(NativeShuffleExchangeExec.scala:131)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
26/03/25 19:01:51 ERROR Executor: Exception in task 29.0 in stage 8157.0 (TID 52908)
java.lang.NoClassDefFoundError: Could not initialize class org.apache.auron.jni.AuronCallNativeWrapper
at org.apache.spark.sql.auron.NativeHelper$.executeNativePlan(NativeHelper.scala:103)
at org.apache.spark.sql.execution.auron.shuffle.AuronShuffleWriterBase.nativeShuffleWrite(AuronShuffleWriterBase.scala:67)
at org.apache.spark.sql.execution.auron.plan.NativeShuffleExchangeExec$$anon$1.internalWrite(NativeShuffleExchangeExec.scala:187)
at org.apache.spark.sql.execution.auron.plan.NativeShuffleExchangeExec$$anon$1.write(NativeShuffleExchangeExec.scala:131)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Subsequent failures:

java.lang.NoClassDefFoundError: Could not initialize class org.apache.auron.jni.AuronCallNativeWrapper


Observed Behavior

  • Issue happens on only one executor JVM
  • Other executors on the same node are unaffected
  • Tasks keep failing only when scheduled to that specific executor
  • Restarting the application (or executor) resolves the issue
  • Failure occurs during native shuffle execution path

Root Cause Analysis (Our Understanding)

Based on logs and behavior, we believe the root cause is:

  1. Auron native library is lazily initialized in AuronCallNativeWrapper
  2. Multiple tasks concurrently attempt initialization (lock contention)
  3. During initialization, Spark triggers stage cancellation, interrupting task threads
  4. Native library loading (likely involving file/channel operations) throws:
    ClosedByInterruptException
  5. This causes static initialization (<clinit>) to fail
  6. JVM marks the class as failed initialization
  7. All subsequent usages in the same executor result in:
    NoClassDefFoundError: Could not initialize class

This effectively poisons the entire executor JVM.


Impact

  • A single executor becomes permanently unusable for Auron tasks
  • Tasks repeatedly fail if scheduled on that executor
  • Requires executor/application restart to recover
  • Can cause prolonged instability (we observed ~20 minutes impact)

Expected Behavior

  • Native library initialization should be:

    • resilient to task interruption, OR
    • retriable after failure, OR
    • isolated from task lifecycle (not tied to task threads)
  • Executor should not enter unrecoverable state due to a transient interrupt


Suggestions

  1. Avoid lazy initialization in task execution path

    • Preload native libraries at executor startup
  2. Make initialization interrupt-safe

    • Ignore or defer thread interrupts during critical JNI loading
  3. Allow retry after initialization failure

    • Avoid permanently poisoning class state
  4. Fail fast on executor

    • If initialization fails, terminate executor instead of leaving it in broken state
  5. Reduce lock contention during initialization

    • Ensure only one thread performs initialization without exposing others to failure

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions