-
Notifications
You must be signed in to change notification settings - Fork 211
JNI lazy initialization fails under stage cancellation, causing executor-wide NoClassDefFoundError (AuronCallNativeWrapper) #2118
Description
Description
We encountered an issue in production where Auron native execution fails intermittently on a single executor with:
java.lang.NoClassDefFoundError: Could not initialize class org.apache.auron.jni.AuronCallNativeWrapper
The failure is isolated to a single executor JVM. Other executors on the same node continue to work normally.
Once triggered, all subsequent tasks scheduled on that executor consistently fail with the same error until the executor or application is restarted.
Relevant Logs
Initial failure:
INFO AuronCallNativeWrapper: Initializing native environment (batchSize=10000, memoryFraction=0.6)
INFO Executor: Executor is trying to kill task ..., reason: Stage cancelled
26/03/25 19:01:45 INFO AuronCallNativeWrapper: Initializing native environment (batchSize=10000, memoryFraction=0.6)
26/03/25 19:01:51 INFO Executor: Executor is trying to kill task 9.0 in stage 8157.0 (TID 52888), reason: Stage cancelled
26/03/25 19:01:51 INFO Executor: Executor is trying to kill task 49.0 in stage 8157.0 (TID 52928), reason: Stage cancelled
26/03/25 19:01:51 INFO Executor: Executor is trying to kill task 89.0 in stage 8157.0 (TID 52968), reason: Stage cancelled
26/03/25 19:01:51 INFO Executor: Executor is trying to kill task 29.0 in stage 8157.0 (TID 52908), reason: Stage cancelled
26/03/25 19:01:51 INFO Executor: Executor is trying to kill task 69.0 in stage 8157.0 (TID 52948), reason: Stage cancelled
26/03/25 19:01:51 ERROR Executor: Exception in task 49.0 in stage 8157.0 (TID 52928)
java.lang.NoClassDefFoundError: Could not initialize class org.apache.auron.jni.AuronCallNativeWrapper
at org.apache.spark.sql.auron.NativeHelper$.executeNativePlan(NativeHelper.scala:103)
at org.apache.spark.sql.execution.auron.shuffle.AuronShuffleWriterBase.nativeShuffleWrite(AuronShuffleWriterBase.scala:67)
at org.apache.spark.sql.execution.auron.plan.NativeShuffleExchangeExec$$anon$1.internalWrite(NativeShuffleExchangeExec.scala:187)
at org.apache.spark.sql.execution.auron.plan.NativeShuffleExchangeExec$$anon$1.write(NativeShuffleExchangeExec.scala:131)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
26/03/25 19:01:51 ERROR Executor: Exception in task 69.0 in stage 8157.0 (TID 52948)
java.lang.NoClassDefFoundError: Could not initialize class org.apache.auron.jni.AuronCallNativeWrapper
at org.apache.spark.sql.auron.NativeHelper$.executeNativePlan(NativeHelper.scala:103)
at org.apache.spark.sql.execution.auron.shuffle.AuronShuffleWriterBase.nativeShuffleWrite(AuronShuffleWriterBase.scala:67)
at org.apache.spark.sql.execution.auron.plan.NativeShuffleExchangeExec$$anon$1.internalWrite(NativeShuffleExchangeExec.scala:187)
at org.apache.spark.sql.execution.auron.plan.NativeShuffleExchangeExec$$anon$1.write(NativeShuffleExchangeExec.scala:131)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
26/03/25 19:01:51 ERROR Executor: Exception in task 89.0 in stage 8157.0 (TID 52968)
java.lang.ExceptionInInitializerError
at org.apache.spark.sql.auron.NativeHelper$.executeNativePlan(NativeHelper.scala:103)
at org.apache.spark.sql.execution.auron.shuffle.AuronShuffleWriterBase.nativeShuffleWrite(AuronShuffleWriterBase.scala:67)
at org.apache.spark.sql.execution.auron.plan.NativeShuffleExchangeExec$$anon$1.internalWrite(NativeShuffleExchangeExec.scala:187)
at org.apache.spark.sql.execution.auron.plan.NativeShuffleExchangeExec$$anon$1.write(NativeShuffleExchangeExec.scala:131)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: error loading native libraries: java.nio.channels.ClosedByInterruptException
at org.apache.auron.jni.SparkAuronAdaptor.loadAuronLib(SparkAuronAdaptor.java:52)
at org.apache.auron.jni.AuronCallNativeWrapper.(AuronCallNativeWrapper.java:74)
... 13 more
26/03/25 19:01:51 ERROR Executor: Exception in task 9.0 in stage 8157.0 (TID 52888)
java.lang.NoClassDefFoundError: Could not initialize class org.apache.auron.jni.AuronCallNativeWrapper
at org.apache.spark.sql.auron.NativeHelper$.executeNativePlan(NativeHelper.scala:103)
at org.apache.spark.sql.execution.auron.shuffle.AuronShuffleWriterBase.nativeShuffleWrite(AuronShuffleWriterBase.scala:67)
at org.apache.spark.sql.execution.auron.plan.NativeShuffleExchangeExec$$anon$1.internalWrite(NativeShuffleExchangeExec.scala:187)
at org.apache.spark.sql.execution.auron.plan.NativeShuffleExchangeExec$$anon$1.write(NativeShuffleExchangeExec.scala:131)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
26/03/25 19:01:51 ERROR Executor: Exception in task 29.0 in stage 8157.0 (TID 52908)
java.lang.NoClassDefFoundError: Could not initialize class org.apache.auron.jni.AuronCallNativeWrapper
at org.apache.spark.sql.auron.NativeHelper$.executeNativePlan(NativeHelper.scala:103)
at org.apache.spark.sql.execution.auron.shuffle.AuronShuffleWriterBase.nativeShuffleWrite(AuronShuffleWriterBase.scala:67)
at org.apache.spark.sql.execution.auron.plan.NativeShuffleExchangeExec$$anon$1.internalWrite(NativeShuffleExchangeExec.scala:187)
at org.apache.spark.sql.execution.auron.plan.NativeShuffleExchangeExec$$anon$1.write(NativeShuffleExchangeExec.scala:131)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Subsequent failures:
java.lang.NoClassDefFoundError: Could not initialize class org.apache.auron.jni.AuronCallNativeWrapper
Observed Behavior
- Issue happens on only one executor JVM
- Other executors on the same node are unaffected
- Tasks keep failing only when scheduled to that specific executor
- Restarting the application (or executor) resolves the issue
- Failure occurs during native shuffle execution path
Root Cause Analysis (Our Understanding)
Based on logs and behavior, we believe the root cause is:
- Auron native library is lazily initialized in
AuronCallNativeWrapper - Multiple tasks concurrently attempt initialization (lock contention)
- During initialization, Spark triggers stage cancellation, interrupting task threads
- Native library loading (likely involving file/channel operations) throws:
ClosedByInterruptException - This causes static initialization (
<clinit>) to fail - JVM marks the class as failed initialization
- All subsequent usages in the same executor result in:
NoClassDefFoundError: Could not initialize class
This effectively poisons the entire executor JVM.
Impact
- A single executor becomes permanently unusable for Auron tasks
- Tasks repeatedly fail if scheduled on that executor
- Requires executor/application restart to recover
- Can cause prolonged instability (we observed ~20 minutes impact)
Expected Behavior
-
Native library initialization should be:
- resilient to task interruption, OR
- retriable after failure, OR
- isolated from task lifecycle (not tied to task threads)
-
Executor should not enter unrecoverable state due to a transient interrupt
Suggestions
-
Avoid lazy initialization in task execution path
- Preload native libraries at executor startup
-
Make initialization interrupt-safe
- Ignore or defer thread interrupts during critical JNI loading
-
Allow retry after initialization failure
- Avoid permanently poisoning class state
-
Fail fast on executor
- If initialization fails, terminate executor instead of leaving it in broken state
-
Reduce lock contention during initialization
- Ensure only one thread performs initialization without exposing others to failure