[AURON #2118] make native environment initialization retriable after failure#2151
[AURON #2118] make native environment initialization retriable after failure#2151XorSum wants to merge 1 commit intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses executor-wide NoClassDefFoundError caused by failures in JNI/native initialization during class static initialization, by moving native environment initialization into a retriable, lazily-invoked code path.
Changes:
- Remove static initializer-based native environment setup from
AuronCallNativeWrapper. - Add a thread-safe, lazy
init()method guarded by double-checked locking and invoke it from the wrapper constructor.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| private static void init() { | ||
| if (!initialized) { | ||
| synchronized (AuronCallNativeWrapper.class) { | ||
| if (!initialized) { | ||
| // initialize native environment | ||
| LOG.info("Initializing native environment (batchSize=" | ||
| + AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.BATCH_SIZE) | ||
| + ", " | ||
| + "memoryFraction=" | ||
| + AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.MEMORY_FRACTION) |
There was a problem hiding this comment.
init() can run while a Spark task thread is already interrupted/cancelled (the production issue shows ClosedByInterruptException). Even though initialization is now retriable, attempting Files.copy/System.load from an interrupted task is still likely to fail and generate noisy errors. Consider short-circuiting initialization when the current task is not running (e.g., !AuronAdaptor.getInstance().isTaskRunning() or Thread.currentThread().isInterrupted()), and fail fast so a later non-cancelled task can perform the one-time init successfully.
| // initialize native environment | ||
| LOG.info("Initializing native environment (batchSize=" | ||
| + AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.BATCH_SIZE) | ||
| + ", " | ||
| + "memoryFraction=" | ||
| + AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.MEMORY_FRACTION) | ||
| + ")"); | ||
|
|
||
| // arrow configuration | ||
| System.setProperty("arrow.struct.conflict.policy", "CONFLICT_APPEND"); | ||
|
|
||
| // preload JNI bridge classes | ||
| try { | ||
| Class.forName("org.apache.auron.jni.JniBridge"); | ||
| } catch (ClassNotFoundException e) { | ||
| throw new RuntimeException("Cannot load JniBridge class", e); | ||
| } | ||
|
|
||
| AuronAdaptor.getInstance().loadAuronLib(); | ||
| Runtime.getRuntime().addShutdownHook(new Thread(JniBridge::onExit)); | ||
|
|
||
| initialized = true; |
There was a problem hiding this comment.
Because initialization is now retried after failures, AuronAdaptor.getInstance().loadAuronLib() may be invoked multiple times in the same JVM. In the Spark adaptor, each call extracts to a fresh createTempFile(...).deleteOnExit(), so repeated failures can accumulate temp files until executor exit. Consider making loadAuronLib() idempotent / cached (or cleaning up temp files on failure) to avoid disk bloat on flaky/interrupt-driven init attempts.
| // initialize native environment | |
| LOG.info("Initializing native environment (batchSize=" | |
| + AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.BATCH_SIZE) | |
| + ", " | |
| + "memoryFraction=" | |
| + AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.MEMORY_FRACTION) | |
| + ")"); | |
| // arrow configuration | |
| System.setProperty("arrow.struct.conflict.policy", "CONFLICT_APPEND"); | |
| // preload JNI bridge classes | |
| try { | |
| Class.forName("org.apache.auron.jni.JniBridge"); | |
| } catch (ClassNotFoundException e) { | |
| throw new RuntimeException("Cannot load JniBridge class", e); | |
| } | |
| AuronAdaptor.getInstance().loadAuronLib(); | |
| Runtime.getRuntime().addShutdownHook(new Thread(JniBridge::onExit)); | |
| initialized = true; | |
| try { | |
| // initialize native environment | |
| LOG.info("Initializing native environment (batchSize=" | |
| + AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.BATCH_SIZE) | |
| + ", " | |
| + "memoryFraction=" | |
| + AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.MEMORY_FRACTION) | |
| + ")"); | |
| // arrow configuration | |
| System.setProperty("arrow.struct.conflict.policy", "CONFLICT_APPEND"); | |
| // preload JNI bridge classes | |
| try { | |
| Class.forName("org.apache.auron.jni.JniBridge"); | |
| } catch (ClassNotFoundException e) { | |
| throw new RuntimeException("Cannot load JniBridge class", e); | |
| } | |
| AuronAdaptor.getInstance().loadAuronLib(); | |
| Runtime.getRuntime().addShutdownHook(new Thread(JniBridge::onExit)); | |
| } finally { | |
| // Mark initialization as attempted to avoid repeated native library loading | |
| initialized = true; | |
| } |
Which issue does this PR close?
Closes #2118
Rationale for this change
avoid initialize in static block, make native environment initialization retriable after failure
What changes are included in this PR?
Are there any user-facing changes?
How was this patch tested?