Skip to content

[AURON #2118] make native environment initialization retriable after failure#2151

Open
XorSum wants to merge 1 commit intoapache:masterfrom
XorSum:fix_initialization_error
Open

[AURON #2118] make native environment initialization retriable after failure#2151
XorSum wants to merge 1 commit intoapache:masterfrom
XorSum:fix_initialization_error

Conversation

@XorSum
Copy link
Copy Markdown
Contributor

@XorSum XorSum commented Apr 2, 2026

Which issue does this PR close?

Closes #2118

Rationale for this change

avoid initialize in static block, make native environment initialization retriable after failure

What changes are included in this PR?

Are there any user-facing changes?

How was this patch tested?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses executor-wide NoClassDefFoundError caused by failures in JNI/native initialization during class static initialization, by moving native environment initialization into a retriable, lazily-invoked code path.

Changes:

  • Remove static initializer-based native environment setup from AuronCallNativeWrapper.
  • Add a thread-safe, lazy init() method guarded by double-checked locking and invoke it from the wrapper constructor.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +81 to +90
private static void init() {
if (!initialized) {
synchronized (AuronCallNativeWrapper.class) {
if (!initialized) {
// initialize native environment
LOG.info("Initializing native environment (batchSize="
+ AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.BATCH_SIZE)
+ ", "
+ "memoryFraction="
+ AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.MEMORY_FRACTION)
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

init() can run while a Spark task thread is already interrupted/cancelled (the production issue shows ClosedByInterruptException). Even though initialization is now retriable, attempting Files.copy/System.load from an interrupted task is still likely to fail and generate noisy errors. Consider short-circuiting initialization when the current task is not running (e.g., !AuronAdaptor.getInstance().isTaskRunning() or Thread.currentThread().isInterrupted()), and fail fast so a later non-cancelled task can perform the one-time init successfully.

Copilot uses AI. Check for mistakes.
Comment on lines +85 to +106
// initialize native environment
LOG.info("Initializing native environment (batchSize="
+ AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.BATCH_SIZE)
+ ", "
+ "memoryFraction="
+ AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.MEMORY_FRACTION)
+ ")");

// arrow configuration
System.setProperty("arrow.struct.conflict.policy", "CONFLICT_APPEND");

// preload JNI bridge classes
try {
Class.forName("org.apache.auron.jni.JniBridge");
} catch (ClassNotFoundException e) {
throw new RuntimeException("Cannot load JniBridge class", e);
}

AuronAdaptor.getInstance().loadAuronLib();
Runtime.getRuntime().addShutdownHook(new Thread(JniBridge::onExit));

initialized = true;
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because initialization is now retried after failures, AuronAdaptor.getInstance().loadAuronLib() may be invoked multiple times in the same JVM. In the Spark adaptor, each call extracts to a fresh createTempFile(...).deleteOnExit(), so repeated failures can accumulate temp files until executor exit. Consider making loadAuronLib() idempotent / cached (or cleaning up temp files on failure) to avoid disk bloat on flaky/interrupt-driven init attempts.

Suggested change
// initialize native environment
LOG.info("Initializing native environment (batchSize="
+ AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.BATCH_SIZE)
+ ", "
+ "memoryFraction="
+ AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.MEMORY_FRACTION)
+ ")");
// arrow configuration
System.setProperty("arrow.struct.conflict.policy", "CONFLICT_APPEND");
// preload JNI bridge classes
try {
Class.forName("org.apache.auron.jni.JniBridge");
} catch (ClassNotFoundException e) {
throw new RuntimeException("Cannot load JniBridge class", e);
}
AuronAdaptor.getInstance().loadAuronLib();
Runtime.getRuntime().addShutdownHook(new Thread(JniBridge::onExit));
initialized = true;
try {
// initialize native environment
LOG.info("Initializing native environment (batchSize="
+ AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.BATCH_SIZE)
+ ", "
+ "memoryFraction="
+ AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.MEMORY_FRACTION)
+ ")");
// arrow configuration
System.setProperty("arrow.struct.conflict.policy", "CONFLICT_APPEND");
// preload JNI bridge classes
try {
Class.forName("org.apache.auron.jni.JniBridge");
} catch (ClassNotFoundException e) {
throw new RuntimeException("Cannot load JniBridge class", e);
}
AuronAdaptor.getInstance().loadAuronLib();
Runtime.getRuntime().addShutdownHook(new Thread(JniBridge::onExit));
} finally {
// Mark initialization as attempted to avoid repeated native library loading
initialized = true;
}

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JNI lazy initialization fails under stage cancellation, causing executor-wide NoClassDefFoundError (AuronCallNativeWrapper)

2 participants