Skip to content

bug(bindings/java): HF blocking path segfaults on Linux with XET async upload #7367

@Xuanwo

Description

@Xuanwo

Describe the bug

binding_java / ubuntu-latest / hf / hf_bucket is currently crashing the forked JVM with SIGSEGV instead of returning a normal test failure:

This is not caused by calling hf-xet's blocking API directly.

The Java binding builds a blocking operator by wrapping OpenDAL's async operator, and the crash happens when that blocking wrapper executes an HF/XET async write future on Linux x86_64.

What we know so far

The relevant execution model is:

  • Java behavior tests create a blocking operator via AsyncOperator.blocking().
  • The Java blocking binding then calls Rust blocking::Operator.
  • OpenDAL's blocking operator drives the underlying async write via Handle::block_on(...).
  • HF write goes through the XET async upload path.
  • That async upload path internally spawns Tokio async tasks and blocking tasks.

So the failing shape is:

Java sync API -> Rust blocking::Operator::block_on(async HF/XET write future) -> HF/XET async upload graph

Evidence

The following repro results were observed with real HF bucket credentials:

  • macOS + Java HF blocking write: passes
  • Ubuntu 24.04 x86_64 container + Java HF async write: passes
  • Ubuntu 24.04 x86_64 container + AsyncOperator.blocking() construction only: passes
  • Ubuntu 24.04 x86_64 container + first blocking write: JVM crashes with SIGSEGV
  • Running the minimal repro directly via java -cp ... also crashes, so this is not caused by Surefire/JUnit

This rules out several earlier hypotheses:

  • not HF credentials
  • not Java test code itself
  • not HfCore session initialization timing
  • not using XET's own blocking API
  • not Maven Surefire

A temporary experiment also replaced the Java blocking runtime with a plain Tokio runtime without the JNI thread hooks, and Linux still crashed. So the primary issue does not appear to be the Java executor's attach/detach hooks either.

Likely root cause

The most likely root cause is an incompatibility between:

  • Java's native blocking binding path using Rust Handle::block_on(...), and
  • the HF/XET async upload implementation that internally fans out into spawned async/blocking Tokio work.

In other words, the failure is at the boundary between the Java blocking wrapper and the HF/XET async execution graph, not in ordinary HF business logic.

Minimal repro direction

A reduced repro can be built around:

  1. Construct AsyncOperator.of("hf", config)
  2. Convert it with .blocking()
  3. Perform a single write() against an HF bucket on Linux x86_64

That is sufficient to trigger the crash in the affected environment.

Expected behavior

HF behavior tests should either pass or fail with a normal OpenDAL/Java exception. They must not terminate the JVM with SIGSEGV.

Temporary mitigation

Until the Java blocking path is redesigned or HF/XET becomes compatible with this execution model, the practical mitigation is to disable the Java HF behavior case in CI.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bindings/javabugSomething isn't workingreleases-note/fixThe PR fixes a bug or has a title that begins with "fix"

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions