Skip to content

macOS aarch64 flake: SIGBUS in _pthread_tsd_cleanup after ParquetReadFromFakeHadoopFsSuite #4200

@andygrove

Description

@andygrove

Description

Recurring JVM crash on `macos-14/Spark 4.1, JDK 17, Scala 2.13 [parquet]` (and occasionally other macOS PR-build jobs) after the one `ParquetReadFromFakeHadoopFsSuite` test completes. Reproduced on at least PRs #4197 and earlier runs.

Same failure shape as closed #2354 (`hdfsThreadDestructor` on linux amd64), but here on macOS aarch64 the offending frame is anonymous.

`hs_err` summary

```
SIGBUS (0xa) at pc=0x000000012e828e00
siginfo: si_signo: 10 (SIGBUS), si_code: 1 (BUS_ADRALN), si_addr: 0x000000012e828e00
Current thread is native thread

Native frames:
C 0x000000012e828e00 ← unmapped/stripped
C [libsystem_pthread.dylib+0x4818] _pthread_tsd_cleanup+0x1e8
C [libsystem_pthread.dylib+0x762c] _pthread_exit+0x54
C [libsystem_pthread.dylib+0x6f48] _pthread_start+0x94

Registers (selected):
pc=0x000000012e828e00 x8=0x000000012e828e00 ← callee == pc
```

Root cause (suspected)

Classic `pthread_key_create` TSD destructor called on dlclose'd code pattern:

  1. libcomet (or a library it pulls in — `hdfs-opendal` / libhdfs) calls `pthread_key_create(&key, destructor_fn)` for cleanup on thread exit.
  2. `ParquetReadFromFakeHadoopFsSuite` runs, spawns hdfs worker threads.
  3. The one test finishes (`931 ms` in the latest run); hdfs background threads finish their work and call `_pthread_exit`.
  4. `_pthread_tsd_cleanup` walks the TSD key table and jumps to `destructor_fn`.
  5. By this point the page holding `destructor_fn` has been unmapped / the lib has been unloaded, so the fetch at `pc` raises `BUS_ADRALN`.

The stack `_pthread_start → _pthread_exit → _pthread_tsd_cleanup → ` plus `pc == x8` (the TSD cleanup loop stores the destructor in `x8` before `blr x8` on arm64) is the tell.

Where the stale destructor comes from

The suite depends on the `hdfs-opendal` feature (`assume(isFeatureEnabled("hdfs-opendal"))`). On macOS aarch64 CI that feature is enabled, so every run exercises the JNI bridge to Hadoop native libs. Those libs are the most likely registrars of the TSD key (cf. the original #2354 crash that pointed at `hdfsThreadDestructor+0x61`).

Mitigations to consider

  • Skip `ParquetReadFromFakeHadoopFsSuite` on macOS aarch64 until the root cause is fixed.
  • Unregister TSD keys at library-unload time, or avoid dlclose-like paths when TSD destructors are registered.
  • Upstream fix in whichever hdfs binding registers the key (mirrors JVM crash on hdfsThreadDestructor #2354's hdfsThreadDestructor).

Linking PR #4197 where this most recently surfaced.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:ciCI/CD, GitHub Actions, build toolingarea:scanParquet scan / data readingbugSomething isn't workingpriority:lowMinor issues, test failures, tooling, cosmetic

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions