Skip to content

opencode plugin can spawn AFT before ONNX runtime path is ready, causing dlopen failed and SIGABRT #5

@Suknna

Description

@Suknna

Summary

When experimental_semantic_search=true, the OpenCode plugin can spawn the AFT binary before ONNX runtime initialization finishes. In that window, the child process starts without the resolved ONNX dylib directory, semantic index initialization tries to load libonnxruntime.dylib, and the Rust binary aborts with SIGABRT.

This was reproducible on macOS with the OpenCode plugin and explains repeated restart loops with:

  • Failed to load ONNX Runtime dylib
  • dlopen failed
  • Process exited: code=null, signal=SIGABRT
  • Max restarts (3) reached

Environment

  • Platform: macOS darwin/arm64
  • AFT binary: v0.11.1
  • OpenCode: 1.4.3
  • Config: experimental_search_index=true, experimental_semantic_search=true

Root Cause

In packages/opencode-plugin/src/index.ts, ensureOnnxRuntime() previously ran asynchronously without being awaited. BridgePool could be created first, and the AFT child process could be spawned before _ort_dylib_dir was populated.

That means the child process environment did not reliably include the ONNX dylib path before semantic index initialization started.

Because the Rust release profile uses panic = \"abort\", ONNX loading failure terminates the process instead of surfacing as a recoverable error.

Reproduction

  1. Enable experimental_semantic_search=true
  2. Start OpenCode with the AFT plugin
  3. Trigger an AFT-backed session quickly on a project that causes semantic indexing to initialize early
  4. Observe repeated crashes in aft-plugin.log

Observed log sequence:

  • ONNX Runtime found at ...
  • Spawning binary: .../v0.11.1/aft
  • Failed to load ONNX Runtime dylib: ... dlopen failed
  • Process exited: code=null, signal=SIGABRT
  • Auto-restart #1/#2/#3
  • Max restarts (3) reached

Evidence

I verified this locally from logs and runtime behavior.

Before the fix, hwmon was a reliable repro case and repeatedly crashed during semantic search startup.

After changing plugin startup order so ONNX initialization is awaited before bridge creation, the same setup successfully reached:

  • started
  • pre-warmed symbol cache
  • built semantic index
  • semantic index persisted

Example successful post-fix log sequence:

  • Spawning binary: /Users/.../.cache/aft/bin/v0.11.1/aft (cwd: /Users/.../hwmon)
  • Binary version: 0.11.1
  • pre-warmed symbol cache: 63 files
  • built semantic index: 76 files, 799 entries
  • semantic index persisted: 799 entries, 1707.0 KB

Proposed Fix

In the OpenCode plugin startup path:

  • await ensureOnnxRuntime(storageDir) before constructing BridgePool
  • set configOverrides._ort_dylib_dir before any child process spawn
  • keep the fast path unchanged when experimental_semantic_search=false

Regression Coverage

I added two focused tests locally:

  1. ensureOnnxRuntime must resolve before BridgePool construction
  2. when semantic search is disabled, ensureOnnxRuntime must not be called

Why This Matters

This is not just a corrupted local ONNX cache case. The race can happen even when the dylib exists and is loadable on disk, because the child process may start before the plugin has injected the runtime directory into process configuration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions