Skip to content

VRT hangs indefinitely when the emulation/simulation server crashes #71

@JOOpdenhoevel

Description

@JOOpdenhoevel

VRT hangs indefinitely when the emulation/simulation server crashes

Summary

When VRT is used in EMULATION or SIMULATION mode, it spawns a child process (vpp_emu / vpp_sim) that exposes a ZeroMQ REP socket on tcp://localhost:5555, and talks to it through vrt::ZmqServer over a ZMQ_REQ socket. If that child process never starts, dies during startup, or crashes mid-run, VRT does not detect this — every subsequent socket.recv(...) blocks forever and the calling application hangs with no diagnostic.

VRT should detect that the emulation/simulation peer is unreachable or has gone away and fail fast with a clear error instead of blocking indefinitely.

Where this lives

  • vrt/src/device.cppDevice::Device(...) (EMULATION/SIMULATION branches): launches the child via std::system(...) inside a detached std::thread. The PID is never captured, so liveness can't be polled and exit status can't be reported.
  • vrt/src/utils/zmq_server.cppZmqServer::ZmqServer(): creates a ZMQ_REQ socket and calls connect(). ZMQ's connect succeeds even when no peer is listening; failure only manifests at send / recv time, where the current code blocks unconditionally.

Proposed fix (in increasing order of invasiveness)

  1. Socket-level timeouts. Set ZMQ_RCVTIMEO, ZMQ_SNDTIMEO, and ZMQ_LINGER=0 on the REQ socket in ZmqServer. Wrap every socket.recv(reply) so a timeout raises std::runtime_error with a clear message ("emulation/simulation server did not respond within Xs — it may have crashed; check the server logs"). Note: a REQ socket whose request has timed out is in an invalid state and must be re-created — simplest is to throw and let the caller bail.
  2. Startup handshake. In the Device constructor, after spawning the child, send a probe command with a short timeout and a few retries. This converts "hang on first real call" into "fail fast at construction with a diagnostic."
  3. Capture the child PID. Replace std::system() + detached std::thread with posix_spawn (or fork/execvp), store the PID on Device, and call waitpid(pid, &status, WNOHANG) from cleanup() and on any timeout to surface the exit code/signal in the error message. This also catches mid-run crashes, not just startup failures.

(1) alone removes the hang. (1)+(2) gives a clear early failure. (3) is the right long-term fix because it also catches mid-run crashes and reports the underlying cause.

Acceptance criteria

  • Pointing VRT at an emu/sim binary that exits immediately produces a clear error within a few seconds (no hang).
  • An emu/sim binary that crashes mid-execution causes the next VRT call to throw with a diagnostic that names the failed command and, ideally, the child's exit status or signal.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions