Skip to content

fix: run_tests_with_ollama.sh proceeds silently when Ollama warmup times out #759

@ajbozarth

Description

@ajbozarth

Summary

The run_tests_with_ollama.sh script has a fast path for when it detects an existing Ollama server (lines 82–84), but it blindly trusts that server without verifying it is actually functional. If the existing server is in a bad state, the entire test run fails with Ollama connectivity errors rather than a clear setup failure.

What goes wrong

When the script detects Ollama already running, it skips starting its own server and proceeds directly to model pulls and warmups. If a warmup times out, the script logs a warning but carries on anyway:

Warning: warmup for granite4:micro timed out (will load on first test)

The subsequent tests then error with "could not create OllamaModelBackend: ollama server not running at None" rather than failing fast. The run still takes the full ~80 minutes working through connection timeouts on every affected test before reporting the failures.

Suggested improvements

  • Treat a warmup timeout as a fatal error rather than a warning — either die() with a clear message or attempt to restart the server
  • When reusing an existing server, verify it is responsive with a lightweight check (e.g. ollama ps) before proceeding to warmups
  • Consider adding a --force-restart-ollama flag for environments where stale servers are common

Context

Encountered during a manual cluster test run on an IBM LSF p-series GPU node (preemptable queue). The node had a stale Ollama server from a previous session that was running but unresponsive. Re-running after confirming no Ollama process was running produced a clean result.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions