Skip to content

feat: Resource-Aware Async Orchestrator for Hardware-Safe Inference#282

Open
sanvishukla wants to merge 1 commit intofireform-core:mainfrom
sanvishukla:feature/VRAM-Orchestrator
Open

feat: Resource-Aware Async Orchestrator for Hardware-Safe Inference#282
sanvishukla wants to merge 1 commit intofireform-core:mainfrom
sanvishukla:feature/VRAM-Orchestrator

Conversation

@sanvishukla
Copy link

Summary

This Pull Request introduces a VRAM-Aware Hardware Orchestrator and refactors the core inference pipeline to be fully asynchronous. This addresses the critical issue of server hanging during LLM extraction (identified in #277) and provides the necessary architectural foundation for local voice transcription (#115) on resource-constrained hardware.


Key Changes

1. VRAM Orchestrator (src/core/orchestrator.py)

  • Introduced a VRAMOrchestrator singleton that utilizes an asyncio.Lock (Mutex) to gate-keep hardware access.
  • Rationale: Prevents "Out-of-Memory" (OOM) crashes on 8GB/16GB devices by ensuring that heavy models (like Whisper and Ollama) are serialized and never compete for the same GPU/VRAM resources.

2. Full Async Pipeline Refactor

  • Core Logic: Refactored llm.py to use httpx.AsyncClient for non-blocking communication with local Ollama instances.
  • Propagation: Updated Filler, FileManipulator, and Controller layers to support async/await.
  • API Response: Converted the /forms/fill endpoint to async def, allowing the FastAPI server to remain responsive to other requests (e.g., GET /templates) even during heavy 30–60 second inference tasks.

3. Improved Transport Reliability

  • Added configurable 60s timeouts to LLM requests to prevent indefinite "silent hangs" on slow model loads.
  • Implemented robust error handling for transport errors (ConnectError, TimeoutException), providing clear feedback when Ollama is unavailable.

Technical Implementation Details

  • Pattern: Singleton design for hardware management.
  • Dependencies: Added httpx (async HTTP), pytest-asyncio (async testing), and respx (transport mocking).
  • Schema Mapping: Simplified per-field orchestration within the main_loop to allow yielding the event loop between extraction steps.

Verification & Testing

Automated Tests

  • tests/test_vram_orchestrator.py

    • Confirms singleton integrity
    • Verifies concurrent hardware requests are correctly serialized via the Mutex lock
  • tests/test_reliability.py

    • Uses respx to simulate slow LLM responses and connection failures
    • Verifies timeout handling and non-blocking behavior
bash
# Run verification
export PYTHONPATH=$PYTHONPATH:.
pytest tests/test_vram_orchestrator.py tests/test_reliability.py
image image

Closes #281

@Mahendrareddy2006
Copy link

Thanks @sanvishukla for the detailed implementation. I originally noticed the blocking behavior while testing the FireForm API locally, so it's great to see a full async refactor addressing the issue.

The orchestrator approach for managing VRAM access also seems useful for supporting multiple models on lower-memory systems. I’ll try testing this PR locally and share feedback if I notice anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEAT]: Resource-Aware Async Orchestrator for VRAM Safety & Non-Blocking Inference

2 participants