feat: Resource-Aware Async Orchestrator for Hardware-Safe Inference by sanvishukla · Pull Request #282 · fireform-core/FireForm

sanvishukla · 2026-03-17T21:24:51Z

Summary

This Pull Request introduces a VRAM-Aware Hardware Orchestrator and refactors the core inference pipeline to be fully asynchronous. This addresses the critical issue of server hanging during LLM extraction (identified in #277) and provides the necessary architectural foundation for local voice transcription (#115) on resource-constrained hardware.

Key Changes

1. VRAM Orchestrator (src/core/orchestrator.py)

Introduced a VRAMOrchestrator singleton that utilizes an asyncio.Lock (Mutex) to gate-keep hardware access.
Rationale: Prevents "Out-of-Memory" (OOM) crashes on 8GB/16GB devices by ensuring that heavy models (like Whisper and Ollama) are serialized and never compete for the same GPU/VRAM resources.

2. Full Async Pipeline Refactor

Core Logic: Refactored llm.py to use httpx.AsyncClient for non-blocking communication with local Ollama instances.
Propagation: Updated Filler, FileManipulator, and Controller layers to support async/await.
API Response: Converted the /forms/fill endpoint to async def, allowing the FastAPI server to remain responsive to other requests (e.g., GET /templates) even during heavy 30–60 second inference tasks.

3. Improved Transport Reliability

Added configurable 60s timeouts to LLM requests to prevent indefinite "silent hangs" on slow model loads.
Implemented robust error handling for transport errors (ConnectError, TimeoutException), providing clear feedback when Ollama is unavailable.

Technical Implementation Details

Pattern: Singleton design for hardware management.
Dependencies: Added httpx (async HTTP), pytest-asyncio (async testing), and respx (transport mocking).
Schema Mapping: Simplified per-field orchestration within the main_loop to allow yielding the event loop between extraction steps.

Verification & Testing

Automated Tests

tests/test_vram_orchestrator.py
- Confirms singleton integrity
- Verifies concurrent hardware requests are correctly serialized via the Mutex lock
tests/test_reliability.py
- Uses respx to simulate slow LLM responses and connection failures
- Verifies timeout handling and non-blocking behavior

bash
# Run verification
export PYTHONPATH=$PYTHONPATH:.
pytest tests/test_vram_orchestrator.py tests/test_reliability.py

Closes #281

Mahendrareddy2006 · 2026-03-18T12:40:21Z

Thanks @sanvishukla for the detailed implementation. I originally noticed the blocking behavior while testing the FireForm API locally, so it's great to see a full async refactor addressing the issue.

The orchestrator approach for managing VRAM access also seems useful for supporting multiple models on lower-memory systems. I’ll try testing this PR locally and share feedback if I notice anything.

feat: implement VRAM-aware async orchestrator and non-blocking inference

aef1dca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Resource-Aware Async Orchestrator for Hardware-Safe Inference#282

feat: Resource-Aware Async Orchestrator for Hardware-Safe Inference#282
sanvishukla wants to merge 1 commit intofireform-core:mainfrom
sanvishukla:feature/VRAM-Orchestrator

sanvishukla commented Mar 17, 2026

Uh oh!

Mahendrareddy2006 commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sanvishukla commented Mar 17, 2026

Summary

Key Changes

Automated Tests

Uh oh!

Mahendrareddy2006 commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants