This repository contains a hardened, Dockerized Retrieval-Augmented Generation (RAG) inference boilerplate.
Standard open-source configurations for localized Large Language Models fundamentally fail when subjected to production constraints. They exhaust VRAM, paralyze asynchronous event loops, and collapse under minor dependency version mismatches. This architecture synthesizes native FastAPI routing with strict hardware mapping to deploy LoRA fine-tuned models on highly constrained compute environments (≤ 4GB VRAM).
-
ASGI Event Loop Preservation: Native Hugging Face tensor generation is strictly synchronous. Executing model.generate() within a standard async route causes total server gridlock under concurrent requests. This architecture enforces strict application state isolation (app.state) and utilizes starlette.concurrency threadpool offloading to maintain baseline API responsiveness.
-
Deterministic VRAM Constraints: Default full-precision tensor loading is an engineering anti-pattern. This node leverages bitsandbytes and scipy for deterministic 8-bit quantization paired with torch.float16 mapping, halving the memory footprint to function optimally on entry-level enterprise GPUs.
-
Dependency Schema Immutability: Modern iterations of peft inject experimental configuration schemas (e.g., alora_invocation_tokens) that fracture stable deployment matrices. The adapter configurations within this repository have been surgically pruned to guarantee backward compatibility with peft==0.6.2 and transformers==4.35.2.
-
Virtualization & Network Resilience: Deployed via a customized nvidia/cuda:11.8.0 base image, with explicit DNS injection protocols to bypass WSL2 hypervisor network vacuums and WDDM routing failures.
Ensure the deployment host complies with the following strict parameters:
Host Drivers: NVIDIA Studio Drivers explicitly supporting CUDA 11.8 or higher. Legacy OEM drivers (e.g., WDDM 451.xx) will cause catastrophic virtualization bridging failures.
Execute the following directives sequentially to instantiate the node.
- Clone the Asset:
Bash
git clone https://github.com/your-username/llmops.git
cd llmops
- Compile the Immutable Layer: Force the Docker daemon to ingest the dependency manifests and compile the application layer utilizing cached C-extensions.
Bash
docker build -t enterprise-inference-node .
- Initialize the Containerized Node: Execute the container with explicit GPU passthrough. The --dns flags are mandatory if operating within a restrictive virtualization backend (such as Windows WSL2) to prevent external model hub resolution failures.
Bash
docker run -d --name llm-node --gpus all --dns 8.8.8.8 --dns 1.1.1.1 -p 8000:8000 enterprise-inference-node
Observe the Boot Sequence:
Monitor the telemetry to confirm the 8-bit tensor allocation and application state binding.
Bash
docker logs -f llm-node
- The node is operational exclusively when the terminal outputs: ✔️ Injection Complete! Inference Node active!
Endpoint: POST /generate/
Payload Schema (application/json):
JSON
{
"instructions": "Classify the following text as True or Fake News.",
"context": "The global compute market is shifting entirely towards edge-deployed quantized models.",
"max_new_tokens": 128
}
Execution Telemetry: The application includes a baseline HTTP middleware layer logging request methods, paths, status codes, and sub-second latency tracking to standard output for pipeline observability.