Enterprise Local LLMOps Inference Node

This repository contains a hardened, Dockerized Retrieval-Augmented Generation (RAG) inference boilerplate.

Standard open-source configurations for localized Large Language Models fundamentally fail when subjected to production constraints. They exhaust VRAM, paralyze asynchronous event loops, and collapse under minor dependency version mismatches. This architecture synthesizes native FastAPI routing with strict hardware mapping to deploy LoRA fine-tuned models on highly constrained compute environments (≤ 4GB VRAM).

Architectural Optimizations

ASGI Event Loop Preservation: Native Hugging Face tensor generation is strictly synchronous. Executing model.generate() within a standard async route causes total server gridlock under concurrent requests. This architecture enforces strict application state isolation (app.state) and utilizes starlette.concurrency threadpool offloading to maintain baseline API responsiveness.
Deterministic VRAM Constraints: Default full-precision tensor loading is an engineering anti-pattern. This node leverages bitsandbytes and scipy for deterministic 8-bit quantization paired with torch.float16 mapping, halving the memory footprint to function optimally on entry-level enterprise GPUs.
Dependency Schema Immutability: Modern iterations of peft inject experimental configuration schemas (e.g., alora_invocation_tokens) that fracture stable deployment matrices. The adapter configurations within this repository have been surgically pruned to guarantee backward compatibility with peft==0.6.2 and transformers==4.35.2.
Virtualization & Network Resilience: Deployed via a customized nvidia/cuda:11.8.0 base image, with explicit DNS injection protocols to bypass WSL2 hypervisor network vacuums and WDDM routing failures.

Infrastructure Prerequisites

Ensure the deployment host complies with the following strict parameters:

Host Drivers: NVIDIA Studio Drivers explicitly supporting CUDA 11.8 or higher. Legacy OEM drivers (e.g., WDDM 451.xx) will cause catastrophic virtualization bridging failures.

Orchestration: Docker Engine with the NVIDIA Container Toolkit properly initialized.

Hardware: Minimum 4GB dedicated VRAM.

Global Deployment Protocol

Execute the following directives sequentially to instantiate the node.

Clone the Asset:

Bash

git clone https://github.com/your-username/llmops.git
cd llmops

Compile the Immutable Layer: Force the Docker daemon to ingest the dependency manifests and compile the application layer utilizing cached C-extensions.

Bash

docker build -t enterprise-inference-node .

Initialize the Containerized Node: Execute the container with explicit GPU passthrough. The --dns flags are mandatory if operating within a restrictive virtualization backend (such as Windows WSL2) to prevent external model hub resolution failures.

Bash

docker run -d --name llm-node --gpus all --dns 8.8.8.8 --dns 1.1.1.1 -p 8000:8000 enterprise-inference-node
Observe the Boot Sequence:
Monitor the telemetry to confirm the 8-bit tensor allocation and application state binding.

Bash

docker logs -f llm-node

The node is operational exclusively when the terminal outputs: ✔️ Injection Complete! Inference Node active!

API Contract

Endpoint: POST /generate/

Payload Schema (application/json):

JSON

{
  "instructions": "Classify the following text as True or Fake News.",
  "context": "The global compute market is shifting entirely towards edge-deployed quantized models.",
  "max_new_tokens": 128
}

Execution Telemetry: The application includes a baseline HTTP middleware layer logging request methods, paths, status codes, and sub-second latency tracking to standard output for pipeline observability.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
core		core
routers		routers
schemas		schemas
Dockerfile		Dockerfile
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Enterprise Local LLMOps Inference Node

Architectural Optimizations

Infrastructure Prerequisites

Host Drivers: NVIDIA Studio Drivers explicitly supporting CUDA 11.8 or higher. Legacy OEM drivers (e.g., WDDM 451.xx) will cause catastrophic virtualization bridging failures.

Orchestration: Docker Engine with the NVIDIA Container Toolkit properly initialized.

Hardware: Minimum 4GB dedicated VRAM.

Global Deployment Protocol

API Contract

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Enterprise Local LLMOps Inference Node

Architectural Optimizations

Infrastructure Prerequisites

Host Drivers: NVIDIA Studio Drivers explicitly supporting CUDA 11.8 or higher. Legacy OEM drivers (e.g., WDDM 451.xx) will cause catastrophic virtualization bridging failures.

Orchestration: Docker Engine with the NVIDIA Container Toolkit properly initialized.

Hardware: Minimum 4GB dedicated VRAM.

Global Deployment Protocol

API Contract

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages