Skip to content

UniverseScripts/llmops

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Enterprise Local LLMOps Inference Node

This repository contains a hardened, Dockerized Retrieval-Augmented Generation (RAG) inference boilerplate.

Standard open-source configurations for localized Large Language Models fundamentally fail when subjected to production constraints. They exhaust VRAM, paralyze asynchronous event loops, and collapse under minor dependency version mismatches. This architecture synthesizes native FastAPI routing with strict hardware mapping to deploy LoRA fine-tuned models on highly constrained compute environments (≤ 4GB VRAM).

Architectural Optimizations

  1. ASGI Event Loop Preservation: Native Hugging Face tensor generation is strictly synchronous. Executing model.generate() within a standard async route causes total server gridlock under concurrent requests. This architecture enforces strict application state isolation (app.state) and utilizes starlette.concurrency threadpool offloading to maintain baseline API responsiveness.

  2. Deterministic VRAM Constraints: Default full-precision tensor loading is an engineering anti-pattern. This node leverages bitsandbytes and scipy for deterministic 8-bit quantization paired with torch.float16 mapping, halving the memory footprint to function optimally on entry-level enterprise GPUs.

  3. Dependency Schema Immutability: Modern iterations of peft inject experimental configuration schemas (e.g., alora_invocation_tokens) that fracture stable deployment matrices. The adapter configurations within this repository have been surgically pruned to guarantee backward compatibility with peft==0.6.2 and transformers==4.35.2.

  4. Virtualization & Network Resilience: Deployed via a customized nvidia/cuda:11.8.0 base image, with explicit DNS injection protocols to bypass WSL2 hypervisor network vacuums and WDDM routing failures.

Infrastructure Prerequisites

Ensure the deployment host complies with the following strict parameters:

Host Drivers: NVIDIA Studio Drivers explicitly supporting CUDA 11.8 or higher. Legacy OEM drivers (e.g., WDDM 451.xx) will cause catastrophic virtualization bridging failures.

Orchestration: Docker Engine with the NVIDIA Container Toolkit properly initialized.

Hardware: Minimum 4GB dedicated VRAM.

Global Deployment Protocol

Execute the following directives sequentially to instantiate the node.

  1. Clone the Asset:
Bash

git clone https://github.com/your-username/llmops.git
cd llmops
  1. Compile the Immutable Layer: Force the Docker daemon to ingest the dependency manifests and compile the application layer utilizing cached C-extensions.
Bash

docker build -t enterprise-inference-node .
  1. Initialize the Containerized Node: Execute the container with explicit GPU passthrough. The --dns flags are mandatory if operating within a restrictive virtualization backend (such as Windows WSL2) to prevent external model hub resolution failures.
Bash

docker run -d --name llm-node --gpus all --dns 8.8.8.8 --dns 1.1.1.1 -p 8000:8000 enterprise-inference-node
Observe the Boot Sequence:
Monitor the telemetry to confirm the 8-bit tensor allocation and application state binding.
Bash

docker logs -f llm-node
  1. The node is operational exclusively when the terminal outputs: ✔️ Injection Complete! Inference Node active!

API Contract

Endpoint: POST /generate/

Payload Schema (application/json):

JSON

{
  "instructions": "Classify the following text as True or Fake News.",
  "context": "The global compute market is shifting entirely towards edge-deployed quantized models.",
  "max_new_tokens": 128
}

Execution Telemetry: The application includes a baseline HTTP middleware layer logging request methods, paths, status codes, and sub-second latency tracking to standard output for pipeline observability.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors