Official Python client for Inference Labs — a vendor-neutral router for the major cloud LLMs (OpenAI / Azure / Anthropic / Google / AWS Bedrock / RunwayML). One endpoint, one billing surface, automatic failover, semantic caching, and policy-driven model selection (cost-first, quality-first, latency-first, balanced, judge).
pip install inference-labsOr install the current release directly from GitHub (no PyPI account needed by us — works today):
pip install https://github.com/bosslesss/inference-labs-python/releases/download/v0.1.0/inference_labs-0.1.0-py3-none-any.whlOptional LangChain integration:
pip install "inference-labs[langchain]"from inference_labs import InferenceLabs
client = InferenceLabs(api_key="il_live_...") # or INFERENCE_LABS_API_KEY env var
out = client.generate(
prompt="Summarize this ticket: the laser printer is offline...",
strategy="cost-first",
max_cost_usd=0.01,
)
print(out.text)
print(f"routed via {out.provider}/{out.model} -- ${out.cost_usd:.5f}")Streaming:
for chunk in client.stream(prompt="Write a haiku about caching."):
print(chunk, end="", flush=True)Async (same surface, awaitable):
import asyncio
from inference_labs import AsyncInferenceLabs
async def main():
async with AsyncInferenceLabs() as client:
out = await client.generate(prompt="Hello.")
print(out.text)
asyncio.run(main())from inference_labs.langchain import ChatInferenceLabs
from langchain_core.messages import HumanMessage, SystemMessage
llm = ChatInferenceLabs(
api_key="il_live_...",
strategy="balanced",
max_cost_usd=0.01,
)
resp = llm.invoke([
SystemMessage(content="You are a terse SRE."),
HumanMessage(content="What does workers=1 threads=8 mean for SQLite?"),
])
print(resp.content)
print(resp.additional_kwargs) # -> model, provider, cost_usd, latency_ms, cached, trace_idAll parameters below are optional and can be set per call.
| Parameter | Type | What it does |
|---|---|---|
strategy |
"balanced" / "cost-first" / "quality-first" / "latency-first" / "judge" |
Picks the policy the router uses to choose between models in your allowlist. |
max_cost_usd |
float |
Hard cap on per-request cost in USD. |
max_latency_ms |
int |
Latency budget in milliseconds. |
allow_models |
list[str] |
Restrict the call to a subset of model IDs. |
deny_models |
list[str] |
Exclude specific model IDs from selection. |
workspace_id |
str |
Override the API key's default workspace. |
collect_trace |
bool |
Persist a redacted trace for evals (default True). |
redact_pii |
bool |
Run the PII / secrets redactor before storage (default True). |
The router returns a small typed object:
@dataclass
class GenerationResult:
text: str
model: str
provider: str
cost_usd: float
latency_ms: int
cached: bool
trace_id: str | None
raw: dict # whole response payload if you need fields we don't surfacefrom inference_labs import (
InferenceLabsError, AuthenticationError, RateLimitError,
InsufficientCreditsError, APIError,
)All exceptions inherit from InferenceLabsError so you can catch one.
client = InferenceLabs(
api_key="il_live_...", # or INFERENCE_LABS_API_KEY
base_url="https://app.inference-labs.com", # override for staging / self-hosted
timeout=60.0,
)For multi-tenant frameworks pass your own httpx.Client / httpx.AsyncClient via the client= kwarg so the SDK reuses your connection pool.
Apache-2.0. See LICENSE.
- Marketing: https://inference-labs.com
- App / dashboard: https://app.inference-labs.com
- Issues: https://github.com/bosslesss/InferenceLabs/issues
- Blog: https://blog.inference-labs.com