Skip to content

frckeepit/llm-production-toolkit

Repository files navigation

LLM Production Toolkit

A Python toolkit for evaluating, monitoring, and ensuring the safety of LLM deployments in production.

The problem: 95% of enterprise AI pilots fail to deliver value — not because the models are bad, but because organizations lack the production engineering to deploy them reliably.

The solution: Concrete, runnable tools that address the most common failure modes: hallucination, bias, lack of feedback loops, and operational unreadiness.

Quick Start

pip install llm-production-toolkit

For ML-powered modules (hallucination detection):

pip install llm-production-toolkit[hallucination]

For everything:

pip install llm-production-toolkit[all]

Modules

Hallucination Grounding Check

Evaluate whether LLM output is grounded in source documents. Uses embedding similarity + NLI entailment for robust detection.

llm-toolkit hallucination check \
  --output "The Eiffel Tower was built in 1820" \
  --source "The Eiffel Tower was constructed from 1887 to 1889 in Paris, France"
from llm_production_toolkit.hallucination import GroundingEvaluator

evaluator = GroundingEvaluator()
result = evaluator.evaluate(
    llm_output="The Eiffel Tower was built in 1820.",
    source_context="The Eiffel Tower was constructed from 1887 to 1889.",
)
print(f"Grounding score: {result.overall_score:.2f}")
print(f"Flagged claims: {len(result.flagged_claims)}")

Bias Evaluation

Test any LLM for demographic bias across gender, race, and age using controlled prompt variations.

from llm_production_toolkit.bias import BiasEvaluator

def my_llm(prompt: str) -> str:
    # Wrap any LLM — OpenAI, Anthropic, local model, etc.
    return call_your_llm(prompt)

evaluator = BiasEvaluator(llm_callable=my_llm, categories=["gender", "race"])
report = evaluator.evaluate(num_runs=3)
print(f"Overall bias score: {report.overall_bias_score:.2f}")

Production Feedback Loop

Collect and analyze user feedback on LLM outputs. Python API or REST server.

# Start the feedback server
llm-toolkit feedback start --port 8100

# Check metrics
llm-toolkit feedback metrics --window 24
from llm_production_toolkit.feedback import FeedbackCollector, FeedbackEntry

collector = FeedbackCollector("feedback.db")
collector.record(FeedbackEntry(
    session_id="sess-123",
    prompt_hash="abc",
    feedback_type="thumbs",
    thumbs_value=True,
))
metrics = collector.get_metrics(window_hours=24)
print(f"Satisfaction: {metrics.satisfaction_rate:.1%}")

Production Readiness Assessment

Interactive CLI that scores your LLM deployment's operational maturity across 9 categories.

llm-toolkit readiness assess

Produces a readiness score (0-100) with category breakdowns and prioritized recommendations.

Compliance Mapper

Map evaluation results to AI best-practice requirements. Works with any subset of module outputs.

llm-toolkit compliance report \
  --readiness readiness.json \
  --hallucination grounding.json

Architecture

Each module is independently usable with its own CLI and Python API. Optional dependencies keep the core installation small (~5MB). ML-powered modules (hallucination, bias) add heavier dependencies only when needed.

llm-production-toolkit
├── hallucination    # Grounding evaluation (requires torch)
├── bias             # Demographic bias testing (requires textblob)
├── feedback         # Feedback collection (requires fastapi)
├── readiness        # Readiness assessment (core deps only)
└── compliance       # Compliance mapping (core deps only)

Development

git clone https://github.com/frckeepit/llm-production-toolkit.git
cd llm-production-toolkit
pip install -e ".[dev,all]"
pytest tests/ -m "not slow"
ruff check src/ tests/

License

MIT

About

Production-ready toolkit for evaluating, monitoring, and ensuring safety of LLM deployments. Hallucination detection, bias evaluation, feedback loops, and production readiness assessment.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages