Closed-Loop LLM Routing with Conformal Cascading and Distillation Co-Optimization
This repository contains the code, configuration, and benchmarking infrastructure for RouteNLP, a cost-aware framework for routing enterprise NLP queries across a tiered model portfolio. The system integrates difficulty-aware routing, confidence-calibrated cascading with conformal prediction, and a distillationβrouting co-optimization loop.
| Aspect | Details |
|---|---|
| Status | Pilot deployment (8 weeks) |
| Scale | ~5,000 queries/day |
| Domains | Customer service (pilot), finance & legal (benchmark) |
| Latency | P50: ~45ms (router), P99: 387ms (end-to-end) |
| Infrastructure | vLLM (T2/T3), ONNX Runtime (T1), API gateway (T4) |
| Metric | Before (Always-T4) | After (RouteNLP) | Change |
|---|---|---|---|
| Inference cost | $200K+/month | ~$84K/month | β58% |
| Latency P99 | 1,847 ms | 387 ms | β79% |
| Throughput | Limited by API | 10K+ QPS (T1/T2) | Significantly improved |
| Response acceptance | 93.8% | 91.0% | β2.8pp (deemed acceptable) |
| SLA violation rate | 38.2% | 2.3% | β94% |
| Quality ratio (vs. T4) | 1.000 | 0.971 | β2.9% |
[Query + Task] β [Difficulty-Aware Router (DistilBERT, ~4ms)]
β
βββββββββββββββββΌββββββββββββββββββββββββββββββββ
β β β β
[T1: DistilBERT] [T2: Mistral-7B] [T3: Mixtral] [T4: GPT-4-Turbo]
[$0.01/1K] [$0.10/1K] [$0.80/1K] [$8.00/1K]
β β β β
[Confidence Check: u β€ Ξ΄_{k,t}?] [Response]
β (escalate if uncertain)
[Next Tier via Cascade]
β
[Co-Optimization Loop: cluster failures β targeted distillation β retrain router]
-
Conformal thresholds are initialization, not guarantees. Under distribution shift, coverage violations exceeded the 5% target (up to 8.1%). Production monitoring and periodic recalibration are essential.
-
Failure clustering provides 2Γ the cost improvement of random distillation. Targeting systematic failure patterns for distillation is far more effective than uniformly sampling training data.
-
Generation tasks are harder to route than classification tasks. BERTScore as a routing proxy agrees with human judgment 84-87% of the time but can miss nuanced quality differences, leading to 8-9% substantially degraded responses.
-
The co-optimization loop converges quickly but has diminishing returns. After 3 iterations, remaining failures become edge cases that are increasingly hard to address through distillation alone.
-
Portfolio agnosticism matters operationally. When the frontier model changed during the pilot, only thresholds and quality labels needed updatingβthe routing framework adapted automatically.
# Clone repository
git clone https://anonymous.4open.science/r/RouteNLP
cd routenlp
# Install core package
pip install -e .
# Install with serving dependencies
pip install -e ".[serving]"
# Install everything (including dev tools)
pip install -e ".[all]"python -m routenlp.evaluation --batch-size 1 --num-samples 100
# Scaling benchmark across batch sizes
python -m routenlp.evaluation --scalinguvicorn routenlp.serving.server:app --host 0.0.0.0 --port 8080
# Docker
docker build -f Dockerfile.serving -t routenlp-serving .
docker run -p 8080:8080 routenlp-servingpython scripts/verify_acl_industry_compliance.pyroutenlp/
βββ src/routenlp/
β βββ router/ # Difficulty-aware router (DistilBERT + task embeddings)
β β βββ __init__.py # RouterConfig, DifficultyAwareRouter, RouterLoss
β β βββ trainer.py # Training loop with early stopping
β βββ cascade/ # Conformal cascading mechanism
β β βββ __init__.py # ConformalCalibrator, CascadeEngine
β βββ distillation/ # Co-optimization loop
β β βββ __init__.py # FailureAnalyzer, DistillationDataGenerator, CoOptLoop
β βββ models/ # Model portfolio management
β β βββ __init__.py # LocalModelTier, VLLMModelTier, APIModelTier, ModelPortfolio
β βββ data/ # Dataset loading and processing
β β βββ __init__.py # BenchmarkDataset, load_benchmark, TASK_REGISTRY
β βββ evaluation/ # Evaluation and latency benchmarking
β β βββ __init__.py # benchmark_latency, evaluate_system
β βββ serving/ # Production FastAPI server
β β βββ server.py # Endpoints: /route, /route/batch, /health, /metrics
β βββ utils/ # Reproducibility, metrics
β βββ reproducibility.py
β βββ metrics.py # F1, ROUGE-L, BERTScore, accuracy + thresholds
βββ configs/
β βββ default.yaml # Full system configuration
βββ scripts/
β βββ train_router.py # Router training (single/multi-seed)
β βββ evaluate.py # System evaluation
β βββ verify_acl_industry_compliance.py
βββ tests/
β βββ test_core.py # Comprehensive test suite
βββ docs/
β βββ COMPUTE_RESOURCES.md
βββ kubernetes/
β βββ deployment.yaml # K8s deployment + HPA + service
βββ Dockerfile.serving # Production Docker image
βββ pyproject.toml # Package configuration
βββ requirements-serving.txt
βββ README.md
Six tasks across three enterprise domains (40,200 train / 8,800 test):
| Domain | Task | Train | Test | Metric | Source |
|---|---|---|---|---|---|
| Finance | NER | 8,200 | 1,800 | F1 | SEC EDGAR |
| Finance | Summarization | 5,400 | 1,200 | ROUGE-L | SEC EDGAR |
| Customer Service | Intent Classification | 12,000 | 2,600 | F1 | BANKING77+ |
| Customer Service | Response Generation | 6,800 | 1,500 | BERTScore | BANKING77+ |
| Legal | Clause Extraction | 4,600 | 1,000 | F1 | CUAD |
| Legal | Risk Assessment | 3,200 | 700 | Accuracy | CUAD |
| System | Quality Ratioβ | Cost Ratioβ | P99 (ms)β | SLA Viol.β |
|---|---|---|---|---|
| Always-T4 | 1.000Β±.000 | 1.000Β±.000 | 1,847 | 38.2% |
| FrugalGPT | .967Β±.004 | .284Β±.009 | 986 | 21.3% |
| Hybrid LLM | .972Β±.005 | .312Β±.011 | 874 | 18.7% |
| RouteLLM | .969Β±.004 | .246Β±.008 | 841 | 17.2% |
| AutoMix | .958Β±.006 | .231Β±.010 | 1,124 | 24.6% |
| RouteNLP | .971Β±.004 | .159Β±.006 | 387 | 2.3% |
| Task | T4 Quality | RouteNLP | Retention | Cost Reduction |
|---|---|---|---|---|
| Financial NER | 94.2 F1 | 93.8Β±.3 | 99.6% | 82% |
| Financial Summarization | 48.7 R-L | 46.9Β±.4 | 96.3% | 47% |
| CS Intent | 96.1 F1 | 95.8Β±.2 | 99.7% | 85% |
| CS Response | 72.4 BS | 69.7Β±.6 | 96.3% | 42% |
| Legal Clause | 91.6 F1 | 90.9Β±.3 | 99.2% | 78% |
| Legal Risk | 88.3 Acc | 86.1Β±.5 | 97.5% | 40% |
- CPU: 4+ cores
- RAM: 16 GB
- GPU: NVIDIA T4 or better (for T1 fine-tuned model)
- Storage: 20 GB
- CPU: 8+ cores
- RAM: 64 GB
- GPU: NVIDIA A100 40/80 GB
- Storage: 100 GB
While production data cannot be released, we provide:
- Router training code and configuration
- Cascade calibration implementation
- Co-optimization loop implementation
- Latency benchmarking infrastructure
- Evaluation metrics and system evaluation
- Serving infrastructure (FastAPI + Docker + Kubernetes)
- Full hyperparameter specification
- Multi-seed evaluation support (5 seeds)
- Training data (proprietary annotations)
- Fine-tuned model weights (pending approval)
- Pilot deployment covers only customer service (~5K queries/day, 8 weeks); finance and legal claims rely on benchmark simulation.
- The benchmark adapts public datasets with enterprise annotations rather than proprietary data.
- The co-optimization loop ran on benchmark data, not production failure logs.
- The pilot was a shadow deployment without randomized A/B testing.
- Conformal coverage degrades under distribution shift (up to 8.1% violations vs. 5% target).
- English-only evaluation.
- BERTScore proxy agreement with humans (84-87%) is not verified under domain shift.
- Cost savings depend on cost structures; baseline adaptation may not fully preserve 2-model inductive biases.
- The co-optimization loop incurs ~$2,400 one-time cost at our scale.
Cost-aware routing creates quality disparities: queries routed to cheaper models receive less capable responses. The framework mitigates this through task-level quality constraints and conformal calibration, but individual-level variance exists. We recommend that organizations:
- Disclose model tier usage to end users
- Monitor routing patterns for systematic disparities across demographic groups
- Implement fairness-constrained routing that escalates underserved segments
- Conduct regular quality audits with human evaluation
The co-optimization loop (~120 GPU-hours) yields 40-85% ongoing inference cost reduction, representing a net environmental benefit. The benchmark uses publicly available data with no PII.
- A. Limitations: See Limitations section above
- B. Potential Risks: Quality disparities from cost-aware routing; see Ethical Considerations
- C. Compute Resources: See docs/COMPUTE_RESOURCES.md
- D. Human Evaluation: 200 samples Γ 2 tasks, 3 annotators, Krippendorff's Ξ± = 0.68-0.72