RouteNLP

Closed-Loop LLM Routing with Conformal Cascading and Distillation Co-Optimization

This repository contains the code, configuration, and benchmarking infrastructure for RouteNLP, a cost-aware framework for routing enterprise NLP queries across a tiered model portfolio. The system integrates difficulty-aware routing, confidence-calibrated cascading with conformal prediction, and a distillation–routing co-optimization loop.

🏭 Main Work

Deployment Context

Aspect	Details
Status	Pilot deployment (8 weeks)
Scale	~5,000 queries/day
Domains	Customer service (pilot), finance & legal (benchmark)
Latency	P50: ~45ms (router), P99: 387ms (end-to-end)
Infrastructure	vLLM (T2/T3), ONNX Runtime (T1), API gateway (T4)

Production Metrics

Metric	Before (Always-T4)	After (RouteNLP)	Change
Inference cost	$200K+/month	~$84K/month	−58%
Latency P99	1,847 ms	387 ms	−79%
Throughput	Limited by API	10K+ QPS (T1/T2)	Significantly improved
Response acceptance	93.8%	91.0%	−2.8pp (deemed acceptable)
SLA violation rate	38.2%	2.3%	−94%
Quality ratio (vs. T4)	1.000	0.971	−2.9%

System Architecture

[Query + Task] → [Difficulty-Aware Router (DistilBERT, ~4ms)]
                         ↓
         ┌───────────────┼───────────────┐───────────────┐
         ↓               ↓               ↓               ↓
    [T1: DistilBERT] [T2: Mistral-7B] [T3: Mixtral]  [T4: GPT-4-Turbo]
    [$0.01/1K]       [$0.10/1K]       [$0.80/1K]     [$8.00/1K]
         ↓               ↓               ↓               ↓
    [Confidence Check: u ≤ δ_{k,t}?]                  [Response]
         ↓ (escalate if uncertain)
    [Next Tier via Cascade]
                         ↓
    [Co-Optimization Loop: cluster failures → targeted distillation → retrain router]

Lessons Learned

Conformal thresholds are initialization, not guarantees. Under distribution shift, coverage violations exceeded the 5% target (up to 8.1%). Production monitoring and periodic recalibration are essential.
Failure clustering provides 2× the cost improvement of random distillation. Targeting systematic failure patterns for distillation is far more effective than uniformly sampling training data.
Generation tasks are harder to route than classification tasks. BERTScore as a routing proxy agrees with human judgment 84-87% of the time but can miss nuanced quality differences, leading to 8-9% substantially degraded responses.
The co-optimization loop converges quickly but has diminishing returns. After 3 iterations, remaining failures become edge cases that are increasingly hard to address through distillation alone.
Portfolio agnosticism matters operationally. When the frontier model changed during the pilot, only thresholds and quality labels needed updating—the routing framework adapted automatically.

Quick Start

Installation

# Clone repository
git clone https://anonymous.4open.science/r/RouteNLP
cd routenlp

# Install core package
pip install -e .

# Install with serving dependencies
pip install -e ".[serving]"

# Install everything (including dev tools)
pip install -e ".[all]"

Run Latency Benchmark

python -m routenlp.evaluation --batch-size 1 --num-samples 100

# Scaling benchmark across batch sizes
python -m routenlp.evaluation --scaling

Start Serving

uvicorn routenlp.serving.server:app --host 0.0.0.0 --port 8080

# Docker
docker build -f Dockerfile.serving -t routenlp-serving .
docker run -p 8080:8080 routenlp-serving

Verify ACL Compliance

python scripts/verify_acl_industry_compliance.py

Repository Structure

routenlp/
├── src/routenlp/
│   ├── router/             # Difficulty-aware router (DistilBERT + task embeddings)
│   │   ├── __init__.py     # RouterConfig, DifficultyAwareRouter, RouterLoss
│   │   └── trainer.py      # Training loop with early stopping
│   ├── cascade/            # Conformal cascading mechanism
│   │   └── __init__.py     # ConformalCalibrator, CascadeEngine
│   ├── distillation/       # Co-optimization loop
│   │   └── __init__.py     # FailureAnalyzer, DistillationDataGenerator, CoOptLoop
│   ├── models/             # Model portfolio management
│   │   └── __init__.py     # LocalModelTier, VLLMModelTier, APIModelTier, ModelPortfolio
│   ├── data/               # Dataset loading and processing
│   │   └── __init__.py     # BenchmarkDataset, load_benchmark, TASK_REGISTRY
│   ├── evaluation/         # Evaluation and latency benchmarking
│   │   └── __init__.py     # benchmark_latency, evaluate_system
│   ├── serving/            # Production FastAPI server
│   │   └── server.py       # Endpoints: /route, /route/batch, /health, /metrics
│   └── utils/              # Reproducibility, metrics
│       ├── reproducibility.py
│       └── metrics.py      # F1, ROUGE-L, BERTScore, accuracy + thresholds
├── configs/
│   └── default.yaml        # Full system configuration
├── scripts/
│   ├── train_router.py     # Router training (single/multi-seed)
│   ├── evaluate.py         # System evaluation
│   └── verify_acl_industry_compliance.py
├── tests/
│   └── test_core.py        # Comprehensive test suite
├── docs/
│   └── COMPUTE_RESOURCES.md
├── kubernetes/
│   └── deployment.yaml     # K8s deployment + HPA + service
├── Dockerfile.serving      # Production Docker image
├── pyproject.toml          # Package configuration
├── requirements-serving.txt
└── README.md

Benchmark

Six tasks across three enterprise domains (40,200 train / 8,800 test):

Domain	Task	Train	Test	Metric	Source
Finance	NER	8,200	1,800	F1	SEC EDGAR
Finance	Summarization	5,400	1,200	ROUGE-L	SEC EDGAR
Customer Service	Intent Classification	12,000	2,600	F1	BANKING77+
Customer Service	Response Generation	6,800	1,500	BERTScore	BANKING77+
Legal	Clause Extraction	4,600	1,000	F1	CUAD
Legal	Risk Assessment	3,200	700	Accuracy	CUAD

Main Results

Cost–Quality Tradeoff (5-seed mean ± std)

System	Quality Ratio↑	Cost Ratio↓	P99 (ms)↓	SLA Viol.↓
Always-T4	1.000±.000	1.000±.000	1,847	38.2%
FrugalGPT	.967±.004	.284±.009	986	21.3%
Hybrid LLM	.972±.005	.312±.011	874	18.7%
RouteLLM	.969±.004	.246±.008	841	17.2%
AutoMix	.958±.006	.231±.010	1,124	24.6%
RouteNLP	.971±.004	.159±.006	387	2.3%

Per-Task Performance

Task	T4 Quality	RouteNLP	Retention	Cost Reduction
Financial NER	94.2 F1	93.8±.3	99.6%	82%
Financial Summarization	48.7 R-L	46.9±.4	96.3%	47%
CS Intent	96.1 F1	95.8±.2	99.7%	85%
CS Response	72.4 BS	69.7±.6	96.3%	42%
Legal Clause	91.6 F1	90.9±.3	99.2%	78%
Legal Risk	88.3 Acc	86.1±.5	97.5%	40%

Hardware Requirements

Minimum (evaluation only)

CPU: 4+ cores
RAM: 16 GB
GPU: NVIDIA T4 or better (for T1 fine-tuned model)
Storage: 20 GB

Recommended (full pipeline)

CPU: 8+ cores
RAM: 64 GB
GPU: NVIDIA A100 40/80 GB
Storage: 100 GB

Reproducibility

While production data cannot be released, we provide:

Limitations

Pilot deployment covers only customer service (~5K queries/day, 8 weeks); finance and legal claims rely on benchmark simulation.
The benchmark adapts public datasets with enterprise annotations rather than proprietary data.
The co-optimization loop ran on benchmark data, not production failure logs.
The pilot was a shadow deployment without randomized A/B testing.
Conformal coverage degrades under distribution shift (up to 8.1% violations vs. 5% target).
English-only evaluation.
BERTScore proxy agreement with humans (84-87%) is not verified under domain shift.
Cost savings depend on cost structures; baseline adaptation may not fully preserve 2-model inductive biases.
The co-optimization loop incurs ~$2,400 one-time cost at our scale.

Ethical Considerations and Fairness

Cost-aware routing creates quality disparities: queries routed to cheaper models receive less capable responses. The framework mitigates this through task-level quality constraints and conformal calibration, but individual-level variance exists. We recommend that organizations:

Disclose model tier usage to end users
Monitor routing patterns for systematic disparities across demographic groups
Implement fairness-constrained routing that escalates underserved segments
Conduct regular quality audits with human evaluation

The co-optimization loop (~120 GPU-hours) yields 40-85% ongoing inference cost reduction, representing a net environmental benefit. The benchmark uses publicly available data with no PII.

Responsible NLP Checklist

A. Limitations: See Limitations section above
B. Potential Risks: Quality disparities from cost-aware routing; see Ethical Considerations
C. Compute Resources: See docs/COMPUTE_RESOURCES.md
D. Human Evaluation: 200 samples × 2 tasks, 3 annotators, Krippendorff's α = 0.68-0.72

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RouteNLP

🏭 Main Work

Deployment Context

Production Metrics

System Architecture

Lessons Learned

Quick Start

Installation

Run Latency Benchmark

Start Serving

Verify ACL Compliance

Repository Structure

Benchmark

Main Results

Cost–Quality Tradeoff (5-seed mean ± std)

Per-Task Performance

Hardware Requirements

Minimum (evaluation only)

Recommended (full pipeline)

Reproducibility

Limitations

Ethical Considerations and Fairness

Responsible NLP Checklist

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
docs		docs
kubernetes		kubernetes
scripts		scripts
src/routenlp		src/routenlp
tests		tests
.gitattributes		.gitattributes
Dockerfile.serving		Dockerfile.serving
README.md		README.md
pyproject.toml		pyproject.toml
requirements-serving.txt		requirements-serving.txt

Folders and files

Latest commit

History

Repository files navigation

RouteNLP

🏭 Main Work

Deployment Context

Production Metrics

System Architecture

Lessons Learned

Quick Start

Installation

Run Latency Benchmark

Start Serving

Verify ACL Compliance

Repository Structure

Benchmark

Main Results

Cost–Quality Tradeoff (5-seed mean ± std)

Per-Task Performance

Hardware Requirements

Minimum (evaluation only)

Recommended (full pipeline)

Reproducibility

Limitations

Ethical Considerations and Fairness

Responsible NLP Checklist

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages