Skip to content

bettyguo/RouteNLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

RouteNLP

Closed-Loop LLM Routing with Conformal Cascading and Distillation Co-Optimization

This repository contains the code, configuration, and benchmarking infrastructure for RouteNLP, a cost-aware framework for routing enterprise NLP queries across a tiered model portfolio. The system integrates difficulty-aware routing, confidence-calibrated cascading with conformal prediction, and a distillation–routing co-optimization loop.


🏭 Main Work

Deployment Context

Aspect Details
Status Pilot deployment (8 weeks)
Scale ~5,000 queries/day
Domains Customer service (pilot), finance & legal (benchmark)
Latency P50: ~45ms (router), P99: 387ms (end-to-end)
Infrastructure vLLM (T2/T3), ONNX Runtime (T1), API gateway (T4)

Production Metrics

Metric Before (Always-T4) After (RouteNLP) Change
Inference cost $200K+/month ~$84K/month βˆ’58%
Latency P99 1,847 ms 387 ms βˆ’79%
Throughput Limited by API 10K+ QPS (T1/T2) Significantly improved
Response acceptance 93.8% 91.0% βˆ’2.8pp (deemed acceptable)
SLA violation rate 38.2% 2.3% βˆ’94%
Quality ratio (vs. T4) 1.000 0.971 βˆ’2.9%

System Architecture

[Query + Task] β†’ [Difficulty-Aware Router (DistilBERT, ~4ms)]
                         ↓
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         ↓               ↓               ↓               ↓
    [T1: DistilBERT] [T2: Mistral-7B] [T3: Mixtral]  [T4: GPT-4-Turbo]
    [$0.01/1K]       [$0.10/1K]       [$0.80/1K]     [$8.00/1K]
         ↓               ↓               ↓               ↓
    [Confidence Check: u ≀ Ξ΄_{k,t}?]                  [Response]
         ↓ (escalate if uncertain)
    [Next Tier via Cascade]
                         ↓
    [Co-Optimization Loop: cluster failures β†’ targeted distillation β†’ retrain router]

Lessons Learned

  1. Conformal thresholds are initialization, not guarantees. Under distribution shift, coverage violations exceeded the 5% target (up to 8.1%). Production monitoring and periodic recalibration are essential.

  2. Failure clustering provides 2Γ— the cost improvement of random distillation. Targeting systematic failure patterns for distillation is far more effective than uniformly sampling training data.

  3. Generation tasks are harder to route than classification tasks. BERTScore as a routing proxy agrees with human judgment 84-87% of the time but can miss nuanced quality differences, leading to 8-9% substantially degraded responses.

  4. The co-optimization loop converges quickly but has diminishing returns. After 3 iterations, remaining failures become edge cases that are increasingly hard to address through distillation alone.

  5. Portfolio agnosticism matters operationally. When the frontier model changed during the pilot, only thresholds and quality labels needed updatingβ€”the routing framework adapted automatically.


Quick Start

Installation

# Clone repository
git clone https://anonymous.4open.science/r/RouteNLP
cd routenlp

# Install core package
pip install -e .

# Install with serving dependencies
pip install -e ".[serving]"

# Install everything (including dev tools)
pip install -e ".[all]"

Run Latency Benchmark

python -m routenlp.evaluation --batch-size 1 --num-samples 100

# Scaling benchmark across batch sizes
python -m routenlp.evaluation --scaling

Start Serving

uvicorn routenlp.serving.server:app --host 0.0.0.0 --port 8080

# Docker
docker build -f Dockerfile.serving -t routenlp-serving .
docker run -p 8080:8080 routenlp-serving

Verify ACL Compliance

python scripts/verify_acl_industry_compliance.py

Repository Structure

routenlp/
β”œβ”€β”€ src/routenlp/
β”‚   β”œβ”€β”€ router/             # Difficulty-aware router (DistilBERT + task embeddings)
β”‚   β”‚   β”œβ”€β”€ __init__.py     # RouterConfig, DifficultyAwareRouter, RouterLoss
β”‚   β”‚   └── trainer.py      # Training loop with early stopping
β”‚   β”œβ”€β”€ cascade/            # Conformal cascading mechanism
β”‚   β”‚   └── __init__.py     # ConformalCalibrator, CascadeEngine
β”‚   β”œβ”€β”€ distillation/       # Co-optimization loop
β”‚   β”‚   └── __init__.py     # FailureAnalyzer, DistillationDataGenerator, CoOptLoop
β”‚   β”œβ”€β”€ models/             # Model portfolio management
β”‚   β”‚   └── __init__.py     # LocalModelTier, VLLMModelTier, APIModelTier, ModelPortfolio
β”‚   β”œβ”€β”€ data/               # Dataset loading and processing
β”‚   β”‚   └── __init__.py     # BenchmarkDataset, load_benchmark, TASK_REGISTRY
β”‚   β”œβ”€β”€ evaluation/         # Evaluation and latency benchmarking
β”‚   β”‚   └── __init__.py     # benchmark_latency, evaluate_system
β”‚   β”œβ”€β”€ serving/            # Production FastAPI server
β”‚   β”‚   └── server.py       # Endpoints: /route, /route/batch, /health, /metrics
β”‚   └── utils/              # Reproducibility, metrics
β”‚       β”œβ”€β”€ reproducibility.py
β”‚       └── metrics.py      # F1, ROUGE-L, BERTScore, accuracy + thresholds
β”œβ”€β”€ configs/
β”‚   └── default.yaml        # Full system configuration
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ train_router.py     # Router training (single/multi-seed)
β”‚   β”œβ”€β”€ evaluate.py         # System evaluation
β”‚   └── verify_acl_industry_compliance.py
β”œβ”€β”€ tests/
β”‚   └── test_core.py        # Comprehensive test suite
β”œβ”€β”€ docs/
β”‚   └── COMPUTE_RESOURCES.md
β”œβ”€β”€ kubernetes/
β”‚   └── deployment.yaml     # K8s deployment + HPA + service
β”œβ”€β”€ Dockerfile.serving      # Production Docker image
β”œβ”€β”€ pyproject.toml          # Package configuration
β”œβ”€β”€ requirements-serving.txt
└── README.md

Benchmark

Six tasks across three enterprise domains (40,200 train / 8,800 test):

Domain Task Train Test Metric Source
Finance NER 8,200 1,800 F1 SEC EDGAR
Finance Summarization 5,400 1,200 ROUGE-L SEC EDGAR
Customer Service Intent Classification 12,000 2,600 F1 BANKING77+
Customer Service Response Generation 6,800 1,500 BERTScore BANKING77+
Legal Clause Extraction 4,600 1,000 F1 CUAD
Legal Risk Assessment 3,200 700 Accuracy CUAD

Main Results

Cost–Quality Tradeoff (5-seed mean Β± std)

System Quality Ratio↑ Cost Ratio↓ P99 (ms)↓ SLA Viol.↓
Always-T4 1.000Β±.000 1.000Β±.000 1,847 38.2%
FrugalGPT .967Β±.004 .284Β±.009 986 21.3%
Hybrid LLM .972Β±.005 .312Β±.011 874 18.7%
RouteLLM .969Β±.004 .246Β±.008 841 17.2%
AutoMix .958Β±.006 .231Β±.010 1,124 24.6%
RouteNLP .971Β±.004 .159Β±.006 387 2.3%

Per-Task Performance

Task T4 Quality RouteNLP Retention Cost Reduction
Financial NER 94.2 F1 93.8Β±.3 99.6% 82%
Financial Summarization 48.7 R-L 46.9Β±.4 96.3% 47%
CS Intent 96.1 F1 95.8Β±.2 99.7% 85%
CS Response 72.4 BS 69.7Β±.6 96.3% 42%
Legal Clause 91.6 F1 90.9Β±.3 99.2% 78%
Legal Risk 88.3 Acc 86.1Β±.5 97.5% 40%

Hardware Requirements

Minimum (evaluation only)

  • CPU: 4+ cores
  • RAM: 16 GB
  • GPU: NVIDIA T4 or better (for T1 fine-tuned model)
  • Storage: 20 GB

Recommended (full pipeline)

  • CPU: 8+ cores
  • RAM: 64 GB
  • GPU: NVIDIA A100 40/80 GB
  • Storage: 100 GB

Reproducibility

While production data cannot be released, we provide:

  • Router training code and configuration
  • Cascade calibration implementation
  • Co-optimization loop implementation
  • Latency benchmarking infrastructure
  • Evaluation metrics and system evaluation
  • Serving infrastructure (FastAPI + Docker + Kubernetes)
  • Full hyperparameter specification
  • Multi-seed evaluation support (5 seeds)
  • Training data (proprietary annotations)
  • Fine-tuned model weights (pending approval)

Limitations

  1. Pilot deployment covers only customer service (~5K queries/day, 8 weeks); finance and legal claims rely on benchmark simulation.
  2. The benchmark adapts public datasets with enterprise annotations rather than proprietary data.
  3. The co-optimization loop ran on benchmark data, not production failure logs.
  4. The pilot was a shadow deployment without randomized A/B testing.
  5. Conformal coverage degrades under distribution shift (up to 8.1% violations vs. 5% target).
  6. English-only evaluation.
  7. BERTScore proxy agreement with humans (84-87%) is not verified under domain shift.
  8. Cost savings depend on cost structures; baseline adaptation may not fully preserve 2-model inductive biases.
  9. The co-optimization loop incurs ~$2,400 one-time cost at our scale.

Ethical Considerations and Fairness

Cost-aware routing creates quality disparities: queries routed to cheaper models receive less capable responses. The framework mitigates this through task-level quality constraints and conformal calibration, but individual-level variance exists. We recommend that organizations:

  • Disclose model tier usage to end users
  • Monitor routing patterns for systematic disparities across demographic groups
  • Implement fairness-constrained routing that escalates underserved segments
  • Conduct regular quality audits with human evaluation

The co-optimization loop (~120 GPU-hours) yields 40-85% ongoing inference cost reduction, representing a net environmental benefit. The benchmark uses publicly available data with no PII.


Responsible NLP Checklist

  • A. Limitations: See Limitations section above
  • B. Potential Risks: Quality disparities from cost-aware routing; see Ethical Considerations
  • C. Compute Resources: See docs/COMPUTE_RESOURCES.md
  • D. Human Evaluation: 200 samples Γ— 2 tasks, 3 annotators, Krippendorff's Ξ± = 0.68-0.72

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages