Proof-of-concept for CPU embedding offload and CPU-assisted cold start for recommendation models.
- Python 3.9+
- PyTorch 2.0+ (CPU & CUDA)
- FastAPI (routing server)
- NumPy (data generation)
Request → Router (poc.py) → CPU Embeddings → GPU MLP → Response
↘ GPU Full Model ↗
# Install dependencies using uv (recommended)
uv sync
# Or with pip
pip install torch numpy
# Run simple example
python3 example.py
# Run benchmarks (takes ~60 seconds)
python3 benchmark.pyMinimum:
- CPU: 8+ cores, 32GB RAM
- GPU: 1× RTX 3090 / A10 / T4 (16GB+ VRAM)
Optimal:
- CPU: 32+ cores, 128GB RAM
- GPU: 1× A100 (40GB+)
- Network: 10Gbps if CPU/GPU on separate machines
- CPU embedding offload reduces GPU memory 10-100×
- CPU pre-fill during cold start eliminates downtime
- Intelligent routing improves latency for large batches
- Baseline GPU latency: 5-100ms
- CPU offload latency: 10-50ms (faster for large batches)
- Cold start: 30-60s → <5s with CPU assist
Done:
- ✅ Basic routing logic
- ✅ CPU embedding lookup
- ✅ GPU MLP execution
TODO: See TODO.md
- Prism (Alibaba): NSDI 2025
- FleetRec (ETH/Alibaba): KDD 2021
- CaraServe (HKUST): arXiv 2024
- NVIDIA HugeCTR: github.com/NVIDIA-Merlin/HugeCTR