Skip to content

ModelExpress Release v0.3.0

Choose a tag to compare

@saturley-hall saturley-hall released this 17 Apr 22:04
76fc5d7

ModelExpress v0.3.0 Release Notes

The big picture

Today, scaling inference means waiting. A 671B model takes 40+ minutes to load from storage before it can serve a single token. Every new node repeats that wait. ModelExpress exists to eliminate it, by turning GPU memory itself into the fastest model cache in the cluster.

With v0.3.0, that vision gets real production-grade infrastructure to power it. This release makes NIXL-based GPU-to-GPU transfer a first-class, production-grade path. Today, users are deploying on Kubernetes with proper lifecycle management, metadata coordination, and failure handling. It's incredible to see DeepSeek-V3 transfers in production inference environments take ~15 seconds across 8 GPUs.

What's new in this release

v0.3.0 lands three things that matter:

  1. P2P transfers that work in production: not just the data plane, but the control plane around it. Multi-source metadata exchange, heartbeat-based liveness, stale source cleanup, and a NIXL-native metadata path that removes the Redis dependency for coordination. You can now have multiple workers publishing and consuming transfer metadata cleanly, and the system detects and sheds dead sources automatically.
  2. One loader, not two: the previous mx-source / mx-target split is gone. The unified MX loader auto-detects the right path (HuggingFace download, disk cache, GDS, or P2P receive) based on what's available. Less configuration, fewer failure modes, clearer operations.
  3. Kubernetes as a real deployment target: metadata management through Kubernetes-native backends, Helm chart hardening (ephemeral storage limits, ServiceAccount lockdown), no-shared-storage test coverage, and multi-node gRPC transfer validation. MX no longer just "runs on K8s", it's K8s-native.

P2P transfer and metadata

  • End-to-end NIXL P2P with post-processed tensor registration and non-contiguous RDMA: full module trees get registered, not just top-level parameters. Storage-level RDMA handles real tensor layouts. (#135, #169, #188)
  • NIXL-native metadata exchange: P2P coordination without a centralized store. Fewer moving parts for distributed setups. (#177)
  • Multi-source metadata model: multiple workers can publish and consume metadata concurrently, required for real multi-node topologies. (#170)
  • TransferEngine metadata backend + simplified schema: pluggable metadata backends with a cleaner wire format. Foundation for resilience in production. (#157, #165)
  • Client heartbeats + stale-source reaper: the server detects dead sources and cleans them up. P2P doesn't hang waiting for a node that's gone. (#182)
  • NIXL listener lifecycle: listener only starts when P2P metadata is enabled. No phantom ports in cache-only deployments. (#210)

Loading and caching

  • Unified MX loader: one path replaces the source/target split. Auto-detects the fastest available route. (#147)
  • GDS-aware loading: GPU Direct Storage integrated into the auto-detection path, used when available, skipped when not. (#166)
  • Provider-aware cache and streaming: cache and download behavior follows the active provider, not a hardcoded HuggingFace assumption. (#172)

Kubernetes and deployment

  • Helm chart hardened: ephemeral-storage limits, ServiceAccount automount disabled, PVC naming aligned with docs (#144, #143, #145, #192, #191)
  • Redis metadata setup made explicit in README and docker-compose (#203)
  • Docker build context excludes target/, cutting image build upload size (#175)
  • Unnecessary serviceAccountName removed from vLLM client manifests (#198)
  • Standard gRPC health check service for probe integration (#205)
  • CodeRabbit added for automated PR review (#138)

Bug fixes

  • Model deletion no longer triggers unintended downloads; eviction path is consistent (#154, #168)
  • HuggingFace: handles empty files, skips dotfiles, honors HF_HUB_OFFLINE (#139, #128)
  • Cache clearing works in real usage (#130)
  • Non-root workers can use PVCs correctly (#132)
  • Kubernetes status round-tripping with default Unknown status (#174)
  • Metadata publish retries before failing the loader (#196)
  • Security: Pygments CVE-2026-4539, rustls-webpki RUSTSEC-2026-0049 (#195, #178)
  • Multi-node K8s tests respect KUBECONFIG; Python tests re-enabled in CI (#189, #171)
  • P2P throughput docs corrected; metadata backend logging shows actual type (#181, #190)

Where we're headed

v0.3.0 establishes the P2P transfer plane and pluggable metadata as production primitives. The next releases focus on three areas:

  • Performance: transfer throughput optimization, contiguous region support, and benchmarking across network topologies
  • Broader runtime coverage: stronger SGLang and TensorRT-LLM integration alongside vLLM
  • Day-2 operations: observability (metrics, tracing), rolling upgrades without transfer disruption, and multi-tenant isolation

The longer arc: ModelExpress becomes the weight management layer for inference and RL systems. It becomes the critical piece that makes model placement, scaling, and migration fast enough that the orchestrator can treat GPU memory as a fungible resource across the cluster.

Contributors

Thank you to everyone who contributed to this release, especially the sustained effort that landed NIXL P2P and metadata coherently across dozens of PRs, and to all reviewers and testers who tested the Kubernetes and multi-node paths.

Full changelog

v0.2.2...v0.3.0