Skip to content

ModelExpress v0.4.0

Latest

Choose a tag to compare

@saturley-hall saturley-hall released this 10 Jun 17:22
ac11e93

ModelExpress v0.4.0 Release Notes

The big picture

ModelExpress exists to eliminate the wait. Scaling inference today means watching large models load from storage before they can serve a single token, and paying that cost again on every new node. v0.3.0 turned NIXL-based GPU-to-GPU transfer into a production-grade path and made Kubernetes a first-class deployment target.

v0.4.0 is about reach. The fast path is no longer tied to one storage backend, one model source, or one inference engine. This release adds streaming from S3, GCS, Azure, and local disk; brings in NGC as a model provider; replaces the SQLite registry with a distributed, K8s-native backend that scales across the cluster; and extends first-class loader integration to SGLang and TensorRT-LLM alongside vLLM. P2P transfer also matures considerably, with libfabric/EFA support, per-rank NIC pinning, a new VMM arena allocator, and safety checks that make the transfer plane more robust under real workloads.

What's new in this release

v0.4.0 lands four things that matter:

  1. Stream weights from anywhere: the new ModelStreamer load strategy streams model weights directly from S3 object storage, and now also from local disk, GCS, and Azure. Distributed streaming and S3 URI support land for the mx load format, and ModelExpress can hand off to native engine streamer loaders where they exist. (#221, #235, #261, #275)
  2. A distributed registry, not a local file: SQLite is replaced by a distributed RegistryBackend built on Redis and Kubernetes CRDs, plus a k8s-service metadata backend. The registry now coordinates state across the whole cluster instead of living on a single node, with model keys scoped by provider to prevent cross-provider false cache hits. (#250, #251, #412)
  3. More engines, more providers: SGLang gets a full ModelExpress loader integration with usage guide and P2P examples, TRT-LLM gains P2P weight transfer, and an NGC model provider joins HuggingFace and the cloud backends. Model requests are now provider-aware end to end. (#273, #202, #232, #207)
  4. A more robust P2P transfer plane: libfabric backend selection in NIXL with an AWS EFA example, per-rank NIC pinning via MX_RDMA_NIC_PIN, allocation-based pool registration via MX_POOL_REG, a per-allocation VMM arena allocator, P2P enabled by default for MLA models, and explicit safety checks around weight transfer. (#259, #255, #267, #282, #265, #225)

Storage providers and streaming

  • ModelStreamer for S3: stream model weights directly from S3 object storage instead of staging to disk first. (#221)
  • ModelStreamer everywhere: extended to local disk, GCS, and Azure, so the same streaming path covers on-prem and every major cloud. (#235)
  • Distributed streaming + S3 URIs for --load-format mx: streaming scales across workers and accepts S3 URIs directly. (#261)
  • Native engine streamer loaders: hand off to the engine's own streamer where one exists, avoiding duplicate paths. (#275)
  • NGC model provider: download models from NGC, joining HuggingFace and the cloud backends. (#232)
  • GCS end to end: GCS exposed as a first-class source across the request and cache paths. (#209)
  • Provider-aware model requests: requests carry their provider through the stack, plus a file selector for fetching specific model files. (#207, #254)
  • ModelStreamer Kubernetes recipes: ready-to-use manifests for streaming deployments. (#296)

Distributed registry and metadata

  • Distributed RegistryBackend (Redis + K8s CRD): replaces SQLite with cluster-wide state that scales beyond a single node. (#250)
  • k8s-service metadata backend: an additional Kubernetes-native metadata path. (#251)
  • Provider-scoped model keys: keys are namespaced by provider, preventing cross-provider false cache hits (backport to 0.4.0). (#412)
  • CRD status reporting: CRD conditions and observedGeneration are populated, agg.yaml fixed, and a stale manifest removed. (#216)
  • GMS integration groundwork: build_load_context extracted and unpublish_metadata added. (#239)

P2P transfer and NIXL

  • TRT-LLM P2P weight transfer: peer-to-peer transfer support for TensorRT-LLM. (#202)
  • Safety checks for P2P weight transfer: guards against unsafe transfer conditions. (#225)
  • libfabric backend selection in NIXL: choose the LIBFABRIC backend, with an aws_efa example for EFA on AWS. (#259, #274)
  • Per-rank NIC pinning (MX_RDMA_NIC_PIN): pin UCX_NET_DEVICES per rank for predictable multi-NIC performance. (#255)
  • Allocation-based pool registration (MX_POOL_REG): register NIXL pools by allocation. (#267)
  • VMM arena allocator: per-allocation cuMemCreate, 16 TiB virtual address space, and single-MR dmabuf. (#282)
  • P2P on by default for MLA models: MLA architectures use P2P transfer without extra configuration. (#265)
  • Hidden CUDA tensors for correct RDMA inference: adopts hidden CUDA tensors so P2P RDMA produces correct results. (#241)
  • Manifest size logging and TLS cleanup: log manifest byte size on source and target; remove stale UCX_TLS settings. (#278, #336)
  • NIXL listener lifecycle: the listener-only-when-P2P-enabled change was landed and then reverted while the conditions are reworked. (#210, #220)
  • Single-node disaggregated Dynamo P2P example: a single-node disaggregated variant for the Dynamo P2P example. (#245)

Engine integrations (vLLM, SGLang, TRT-LLM)

  • SGLang loader integration: a full ModelExpress loader for SGLang, with a usage guide and P2P examples. (#273, #264, #287)
  • vLLM mx load format: native support for the ModelExpress vLLM load format. (#292)
  • Opt-in OTEL tracing in the vLLM loader: distributed tracing for day-2 observability. (#260)
  • vLLM logging and rank/device fixes: propagate vLLM log handlers to ModelExpress loggers, resolve worker device id, keep device id aligned with the target device, and use the torch.distributed world rank for worker_rank. (#227, #269, #270, #363)
  • Loader refactors: extract LoadStrategyChain and modularize the vLLM loader; organize engine and metadata loading paths. (#200, #262)

Lifecycle and API

  • Sleep/wake support: pause_serving / resume_serving lifecycle hooks. (#291)
  • Go gRPC bindings: official Go bindings for the ModelExpress gRPC API. (#243)
  • Standard gRPC health check service: standard health-check endpoint for probe integration. (#205)
  • Bare gRPC endpoints accepted: clients can pass bare ModelExpress gRPC endpoints. (#299)
  • MX_SERVER_ADDRESS recommended: docs now recommend MX_SERVER_ADDRESS and mark MODEL_EXPRESS_URL deprecated. (#400)

Bug fixes

  • Re-initialize the model after a failed strategy mutates it, fixing the FP8 disk-fallback crash. (#233)
  • Resolve the HuggingFace cache path in the model validate command, and validate GCS-cached models in the CLI. (#224, #399)
  • On cache hit, clear now removes the registry record and verifies disk (backport to 0.4.0). (#409)
  • NGC downloads use the files-manifest endpoint for all storage/scope combinations (backport to 0.4.0). (#405)
  • Default coalesce_transfers to False in receive_from_source. (#234)
  • Fix the unsupported wheel/source-distribution error on macOS. (#257)
  • Use the correct grpc-health-probe image in the Helm test template. (#213)
  • Example/manifest fixes: remove --enable-expert-parallel from vLLM examples, align AWS EFA vLLM DeepSeek defaults, and bind the SGLang P2P example to 0.0.0.0:8000 to match its Service. (#228, #407, #401)
  • Make Redis metadata setup explicit in the README and docker-compose. (#203)

CI, build, and housekeeping

  • New Kubernetes CI coverage: P2P tests for vLLM and TRT-LLM, P2P + TP + model-streamer tests, unblocked TRT-LLM P2P and single-node TP=2 with a Dynamo aggregated test, and a disaggregated Dynamo P2P + stale-metadata test. (#279, #290, #294, #297)
  • Enable copy-pr-bot on the repository. (#283)
  • Docker reorganized into docker/ with a cross-arch wheel builder. (#398)
  • Bump dependencies (urllib3, pytest, rand) to address Dependabot alerts. (#276)
  • Simplify the redundant quiet-mode writer condition. (#249)
  • Release bookkeeping: bump version 0.3.0 → 0.4.0 and bump example/Helm image tags to 0.4.0. (#339, #393)
  • Documentation: update vLLM loader registration docs for 0.4.0. (#391)

Where we're headed

v0.4.0 broadens ModelExpress across storage backends, model providers, and inference engines, and hardens the P2P transfer plane. The next releases focus on:

  • Performance: transfer throughput optimization across the new streaming and P2P paths, contiguous-region support, and benchmarking across network topologies and cloud fabrics.
  • Day-2 operations: building on OTEL tracing and sleep/wake toward full observability, rolling upgrades without transfer disruption, and multi-tenant isolation.
  • Deeper engine and provider coverage: continuing to close the gap between vLLM, SGLang, and TensorRT-LLM, and expanding provider support.

The longer arc is unchanged: ModelExpress becomes the weight management layer for inference and RL systems, the piece that makes model placement, scaling, and migration fast enough that the orchestrator can treat GPU memory as a fungible resource across the cluster.

New Contributors

Contributors

Thank you to everyone who contributed to this release, and especially to the reviewers and testers who validated the new streaming, multi-cloud, distributed-registry, and SGLang/TRT-LLM P2P paths across Kubernetes and multi-node setups.

Full changelog

v0.3.0...v0.4.0