Skip to content

[Initiative]: AI for Science (AI4S) Cloud-Native Foundation whitepaper #2002

@Jiamu1

Description

@Jiamu1

Name

AI4S Cloud-Native Foundation whitepaper

Short description

Formalize principles, reference architectures, and ecosystem strategies for building scalable, reproducible AI-for-Science (AI4S) platforms on Kubernetes and cloud-native substrates, unifying scientific computing, AI workloads, and large-scale data management.

Responsible group

TOC

Does the initiative belong to a subproject?

Yes

Subproject name

TOC Artificial Intelligence Initiatives

Primary contact

Yanjun Chen(@chenyanjunyjy), Jiamu Liu(@Jiamu1)

Additional contacts

No response

Initiative description

The purpose of this initiative is to establish a dedicated sub-stream within relevant cloud-native working groups to define foundational principles, reference patterns, and ecosystem gaps for AI4S platforms. These platforms will accelerate scientific discovery across domains (e.g., material science, drug discovery, climate modeling) by leveraging cloud-native’s strengths in elasticity, portability, and orchestration to unify heterogeneous workloads—including AI model training/inference, high-performance computing (HPC) tasks, large-scale scientific data processing and AI4S AI Agents.

Scope Definition
The group will focus on architectural guidance, interoperability standards, and ecosystem alignment for cloud-native AI4S platforms, rather than building proprietary runtime systems or deep-diving into domain-specific scientific algorithms. Key focus areas include:

  1. Workload Orchestration & Resource Management: How to model heterogeneous AI4S workloads (e.g., GPU-intensive model training, CPU-bound scientific simulations, batch data preprocessing) in cloud-native terms (Pods, CRDs, Jobs)? How to integrate Kubernetes with HPC schedulers (e.g., Slurm) or optimize scheduling for mixed workloads using Dynamic Resource Allocation (DRA) and GPU partitioning? What lifecycle hooks are needed for long-running scientific simulations paired with AI inference pipelines?
  2. Unified Data Layer for scientific AI: Scientific data is often large-scale, heterogeneous (e.g., protein sequence, sensor data, simulations, imaging), and stored in siloed systems (e.g., HPC storage, object stores, databases). How to design a cloud-native data access layer that unifies these sources with AI-friendly abstractions (e.g., vector databases for molecular structures, cached datasets for model fine-tuning)? What protocols (e.g., S3, CSI, Data Transfer Node APIs) and caching strategies (e.g., Redis, Memcached) ensure low-latency access for AI4S workloads?
  3. AI Toolchain Integration & Model Management: How to standardize the integration of AI toolchains (e.g., model registries, fine-tuning frameworks, inference optimizers) into cloud-native AI4S platforms? What CRDs or APIs are needed for managing scientific AI models (e.g., versioning, lineage, reproducibility of model training with specific scientific datasets)? How to optimize inference for domain-specific models (e.g., 3D molecular models, climate simulation predictors) using techniques like model parallelism and tensor parallelism on cloud-native clusters?
  4. AI4S agent engineering stack: AI4S requires intelligent agents to automate scientific AI workflow lifecycle coordination. How to build cloud-native AI4S agents via CNCF primitives (e.g., Kubernetes CRDs, Argo Workflows, OpenTelemetry)? How to design agents’ decision logic (e.g., adjusting training sharding by GPU utilization) to address general CNCF workflow tools’ (Argo/Tekton) lack of native intelligent coordination for scientific computing?
  5. Reproducibility & Traceability: Scientific discovery requires strict reproducibility of experiments. How to leverage cloud-native technologies (e.g., containerization, environment snapshots, GitOps) to capture full experiment contexts (code, data, models, infrastructure)? What observability primitives (e.g., OpenTelemetry spans for simulation-AI pipelines, metadata logging for scientific parameters) are needed to trace the lineage of scientific insights from data collection to AI-driven conclusions?

Why it matters to CNCF

  1. Emerging scientific workloads: AI4S is a high-impact emerging workload driving breakthroughs in healthcare, energy, and climate science. These workloads need cloud-native tech’s strengths in scalability, resilience and interoperability—giving cloud-native a key chance to be the de facto infrastructure for scientific discovery.
  2. Avoid fragmentation: Today’s AI4S platforms are often siloed (e.g., HPC-centric solutions, proprietary AI toolchains, cloud-specific offerings) leading to vendor lock-in and limited collaboration. A neutral and open cloud-native framework can unify these diverse approaches and CNCF could take a foundational role in this effort..
  3. Fill the HPC-AI-cloud gap: Traditional HPC systems lack the elasticity and AI toolchain integration needed for modern scientific AI. This initiative bridges this gap, expanding cloud-native’s reach beyond enterprise IT to research and scientific computing.
  4. Ecosystem expansion: AI4S will involve academic researchers, industrial R&D teams, HPC experts, and AI engineers. By defining clear cloud-native patterns for AI4S, the community can attract new contributors and projects (e.g., scientific data operators, AI4S-specific schedulers) while integrating existing cloud-native tools into a high-impact use case.

Key technologies & projects involved

  1. Orchestration & Scheduling: Kubernetes, KEDA, Kueue, Dynamic Resource Allocation (DRA), MPI Operator, Slurm-Kubernetes Bridge, Volcano Scheduler
  2. Data Management: MinIO, Ceph, Apache Iceberg (for data lakes), Vector Databases (Chroma, Milvus), Velero (data backup/restore), CSI Drivers for HPC Storage, KubeRay
  3. AI Toolchain: Kubeflow, MLflow (model registry), BentoML (model serving), vLLM/TensorRT (inference optimization), PyTorch Distributed/TensorFlow Distributed (distributed training)
  4. Scientific Computing Integration: MPI, OpenMP, NVIDIA CUDA/AMD ROCm (accelerators), HPC Container Orchestration (Singularity/Kubernetes integration)
  5. Observability & Traceability: OpenTelemetry, Prometheus/Grafana, Jaeger (distributed tracing), Weights & Biases (experiment tracking), GitLab/GitHub Actions (GitOps)
  6. Workflow Automation: Argo Workflows, Tekton, Apache Airflow (scientific pipeline orchestration), Prefect.

Deliverable(s) or exit criteria

Publish “AI for Science (AI4S) Cloud-Native Foundation” whitepaper (≤ 15 pp): Details architectural patterns, reference architecture, workload models, data layer design, key challenges, gap analysis and case studies (e.g., drug discovery, climate modeling).

Tracking document for meeting and progress

TBD

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/initiativeAn initiative or an item related to imitative processesneeds-groupIndicates an issue or PR that has not been assigned a group (toc or tag/foo label applied).needs-triageIndicates an issue or PR that has not been triaged yet (has a 'triage/foo' label applied).

    Type

    No type

    Projects

    Status

    New

    Status

    status/new

    Status

    No status

    Status

    No status

    Status

    No status

    Status

    No status

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions