Skip to content

HPC on AWS removes the long wait times and lost productivity often associated with on-premises HPC clusters. Flexible HPC cluster configurations and virtually unlimited scalability allows you to grow and shrink your infrastructure as your workloads dictate, not the other way around

License

Notifications You must be signed in to change notification settings

aws-solutions-library-samples/guidance-for-building-a-high-performance-numerical-weather-prediction-system-on-aws

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Guidance for Building a High-Performance Numerical Weather Prediction System on AWS

Overview

Amazon Web Services (AWS) provides the most elastic and scalable cloud infrastructure to run your weather workloads. With virtually unlimited capacity, engineers, researchers, HPC system administrators, and organizations can innovate beyond the limitations of on-premises HPC infrastructure.

High Performance Compute (HPC) on AWS removes the long wait times and lost productivity often associated with on-premises HPC clusters. Flexible HPC cluster configurations and virtually unlimited scalability allows you to grow and shrink your infrastructure as your workloads dictate, not the other way around. Additionally, with access to a broad portfolio of cloud-based services for Data Analytics, Artificial Intelligence (AI), and Machine Learning (ML), you can reinvent traditional NWP workflows to derive results faster and under budget.

PLease find AWS Weather HPC customer case studies here, under Weather.

This guidance is intended for builders who want to learn hands-on about running weather fofrecasting in AWS Cloud.

Sample Surface Temperature Model

Figure 1. Sample Surface temperature model obtained by Numerical Weather prediction

Architecture Overview

Architecture diagrams

Architecture diagrams below show sample HPC Cluster architecture, provisoning process and user interactions via ParallelCluster UI in order to run numerical weather forecasting tasks
Provision AWS ParallelCluster UI and configure HPC cluster
Figure 2: AWS ParallelCluster UI and HPC Cluster Architecture

Below are steps that provision AWS ParallelCluster UI and configure HPC cluster with compute and storage capabilities:

  1. Users deploy guidance Amazon CloudFormation stack that provisions networking resources (Amazon VPC, subnets), Amazon API Gateway, storage (Amazon FSx for Lustre) and finally Amazon ParallelCluster UI .
  2. Amazon ParallelCluster UI endpoint is available for Users authentication via Amazon API Gateway
  3. Users authenticate to AWS ParallelCluster UI endpoint via integrated triggering triggering AWS Lambda) function and handling login details via Amazon Cognito.
  4. Authenticated Users provision HPC clusters via ParallelCluster UI using sample cluster specifications available with guidance code. Each HPC cluster has a Head node and Compute node(s) getting dynamically provisioned for application workload execution.
  5. Users authenticated via ParallelCluster UI can connect to HPC cluster using AWS SSM Session Manager or via NICE DCV sessions

Sample HPC Cluster Architecture and user Interactions


Figure 3. HPC Cluster Architecture and User interactions for running Numerical Weather prediction on AWS

Below are the steps of User interactions with AWS ParallelCluster UI in order to configure HPC cluster with compute and storage capabilities, then deploy and run Numerical Weather prediction model.

  1. User authenticates to AWS ParallelCluster UI via Amazon Cognito, API Gateway and Lambda
  2. User connects to HPC Cluster via AWS ParallelCluster UI using SSM Connection or NICE DCV (latter can be used directly w/o ParallelCluster UI)
  3. SLURM  (HPC resource manager from SchedMD) is installed and used to manage resources of AWS ParallelCluster driving resource scaling.
  4. Spack is a Package manager for supercomputers, Linux, and macOS. It is used to install necessary compilers and libraries, including NCAR Command Language (NCL) and Weather Research & Forecasting Model (WRF) model
  5. Amazon FSx for Lustre storage is provisioned along with other HPC cluster resources. Input data used for simulating WRF test model - 12-km CONUS (Continental United States) – is copied to /fsx directory mapped to that storage
  6. Users create sbatch script to run the CONUS 12-km model, submit that job and monitor its status via squeue command.
  7. Weather Forecast results are stored locally in the /fsx/conus_12km/ folder and can be visualized using NCL scripts

AWS Services in this Guidance

The following AWS Services are deployed in this Guidance:

AWS service Description
Amazon VPC Core Service - provides additional Networking isolation and security
Amazon EC2 Core Service - EC2 instances used as cluster nodes
Amazon API Gateway Core service - create, publish, maintain, monitor, and secure APIs at scale
Amazon Cognito Core service - provides user identity and access management (IAM) services
Amazon Lambda Core service - provides serverless automation of user authentication
Amazon FSx for Lustre Core service - provides high-performance Lustre file system
Amazon Parallel Cluster Core service - Open source cluster management tool for deployment and management of High Performance Computing (HPC) clusters
Amazon High Performance Computing HPC cluster Core service - high performance compute resource
Amazon System Manager Session Manager Auxiliary service - instance connection management

Plan your deployment

Supported AWS Regions

This Guidance uses EC2 services with specific instances such as hpc6 and FSx for Lustre storage services, which may not currently be available in all AWS Regions. You must launch this solution in an AWS Region where EC2 specific instance types and FSx is available. For the most current availability of AWS services by Region, refer to the AWS Regional Services List.

Guidance for Building a High-Performance Numerical Weather Prediction System on AWS is currently supported in the following AWS Regions (based on availability of hpc6a, hpc7a and hpc7g instances:

AWS Region Amazon EC2 HPC Optimized Instance type
ap-southeast-1 hpc6a.48xlarge
eu-north-1 hpc6id.32xlarge
hpc6a.48xlarge
eu-west-1 hpc7a.12xlarge
hpc7a.24xlarge
hpc7a.48xlarge
hpc7a.96xlarge
us-east-1 hpc7g.4xlarge
hpc7g.8xlarge
hpc7g.16xlarge
us-east-2 hpc6a.48xlarge
hpc6id.32xlarge
hpc7a.12xlarge
hpc7a.24xlarge
hpc7a.48xlarge
hpc7a.96xlarge

Cost

You are responsible for the cost of the AWS services used while running this guidance.

Please refer to the sample pricing webpage for each AWS Service used in this Guidance. Please note that monthly costs assume that an HPC cluster with Head Node of instanceType c6a.2xlarge and two Compute Nodes of instanceType hpc6a.48xlarge with 1200 GB of FSx for Lustre persistent storage provisioned for that cluster that are active 50%. In reality, computeNodes get de-provisonied around 10 min after completing a job and therefore monthly cost would be lower than this estimate.

Sample cost table

The following table provides a sample cost breakdown for an HPC cluster with one c6a.2xlarge Head Node and two Compute Nodes of instanceType hpc6a.48xlarge with 1200 GB of FSx for Lustre persistent storage allocated for it deployed in the us-east-2 region:

Node Processor Type On Demand Cost/month USD
c6a.2xlarge $226.58
hpc6a.48xlarge $2,102.40
FSx Lustre storage $720.07
VPC, subnets $283.50
Total estimate $3,332.55

Security

When you build systems on AWS infrastructure, security responsibilities are shared between you and AWS. This shared responsibility model reduces your operational burden because AWS operates, manages, and controls the components including the host operating system, the virtualization layer, and the physical security of the facilities in which the services operate. For more information about AWS security, visit AWS Cloud Security.

AWS ParallelCluster users are securely authenticiated and authorized to their roles via Amazon Cognito user pool service. HPC cluster EC2 components are deployed into a Virtual Private Cloud (VPC) which provides additional network security isolation for all contained components. Head Node is depoyed into a Public subnet and available for access via secure connections (SSH and DCV), compute nodes are deployed into Private subnet and managed from Head node via SLURM package manager. Data stored in Amazon FSx for Lustre is enrypted at rest and in transit.

See CONTRIBUTING for more information.

Deployment Steps

Please see published Implementation Guide for step-by-step deployment instructions for this guidance.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

About

HPC on AWS removes the long wait times and lost productivity often associated with on-premises HPC clusters. Flexible HPC cluster configurations and virtually unlimited scalability allows you to grow and shrink your infrastructure as your workloads dictate, not the other way around

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages