Permalink
Fetching contributors…
Cannot retrieve contributors at this time
534 lines (389 sloc) 45.1 KB

Build Status

Table of Contents

Azure Big Compute

Azure Big Compute

License

MSFT OSCC

Credits

This repo is inspired by Christian Smith's repo https://github.com/smith1511/hpc

Deploy from Portal and visualize

Deploy to Azure

For portal Deployment, the following pic might assist.

azureportaldeploy

This project is hosted at:

For the latest version, to contribute, and for more information, please go through this README.md.

To clone the current master (development) branch run:

git clone git://github.com/cloudgear-io/azure-bigcompute.git

Single or Cluster Topology Examples with Azure CLI

New Azure CLI

docker run -dti --restart=always --name=azure-cli-python azuresdk/azure-cli-python && docker exec -ti azure-cli-python bash -c "az login && bash" To sign in, use a web browser to open the page https://aka.ms/devicelogin and enter the code XXXXXXXXX to authenticate.

HPC with RDMA over IB

  • HPC Cluster (each H16R) with PBSPro and no OMS with head login user "azurehpcuser" and intern user "hpcgpu" - minimum 1 head and minimum 1 worker [provided sshpublickey value is supplied below]:

     bash-4.3# az group create -l southcentralus -n tsthpc && az group deployment create -g tsthpc -n tsthpc --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"cluster\"},\"DnsLabelPrefix\":{\"value\":\"tsthpc\"},\"AdminUserName\":{\"value\":\"azurehpcuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS-HPC\"},\"ImageSku\":{\"value\":\"7.4\"},\"schedulerpbsORTorqueORslurm\":{\"value\":\"pbspro\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_H16R\"},\"WorkerNodeCount\":{\"value\": 1},\"NumDataDisks\":{\"value\":\"32\"}}" --debug
  • HPC Single H16R with PBSPro and no OMS with login user "azurehpcuser" and intern user "hpcgpu"- [provided sshpublickey value is supplied below]:

     bash-4.3# az group create -l southcentralus -n tsthpc && az group deployment create -g tsthpc -n tsthpc --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"single\"},\"DnsLabelPrefix\":{\"value\":\"tsthpc\"},\"AdminUserName\":{\"value\":\"azurehpcuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS-HPC\"},\"ImageSku\":{\"value\":\"7.4\"},\"schedulerpbsORTorqueORslurm\":{\"value\":\"pbspro\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_H16R\"},\"WorkerNodeCount\":{\"value\": 0},\"NumDataDisks\":{\"value\":\"32\"}}" --debug
  • HPC Cluster (each H16R) with PBSPro with OMS with head login user "azurehpcuser" and intern user "hpcgpu"- minimum 1 head and minimum 1 worker [provided sshpublickey value is supplied below along with oMSWorkSpaceId and oMSWorkSpaceKey]:

     bash-4.3# az group create -l southcentralus -n tsthpc && az group deployment create -g tsthpc -n tsthpc --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"single\"},\"DnsLabelPrefix\":{\"value\":\"tsthpc\"},\"AdminUserName\":{\"value\":\"azurehpcuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS-HPC\"},\"ImageSku\":{\"value\":\"7.4\"},\"schedulerpbsORTorqueORslurm\":{\"value\":\"pbspro\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_H16R\"},\"WorkerNodeCount\":{\"value\": 1},\"NumDataDisks\":{\"value\":\"32\"},\"oMSWorkSpaceId\":{\"value\": \"xxxxxxxxxx\"},\"oMSWorkSpaceKey\":{\"value\": \"xxxxxxxxx\"}}" --debug
  • HPC Single H16R with PBSPro with OMS with login user "azurehpcuser" and intern user "hpcgpu"- [provided sshpublickey value is supplied below along with oMSWorkSpaceId and oMSWorkSpaceKey]:

     bash-4.3# az group create -l southcentralus -n tsthpc && az group deployment create -g tsthpc -n tsthpc --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"single\"},\"DnsLabelPrefix\":{\"value\":\"tsthpc\"},\"AdminUserName\":{\"value\":\"azurehpcuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS-HPC\"},\"ImageSku\":{\"value\":\"7.4\"},\"schedulerpbsORTorqueORslurm\":{\"value\":\"pbspro\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_H16R\"},\"WorkerNodeCount\":{\"value\": 0},\"NumDataDisks\":{\"value\":\"32\"},\"oMSWorkSpaceId\":{\"value\": \"xxxxxxxxxx\"},\"oMSWorkSpaceKey\":{\"value\": \"xxxxxxxxx\"}}" --debug
  • HPC (each H16R) Cluster with Torque and no OMS with head login user "azurehpcuser" and intern user "hpcgpu"- minimum 1 head and minimum 1 worker [provided sshpublickey value is supplied below]:

     bash-4.3# az group create -l southcentralus -n tsthpc && az group deployment create -g tsthpc -n tsthpc --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"cluster\"},\"DnsLabelPrefix\":{\"value\":\"tsthpc\"},\"AdminUserName\":{\"value\":\"azurehpcuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS-HPC\"},\"ImageSku\":{\"value\":\"7.4\"},\"schedulerpbsORTorqueORslurm\":{\"value\":\"Torque\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_H16R\"},\"WorkerNodeCount\":{\"value\": 1},\"NumDataDisks\":{\"value\":\"32\"}}" --debug
  • HPC Single H16R with Torque and no OMS with login user "azurehpcuser" and intern user "hpcgpu"- [provided sshpublickey value is supplied below]:

     bash-4.3# az group create -l southcentralus -n tsthpc && az group deployment create -g tsthpc -n tsthpc --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"single\"},\"DnsLabelPrefix\":{\"value\":\"tsthpc\"},\"AdminUserName\":{\"value\":\"azurehpcuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS-HPC\"},\"ImageSku\":{\"value\":\"7.4\"},\"schedulerpbsORTorqueORslurm\":{\"value\":\"Torque\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_H16R\"},\"WorkerNodeCount\":{\"value\": 0},\"NumDataDisks\":{\"value\":\"32\"}}" --debug
  • HPC (each H16R) Cluster with Torque with OMS with head login user "azurehpcuser" and intern user "hpcgpu"- minimum 1 head and minimum 1 worker [provided sshpublickey value is supplied below along with oMSWorkSpaceId and oMSWorkSpaceKey]:

     bash-4.3# az group create -l southcentralus -n tsthpc && az group deployment create -g tsthpc -n tsthpc --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"cluster\"},\"DnsLabelPrefix\":{\"value\":\"tsthpc\"},\"AdminUserName\":{\"value\":\"azurehpcuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS-HPC\"},\"ImageSku\":{\"value\":\"7.4\"},\"schedulerpbsORTorqueORslurm":{\"value\":\"Torque\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_H16R\"},\"WorkerNodeCount\":{\"value\": 1},\"NumDataDisks\":{\"value\":\"32\"},\"oMSWorkSpaceId\":{\"value\": \"xxxxxxxxxx\"},\"oMSWorkSpaceKey\":{\"value\": \"xxxxxxxxx\"}}" --debug
  • HPC single H16R with Torque with OMS with login user "azurehpcuser" and intern user "hpcgpu"- [provided sshpublickey value is supplied below along with oMSWorkSpaceId and oMSWorkSpaceKey]:

     bash-4.3# az group create -l southcentralus -n tsthpc && az group deployment create -g tsthpc -n tsthpc --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"single\"},\"DnsLabelPrefix\":{\"value\":\"tsthpc\"},\"AdminUserName\":{\"value\":\"azurehpcuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS-HPC\"},\"ImageSku\":{\"value\":\"7.4\"},\"schedulerpbsORTorqueORslurm\":{\"value\":\"Torque\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_H16R\"},\"WorkerNodeCount\":{\"value\": 0},\"NumDataDisks\":{\"value\":\"32\"},\"oMSWorkSpaceId\":{\"value\": \"xxxxxxxxxx\"},\"oMSWorkSpaceKey\":{\"value\": \"xxxxxxxxx\"}}" --debug
  • HPC (each H16R) Cluster with Slurm and no OMS with head login user "azurehpcuser" and intern user "hpcgpu"- minimum 1 head and minimum 1 worker [provided sshpublickey value is supplied below]:

     bash-4.3# az group create -l southcentralus -n tsthpc && az group deployment create -g tsthpc -n tsthpc --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"cluster\"},\"DnsLabelPrefix\":{\"value\":\"tsthpc\"},\"AdminUserName\":{\"value\":\"azurehpcuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS-HPC\"},\"ImageSku\":{\"value\":\"7.4\"},\"schedulerpbsORTorqueORslurm\":{\"value\":\"slurm\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_H16R\"},\"WorkerNodeCount\":{\"value\": 1},\"NumDataDisks\":{\"value\":\"32\"}}" --debug
  • HPC Single H16R with Slurm and no OMS with login user "azurehpcuser" and intern user "hpcgpu"- [provided sshpublickey value is supplied below]:

     bash-4.3# az group create -l southcentralus -n tsthpc && az group deployment create -g tsthpc -n tsthpc --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"single\"},\"DnsLabelPrefix\":{\"value\":\"tsthpc\"},\"AdminUserName\":{\"value\":\"azurehpcuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS-HPC\"},\"ImageSku\":{\"value\":\"7.4\"},\"schedulerpbsORTorqueORslurm\":{\"value\":\"slurm\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_H16R\"},\"WorkerNodeCount\":{\"value\": 0},\"NumDataDisks\":{\"value\":\"32\"}}" --debug
  • HPC (each H16R) Cluster with Slurm with OMS with head login user "azurehpcuser" and intern user "hpcgpu"- minimum 1 head and minimum 1 worker [provided sshpublickey value is supplied below along with oMSWorkSpaceId and oMSWorkSpaceKey]:

     bash-4.3# az group create -l southcentralus -n tsthpc && az group deployment create -g tsthpc -n tsthpc --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"cluster\"},\"DnsLabelPrefix\":{\"value\":\"tsthpc\"},\"AdminUserName\":{\"value\":\"azurehpcuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS-HPC\"},\"ImageSku\":{\"value\":\"7.4\"},\"schedulerpbsORTorqueORslurm":{\"value\":\"slurm\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_H16R\"},\"WorkerNodeCount\":{\"value\": 1},\"NumDataDisks\":{\"value\":\"32\"},\"oMSWorkSpaceId\":{\"value\": \"xxxxxxxxxx\"},\"oMSWorkSpaceKey\":{\"value\": \"xxxxxxxxx\"}}" --debug
  • HPC single H16R with Slurm with OMS with login user "azurehpcuser" and intern user "hpcgpu"- [provided sshpublickey value is supplied below along with oMSWorkSpaceId and oMSWorkSpaceKey]:

     bash-4.3# az group create -l southcentralus -n tsthpc && az group deployment create -g tsthpc -n tsthpc --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"single\"},\"DnsLabelPrefix\":{\"value\":\"tsthpc\"},\"AdminUserName\":{\"value\":\"azurehpcuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS-HPC\"},\"ImageSku\":{\"value\":\"7.4\"},\"schedulerpbsORTorqueORslurm\":{\"value\":\"slurm\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_H16R\"},\"WorkerNodeCount\":{\"value\": 0},\"NumDataDisks\":{\"value\":\"32\"},\"oMSWorkSpaceId\":{\"value\": \"xxxxxxxxxx\"},\"oMSWorkSpaceKey\":{\"value\": \"xxxxxxxxx\"}}" --debug

GPU Computes

Ubuntu 16.04-LTS
  • Ubuntu GPU Cluster (each NC24) with no scheduler and no OMS with head login user "azuregpuuser" and intern user "gpuclususer"- minimum 1 head and minimum 1 worker [provided sshpublickey value is supplied below]:

    bash-4.3# az group create -l eastus -n tstgpu4computes && az group deployment create -g tstgpu4computes -n tstgpu4computes --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"cluster\"},\"DnsLabelPrefix\":{\"value\":\"tstgpu4computes\"},\"AdminUserName\":{\"value\":\"azuregpuuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"Canonical\"},\"ImageOffer\":{\"value\":\"UbuntuServer\"},\"ImageSku\":{\"value\":\"16.04-LTS\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_NC24\"},\"WorkerNodeCount\":{\"value\": 1},\"GpuHpcUserName\":{\"value\":\"gpuclususer\"},\"NumDataDisks\":{\"value\":\"32\"}}" --debug 
  • Ubuntu Single NC24 with no scheduler and no OMS with head login user "azuregpuuser" and intern user "gpuuser"- [provided sshpublickey value is supplied below]:

     bash-4.3# az group create -l eastus -n tstgpu4computes && az group deployment create -g tstgpu4computes -n tstgpu4computes --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"single\"},\"DnsLabelPrefix\":{\"value\":\"tstgpu4computes\"},\"AdminUserName\":{\"value\":\"azuregpuuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"Canonical\"},\"ImageOffer\":{\"value\":\"UbuntuServer\"},\"ImageSku\":{\"value\":\"16.04-LTS\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_NC24\"},\"WorkerNodeCount\":{\"value\": 0},\"GpuHpcUserName\":{\"value\":\"gpuuser\"},\"NumDataDisks\":{\"value\":\"32\"}}" --debug
  • Ubuntu GPU Cluster (each NC24) with no scheduler with OMS with head login user "azuregpuuser" and intern user "gpuclususer"- minimum 1 head and minimum 1 worker [provided sshpublickey value is supplied below along with oMSWorkSpaceId and oMSWorkSpaceKey]:

     bash-4.3# az group create -l eastus -n tstgpu4computes && az group deployment create -g tstgpu4computes -n tstgpu4computes --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"cluster\"},\"DnsLabelPrefix\":{\"value\":\"tstgpu4computes\"},\"AdminUserName\":{\"value\":\"azuregpuuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"Canonical\"},\"ImageOffer\":{\"value\":\"UbuntuServer\"},\"ImageSku\":{\"value\":\"16.04-LTS\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_NC24\"},\"WorkerNodeCount\":{\"value\": 1},\"GpuHpcUserName\":{\"value\":\"gpuclususer\"},\"NumDataDisks\":{\"value\":\"32\"},\"oMSWorkSpaceId\":{\"value\": \"xxxxxxxxxx\"},\"oMSWorkSpaceKey\":{\"value\": \"xxxxxxxxx\"}}" --debug
  • Ubuntu Single NC24 with no scheduler with OMS with head login user "azuregpuuser" and intern user "gpuuser"- [provided sshpublickey value is supplied below along with oMSWorkSpaceId and oMSWorkSpaceKey]:

     bash-4.3# az group create -l eastus -n tstgpu4computes && az group deployment create -g tstgpu4computes -n tstgpu4computes --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"single\"},\"DnsLabelPrefix\":{\"value\":\"tstgpu4computes\"},\"AdminUserName\":{\"value\":\"azuregpuuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"Canonical\"},\"ImageOffer\":{\"value\":\"UbuntuServer\"},\"ImageSku\":{\"value\":\"16.04-LTS\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_NC24\"},\"WorkerNodeCount\":{\"value\": 0},\"GpuHpcUserName\":{\"value\":\"gpuuser\"},\"NumDataDisks\":{\"value\":\"32\"},\"oMSWorkSpaceId\":{\"value\": \"xxxxxxxxxx\"},\"oMSWorkSpaceKey\":{\"value\": \"xxxxxxxxx\"}}" --debug
CentOS 7.3
  • CentOS GPU Cluster (each NC24) with no scheduler and no OMS with head login user "azuregpuuser" and intern user "gpuclususer"- minimum 1 head and minimum 1 worker [provided sshpublickey value is supplied below]:

    bash-4.3# az group create -l eastus -n tstgpu4computes && az group deployment create -g tstgpu4computes -n tstgpu4computes --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"cluster\"},\"DnsLabelPrefix\":{\"value\":\"tstgpu4computes\"},\"AdminUserName\":{\"value\":\"azuregpuuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS\"},\"ImageSku\":{\"value\":\"7.3\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_NC24\"},\"WorkerNodeCount\":{\"value\": 1},\"GpuHpcUserName\":{\"value\":\"gpuclususer\"},\"NumDataDisks\":{\"value\":\"32\"}}" --debug 
  • CentOS Single NC24 with no scheduler and no OMS with head login user "azuregpuuser" and intern user "gpuuser"- [provided sshpublickey value is supplied below]:

     bash-4.3# az group create -l eastus -n tstgpu4computes && az group deployment create -g tstgpu4computes -n tstgpu4computes --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"single\"},\"DnsLabelPrefix\":{\"value\":\"tstgpu4computes\"},\"AdminUserName\":{\"value\":\"azuregpuuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS\"},\"ImageSku\":{\"value\":\"7.3\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_NC24\"},\"WorkerNodeCount\":{\"value\": 0},\"GpuHpcUserName\":{\"value\":\"gpuuser\"},\"NumDataDisks\":{\"value\":\"32\"}}" --debug
  • CentOS GPU Cluster (each NC24) with no scheduler with OMS with head login user "azuregpuuser" and intern user "gpuclususer"- minimum 1 head and minimum 1 worker [provided sshpublickey value is supplied below along with oMSWorkSpaceId and oMSWorkSpaceKey]:

     bash-4.3# az group create -l eastus -n tstgpu4computes && az group deployment create -g tstgpu4computes -n tstgpu4computes --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"cluster\"},\"DnsLabelPrefix\":{\"value\":\"tstgpu4computes\"},\"AdminUserName\":{\"value\":\"azuregpuuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS\"},\"ImageSku\":{\"value\":\"7.3\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_NC24\"},\"WorkerNodeCount\":{\"value\": 1},\"GpuHpcUserName\":{\"value\":\"gpuclususer\"},\"NumDataDisks\":{\"value\":\"32\"},\"oMSWorkSpaceId\":{\"value\": \"xxxxxxxxxx\"},\"oMSWorkSpaceKey\":{\"value\": \"xxxxxxxxx\"}}" --debug
  • CentOS Single NC24 with no scheduler with OMS with head login user "azuregpuuser" and intern user "gpuuser"- [provided sshpublickey value is supplied below along with oMSWorkSpaceId and oMSWorkSpaceKey]:

     bash-4.3# az group create -l eastus -n tstgpu4computes && az group deployment create -g tstgpu4computes -n tstgpu4computes --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"single\"},\"DnsLabelPrefix\":{\"value\":\"tstgpu4computes\"},\"AdminUserName\":{\"value\":\"azuregpuuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS\"},\"ImageSku\":{\"value\":\"7.3\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_NC24\"},\"WorkerNodeCount\":{\"value\": 0},\"GpuHpcUserName\":{\"value\":\"gpuuser\"},\"NumDataDisks\":{\"value\":\"32\"},\"oMSWorkSpaceId\":{\"value\": \"xxxxxxxxxx\"},\"oMSWorkSpaceKey\":{\"value\": \"xxxxxxxxx\"}}" --debug

Jumpboxes

It is always great to build a Linux secure shell (SSH) jumpbox. Having a centralized location which can be used to quickly “jump” to any cluster saves a whole bunch of time. Not only that, it opens opportunities for speeding up repetitive chores during testing, deployment especially in a cloud-only environment.

This repository can be used for creating linux jumpboxes preferably Ubuntu-16.04-LTS or CentOS 7.3 as per the distro of choice.

For Linux, it is always a good idea to visit Azure virtual machines you can use to run your Linux apps and workloads.

One can also create excellent grade clusters by replacing single with cluster in the template parameters.

  • The following CLI example creates a CentOS 7.3 with dns name centospublic in West Europe Region of size: Standard F2s (2 cores, 4 GB memory) with local SSD and 2TB available on /data. azureuser is the admin login user with sudo privileges. the VM name is "centos73". "azure" is the internal user again with sudo privileges and access to /data . A SSH Key needs to be provided during deployment for "azureuser". Optionally, a OMS WSID and Key may be provided. This would have non-privileged usage of latest Docker CE for CentOS, latest release of docker-compose and latest release of docker-machine.
az group create -l westeurope -n centospublicwe && az group deployment create -g centospublicwe -n centospublicwe --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"single\"},\"DnsLabelPrefix\":{\"value\":\"centospublic\"},\"AdminUserName\":{\"value\":\"azureuser\"},\"SshPublicKey\":{\"value\":\"XXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS\"},\"ImageSku\":{\"value\":\"7.3\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_F2s\"},\"WorkerNodeCount\":{\"value\": 0},\"GpuHpcUserName\":{\"value\":\"azure\"},\"MasterVMName\":{\"value\":\"centos73\"},\"NumDataDisks\":{\"value\":\"2\"}}" --debug
  • The following CLI example creates a Ubuntu 16.04-LTS with dns name ubuntupublic in West Europe Region of size: Standard F2s (2 cores, 4 GB memory) with local SSD and 2TB available on /data. azureuser is the admin login user with sudo privileges. the VM name is "ubuntu1604". "azure" is the internal user again with sudo privileges and access to /data . A SSH Key needs to be provided during deployment for "azureuser". Optionally, a OMS WSID and Key may be provided. This would have non-privileged usage of latest Docker CE for Ubuntu 16.04-LTS, latest release of docker-compose and latest release of docker-machine.
az group create -l westeurope -n ubuntupublicwe && az group deployment create -g ubuntupublicwe -n ubuntupublicwe --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"single\"},\"DnsLabelPrefix\":{\"value\":\"ubuntupublic\"},\"AdminUserName\":{\"value\":\"azureuser\"},\"SshPublicKey\":{\"value\":\"XXXXX\"},\"ImagePublisher\":{\"value\":\"Canonical\"},\"ImageOffer\":{\"value\":\"UbuntuServer\"},\"ImageSku\":{\"value\":\"16.04-LTS\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_F2s\"},\"WorkerNodeCount\":{\"value\": 0},\"GpuHpcUserName\":{\"value\":\"azure\"},\"MasterVMName\":{\"value\":\"ubuntu1604\"},\"NumDataDisks\":{\"value\":\"2\"}}" --debug

GPUs for Compute

Azure GPUs

  • Entry point is valid for the stated sku presently only for specific regions of "East-US" or "Southcentral-US". Sku availability per region is here.
  • gpu enablement is possible only on approval of quota for sku usage in the stated subscription. Please see this link for instructions on requesting a core quota increase.
  • NVIDIA drivers are OK for Ubuntu 16.04-LTS as well as for CentOS 7.3, both being unattended cluster as well as single install.
  • Latest Secure Install of CUDA available (auto updated to 9.0) and samples on RAID0 (/data/data default) @ NVIDIA_CUDA-9.0_Samples for Ubuntu and in /usr/local/cuda-9.0/samples for CentOS 7.3.
  • One can run all CUDA Samples across the cluster and test with latest CUDA and CUDAnn.
  • Latest Docker CE both for Ubuntu and CentOS configurable each Head and all compute Nodes. - default is 17.09.x CE.
  • Latest docker-compose configurable each Head and compute Nodes.
  • Latest docker-machine configurable.
  • Latest new and old azure cli are in both Head and Compute nodes.
  • Disk auto mounting is at /'parameter'/data.
  • NFS4 is on.
  • Strict ssh public key enabled.
  • Nodes that share public RSA key shared can be used as direct jump boxes as azuregpuuser@DNS.
  • Head and comp nodes work via sudo su - --gpuclususer-- and then direct ssh.
  • Internal firewall is off.
  • For M60 usage for visualizations using NVIDIA GRID 4.2 for Windows Server 2016, please visit aka.ms/accessgpu

Try CUDA Samples and GROMACS

  • Latest Secure Install of CUDA available (auto updated to 9.0) and samples on RAID0 (/data/data default) @ NVIDIA_CUDA-9.0_Samples for Ubuntu and in /usr/local/cuda-9.0/samples for CentOS 7.3. just a make within each would suffice post successful provisioning.
  • Securely install GROMACS via the following for GPU Usage.
  • For both GPU and MPI Usage please use the following extra -DGMX_MPI=on cmake option
	yum/apt-get install -y cmake

Then,

	cd /opt && \
	export GROMACS_DOWNLOAD_SUM=e9e3a41bd123b52fbcc6b32d09f8202b && export GROMACS_PKG_VERSION=2016.3 && curl -o gromacs-$GROMACS_PKG_VERSION.tar.gz -fsSL http://ftp.gromacs.org/pub/gromacs/gromacs-$GROMACS_PKG_VERSION.tar.gz && \
	echo "$GROMACS_DOWNLOAD_SUM  gromacs-$GROMACS_PKG_VERSION.tar.gz" | md5sum -c --strict - && \
	tar xfz gromacs-$GROMACS_PKG_VERSION.tar.gz && \
	cd gromacs-$GROMACS_PKG_VERSION && \
	mkdir build-gromacs && \
	cd build-gromacs && \
	cmake .. -DGMX_BUILD_OWN_FFTW=ON -DGMX_GPU=ON -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-8.0 && \
	make && \
	make install && \
	export PATH=/usr/local/gromacs/bin:$PATH

Post the above gmx would be available. For further reference please visit latest GROMACS manual

Unattended NVIDIA Tesla Driver Silent Install without further reboot during provisioning via this repo

NVIDIA Tesla Driver Silent Install without further reboot installed via azuredeploy.sh in this repository for cluster or single node as follows:

Currently, this need not be required when using secure cuda-repo-ubuntu1604_8.0.61-1_amd64.deb for Azure NC VMs running Ubuntu Server 16.04 LTS.

This is required for NVIDIA Driver with DKMS (Dynamic Kernel Module Support) for driver load surviving kernel updates.

Ubuntu 16.04-LTS

	service lightdm stop 
	wget  http://us.download.nvidia.com/XFree86/Linux-x86_64/375.39/NVIDIA-Linux-x86_64-375.39.run&lang=us&type=Tesla
	apt-get install -y linux-image-virtual
	apt-get install -y linux-virtual-lts-xenial
	apt-get install -y linux-tools-virtual-lts-xenial linux-cloud-tools-virtual-lts-xenial
	apt-get install -y linux-tools-virtual linux-cloud-tools-virtual
	DEBIAN_FRONTEND=noninteractive apt-mark hold walinuxagent
	DEBIAN_FRONTEND=noninteractive apt-get update -y
	DEBIAN_FRONTEND=noninteractive apt-get install -y build-essential gcc gcc-multilib dkms g++ make binutils linux-headers-`uname -r` linux-headers-4.4.0-70-generic
	chmod +x NVIDIA-Linux-x86_64-375.39.run
	./NVIDIA-Linux-x86_64-375.39.run  --silent --dkms
	DEBIAN_FRONTEND=noninteractive update-initramfs -u

CentOS 7.3

	wget http://us.download.nvidia.com/XFree86/Linux-x86_64/375.39/NVIDIA-Linux-x86_64-375.39.run&lang=us&type=Tesla
	yum clean all
	yum update -y  dkms
	yum install -y gcc make binutils gcc-c++ kernel-devel kernel-headers --disableexcludes=all
	yum -y upgrade kernel kernel-devel
	chmod +x NVIDIA-Linux-x86_64-375.39.run
	cat >>~/install_nvidiarun.sh <<EOF
	cd /var/lib/waagent/custom-script/download/0 && \
	./NVIDIA-Linux-x86_64-375.39.run --silent --dkms --install-libglvnd && \
	sed -i '$ d' /etc/rc.d/rc.local && \
	chmod -x /etc/rc.d/rc.local
	rm -rf ~/install_nvidiarun.sh
	EOF
	chmod +x install_nvidiarun.sh
	echo -ne "~/install_nvidiarun.sh" >> /etc/rc.d/rc.local
	chmod +x /etc/rc.d/rc.local

Installation of NVIDIA CUDA Toolkit during provisioning via this repo

Silent and Secure installation of NVIDIA CUDA Toolkit via azuredeploy.sh in this repository for cluster or single node.

Ubuntu 16.04-LTS

CUDA_REPO_PKG=cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
DEBIAN_FRONTEND=noninteractive apt-mark hold walinuxagent
export CUDA_DOWNLOAD_SUM=1f4dffe1f79061827c807e0266568731 && export CUDA_PKG_VERSION=8-0 && curl -o cuda-repo.deb -fsSL http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/${CUDA_REPO_PKG} && \
    echo "$CUDA_DOWNLOAD_SUM  cuda-repo.deb" | md5sum -c --strict - && \
    dpkg -i cuda-repo.deb && \
    rm cuda-repo.deb && \
    apt-get update -y && apt-get install -y cuda && \
    apt-get install -y nvidia-cuda-toolkit && \
export LIBRARY_PATH=/usr/local/cuda-8.0/lib64/:${LIBRARY_PATH}  && export LIBRARY_PATH=/usr/local/cuda-8.0/lib64/stubs:${LIBRARY_PATH} && \
export PATH=/usr/local/cuda-8.0/bin:${PATH}

CentOS 7.3

  wget http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-8.0.61-1.x86_64.rpm
  rpm -i cuda-repo-rhel7-8.0.61-1.x86_64.rpm
  yum clean all
  yum install -y cuda
CUDA Samples Install
Ubuntu 16.04-LTS

CUDA Samples installed via azuredeploy.sh in this repository cluster or single node in parameterized RAID0 location as follows for Ubuntu:

export SHARE_DATA="/data/data"
export SAMPLES_USER="gpuuser"
su -c "/usr/local/cuda-8.0/bin/./cuda-install-samples-8.0.sh $SHARE_DATA" $SAMPLES_USER
Centos 7.3

In /usr/local/cuda-8.0/samples for CentOS 7.3.

  • Just a make within each would suffice post successful provisioning.

Secure installation of CUDNN during provisioning via this repo

Both Ubuntu 16.04-LTS and CentOS 7.3

The NVIDIA CUDA® Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers. cuDNN is part of the NVIDIA Deep Learning SDK and is installed silently as follows via azuredeploy.sh in this repository cluster or single node.

   export CUDNN_DOWNLOAD_SUM=a87cb2df2e5e7cc0a05e266734e679ee1a2fadad6f06af82a76ed81a23b102c8 && curl -fsSL http://developer.download.nvidia.com/compute/redist/cudnn/v5.1/cudnn-8.0-linux-x64-v5.1.tgz -O && \
   echo "$CUDNN_DOWNLOAD_SUM  cudnn-8.0-linux-x64-v5.1.tgz" | sha256sum -c --strict - && \
   tar -xzf cudnn-8.0-linux-x64-v5.1.tgz -C /usr/local && \
   rm cudnn-8.0-linux-x64-v5.1.tgz && \
   ldconfig

nvidia-docker usage

nvidia-docker version parameterized binary installation is automated for both Ubuntu 16.04-LTS and CentOS 7.3

The latest nvidia 2.0 runtime for docker is available and auto running post cluster provisiong

DIGITS with docker runtime nvidia 2 and tensorboard within DIGITS

DIGITS Latest is available @

http://<Cluster_Public_IP>:5000

and Tensorboard @

http://<Cluster_Public_IP>:6006
Notes on nvidia-docker usage

Besides, Latest Installation of NVIDIA CUDA Toolkit during provisioning via this repo:

nvidia-docker can be leveraged for usage of dockerized CUDA Toolkit Usage as per the test and picture below. This opens up possibilities of using "py" and "gpu" tagged images of cntk, tensorflow, theano and more available as nightly builds from docker hub with jupyter notebooks. Latest gitlab.com/nvidia cudnn RCs can be used for testing.

sudo systemctl start nvidia-docker

nvidia-docker run --rm nvidia/cuda nvidia-smi

nvidiadocker

More Information available @ https://github.com/NVIDIA/nvidia-docker/wiki

License Agreements

By provisioning via this repository, you agree to the terms of the license agreements for NVIDIA software installed silently.

CUDA Toolkit

To view the license for the CUDA Toolkit , click here

CUDA Deep Neural Network library (cuDNN)

To view the license for cuDNN click here

H-Series and A9 with schedulers

Details

  • Entry point is valid for the stated sku presently for specific regions. Sku availability per region is here.
  • Default quota is always 8 cores per region and it is possible to request quotas for the stated subscription. Please see this link for instructions on requesting a core quota increase.
  • This creates configurable number of disks with configurable size for centos-hpc A9/H16R/H16MR Creates a Cluster with configurable number of worker nodes each with prebuilt Intel MPI and Direct RDMA for each Head and corresponding compute Nodes.
    • For CentOS-HPC imageOffer for skuName(s) are 7.1
    • Cluster Scheduler can be Torque or PBSPro.
    • Only Intel MPI.
    • Latest Docker CE both for Ubuntu and CentOS configurable each Head and all compute Nodes. - default is 17.03 CE.
    • Latest docker-compose configurable each Head and compute Nodes.
    • Latest docker-machine configurable.
    • Latest new and old azure cli are in both Head and Compute nodes.
    • Disk auto mounting is at /'parameter'/data.
    • NFS4 is on.
    • Strict ssh public key enabled.
    • Nodes that share public id_rsa.pub key for admin user shared can be used as direct jump boxes as azurehpcuser@DNS.
    • Head and comp nodes work via sudo su - --hpc user-- and then direct ssh.
    • msft drivers check via rpm -qa msft* or rpm -qa microsoft*
    • Internal firewalld is off.
    • WALinuxAgent disabling and manual workrounds required ONLY for NC24R- CentOS 7.3, presently.

mpirun

All path are set automatically for key 'default' provided users like azureuser/hpc. for root specific su - root is required.

source /opt/intel/impi/5.1.3.223/bin64/mpivars.sh

mpirun -ppn 1 -n 2 -hosts headN,compn0 -env I_MPI_FABRICS=shm:dapl -env I_MPI_DAPL_PROVIDER=ofa-v2-ib0 -env I_MPI_DYNAMIC_CONNECTION=0 hostname (Cluster Check)

mpirun -hosts headN,compn0 -ppn --processes per node in number-- -n --number of consequtive processes-- -env I_MPI_FABRICS=dapl -env I_MPI_DAPL_PROVIDER=ofa-v2-ib0 -env I_MPI_DYNAMIC_CONNECTION=0 IMB-MPI1 pingpong (Base Pingpong stats)

IB

ls /sys/class/infiniband

cat /sys/class/infiniband/mlx4_0/ports/1/state

/etc/init.d/opensmd start (if required)

cat /sys/class/infiniband/mlx4_0/ports/1/rate

Torque and pbspro for CentOS-HPC Skus

All computes would have automatic pbs_mom and head the pbs_mom and pbs_server for latest Torque or Pbspro from their respective master repos made from source during cluster provision time. No post installation tasks are required post successful cluster deployment except if np is to be increased from 1.

check for Torque or PBSPro via pbsnodes -a

PBS Pro License

All path are set automatically for key 'default' users like azureuser/hpc/root for root specific su - root is required.

SLURM LICENSE AGREEMENT - GPL v2

To check slurm info please shoot : sinfo -N -l

Since this is is Intel MPI, preferred usage is using mpirun with sbatch

Optional usage with OMS

OMS Setup is optional and the OMS Workspace Id and OMS Workspace Key can either be kept blank or populated post the steps below.

Create a free account for MS Azure Operational Management Suite with workspaceName

  • Provide a Name for the OMS Workspace.
  • Link your Subscription to the OMS Portal.
  • Depending upon the region, a Resource Group would be created in the Subscription like 'mms-weu' for 'West Europe' and the named OMS Workspace with portal details etc. would be created in the Resource Group.
  • Logon to the OMS Workspace and Go to -> Settings -> 'Connected Sources' -> 'Linux Servers' -> Obtain the Workspace ID like ba1e3f33-648d-40a1-9c70-3d8920834669 and the 'Primary and/or Secondary Key' like xkifyDr2s4L964a/Skq58ItA/M1aMnmumxmgdYliYcC2IPHBPphJgmPQrKsukSXGWtbrgkV2j1nHmU0j8I8vVQ==
  • Add The solutions 'Agent Health', 'Activity Log Analytics' and 'Container' Solutions from the 'Solutions Gallery' of the OMS Portal of the workspace.
  • While Deploying the Template just the WorkspaceID and the Key are to be mentioned and all will be registered including all containers in any nodes of the cluster(s).
  • Then one can login to https://OMSWorkspaceName.portal.mms.microsoft.com and check all containers running for single or cluster topologies and use Log Analytics and if Required perform automated backups using the corresponding Solutions for OMS.
  • Further Solutions can be added like Backup from OMS Workspace.
  • OMS usage is Sku/provider/imageoffer agnostic since Dockerized OMS agent would be present in all on latest tag post deployment via this repository.
  • Or if the OMS Workspace and the Machines are in the same subscription, one can just connect the Linux Node sources manually to the OMS Workspace as Data Sources.

Reporting bugs

Please report bugs by opening an issue in the GitHub Issue Tracker

Patches and pull requests

Patches can be submitted as GitHub pull requests. If using GitHub please make sure your branch applies to the current master as a 'fast forward' merge (i.e. without creating a merge commit). Use the git rebase command to update your branch to the current master if necessary.

Region availability and Quotas for MS Azure Skus

  • Sku availability per region is here.
  • Please see this link for instructions on requesting a core quota increase.
  • For more information on Azure subscription and service limits, quota, and constraints, please see here.