# Comet: System Characteristics

- Total peak flops ~2.1 PF
- Dell primary integrator
    - Intel Haswell processors w/ AVX2
    - Mellanox FDR InfiniBand
- 1944 standard compute nodes (46656 cores)
    - Dual CPUs, each 12-core, 2.5 GHz
    - 128 GB DDR4 2133 MHz DRAM
    - 2\*160GB GB SSDs (local disk)
- 72 GPU nodes
    - 36 nodes same as standard nodes *plus* two NVIDIA K80 cards, each with dual Kepler3 GPUs
    - 36 nodes with 2 14-core Intel Broadwell CPUs plus 4 NVIDIA P100 GPUs
- 4 large-memory nodes
    - 1.5 TB DDR4 1866 MHz DRAM
    - 4 Haswell processors/node
    - 64 cores/node
- Hybrid fat-tree topology
    - FDR (56 Gbps) InfiniBand
    - Rack-level (72 nodes, 1728 cores) full bisection bandwidth
    - 4:1 oversubscription cross-rack
- Performance Storage (Aeon)
    - 7.6 PB, 200 GB/s; Lustre
    - Scratch and Persistant Storage segments
- Durable Storage (Aeon)
    - 6 PB, 100 GB/s; Lustre
    - Automatic backups of critical data
- Home directory storage
- Gateway hosting nodes
- Virtual image repository
- 100 Gbps external connectivity to Internet2 and ESNet

# Comet Network Architecture
## InfiniBand Compute, Ethernet Storage

![alt text](supercomputerarchitecture.png "Supercomputer Architecture")

# Getting Started

## System Access - Logging in

- Linux/Mac - Use available ssh clients
- ssh clients for Windows - Putty, Cygwin
    - http://www.chiark.greenend.org.uk/~sgtatham/putty/
- Login hosts for the SDSC Comet:
    - comet.sdsc.edu

# Logging into Comet

- Mac/Linux
    - ssh username@comet.sdsc.edu
- Windows (PuTTY)
    - Host Name is comet.sdsc.edu
    
![alt text](loggingontocomet.png "Logging onto Comet")

# Comet: Filesystems

- Lusture filesystems - Good for scalable large block I/O
    - Accessible from all compute and GPU nodes
    - /oasis/scratch/comet - 2.5PB, peak performance: 100GB/s. Good location for storing large scale scratch data during a job
    - /oasis/projects/nsf - 2.5PB, peak performance: 100GB/s. Long term storage
    - **Not good for lots of small files or small block I/O**
- SSD filesystems
    - /scratch local to each native compute node - 210GB on regular compute nodes, 285GB on GPU, large memory nodes, 1.4TB on selected compute nodes
    - SSD location is good for writing small files and temporary scratch files. Purged at the end of a job
- Home directories (/home/#USER)
    - Source trees, binaries, and small input files
    - **Not good for large scale I/O**

# Comet: System Environment

- Modules used to manage environment for users
- Default environment:
    - \$ module li
    - Currently Loaded Modulefiles:
        - intel/2013\_sp1.2.144
        - mvapich2\_ib/2.1
        - gnutools/2.69
- Listing available modules:
    - \$ module av
    - --------------------------/opt/modulefiles/mpi/.intel--------------------------
    - intelmpi/2016.3.210(default) mvapich2\_ib/2.1(default)
    - mvapich2\_gdr/2.1(default) openmpi\_ib/1.8.4(default)
    - mvapich2\_gdr/2.2
    - ------------------------/opt/modulefiles/applications/.intel-----------------
    - ...
    - ...
- Loading modules:
    - \$ module load fftw/3.3.4
    - \$module li
    - Currently Loaded Modulefiles:
        - intel/2013\_sp1.2.144
        - mvapich2\_ib/2.1
        - gnutools/2.69
        - fftw/3.3.4
- See what a module does:
    - \$ module show fftw/3.3.4
    - .----------------------------------------------------------------
    - /optmodulefiles/applications/.intel/fftw/3.3.4:
    - module-whatis fftw
    - module-whatis Version: 3.3.4
    - ...
    - ...
- \\$ echo \$PATH
    - /opt/fftw/3.3.4/intel/mvapich2\_ib/bin:/share/apps/compute...
- \\$ echo \$FFTWHOME
    - /opt/fftw/3.3.4/intel/mvapich2\_ib

# Parallel Programming

- Comet supports MPI, OpenMP, and Pthreads for parallel programming. Hybrid modes are possible
- GPU nodes support CUDA, Open ACC
- MPI
    - Default: mvapich2_ib/2.1
    - Other options: openmpi_ib/1.8.4 (and 1.10.2), Intel MPI
    - mvapich2_gdr: GPU direct enabled version
- OpenMP: All compilers (GNU, Intel, PGI) have Open MP flags
- Default Intel Compiler: intel/2013_sp1.2.144; *Versions 2015.2.164 and 2016.3.210 available*

# Running Jobs on Comet

- Important Note: **Do not run on the login nodes - even for simple tests**
- All runs must be via the Slurm scheduling infrastructure
    - Interactive Jobs: Use **srun** command:
        - srun --pty --nodes=1 --ntasks-per-node=24 -p debug -t 00:30:00 --wait 0 /bin/bash
    - Batch Jobs: Submit batch scripts from the login nodes. Can choose:
        - Partition (details on upcoming cell)
        - Time limit for the run (maximum of 48 hours)
        - Number of nodes, tasks per node
        - Memory requirements (if any)
        - Job namd, output file location
        - Email info, configuration

# Slurm Commands

![alt text](slurmtable.png "Slurm Table")

- Specified using -p option in batch script. For example:
    - #SBATCH -p gpu
- Submit jobs using the **sbatch** command:
    - \$ sbatch Localscratch-slurm.sb
    - Submitted batch job 8718049
- Check job status using the **squeue** command:
    - \\$ squeue -u \$USER
    - JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
    - 8718049 compute localscr mahidhar PD 0:00 1 (Priority)
- Once the job is running:
    - \\$ squeue -u \$USER
    - JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
    - 8718064 debug localscr mahidhar R 0:02 1 comet-14-01

# Comet Compute Nodes
## 2-Socket (Total 24 cores) Intel Haswell Processors

- Hands on Examples Using:
    - MPI
    - OpenMP
    - HYBRID
    - Local scratch

# Comet - Compiling/Running Jobs

- Copy and change to directory (assuming you already copied the PHYS244 directory):
    - cd /home/\$USER/SI2017/MPI
- Verify modules loaded:
    - module list
    - Currently Loaded Modulefiles:
        - intel/2013\_sp1.2.144
        - mvapich2\_ib/2.1
        - gnutools/2.69
- Compile the MPI hello world code:
    - mpif90 -o hello\_mpi hello\_mpi.f90
- Verify executable has been created:
    - Is -It hello_mpi
    - -rwxr-xr-x 1 mahidhar sdsc 721912 Mar 25 14:53 hello_mpi
- Submit job from IBRUN directory:
    - cd /home/\$USER/SI2017/MPI/IBRUN
    - sbatch --res=SI2017DAY1 hellompi-slurm.sb

# Comet - Hello World on Compute Nodes

The submit script is hellompi-slurm.sb:

#!/bin/bash
#SBATCH --job-name="hellompi"
#SBATCH --output="hellompi.%j.%N.out"
...
...

#This job runs with 2 nodes, 24 cores per node for a total fo 48 cores
#ibrun in verbose mode will give binding detail

ibrun -v ./hello_mpi
IBRUN: Command is ../hello_mpi
IBRUN: Command is /share/apps/examples/MPI/hello_mpi
...
...
node      18 : Hello world
node      13 : Hello world
...

# Compiling OpenMP Example

- Change to the examples directory:
    - cd /home/\$USER/SI2017/OPENMP
- Compile using -openmp flag:
    - ifort -o hello\_openmp -openmp hello\_openmp.f90
- Verify executable was created:
    - [mahidhar@comet-08-11 OPENMP]\$ Is -It hello\_openmp
    - -rwxr-xr-x 1 mahidhar sdsc 750648 Mar 25 15:00 hello_openmp

# OpenMP job script

#!/bin/bash
#SBATCH --job-name="hell\_openmp"
#SBATCH --output="hello\_openmp.%j.%N.out"
...
...

#SET the number of openmp threads
export OMP_NUM_THREADS=24

#Run the job using mpirun_rsh
./hello_openmp

# Output from the OpenMP Job

\$ more hello_openmp.out
HELLO FROM THREAD NUMBER = 7
HELLO FROM THREAD NUMBER = 6
HELLO FROM THREAD NUMBER = 9
...
...

# Running Hybrid (MPI + OpenMP) Jobs

Several HPC codes use a hbrid MPI, OpenMP approach

"ibrun" wrapper developed to handle such hybrid use cases. Automatically senses the MPI build (mvapich2, openmpi) and binds tasks correctly

"ibrun -help" gives detailed usage info

hello\_hybrid.c is a sample code, and hello_hybrid.cmd shows "ibrun" usage

# hello_hybrid.cmd

#!/bin/bash
#SBATCH --job-name="hellohybrid"
#SBATCH --output="hellohybrid.%j.%N.out"
...
...

export OMP_NUM_THREADS=6
ibrun --npernode 4 ./hello_hybrid

# Hybrid Code Output

\[etrain61@comet-ln3 HYBRID]$ more hellohybrid.8557716.comet-14-01.out

Hello from thread 0 out of 6 from process 2 out of 8 on comet-14-01.local
...
...

# Using SDD Scratch

#!/bin/bash
#SBATCH --job-name="localscratch"

...

#Copy binary to SSD

cp IOR.exe /scratch/\\$USER/\$SLURM_JOBID

#Change to local scratch (SSD) and run IOR benchmark

cd /scratch/\\$USER/\\$SLURM_JOBID

#Run IO benchmark

ibrun -np 4 ./IOR.exe -F -t 1m -b 4g -v --v > IOR.out.\\$SLURM\\_JOBID

#Copy out data you need

cp IOR.out.\\$SLURM_JOBID $SLURM_SUBMIT_DIR

- Snapshot on the node during the run:

\$ pwd

/scratch/mahidhar/435463

\$ ls -It

total 22548292

-rw-r-- 1 mahidhar hpss 5429526528 May 15 23:48 testFile.00000001

...

- Performance from single node (in log file copied back):
    - Max Write: 250.52 MiB/sec (262.69 MB/sec)
    - Max Read: 181.92 MiB/sec (190.76 MB/sec)

# Comet GPU Nodes
## 2 NVIDIA K-80 Cards (4 GPUs total) per node

[1] CUDA code compiles and run example

[2] Hands on Examples using Sungularity to enable Tensorflow

# Compiling CUDA Example

- Load the CUDA module:
    - module load cuda
- Compile the code:
    - cd /home/\$USER/SI2017/CUDA
    - nvcc -o matmul -l. matrixMul.cu
- Submit the job:
    - sbatch --res=SI2017DAY1 cuda.sb

# CUDA Example: Batch Submission Script

#!/bin/bash
#SBATCH --job-name="CUDA"
#SBATCH --output="CUDA.%j.%N.out"

...

#Load the cuda module

module load cuda

#Run the job

./matmul

# Sungularity: Provides Flexibility for OS Environment

- Singulatiry (http://singularity.lbl.gov) is a relatively new development that has become very popular on Comet
- Singularity allows groups to easily migrate complex software stacks from their campus to Comet
- Singularity runs in user space, and requires very special support - in fact it actually reduces it in some cases
- We have roughly 15 groups running this on Comet
- Applications include: Tensorflow, Paraview, Torch, Fenics, and custom user applications
- Docker images can be imported into Singularity

# Tensorflow via Singularity

#!/bin/bash
#SBATCH --job-name"TensorFlow"
#SBATCH --output="TensorFlow.%j.%N.out"
#SBATCH --partition=gpu-shared
...
#SBATCH --gres=gpu:k80:1

#Run the job

module load singularity

singularity exec /share/apps/gpu/singularity/sdsc_ubuntu_gpu_tflow.img lsb_release -a

singularity exec /share/apps/gpu/singularity/sdsc_ubuntu_gpu_tflow.img python -m tensorflow.models.image.mnist.convolutional

- Change to the examples directory:
cd /home/$USER/SI2017/TensorFlow

- Submit the job:
sbatch --res=SI2017DAY1 TensorFlow.sb

# Tensorflow Example: Output

Distributor ID: Ubuntu

Description: Ubuntu 16.04 LTS

Release: 16.04

Codename: xenial

I tensorflow/stream_executor/dso_loader.cc:108] successrully opened CUDA library libcublas.so locally

...

I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0)->(device: 0, name: Tesla K80, pci bus id: 0000:85:00.0)

...

# Add Data Analysis to Existing Compute Infrastructure

![alt text](structure1.png "Structure 1")

![alt text](structure2.png "Structure 2")

![alt text](structure3.png "Structure 3")

![alt text](structure4.png "Structure 4")

# ANAGRAM Example

- Change to directory:

cd $HOME/SI2017/hadoop/ANAGRAM_Hadoop2

- Submit job:

sbatch --res=SI2017DAY1 anagram.script

- Check configuration in directory:

ls $HOME/cometcluster

# Anagram Example - Sample Output

cat part-00000

...

aabcdelmnu manducable, ambulanced,

aabcdeorrsst broadcasters, rebroadcasts,

...

# RDMA-Hadoop and RDMA-Spark
## Network-Based Computing Lab, Ohio State University
### NSF funded project in collaboration with DR. DK Panda

- HDFS, MapReduce, and RPC over native InfiniBand and RDMA over Converged Ethernet (RoCE)
- Based on Apache distributions of Hadoop and Spark
- Version RDMA-Apache-Hadoop-2.x 1.1.0 (based on Apache Hadoop 2.6.0) available on Comet
- Version RDMA-Spark 0.9.3 (based on Apache Spark 1.5.1) is availaable on Comet
- More details on the RDMA-Hadoop and RDMA-Spark projects at:
    - http://hibd.cse.ohio-state.edu/

# RDMA-Hadoop, Spark

- Exploit performance on modern clusters with RDMA-enabled interconnects for Big Data applications
- Hybrid design with in-memory and hetergeneous storage (HDD, SSDs, Lustre)
- Keep compliance with standard distributions from Apache

![alt text](RDMA.png "RDMA-Hadoop Spark")

# Hands On: Anagram using HHH-M mode

#!/bin/bash

#SBATCH --job-name="rdmahadoopanagram"

#SBATCH --output="rdmahadoopanagram.%j.%N.out"

...

#Script request 3 nodes - one used for namenode, 2 for data nodes/processing

#Set modulepath and load RDMA Hadoop Module

export

MODULEPATH=/share/apps/compute/modulefiles/applications:$MODULEPATH

module load rdma-hadoop/2x-1.1.0

#Get the host list

export SLURM\_NODEFILE=\`generate_pbs_nodefile`

cat #SLURM\_NODEFILE | sort -u > hosts.hadoop.list

#Use SLURM integrated configuration/startup script

hibd_install_configure_start.sh -s -n ./hosts.hadoop.list -i \\$SLURM\_JOBID -h \\$HADOOP\_HOME -j \\$JAVA\_HOME -m hhh-m -r /dev/shm -d /scratch/\\$USER/\\$SLURM\_JOBID -t /scratch/\\$USER/$SLURM_JOBID/hadoop_local

#Commands to run ANAGRAM example

\\$HADOOP\_HOME/bin/hdfs --config \\$HOME/conf_\\$SLURM_JOBID dfs -mkdir -p /user/$USER/input

\\$HADOOP\_HOME/bin/hdfs --config \\$HOME/conf_\\$SLURM_JOBID dfs -put SINGLE.TXT /user/$USER/input/SINGLE.TXT

\\$HADOOP\_HOME/bin/hadoop --config \\$HOME/conf_\\$SLURM_JOBID jar AnagramJob.jar /user/\\$USER/input/SINGLE.TXT /user/\$HOME/output

\\$HADOOP\_HOME/bin/hdfs --config \\$HOME/conf_\\$SLURM_JOBID dfs -get /user/\\$USER/output/part* $SLURM_WORKING_DIR

#Clean up

hibd_stop_cleanup.sh -d -h $HADOOP_HOME -m hhh-m -r /dev/shm

# RDMA-Hadoop: HHH-M Example

- Change to directory:

cd $HOME/SI2017/hadoop/RDMA-Hadoop/RDMA-HHH-M

- Submit job:

sbatch --res=SI2017DAY1 anagram.script

# Summary

- Comet can be directly accessed using a ssh client
- Always run via the batch scheduler - for both interactive and batch jobs. **Do not run on the login nodes**
- Chooose your filesystem wisely - Lustre parallel filesystem for large block I/O. SSD based filesystems for small block I/O, lots of small files. **Do not use home filesystem for intensive I/O of any kind**
- Comet can handle MPI, OpenMP, Pthreads, Hybrid, CUDA, and OpenACC jobs. Singularity provides further flexibility
- Dynamic spin up of Hadoop, Spark instances within Comet scheduler framework