# HPC Architecture

## Overview
- HPC Job Scheduling
- Peformance analysis
- Debugging
- Profiling
- Petascale and Exascale Computing.

## Cluster Terminology


## What is a Job
- Job 
    - User's program/name of an executable
    - input data and parameters
    - environment variables
    - required libraries
    - descriptions of computing resources required
- Job Script
    - Formal specification
    - identifies an application to run along with its input data and env variables
    - requests computing resources.

## Local Resource Manager (LRM) / Batch System

- Job Scheduler or Workload Manager
    - Identifies jobs to run, selects the resources for the job, and decides when to run the job.
- Resource manager
    - identifies the compute resources and keeps track of their usage and feeds back this info to the workload manager.
- Execution manager
    - job initiation and start of execution is co-ordinated by the execution manager of the batch system.

## Job Scheduling
- The LRM  is responsible for receiving and parsing job script.
- if a job cannot be executed immediately, it is added to a queue. 

## Job Scheduling Policies:
- FCFS
- Multi-priority queues
- Back-filing
- Fair-share
- Premptive

## Job execution in Compute Nodes.
- User -> Job script -> Head Node/Login NOde -> Network Switch -> Compute Nodes
- Job script: Selects the appropriate node for job execution. exec and input files are copied to compute nodes and job is started.
- Login Node: Monitor's the status of job submitted. (Its not supposed to run anything, only compute nodes run execution)
- After execution, Input and Output files are written to user specified location. 

## Local Resource Manager - SLURM
- When you login to HPC cluster, you land on login nodes.
    - Login nodes are not meant to run jobs.
    - These are used to submit jobs to Compute Nodes.
- To submit job on the cluster, you need to write a scheduler job script.
- SLURM - Simple Linux Utility for Resource Mangement.
- It is a local manager that provides a framework for job queues, allocation of computer nodes, and the start and execution of jobs.

## SLURM Components
- client commands list
- compute node daemons
    - slurmd
- controller daemons
    - slurmctld
    - secondary slurmctld
    - slurmdbd
- data base
- other clusters (option)

## SLURM Commands
The list and descriptions of the mostly used Slurm commands (refer ppt).
- sbatch
- squeue -> info on job queue
- sinfo -> info on all nodes, partition, and their availabilty
- scancel
- scontrol 
- sacct
- srun



## SLURM: Sample Job Script for Serial Jobs


## SLURM: Sample Job Script for Parallel Jobs on GPUs
#!/bin/bash

#SBATCH -N 1                                // number of nodes

#SBATCH --ntasks-per-node=40                // number of cores per node

#SBATCH --output=3mm.out                    // name of output file

#SBATCH --error=3mm.err                     // name of error file

#SBATCH --time=01:00:00                     // time required to execute the program

#SBATCH --gres=gpu:2                        // request use of GPUs on compute nodes

#SBATCH --partition=gpu                     // partition or queue name




export OMP_NUM_THREADS=40

## Commands executed in Param Utkarsh:
- sinfo
- squeue
- ls /tmp/slurm-samples
- cp -r /tmp/slurm-samples ~/
- cd slurm-samples
- sbatch test1.slurm
- ls -lrt
- squeue -u \<username\>
- sbatch test2.slurm
- cp test1.slurm test-sleep.slurm
- vim test-sleep.slurm
    - (Add line "sleep 120" so that this job will sleep for 120 seconds).
- sbatch test-sleep.slurm
- squeue -u \<username\>
- scontrol show job \<jobid\>
- scancel \<jobid\>
- sacct - u \<username\>
- srun --nodes=1 --ntasks-per-node=1 --time=00:05:00 --pty bash
    - (This srun command will queue a job with jobid and will wait till resources are allocated to that jobid)


## Working Environment

## Module Utility
- How to set Environment?
    - use module command 
- some important commands:
    - module avail -> To see the available software installed on HPC system.
        - list of precompiled
    - module list
    - module load
    - module unload
    - module purge


## Commands executed:
- mpicc
- module avail | grep openmpi
    - This will show different versions of a software if available
- module load openmpi/4.1.1
- mpicc
- which mpicc
- echo $PATH
    - openmpi/4.1.1 is added to path
- echo $LD_LIBRARY_PATH
    - library files are displayed
- module unload openmpi/4.1.1

## Transferring files between local machine and HPC cluster
- When a user wishes to transfer data from their local system (laptopo/desktop) to HPC system. They can just use "scp" command on their terminal.
- The command shown below can be used for effecting file transfer "scp -r \<path to the local data directory\> \<your username\>@\<IP of ParamUtkarsh\>:\<path to directory on HPC where to save the data\>"

## SLURM Job Arrays

## Commands executed:
- sbatch array-jobs.slurm
- sacct 

## Dependent Jobs/Workflows
- A standard example of this is a workflow in which the output from one job is used as the input to the next.
- Useful when you need to run multiple jobs in a particular order.

                    jobA
                /           \
            jobB            jobC
                \           /
                    jobD

- Instructor says that jobB and jobC are running in parallel.

## Commands executed:
- sbatch JobA.slurm
- sbatch -d afterok:\<jobidA\> JobB.slurm
- sbatch -d afterok:\<jobidA\> JobC.slurm
- sbatch -d afterok:\<jobidB\>:\<jobid\> JobD.slurm

- /tmp/commands
    - this file will store all commands executed till now

## Output of /tmp/commands:

paramutkarsh.cdac.in

ssh user@14.139.1.74 -p 4422

ls /tmp/slurm-samples

cp -r /tmp/slurm-samples ~/

cd ~/slurm-samples

ls

sbatch test1.slurm

ls -lrt

squeue -u <username>


sbatch test2.slurm

cp test1.slurm test-sleep.slurm

vi test-sleep.slurm

sbatch test-sleep.slurm

srun --nodes=1 --ntasks-per-node=1 --time=00:05:00 --pty bash

mpicc

module avail | grep openmpi

 module load openmpi/4.1.1

which mpicc

echo $PATH
echo $LD_LIBRARY_PATH

module unload openmpi/4.1.1

sbatch array-jobs.slurm

sbatch -d afterok:<jobidA> JobB.slurm

sbatch -d afterok:<jobidA> JobC.slurm

sbatch -d afterok:<jobidB>:<jobidC> JobD.slurm


## Speedup

- speedup = (Sequential execution time)/(parallel execution time)

## Amdahl's Law:
- S(n) = <= 1/(((1-f) + (f/n))) <= 1/f (for large values of n)
- S(n) = theoritical speedup with n processors
- f = serial fraction
- n =  number of processors

## Gustafson's Law:
- S(n) = n + (1-n)f

## Granularity:
- granularity = size of computation between syncrhonization points.
    - coarse -> heavyweight processes + IPC (Interprocess communication like MPI)
    - Fine -> Instruction level (eg. SIMD)
    - Medium -> Threads + [message passing + shared memory]
- Computation to Communication Ratio
    - (Computation time)/(Communication time)
    - Increasing this ratio is often a key to good efficiency.

## Communication Overhead
- Comms overhead = time (measured in instructions) a zero-byte message consumes in a process.
    - Measure time spent on communication that cannot be spent on computation.
- Overlapped Message - portion of message lifetime that can occur concurrently with computation.

## Debugging

## Parallel vs Serial

## Parallel bugs
- Race conditions
- Deadlocks 

eg of deadlock and code samples

## Overview of GDB
- GNU Debugger 

## Preparing your program
- Additional step when compiling program
- Add a -g option to enable built-in debugging support (which gdb needs)

## Commands executed:
- cd slurm-samples
- gdb gdb-test
- list
    - list the source code and line numbers
- break 6
    - sets breakpoint at line number 6
- run "cdacbangalore"
- next
- backtrace
- info frame
- x &a1
- print a1
- quit

## OpenMP debugging with GDB
- Thread-specific commands in GDB:
    - info_thread
    - thread
    - thread \<thread_no\>
    - thread apply
    ....

## Debuggers
- GDB
- LLDB
- TotalView
- DDT

## Profiling
- Profiling is useful for performance analysis and measurement of a software applications to identify and diagnose performance issues or bottlenecks.
- various profilers are available like vtool (see ppt for more).

## Petascale and Exascale Computing:
- Cores, Rmax (PFlop/s), Rpeak (PFlop/s), Power(kW) 
- Rmax = maximum achieved on HPL benchmark
- Rpeak = theoretical maximum performance
- AIRAWAT - PARAM Siddhi - AI of 8.5 Petaflops is the fastest supercomputer in India.
- ranked at no. 90 position in "Top500" supercomputer list - November 2023.

## Challenges
- power consumption
- scalability
- heterogeneity
- resilience
- programming methodologies and applications

## C-DAC software for system software
- ParaDE -> parallel desktop environment
- CAPC -> CDAC auto parallizing compiler
- CHAP -> CDAC HPC Application Profiler
- SuPrakashan -> monitoring and management solution
- HPC Dashboard.

**All these softwares are available only to PARAM systems**