# HPC for All

Ana Gainaru <br/>
<small>Vanderbilt University</small>

# About me

<cite> www.ana-gainaru.com (anagainaru Github) </cite>

<img src="figures/aboutme.png" width="200" align="left" />

<br />
PhD in Computer Science <br />
<small> Failure prediction, Hybrid checkpointing <br />
        Fault tolerance framework for Blue Waters </small>

<br /><br />
HPC Architect <br />
<small> Collective communication <br />
         HW custom application optimization</small>

<br />
Research Assistant Professor <br />
<small>Scheduling <br />
            Heterogeneous, dynamic applications</small>

# About HPC

![HPC systems](figures/hpc.png)

HPC system evolve together with large monolithic codes
 - Focus on performance
 - Developed by the community for years
 - Tuned to scale and run on large-scale infrastructures

# HPC Evolution

![HPC systems](figures/hpc.png)

- NUMA
- Hardware threading

- Accelerators, Memory hierarchy 
- High-bandwidth memory
- Burst buffers, Vectorization

<cite> Summit/Sierra: NVLink, NVMe, NVSwitch, GPUDirect, Unified Virtual Memory </cite>

# HPC Evolution

![HPC systems](figures/hpc.png)

**Unknowns to come**
 - Chiplet architectures
 - Configurable Spatial Accelerator architecture
 - Computing on switch

# HPC Evolution

Top500 June 2018 <br/>
<small> Only 55 systems that have both HPL and HPCG information </small>


<img src="figures/hpcg_hpl.png" width="600" />

<cite> Two orders of magnitude performance gap </cite>

# XSEDE

<img src="figures/cumulative_walltime.png"  width="350" align="right" valigh="top" />

Workloads running on Stempede (representative of all XSEDE systems)

```
Applications running on one node: 33%
Applications running on < 10 nodes: 79%
Applications running on < 100 nodes: 98%
Average number of nodes: 15 (max 10,417)
```

<br/><br/>
Applications run between a few minutes and over 100 hours 
<br/>

<cite> Small jobs are distributed almost uniformly throughout the year </cite>

<small> Over 80% of the days run multiple small jobs (average volume of 100 node hours) </small>




# Second generation applications

![Variability](figures/app_variability.png)

Variability in execution time and memory usage <br/><br/>
<small>
(a) Variability factor (ratio between maximum variability over execution time) for MultiAtlas, a second generation application (blue) and a traditional application that ran on the Mira supercomputer (orange) <br/><br />
(b) Memory usage variability within multiple phases of the MultiAtlas code
</small>

# Performance variability

<img src="figures/correlation.png"  width="300" align="right" valigh="top" />
          
1. Intrinsic to the code
2. Dependent on the input data 
    - size or characteristics
3. Due to system failures
4. Due to resource contention
5. Due to other runtime decisions
    - scheduling policies

**Same input data and execution parameters**
- Task based GEMM has 20% variation on CPU (over 50% on GPUs)
- Multi Atlas has over 10x variation when running on one node (around 7% intrinsic to the code)

# Research Statement

<cite> Update the HPC software stack for the new generation of applications </cite>

Directions:

**(1) Understand performance variability**
At system, middleware, application levels

![Congestion](figures/intrepid_congestion.png)

`Congestion caused by sharing resource`

# Research Statement

<cite> Update the HPC software stack for the new generation of applications </cite>

Directions:

**(1) Understand performance variability**
- At system, middleware, application levels

![Resiliency](figures/resiliency.png)

# Research Statement

<cite> Update the HPC software stack for the new generation of applications </cite>

Directions:

**(1) Understand performance variability**
- At system, middleware, application levels

![Scheduling](figures/wait_time.png)
<small> Queue wait time is a function of the requested walltime, requested cores, platform occupancy, and platform policy </small>

<sub> K. Yamamoto et al., “The K computer Operations: Experiences and
Statistics”, Elsevier Computer Science, Volume 29, 2014, Pages 576-585 </sub>

# Research Statement

<cite> Update the HPC software stack for the new generation of applications </cite>

Directions:

**(1) Understand performance variability**
- At system, middleware, application levels

<img src="figures/multiatlas_corr.png" width="300" align="left" />
<img src="figures/multiatlas_prediction.png" width="390" />

# Research Statement

<cite> Update the HPC software stack for the new generation of applications </cite>

Directions:

**(1) Understand performance variability**
- At system, middleware, application levels

**(2) Design new middleware to adapt to the needs of unpredictable applications**
- On-going work in I/O, scheduling and fault tolerance

<img src="figures/sch.png" width="250" align="right" />

# Preliminary results

## Scheduling

**Direction 1**: Optimize the way emerging applications use current scheduling systems
 - Automating the resorce estimation provided by users 
 - Using checkpointing strategies to deal with applications being killed by the runtime `INRIA`
 - Developing tools to understand task based scheduling limitations `UTK`

**Direction 2**: Modernize existing schedulers to become more flexible 
 - Provide flexible resource management

# Resource estimation

<img src="figures/estimates1.png" width="400" align="left" />
<img src="figures/estimates2.png" width="380" />

Use logs of execution to predict future runs
 - Fit the CDF with polynomial/distribution interpolation
 - Machine learning methods

# Resource estimation

<img src="figures/sequence.png" width="400" align="left" valign="top" />
<img src="figures/intrepid_utilization.png" width="250" />

<br/>

Use logs of execution to predict future runs
 - **Fit the CDF with polynomial/distribution interpolation**
 - Machine learning methods

# Resource estimation

Productivity is more important then performance
 - Codes change depending on each study
 - Modules are combined in different ways
 
<img src="figures/history.png" width="450" />
 
 <cite> Discard distant future and only keep the last week/month for the prediction </cite>

# Implementation

Machine learning to identify amount of history
 - Based on how frequent an application is ran
 - Based on how frequent the application changes behavior
 
**(1) Use the generated sequences to submit an application on Slurm**
 - Slurm doesn't automatically re-submit application in case of a failure
 
**(2) Reserve a "Job Lane" for multiple jobs**
 - Use the generated sequences to schedule jobs inside the Lane

# Preliminary resutls
## Scheduling

Direction 1: Optimize the way emerging applications use current scheduling systems
 - Automating the resorce estimation provided by users 
 - Using checkpointing strategies to deal with applications being killed by the runtime
 - Developing tools to understand task based scheduling limitations 

**Direction 2: Modernize existing schedulers to become more flexible **
 - **Provide flexible resource management**

# Flexible resource management

## Move towards online schedulers
Oneline schedulers work perfect with neuroscience workloads
 - Not necessary with classic HPC

<img src="figures/scheduling.png" alt="Hybrid schedulers" width="500" />

# Very preliminary results

Simulating hybrid schedulers (stochastic batch scheduler)

<img src="figures/scheduling_result.png" width="500" />

<cite> Variability factor = ratio between maximum variability over execution time </cite>

# Code

Code that is returning the sequence of request used by Vanderbilt: <br />
https://github.com/anagainaru/HPCWalltime

<small> Speculative Scheduling Techniques for Stochastic HPC Applications <br />
[ICPP 2019] </small>

<small> Making Speculative Scheduling Robust to Incomplete Data <br />
[SCALA@SC 2019] </small>

<img src="https://raw.githubusercontent.com/anagainaru/ScheduleFlow/master/docs/logo.png" alt="ScheduleFlow" width="200" align="left" />

<br/><br/>
Simulator for batch schedulers:  <br />
https://github.com/anagainaru/ScheduleFlow

<small> On-the-fly scheduling vs. reservation-based scheduling for unpredictable workflows  <br />
[IJHPCA 2019] </small>

<img src="figures/ft.png" width="250" align="right" />

# Preliminary results

## Fault tolerance

**Direction 1:** Characterizing the Intrinsic Application Resiliency to Failures
 - Machine Learning Systems
 - PDE-based simulations
 - Second generation applications
 
**Direction 2:** Adapt current checkpointing/replication methods
 - Application-Level Fault Tolerance 


# PDE-Based Simulations

E.g. Two-dimensional heat flow problem

<img src="figures/spmv_error.png" width="300" align="right" valign="middle" />
<img src="figures/heat_equation.png" width="400" />

Translates to a series of SpMV that can propagate errors and degrade performance

### Failures can degrade the performance or prevent convergency

# Characterize intrinsic resilient behavior

We define a metric to characterize the rate of change within each component u_i of the solution vector
 - that corresponds to the point (x_i , y_i) in the spatial domain
 
**First resiliency gradient**
![Metric](figures/metric.png)

Similarly, we can define the **Second resiliency gradient**
 - characterize the change in acceleration

![Metric](figures/metric2.png)

<cite> We define slow and fast changing elements </cite>

# Failure injection

Resiliency properties when injecting failures in the slow and fast changing components of the solution vector

![Injection](figures/spMV_injection.png)

<small> (a) Relative increase in the number of iterations to convergence compared to error-free execution</small>

<small> (b) Histogram of the scale of relative error in the solution vector compared to the convergent vector in
error-free execution </small>

# Resiliency metrics in space

Similar to time but based on neighbor values

<img alt="Space metric" src="figures/patch_space.png" width="400"/>

Classifying the spatial domain into patches based on the space chainging metric
 - First or second gradient

# Small and constant values

Can be interpolated 
 - based on previous value (in time)
 - based on the average neighbor values
 
![Interpolation](figures/constant.png)

<small> Percentage of elements in the solution vector that either remain constants through iterations or have small resiliency gradients that allow for an error to be smoothed out by interpolation </small>

# Time-space change rate

The evolution of "patches" in the spatial domain over time for the heat-flow problem
![time space](figures/patch_time.png)

Protect only border elements (or with stronger methods)

### How often to trigger a new characterization phase?

# Preliminary results

<img src="figures/interpolation.png" width="450" />

<small> Relative performance degradation compared to failure-free execution when injecting failures in the high gradient elements or in low-gradient elements (with or without interpolation) </small>

# Code

University of Florida Sparse Matrix Collection: <br/>
https://www.cise.ufl.edu/research/sparse/matrices/list_by_id.html

<img src="figures/matrix.gif" align="right" width="200" />

Code for SpMV <br/>
https://github.com/vanderbiltscl/spMV_customFI

<cite> Preliminary work, not all the code is available </cite>

# Future

<cite> Update the HPC software stack for the new generation of applications </cite>

**(1) Understand performance variability**
- Scheduling decisions (bach and task schedulers)
- Caused by fault tolerance mechanisms
- I/O congestion (conected with fault tolerance and scheduling)

**(2) Design new middleware to adapt to the needs of unpredictable applications**
- Promote communication between the application and the middleware
- Application aware fault tolerance
- Hybrid / speculative schedulers that informs the application 

<img src="https://avatars2.githubusercontent.com/u/49881432?s=200&v=4" alt="Vanerbilt" align="right" width="150" />
<img src="https://anagainaru.github.io/assets/images/favicon.png" alt="Ana Gainaru" align="right" width="150" />

# Thank you

<br />

### Questions?

<br /><br />

<cite> ana.gainaru@vanderbilt.edu </cite> 

http://www.ana-gainaru.com