

# The Parallel Programming World Beyond OpenMP

Tim Mattson
Intel Corp.

#### **Notices and Disclaimers**

Forward-Looking Statements. Statements in this presentation that refer to business outlook, plans, and expectations are forward-looking statements that involve risks and uncertainties. Words such as "anticipate," "expect," "intend," "goal," "plans," "believe," "seek," "estimate," "continue," "committed," "on-track," "positioned," "ramp," "momentum," "roadmap," "path," "pipeline," "progress," "schedule," "forecast," "likely," "guide," "potential," "next gen," "future," "may," "will," "would," "should," "could," and variations of such words and similar expressions are intended to identify such forward-looking statements. Statements that refer to or are based on estimates, forecasts, projections, uncertain events or assumptions, including statements relating to Intel's strategy and its anticipated benefits; business plans; financial projections and expectations; total addressable market (TAM) and market opportunity; manufacturing expansion and investment plans; future manufacturing capacity; future products, technology, and services, and the expected availability and benefits of such products, technology, and services, including product and manufacturing plans, goals, timelines, ramps, progress, and future product and process leadership and performance; future economic conditions; future impacts of the COVID-19 pandemic; plans and goals related to Intel's foundry business; future legislation; future capital offsets; pending or future transactions; the proposed Mobileye IPO; supply expectations including regarding industry shortages; future external foundry usage; future use of EUV and other manufacturing technologies; expectations regarding customers, including designs, wins, orders, and partnerships; projections regarding competitors; ESG goals; and anticipated trends in our businesses or the markets relevant to them, including future demand, market share, industry growth, and technology trends, also identify forward-looking statements. Such statements involve many risks and uncertainties that could cause actual results to differ materially from those expressed or implied in these forward-looking statements. Important factors that could cause actual results to differ materially are set forth in Intel's earnings release dated January 26, 2022, which is included as an exhibit to Intel's Form 8-K furnished to the SEC on such date, and in Intel's SEC filings, including the company's most recent reports on Forms 10-K and 10-Q. Copies of Intel's SEC filings may be obtained by visiting our Investor Relations website at www.intc.com or the SEC's website at www.sec.gov. All information in this presentation reflects management's views as of April 20, unless an earlier date is indicated. Intel does not undertake, and expressly disclaims any duty, to update any statement made in this presentation, whether as a result of new information, new developments or otherwise, except to the extent that disclosure may be required by law.

Intel technologies may require enabled hardware, software or service activation. No product or component can be absolutely secure. Your costs and results may vary. Product and process performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex and www.Intel.com/ProcessInnovation. Future product and process performance and other metrics are projections and are inherently uncertain.

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

#### **Disclaimer**

- The views expressed in this talk are those of the speaker and not his employer.
- This is a speculative, academic style talk ... I am not describing, or even suggesting, <u>ANYTHING</u> about future products from Intel!!!

I work in Intel's research labs. I don't build products.

Instead, I get to poke into dark corners and pursue silly ideas ... just to make sure we don't miss anything important.

I have a really great job!!!!!

#### Hardware is diverse ... and its only getting worse!!!



Cloud







Cluster



Heterogeneous node

#### The Big Three

- In HPC, 3 programming environments dominate ... covering the major classes of hardware.
  - MPI: distributed memory systems ... though it works nicely on shared memory computers.

- OpenMP: Shared memory systems ... more recently, GPGPU too.

You are all OpenMP experts and know a great deal about multithreading

CUDA, OpenCL, Sycl, OpenACC, OpenMP ...: GPU programming (use CUDA if you don't mind locking yourself to a single vendor ... it is a really nice programming model)

• Even if you don't plan to spend much time programming with these systems ... a well rounded HPC programmer should know what they are and how they work.

#### The Big Three

If you don't know MPI, you really aren't an HPC programmer!

- In HPC, 3 programming environments dominate ... covering the major classes of hardware.
  - MPI: distributed memory systems ... though it works nicely on shared memory computers.

OpenMP: Shared memory systems ... more recently, GPGPU too.

CUDA, OpenCL, Sycl, OpenACC, OpenMP ...: GPU programming (use CUDA if you don't mind locking yourself to a single vendor ... it is a really nice programming model)

• Even if you don't plan to spend much time programming with these systems ... a well rounded HPC programmer should know what they are and how they work.

#### Parallel API's: MPI ... the Message Passing Interface



# **Programming Model: Message Passing**

- Program consists of a collection of named processes.
  - Number of processes almost always fixed at program startup time
  - Local address space per node -- NO physically shared memory.
  - Logically shared data is partitioned over local processes.
- Processes communicate by explicit send/receive pairs
  - Coordination is implicit in every communication event.
  - MPI (Message Passing Interface) is the most commonly used SW



Private memory

# How do people use MPI? The SPMD Design Pattern

A sequential program working on a data set

- •A single program working on a decomposed data set.
- •Use Node ID and numb of nodes to split up work between processes
- Coordination by passing messages.



#### An MPI program at runtime

 Typically, when you run an MPI program, multiple processes running the same program are launched ... working on their own block of data.



MPI functions work within a "<u>context</u>": MPI actions occurring in different contexts, even if they share a process group, cannot interfere with each other.

#### **MPI Hello World**

```
#include <stdio.h>
#include <mpi.h>
int main (int argc, char **argv){
    int rank, size;
    MPI_Init (&argc, &argv);
    MPI_Comm_rank (MPI_COMM_WORLD, &rank);
    MPI_Comm_size (MPI_COMM_WORLD, &size);
    printf( "Hello from process %d of %d\n",
                                 rank, size );
    MPI Finalize();
    return 0;
```

#### Initializing and finalizing MPI

```
other MPI functions.
                            agrc and argv are the command line args passed
#include <stdio.h>
                             from main()
#include <mpi.h>
int main (int argc, char **argv){
    int rank, size;
    MPI_Init (&argc, &argv);
    MPI Comm rank (MPI COMM WORLD, &rank);
    MPI Comm size (MPI COMM WORLD, &size);
    printf( "Hello from process %d of %d\n",
                                  rank, size );
    MPI Finalize();
                      int MPI Finalize (void)
    return 0;
                          Frees memory allocated by the MPI library ... close
                            every MPI program with a call to MPI_Finalize
```

int MPI Init (int\* argc, char\* argv[])

Initializes the MPI library ... called before any

#### How many processes are involved?

```
int MPI_Comm_size (MPI_Comm comm, int* size)
```

- MPI\_Comm, an opaque data type called a communicator. Default context: MPI\_COMM\_WORLD (all processes)
- MPI\_Comm\_size returns the number of processes in the process group associated with the communicator

#inclu #include <mpi.h> int main (int argc, char \*\*argv){ int rank, size; MPI\_Init (&argc, &argv); MPI\_Comm\_rank (MPI\_COMM\_WORLD, &rank); MPI\_Comm\_size (MPI\_COMM\_WORLD, &size); printf( "Hello from process %d of %d\n", rank, size ); MPI Finalize(); return 0;

Communicators consist of two parts, a context and a process group.

The communicator lets one control how groups of messages interact.

Communicators support modular SW ... i.e. I can give a library module its own communicator and know that it's messages can't collide with messages originating from outside the module

## Which process "am I" (the rank)

```
int MPI_Comm_rank (MPI_Comm comm, int* rank)
```

- MPI\_Comm, an opaque data type, a communicator. Default context: MPI\_COMM\_WORLD (all processes)
- MPI\_Comm\_rank An integer ranging from 0 to "(num of procs)-1"

```
#inclu
#include <mpi.h>
int main (int argc, char **argv){
    int rank, size;
    MPI Init (&argc, &argv);
    MPI Comm rank (MPI COMM WORLD, &rank);
    MPI Comm size (MPI COMM WORLD, &size);
    printf( "Hello from process %d of %d\n",
                                 rank, size );
    MPI_Finalize();
    return 0;
```

Note that other than init() and finalize(), every MPI function has a communicator.

This makes sense .. You need a context and group of processes that the MPI functions impact ... and those come from the communicator.

#### Running the program

```
#include <stdio.h>
#include <mpi.h>
int main (int argc, char **argv){
    int rank, size;
    MPI Init (&argc, &argv);
    MPI Comm rank (MPI COMM WORLD, &rank);
    MPI Comm size (MPI COMM WORLD, &size);
    printf( "Hello from process %d of %d\n",
                                rank, size );
    MPI Finalize();
    return 0;
```

- On a 4 node cluster, I'd run this program (hello) as:
   > mpiexec -np 4 -hostfile hostf hello
- Where "hostf" is a file with the names of the cluster nodes, one to a line.
- What would this program would output?

#### Running the program

```
#include <stdio.h>
#include <mpi.h>
int main (int argc, char **argv){
    int rank, size;
    MPI_Init (&argc, &argv);
    MPI Comm rank (MPI COMM WORLD, &rank);
    MPI Comm size (MPI COMM WORLD, &size);
    printf( "Hello from process %d of %d\n",
                                rank, size );
    MPI Finalize();
    return 0;
```

On a 4 node cluster, I'd run this program (hello) as:
 > mpiexec -np 4 -hostfile hostf hello Hello from process 1 of 4
 Hello from process 2 of 4
 Hello from process 0 of 4
 Hello from process 3 of 4

 Where "hostf" is a file with the names of the cluster nodes, one to a line.

## **Bulk Synchronous Programming:**

#### A common design pattern used with MPI Programs

- Many MPI applications have few (if any) sends and receives.
   They use the following very common pattern:
  - Use the Single Program Multiple Data pattern
  - Each process maintains a local view of the global data
  - A problem broken down into phases each of which is composed of two subphases:
    - Compute on local view of data
    - Communicate to update global view on all processes (collective communication).
  - Continue phases until complete

This is a subset or the SPMD pattern sometimes referred to as the Bulk Synchronous pattern.



#### **Example Problem:** Numerical Integration



Mathematically, we know that:

$$\int_{0}^{1} \frac{4.0}{(1+x^2)} dx = \pi$$

We can approximate the integral as a sum of rectangles:

$$\sum_{i=0}^{N} F(x_i) \Delta x \approx \pi$$

Where each rectangle has width  $\Delta x$  and height  $F(x_i)$  at the middle of interval i.

#### PI Program: an example

```
static long num steps = 100000;
double step;
void main ()
        int i; double x, pi, sum = 0.0;
        step = 1.0/(double) num_steps;
        x = 0.5 * step;
        for (i=0;i\leq num steps; i++)
              x+=step;
              sum += 4.0/(1.0+x*x);
        pi = step * sum;
```

#### Pi program in MPI ... using the BSP pattern

```
#include <mpi.h>
void main (int argc, char *argv[])
       int i, my id, numprocs; double x, pi, step, sum = 0.0;
       step = 1.0/(double) num steps;
       MPI Init(&argc, &argv);
       MPI Comm Rank(MPI_COMM_WORLD, &my_id);
       MPI Comm Size(MPI COMM WORLD, &numprocs);
       my steps = num steps/numprocs;
       for (i=my id*my steps; i<(my id+1)*my_steps; i++)
                x = (i+0.5)*step;
                                              Sum values in "sum" from
                sum += 4.0/(1.0+x*x);
                                              each process and place it
                                                 in "pi" on process 0
       sum *= step;
       MPI Reduce(&sum, &pi, 1, MPI DOUBLE, MPI SUM, 0,
               MPI COMM WORLD);
```

#### Reduction

- MPI\_Reduce performs specified reduction operation on specified data from all processes in communicator, places result in process "root" only.
- MPI\_Allreduce places result in all processes (avoid unless necessary)

| Operation  | Function                   |
|------------|----------------------------|
| MPI_SUM    | Summation                  |
| MPI_PROD   | Product                    |
| MPI_MIN    | Minimum value              |
| MPI_MINLOC | Minimum value and location |
| MPI_MAX    | Maximum value              |
| MPI_MAXLOC | Maximum value and location |
| MPI_LAND   | Logical AND                |

| Operation    | Function                                          |
|--------------|---------------------------------------------------|
| MPI_BAND     | Bitwise AND                                       |
| MPI_LOR      | Logical OR                                        |
| MPI_BOR      | Bitwise OR                                        |
| MPI_LXOR     | Logical exclusive OR                              |
| MPI_BXOR     | Bitwise exclusive OR                              |
| User-defined | It is possible to define new reduction operations |

## Sending and receiving messages

- Pass a buffer which holds "count" values of MPI\_TYPE
- The data in a message to send or receive is described by a triple:
  - (address, count, datatype)
- The receiving process identifies messages with the double :
  - (source, tag)
- Where:
  - Source is the rank of the sending process
  - Tag is a user-defined integer to help the receiver keep track of different messages from a single source

MPI\_Send (buff, 100, MPI\_DOUBLE, Dest, tag, MPI\_COMM\_WORLD);



Rank of Source node

# **Blocking Send-Receive Timing Diagram**

(MPI functions return when local buffer can be used again)



It is important to post the receive before sending, for highest performance.

#### Non-Blocking Send-Receive Diagram

(MPI functions return immediately)



#### **Example: finite difference methods**

- Solve the heat diffusion equation in 1 D:
  - u(x,t) describes the temperature field
  - We set the heat diffusion constant to one
  - Boundary conditions, constant u at endpoints.

$$\frac{\partial^2 u}{\partial x^2} = \frac{\partial u}{\partial t}$$

map onto a mesh with stepsize h and k

$$x_i = x_0 + ih \qquad t_i = t_0 + ik$$

 Central difference approximation for spatial derivative (at fixed time)

$$\frac{\partial^2 u}{\partial x^2} = \frac{u_{j+1} - 2u_j + u_{j-1}}{h^2}$$

■ Time derivative at t = t<sup>n+1</sup>

$$\frac{du}{dt} = \frac{u^{n+1} - u^n}{k}$$

#### **Example: Explicit finite differences**

Combining time derivative expression using spatial derivative

at t = t<sup>n</sup>

$$\frac{u_j^{n+1} - u_j^n}{k} = \frac{u_{j+1}^n - 2u_j^n + u_{j-1}^n}{h^2}$$

Solve for u at time n+1 and step j

$$u_{j}^{n+1} = (1-2r)u_{j}^{n} + ru_{j-1}^{n} + ru_{j+1}^{n} \qquad r = \frac{k}{h^{2}}$$

The solution at  $t = t_{n+1}$  is determined explicitly from the solution at  $t = t_n$  (assume u[t][0] = u[t][N] = Constant for all t).

```
for (int t = 0; t < N_STEPS-1; ++t)
  for (int x = 1; x < N-1; ++x)
     u[t+1][x] = u[t][x] + r*(u[t][x+1] - 2*u[t][x] + u[t][x-1]);</pre>
```

 Explicit methods are easy to compute ... each point updated based on nearest neighbors. Converges for r<1/2.</li>





Pictorially, you are sliding a three point "stencil" across the domain (u) and updating the center point at each stop.

return 0;



```
int main()
                                                             Note: I don't need the
                                                            intermediate "u[t]" values
   double *u = malloc (sizeof(double) * (N));
                                                         hence "u" is just indexed by x.
   double *up1 = malloc (sizeof(double) * (N));
   initialize_data(uk, ukp1, N, P); // init to zero, set end temperatures
   for (int t = 0; t < N_STEPS; ++t){
      for (int x = 1; x < N-1; ++x)
          up1[x] = u[x] + (k / (h*h)) * (u[x+1] - 2*u[x] + u[x-1]);
                                                   A well known trick with 2 arrays so I
      temp = up1; up1 = u; u = temp;
                                                   don't overwrite values from step k-1
                                                   as I fill in for step k
```



```
How would
int main()
                                                       you parallelize
  double *u = malloc (sizeof(double) * (N));
                                                       this program?
  double *up1 = malloc (sizeof(double) * (N));
  initialize_data(uk, ukp1, N, P); // init to zero, set end temperatures
  for (int t = 0; t < N_STEPS; ++t){
     for (int x = 1; x < N-1; ++x)
         up1[x] = u[x] + (k / (h*h)) * (u[x+1] - 2*u[x] + u[x-1]);
     temp = up1; up1 = u; u = temp;
return 0;
```

• Start with our original picture of the problem ... a one dimensional domain with end points set at a fixed temperature.



• Break it into chunks assigning one chunk to each process.



• Each process works on it's own chunk ... sliding the stencil across the domain to updates its own data.



 What about the ends of each chunk ... where the stencil will run off the end and hence have missing values for the computation?



• We add ghost cells to the ends of each chunk, update them with the required values from neighbor chunks at each time step ... hence giving the stencil everything it needs on any given chunk to update all of its values.



## **Design Pattern: Geometric Decomposition**

#### • Use when:

 The problem is organized around a central data structure that can be decomposed into smaller segments (chunks) that can be updated concurrently.

#### Solution

- Typically, the data structure is updated iteratively where a new value for one chunk depends on neighboring chunks.
- The computation breaks down into three components: (1) exchange boundary data, (2) update the interiors or each chunk, and (3) update boundary regions. The optimal size of the chunks is dictated by the properties of the memory hierarchy.

#### Note:

 This pattern is often used with the Structured Mesh and linear algebra computational strategy pattern.

## The Geometric Decomposition Pattern

This is an instance of a very important design pattern ... the Geometric decomposition pattern.



#### **Heat Diffusion MPI Example**

```
MPI Init (&argc, &argv);
MPI Comm size (MPI COMM WORLD, &P);
MPI Comm rank (MPI COMM WORLD, &myID);
double *u = malloc (sizeof(double) * (2 + N/P)) // include "Ghost Cells"
double *up1 = malloc (sizeof(double) * (2 + N/P)); // to hold values
                                                    // from my neighbors
initialize_data(uk, ukp1, N, P);
for (int t = 0; t < N STEPS; ++t){
  if (myID != 0) MPI_Send (&u[1], 1, MPI_DOUBLE, myID-1, 0, MPI_COMM_WORLD);
  if (myID != P-1) MPI Recv (&u[N/P+1], 1, MPI DOUBLE, myID+1, 0, MPI COMM WORLD, &status);
  if (myID != P-1) MPI Send (&u[N/P], 1, MPI DOUBLE, myID+1, 0, MPI COMM WORLD);
  if (myID != 0) MPI_Recv (&u[0], 1, MPI_DOUBLE, myID-1, 0,MPI_COMM WORLD, &status);
  for (int x = 2; x \leftarrow N/P; ++x)
    up1[x] = u[x] + (k / (h*h)) * (u[x+1] - 2*u[x] + u[x-1]);
  if (myID != 0)
    up1[1] = u[1] + (k / (h*h)) * (u[1+1] - 2*u[1] + u[1-1]);
  if (myID != P-1)
    up1[N/P] = u[N/P] + (k/(h*h)) * (u[N/P+1] - 2*u[N/P] + u[N/P-1]);
  temp = up1; up1 = u; u = temp;
} // End of for (int t ...) loop
MPI Finalize();
return 0;
```

We write/explain this part first and then address the communication and data structures

#### **Heat Diffusion MPI Example**

```
/* continued from previous slide */
                                         Temperature fields using local data and values
                                         from ghost cells.
  for (int x = 2; x \leftarrow N/P; ++x)
    up1[x] = u[x] + (k / (h*h)) * (u[x+1] - 2*u[x] + u[x-1]);
  if (myID != 0)
    up1[1] = u[1] + (k / (h*h)) * (u[1+1] - 2*u[1] + u[1-1]);
                                                                     u[0] and u[N/P+1]
                                                                        are the ghost
  if (myID != P-1)
    up1[N/P] = u[N/P] + (k/(h*h)) * (u[N/P+1] - 2*u[N/P] + u[N/P-1]);
  temp = up1; up1 = u; u = temp;
                                        Note I was lazy and assume N was evenly
} // End of for (int t ...) loop
                                        divided by P. Clearly, I'd never do this in a
                                        "real" program.
MPI_Finalize();
return 0;
```

cells

#### **Heat Diffusion MPI Example**

```
MPI_Init (&argc, &argv);
                                        1D PDE solver ... the simplest "real" message
MPI_Comm_size (MPI_COMM_WORLD, &P); passing code I can think of. Note: edges of
MPI_Comm_rank (MPI_COMM_WORLD, &myID); domain held at a fixed temperature
double *u = malloc (sizeof(double) * (2 + N/P)) // include "Ghost Cells"
double *up1 = malloc (sizeof(double) * (2 + N/P)); // to hold values
                                                     // from my neighbors
initialize data(uk, ukp1, N, P);
for (int t = 0; t < N STEPS; ++t){
  if (myID != 0) Send my "right" boundary value to my "right' neighbor
    MPI_Send (&u[1], 1, MPI_DOUBLE, myID-1, 0, MPI_COMM_WORLD);
  if (myID != P-1) Receive my "left" ghost cell from my "left' neighbor
    MPI_Recv (&u[N/P+1], 1, MPI_DOUBLE, myID+1, 0, MPI_COMM_WORLD, &status);
  if (myID != P-1) Send my "left" boundary value to my "left' neighbor
    MPI_Send (&u[N/P], 1, MPI_DOUBLE, myID+1, 0, MPI_COMM_WORLD);
  if (myID != 0) Receive my "right" ghost cell from my "right' neighbor
    MPI Recv (&u[0], 1, MPI DOUBLE, myID-1, 0, MPI COMM WORLD, &status);
/* continued on next slide */
```

## MPI is huge!!!

- MPI has over 430 functions!!!
  - Many forms of message passing
  - Full range of collectives (such as reduction)
  - dynamic process management
  - Shared memory
  - and much more
- Most programs, however use around a dozen different constructs ... so it's not as hard to learn as it may seem.

# Does a shared address space make programming easier?



Proving that a shared address space program using semaphores is race free is an NP-complete problem\*

## The Big Three

- In HPC, 3 programming environments dominate ... covering the major classes of hardware.
  - MPI: distributed memory systems ... though it works nicely on shared memory computers.

OpenMP: Shared memory systems ... more recently, GPGPU too.



• Even if you don't plan to spend much time programming with these systems ... a well rounded HPC programmer should know what they are and how they work.

#### **OpenMP Basic Definitions:** Basic Solution Stack



For the OpenMP Common Core, we focus on Symmetric Multiprocessor Case .... i.e., lots of threads with "equal cost access" to memory

#### **OpenMP Basic Definitions:** Solution stack



45

# A Generic GPU (following Hennessey and Patterson)



## A Generic GPU (following Hennessey and Patterson)



#### The "BIG idea" of GPU programming

#### Traditional loops

```
void
trad mul(int n,
         const float *a,
         const float *b,
         float *c)
  int i;
  for (i=0; i<n; i++)
    c[i] = a[i] * b[i];
```

#### **Data Parallel OpenCL**

```
kernel void
dp mul(global const float *a,
       global const float *b,
       global float *c)
  int id = get global id(0);
 c[id] = a[id] * b[id];
 // execute over "n" work-items
```

This is just the kernel code. The host code running on the CPU is much more complicated.

# How do we execute code on a GPU: The SIMT model (Single Instruction Multiple Thread)

 Turn source code into a scalar work-item

```
extern void reduce( __local float*, __global float*);

__kernel void pi( const int niters, float step_size,
    __local float* l_sums, __global float* p_sums)
{
    int n_wrk_items = get_local_size(0);
    int loc_id = get_local_id(0);
    int grp_id = get_group_id(0);
    float x, accum = 0.0f; int i,istart,iend;

istart = (grp_id * n_wrk_items + loc_id) * niters;
    iend = istart+niters;

for(i= istart; i<iend; i++){
        x = (i+0.5f)*step_size; accum += 4.0f/(1.0f+x*x); }

l_sums[local_id] = accum;
    barrier(CLK_LOCAL_MEM_FENCE);
    reduce(l_sums, p_sums);
}</pre>
```

This is OpenCL kernel code ... the sort of code the OpenMP compiler generates on your behalf

2. Map work-items onto an N dim index space.



4. Run on hardware

designed around

# **GPU** terminology is Broken (sorry about that)

| Hennessy and Patterson           | CUDA                     | OpenCL               |
|----------------------------------|--------------------------|----------------------|
| Multithreaded SIMD Processor     | Streaming multiprocessor | Compute Unit         |
| SIMD Thead Scheduler             | Warp Scheduler           | Work-group scheduler |
| SIMD Lane                        | CUDA Core                | Processing Element   |
| GPU Memory                       | Global Memory            | Global Memory        |
| Private Memory                   | Local Memory             | Private Memory       |
| Local Memory                     | Shared Memory            | Local Memory         |
| Vectorizable Loop                | Grid                     | NDRange              |
| Sequence of SIMD Lane operations | CUDA Thread              | work-item            |
| A thread of SIMD instructions    | Warp                     | sub-group            |

# Executing a program on CPUs and GPUs





One work-group per compute-unit executing

# Executing a program on CPUs and GPUs



One work-group per compute-unit executing

# CPU/GPU execution modes!



For a CPU, the threads are all active and able to make forward progress.

For a GPU, any given work-group might be in the queue waiting to execute.

#### A Generic Host/Device Platform Model



- One Host and one or more Devices
  - Each Device is composed of one or more Compute Units
  - Each Compute Unit is divided into one or more *Processing Elements*
- Memory divided into host memory and device memory

# Running code on the GPU: The target construct and default data movement



## **Default Data Sharing: example**

```
1. Variables created in host
int main(void) {
                                                 memory.
 int N = 1024;
 double A[N], B[N];
                                      2. Scalar N and stack arrays
                                      A and B are copied to device
                                           memory. Execution
 #pragma omp target
                                          transferred to device.
                                      3. ii is private on the device
                                        as it's declared within the
   for (int ii = 0; ii < N; ++ii) {
                                               target region
     A[ii] = A[ii] + B[ii];
                                       4. Execution on the device.
                                       5. stack arrays A and B are
                                      copied from device memory back to the host. Host
 } // end of target region
                                           resumes execution.
```

#### So lets run code in parallel on the device

```
int main(void) {
 int N = 1024;
  double A[N], B[N];
 #pragma omp target
   #pragma omp loop
   for (int ii = 0; ii < N; ++ii) {
     A[ii] = A[ii] + B[ii];
  } // end of target region
```

#### The loop construct tells the compiler:

"this loop will execute correctly if the loop iterations run in any order. You can safely run them concurrently. And the loop-body doesn't contain any OpenMP constructs. So do whatever you can to make the code run fast"

The loop construct is a declarative construct. You tell the compiler what you want done but you DO NOT tell it how to "do it". This is new for OpenMP

# What about pointers? implicit movement with a target region

- Pointers and their data:
  - Example: arrays allocated on the heap
    - double \*A = malloc(sizeof(double)\*1000);
  - The pointer value will be mapped.
  - But the data it points to will not be mapped by default.

#### **Explicit Data Sharing**

- We explicitly control the movement of data using the map clause.
- Data allocated on the heap needs to be explicitly copied to/from the device:

```
int main(void) {
  int ii=0, N = 1024;
  int* A = malloc(sizeof(int)*N);

#pragma omp target
  {
    // N, ii and A all exist here
    // The data that A points to (*A , A[ii]) DOES NOT exist here!
  }
}
```

## **Controlling data movement**

```
int i, a[N], b[N], c[N];
#pragma omp target map(to:a,b) map(tofrom:c)
```

Data movement defined from the *host* perspective.

- The various forms of the map clause
  - map(to:list): On entering the region, variables in the list are initialized on the device using the original values from the host (host to device copy).
  - map(from:list): At the end of the target region, the values from variables in the list are copied into the original variables (device to host copy). On entering the region, initial value of the variable is not initialized.
  - map(tofrom:list): the effect of both a map-to and a map-from (host to device copy at start of region, device to host copy at end)
  - map(alloc:list): On entering the region, data is allocated and uninitialized on the device.
  - map(list): equivalent to map(tofrom:list).
- For pointers you must use array section notation ...
  - map(to:a[0:N]). Notation is A[lower-bound : length]

## Moving arrays with the map clause

```
int main(void) {
  int N = 1024;
  int^* \mathbf{A} = malloc(sizeof(int)^* \mathbf{N});
  #pragma omp target map(A[0:N])
   // N, ii and A all exist here
   // The data that A points to DOES exist here!
```

Default mapping map(tofrom: A[0:N])

Copy at start and end of **target** region.

#### Our running example: Jacobi solver

- An iterative method to solve a system of linear equations
  - Given a matrix A and a vector b find the vector x such that Ax=b
- The basic algorithm:
  - Write A as a lower triangular (L), upper triangular (U) and diagonal matrix
     Ax = (L+D+U)x = b
  - Carry out multiplications and rearrange

$$Dx=b-(L+U)x \rightarrow x = (b-(L+U)x)/D$$

Iteratively compute a new x using the x from the previous iteration

$$X_{new} = (b-(L+U)x_{old})/D$$

- Advantage: we can easily test if the answer is correct by multiplying our final x by A and comparing to b
- Disadvantage: It takes many iterations and only works for diagonally dominant matrices

#### **Jacobi Solver**

Iteratively update xnew until the value stabilizes (i.e. change less than a preset TOL)

```
<<< allocate and initialize the matrix A >>>
<< and vectors x1, x2 and b
                                       >>>
while((conv > TOL) && (iters<MAX ITERS))
   iters++;
  for (i=0; i<Ndim; i++){
     xnew[i] = (TYPE) 0.0;
     for (j=0; j<Ndim;j++){
        if(i!=j)
         xnew[i]+=A[i*Ndim + i]*xold[i];
     xnew[i] = (b[i]-xnew[i])/A[i*Ndim+i];
```

```
// test convergence
   conv = 0.0;
   for (i=0; i<Ndim; i++){
     tmp = xnew[i]-xold[i];
     conv += tmp*tmp;
   conv = sqrt((double)conv);
  // swap pointers for next
  // iteration
  TYPE* tmp = xold;
  xold = xnew;
  xnew = tmp;
} // end while loop
```

## Jacobi Solver (Par Targ, 1/2)

```
while((conv > TOL) && (iters<MAX_ITERS))</pre>
   iters++;
#pragma omp target map(tofrom:xnew[0:Ndim],xold[0:Ndim]) \
              map(to:A[0:Ndim*Ndim], b[0:Ndim])
#pragma omp loop
for (i=0; i<Ndim; i++){
     xnew[i] = (TYPE) 0.0;
     for (j=0; j<Ndim; j++){
        if(i!=j)
         xnew[i]+=A[i*Ndim + i]*xold[i];
     xnew[i] = (b[i]-xnew[i])/A[i*Ndim+i];
```

# Jacobi Solver (Par Targ, 2/2)

```
// test convergence
  conv = 0.0;
#pragma omp target map(to:xnew[0:Ndim],xold[0:Ndim]) \
                         map(tofrom:conv)
#pragma omp loop private(i,tmp) reduction(+:conv)
for (i=0; i<Ndim; i++){
     tmp = xnew[i]-xold[i];
     conv += tmp*tmp;
  conv = sqrt((double)conv);
  TYPE* tmp = xold;
  xold = xnew;
  xnew = tmp;
} // end while loop
```

This worked but the performance was awful. Why?

| System                 | Implementation         | Ndim = 4096 |
|------------------------|------------------------|-------------|
| NVIDA®<br>K20X™<br>GPU | Target dir per<br>loop | 131.94 secs |

Cray® XC40™ Supercomputer running Cray® Compiling Environment 8.5.3. Intel® Xeon ® CPU E5-2697 v2 @ 2.70GHz with 32 GB DDR3. NVIDIA® Tesla® K20X, 6GB.

#### **Data movement dominates!!!**

```
while((conv > TOLERANCE) && (iters<MAX_ITERS))
                                                     Typically over 4000 iterations!
 { iters++;
  xnew = iters % s ? x2 : x1:
  xold = iters \% s ? x1 : x2:
                                                                For each iteration, copy to device
  #pragma omp target map(tofrom:xnew[0:Ndim],xold[0:Ndim]) \
                                                                (3*Ndim+Ndim<sup>2</sup>)*sizeof(TYPE) bytes
             map(to:A[0:Ndim*Ndim], b[0:Ndim])
   #pragma omp loop private(i,j)
   for (i=0; i<Ndim; i++){
      xnew[i] = (TYPE) 0.0;
      for (j=0; j<Ndim; j++){
        if(i!=j)
         xnew[i]+= A[i*Ndim + j]*xold[j];
      xnew[i] = (b[i]-xnew[i])/A[i*Ndim+i];
                                                                For each iteration, copy from device
// test convergence
                                                                2*Ndim*sizeof(TYPE) bytes
  conv = 0.0;
  #pragma omp target map(to:xnew[0:Ndim],xold[0:Ndim]) \
                      map(tofrom:conv)
                                                                          For each iteration, copy to
    #pragma loop reduction(+:conv)
                                                                          device
    for (i=0; i<Ndim; i++){
                                                                          2*Ndim*sizeof(TYPE) bytes
      tmp = xnew[i]-xold[i];
      conv += tmp*tmp;
  conv = sqrt((double)conv);
```

#### **Target data directive**

The target data construct creates a target data region
 ... use map clauses for explicit data management



# Jacobi Solver (Par Target Data, 1/2)

```
#pragma omp target data map(tofrom:x1[0:Ndim],x2[0:Ndim]) \
              map(to:A[0:Ndim*Ndim], b[0:Ndim],Ndim)
while((conv > TOL) && (iters<MAX ITERS))
 { iters++;
#pragma omp target
#pragma omp loop private(j) firstprivate(xnew,xold)
  for (i=0; i<Ndim; i++){
     xnew[i] = (TYPE) 0.0;
     for (j=0; j<Ndim;j++){
       if(i!=j)
        xnew[i]+=A[i*Ndim + i]*xold[i];
     xnew[i] = (b[i]-xnew[i])/A[i*Ndim+i];
```

# Jacobi Solver (Par Target Data, 2/2)

```
// test convergence
conv = 0.0;
#pragma omp target map(tofrom: conv)
#pragma omp loop private(tmp) firstprivate(xnew,xold) reduction(+:conv)
  for (i=0; i<Ndim; i++){
     tmp = xnew[i]-xold[i];
     conv += tmp*tmp;
// end target region
conv = sqrt((double)conv);
  TYPE* tmp = xold;
  xold = xnew;
  xnew = tmp;
} // end while loop
```

| System   | Implementation                | Ndim = 4096 |
|----------|-------------------------------|-------------|
| K30 V TM | Target dir per loop           | 131.94 secs |
|          | Above plus target data region | 18.37 secs  |

## **Single Instruction Multiple Data**

- Individual work-items of a warp start together at the same program address
- Each work-item has its own instruction address counter and register state
  - Each work-item is free to branch and execute independently
  - Supports the SPMD pattern.
- Branch behavior
  - Each branch will be executed serially
  - Work-items not following the current branch will be disabled



#### **Branching**

#### **Conditional execution**

```
// Only evaluate expression
// if condition is met
if (a > b)
{
   acc += (a - b*c);
}
```

#### **Selection and masking**

```
// Always evaluate expression
// and mask result
temp = (a - b*c);
mask = (a > b ? 1.f : 0.f);
acc += (mask * temp);
```

#### Coalescence

- Coalesce to combine into one
- Coalesced memory accesses are key for high bandwidth
- Simply, it means, if thread i
  accesses memory location n then
  thread i+1 accesses memory
  location n+1
- In practice, it's not quite as strict...

```
for (int id = 0; id < size; id++)
 // ideal
    float val1 = memA[id];
 // still pretty good
    const int c = 3;
    float val2 = memA[id + c];
 // stride size is not so good
    float val3 = memA[c*id];
 // terrible
    const int loc =
      some strange func(id);
    float val4 = memA[loc];
```

#### Jacobi Solver (Target Data/branchless/coalesced mem, 1/2)

```
#pragma omp target data map(tofrom:x1[0:Ndim],x2[0:Ndim]) \
              map(to:A[0:Ndim*Ndim], b[0:Ndim],Ndim)
while((conv > TOL) && (iters<MAX ITERS))
 { iters++;
#pragma omp target
    #pragma omp loop private(j)
  for (i=0; i<Ndim; i++){
     xnew[i] = (TYPE) 0.0;
     for (j=0; j<Ndim;j++){
        xnew[i]+= (A[j*Ndim + i]*xold[j])*((TYPE)(i != j));
     xnew[i] = (b[i]-xnew[i])/A[i*Ndim+i];
```

```
We replaced the original code with a poor memory access pattern xnew[i]+= (A[i*Ndim + j]*xold[j])
With the more efficient xnew[i]+= (A[j*Ndim + i]*xold[j])
```

### Jacobi Solver (Target Data/branchless/coalesced mem, 2/2)

```
// test convergence
   conv = 0.0;
#pragma omp target map(tofrom: conv)
#pragma omp loop private(tmp) reduction(+:conv)
  for (i=0; i<Ndim; i++){
     tmp = xnew[i]-xold[i];
     conv += tmp*tmp;
conv = sqrt((double)conv);
  TYPE* tmp = xold;
  xold = xnew;
  xnew = tmp;
} // end while loop
```

| System          | Implementation                      | Ndim = 4096 |
|-----------------|-------------------------------------|-------------|
| NVIDA®<br>K20X™ | Target dir per<br>loop              | 131.94 secs |
| GPU             | Above plus<br>target data<br>region | 18.37 secs  |
|                 | Above plus reduced branching        | 13.74 secs  |
|                 | Above plus improved mem access      | 7.64 secs   |

## The loop construct is great, but sometimes you want more control.

#### Our host/device Platform Model and OpenMP



blocks of loop iterations to teams.

76

#### teams and distribute constructs

#### The teams construct

- Similar to the parallel construct
- It starts a league of thread teams
- Each team in the league starts as one initial thread a team of one
- Threads in different teams cannot synchronize with each other
- The construct must be "perfectly" nested in a target construct

#### The distribute construct

- Similar to the **for** construct
- Loop iterations are workshared across the initial threads in a league
- No implicit barrier at the end of the construct
- dist\_schedule(kind[, chunk\_size])
  - If specified, scheduling kind must be static
  - Chunks are distributed in round-robin fashion in chunks of size chunk\_size
  - If no chunk size specified, chunks are of (almost) equal size; each team receives at least one chunk

### Create a league of teams and distribute a loop among them

- teams construct
- distribute construct

```
#pragma omp target
#pragma omp teams
#pragma omp distribute
for (i=0;i<N;i++)
...
```



- Transfer execution control to MULTIPLE device initial threads
- Workshare loop iterations across the initial threads.

## Create a league of teams and distribute a loop among them and run each team in parallel with its partition of the loop

- teams distribute
- parallel for simd

#pragma omp target
#pragma omp teams
#pragma omp distribute
#pragma omp parallel for simd
for (i=0;i<N;i++)</pre>



- Transfer execution control to MULTIPLE device initial threads
  - Workshare loop iterations across the initial threads (teams distribute)
- Each initial thread becomes the primary\* thread in a thread team
  - Workshare loop iterations across the threads in a team (parallel for)

## Create a league of teams and distribute a loop among them and run each team in parallel with its partition of the loop

- teams distribute
- parallel for simd

Works with nested loops as well

```
#pragma omp target
#pragma omp teams distribute
for (i=0;i<N;i++)
#pragma omp parallel for simd
for (j=0;j<M;i++)
...
```

host thread



- Transfer execution control to MULTIPLE device initial threads
  - Workshare loop iterations across the initial threads (teams distribute)
- Each initial thread becomes the primary\* thread in a thread team
  - Workshare loop iterations across the threads in a team (parallel for)

#### SIMT Programming models: it's more than just OpenMP

#### • CUDA:

- Released ~2006. Made GPGPU programming "mainstream" and continues to drive innovation in SIMT programming.
  - Downside: proprietary to NVIDIA

#### OpenCL:

- Open Standard for SIMIT programming created by Apple, Intel, NVIDIA, AMD, and others. 1st release in 2009.
- Supports CPUs, GPUs, FPGAs, and DSP chips. The leading cross platform SIMT model.
  - Downside: extreme portability means verbose API. Painfully low level especially for the host-program.

#### Sycl:

- C++ abstraction layer implements SIMT model with kernels as lambdas. Closely aligned with OpenCL. 1st release 2014
  - Downside: Cross platform implementations only emerging recently.

#### Directive driven programming models:

- OpenACC: they split from an OpenMP working group to create a competing directive driven API emphasizing descriptive (rather than prescriptive) semantics.
  - Downside: NOT an Open Standard. Controlled by NVIDIA.
- OpenMP: Mixes multithreading and SIMT. Semantics are prescriptive which makes it more verbose. A truly Open standard supported by all the key GPU players.
  - Downside: Poor compiler support so far ... but that will change over the next couple years.

#### **Vector addition with CUDA**

```
// Compute sum of length-N vectors: C = A + B
                                                                                CUDA kernel as
               void global
                                                                                   function
               vecAdd (float* a, float* b, float* c, int N) {
                   int i = blockIdx.x * blockDim.x + threadIdx.x;
                   if (i < N) c[i] = a[i] + b[i];
                                                                             Unified shared
               int main () {
                                                                           memory ... allocate
                   int N = \dots;
                                                                           on host, visible on
                   float *a, *b, *c;
                                                                               device too
                   cudaMalloc (&a, sizeof(float) * N);
Enqueue the kernel
to execute on the
                  // ... allocate other arrays (b and c), fill with data
      Grid
                 // Use thread blocks with 256 threads each
                   vecAdd <<< (N+255)/256, 256 >>> (a, b, c, N);
```

#### **Vector addition with SYCL**

```
// Compute sum of length-N vectors: C = A + B
                #include <CL/sycl.hpp>
                int main () {
                                                                                      Unified shared
                    int N = \dots;
                                                                                    memory ... allocate
Create a queue
                    float *a, *b, *c;
                                                                                     on host, visible on
  for SYCL
                                                                                        device too

    sycl::queue q;
 commands
                    *a = (float *)sycl::malloc_shared(N * sizeof(float), q);
                  // ... allocate other arrays (b and c), fill with data
                         q.parallel_for(sycl::range<1>{N},
                                                                                Kernel as a C++
                                   [=](sycl::id<1> i) {
                                                                                Lambda function
                                     c[i] = a[i] + b[i];
                                   });
                                                                               [=] means capture external
                                                                                 variables by value.
                          q.wait();
```

### **Vector addition with OpenACC**

•Let's add two vectors together .... C = A + B

Host waits here until the kernel is done. Then the output array c is copied back to the host.

```
Assure the
void vadd(int n,
                                    compiler that c is
          const float *a,
                                     not aliased with
          const float *b,
                                     other pointers
          float *restrict c)
                                       Turn the loop
  int i;
                                       into a kernel,
 #pragma acc parallel loop ←
                                       move data to a
  for (i=0; i<n; i++)
                                        device, and
    c[i] = a[i] + b[i];
                                        launch the
                                          kernel.
int main(){
float *a, *b, *c; int n = 10000;
// allocate and fill a and b
    vadd(n, a, b, c);
```

#### A more complicated example: Jacobi iteration: OpenACC (GPU)

```
#pragma acc data copy(A), create(Anew) 
while (err>tol && iter < iter max) {</pre>
                                                     host)
   err = 0.0;
   #pragma acc parallel loop reduction(max:err)
   for(int j=1; j< n-1; j++) {
      for(int i=1; i<M-1; i++) {
         Anew[j][i] = 0.25* (A[j][i+1] + A[j][i-1]+
                              A[j-1][i] + A[j+1][i]);
         err = max(err,abs(Anew[j][i] - A[j][i]));
    #pragma acc parallel loop
    for(int j=1; j< n-1; j++){
      for(int i=1; i<M-1; i++) {
         A[j][i] = Anew[j]i];
                       Copy A back out to host
    iter ++;
                          ... but only once
```

Create a data region on the GPU. Copy A once onto the GPU, and create Anew on the device (no copy from host)

#### A more complicated example:

#### Jacobi iteration: OpenMP target directives

```
region on the
#pragma omp target data map(A) map(alloc:Anew)_
                                                        GPU. Map A
while (err>tol && iter < iter max){</pre>
                                                       and Anew onto
   err = 0.0;
                                                       the target device
   #pragma target
   #pragma omp teams loop reduction(max:err)
   for(int j=1; j< n-1; j++) {
      for(int i=1; i<M-1; i++) {
         Anew[j][i] = 0.25* (A[j][i+1] + A[j][i-1]+
                               A[j-1][i] + A[j+1][i]);
         err = max(err,abs(Anew[j][i] - A[j][i]));
    #pragma omp target
    #pragma omp teams loop
    for(int j=1; j< n-1; j++){
      for(int i=1; i<M-1; i++) {
         A[j][i] = Anew[j]i];
    iter ++;
                Copy A back out to host
                   ... but only once
```

Create a data

#### Why so many ways to do the same thing?

- The parallel programming model people have failed you ...
  - It's more fun to create something new in your own closed-community that work across vendors to create a portable API
- The hardware vendors have failed you ...
  - Don't you love my "walled garden"? It's so nice here, programmers, just don't even think of going to some other platform since your code is not portable.
- The standards community has failed you ...
  - Standards are great, but they move too slow. OpenACC stabbed OpenMP in the back and I'm pissed, but their comments at the time were spot-on (OpenMP was moving so slow ... they just couldn't wait).
- The applications community failed themselves ...
  - If you don't commit to a standard and use "the next cool thing" you end up with the diversity of overlapping options we have today. Think about what happened with OpenMP and MPI.



### If you care about power, the world is heterogeneous?

Specialized processors doing operations suited to their architecture are more efficient than general purpose processors.



Hence, future systems will be increasingly heterogeneous ... GPUs, CPUs, FPGAs, and a wide range of accelerators

Source: Suyash Bakshi and Lennart Johnsson, "A Highly Efficient SGEMM Implementation using DMA on the Intel/Movidius Myriad-2. IEEE International Symposium on Computer Architecture and High Performance Computing, 2020

## Offload vs. Heterogeneous computing

- Offload: The CPU moves work to an accelerator and waits for the answer.
- Heterogeneous Computing: Run sub-problems in parallel on the hardware best suited to them.



#### Example: Single-cell RNA-Seq benchmark (SCANPY)

- SCANPY ... a widely used tool for studying gene expression. All data are elapsed time in seconds
- We started with results from an Nvidia blog (Example 2 from <u>link</u>), optimized code for one socket of Intel® Xeon® 8380 CPU and then "simulated" heterogeneous computing result by taking the faster of CPU and GPU execution times.

| Pipeline stages              | 64 vCPUs<br>n1-highmem-64<br>(off-the-shelf Python) | A100 40Gb<br>(Clara Parabricks) | ICX-1s, 40 cores<br>(optimized by Intel) | "Simulated" hetercRedacted A100 & ICS-1s 40 cores |
|------------------------------|-----------------------------------------------------|---------------------------------|------------------------------------------|---------------------------------------------------|
| Data Loading & Preprocessing | 1120                                                | 475                             | 15.7                                     | Imagine                                           |
| PCA                          | 44                                                  | 17.8                            | 5.0                                      | mixing the                                        |
| T-SNE                        | 6509                                                | 37                              | 205.6                                    | best of the                                       |
| K-means (single iteration)   | 148                                                 | 2                               | 7.1                                      | CPU and GPU                                       |
| KNN                          | 154                                                 | 62                              | 59.8                                     | numbers.                                          |
| UMAP                         | 2571                                                | 21                              | 84.5                                     | What <sup>2</sup> would                           |
| Louvain clustering           | 1153                                                | 2.4                             | 6.0                                      | th'e                                              |
| Leiden clustering            | 6345                                                | 1.7                             | 28.4                                     | performance                                       |
| Reanalysis of subgroup       | 255                                                 | 17.9                            | 22.5                                     | look like?                                        |
| Rest                         | 39                                                  | 49.2                            | 49.0                                     | 49.0                                              |
| End-to-End runtime           | 18338                                               | 686                             | 483.6                                    | 211.5                                             |

#### **Lessons learned:**

- Be careful comparing unoptimized python to hand-tuned CUDA code
- GPUs are great. So are CPUs if you fully utilize all the cores and vector units.
- What you really want is the best of both worlds. You want heterogeneous computing!

See Backup for workloads and configurations. Results may vary.

Clara Parabricks: Nvidia solution stack built on RAPIDS for healthcare applications

https://github.com/clara-parabricks/rapids-single-cell-examples

github repository as of Dec 16, 2020

This column shows the potential of heterogenous computing. We ignored extra communication and synchronization overhead, so actual runtimes would be slightly greater.

Source: Github repository as of Dec 16, 2020 - Example 2: Single-cell RNA-seq of 1.3 Million Mouse Brain Cells comparing CPU (n1-highmem-64 64 vCPUs) vs GPU (n1-highmem-66. <a href="https://github.com/clara-parabricks/rapids-single-cell-examples">https://github.com/clara-parabricks/rapids-single-cell-examples</a>. Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

Third party names are the property of their owners.

#### Five Epochs of Distributed Computing\*

| Epoch starting date | Defining limitations                    | Application                                    | Interaction time and Network performance | Capability           |
|---------------------|-----------------------------------------|------------------------------------------------|------------------------------------------|----------------------|
| First<br>1970       | Rare connections to expensive computers | FTP, telnet, email                             | 100 ms<br>Low bandwidth high<br>latency  | People to computers  |
| Second<br>1984      | I/O wall, disks<br>can't keep up        | RPC,<br>Client Server                          | 10 ms<br>10 mbps                         | Computer to computer |
| Third<br>1990       | Networking wall                         | MPP HPC, three-<br>tier datacenter<br>networks | 1 ms<br>100 mbs → 1 Gbs                  | Services to services |
| Fourth<br>2000      | Dennard scaling wall per core plateau   | Web search, planet-scale services              | 100 μs<br>10 Gbps<br>flash               | People to people     |
| Fifth<br>2015       | Per socket wall accelerators take off   | Machine<br>Learning, data<br>centric computing | 10 μs<br>200 Gbps → 1 Tbps               | People to insights   |

<sup>\*</sup>The five Epochs of distributed computing, Amin Vahdat of Google: SIGCOMM Lifetime achievement award keynote, 2020.

## The Eight Fallacies of Distributed Computing

(Peter Deutsch of Sun Microsystems, 1994 ... item 8 added in 1997 by James Gosling)

Essentially everyone, when they first build a distributed application, makes the following eight assumptions. All prove to be false in the long run and all cause *big* trouble and *painful* learning experiences.

- 1. The network is reliable
- 2. Latency is zero
- 3. Bandwidth is infinite
- 4. The network is secure
- 5. Topology doesn't change
- 6. There is one administrator
- 7. Transport cost is zero
- 8. The network is homogeneous

https://en.wikipedia.org/wiki/Fallacies\_of\_distributed\_computing

## The Eight Fallacies of Distributed Computing

(Peter Deutsch of Sun Microsystems, 1994 ... item 8 added in 1997 by James Gosling)

Essentially everyone, when they first build a distributed application, makes the following eight assumptions. All prove to be false in the long run and all cause *big* trouble and *painful* learning experiences.

- 1. The network is reliable
- Latency is low and fixed
- 3. Bandwidth is high and fixed
- 4. The network is secure
- 5. Topology doesn't change
- 6. There is one administrator
- 7. Transport cost is negligible
- 8. The network is homogeneous

https://en.wikipedia.org/wiki/Fallacies\_of\_distributed\_computing

## The Eight Fallacies of Distributed Computing

(Peter Deutsch of Sun Microsystems, 1994 ... item 8 added in 1997 by James Gosling)

Essentially everyone, when they first build a distributed application, makes the following eight assumptions. All prove to be false in the long run and all cause *big* trouble and *painful* learning experiences.

#### **Cloud**

- X. The network is reliable
- X. Latency is low and fixed
- X. Bandwidth is high and fixed
- X. The network is secure
- X. Topology doesn't change
- X. There is one administrator
- X. Transport cost is negligible
- X. The network is homogeneous

#### **HPC Cluster**

- **1.** The network is reliable
- Latency is low and fixed
- 3. Bandwidth is high and fixed
- 4. The network is secure
- 5. Topology doesn't change
- **6.** There is one administrator
- X. Transport cost is negligible
- **In the state of t**

## The three domains of parallel programming

| Platform*                    | Laptop or server     | HPC Cluster                                                          | Cloud                                                                      |
|------------------------------|----------------------|----------------------------------------------------------------------|----------------------------------------------------------------------------|
| Execution Agent              | Threads              | Processes                                                            | Microservices                                                              |
| Memory                       | Single Address Space | Distributed memory, local<br>memory owned by individual<br>processes | Distributed object store (in memory) backed by a persistent storage system |
| Typical<br>Execution Pattern | Fork-join            | SPMD                                                                 | Event driven tasks, FaaS,<br>and Actors                                    |
| ,                            | '                    | '                                                                    |                                                                            |

Laptop/server and cluster models work well together.

An impenetrable wall separates them from the cloud-native world

### The sixth Epoch of Distributed Computing

| Epoch starting date | Defining limitations                    | Application                                                              | Interaction time and Network performance | Capability            |
|---------------------|-----------------------------------------|--------------------------------------------------------------------------|------------------------------------------|-----------------------|
| First<br>1970       | Rare connections to expensive computers | FTP, telnet, email                                                       | 100 ms<br>Low bandwidth high latency     | People to computers   |
| Second<br>1984      | I/O wall, disks can't keep up           | RPC,<br>Client Server                                                    | 10 ms<br>10 mbps                         | Computer to computer  |
| Third<br>1990       | Networking wall                         | MPP HPC, three-tier datacenter networks                                  | 1 ms<br>100 mbs → 1 Gbs                  | Services to services  |
| Fourth<br>2000      | Dennard Scaling Wall per core plateau   | Web search, planet-scale services                                        | 100 μs<br>10 Gbps<br>flash               | People to people      |
| Fifth 2015          | Per socket wall accelerators take off   | Machine Learning, data centric computing                                 | 10 μs<br>200 Gbps → 1 Tbps               | People to insights    |
| Sixth<br>2025       | Speed of light                          | Dynamic, real-time AI, integrated from data-center to the edge with SDE* | 100 ns<br>10 Tbs                         | People to experiences |

<sup>\*</sup> SDE: Software defined Everything, i.e. software defined networking, software defined infrastructure, software defined servers ... All at the same time ... to dynamically construct systems to meet the needs of workloads.

# Networking technology... replace generic data center network with a cluster of cliques



A clique: A graph where every vertex is connected to every other vertex

A Clique: a network of diameter one with  $O(\frac{1}{4}N^2)$  bisection bandwidth

Combine with next generation optical networks to hit latencies of 100 ns

#### Latencies every engineer should know ...

L1 cache reference 1.5 ns

L2 cache reference 5 ns

Branch misprediction 6 ns

Uncontended mutex lock/unlock 20 ns

L3 cache reference 25 ns

Main memory reference 100 ns

"Far memory"/Fast NVIVI reference 1,000 ns (1us)

Read 1 MB sequentially from memory 12,000 ns (12 us)

SSD Random Read 100,000 ns (100 us)

Read 1 MB bytes sequentially from SSD 500,000 ns (500 us)

Read 1 MB sequentially from 10Gbps network 1,000,000 ns (1 ms)

Read 1 MB sequentially from disk 10,000,000 ns (10 ms)

Disk seek 10,000,000 ns (10 ms)

Send packet California→Netherlands→California (150 ms)



A cluster of nodes with a Clique network topology and low latency optical network...

Yields one hop network latencies on par with DRAM access latencies.

Source: The Datacenter as a Computer:

Designing Warehouse-Scale Machines, Luiz
Andre Barroso, Urs Holzle, Parthasarathy
Ranganathan, 3<sup>rd</sup> edition, Morgan & Claypool,
2019.

#### Take out the big stuff & you're left with lots of µs overheads

All those SW overheads add up ... like bricks that combine to build a networking-wall ... turning a 2 µs network into a 100 µs network...



Computer Scientists need to rethink system SW stacks to minimize latencies ... fast RDMA, reduce sync contention, low latency interrupt handlers, and more .... All to hit  $O(\mu s)$  latencies.

## In the sixth Epoch of Distributed Computing, cloud and cluster overlap ... or even merge!

**HPC Cluster** Cloud The network is reliable Data Streaming Accelerator reduces tail latency. Latency is low and fixed Chip-to-chip optical networks push latency down, 3. Bandwidth is high and fixed P4/P5/P6 + Infrastructure and bandwidth up The network is secure Processing Units drive down latency and reduces jitter Topology doesn't change Transport cost is negligible The network is homogeneous

With Low Latencies, high bandwidths and stable performance, we can do loosely synchronous and synchronous applications in the cloud. The economics of the cloud vs dedicated HPC clusters means the cloud will dominate HPC

HPC applications will need to change to deal with reliability and network inhomogeneities.

## The three domains of parallel programming

| Platform*                    | Laptop or server     | HPC Cluster                                                          | Cloud                                                                      |
|------------------------------|----------------------|----------------------------------------------------------------------|----------------------------------------------------------------------------|
| Execution Agent              | Threads              | Processes                                                            | Microservices                                                              |
| Memory                       | Single Address Space | Distributed memory, local<br>memory owned by individual<br>processes | Distributed object store (in memory) backed by a persistent storage system |
| Typical<br>Execution Pattern | Fork-join            | SPMD                                                                 | Event driven tasks, FaaS,<br>and Actors                                    |

Advances in networking technology plus low-overhead software stacks optimized to reduce tail-latency will shatter this wall



## The three domains of parallel programming

| Platform*                    | Laptop or server     | HPC Cluster                                                    | Cloud                                                                      |
|------------------------------|----------------------|----------------------------------------------------------------|----------------------------------------------------------------------------|
| Execution Agent              | Threads              | Processes                                                      | Microservices                                                              |
| Memory                       | Single Address Space | Distributed memory, local memory owned by individual processes | Distributed object store (in memory) backed by a persistent storage system |
| Typical<br>Execution Pattern | Fork-join            | SPMD                                                           | Event driven tasks, FaaS,<br>and Actors                                    |

There will always be a need for top-end scalable systems in supercomputer centers, but economics will push the bulk of scientific computing into the cloud.

## One codebase $\rightarrow$ many systems

 Performance, Productivity AND Portability ... the database people "did it" with relational algebras and SQL.



<sup>\*</sup>This is the logo of the machine programming research program I help lead inside Intel Labs

# The Three Pillars of Machine Programming (MP)



MP is the automation of software development

- Intention: Discover the intent of a programmer
- Invention: Create new algorithms and data structures
- Adaptation: Evolve in a changing hardware/software world

Justin Gottschlich, Intel Labs Armando Solar-Lezama, MIT Nesime Tatbul, Intel Labs Michael Carbin, MIT Martin Rinard, MIT Regina Barzilay, MIT Saman Amarasinghe, MIT Joshua B Tenenbaum, MIT Tim Mattson, Intel Labs

Summarized ~90 works.

Key efforts by Berkeley, Google, Microsoft, MIT, Stanford, UW and others.

#### oneAPI: A bridge to our heterogeneous/Distributed Future

My vision for how we bring one API into a future dominated by power-optimized heterogenous chips organized into distributed systems.



The key to making this work ... the programmer is in control and chooses the level of abstraction based on the programming task.

## Summary

- Parallel computing is fun ... but it can be hard.
- Fortunately, if you stick to the Big-3 and the core patterns of parallel computing for HPC, it's not too overwhelming
  - The big 3: MPI, OpenMP, and "a GPU programming model"
  - Key Patterns: SPMD, loop level parallelism, geometric decomposition, divide and conquer, and SIMT
- Some day we'll automate the hard-parts with Machine Programming, but that may be 10 years!!!!

### SCANPY workload details and system configuration

|                | T T                               |
|----------------|-----------------------------------|
| ame            | Intel® Xeon® Platinum 8380        |
| Time           | Jan 20, 2022                      |
| Manufacturer   | Intel Corporation                 |
| Product Name   | Intel® Xeon® Platinum 8380        |
|                | SE5C6200.86B.0020.P23.21032613    |
| BIOS Version   | 09                                |
|                | Rocky Linux release 8.5 (Green    |
| OS             | Obsidian)                         |
| Kernel         | 4.18.0-240.22.1.el8_3.crt6.x86_64 |
| Microcode      | 0xd000270                         |
| IRQ Balance    | enabled                           |
|                | Intel(R) Xeon(R) Platinum 8380    |
| CPU Model      | CPU @ 2.30GHz                     |
| Base Frequency | 2.3GHz                            |
| Maximum        |                                   |
| Frequency      | 3.4GHz                            |
| All-core       |                                   |
| Maximum        |                                   |
| Frequency      | 2.5GHz                            |
| CPU(s)         | 40                                |
| Thread(s) per  |                                   |
| Core           | 2                                 |
| Core(s) per    |                                   |
| Socket         | 40                                |

| Socket(s)             | 1                          |
|-----------------------|----------------------------|
| NUMA Node(s)          | 1                          |
| Prefetchers           |                            |
| Turbo                 | Enabled                    |
| PPIN(s)               |                            |
| Power & Perf          |                            |
| Policy                | Performance                |
| TDP                   | 270 watts                  |
| Frequency Driver      |                            |
| Frequency             |                            |
| Governer              | Performance                |
| Frequency (MHz)       |                            |
| Max C-State           |                            |
|                       | Intel® Xeon® Platinum 8380 |
|                       | 40c D1 DDR4                |
|                       | 16*16GB@3200MHz -          |
| Installed             | Mellanox HDR               |
| Huge Pages Size       | 2048 kB                    |
| Transparent           |                            |
| Huge Pages            | Always                     |
| Automatic             |                            |
| <b>NUMA Balancing</b> | Enabled                    |

- The following was done to optimize the SCANPY benchmark
  - Data preprocessing used warm file cache and multi-threaded using Numba JIT
  - PCA, K-means, KNN Used the Intel extension for scikit-learn.
  - t-SNE Used optimized version from Intel's oneDAL Library.
  - Parallelized quadtree building, sorting and summarization steps using Morton codes.
  - UMAP optimized the UMAP code using AVX512/AVX2. Used MKL for eigenvalue computation.
  - Louvain and Leiden algorithms collaborated with Katana Graph to get well optimized versions and integrated them into SCANPY.