# Capturing loop parallelisability property:

from source code parallelisability metrics to machine-learning based assistance tool

PPar Lunch Talk

January, 23, 2019

Aleksandr Maramzin, s1736883







# **Background**

- High-level Software Parallelisation Problem Statement
- Software Quality Metrics as a work motivation
- Program Dependence Graph (PDG) and Loop Iterator Recognition

# **High-level Problem Statement**

• Abundance of parallel hardware across the whole spectrum of computing systems (from small embedded devices to warehouse-scale server farms)







• **Difficulty of manual parallel programming** (requires application domain knowledge, parallell programming skills (synchronisation, deadlocks, race conditions, etc.), knowledge of exact frameworks (OpenMP, MPI, etc.)

• Automatic parallelisation tools are conservative and limited in their abilities to exploit all available application parallelism



# Software Source Code Metrics in Software Engineering

Software Quality

# Cyclomatic Complexity (CC)

Code coverage

Software Coupling & Cohesion



Bugs per line of code

#### Halstead's Software Science

For a given problem, Let:

- $\eta_1$  = the number of distinct operators
- $\eta_2$  = the number of distinct operands
- $N_1$  = the total number of operators
- $N_2$  = the total number of operands

From these numbers, several measures can be calculated:

- ullet Program vocabulary:  $\eta=\eta_1+\eta_2$
- ullet Program length:  $N=N_1+N_2$
- ullet Calculated program length:  $\hat{N}=\eta_1\log_2\eta_1+\eta_2\log_2\eta_2$
- ullet Volume:  $V=N imes \log_2\eta$
- ullet Difficulty :  $D=rac{\eta_1}{2} imesrac{N_2}{\eta_2}$
- ullet Effort: E=D imes V

#### Parallel Programming

The only metrics in the subfield are different variations of speedup

$$speedup = \frac{\text{serial execution time}}{\text{parallel execution time}}$$

- Relative speedup (serial algorithm vs parallel algorithm)
- Real speedup (best known serial algorithm vs parallel algorithm)
- Absolute speedup (best known serial algorithm on the best serial hardware vs parallel algorithm)
- Analytical speedup
- Asymptotic speedup (O<sub>serial</sub>(n)/O<sub>parallel</sub>(n))

# Program Dependence Graph and Iterator Recognition Works

- Program Dependence Graph (PDG) is an IR graph, representing both data and control dependencies between instructions
- **Generalized Loop Iterator** is a Strongly Connected Component on a PDG with no incoming dependence edges in the Component Graph

- The Program Dependence Graph and Its Use in Optimization. Jeanne Ferrante (IBM T. J. Watson Research Center), Karl J. Ottenstein (Michigan Technological University), Joe D. Warren (Rice University). ACM Transactions on Programming Languages and Systems, 1987.
- Generalized Profile-Guided Iterator Recognition. Stanislav Manilov, Christos Vasiladiotis, and Björn Franke (The University of Edinburgh). Proceedings of 27th International Conference on Compiler Construction (CC'18).

# **Dependence based Anatomy of a Loop**

```
unsigned int a[100];
for (unsigned int i = 1; i < 100; i++) {
    a[i] = a[i-1];
}</pre>
```

## Loop Iterator



# **Software Parallelisability Metrics**

- Proposed Software Parallelisability Metrics
- Metrics Correlation with Loop Parallelisability

| Metric Group           | Metric                                     | Metric Definition                                                                                        | Intuition                                                                                                                                                    |  |  |
|------------------------|--------------------------------------------|----------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
|                        | Absolute Size                              | Number of LLVM IR instructions in a whole loop                                                           | The bigger the loop, the harder it is to parallelize it                                                                                                      |  |  |
| Loop Proportions       | Payload Fraction                           | Payload Instructions Number Total Loop Instructions Number                                               | The smaller the payload fraction (hence, the more complex iterator is), the harder it is to parallelize a loop                                               |  |  |
| Loop Floportions       | Proper SCCs Number                         | Number of SCCs with more than one LLVM IR instruction in a payload of a loop                             | The more proper SCCs we have, the harder this loop is for parallelization                                                                                    |  |  |
|                        | Critical Payload<br>Fraction               | Critical Payload Instructions Number Payload Instructions Number                                         | The bigger the critical part of a loop, the harder it is to parallelize a loop                                                                               |  |  |
| Loop                   | Payload<br>Dependencies<br>Number          | Number of PDG edges in a payload ( <b>True</b> , <b>Anti</b> , <b>Output</b> and <b>Total</b> )          | The more dependencies we have, especially in the critical part of a loop payload, the                                                                        |  |  |
| Dependencies<br>Number | Critical Payload<br>Dependencies<br>Number | Number of PDG edges in a critical payload ( <b>True</b> , <b>Anti</b> , <b>Output</b> and <b>Total</b> ) | harder it is to parallelize a loop                                                                                                                           |  |  |
| Loop Cohesion          | Iterator/Payload<br>Cohesion               | Edges Number Between Iterator / Payload<br>Total Loop Edges Number                                       | No apparent intuition                                                                                                                                        |  |  |
| Loop Collesion         | Critical/Regular<br>Payload Cohesion       | Edges Number Between Critical / Non - critical Payload Total Payload Edges Number                        | The tighter a regular payload is coupled with payload's critical part (more edges in between, bigger cohesion value), the harder it is to parallelize a loop |  |  |
| Loop Instruction       | Call instructions count                    | Number of call instructions in iterator, payload and critical payload of a loop                          | Uninlined calls prevent loop parallelisation                                                                                                                 |  |  |
| Nature                 | Branch instructions count                  | Number of branch instructions in iterator, payload and critical payload of a loop                        | Branch instructions implement breaks, returns and a complicated CF in a loop body, what effectively hinders loop parallelisation                             |  |  |









• Let's try to harness all these metrics together in a machine learning model

## **Benchmarks**

- Brief Seoul National University (SNU) NAS Parallel Benchmarks (NPB) overview
- SNU NPB Parallelisability analysis

# Seoul National University NAS Parallel Benchmarks

|      | Seoul National University (SNU) NASA Parallel Benchmarks (NPB) |                    |                          |  |  |  |
|------|----------------------------------------------------------------|--------------------|--------------------------|--|--|--|
| NASA | Benchmarks Total Number                                        | Loops Total Number | OpenMP Pragmas<br>Number |  |  |  |
|      | 11                                                             | 1575               | 211                      |  |  |  |

- NAS Parallel Benchmarks (NPB) are a set of benchmarks targeting performance evaluation of highly parallel supercomputers
- SNU NPB version of benchmarks developed at Seoul National University
- SNU NPB has sequential as well as manually parallelised OpenMP version

# Seoul National University NAS Parallel Benchmarks

#### **SNU NPB Parallelisability %**



#### **SNU NPB Total Loops Number**



## Intel C/C++ Compiler Parallelisation limitations (171)

#### **Potentially solvable limitations**

| SNU NASA Parallel Benchmarks Intel C/C++ Compiler Limitations (89) |                                |                   |                                               |                        |       |  |  |
|--------------------------------------------------------------------|--------------------------------|-------------------|-----------------------------------------------|------------------------|-------|--|--|
| Unrecognised<br>Reduction                                          | Array<br>Privatization<br>Need | Alias<br>Analysis | Statically<br>unknown number<br>of iterations | Static<br>dependencies | Other |  |  |
| 18                                                                 | 3                              | 49                | 7                                             | 11                     | 3     |  |  |

# Difficult to overcome: requires human source code comprehension and benchmark runtime information

#### SNU NASA Parallel Benchmarks Intel C/C++ Compiler Limitations (82) Static dependencies: **Array Privatization** Too complex (cross-iteration Alias analysis Uninlined Other runtime information need dependencies, entangled CF, conservativeness function calls Need indirect array referencing, etc.) 35 4 22 11 1 4

# Machine Learning based Loop Parallelisability Prediction

- Supervised Machine Learning Loop Classification Problem
- Loop Classification Labels
- Model Learning Performance
- Model Use Scenarios





#### **Seoul National University (SNU) NASA Parallel Benchmarks (NPB)**

| Benchmarks Total Number | Loops Total Number | OpenMP Pragmas<br>Number |
|-------------------------|--------------------|--------------------------|
| 11                      | 1575               | 211                      |

| SNU NPB ICC optimisations |           |              |            |        |                             |            |                                     |                               |  |
|---------------------------|-----------|--------------|------------|--------|-----------------------------|------------|-------------------------------------|-------------------------------|--|
| Loop Distributi           | p Fusions |              | Loop Colla | pses   | Loc                         | op Tilings |                                     |                               |  |
| 34                        |           |              | 214        | 214 58 |                             |            |                                     | 27                            |  |
| Parallelised              | Mems      | et Generated | Vectorised |        | rallelisation<br>pendencies |            | Not a<br>rallelisation<br>candidate | Vectorisation<br>Dependencies |  |
| 653                       |           | 35           | 737        |        | 535                         |            | 103                                 | 266                           |  |

| Machine Learning Parallelisability Labels (Parallelisible/Non-Parallelisible) |                          |                        |  |  |  |  |
|-------------------------------------------------------------------------------|--------------------------|------------------------|--|--|--|--|
| SNU NPB DEVELOPER<br>KNOWLEDGE                                                | ICC MISSED OPPORTUNITIES | ICC                    |  |  |  |  |
| 1145 / <mark>430</mark>                                                       | 1051 / <mark>524</mark>  | 962 / <mark>613</mark> |  |  |  |  |

# **Decision Tree Model Learning Performance**

#### **Machine Learning Accuracy** (outliers beyond 3 standard deviations have been filtered out) **Loop Classification** SNU NPB developer knowledge ICC missed opportunities ICC Labels 1420 loops **1420 loops** 1420 loops Loops 995 / 425 901 / 519 812 / 608 **Baseline** (random) 70% 63.5% 57% predictor accuracy **DT** accuracy 89-91.6% 90-93% 89-92.5% Error types (safe/false negatives 43-52% 42-50% 42-49% (unsafe/false 48-57% 50-58% 51-58% positives)

# **Decision Tree Model Learning Performance**



# C/C++ loop **ICC** Compiler **Decision Tree Model** learn [YES] 93.4% / <mark>6.6%</mark> [YES] 100% / 0% 14.5% / 85.5% Parallelisation Advice [YES] 93.8% / 6.2% [NO] 3.6% / 96.4%

# **ML based Parallelisation Assistance Tool**

Tool performance with ICC missed opportunities loop classification labels

| Training /<br>Testing<br>Set Sizes | ICC<br>/ ML | Cases<br>Number | Real<br>Parallel | Non-<br>Parallel |          | ormance<br>% → 98.9% |
|------------------------------------|-------------|-----------------|------------------|------------------|----------|----------------------|
| 880 / 220                          | 0/0         | 49.9            | 1.8              | 48.1             | ICC      | 148.1                |
| 161.6 / <mark>59</mark>            | 0/1         | 22.3            | 11.4             | 10.9             | ICC+ML   | 159.8                |
|                                    | 1/0         | 6               | 6.28             | 0                | ICCTIVIL |                      |
|                                    | 1/1         | 142.1           | 142.1            | 0                | Real     | 161.6                |

<sup>\*</sup> Results are relatively stable across training/testing sets size variation

Combined output utilises almost all SNU NPB available parallelism, but introduces unsafe false positives

Programmer eliminates unsafe false positives

In a perfect use scenario SNU NPB available parallelism utilisation increases on average from 91.5% to 99%

# C/C++ loop ICC Compiler **Decision Tree Model** learn [YES] 100% / 0% [YES] 93.7% / 6.3% 21.4% / 79.6% Parallelisation Advice

[YES] 94% / 6%

[YES] 100% / 0%

#### ML based Parallelisation Assistance Tool

Tool performance with SNU NPB developer loop classification labels

| Training /<br>Testing<br>Set Sizes | ICC /<br>ML | Cases<br>Number | Real<br>Parallel | Non-<br>Parallel | Performa<br>85.7% → |       |
|------------------------------------|-------------|-----------------|------------------|------------------|---------------------|-------|
| 867 / 217                          | 0/0         | 36              | 3.4              | 32.6             | ICC                 | 148.5 |
| 173.3 /<br>43.6                    | 0/1         | 32.4            | 21.4             | 11               |                     |       |
| .0.0                               | 1/0         | 5.5             | 5.5              | 0                | ICC+ML              | 169.9 |
|                                    | 1/1         | 143             | 143              | 0                | Real                | 173.3 |

Combined output utilises almost all SNU NPB available NOT 9.5% 191.5% parallelism, but introduces unsafe false positives

> Programmer eliminates unsafe false positives

> > In a perfect use scenario SNU NPB available parallelism utilisation increases on average from 86% to 98%

#### **ML based Parallelisation Assistance Tool**



Safe Parallelism Utilisation Increase

C/C++ loop

# Thank you for your attention!

# Backup

- Source code examples
- Additional Tables

## BT, SP, LU benchmarks: array privatization need

```
void error_norm(double rms[5])
                                                                     void exact_solution(double xi, double eta, double zeta, double dtemp[5])
 int i, j, k, m, d;
 double xi, eta, zeta, u_exact[5], add;
                                                                       int m;
 double rms_local[5];
                                                                      for (m = 0; m < 5; m++) {
                                                                        dtemp[m] = ce[m][0] +
 for (m = 0; m < 5; m++) {
   rms[m] = 0.0:
                                                                          xi*(ce[m][1] + xi*(ce[m][4] + xi*(ce[m][7] + xi*ce[m][10]))) +
                                                                          eta*(ce[m][2] + eta*(ce[m][5] + eta*(ce[m][8] + eta*ce[m][11])))+
                                                                          zeta*(ce[m][3] + zeta*(ce[m][6] + zeta*(ce[m][9] +
                                                                          zeta*ce[m][12])));
 #pragma omp parallel default(shared) \
         private(i,j,k,m,zeta,eta,xi,add,u_exact,rms_local) shared(rms)
 for (m = 0; m < 5; m++) {
                                                      Loop nest (k,j,i) accumulates the total error (difference between
   rms local[m] = 0.0:
                                                      solution u and the exact solution u exact) via a reduction
 #pragma omp for nowait
 for (k = 0; k <= grid_points[2]-1; k++) {
                                                      Every iteration (k,j,i) uses temporary rms |local|| memory to
   zeta = (double)(k) * dnzm1;
   for (j = 0; j <= grid_points[1]-1; j++) {
                                                      compute and store one little portion of the reduction
     eta = (double)(j) * dnym1;
     for (i = 0; i <= grid_points[0]-1; i++) {
       xi = (double)(i) * dnxm1;
                                                      This rms local introduces cross-iteration dependency, since it
       exact_solution(xi, eta, zeta, u_exact);
                                                      is being reused on different loop nest iterations
       for (m = 0; m < 5; m++) {
         add = u[k][j][i][m]-u_exact[m];
                                                      Programmer replicates this temporary rms |local| memory with
         rms local[m] = rms local[m] + add*add:
```

for (m = 0; m < 5; m++) {

#pragma omp atomic rms[m] += rms\_local[m];

} //end parallel

Intel Compiler does not automatically do such a parallelising transformation

OpenMP semantics and parallelises the loop nest

### **UA benchmark: static dependence (runtime information need)**

```
// mt_to_id[miel] takes as argument the morton index and returns the actual
                          element index
// id to mt(iel) takes as argument the actual element index and returns the
                         morton index
#pragma omp parallel for default(shared) private(miel,iel)
for (miel = 0; miel < nelt; miel++) {</pre>
   iel = mt_to_id[miel];
  id_to_mt[iel] = miel; Output
 LOOP BEGIN at /home/s1736883/Work/PParMetrics/benchmarks/snu-npb/nauseous-omp/UA/src/adapt.c(149.3)
    remark #17104: loop was not parallelized: existence of parallel dependence
    remark #17106: parallel dependence: assumed OUTPUT dependence between id_to_mt[mt_to_id[miel]] (151:5) and id_to_mt[mt_to_id[miel]] (151:5)
    remark #17106: parallel dependence: assumed OUTPUT dependence between id_to_mt[mt_to_id[miel]] (151:5) and id_to_mt[mt_to_id[miel]] (151:5)
    remark #15388: vectorization support: reference mt_to_id[miel] has aligned access [ /home/s1736883/Work/PParMetrics/benchmarks/snu-npb/nduseous-omp/
    remark #15329: vectorization support: irregularly indexed store was emulated for the variable <id_to_mt[mt_to_id[miel]]>, part of index is read from m
    remark #15305: vectorization support: vector length 8
    remark #15300: LOOP WAS VECTORIZED
    remark #15448: unmasked aligned unit stride loads: 1
    remark #15463: unmasked indexed (or scatter) stores: 1
    remark #15475: --- begin vector cost summary ---
    remark #15476: scalar cost: 4
    remark #15477: vector cost: 20.750
    remark #15478: estimated potential speedup: 0.190
    remark #15488: --- end vector cost summary ---
    remark #25015: Estimate of max trip count of loop=1100
```

LOOP END

- Loop has an output dependency and Intel compiler correctly detects it
- Intel compiler does not parallelise the loop, but can vectorise it
- Programmer knows, that dependence does not materialise and preceeds the loop with an OpenMP pragma

# IS benchmark: static dependence (runtime information need)

```
Determine the number of keys in each bucket */
     #pragma omp for schedule(static)
     for( i=0; i<NUM_KEYS; i++ )</pre>
          work_buff[key_array[i] >> shift]++;  True Anti Output
LOOP BEGIN at /home/s1736883/Work/PParMetrics/benchmarks/snu-npb/nauseous-omp/IS/src/is.c(649.5)
  remark #17104: loop was not parallelized: existence of parallel dependence
  remark #17106: parallel dependence: assumed FLOW dependence between work_buff[key_array[i]] (650:9) and work_buff[key_array[i]]
  remark #17106: parallel dependence: assumed ANTI dependence between work_buff[key_array[i]] (650:9) and work_buff[key_array[i]]
  remark #17106: parallel dependence: assumed OUTPUT dependence between work_buff[key_array[i]] (650:9) and work_buff[key_array[i
  remark #17106: parallel dependence: assumed OUTPUT dependence between work_buff[key_array[i]] (650:9) and work_buff[key_array[i
  remark #15344: loop was not vectorized: vector dependence prevents vectorization
  remark #15346: vector dependence: assumed FLOW dependence between work_buff[key_array[i]] (650:9) and work_buff[key_array[i]] (
  remark #15346: vector dependence: assumed ANTI dependence between work_buff[key_array[i]] (650:9) and work_buff[key_array[i]] (
  remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
  remark #25015: Estimate of max trip count of loop=134217728
LOOP END
```

- Intel Compiler assumes all static dependencies to materialise and does not proceed with the parallelisation of the loop
- Programmer knows that

# Intel C/C++ Compiler optimisation report parsing and analysis problems

- Intel Compiler uses cost model to decide on how to process parallelisible loops and loop nests (parallelise, vectorise, generate memset/memcpy, leave sequential)
- Has been vectorised != absence of parallelisation restricting dependencies
- A set of transformations applied in order to enable loop parallelisation (loop distribution, fusion, tiling, collapsing, interchange, etc.)
- Loop eliminations (unrolling, inlining,

| Machine Learning Parallelisability Labels (Parallelisible/Non-Parallelisible) |                         |                        |  |  |  |  |  |
|-------------------------------------------------------------------------------|-------------------------|------------------------|--|--|--|--|--|
| OMP+MANUAL+ICC                                                                | MANUAL+ICC              | ICC                    |  |  |  |  |  |
| 1145 / <mark>430</mark>                                                       | 1051 / <mark>524</mark> | 962 / <mark>613</mark> |  |  |  |  |  |

# **Seoul National University NAS Parallel Benchmarks**

| Benchmarks | Brief Description                                                                                              | Benchmark<br>Parallelisability | Loops<br>Number | Loops Na              | ature       |
|------------|----------------------------------------------------------------------------------------------------------------|--------------------------------|-----------------|-----------------------|-------------|
|            |                                                                                                                | (ICC+OMP/ICC)                  |                 | OMP+<br>ICC           | ICC         |
| ВТ         |                                                                                                                | <mark>79%</mark> /<br>69%      | 190             | 150 / <mark>30</mark> | 131 /<br>59 |
| SP         | parallel access patterns.                                                                                      | 83% /<br>77%                   | 259             | 217 / 42              | 200 /<br>59 |
| LU         |                                                                                                                | 79% /<br>68%                   | 207             | 164 / <mark>43</mark> | 141 /<br>64 |
| CG         | Half of the benchmark are parallelisible reductions (even without OpenMP compiler hints), the other half are   | 61% /<br>59%                   | 47              | 29 / <mark>18</mark>  | 28 / 19     |
| EP         | Small, nothing interesting                                                                                     | 60% /<br>50%                   | 10              | 6 / <mark>4</mark>    | 5/5         |
| IS         | Materialisation of loop cross-iteration dependencies depends on a dynamic program properties. Programmer knows | 67% /<br>43%                   | 20              | 8 / 12                | 6 / 14      |

# Loops challenging ICC

- · Loops with cross-iteration dependencies
- · Loops with a statically unknown number of iterations

- · Loops with unidentified control variable
- · Loops with function calls inside the body
- · Loops with multiple exits (break, return statements, etc.)

```
void exact_solution(double xi, double eta, double zeta, double dtemp[5])
   int m;
   for (m = 0; m < 5; m++) {
      dtemp[m] = ce[m][0] +
        xi*(ce[m][1] + xi*(ce[m][4] + xi*(ce[m][7] + xi*ce[m][10]))) +
        eta*(ce[m][2] + eta*(ce[m][5] + eta*(ce[m][8] + eta*ce[m][11])))+
        zeta*(ce[m][3] + zeta*(ce[m][6] + zeta*(ce[m][9] +
        zeta*ce[m][12])));
LOOP BEGIN at /home/s1736883/Work/PParMetrics/benchmarks/snu-npb/nauseous-omp/BT/src/exact_solution.c(44,3)
  remark #17104: loop was not parallelized: existence of parallel dependence
  remark #17106: parallel dependence: assumed FLOW dependence between dtemp[m] (45:5) and ce[m][0] (45:5)
  remark #17106: parallel dependence: assumed ANTI dependence between ce[m][0] (45:5) and dtemp[m] (45:5)
  remark #15344: loop was not vectorized: vector dependence prevents vectorization
  remark #15346: vector dependence: assumed FLOW dependence between dtemp[m] (45:5) and ce[m][0] (45:5)
  remark #15346: vector dependence: assumed ANTI dependence between ce[m][0] (45:5) and dtemp[m] (45:5)
LOOP END
```

## Loop classification problem [1]: what is parallel and what is not?

```
void error_norm(double rms[5])
                                                                       void exact_solution(double xi, double eta, double zeta, double dtemp[5])
 int i, j, k, m, d;
 double xi, eta, zeta, u_exact[5], add;
                                                                         int m;
 double rms_local[5];
                                                                         for (m = 0; m < 5; m++) {
                                                                           dtemp[m] = ce[m][0] +
 for (m = 0; m < 5; m++) {
   rms[m] = 0.0:
                                                                             xi*(ce[m][1] + xi*(ce[m][4] + xi*(ce[m][7] + xi*ce[m][10]))) +
                                                                             eta*(ce[m][2] + eta*(ce[m][5] + eta*(ce[m][8] + eta*ce[m][11])))+
                                                                             zeta*(ce[m][3] + zeta*(ce[m][6] + zeta*(ce[m][9] +
                                                                             zeta*ce[m][12])));
 #pragma omp parallel default(shared) \
         private(i,j,k,m,zeta,eta,xi,add,u_exact,rms_local) shared(rms)
  for (m = 0; m < 5; m++) {
                                                     How to label these loops for a ML-based tool?
   rms local[m] = 0.0:
                                                      OpenMP semantics is supposed to replicate rmh local and ensure
  #pragma omp for nowait
                                                      atomic mutually exclusive access to a shared rhs[]
  for (k = 0; k <= grid_points[2]-1; k++) {
   zeta = (double)(k) * dnzm1;
   for (j = 0; j <= grid_points[1]-1; j++) {
                                                      OpenMP puts in an additional semantics, without which there is a memory
     eta = (double)(j) * dnym1;
                                                      dependence
     for (i = 0; i <= grid_points[0]-1; i++) {
       xi = (double)(i) * dnxm1;
       exact_solution(xi, eta, zeta, u_exact);
                                                      No dependence in OpenMP version, but memory cross-iteration
                                                      dependence in the sequential version
       for (m = 0; m < 5; m++) {
         add = u[k][j][i][m]-u_exact[m];
```

instruction

rms local[m] = rms local[m] + add\*add;

for (m = 0; m < 5; m++) {

rms[m] += rms\_local[m];

#pragma omp atomic

} //end parallel

- Function calls scare conservative compiler. Should we classify the code
- as it is or only after assumed inlining is done?

Need to provide memory in rhs local for every parallel thread, before we

parallelise it. Compiler does not do it automatically, without programmer's

#### BT, SP, LU benchmarks:

```
int m. i. i. k. ip1. im1. ip1. im1. km1. kp1:
#pragma omp parallel default(shared) private(i,j,k,m,zeta,eta,xi,\
#pragma omp for schedule(static) nowait
for (k = 1; k <= grid_points[2]-2; k++) {</pre>
 zeta = (double)(k) * dnzm1;
 for (j = 1; j <= grid_points[1]-2; j++) {
   eta = (double)(j) * dnym1;
   for (i = 0; i <= grid_points[0]-1; i++) {</pre>
      xi = (double)(i) * dnxm1:
      exact_solution(xi, eta, zeta, dtemp);
      for (m = 0; m < 5; m++) {
       ue[i][m] = dtemp[m];
      q[i] \= 0.5*(buf[\frac{1}{2}][1] \tag{1} + buf[i][2] \tag{2} + \text{ue}[i][2] +
                  buf[i\][3]*ue[i][3]);
    for (i = 1; i <= grid_points[0]-2; i++) {
      im1 = i+1;
                           True
      ip1 = i+1:
      forcing[k][j][i][0] + forcing[k][j][i][0] -
        tx2*( ue[ip1][1]-ue[im1][1] )+
        dx1tx1*(ue[ip1][0]-2.0*ue[i][0]+ue[im1][0]);
    for (m = 0: m < 5: m++) {
      i = grid_points[0]-3;
      forcing[k][i][i][m] = forcing[k][i][i][m] - dssp *
        (ue[i-2][m] - 4.0*ue[i-1][m] +
         6.0*ue[i][m] - 4.0*ue[i+1][m]);
      i = grid_points[0]-2;
      forcing[k][j][i][m] = forcing[k][j][i][m] - dssp *
        (ue[i-2][m] - 4.0*ue[i-1][m] + 5.0*ue[i][m]);
```

double dtemp[5], xi, eta, zeta, dtpp;

- ue[PROBLEM\_SIZE+1][5] is a global 2D array, which contains different values as a function of 2 loop nest iteration (k,j)
- Programmer does not replicate ue's memory per every thread, but seemingly relies on the nonoverlapping timing of loop iterations
- Intel Compiler correctly detects all dependencies and refuses to parallelise this code

# OpenMP pragma [3]: UA

```
for (i = 0; i < LX1-1; i++) {
    [tmmor[idmo[iel][iface][0][0][j][i]] = 0.0; Output
}</pre>
```

- Loop has an output dependency and Intel compiler correctly detects it
- Intel compiler does not parallelise the loop, due to an output dependency
- Same-value output dependency

# Getting ML classification labels for loops: methods and challenges

#### Methods and sources of loop classification labels extraction:

- Seoul National University (SNU) NAS Parallel Benchmarks (NPB) with added OpenMP pragmas
- Intel C/C++ Compiler (ICC) optimisation reports
- Manual SNU NPB source code study to check and verify parallelisation labels

#### Task challenges and complications:

- Seoul National University (SNU) NAS Parallel Benchmarks (NPB) with added OpenMP pragmas
- Intel C/C++ Compiler (ICC) optimisation reports
- Manual SNU NPB source code study to check and verify parallelisation labels