# Assignment 1 COMP4300

Abhaas Goyal (u7145384) April 27, 2021

Table 1: Table abbreviations

| $\mathbf{A_{T}}$    | Advection Time         |
|---------------------|------------------------|
| $N_{ m Send}$       | Blocking Send          |
| $\mathbf{I_{Send}}$ | Non-Blocking Send      |
| $\mathbf{G_F}$      | Gigaflops per second   |
| $\mathbf{P_C}$      | Per core (with $G_F$ ) |
| $\mathbf{NP}$       | Number of processes    |

## 1 Deadlock issues

The prototype function parAdvect() can fail for total buffer length  $> 2^{15}$  because there are 2 MPI\_Send happening at the same time. When buffer length is small then the send is locally blocking (so it copies the message into the local buffer). But for larger value of total buffer length it becomes globally blocking (so both the messages now are stuck in the first MPI\_Send waiting for the other process to receive). Hence, we face a deadlock situation here.

The simplest solution is to divide up the ranks into even and odd, and switch up the sending and receiving these classes.

Figure 1: Splitting and switching receiving and sending for odd and even ranks

Now it seems to be safely working for large values of M and N.

## 2 The effect of non-blocking communication

## 2.1 Implementing Non-blocking communication

- MPI\_ISend and MPI\_IRecv were implemented with the same style as given in the original question with MPI\_ISend and MPI\_IRecv, which form the basis for later questions.
- The above was done keeping in mind that there were no race conditions during top/bottom and left/right halo exchange

### 2.2 Timing of Blocking vs Non-blocking communication rationale

- The timing was defined by using the difference of M\_Wtime() between top and bottom halo sending and receiving of messages to maximize preciseness.
- In each repetition the total time accumulator was increased and in the end average time to update was taken. The source could be found in a previous commit here. The basic methodology to reduce and put the answer is here as follows

- The tests were done with small values of M and huge values of N since we need to focus on timings of communication here and in top/bottom exchange large values of N would mean a larger chunk of block to sent/receive.
- It is also to be noted that the value of M should at least be the number of working nodes for all nodes participating together in alteast one send/receive. Hence, I chose M as {48, 192} for tests
- $\bullet$  Another advanctage of keeping the value of M small is that we can still exploit the L3 cache with larger values of N
- Number of reps were taken to be 100 for making the sending/receiving average more consistent
- The tests were done 3 times and we were provided with consistent values.

#### 2.3 Results

• For small values of N (Table 2), got varied results with in some cases blocking being faster and in others non-blocking being faster

For large values of N (Table 3), one could clearly see that non-blocking sends/received proved to be consistently faster to a noticeable extent. Hence, we would be using that for the following set of questions.

Table 2: Blocking vs Non Blocking Send for Large values of N

| $\mathbf{M}$ | N    | NP  | $N_{ m Send}$ | $I_{ m Send}$             |
|--------------|------|-----|---------------|---------------------------|
| 48           | 1000 | 3   | 1.0e-06       | 2.6e-06                   |
| 48           | 1000 | 6   | 7.0e-06       | 4.2e-06                   |
| 48           | 1000 | 12  | 8.9e-06       | 9.3e-06                   |
| 48           | 1000 | 24  | 9.3e-05       | 7.4e-05                   |
| 48           | 1000 | 48  | 9.3e-05       | 5.4e-05                   |
| 192          | 1000 | 48  | 1.4e-05       | 6.8e-06                   |
| 192          | 1000 | 96  | 9.9e-05       | 7.6e-05                   |
| 192          | 1000 | 192 | 1.1e-05       | $6.6\mathrm{e}\text{-}05$ |

Table 3: Blocking vs Non Blocking Send for Large values of N

| $\overline{\mathbf{M}}$ | N      | NP  | $N_{ m Send}$ | $I_{ m Send}$ |
|-------------------------|--------|-----|---------------|---------------|
| 48                      | 100000 | 3   | 1.0e-03       | 3.6e-04       |
| 48                      | 100000 | 6   | 7.0e-04       | 4.2e-04       |
| 48                      | 100000 | 12  | 8.9e-04       | 9.3e-04       |
| 48                      | 100000 | 24  | 9.3e-04       | 7.4e-04       |
| 48                      | 100000 | 48  | 9.3e-04       | 5.4e-04       |
| 192                     | 100000 | 48  | 1.4e-03       | 6.8e-04       |
| 192                     | 100000 | 96  | 9.9e-04       | 7.6e-04       |
| 192                     | 100000 | 192 | 1.1e-03       | 6.6e-04       |

## 3 Make Performance modelling and calibration

#### 3.1 Performance Model

• Parallel communication of top and bottom

$$T_{comm} = 4(t_s + Nt_w)$$

• Sequential communication (for width = 1) (p is number of processes)

$$T_{seq} = 18 * \frac{r.MN.t_f}{p} = O(MN/p)$$

2 is taken in sequential because of copying back

$$T_{par} = 4(t_s + Nt_w) + 18 * \frac{r.MN.t_f}{p}$$

## 3.2 Testing Methodolgy

- Performance model was tested in one node
- The goal was to minimize top and bottom halo exchange time. Hence, like in question 2, a large value of N and small value of M was taken. In this case M = 48 and N = 1000000. They remain unchanged for increasing NP in this case because we want to do strong scaling.
- Number of reps was taken to be 100

### 3.3 Results

M = 48 N = 1000000 reps = 100

Table 4: Strong scaling on single Node

| NP | $\mathbf{A_{T}}$ | $G_{ m F}$ | $ ho_{ m C}$ |
|----|------------------|------------|--------------|
| 3  | 6.01e-01         | 1.60e + 01 | 5.32e+00     |
| 6  | 5.11e-01         | 1.88e + 01 | 3.13e+00     |
| 12 | 5.35 e-01        | 1.79e + 01 | 1.50e + 00   |
| 24 | 2.40e-01         | 4.01e+01   | 1.67e + 00   |
| 48 | 1.28e-01         | 7.53e + 01 | 1.57e + 00   |

- t<sub>f</sub> is taken as 1 / P\_C, hence we find that it is first decreasing then staying consistent.
- When going from 6 to 12 processors, I was surprised to see the advection time increasing instead of decreasing and for initial values of NP A<sub>T</sub> doesn't seem to decrease that much. My best guess would be because of the memory hierarchy present in NCI nodes. As the number of processors increase, the size of the data distribution decreases in each processor, hence more data can be stored in L1 and L2 cache with more cache hits. This leads to t<sub>w</sub> being less. Initially, they don't have much effect given the size of the data and the communication of the nodes is increasing (leading to the abnormality), however from NP > 12, the most of the blocks are small enough to be fit into lower levels of cache hierarchy.
- Other than that, we see a consistent result of **A**<sub>T</sub> almost halving after N=12, when everything is in L3.

# 4 The effect of 2D process grids

## 4.1 Theoretical time

• Strip communication of top and bottom (Done till Q3)

$$4(t_s + Nt_w)$$

• Block communication

$$8(t_s + \frac{MN}{p}t_w)$$

• Block communication > Strip communication if

$$t_s > N(1 - \frac{2M}{p})t_w$$

• We find a similar graph as found in Lecture 11 (22). Hence, we need to use large values of M and N to see an improvement.

#### 4.2 Results

•  $M = N = 2000 (2 * L_3 cache has around 70 MB memory)$ 

Table 5: Computation for 2D process grids (1 Node)

| P  | Q  | NP | $\mathbf{A_{T}}$ | $G_{ m F}$ | $P_{C}$    |
|----|----|----|------------------|------------|------------|
| 1  | 12 | 12 | 2.70e-01         | 2.97e+01   | 2.47e+00   |
| 1  | 24 | 24 | 8.19e-02         | 9.76e + 01 | 4.07e + 00 |
| 1  | 48 | 48 | 4.42e-02         | 1.81e + 02 | 3.77e + 00 |
| 2  | 6  | 12 | 2.69e-01         | 2.98e + 01 | 2.48e + 00 |
| 2  | 12 | 24 | 7.92e-02         | 1.01e + 02 | 4.21e+00   |
| 2  | 24 | 48 | 3.35e-02         | 2.38e + 02 | 4.97e + 00 |
| 3  | 4  | 12 | 2.64e-01         | 3.03e+01   | 2.53e+00   |
| 3  | 8  | 24 | 8.08e-02         | 9.90e + 01 | 4.12e+00   |
| 3  | 12 | 48 | 3.30e-02         | 2.42e + 02 | 5.05e + 00 |
| 6  | 2  | 12 | 2.65e-01         | 3.02e+01   | 2.52e+00   |
| 6  | 4  | 24 | 8.38e-02         | 9.55e + 01 | 3.98e + 00 |
| 6  | 8  | 48 | 3.06e-02         | 2.61e + 02 | 5.44e + 00 |
| 12 | 1  | 12 | 2.65e-01         | 3.02e+01   | 2.52e+00   |
| 12 | 2  | 24 | 7.74e-02         | 1.03e + 01 | 4.31e+00   |
| 12 | 4  | 48 | 2.91e-02         | 2.75e + 02 | 5.73e + 00 |
| 16 | 3  | 48 | 2.35e-02         | 3.40e + 02 | 7.08e + 00 |
| 24 | 2  | 48 | 3.20 e-02        | 2.50e + 02 | 5.51e + 00 |

- From Table 4, best ratio of P:Q is either between 3 and 5.2 (Because Q is stranded it needs some more time to send than P)
- On further investigation an estimate was found to around 4 (by looking at the above data and additional experiments)

Table 6: Computation for 2D process grids (4 Nodes)

|     |    | _   |                                      |            | ,          |
|-----|----|-----|--------------------------------------|------------|------------|
| P   | Q  | NP  | $\mathbf{A_{T}}$                     | $G_{ m F}$ | $ m P_{C}$ |
| 48  | 1  | 48  | 2.70e-01                             | 2.97e + 01 | 2.47e + 00 |
| 16  | 3  | 48  | 2.35e-02                             | 3.40e + 02 | 7.08e + 00 |
| 96  | 1  | 96  | 4.97e-02                             | 1.61e + 02 | 1.68e + 00 |
| 16  | 6  | 96  | 1.73e-02                             | 4.62e + 02 | 4.81e+00   |
| 192 | 1  | 192 | 4.40e-02                             | 1.81e + 02 | 9.47e-01   |
| 16  | 12 | 192 | 1.45e-02                             | 5.51e + 02 | 2.87e + 00 |
| 24  | 8  | 192 | 1.37e-02                             | 5.85e + 02 | 3.05e + 00 |
| 32  | 6  | 192 | $2.19\mathrm{e}\text{-}02\mathrm{s}$ | 3.66e + 02 | 1.90e+00   |

- Till Q3 Q was till 1 only so performance improvement in optimal values of P and Q (when P is close to a near square ration) are highly impressive. On optimum values a 4x improvement is around on using this approach for large values of M and N.
- If  $t_w$  were 10 times larger, then strip partitioning would have been better (from seeing the equation and calculating the values between strip and block paritioning the condition wouldn't hold true for the large values that we have tested against)

## 5 Overlapping communication with computation

- In this question, we should capitalize on LR exchange since we need Q=1 and keeping the sequential exchange part minimal we update on the rows. We take the parameters as M=1000000, N=NP
- The performance model would be affected by

$$T_{comm} = t_s + 4 * t_w$$

In the best case scenario (the 4 is to highlight the 4 corners that I have sent before sending the messages left and right)

#### 5.1 Results

Table 7: Performance comparision between normal and overlapping communication (4 Nodes)

| NP  | $\mathbf{A_{T}}$ | $G_{ m F}$ | $P_{C}$    | A <sub>T</sub> (-o) | $G_{ m F}$ | $P_{C}$    |
|-----|------------------|------------|------------|---------------------|------------|------------|
| 48  | 3.78 e-01 s      | 1.02e+02   | 2.12e+00   | 3.60e-01            | 1.07e + 02 | 2.22e+00   |
| 96  | 1.64 e-01 s      | 2.34e + 02 | 2.44e + 00 | 1.21e-01            | 3.17e + 02 | 3.31e+00   |
| 192 | 5.33e- $02$ s    | 7.21e + 02 | 3.75e + 00 | 3.95e-02            | 9.72e + 02 | 5.06e + 00 |

- For large number of processes with high left and right halo exchange, it acts as an optimization layer and it works pretty nice (with a 1.34 speedup in 192 processes).
- Achieving overlap for 2D communication is difficult because the left-right halo exchange is dependent
  on top bottom halo exchange corners I clarified a doubt on this with a diagram in Piazza(link).
  So synchronizing it with the number of requests to wait for is difficult (with additional checks for
  P=1 or Q = 1). However, I implemented this to work for 2D process grids in optimization part of
  the assignment and compared it's model too with Gadi in Q9

## 6 Wide halo transfers

## 6.1 Gracefully exiting on $w > m \mid\mid w > n$

- For Q or P > 1 Since the ranks which touch the right and bottom of the field may not satisfy the condition of w > M\_loc || w > N\_loc (because the blocks size may be smaller there during division of work), normally checking this wouldn't work for all ranks.
- However, we know that for rank == 0, if this condition holds true then all the blocks are in danger.
- Hence, we broadcast the value to exit in this scenario
- I also changed the return type of checkHaloSize() to int so that all the processes could gracefully exit from main()

## 6.2 Performance Model

• Parallel communication of top and bottom

$$T_{comm} = 8(t_s + \frac{MNw}{p}t_w)$$

• Sequential communication (for width = 1) (p is number of processes) (Assumed the case of square grids for maximal increase of w)

$$T_{seq} = 18 * \frac{r.MN.t_f}{pw} + \frac{2 * r * (w-1)^2 + (w-2)^2 .. (w-w)^2}{w} = O(\frac{MN}{pw} + w^2)$$

2 is taken in sequential because of copying back

• Total time

$$T_{par} = 8(t_s + \frac{MNw}{p}t_w) + O(\frac{MN}{pw} + w^2)$$

## 6.3 Implementation of wide halo technique (discussing the impact on performance)

A wide halo technique would be useful when it's applied in conjunction with overlapping (since you need to do some extra computation) or when a large amount of data can be stored and used in cache at one point during the sequential updation process. However, this comes at a price of increased 4 \* w^2 computations (affecting t<sub>f</sub>) with increasing values of w in 4 different corners and. Functionally and algorithmically as of now it is correct however there are a few bottlenecks or some extra wait calling that I have not noticed.

• One of the bottlenecks and extra overhead that my non-optimized code has is that when executing this section of the code:

```
int reps_left = reps % w;
if (reps_left > 0) {
    // Doing w updates isn't good, the number should be reps_left
    // But won't work
    updateBoundary(u, ldu, w);
    for (w_i = 1; w_i <= reps_left; w_i++) {
        int UR_size = M_loc + (2 * w) - (2 * w_i);
        int UC_size = N_loc + (2 * w) - (2 * w_i);
        updateAdvectField(UR_size, UC_size, &V(u,w_i,w_i), ldu, &V(v,w_i,w_i), ldv);
        copyField(UR_size, UC_size, &V(v,w_i,w_i), ldv, &V(u,w_i,w_i), ldu);
    }
}</pre>
```

- Extra updates are being taken here which are not needed. To improve this, I tried doing
  with updateBoundary(u, ldu, \*n\_reps\*); but it didn't work. Hence, the code needs to be
  investigated further to be improved.
- This leads to the fact that reps % w == 0 would give the optimum performance as of now.

## 6.4 Results

```
M = N = 2000 \text{ reps} = 100
```

| 7D 11 0  | D C          | c ·        | 1 1        | C • 1.1   | C 1           | 1        | COD    | • 1   |
|----------|--------------|------------|------------|-----------|---------------|----------|--------|-------|
| Table X  | Pertormance  | tor varior | g lengthe  | of width  | for optimized | division | 01 711 | oride |
| Table 0. | 1 CHOHIMANCC | ioi variot | o icinguio | or wratin | TOT OPHILIDOG | urvision | 01 21  | grius |

| $\mathbf{w}$ | P  | $\mathbf{Q}$ | NP  | $G_{ m R}$ | $G_{ m R}$ | $\overline{\mathrm{P_{C}}}$ |
|--------------|----|--------------|-----|------------|------------|-----------------------------|
| 1            | 3  | 4            | 12  | 2.64e-01   | 3.03e+01   | 2.53e+00                    |
| 1            | 16 | 3            | 48  | 2.35e-02   | 3.40e + 02 | 7.08e + 00                  |
| 1            | 24 | 8            | 192 | 1.37e-02   | 5.85e + 02 | 3.05e + 00                  |
| 2            | 3  | 4            | 12  | 2.70e-01   | 2.97e + 01 | 2.47e + 00                  |
| 2            | 16 | 3            | 48  | 3.01e-02   | 2.66e + 02 | 5.55e + 00                  |
| 2            | 24 | 8            | 192 | 2.66e-01   | 3.01e+01   | 2.51e+00                    |
| 3            | 3  | 4            | 12  | 2.70e-01   | 2.97e + 01 | 2.47e + 00                  |
| 3            | 16 | 3            | 48  | 3.14e-02   | 2.55e + 02 | 5.31e+00                    |
| 3            | 24 | 8            | 192 | 2.68e-01   | 2.99e+01   | 2.49e+00                    |
| 4            | 3  | 4            | 12  | 2.70e-01   | 2.97e + 01 | 2.47e + 00                  |
| 4            | 16 | 3            | 48  | 2.99e-02   | 2.68e + 02 | 5.58e + 00                  |
| _ 4          | 24 | 8            | 192 | 2.70e-01   | 2.96e + 01 | 2.47e + 00                  |

- Sample values of tiled stencil with optimum values of P and Q in Q4
- Can suprisingly see increasing trend of time with increase of of w because of the reason above and also because cache locality is not sustained in increasing order of lines. Maybe the implementation is lacking but my assumption is that the following approach is limited.
- The values of w = 2 and 3 are taken from the fact that reps % 3 != 0 and reps % 2 == 0 (see the bottleneck point mentioned above)

## 7 Tiled Stencil Technique

While the processing power of CPUs have been consistently growing, the actual performance of computation is often bottlenecked by the speed at which the processor can access data from the memory. This is a potential problem that we have been facing since the last question. The main factors [1]

The main goal of tiling transformation can be used to improve locality of data, and hence improving cache performance by a huge margin. In our advection solver (where a 9 point stencil has been used) the distance to be traveled between the central to bottom right corner is  $N_{loc} + 2 * w + 1$ . Hence as w increases, the probability of the next element in the same cache decreases.

For eg, a very similar problem of 2D Jacobi 9 point stencil uses a form of tiling called Hierarchical Overlapped tiling 2. It's goal is to adapt to the memory hierarchy of the target machine.

The papers mention that tiling techniques could improve reuse, with array padding. By rearranging and grouping the order of execution (or iteration space) while preserving the information flow we need to achieve 2 oals

- 1. Not violating data dependency
- 2. Increasing the segment of data in cache which is needed

Selecting Tile sizes and the process of determining the abstract tiles is reference in [2] with the following code:

# 8 Combination of Wide halo and stencil technique

A combination of these might alleviate the  $w^2$  extra computations to a certain extent by having good cache coherency. It would work for small values of 2.

# 9 Extra optimizations

- A potential optimization was seen when I saw an opportunity to explore Q5 further to work with values of Q > 1. For that, I also referenced my doubts and got it clarified on Piazza (link).
- The algorithm works to overlap both row and column exchange with computation in the case that
  - 1. The left and right exchange only happens when the values of the corners of the top and bottom halos are exchanged
  - 2. The final updation of the inner halo only happens after receiving all the messages.

#### 9.1 Results

Table 9: Performance for optimized division of 2D grids and overlapping

| P  | $\mathbf{Q}$ | NP  | ${f A_T}$ | $G_{ m R}$ | ${ m P_C}$ |
|----|--------------|-----|-----------|------------|------------|
| 16 | 3            | 48  | 2.37e-02  | 3.38e + 02 | 7.05e+00   |
| 16 | 6            | 96  | 1.69e-02  | 4.68e + 02 | 4.81e+00   |
| 16 | 12           | 192 | 1.44e-02  | 5.57e + 02 | 2.87e + 00 |
| 24 | 8            | 192 | 1.33e-02  | 5.97e + 02 | 3.11e+00   |

We see a Similar performance to Q4 with it being slightly slow at some points (but and improved version of Q5 nonetheless). I would have liked to know how I could have improved this model w.r.t the part on exchanging corners.

## 10 References

- [1] Zhou, T., 2016. Factors Affecting Stencil Code Performance. [online] Scholar-ship.tricolib.brynmawr.edu. Available at: https://scholarship.tricolib.brynmawr.edu/bitstream/handle/10066/18674/2016ZhouT.pdf?sequence=1&isAllowed=y [Accessed 26 April 2021].
- [2] Holewinski, J., 2021. High-Performance Code Generation for Stencil Computations on GPU Architectures. [online] http://web.cs.ucla.edu/~pouchet/doc/ics-article.12.pdf. Available at: http://web.cs.ucla.edu/~pouchet/doc/ics-article.12.pdf [Accessed 26 April 2021].
- [3] Rivera, G. and Tseng, C., 2000. Tiling optimizations for 3D scientific computations | Proceedings of the 2000 ACM/IEEE conference on Supercomputing. [online] Dl.acm.org. Available at: https://dl.acm.org/doi/10.5555/370049.370403 [Accessed 26 April 2021].

# 11 Acknowledgements

- Done individually and with help from Piazza/COMP4300 Practicals.
- One thing which could be improved was providing some sort of visualization to the proposed question which asks the algorithm to be implemented. Rather than being asked in Piazza (like the examples of Q5 and Q6), which would save the student so much time in understanding and executing the correct algorithm.