# Standard Parallelism in C++

**Author**: [Debabrata Mandal]()

# Table of Contents

1. <a href="#"> Introduction </a>
2. <a href="#"> Setup </a>
3. <a href="#"> Examine CUDA C++ code </a>
4. <a href="#"> Execution policies (since C++17) </a>
5. <a href="#"> Matrix multiplication using std::par </a>
6. <a href="#"> Conclusion </a>

#Introduction

Welcome to the first assignment, on heterogenous computing in C++ using [standard parallelism](https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41960/). Since C++17 there has been an easier way to write parallel code targeted for different kinds of platforms (e.g. CPU and GPUs). This was added to the standard under the term of **Execution policies** and many popular algorithms ([std::sort](https://en.cppreference.com/w/cpp/algorithm/sort)) already support this out of the box. 


In this assignment, we explore what standard parallism means and what it implies for users of the language. It is not compulsory but recommended that readers refer to all the linked articles and videos scattered throughout this notebook.

<div class="alert alert-block alert-info"> <b> NOTE: </b> <span style="color:black"> <b> Since some parts of this assignment will be automatically graded, refrain from adding or deleting any cells in this notebook. Answer cells have been already provided for the respective questions.</b> </span> </div>

#Setup


Since this assignment will explore the usage of GPUs you will have to install the necessary tools (compiler and profiler) to successfully complete it. 

Specifically we will need the following tools:

1. [nvc++](https://www.youtube.com/watch?v=KhZvrF_w1ak) compiler (part of the [Nvidia HPC SDK](https://developer.nvidia.com/hpc-sdk))
2. [NSIGHT Systems](https://developer.nvidia.com/nsight-systems) (nsys)

Since this notebook has been primarily designed for readers not having access to a GPU (other than a hosted runtime environment like Colab), we will assume the OS and hardware resources used to solve the questions are similar to the colab version.

## Switching between runtimes

In order to turn on GPU runtime, 
- Runtime > Change runtime type > Hardware accelerator > GPU

In [1]:
# check GPU is recognised and available
! /opt/bin/nvidia-smi

Fri Aug  5 05:30:29 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
# Make sure we install the right package depending on the OS release.
! cat /etc/os-release

NAME="Ubuntu"
VERSION="18.04.6 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.6 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic


Now, based on the CUDA toolkit version installed and the OS release find the compatible `nvhpc` package from [this](https://developer.nvidia.com/nvidia-hpc-sdk-releases) list. **(Points: 1)**

ANSWER: XY.Z >= 21.2; however nvcc fails to work with the latest package versions due to a PTX compatibility issue. This is because we need a forward compatibility package for the nvcc compiler provided with the latest packages (containing CUDA version >= 11.2).

In [3]:
! echo 'deb [trusted=yes] https://developer.download.nvidia.com/hpc-sdk/ubuntu/amd64 /' > /etc/apt/sources.list.d/nvhpc.list
! sudo apt-get update -y

# Replace X,Y & Z with package version.
! sudo apt-get install -y nvhpc-21-2

Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Get:2 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Hit:3 http://archive.ubuntu.com/ubuntu bionic InRelease
Hit:4 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Get:5 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Hit:6 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Ign:7 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Get:8 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease [1,581 B]
Hit:9 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease
Get:10 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages [1,528 kB]
Get:11 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Hit:12 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
Get:13 http://archive.ubuntu.com/ubuntu bi

In [4]:
# Check compiler version

# Replace X,Y & Z with package version.
! /opt/nvidia/hpc_sdk/Linux_x86_64/21.2/compilers/bin/nvc++ --version 


nvc++ 21.2-0 LLVM 64-bit target on x86-64 Linux -tp haswell 
NVIDIA Compilers and Tools
Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.


Since we will be using `nvc++` a lot, we should modify the system `PATH` variable to look in the correct locations for `nvc++` rather than us provide it each time.

<div class="alert alert-block alert-danger"> Warning: If the steps in the next cell, are not followed correctly you might need to restart the runtime.</div>

In [5]:
!echo $PATH # note this path down

/opt/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin


In [6]:
# **Note** the <old $PATH> in command below. 
# Fill it with the PATH output from the previous cell.
# When this notebook was last run it was the following:
# /opt/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin

import os

# Replace X,Y & Z with package version.
os.environ['PATH']='/opt/nvidia/hpc_sdk/Linux_x86_64/21.2/compilers/bin/:/opt/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin'

!echo $PATH

/opt/nvidia/hpc_sdk/Linux_x86_64/21.2/compilers/bin/:/opt/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin


In [7]:
!nvc++ --version


nvc++ 21.2-0 LLVM 64-bit target on x86-64 Linux -tp haswell 
NVIDIA Compilers and Tools
Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.


Find the `nsys` version installed. **(Points: 1)**

(Hint: `nsys` is already a part of the `nvhpc` package installed before!) 

In [8]:
!nsys --version

NVIDIA Nsight Systems version 2020.5.1.85-5ee086b


Your answer: 

---

Curious readers can try exploring the various options [nsys](https://docs.nvidia.com/nsight-systems/UserGuide/index.html) provides. This can prove to be a useful exercise for later parts of this assignment.

Some highlighted options are:

**nsys**
  * --trace
  * --stats
  * --cuda-memory-usage
  * --cuda-um-cpu-page-faults
  * --cuda-um-gpu-page-faults
  * --gpu-metrics-device

To explore the full set of options that `nsys` provides either refer to the official Nvidia documentation on the tools, or simply print them using `--help`.




# Examine CUDA C++ code

In this part of the assignment, you shall look at 2 CUDA kernels related to [matrix multiplication](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#mat-mat-multi) on the GPU. While this section will not directly make you write code in CUDA, it will set the background for our future discussions.

The purpose of this section is to stress on the importance of reading and understanding CUDA code and translating them to platform agnostic code using standard parallelism in later sections.

For the code snippet below, add **single** line comments wherever instigated. Be as specific as possible to score the maximum points!**(Points: 3)**

```c++
template <typename T>
__global__ void gpu_gemm_nn(int m, int n, int k,                        //in: matrix dimensions: C(m,n)+=A(m,k)*B(k,n)
							T * __restrict__ dest,                      //inout: pointer to C matrix data
							const T * __restrict__ left,                //in: pointer to A matrix data
							const T * __restrict__ right)               //in: pointer to B matrix data
{
    // <what does ty refer to?>
    size_t ty = blockIdx.y*blockDim.y + threadIdx.y; 
    // <what does tx refer to?>
	size_t tx = blockIdx.x*blockDim.x + threadIdx.x;
    size_t n_pos = ty; 
    // <why is the following loop necessary?> 
	while(n_pos < n){
		size_t m_pos = tx; 
		while(m_pos < m) {
            // <what does tmp store?>
			T tmp = static_cast<T>(0.0);
            // <the following loops over which dimension (x or y) of the matrix A and which dimension of matrix B?>
			for(size_t k_pos = 0; k_pos < k; ++k_pos)
			{
				tmp += left[m_pos*k + k_pos] * right[k_pos*n + n_pos];
			}
			dest[m_pos*n + n_pos] += tmp;
			m_pos += gridDim.x*blockDim.x; 
		}
		n_pos += gridDim.y*blockDim.y; 
	}
	return;
}
```

Your answer:

1. The column id of the cell in the output matrix being computed.

2. The row id of the cell in the output matrix being computed.

3. If the block dims are not a multiple of the row or the column dimensions of the output matrix some kernel calls from the last blocks along the x and y dimension will be invalid since these blocks are partially contained within the output matrix C. Hence, this check prevents such kernel calls from getting executed to completion.

4. this will accumulate the value for an output cell (tx, ty) in the matrix C.

5. x dimension of matrix A and y dimension of matrix B.

----

Work out the arithnetic intensity of matrix multiplication based on the algorithm above. **(Points: 1)**

Your answer: 
Refer to https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html

---

Now, look into the file `naive-matmul.cu`. It contains the same kernel that we discussed in the previous cell. Read the `main` function provided and find the execution time for the following workloads. **(Points: 2)**

| Workload  |  Execution times (ms) |
|---|---|
| M=N=K=256  |  0.501 |
|  M=N=K=512 |  2.442 |
| M=N=1024 and K=256  |  4.702 |
|  M=N=K=2048 | 147.85  |

Attach the profiler screenshots zooming into the timeline **(Points: 2)**
1. When the kernel was started - mark/highlight the memcpy operation from Host to Device
2. When the kernel finished executing - mark/highlight the memcpy operation from Device to Host

The profile can be first generated using `nsys profile` and then visualised using NSIGHT Systems GUI. 

For the workload M=N=K=1024 find the optimal block size (defined as BLOCK_SIZE in the code). Provide a short explanation for the same. **(Points: 1)**

In [12]:
# your code workspace
# !/opt/nvidia/hpc_sdk/Linux_x86_64/21.2/compilers/bin/nvcc
!nvcc -lnvToolsExt naive_matmul.cu && ./a.out

# # For nys profile options refer to this crisp summary of the CLI options
# # https://gist.github.com/mcarilli/376821aa1a7182dfcf59928a7cde3223

# !nsys profile --trace cuda,nvtx -f true -o report1 ./a.out

[01m[Knaive_matmul.cu:[m[K In function ‘[01m[Kint main()[m[K’:
     cudaThreadSynchron[01;35m[Ki[m[Kze();
                       [01;35m[K^[m[K
[01m[K/opt/nvidia/hpc_sdk/Linux_x86_64/21.2/cuda/11.2/include/cuda_runtime_api.h:1011:46:[m[K [01;36m[Knote: [m[Kdeclared here
 extern __CUDA_DEPRECATED __host__ cudaError_t[01;36m[K CUDARTAPI cudaThread[m[KSynchronize(void);
                                              [01;36m[K^~~~~~~~~~~~~~~~~~~~~[m[K
Check 1/1
Naive matmul 10.0411 ms


Your answers:

Screenshots

1. ![](https://drive.google.com/uc?export=view&id=1VefYQC9RJsFCS7x1ugd778Ph2SeoR9PE)


2. ![](https://drive.google.com/uc?export=view&id=1c8KJmhbAA47iYuytb8eJv9Esh4GRzekY)


Optimal block size - 8

BLOCK SIZE 6 ~ 13 ms

BLOCK SIZE 8 ~ 10 ms

BLOCK SIZE 10 ~ 13 ms

(The exact reason for this is complicated to find due to interaction between block scheduler, thread scheduler, memory address streams generated, internal resource limits, internal buffering mechanisms etc. However for simple kernels like ours it could be possible to find a closed form solution to this. In practical settings, performance tuning will usually reveal the most optimal set of configuration parameters to use.)


---

**Bonus - Part 1 (2 points)**

For this bonus part, continue to make changes to the `naive_matmul.cu` script and measure execution times using it. You will need to submit this modified script.

Observe the memory access pattern of the kernel provided and explore variations to find the most efficient one. Show the improved execution times. **(Points: 1)**

Try creating pinned memory on the host instead of allocating it using `malloc`. Observe and note the improved execution time. **(Points: 1)**

Hint: Look at the matrix initialisation code for A or B inside `main()`. Is A (and B) row major or column major?

In [None]:
# your code workspace
# Look at the kernel gpu_gemm_nn_optimised naive_matmul.cu file



Your answer: 

Changing matrix A to column major improves the performance since individual threads (within a warp) accessing elements along the column leads to better memory access pattern via memory coalescing. 

---

#Execution policies (since C++17)

Since C++17, parallelism was made available for popular `std::` algorithms under the context of execution policies. To read more about this refer to the following articles:
1. https://developer.nvidia.com/blog/accelerating-standard-c-with-gpus-using-stdpar/
2. https://developer.nvidia.com/blog/developing-accelerated-code-with-standard-language-parallelism/
3. https://developer.nvidia.com/blog/multi-gpu-programming-with-standard-parallel-c-part-1/
4. https://developer.nvidia.com/blog/multi-gpu-programming-with-standard-parallel-c-part-2/

<div class="alert alert-block alert-info"> <b>Tip:</b> One might find many answers for this assignment hidden in the above links!</div>

In this section, we will explore the usage of different execution policies offered by the standard and inspect and try to improve their performance using the `nsys profiler`.




We will first implement some trivial workloads using C++11 (& C++17). These will help us understand the maximum amount of effort needed to make our workloads compatible for GPUs.

### [std::sort]()

Look inside the file `sort.cpp`. It contains a very simple example which can be our first candidate made eligible for standard parallelism.

Your task is to modify the function `naive_sort` to use [`std::execution`](https://en.cppreference.com/w/cpp/algorithm/execution_policy_tag_t) policies, benchmark and find the execution times and fill in the 2 tables below. To find execution times for smaller workloads, make sure you use a high precision clock. **(Points: 3)**

(Feel free to modify the script as you please, to automate the benchmarking process.)

| Time in ms (average over 1000 iterations) | cpu  | multicore  | gpu  |
|:---|:---:|:---:|:---:|
| no execution policy  |   |   |   |
| std::seq  |   |   |   |
|  std::unseq |   |   |   |
|  std::par |   |   |   |
|  std::par_unseq |   |   |   |

| Time in ms (average over 1000 iterations)	  |  N=1024 | N=2048  | N=8192  | N=81920  | N=819200  |   |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|
| cpu  |   |   |   |   |   |   |
|  multicore |   |   |   |   |   |   |
|  gpu |   |   |   |   |   |   |

Mention atleast one important observation from table 1 and table 2. **(Points: 1)**

In [None]:
# your workspace

# !nvc++ -Minfo -stdpar=gpu/multicore sort_sol.cpp && ./a.out
!nvc++ -Minfo -stdpar=gpu sort_sol.cpp && ./a.out

  }
   ^

3.21192 ms
1 819200


Your answer:

Table 1: For high workloads, the improvement of seq and unseq from cpu to multicore to gpu is marginal compared to the improvement shown by par/par_unseq from cpu to multicore to gpu for large workloads.

Table 2: For smaller workloads "seq" dominates on the cpu (some reasons being no memory copy from device to host and vice versa, lesser need of parallelism due to smaller workload). However, as the matrix sizes increase, beyond 8192 the best execution time is shown by the GPU when using a parallel (par, par_unseq) execution policy is used. It even performs better than multicore mode, since the number of threads available for parallel execution is much higher in the GPU  than in the all of the CPU cores combined.

----

### Examples

Look inside the files `sum.cpp`, `vector_product.cpp`, `swaps.cpp` and `increment.cpp`. It contains naive implementations of some popular algorithms. Your task is to examine the code presented and answer the following questions below.

Using [`std::for_each`](https://en.cppreference.com/w/cpp/algorithm/for_each) only, for each algorithm mention the execution policy that will perform the best? **(Points: 1)**

| Algorithm  | Execution policy used |
|---|---|
| swaps  | seq  |
|  naive_sum | seq/unseq (since parallel execution policies will lead to data race)  |
|  naive_increment | seq/unseq (higher compute workloads will change this answer to par/par_unseq) |
|  naive_inner_prod | seq/unseq (since parallel execution with for_each will cause data race) |


Port the above "naive" implementations to work on GPUs using the following `std` algorithms. **(Points: 4)**

| Algorithm  | std:: | function name |
|---|---|---| 
| swaps  |  for_each | std_swaps |
|  naive_sum |  reduce | std_reduce_sum |
|  naive_increment | transform | std_transform |
|  naive_innerp | inner_product | std_innerp |

The functions signatures are already provided in each of the code files. Fill up these empty functions and make sure it works with the `main()` provided.

In [None]:
# your workspace
!nvc++ -Minfo increment_sol.cpp && ./a.out

  }
   ^

5
0.106532 ms
5
0.122664 ms


Based on your implementations, answer the following quetions: **(Points: 4)**
1. How well does a parallelised `std_swaps` perform against a sequential implementation?
2. On which platform (CPU or GPU) does parallelised `std_reduce_sum` perform better?
3. For `N=8192` does parallelised `std_tranform` on GPU ever perform better than sequential implementation? If not, explain why or else mention the scenario when it shows improvement over the sequential implementation. (Hint: Adjust the workload assigned within each loop and observe the behavior.) 
4. Does `std::inner_product` support execution policies? If not, write a new function `std_transform_reduce` which makes use of an algorithm that supports parallel execution policies.



Your answer:

(For detailed reasoning read the comments in solution code files.)

1. Poorly as compared to sequential policy.

2. On GPUs, only for large workloads (N = 819200, 8192000). For a workload of N=8192 cpu execution with any policy outperforms GPU. It is worth mentioning that in these cases when GPU does not show better performance it is primarily due to poor memory access patterns from device global memory. Using `-stdpar=multicore` is the better option than just relying on a single core CPU execution, since there is no concept of device memory for `multicore`.

3. Find the reason in the solution code (`increment_sol.cpp`).

4. No, it does not.




Note: On single core CPUs, only concurrent execution is possible hence all policies should show similar execution times with minor differences. This should hold for all the kernels presented in this question.

---

### Debugging

This section will focus on some common pitfalls when using `std::par` with GPUs. 

To complete this part, you will need to debug, compile and verify the code snippets provided. In the files `debug*.cpp` make minor changes so that it compiles and the checks work. **(Points: 4)**

(Hint: Search for a `nvc++` flag which provides additional debug info during compilation.)

In [None]:
# your workspace

---

#Matrix multiplication using std::par

Now, you shall finally implement matrix multiplication using std::par. To complete this section, make sure you have the nsys profiler GUI installed and running.

**Part 1**

The reference script is provided in `matmul_1.cpp`.

The points for this part will depend on the relative performance of your optimised implementation in comparison to the CUDA kernels (under a similar workload).

**(Total Points: 6)**

*Make sure your implementation uses at max 2 nested `std::for_each`/looping blocks for this part.*

Q1. What is the relative performance of your unoptimised implementation when compared to the CUDA kernel? **(Points: 0.5)**

Q2. In order to reduce the execution time of your implementation, take the help of the `nsys profiler` to collect logs and identify unnecessary `memcpy` cycles from **device to host** and **host to device**. Attach screenshots clearly marking these memcpy operations from the profiler. **(Points: 0.5)**

Q3. Refer to [this](https://developer.nvidia.com/blog/multi-gpu-programming-with-standard-parallel-c-part-2/) article to come up with a fix to the script provided. Explain the fix. **(Points: 0.5)**

Q4. Based on the finding in the previous question try to optimise your script as much as possible. Note down the new relative performance of your implementation. **(Points: 3.5)**

Q5. In the same article linked above, there is yet another optimisation strategy mentioned which further improves the execution time on top of the previous fix. Identify it and mention the relative speed up from step 4. **(Points: 0.5)**

Q6. Based on our observation from the CUDA exercise, is `cudaMallocHost` a better way to allocate memory on the host when using standard parallelism? Explain your answer. **(Points: 0.5)**

You will need to submit the `matmul_1.cpp` script provided with your (ultra) optimised implementation.

In [21]:
# your workspace
!nvc++ -Minfo -stdpar=gpu matmul_1.cpp && ./a.out

# !nsys profile --trace cuda,nvtx -f true -o report2 ./a.out

      printf("Time elapsed on matrix multiplication of %dx%d . %dx%d for matmul: %f ms.\n\n", m, k, k, n, elapsed);
                                                                                                          ^

Time elapsed on matrix multiplication of 1024x1024 . 1024x1024 for matmul: 16.218702 ms.



Your answers:

1. ~ (10ms / 20 ms) = 0.5

<!-- 2. ![](https://drive.google.com/uc?export=view&id=1sg6hAf8FYFN26ejTxA39_vj3mIeAk1HI) -->

2. https://drive.google.com/file/d/1sg6hAf8FYFN26ejTxA39_vj3mIeAk1HI/view?usp=sharing


3. The improvement comes from the fact that we are initialising the matrix on the CPU which would need an unnecessary device to host copy since stdpar uses unified memory which will try to get the matrix from device to host resulting in a memory copy.

4. (10 ms / 16 ms)

5. The second improvement is fby creating GPU pinned memory using cudaMalloc instead of malloc which forces th memory to be available on the GPU. All these optimisations serve on purpose - to reduce te number of memcpy cycles required before we get the actual outputs. (Note: For the actual outputs we will have to incur a device to host memcpy cycle.) - (10 ms / 15 ms)

6. cudaMallocHost creates pinned memory explicitly on the CPU and we would have to incur additional host to device and device to host memcpy cycles, which would not be beneficial for our use. 

---

**Part 2 (Bonus)**

The reference script is provided in `matmul_2.cpp`.

**(Total Points: 3)**



Q1. In the previous part there was a contraint on the number of nested `std::for_each` blocks that one could use. However, it is more intuitive to represent matrix multiplication in terms of an outer loop on the row index, an intermediate loop on the column index and a inner loop representing the inner product of the row and column. Implement matrix multiplication using atleast 3 nested `std::for_each` blocks. **(Points: 1)**

Q2. Did you face any issues in the previous question? What is the best execution time you can obtain using the 3 nested `for_each` blocks? **(Points: 1)**

Q3. Can you improve your execution time by switching from row-major to column major for each of the three matrices? How much improvement did you observe? Provide an intuitive explanation for the same. **(Points: 1)**

You will need to submit the `matmul_2.cpp` script provided with your nested `for_each` implementation.

In [24]:
# your workspace

!nvc++ -Minfo -stdpar=gpu matmul_2.cpp && ./a.out

      printf("Time elapsed on matrix multiplication of %dx%d . %dx%d for matmul: %f ms.\n\n", m, k, k, n, elapsed);
                                                                                                          ^

Time elapsed on matrix multiplication of 1024x1024 . 1024x1024 for matmul: 134.058170 ms.



Your answers:

2. Possible issues - cannot create nested std par policies.

3. The improvement is because writing (to the output matrix cell) is much more time consuming than reading from the input matrix cells. This is because writing 

improvement - (134 ms / 1100 ms)

---