# Using MPI with OpenMP* Offload (C/C++)

## Learning Objective

* Explain how OpenMP can be used along with MPI on compute clusters with GPUs

## HPC Multi-Node Workflow with oneAPI
With oneAPI, Accelerated code can be written in either a kernel (DPC++) or __directive based style__. Developers can use the __Intel® DPC++ Compatibility tool__ to perform a one-time migration from __CUDA*__ to __Data Parallel C++__. Existing __Fortran__ programmers can leverage __OpenMP*__ for directive-based offload. Existing __C++__ applications can choose either the __Kernel style__ or the __directive based style option__ and existing __OpenCL*__ applications can remain in the OpenCL language or migrate to Data Parallel C++.

To take advantage of multi-node clusters with GPUs, the aforementioned standards can be used in conjuction with MPI. The Intel® MPI Library is included with the Intel® oneAPI HPC toolkit.

__Intel® Advisor__ is recommended to  __optimize__ the existing design for __vectorization and memory usage__ (CPU and GPU) and __Identify__ loops that are candidates for __offload__ and project the __performance on target accelerators.__ The __Intel® VTune™ Profiler__ can be used to profile your accelerated program while The __Intel® Trace Analyzer and Collector__ can be used to help you understand MPI application behavior.

The figure below shows the recommended approach of different starting points for HPC developers.

<img src="Assets/OneAPI_flow_mpi.JPG">

## MPI and OpenMP

The following diagram illustrates how MPI and OpenMP interacts with the various nodes and accelerator devices in a cluster. Currently, the Intel® MPI Library supports host-based MPI where MPI calls are made from the hosts' CPUs. In the future, GPU-aware MPI will be supported so that GPU code will be able to directly interact with MPI.

<img src="Assets/OpenMP_MPI.JPG">

In the current paradigm, the two standards serve different purposes and can be effectively used together. MPI is used to communicate between nodes, while OpenMP is used to accelerate computation on a single node using the CPUs or GPUs available.

## Calculating Pi

To illustrate this concept, we're going to use both MPI and OpenMP to calculate pi through numeric integration.

$$
\pi=\int_0^1 \frac{4}{1+x^2} dx
$$

The algorithm will use a discretization of the above equation.

$$
\pi \approx \sum_{n=0}^{N-1} \frac{4}{1+x_{i}^2} dx
$$

Where N=Number of steps  or iterations, dx = 1/N, and $x_{i}= (i+0.5)/N$

In the code below, in _main()_, the iterations are divided based on the number of MPI ranks. Then for each rank, the group of iterations would execute in parallel on the GPU.

Execute the cell below to write the code to file.


In [None]:
%%writefile lab/pi_mpi_omp.cpp

//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
//
// PI_MPI_OMP: Using OpenMP Offload in MPI program.
//
// Using OpenMP Offload, the code sample runs multiple MPI ranks to
// distribute the calculation of the number Pi. Each rank offloads the
// computation to an accelerator (GPU/CPU) using OpenMP Offload to compute
// a partial compution of the number of Pi.
//
// For more information on the Intel(r) C++ Compiler or
// Intel(r) MPI Library, visit the Intel(r) HPC Toolkit website.
// https://software.intel.com/content/www/us/en/develop/tools/oneapi/hpc-toolkit.html
//
//******************************************************************************
// Content: (version 1.0)
//      Calculate the number Pi in parallel using its integral representation.
//
//******************************************************************************
#include <mpi.h>
#include <iostream>
#include <omp.h>

using namespace std;

constexpr int kMaster = 0;
constexpr long kIteration = 1024;
constexpr long kScale = 45;
constexpr long kTotalNumStep = kIteration * kScale;

//******************************************************************************
// Function description: computes the number Pi partially in parallel using OpenMP.
// Each MPI rank calls this function to computes the number Pi partially.
//******************************************************************************
void CalculatePiParallel(float* results, int rank_num, int num_procs);

int main(int argc, char* argv[]) {
    int i, id, num_procs;
    float total_pi;
    MPI_Status stat;

    // Start MPI.
    if (MPI_Init(&argc, &argv) != MPI_SUCCESS) {
        cout << "Failed to initialize MPI\n";
        exit(-1);
    }
    
    // Create the communicator, and retrieve the number of processes.
    MPI_Comm_size(MPI_COMM_WORLD, &num_procs);
    
    // Determine the rank of the process.
    MPI_Comm_rank(MPI_COMM_WORLD, &id);

    int num_step_per_rank = kTotalNumStep / num_procs;
    float* results_per_rank = new float[num_step_per_rank];
    for (size_t i = 0; i < num_step_per_rank; i++) results_per_rank[i] = 0.0;

    // Calculate the Pi number partially in parallel.
    CalculatePiParallel(results_per_rank, id, num_procs);

    float sum = 0.0;
    for (size_t i = 0; i < num_step_per_rank; i++) sum += results_per_rank[i];

    delete[] results_per_rank;

    MPI_Reduce(&sum, &total_pi, 1, MPI_FLOAT, MPI_SUM, kMaster, MPI_COMM_WORLD);

    if (id == kMaster) cout << "---> pi= " << total_pi << "\n";

    MPI_Finalize();

    return 0;
}

////////////////////////////////////////////////////////////////////////
//
// Compute the number Pi partially on device: the partial result is
// returned in "results".
//
////////////////////////////////////////////////////////////////////////
void CalculatePiParallel(float* results, int rank_num, int num_procs) {
    char machine_name[MPI_MAX_PROCESSOR_NAME];
    int name_len;
    int is_cpu=true;
    int num_step = kTotalNumStep / num_procs;
    float* x_pos_per_rank = new float[num_step];
    float dx, dx_2;

    // Get the machine name.
    MPI_Get_processor_name(machine_name, &name_len);

    dx = 1.0f / (float)kTotalNumStep;
    dx_2 = dx / 2.0f;

    for (size_t i = 0; i < num_step; i++)
        x_pos_per_rank[i] = ((float)rank_num / (float)num_procs) + i * dx + dx_2;
    
    #pragma omp target map(from:is_cpu) map(to:x_pos_per_rank[0:num_step]) map(from:results[0:num_step])
    {  
        #pragma omp teams distribute parallel for simd
        // Use loop to calculate a partial of the number Pi in parallel.
        for (int k=0; k< num_step; k++) {
            if (k==0) is_cpu=omp_is_initial_device();
            float x = x_pos_per_rank[k];
            results[k] = (4.0f * dx) / (1.0f + x * x);
        }
    }
    cout << "Rank " << rank_num << " of " << num_procs
         << " runs on: " << machine_name
         << ", uses device: " << (is_cpu?"CPU":"GPU")
         << "\n";

    // Cleanup.
    delete[] x_pos_per_rank;
}

### Compile the Code
Compilation of OpenMP offload code with MPI can be done using the __mpiicpc__ compiler command included with the Intel MPI Library. Simply set the compiler to be __icpx__ along with the options that enable OpenMP offload.
The script _compile_omp_c.sh_ was created to easily submit compile comands on the DevCloud.
The compile script compiles the newly written _pi_mpi_omp.cpp_ with __mpiicpc__ using __icpx__.
You may examine the launch script by executing the following cell.

In [None]:
%pycat compile_omp_c.sh

The following cell will submit the execution of the compilation script using the __q__ script. The __q__ script submits jobs to the DevCloud and retrieves the output. The first arguments to __q__ is the script to execute. The second argument is the properties of the nodes to request. In the following cell, we're requesting one node with the property ppn=2.

In [None]:
! chmod 755 q; chmod 755 compile_omp_c.sh; ./q compile_omp_c.sh nodes=1:ppn=2

### Execute the Code
Next we will execute the compiled binary with the __mpirun__ command to launch the MPI job using 4 processes. Examine the launch script by executing the following cell.

In [None]:
%pycat launch.sh

Execute the following cell to run the program on multiple nodes. In the following example, 2 nodes with GPUs are requested.

In [None]:
! chmod 755 q; chmod 755 launch.sh; ./q launch.sh nodes=2:gpu:ppn=2

Once the execution completes, in the output, you should see the value of $\pi$ as well as the nodes that each of the processes ran on.

## Conclusion
This simple exercise exposed you to the basics of using DPC++ and MPI with the [Intel® oneAPI Toolkits](https://software.intel.com/oneapi "oneAPI main page"). We encourage you to try these software toolkits on the [Intel® DevCloud](https://devcloud.intel.com/datacenter/connect) with your own designs.
***

@Intel Corporation | [\*Trademark](https://www.intel.com/content/www/us/en/legal/trademarks.html)