# HPC Demonstration

This notebook explores some of the considerations that need to be made when running Firedrake on High Performance Computers (HPC). We will solve very large instances of the Poisson equation, demonstrating a range of different solver options for different sized problems. Additional supplimentary material is provided for running scripts on HPC.

## How big?

The parameters `Nx`, `Nref` and `degree` defined below have been selected so that the simulation runs on a single core in a notebook. This is not the regime we want to think about in this tutorial, we want to think about very large problems. We will consider how each of these parameters affects the overall prolem size.

`Nx` defines our coarse grid in the mesh hierarchy, it is used to construct a coarse cube mesh. The cube will be divided up into $N_x \times N_x \times N_x$ smaller cubes and each cube is then (by default) split into 6 tetrahedra giving a total of $6 N_x^3$ cells.

The `Nref` parameter determines how many times that mesh is refined to create a mesh hierarchy. Each refinement uniformly divides each tetrahedron into 8 smaller tetrahedra. After $N_{ref}$ refinements there are $8^{N_{ref}}$ times as many cells.

Finally, the degree, which we denote $k$ specifies the polynomial order of the basis functions used to approximate functions in our finite element space. For standard Lagrange elements in 3D each cell has $\frac{1}{6}(k+1)(k+2)(k+3)$ degrees of freedom per cell. Some of these degrees of freedom (DOFs) are shared with neighbouring cells for a conforming finite element scheme.

This small notebook example solves a problem with a large number of DOFs, but on HPC we want to solve problems orders of magnitude larger still, by the end of this notebook we will be considering problems larger than 30 000 000 DOFs. When solving problems using Firedrake in parallel, it's worth remembering that performance can be improved by adding more processes (MPI ranks) as long as the number of DOFs remains above [50 000 DOFs per core](https://firedrakeproject.org/parallelism.html#expected-performance-improvements).

Later we will discuss the complexity in terms of the variable $n$, this corresponds to the total number of DOFs.

---
### Exercise:
Can you work out the total number of DOFs for a cube mesh: coarse grid size $N_x = 8$, $N_{ref} = 2$ refinements and degree $k=2$ Lagrange finite elements?

Can you find an expression to calculate the total number of DOFs for a mesh described above in terms of the coarse mesh size $N_x$, the number of refinements $N_{ref}$, the degree of the Lagrange finite element $k$ and the geometric dimension of the mesh $d$?

---

In [1]:
from firedrake import *
from firedrake.petsc import PETSc
from time import time

parprint = PETSc.Sys.Print

We start as always by importing Firedrake. We also define parprint to perform parallel printing as in this [demonstration](https://firedrakeproject.org/demos/parprint.py.html). Finally, we import the Python time module to benchmark the different solvers.

## The equations
We will consider the Poisson equation in a 3D domain $\Omega = [0, 1]^3$:

$$
\left\{
\begin{aligned}
	-\nabla^2 u &= f && \text{on } \Omega,\\
	u &= 0 && \text{on } \partial\Omega,
\end{aligned}
\right.
$$

Where $f$ is given by:

$$
f(x,y,z) = -\frac{\pi^2}{2}
\times\left( 2\cos(\pi x) - \cos\left( \frac{\pi x}{2} \right)
- 2(a^2 + b^2)\sin(\pi x)\tan \left( \frac{\pi x}{4} \right)  \right)
\times\sin(a\pi y) \sin(b\pi z)
$$

We use this particular right hand side since it has corresponding analytic solution:

$$
u(x,y,z) =
\sin(\pi x)\tan\left(\frac{\pi x}{4}\right)
\sin(a\pi y)\sin(b\pi z)
$$
Having an analytic solution allows us to compute the error in our computed solution as $\|u_h - u\|_{L^2}$. For this notebook we fix $a=1$ and $b=2$, feel free to try other values.

The Poisson equation has the weak form: Find $u_h \in V$ such that

$$
\int_\Omega \nabla u_h\cdot \nabla v\ dx = \int_\Omega f v\ dx \qquad \forall v \in V
$$

For the discrete function space $V$ we initially consider piecewise quadratic Lagrange elements, that is `V = FunctionSpace(mesh, "CG", degree)`, with `degree=2`.

It is straightworward to solve the equation using Firedrake by expressing this weak form in UFL.
We create a Python function `make_problem` which generates a `problem` object of the desired size, a function `u_h` to store the solution and the analytic solution `truth` so we can compute the $L_2$ error norm.

In [2]:
def make_problem(Nx, Nref, degree):
    # Create mesh and mesh hierarchy
    mesh = UnitCubeMesh(Nx, Nx, Nx)
    hierarchy = MeshHierarchy(mesh, Nref)
    mesh = hierarchy[-1]
    
    # Define the function space and print the DOFs
    V = FunctionSpace(mesh, "CG", degree)
    dofs = V.dim()
    parprint('DOFs', dofs)

    u = TrialFunction(V)
    v = TestFunction(V)

    bcs = DirichletBC(V, zero(), (1, 2, 3, 4, 5, 6))
    
    # Define the RHS and analytic solution
    x, y, z = SpatialCoordinate(mesh)

    a = Constant(1)
    b = Constant(2)
    exact = sin(pi*x)*tan(pi*x/4)*sin(a*pi*y)*sin(b*pi*z)
    truth = Function(V).interpolate(exact)
    f = -pi**2 / 2
    f *= 2*cos(pi*x) - cos(pi*x/2) - 2*(a**2 + b**2)*sin(pi*x)*tan(pi*x/4)
    f *= sin(a*pi*y)*sin(b*pi*z)
    
    # Define the problem using the bilinear form `a` and linear functional `L`
    a = dot(grad(u), grad(v))*dx
    L = f*v*dx
    u_h = Function(V)
    problem = LinearVariationalProblem(a, L, u_h, bcs=bcs)
    return problem, u_h, truth

In [3]:
Nx = 8
Nref = 2
degree = 2

problem, u_h, truth = make_problem(Nx, Nref, degree)

DOFs 274625


Creating a problem instance we can see there are just short of 275000 DOFs. (How close was your answer to the above exercise?).

## The Solver

We define another function to wrap the solve, so we can provide different solver options and to assess their performance, the run time is printed.
This is a fairly crude way to profile our code, for a more in depth guide to profiling, take a look at the page on [optimising Firedrake performance](https://firedrakeproject.org/optimising.html).

This table summarised the different solvers we will use:

| Solver | Abbreviation |   Cost   | Information |
|:-------|:------------:|:--------:|:------------|
| LU     | LU           |   O(nÂ²)  | Firedrake Default |
| Conjugate Gradient + Geometric Multigrid V-cycle | CG + MGV | O(qn) | Sensible choice of KSP + PC |
| Full Geometric Multigrid | FMG | O(n) | Linear complexity |
| Matrix free FMG | Matfree FMG | O(n) | Reduced memory O(kn) |
| Matrix free FMG with Telescoping | Telescoped matfree FMG | O(n) | Reduced memory and reduced communication |

where n is the problem size and q is the number of iterations taken by an iterative method.

In [4]:
def run_solve(problem, parameters):
    # Create a solver ad time how long the solve takes
    t = time()
    solver = LinearVariationalSolver(problem, solver_parameters=parameters)
    solver.solve()
    parprint("Runtime :", time() - t)

## LU

We can start by looking at the Firedrake's default solver options. If you don't specify any solver options a direct solver such as MUMPS will be used to perform an LU factorisation.

Here we explicitly list the PETSc solver options so it's clear how the solver is set up. We also enable the `snes_view` so that PETSc prints the solver options it's using at runtime.

In [5]:
u_h.assign(0)
lu_mumps = {
    "snes_view": None,
    "ksp_type": "preonly",
    "pc_type": "lu",
    "pc_factor_mat_solver_type": "mumps"
}
run_solve(problem, lu_mumps)
parprint("Error   :", errornorm(truth, u_h))

SNES Object: (firedrake_0_) 1 MPI processes
  type: ksponly
  maximum iterations=50, maximum function evaluations=10000
  tolerances: relative=1e-08, absolute=1e-50, solution=1e-08
  total number of linear solver iterations=1
  total number of function evaluations=1
  norm schedule ALWAYS
  KSP Object: (firedrake_0_) 1 MPI processes
    type: preonly
    maximum iterations=10000, initial guess is zero
    tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
    left preconditioning
    using NONE norm type for convergence test
  PC Object: (firedrake_0_) 1 MPI processes
    type: lu
      out-of-place factorization
      tolerance for zero pivot 2.22045e-14
      matrix ordering: external
      factor fill ratio given 0., needed 0.        Factored matrix follows:
          Mat Object: 1 MPI processes
            type: mumps
            rows=274625, cols=274625
            package used to perform factorization: mumps
            total: nonzeros=410915141, allocated nonzeros=41

Error   : 1.7378060441226638e-05


The above solve takes under a minute on a single Zen2 core of ARCHER2.

Dense LU factorisations are very expensive, typically $O(n^3)$. Sparse LU with a state of the art solver like MUMPS or SuperLU_dist can do better, typically in $O(n^2)$ or possibly faster, depending on the specific problem. 

We can measure the computational cost of our solvers by increasing the problem size (the number of DOFs) and observing how this changes the solver run time. In the computational cost plots below you can see that the cost of LU factorisation stays below $O(n^{5/3})$, but this cost grows far faster than the other solver methods.

Direct solvers are very fast for small problems, which is why LU is the default solver in Firedrake. However, when $n$ gets large, direct solvers are no longer viable and should be avoided where possible.

![](image/hpc_single.png)

## Iterative solvers

An alternative to a direct solver is an iterative solver and PETSc gives us access to a large number of Krylov Subspace solvers (KSP). Since we have a symmetric problem, we can use the Conjugate Gradient (CG) method, which has computational cost $O(qn)$, where $q$ is the number of iterations for the method to converge. 

To reduce $q$ we can precondition the KSP. Looking at the `make_problem` function, we have created a `MeshHierarchy` which allows for the use of a Geometric Multigrid V-cycles to precondition the CG method. The solver options for this setup are shown below.

We assign 0 to the function `u_h` before we solve so that we aren't using the solution from the LU solve above as our initial guess for the CG solver.

In [6]:
u_h.assign(0)
vmg = {
    "snes_view": None,
    "ksp_type": "cg",
    "pc_type": "mg",
    "pc_mg_log": None
}
run_solve(problem, vmg)
parprint("Error   :", errornorm(truth, u_h))

SNES Object: (firedrake_1_) 1 MPI processes
  type: ksponly
  maximum iterations=50, maximum function evaluations=10000
  tolerances: relative=1e-08, absolute=1e-50, solution=1e-08
  total number of linear solver iterations=5
  total number of function evaluations=1
  norm schedule ALWAYS
  KSP Object: (firedrake_1_) 1 MPI processes
    type: cg
    maximum iterations=10000, initial guess is zero
    tolerances:  relative=1e-07, absolute=1e-50, divergence=10000.
    left preconditioning
    using PRECONDITIONED norm type for convergence test
  PC Object: (firedrake_1_) 1 MPI processes
    type: mg
      type is MULTIPLICATIVE, levels=3 cycles=v
        Cycles per PCApply=1
        Not using Galerkin computed coarse grid matrices
    Coarse grid solver -- level -------------------------------
      KSP Object: (firedrake_1_mg_coarse_) 1 MPI processes
        type: preonly
        maximum iterations=10000, initial guess is zero
        tolerances:  relative=1e-05, absolute=1e-50, diverge

The CG solver is significantly faster than the LU factorisation, but is still slower than the full Geometric multigrid method.

We can measure the weak scaling performance of the solvers by increasing the size of the problem in line with the number of processors. This is done approximately in the plot below, the number of DOFs per core is displayed underneath each data point. For a solver that weak scales perfectly, when we use twice as many cores to solve a problem twice as large, it should take the same total time and the lines in the plot should be approximately constant.

In the weak scaling plot below, CG weaks scales for longer than the LU factorisation, but we also see that CG in _this_ setup does not weak scale as well as the full multigrid methods.

![](image/hpc_weak.png)

## Geometric Multigrid

It is possible to solve the Poisson problem using multigrid alone, without an iterative solver. To achieve this the `ksp_type` is set to `preonly` and we now perform a number of full multigrid sweeps (sometimes called F-cycles).

Beware that using `preonly` we turn off PETSc's internal convergence checks, in this example we have carefully chosen the number of smoothing steps (`mg_levels_ksp_max_it`) so that our solver converges.

---
### Exercise:
Can we get away with fewer smoothing steps in this multigrid solver?
Try changing `mg_levels_ksp_max_it` in the solver options and observe what happens to the errornorm for the solution.

---

In [7]:
u_h.assign(0)
fmg = {
    "snes_view": None,
    "ksp_type": "preonly",
    "pc_type": "mg",
    "pc_mg_log": None,
    "pc_mg_type": "full",
    "mg_levels_ksp_type": "chebyshev",
    "mg_levels_ksp_max_it": 10,
    "mg_levels_pc_type": "jacobi",
    "mg_coarse_pc_type": "lu",
    "mg_coarse_pc_factor_mat_solver_type": "mumps"
}
run_solve(problem, fmg)
parprint("Error   :", errornorm(truth, u_h))

SNES Object: (firedrake_2_) 1 MPI processes
  type: ksponly
  maximum iterations=50, maximum function evaluations=10000
  tolerances: relative=1e-08, absolute=1e-50, solution=1e-08
  total number of linear solver iterations=1
  total number of function evaluations=1
  norm schedule ALWAYS
  KSP Object: (firedrake_2_) 1 MPI processes
    type: preonly
    maximum iterations=10000, initial guess is zero
    tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
    left preconditioning
    using NONE norm type for convergence test
  PC Object: (firedrake_2_) 1 MPI processes
    type: mg
      type is FULL, levels=3 cycles=v
        Not using Galerkin computed coarse grid matrices
    Coarse grid solver -- level -------------------------------
      KSP Object: (firedrake_2_mg_coarse_) 1 MPI processes
        type: preonly
        maximum iterations=10000, initial guess is zero
        tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
        left preconditioning
    

Removing the KSP iterative solve gives a significant speed up over using multigrid as a preconditioner. Note this is possible since the linear system for Poisson is relatively well conditioned, and is a prototypical problem for multigrid. In general it may not be possible to just use a multigrid preconditioner without a Kylov subspace iterative solver. Designing a fast and scalale solver for a given problem is often very challenging.

We can measure the strong scaling performance of the multigrid by choosing a large enough problem and seeing how long it takes to solve on different numbers of processes. In the plot below, the number of DOFs per core is displayed underneath each data point. For a solver that strong scales perfectly, when we use twice as many cores to solve the same size problem should take half as long. This ideal scaling is also plotted as the dashed line.

The figure below shows what happens when we use this multigrid solver for a large problem. For this test we set `Nx = 10` and `Nref = 4` to make a problem with 33 076 161 DOFs and solve over multiple nodes.

The full multigrid solver strong scales poorly beyond 2 nodes. A matrix free solver is slightly faster, but still scales poorly. The reason for this poor scaling is that the solver spends most of its time performing communication solving the problem on the coarse grid in a distributed manner. In the next section we show how to overcome this issue using a telescoping solver.

![](image/hpc_strong.png)

## Variants

In this section we show two variants of the full multigrid solver above, which have advantages for larger problems and on HPC architectures.

One key advantage of using geometric multigrid over algebraic multigrid is the ability to use matrix free methods. These methods never assemble the full finite element matrix, which for large problems gives a significant reduction in memory usage. On the coarsest mesh of the multigrid hierarchy we can use the `firedrake.AssembledPC` to assemble the finite element matrix, which allows us to use a direct solver.

In [8]:
u_h.assign(0)
fmg_matfree = {
    "snes_view": None,
    "mat_type": "matfree",
    "ksp_type": "preonly",
    "pc_type": "mg",
    "pc_mg_log": None,
    "pc_mg_type": "full",
    "mg_levels_ksp_type": "chebyshev",
    "mg_levels_ksp_max_it": 10,
    "mg_levels_pc_type": "jacobi",
    "mg_coarse_pc_type": "python",
    "mg_coarse_pc_python_type": "firedrake.AssembledPC",
    "mg_coarse_assembled": {
        "mat_type": "aij",
        "pc_type": "lu",
        "pc_factor_mat_solver_type": "mumps"
    }
}
run_solve(problem, fmg_matfree)
parprint("Error   :", errornorm(truth, u_h))

SNES Object: (firedrake_3_) 1 MPI processes
  type: ksponly
  maximum iterations=50, maximum function evaluations=10000
  tolerances: relative=1e-08, absolute=1e-50, solution=1e-08
  total number of linear solver iterations=1
  total number of function evaluations=1
  norm schedule ALWAYS
  KSP Object: (firedrake_3_) 1 MPI processes
    type: preonly
    maximum iterations=10000, initial guess is zero
    tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
    left preconditioning
    using NONE norm type for convergence test
  PC Object: (firedrake_3_) 1 MPI processes
    type: mg
      type is FULL, levels=3 cycles=v
        Not using Galerkin computed coarse grid matrices
    Coarse grid solver -- level -------------------------------
      KSP Object: (firedrake_3_mg_coarse_) 1 MPI processes
        type: preonly
        maximum iterations=10000, initial guess is zero
        tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
        left preconditioning
    

The final set of solver options deals with very large problems spread over multiple compute nodes. For a problem with a large multigrid hierarchy, the coarse grid problem is often so small that when it is solved over multiple nodes, the coarse solve spends all its time performing communication, which is slow.

The solution is to let each node solve a local copy of the coarse grid problem, which avoids this communication. This functionality is enabled using the `telescope` preconditioner alongside the assembled preconditioner, as shown below:

In [9]:
u_h.assign(0)
telescope_factor = 1 # Set to number of nodes!
fmg_matfree_telescope = {
    "snes_view": None,
    "mat_type": "matfree",
    "ksp_type": "preonly",
    "pc_type": "mg",
    "pc_mg_log": None,
    "pc_mg_type": "full",
    "mg_levels_ksp_type": "chebyshev",
    "mg_levels_ksp_max_it": 10,
    "mg_levels_pc_type": "jacobi",
    "mg_coarse_pc_type": "python",
    "mg_coarse_pc_python_type": "firedrake.AssembledPC",
    "mg_coarse_assembled": {
        "mat_type": "aij",
        "pc_type": "telescope",
        "pc_telescope_reduction_factor": telescope_factor,
        "pc_telescope_subcomm_type": "contiguous",
        "telescope_pc_type": "lu",
        "telescope_pc_factor_mat_solver_type": "mumps"
    }
}
run_solve(problem, fmg_matfree_telescope)
parprint("Error   :", errornorm(truth, u_h))

SNES Object: (firedrake_4_) 1 MPI processes
  type: ksponly
  maximum iterations=50, maximum function evaluations=10000
  tolerances: relative=1e-08, absolute=1e-50, solution=1e-08
  total number of linear solver iterations=1
  total number of function evaluations=1
  norm schedule ALWAYS
  KSP Object: (firedrake_4_) 1 MPI processes
    type: preonly
    maximum iterations=10000, initial guess is zero
    tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
    left preconditioning
    using NONE norm type for convergence test
  PC Object: (firedrake_4_) 1 MPI processes
    type: mg
      type is FULL, levels=3 cycles=v
        Not using Galerkin computed coarse grid matrices
    Coarse grid solver -- level -------------------------------
      KSP Object: (firedrake_4_mg_coarse_) 1 MPI processes
        type: preonly
        maximum iterations=10000, initial guess is zero
        tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
        left preconditioning
    

## Running on HPC

To run these examples on HPC, the Firedrake code must be a Python script. You can download this notebook as a Python script in an interactive Jupyter notebook by clicking  `File > Download as > Python (.py)`.

The code must run through a job scheduler using another script. An example job script suitable for running on ARCHER2 is provided below.

To use this script change the account (`-A`) to your account, change the number of nodes (`--node=`) to the number of nodes you want to use and the time (`-t`) as appropriate, it is currently set to 10 _minutes_.

```bash
#!/bin/bash
#SBATCH -p standard
#SBATCH -A account
#SBATCH -J firedrake
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH -t 0:10:00

export VENV_NAME=firedrake_08_2021
export WORK=/work/e682/shared/firedrake_tarballs/firedrake_08_2021/
export FIREDRAKE_TEMP=firedrake_tmp
export LOCAL_BIN=$WORK

myScript="HPC_demo.py"

module load epcc-job-env

# Activate Firedrake venv (activate once on first node, extract once per node)
source $LOCAL_BIN/firedrake_activate.sh
srun --ntasks-per-node 1 $LOCAL_BIN/firedrake_activate.sh

# Run Firedrake script
srun --ntasks-per-node 128 $VIRTUAL_ENV/bin/python ${myScript}
```

Finally, if you named your jobscript `jobscript.slm`, then it can be submitted to the queue by running the following command on ARCHER2:

```bash
sbatch jobscript.slm
```

You can see your job's progress through the queue using:

``` bash
squeue -u $USER
```

If you need to cancel a job for any reason, you can pass your job ID number as an argument to the scancel command:

``` bash
scancel 123456
```

## Main exercise

Perform a convergence study for the Poisson problem above, using degree 2 Lagrange elements. To do this solve the problem on a range of different mesh sizes. The cell diameter on the finest mesh in a multigrid hierarchy is given by $h = \frac{\sqrt{2}}{N}$, where $N = N_x \times 2^{N_{ref}}$ is the number of cells along one edge of the cube on the finest grid.

a)

For this exercise we will repeatedly double $N$ (to half the value of $h$), and measure the error for each solution. Start by picking a suitable coarse mesh size (`Nx`) and number of mesh refinements (`Nref`) to achieve the desired fine mesh size `N`:

| N =  | 8 | 16 | 32 | 64 | 128 | 256 | 512 |
|------|---|----|----|----|-----|-----|-----|
| Nx   |   |    | 8  |
| Nref |   |    | 2  |

Throughout the exercise we have already entered appropriate values into the table. These values correspond to the case presented in the notebook.

The values for `Nx` and `Nref` are not unique, for instance to get a fine mesh with N = 256, we could choose `Nx = 256`, `Nref = 0` or `Nx = 8`, `Nref = 5`. However, if we use the former, we cannot use multigrid to solve the problem since there will be no mesh hierarchy! Think about how many multigrid levels are appropriate for the size of problem.

b)
 
Next approximate the number of DOFs for each problem using the guide in the **How big?** section above. Using the total number of DOFs work out how many processes would be appropriate for solving the problem (try to pick a power of 2) and hence how many nodes.

| N =       | 8 | 16 | 32 | 64 | 128 | 256 | 512 |
|-----------|---|----|----|----|-----|-----|-----|
| DOFs      |   |    | 274625 |
| Processes |   |    | 4  |
| Nodes     |   |    | 1  |

If you didn't complete the exercise in the **How big?** section, you can use $DOFs = (2N + 1)^3$.

c)

For each problem size create a Firedrake problem and solve using the functions provided above. Record the error, alongside the cell diameter h in the table below. You will need to copy the code from this notebook, or download it as a Python file and modify as appropriate. 

Additionally, you will need to modify the jobscript to fit the size of problem you are solving. Both the Python script and jobscript need to be changed to suit the problem size!

| N =   | 8 | 16 | 32 | 64 | 128 | 256 | 512 |
|-------|---|----|----|----|-----|-----|-----|
| h     |   |    | 0.044 |
| Error |   |    | 2.33E-06 |

d)

Plot the error against h and measure the rate of convergence.

**Hints:**
- You don't need much compute power to solve small problems on coarse meshes, these will likely fit on one node.
- Remember to make you job big enough for the number of processes that you run:
    - Each MPI rank must own at least one cell in the mesh
    - Firedrake performs better when there are more than 50000 DOFs per rank