# Adaptive PDE discretizations on cartesian grids 
## Volume : GPU accelerated methods
## Part : Eikonal equations, acceleration and reproducibility
## Chapter : Riemannian metrics

In this notebook, we solve Riemannian eikonal equations on the CPU and the GPU, and check that they produce consistent results.

**GPU performance** GPUs are massively parallel machines, which efficiently exploit cache locality. Hence they are at their advantage with :
* Large problem instances, which are embarassingly parallel
* Moderate anisotropy, so that the numerical scheme stncils are not too wide

In [1]:
large_instances = False # True favors the GPU code (CPU times may become a big long.)
strong_anisotropy = True # True favors the CPU code 
anisotropy_bound = 10. if strong_anisotropy else 4. # Ratio between the fastest and the smallest velocity at any given point

[**Summary**](Summary.ipynb) of volume GPU accelerated methods, this series of notebooks.

[**Main summary**](../Summary.ipynb) of the Adaptive Grid Discretizations 
	book of notebooks, including the other volumes.

# Table of contents
  * [1. Two dimensions](#1.-Two-dimensions)
    * [1.1 Isotropic metric](#1.1-Isotropic-metric)
    * [1.2 Smooth anisotropic metric](#1.2-Smooth-anisotropic-metric)
  * [2. Three dimensions](#2.-Three-dimensions)
    * [2.1 Smooth anisotropic metric](#2.1-Smooth-anisotropic-metric)



**Acknowledgement.** The experiments presented in these notebooks are part of ongoing research.
The author would like to acknowledge fruitful informal discussions with L. Gayraud on the 
topic of GPU coding and optimization.

Copyright Jean-Marie Mirebeau, University Paris-Sud, CNRS, University Paris-Saclay

## 0. Importing the required libraries

In [2]:
import sys; sys.path.insert(0,"..")
#from Miscellaneous import TocTools; print(TocTools.displayTOC('Riemann_Repro','GPU'))

In [3]:
import cupy as cp
import numpy as np
import itertools
from matplotlib import pyplot as plt
np.set_printoptions(edgeitems=30, linewidth=100000, formatter=dict(float=lambda x: "%5.3g" % x))

In [4]:
from agd import HFMUtils
from agd import AutomaticDifferentiation as ad
from agd import Metrics
from agd import FiniteDifferences as fd
from agd import LinearParallel as lp
import agd.AutomaticDifferentiation.cupy_generic as cugen

from agd.ExportedCode.Notebooks_GPU.Isotropic_Repro import RunCompare

In [5]:
def ReloadPackages():
    from Miscellaneous.rreload import rreload
    global HFMUtils,ad,cugen,RunGPU,RunSmart,Metrics
    HFMUtils,ad,cugen,Metrics = rreload([HFMUtils,ad,cugen,Metrics],"../..")    
    HFMUtils.dictIn.RunSmart = cugen.cupy_get_args(HFMUtils.RunSmart,dtype64=True,iterables=(dict,Metrics.Base))

In [6]:
cp = ad.functional.decorate_module_functions(cp,cugen.set_output_dtype32) # Use float32 and int32 types in place of float64 and int64
plt = ad.functional.decorate_module_functions(plt,cugen.cupy_get_args)
HFMUtils.dictIn.RunSmart = cugen.cupy_get_args(HFMUtils.RunSmart,dtype64=True,iterables=(dict,Metrics.Base))

## 1. Two dimensions

### 1.1 Isotropic metric

In [7]:
n=4000 if large_instances else 1000
hfmIn = HFMUtils.dictIn({
    'model':'Riemann2',
    'metric':Metrics.Riemann.from_cast(Metrics.Isotropic(cp.array(1.),vdim=2)),
    'seeds':cp.array([[0.5,0.5]]),
    'exportValues':1,
#    'bound_active_blocks':True,
    'traits':{
        'niter_i':24,'shape_i':(12,12), # Best
    }
})
hfmIn.SetRect([[0,1],[0,1]],dimx=n+1,sampleBoundary=True)

Casting output of function array from float64 to float32
Casting output of function array from float64 to float32


In [8]:
_,cpuOut = RunCompare(hfmIn,check=1e-5)

Setting the kernel traits.
Prepating the domain data (shape,metric,...)
Preparing the problem rhs (cost, seeds,...)
Preparing the GPU kernel
Running the eikonal GPU kernel
GPU kernel eikonal ran for 0.06350064277648926 seconds,  and 86 iterations.
Post-Processing
--- gpu done, turning to cpu ---
Field verbosity defaults to 1
Field order defaults to 1
Field seedRadius defaults to 0
Fast marching solver completed in 0.821 s.
Solver time (s). GPU : 0.06350064277648926, CPU : 1.476. Device acceleration : 23.243859203051727
Max |gpuValues-cpuValues| :  2.8908252719395122e-06


In [9]:
n=200; hfmInS = hfmIn.copy() # Define a small instance for bit-consistency validation
hfmInS.SetRect([[0,1],[0,1]],dimx=n+1,sampleBoundary=True)
X = hfmInS.Grid()
cost = np.prod(np.sin(2*np.pi*X),axis=0)+1.1
hfmInS.update({
    'metric': Metrics.Riemann.from_cast(Metrics.Isotropic(cost,vdim=2)), # Isotropic but non-constant metric
    'verbosity':0,
})

In [10]:
RunCompare(hfmInS,variants='basic')

Solver time (s). GPU : 0.013496875762939453, CPU : 0.051000000000000004. Device acceleration : 3.7786522522522525
Max |gpuValues-cpuValues| :  1.3751945037165925e-06

 --- Variant {'multiprecision': True} ---
Solver time (s). GPU : 0.01599717140197754, CPU : 0.053. Device acceleration : 3.313085711730778
Max |gpuValues-cpuValues| :  4.742801229529192e-08

 --- Variant {'seedRadius': 2.0} ---
Solver time (s). GPU : 0.011999130249023438, CPU : 0.055. Device acceleration : 4.583665553965983
Max |gpuValues-cpuValues| :  1.268472188953318e-06

 --- Variant {'seedRadius': 2.0, 'multiprecision': True} ---
Solver time (s). GPU : 0.012998342514038086, CPU : 0.051000000000000004. Device acceleration : 3.923577174929841
Max |gpuValues-cpuValues| :  8.585884103684549e-08


In [11]:
RunCompare(hfmInS,variants='ext',check=0.004)

Solver time (s). GPU : 0.016002178192138672, CPU : 0.05. Device acceleration : 3.124574629756548
Max |gpuValues-cpuValues| :  1.3751945037165925e-06

 --- Variant {'multiprecision': True} ---
Solver time (s). GPU : 0.01600170135498047, CPU : 0.053. Device acceleration : 3.3121478038023717
Max |gpuValues-cpuValues| :  4.742801229529192e-08

 --- Variant {'seedRadius': 2.0} ---
Solver time (s). GPU : 0.015000343322753906, CPU : 0.052. Device acceleration : 3.466587322779579
Max |gpuValues-cpuValues| :  1.268472188953318e-06

 --- Variant {'seedRadius': 2.0, 'multiprecision': True} ---
Solver time (s). GPU : 0.013002872467041016, CPU : 0.053. Device acceleration : 4.076022443067219
Max |gpuValues-cpuValues| :  8.585884103684549e-08

 --- Variant {'factoringRadius': 10.0, 'factoringPointChoice': 'Key'} ---
Solver time (s). GPU : 0.011497735977172852, CPU : 0.056. Device acceleration : 4.870524085018144
Max |gpuValues-cpuValues| :  0.0002013597425319924

 --- Variant {'factoringRadius': 10.

### 1.2 Smooth anisotropic metric

In [12]:
n=4000 if large_instances else 1000
hfmIn = HFMUtils.dictIn({
    'model':'Riemann2',
    'seeds':cp.array([[0.,0.]]),
    'exportValues':1,
#    'bound_active_blocks':True,
    'traits':{
        'niter_i':16,'shape_i':(8,8), # Best
    },
})
hfmIn.SetRect([[-np.pi,np.pi],[-np.pi,np.pi]],dimx=n+1,sampleBoundary=True)

Casting output of function array from float64 to float32


In [13]:
def height(x): return np.sin(x[0])*np.sin(x[1])
def surface_metric(x,z,mu):
    ndim,shape = x.ndim-1,x.shape[1:]
    x_ad = ad.Dense.identity(constant=x,shape_free=(ndim,))
    tensors = lp.outer_self( z(x_ad).gradient() ) + mu**-2 * fd.as_field(cp.eye(ndim),shape)
    return Metrics.Riemann(tensors)

In [14]:
hfmIn['metric'] = surface_metric(hfmIn.Grid(),height,mu=anisotropy_bound)

Casting output of function eye from float64 to float32


In [15]:
gpuOut,cpuOut = RunCompare(hfmIn,check=False)

Setting the kernel traits.
Prepating the domain data (shape,metric,...)
Preparing the problem rhs (cost, seeds,...)
Preparing the GPU kernel
Running the eikonal GPU kernel
GPU kernel eikonal ran for 0.23299622535705566 seconds,  and 254 iterations.
Post-Processing
--- gpu done, turning to cpu ---
Field verbosity defaults to 1
Field order defaults to 1
Field seedRadius defaults to 0
Fast marching solver completed in 1.429 s.
Solver time (s). GPU : 0.23299622535705566, CPU : 2.5060000000000002. Device acceleration : 10.755539048581898
Max |gpuValues-cpuValues| :  5.273971551522649e-05


In [16]:
n=200; hfmInS = hfmIn.copy() # Define a small instance for bit-consistency validation
hfmInS.SetRect([[-np.pi,np.pi],[-np.pi,np.pi]],dimx=n+1,sampleBoundary=True)
hfmInS.update({
    'metric' : surface_metric(hfmInS.Grid(),height,mu=anisotropy_bound), 
    'verbosity':0,
})

Casting output of function eye from float64 to float32


In [19]:
RunCompare(hfmInS,variants='basic')

Solver time (s). GPU : 0.03899812698364258, CPU : 0.09. Device acceleration : 2.3078031423855228
Max |gpuValues-cpuValues| :  7.870599221471153e-06

 --- Variant {'multiprecision': True} ---
Solver time (s). GPU : 0.04249691963195801, CPU : 0.089. Device acceleration : 2.094269438132907
Max |gpuValues-cpuValues| :  2.1523372906173677e-07

 --- Variant {'seedRadius': 2.0} ---
Solver time (s). GPU : 0.03249502182006836, CPU : 0.09. Device acceleration : 2.7696550104920243
Max |gpuValues-cpuValues| :  7.90254544336122e-06

 --- Variant {'seedRadius': 2.0, 'multiprecision': True} ---
Solver time (s). GPU : 0.034499168395996094, CPU : 0.089. Device acceleration : 2.5797723289564614
Max |gpuValues-cpuValues| :  2.2754650697009993e-07


Due to the different switching criteria of the second order scheme, we do not have bit consistency in that case. The results are nevertheless quite close. Note also that we do not deactivate the `decreasing` trait here, contrary to the isotropic case, because the scheme often does not converge without it.

**Bottom line.** Second order accuracy for anisotropic metrics on the GPU is very experimental, and not much reliable, at this stage. Further investigation is needed on the matter.

In [20]:
RunCompare(hfmInS,variants='ext',check=0.1)

Solver time (s). GPU : 0.03699779510498047, CPU : 0.089. Device acceleration : 2.405548756283026
Max |gpuValues-cpuValues| :  7.870599221471153e-06

 --- Variant {'multiprecision': True} ---
Solver time (s). GPU : 0.04148077964782715, CPU : 0.089. Device acceleration : 2.145572015656702
Max |gpuValues-cpuValues| :  2.1523372906173677e-07

 --- Variant {'seedRadius': 2.0} ---
Solver time (s). GPU : 0.031000137329101562, CPU : 0.089. Device acceleration : 2.870955023687934
Max |gpuValues-cpuValues| :  7.90254544336122e-06

 --- Variant {'seedRadius': 2.0, 'multiprecision': True} ---
Solver time (s). GPU : 0.03549981117248535, CPU : 0.089. Device acceleration : 2.5070555887627015
Max |gpuValues-cpuValues| :  2.2754650697009993e-07

 --- Variant {'factoringRadius': 10.0, 'factoringPointChoice': 'Key'} ---
Solver time (s). GPU : 0.03500056266784668, CPU : 0.088. Device acceleration : 2.5142452947146854
Max |gpuValues-cpuValues| :  0.00029348062993576896

 --- Variant {'factoringRadius': 10.

If one removes enforced monotonicity, obtaining the scheme convergence is harder, and requires setting some other parameters carefully and conservatively.

<!---
hfmInS.update({
    'order2_threshold':0.03,
    'verbosity':1,
    'traits':{'decreasing_macro':0,'order2_threshold_weighted_macro':1},
    'metric' : surface_metric(hfmInS.Grid(),height),
    'multiprecision':False,
    'tol':1e-6
})
--->

In [21]:
hfmInS.update({
    'tol':1e-6, # Tolerance for the convergence of the fixed point solver
    'order2_threshold':0.03, # Use first order scheme if second order difference is too large
    'traits':{'decreasing_macro':0}, # Do not enforce monotonicity
})

In [24]:
RunCompare(hfmInS,variants='ext',check=0.15)

Solver time (s). GPU : 0.031998395919799805, CPU : 0.089. Device acceleration : 2.781389424115758
Max |gpuValues-cpuValues| :  1.1566087197545372e-05

 --- Variant {'multiprecision': True} ---
Solver time (s). GPU : 0.038500070571899414, CPU : 0.089. Device acceleration : 2.311684074287377
Max |gpuValues-cpuValues| :  2.0414844297267365e-06

 --- Variant {'seedRadius': 2.0} ---
Solver time (s). GPU : 0.026499032974243164, CPU : 0.091. Device acceleration : 3.4340875792883168
Max |gpuValues-cpuValues| :  1.1538428774660048e-05

 --- Variant {'seedRadius': 2.0, 'multiprecision': True} ---
Solver time (s). GPU : 0.032000064849853516, CPU : 0.089. Device acceleration : 2.781244363647201
Max |gpuValues-cpuValues| :  2.0156858502318187e-06

 --- Variant {'factoringRadius': 10.0, 'factoringPointChoice': 'Key'} ---
Solver time (s). GPU : 0.034998416900634766, CPU : 0.09. Device acceleration : 2.571544886030764
Max |gpuValues-cpuValues| :  0.00029348062993576896

 --- Variant {'factoringRadius'

In [None]:
# TODO : discontinuous metric

## 2. Three dimensions

### 2.1 Smooth anisotropic metric

We generalize the two dimensional test case, although it does not much make geometrical sense anymore: we are computing geodesics in a three dimensional volume viewed as an hypersurface embedded in four dimensional Euclidean space.

In [25]:
n=200 if large_instances else 100
hfmIn = HFMUtils.dictIn({
    'model':'Riemann3',
    'seeds':cp.array([[0.,0.,0.]]),
    'exportValues':1,
#    'bound_active_blocks':True,
})
hfmIn.SetRect([[-np.pi,np.pi],[-np.pi,np.pi],[-np.pi,np.pi]],dimx=n+1,sampleBoundary=True)

Casting output of function array from float64 to float32


In [26]:
def height3(x): return np.sin(x[0])*np.sin(x[1])*np.sin(x[2])

In [27]:
hfmIn['metric'] = surface_metric(hfmIn.Grid(),height3,mu=anisotropy_bound)

Casting output of function eye from float64 to float32


In [28]:
gpuOut,cpuOut = RunCompare(hfmIn,check=1e-4)

Setting the kernel traits.
Prepating the domain data (shape,metric,...)
Preparing the problem rhs (cost, seeds,...)
Preparing the GPU kernel
Running the eikonal GPU kernel
GPU kernel eikonal ran for 0.14097881317138672 seconds,  and 60 iterations.
Post-Processing
--- gpu done, turning to cpu ---
Field verbosity defaults to 1
Field order defaults to 1
Field seedRadius defaults to 0
Fast marching solver completed in 6.378 s.
Solver time (s). GPU : 0.14097881317138672, CPU : 9.642. Device acceleration : 68.39325557577438
Max |gpuValues-cpuValues| :  8.276338205548406e-06


In [29]:
n=20; hfmInS = hfmIn.copy() # Define a small instance for bit-consistency validation
hfmInS.SetRect([[-np.pi,np.pi],[-np.pi,np.pi],[-np.pi,np.pi]],dimx=n+1,sampleBoundary=True)
hfmInS.update({
    'metric' : surface_metric(hfmInS.Grid(),height,mu=anisotropy_bound), 
    'verbosity':0,
})

Casting output of function eye from float64 to float32


In [30]:
RunCompare(hfmInS,variants='basic')

Solver time (s). GPU : 0.008499622344970703, CPU : 0.038000000000000006. Device acceleration : 4.470786872370267
Max |gpuValues-cpuValues| :  2.67317871505135e-07

 --- Variant {'multiprecision': True} ---
Solver time (s). GPU : 0.008997201919555664, CPU : 0.038. Device acceleration : 4.223535310173039
Max |gpuValues-cpuValues| :  4.1609422674060426e-07

 --- Variant {'seedRadius': 2.0} ---
Solver time (s). GPU : 0.006497621536254883, CPU : 0.036000000000000004. Device acceleration : 5.540488900304554
Max |gpuValues-cpuValues| :  2.0110513232474148e-07

 --- Variant {'seedRadius': 2.0, 'multiprecision': True} ---
Solver time (s). GPU : 0.009000062942504883, CPU : 0.035. Device acceleration : 3.8888616917004426
Max |gpuValues-cpuValues| :  4.477440634920171e-07


Due to the different switching criteria of the second order scheme, we do not have bit consistency in that case. The results are nevertheless quite close.

In [31]:
RunCompare(hfmInS,variants='ext',check=0.1)

Solver time (s). GPU : 0.009010076522827148, CPU : 0.038. Device acceleration : 4.217500251382604
Max |gpuValues-cpuValues| :  2.67317871505135e-07

 --- Variant {'multiprecision': True} ---
Solver time (s). GPU : 0.0089874267578125, CPU : 0.037000000000000005. Device acceleration : 4.116862478777589
Max |gpuValues-cpuValues| :  4.1609422674060426e-07

 --- Variant {'seedRadius': 2.0} ---
Solver time (s). GPU : 0.0069980621337890625, CPU : 0.037000000000000005. Device acceleration : 5.287177977650587
Max |gpuValues-cpuValues| :  2.0110513232474148e-07

 --- Variant {'seedRadius': 2.0, 'multiprecision': True} ---
Solver time (s). GPU : 0.00799870491027832, CPU : 0.037000000000000005. Device acceleration : 4.6257488449730255
Max |gpuValues-cpuValues| :  4.477440634920171e-07

 --- Variant {'factoringRadius': 10.0, 'factoringPointChoice': 'Key'} ---
Solver time (s). GPU : 0.007008075714111328, CPU : 0.045. Device acceleration : 6.421163502755665
Max |gpuValues-cpuValues| :  0.014765998189