# Input/Output with PDI

In [None]:
from trustutils import run

run.introduction("A. KHIZAR")

run.TRUST_parameters()

# Description: 

In this note, we check the use of the keyword 'pdi' which allows to generate and read files in HDF5 format, through the open-source PDI library, developped by Julien Bigot (DRF). The documentation can be found here: https://pdi.dev/main/.

PDI (Portable Data Interface) is an interface that provides easy coupling beetween a simulation code and a data handling library. This can have several applications, but in our case, we use it to deal with IOs: TRUST exposes its data to PDI, which then notify the selected plugin to work on them. We chose HDF5 as the backend here but others are also available and switching between those should have a minimal impact on the code.  

### The YAML file
To monitor exchanges between the code and external libraries, PDI needs a YAML file, containing a specification tree, at initialization. It describes the data (their type, their size) and the list of the plugins used to interact with them. This file is automatically generated by TRUST, and is named `save_$pb_name.yml`. However, for advanced users only, it is possible to specify a custom yaml file, with the following syntax:

`sauvegarde pdi { checkpoint_fname TRUST.sauv yaml_fname PDI.yml } `

In this example, we use `PDI.yml` to initialize PDI in order to generate the `TRUST.sauv` checkpoint file.

### The checkpoint file
The strategy used here is to handle one file per numa-node. The main advantage is to aggregate communications between processes on the same node to process local data. This is a good compromise that avoids having one file per processor (not recommended on a cluster as it pollutes working folders and degrades performance for large scale computation since the file metadata only reside on a single MDS server) and a single file for everyone, which can saturate the bandwidth and involve fairly heavy communications.

Each file is organized as follows:
* For sequential computation, when using the keyword `sauvegarde`, we store each unknown in its own multi-dimensionnal dataset: the first dimension iterates over the successive backups performed during the simulation; the remaining ones are specific to each unknown.

<img src="src/seq_checkpoint.png">

As we need to know the dimensions of each dataset before we start writing into it, we need to fix the number of iterations we want to save in advance. This can be set with the parameter `nb_sauv_max` in the scheme section of the datafile (by default, it's 10). If it turns out, during the simulation, that we need to save more iterations than planned, then we will overwrite the first ones.   

* For sequential computation, when using the keyword `sauvegarde_simple`, we only store the last backup. So each unknown is dumped in its own multi-dimensionnal dataset and is overwritten with every new backup.

* For parallel computation, the same logic is applied depending on the chosen method. However, we insert an extra-dimension so that each processor of the node can write its data in its own section.

<img src="src/par_checkpoint.png">

Note: Each processor does not necessarily have arrays of the same size, and given that the size of a dataset must be fixed in each dimension, we chose to fix it with the size of the largest array of the node. 




# Keyword Tests: 


In [None]:
from trustutils import run  

def prepare_pre_run(targetDir, nProcs, checkpoint, restart):
    cmd = "cp pre_run_reprise " + targetDir + "/pre_run;"
    if(nProcs > 1):
        cmd = cmd + "cp domaine.data " + targetDir + ";"
    cmd = cmd + "sed -i 's/nProcs=_TO_FILL_/nProcs=" + str(nProcs) + "/g' " + targetDir + "/pre_run;" 
    cmd = cmd + "sed -i 's/checkpoint_format=_TO_FILL_/checkpoint_format=" + checkpoint + "/g' " + targetDir + "/pre_run;" 
    tinit = 0.06 if checkpoint == "sauvegarde" and restart == "reprise" else 0.08
    cmd = cmd + "sed -i 's/t0=_T0_/t0=" + str(tinit) + "/g' " + targetDir + "/pre_run;"  
    run.executeCommand(cmd)
            
def make_PAR_FILE(par, nProcs):
    par.substitute("# BEGIN MESH #","# BEGIN MESH ")
    par.substitute("# END MESH #","END MESH #")
    par.substitute("# BEGIN SCATTER","# BEGIN SCATTER #")
    par.substitute("END SCATTER #","# END SCATTER #")
    header = "PARALLEL ONLY " + str(nProcs)
    par.substitute("PARALLEL NOT", header)    
            
run.initBuildDirectory()

# running reference computation
run.executeScript("run_full_computation") 

baseDir = "CHECKPOINT_RESTART/" 
for checkpoint in ["sauvegarde", "sauvegarde_simple"] : 
    for restart in ["reprise", "resume_last_time"] :
        # sequential test case
        nProcs = 1
        targetDir = baseDir + "SEQ" + restart + "_from_" + "SEQ" + checkpoint
        templates = {"checkpoint_restart": restart, "PARALLEL_HEADER_": "PARALLEL NOT"}
        seq = run.addCaseFromTemplate("sauvegarde_reprise.data",targetDirectory=targetDir,dic=templates)          
        prepare_pre_run(targetDir, nProcs, checkpoint, restart)
        
        # parallel test case (not using partition() method as I want a pure parallel test case, not to generate a parallel file in addition to the sequential one)
        nProcs = 2
        targetDir = baseDir + "PAR" + restart + "_from_" + "PAR" + checkpoint
        par = seq.copy("sauvegarde_reprise.data", targetDirectory=targetDir, nbProcs=nProcs)
        make_PAR_FILE(par, nProcs)
        run.addCase(par)
        prepare_pre_run(targetDir, nProcs, checkpoint, restart)

run.printCases()
run.runCases(preventConcurrent=True)

## Backup file in HDF5 format
We check that we can resume a calculation with an hdf5 backup file.

###  Use of the keywords ’sauvegarde’ and ’reprise’

Comparison of channel flow rate between a full calculation and a resumed calculation. 

The results of the resumed calculation must fully overlapped those of the full calculation (here, for the sequential computation, the second half of the computation takes place a little before the end of the first half). 

In [None]:
from trustutils.jupyter import plot

a = plot.Graph("test Keywords", size=[15,8])

x = plot.loadText("CHECKPOINT_RESTART/REF/ref_Channel_Flow_Rate_pb_periox")
a.add(x[0],x[1],marker="r-|",label="Full calculation",linewidth=2)

x = plot.loadText("CHECKPOINT_RESTART/SEQreprise_from_SEQsauvegarde/backup_Channel_Flow_Rate_pb_periox")
a.add(x[0],x[1],marker="b-x",label="First half of the calculation")

x = plot.loadText("CHECKPOINT_RESTART/SEQreprise_from_SEQsauvegarde/sauvegarde_reprise_Channel_Flow_Rate_pb_periox")
a.add(x[0],x[1],marker="g-*",label="Second half of the calculation")

x = plot.loadText("CHECKPOINT_RESTART/PARreprise_from_PARsauvegarde/backup_Channel_Flow_Rate_pb_periox")
a.add(x[0],x[1],marker="y-x",label="Parallel first half of the calculation")

x = plot.loadText("CHECKPOINT_RESTART/PARreprise_from_PARsauvegarde/sauvegarde_reprise_Channel_Flow_Rate_pb_periox")
a.add(x[0],x[1],marker="k-*",label="Parallel second half of the calculation")

a.label("Time","Flow Rate")

a.visu()

###  Use of the keywords ’sauvegarde’ and ’resume_last_time’

Comparison of channel flow rate between a full calculation and a resumed calculation. 

The results of the resumed calculation must overlapped those of the full calculation.

In [None]:
from trustutils.jupyter import plot

a = plot.Graph("test Keywords", size=[15,8])

x = plot.loadText("CHECKPOINT_RESTART/REF/ref_Channel_Flow_Rate_pb_periox")
a.add(x[0],x[1],marker="r-|",label="Full calculation",linewidth=2)

x = plot.loadText("CHECKPOINT_RESTART/SEQresume_last_time_from_SEQsauvegarde/backup_Channel_Flow_Rate_pb_periox")
a.add(x[0],x[1],marker="b-x",label="First half of the calculation")

x = plot.loadText("CHECKPOINT_RESTART/SEQresume_last_time_from_SEQsauvegarde/sauvegarde_reprise_Channel_Flow_Rate_pb_periox")
a.add(x[0],x[1],marker="g-*",label="Second half of the calculation")

x = plot.loadText("CHECKPOINT_RESTART/PARresume_last_time_from_PARsauvegarde/backup_Channel_Flow_Rate_pb_periox")
a.add(x[0],x[1],marker="y-x",label="Parallel first half of the calculation")

x = plot.loadText("CHECKPOINT_RESTART/PARresume_last_time_from_PARsauvegarde/sauvegarde_reprise_Channel_Flow_Rate_pb_periox")
a.add(x[0],x[1],marker="k-*",label="Parallel second half of the calculation")


a.label("Time","Flow Rate")

a.visu()

###  Use of the keywords ’sauvegarde_simple’ and ’reprise’

Comparison of channel flow rate between a full calculation and a resumed calculation. 

The results of the resumed calculation must overlapped those of the full calculation.

In [None]:
from trustutils.jupyter import plot

a = plot.Graph("test Keywords", size=[15,8])

x = plot.loadText("CHECKPOINT_RESTART/REF/ref_Channel_Flow_Rate_pb_periox")
a.add(x[0],x[1],marker="r-|",label="Full calculation",linewidth=2)

x = plot.loadText("CHECKPOINT_RESTART/SEQreprise_from_SEQsauvegarde_simple/backup_Channel_Flow_Rate_pb_periox")
a.add(x[0],x[1],marker="b-x",label="First half of the calculation")

x = plot.loadText("CHECKPOINT_RESTART/SEQreprise_from_SEQsauvegarde_simple/sauvegarde_reprise_Channel_Flow_Rate_pb_periox")
a.add(x[0],x[1],marker="g-*",label="Second half of the calculation")

x = plot.loadText("CHECKPOINT_RESTART/PARreprise_from_PARsauvegarde_simple/backup_Channel_Flow_Rate_pb_periox")
a.add(x[0],x[1],marker="y-x",label="Parallel first half of the calculation")

x = plot.loadText("CHECKPOINT_RESTART/PARreprise_from_PARsauvegarde_simple/sauvegarde_reprise_Channel_Flow_Rate_pb_periox")
a.add(x[0],x[1],marker="k-*",label="Parallel second half of the calculation")

a.label("Time","Flow Rate")

a.visu()

###  Use of the keywords ’sauvegarde_simple’ and ’resume_last_time’

Comparison of channel flow rate between a full calculation and a resumed calculation. 

The results of the resumed calculation must overlapped those of the full calculation.

In [None]:
from trustutils.jupyter import plot

a = plot.Graph("test Keywords", size=[15,8])

x = plot.loadText("CHECKPOINT_RESTART/REF/ref_Channel_Flow_Rate_pb_periox")
a.add(x[0],x[1],marker="r-|",label="Full calculation",linewidth=2)

x = plot.loadText("CHECKPOINT_RESTART/SEQresume_last_time_from_SEQsauvegarde_simple/backup_Channel_Flow_Rate_pb_periox")
a.add(x[0],x[1],marker="b-x",label="First half of the calculation")

x = plot.loadText("CHECKPOINT_RESTART/SEQresume_last_time_from_SEQsauvegarde_simple/sauvegarde_reprise_Channel_Flow_Rate_pb_periox")
a.add(x[0],x[1],marker="g-*",label="Second half of the calculation")

x = plot.loadText("CHECKPOINT_RESTART/PARresume_last_time_from_PARsauvegarde_simple/backup_Channel_Flow_Rate_pb_periox")
a.add(x[0],x[1],marker="y-x",label="Parallel first half of the calculation")

x = plot.loadText("CHECKPOINT_RESTART/PARresume_last_time_from_PARsauvegarde_simple/sauvegarde_reprise_Channel_Flow_Rate_pb_periox")
a.add(x[0],x[1],marker="k-*",label="Parallel second half of the calculation")

a.label("Time","Flow Rate")

a.visu()