# LINC tutorial - Jan 2024

author: Etienne Bonnassieux etienne.bonnassieux@uni-wuerzburg.de
Official documentation: https://linc.readthedocs.io/en/latest/

## This document

The purpose of this document is to provide an intuitive, user-friendly addition to the official LINC documentation, in order to allow new LOFAR users to run the initial calibration pipeline on their data. We will go through definining our paths, datasets, etc; we will then show how one can first install and then deploy LINC on the data as defined here.

## Prerequisites

The current software prerequisites for LINC are a huge improvement over past iterations: in principle, all that is strictly required is for git and python3 virtual environments to be installed. If this is not present, you can follow the steps below to install the python requirements (assuming a Linux machine):

In [2]:
sudo apt update && sudo apt upgrade -y
sudo apt-get install git -y
sudo apt install python3-pip -y
sudo apt install build-essential libssl-dev libffi-dev python3-dev
sudo apt install python3-venv -y

SyntaxError: invalid syntax (3854480111.py, line 1)

This will, sequentially:

1. update your software list
2. install git, easy peasy
3. install pip, necessary for convenient python package installation
4. install some reliability prerequisites, in case your distribution is out of date
5. actually install python3-venv itself, which is the python3 virtual environment package.

With this done, you are able to proceed to the next steps.


## I. Defining our environment variables

In order to facilitate this tutorial, we will define a series of important environment variables. **This step is crucial** in order to deploy the pipeline on a variety of environments. You will need to define:

1. The absolute path of your working directory. This is where you will want the pipeline to run, where you will place its input files, and where it will place its outputs. Note that **you will need a reasonable data quota at this specific location**: under **no circumstances** should it EVER be your home directory! If in doubt, ask your friendly local sysadmin where your data storage is located on your computational architecture.
2. The absolute path to the datasets you want to reduce. At present LINC can only reduce one observation at a time; multiple observations will need sequential LINC calls on their respective datasets.

In the examples below, I have defined the environment variables for our local compute infrastructure in Wurzburg. **If you don't change these for your use case, nothing will work**: path errors are the #1 top case of errors and bugs with LINC, so please check each of these folders and files exist with a quick ls!


In [3]:
working_dir = '/data/LOFAR/LBA'
data_dir    = '/data/LOFAR/LBA/DATA/3C380'
msstring    = 'L671058_SB*MS'

## II. Acquiring the pipeline

Once the prerequisites for LINC are installed, the pipeline itself can be acquired and installed with the following commands:

In [4]:
git clone https://git.astron.nl/RD/LINC.git working_dir
cd working_dir/LINC
./build_venv.sh
source venv/bin/activate
pip install --upgrade toil[cwl]
pip install pip install cwltool==3.1.20220628170238

SyntaxError: invalid syntax (173903099.py, line 1)

This will, in order:

1. Git clone the pipeline
2. go to the cloned repository
3. create the virtual environment
4. enter it
5. install the necessary Common Workflow Language processing tool (in this case, toil)
6. install a version of a second Common Workflow Language processing tool (in this case, cwltool) which is known to work with LINC

And with that, you're done! Note that you will need to reload the virtual environment each time you open a new terminal in which to start the pipeline. We will therefore add this command in the configuration and execution steps.

## III. Configuring the pipeline

### III.a. Running the LINC calibrator pipeline

We now build scripts which will allow us to create the pipeline configuration files in a straightforward, easy way. We do this by using Python scripts to generate the .json files which LINC takes as input. There are two files we must create:

1. LINC_calibrator.json
2. LINC_target.json

The first file defines the parameters, including datasets, for the pipeline which finds the initial calibration solutions. The second defines the parameters, including datasets and the outputs of the calibrator run, which are used as inputs for the target pipeline. Once the latter has converged, you have successfully calibrated your LOFAR observation, albeit only for direction-independent gains, and without international stations. This is nevertheless a great start.

We begin by defining the necessary environment variables for the calibration run:

In [5]:
ncpu                 = 24
working_dir          = '/home/ebonnassieux/Teaching'
data_dir             = '/home/ebonnassieux/Teaching'
linc_dir             = '/home/ebonnassieux/LINC'
msstring             = '*MS'
linc_calib_parset    = "LINC_calib.json"

Then, we run the code below, which will generate a file named as per the filename variable above, which will be "LINC_calib.json" by default. This will use all LINC defaults, which can be tweaked if necessary using the variables outlined in the official documentation: https://linc.readthedocs.io/en/latest/calibrator.html

In [18]:
import os
import glob
import numpy as np

mslist=np.array(list(np.sort(glob.glob(data_dir+"/"+msstring))))       
lastms=mslist[-1]
f=open(linc_calib_parset,"w")
f.write("{\n")
f.write("    \"msin\": [\n")
for ms in mslist:
    if ms!=lastms:
        f.write("                {\"class\": \"Directory\", \"path\": \""+ms+"\"},\n")
    else:
        f.write("                {\"class\": \"Directory\", \"path\": \""+ms+"\"}\n")
f.write("            ],\n")
f.write("    \"calibrator_path_skymodel\": {\"class\": \"Directory\", \"path\":\"%s/skymodels/\"},\n"%linc_dir)
f.write("}\n")


2

Using this file, you are now ready to run the LINC calibrator pipeline. There are two ways to invoke the command, which I build for you below. In principle, it is better to use toil-cwl-runner, but cwltool also works. **You only need to run one**: we recommend using **toil-cwl-runner**.

In [19]:
linc_calib_command = 'time toil-cwl-runner --outdir %s/linc_outdir '%working_dir             +\
                                          '--bypass-file-store '                             +\
                                          '--tmp-outdir-prefix %s '%(working_dir+'/scratch') +\
                                          '--maxLocalJobs %i '%ncpu                          +\
                                          '--log-dir <c%s'%log_dir                           +\
                                          '%s/workflows/HBA_calibrator.cwl '%linc_dir         +\
                                          '%s 2>&1 | tee pipeline.calib.log'%linc_calib_parset
print(linc_calib_command)

time toil-cwl-runner --outdir /home/ebonnassieux/Teaching/linc_outdir --bypass-file-store --tmp-outdir-prefix /home/ebonnassieux/Teaching/scratch --maxLocalJobs 24 --log-dir <c/home/ebonnassieux/Teaching/home/ebonnassieux/LINC/workflows/HBA_calibrator.cwl LINC_calib.json 2>&1 | tee pipeline.calib.log


In [20]:
linc_calib_command = 'time cwltool --parallel --singularity '                   +\
                                  '--outdir %s/linc_outdir '%working_dir        +\
                                  '--tmpdir-prefix %s/scratch '%working_dir     +\
                                  '--log-dir %s/scratch '%working_dir         +\
                                  '%s/workflows/HBA_calibrator.cwl '%linc_dir   +\
                                  '%s 2>&1 | tee pipeline.calib.log'%linc_calib_parset
print(linc_calib_command)

time cwltool --parallel --singularity --outdir /home/ebonnassieux/Teaching/linc_outdir --tmpdir-prefix /home/ebonnassieux/Teaching/scratch --writeLogs /home/ebonnassieux/Teaching/scratch /home/ebonnassieux/LINC/workflows/HBA_calibrator.cwl LINC_calib.json 2>&1 | tee pipeline.calib.log


Once you've double-checked that the paths in the command you've generated are correct (a little ls never hurts), you are ready to copy-paste the command into your terminal and start blasting. Pay particular attention to the **number of CPUs** used (not optional with cwltool): it's very rude to crash other people's jobs! Also make sure that **all the tmpdirs and logs** direct to locations where you **know there is a large amount of disk space**. After that, press enter, and go grab a warm beverage of your choice.

### III.b. Reading and understanding the diagnostic outputs

blabla


### III.c. Running the LINC target pipeline

If all goes well, your LINC calibrator pipeline will end with this positive message:

[check what linc says]

If so, you are ready to continue with the target pipeline. This is very similar to the calibrator pipeline, but requires the additional input of knowing where the calibrator pipeline wrote its outputs. We once again define our environment variables:




In [14]:
ncpu               = 24
working_dir        = '/home/ebonnassieux/Teaching'
data_dir           = '/home/ebonnassieux/Teaching'
calib_dir          = '/home/ebonnassieux/Teaching'
linc_dir           = '/home/ebonnassieux/LINC'
log_dir            = '/home/ebonnassieux/Teaching'
msstring           = '*MS'
linc_target_parset = "LINC_target.json"

We then generate the appropriate .json file:

In [15]:
import os
import glob
import numpy as np

mslist=np.array(list(np.sort(glob.glob(data_dir+"/"+msstring))))       
lastms=mslist[-1]

f=open(linc_target_parset,"w")
f.write("{\n")
f.write("    \"process_baselines_target\": \"[CR]S*&\",\n")
f.write("    \"filter_baselines\": \"[CR]S*&\",  \n")
f.write("    \"flag_baselines\":   [],      \n")
f.write("    \"cal_solutions\": {\"class\": \"File\", \"path\": \"%s/cal_solutions.h5\"},\n"%calib_dir)
f.write("    \"msin\": [\n")
for ms in mslist:
    if ms!=lastms:
        f.write("                {\"class\": \"Directory\", \"path\": \""+ms+"\"},\n")
    else:
        f.write("                {\"class\": \"Directory\", \"path\": \""+ms+"\"}\n")
f.write("            ],\n")
f.write("}\n")



2

You can now run the LINC target pipeline using this file, as before. This time, we only show the toil-cwl-runner command, though the cwltool command also exists and can be built if you wish.

In [16]:
linc_target_command = 'time toil-cwl-runner --outdir %s '%(working_dir+'/linc_target')        +\
                                           '--bypass-file-store '                             +\
                                           '--tmp-outdir-prefix %s '%(working_dir+'/scratch') +\
                                           '--tmpdir-prefix %s'%(working_dir+'/scratch')      +\
                                           '--maxLocalJobs %i '%ncpu                          +\
                                           '--log-dir <c%s'%log_dir                           +\
                                           '  %s/workflows/HBA_target.cwl '%linc_dir          +\
                                           '  %s 2>&1 | tee pipeline.target.log'%linc_target_parset
print(linc_target_command)

time toil-cwl-runner --outdir /home/ebonnassieux/Teaching/linc_target --bypass-file-store --tmp-outdir-prefix /home/ebonnassieux/Teaching/scratch --tmpdir-prefix /home/ebonnassieux/Teaching/scratch--maxLocalJobs 24 --log-dir <c/home/ebonnassieux/Teaching  /home/ebonnassieux/LINC/workflows/HBA_target.cwl   LINC_target.json 2>&1 | tee pipeline.target.log


## Recap: the full stack

To finish this tutorial, we put below a full list of python commands which will run the python script for you. Copy the entire cell into a text file. Ensure that you edit these as you have previously, with the correct directories for your architecture. Finally, run your text file as a python script. This should run your pipeline correctly.

In [None]:
ncpu                 = 24
working_dir          = '/home/ebonnassieux/Teaching'
data_dir             = '/home/ebonnassieux/Teaching'
linc_dir             = '/home/ebonnassieux/LINC'
msstring             = '*MS'
linc_calib_parset    = "LINC_calib.json"

runner = 'cwltool'
# runner = 'toil-cwl-runner'

import os
import glob
import numpy as np

mslist=np.array(list(np.sort(glob.glob(data_dir+"/"+msstring))))       
lastms=mslist[-1]
f=open(linc_calib_parset,"w")
f.write("{\n")
f.write("    \"msin\": [\n")
for ms in mslist:
    if ms!=lastms:
        f.write("                {\"class\": \"Directory\", \"path\": \""+ms+"\"},\n")
    else:
        f.write("                {\"class\": \"Directory\", \"path\": \""+ms+"\"}\n")
f.write("            ],\n")
f.write("    \"calibrator_path_skymodel\": {\"class\": \"Directory\", \"path\":\"%s/skymodels/\"},\n"%linc_dir)
f.write("}\n")

if runner == 'toil-cwl-runner':
    linc_calib_command = 'time toil-cwl-runner --outdir %s/linc_outdir '%working_dir             +\
    '--bypass-file-store '
    '--tmp-outdir-prefix %s '%(working_dir+'/scratch') +\
    '--maxLocalJobs %i '%ncpu                          +\
    '--log-dir <c%s'%log_dir                           +\
    '%s/workflows/HBA_calibrator.cwl '%linc_dir        +\
    '%s 2>&1 | tee pipeline.calib.log'%linc_calib_parset
elif runner == 'cwltool'
    linc_calib_command = 'time cwltool --parallel --singularity '                   +\
    '--outdir %s/linc_outdir '%working_dir        +\
    '--tmpdir-prefix %s/scratch '%working_dir     +\
    '--log-dir %s/scratch '%working_dir         +\
    '%s/workflows/HBA_calibrator.cwl '%linc_dir   +\
    '%s 2>&1 | tee pipeline.calib.log'%linc_calib_parset    
    
os.sys(linc_calib_command)
