# Benchmarks

A demonstration of parallel processing performance of this CNGI prototype against the current released versions of CASA using a selection of ALMA datasets representing different computationally-demanding configurations.

## Methodology

Measurement of runtime performance for the component typically dominating compute cost for existing and future workflows -- the [aperture synthesis gridder](https://casadocs.readthedocs.io/en/latest/notebooks/synthesis_imaging.html#Gridding-+-FFT) -- was made against the reference implementation of CASA. Both the latest public release ([6.1.2-7](https://casa.nrao.edu/casadocs/casa-6.1.0/introduction/release-notes-610), referenced here as 6.1) and a pre-release build of the next upcoming CASA version (referenced here as 6.2) were used in order to demonstrate the effect of the recent major refactor to the gridding code of cube imaging implemented by the production system.

Relevant calls of the CASA task `tclean` were isolated for direct comparison with the latest version of the `cngi-prototype` implementation of the mosaic and standard gridders.

The steps of the workflow used to prepare data for testing were:

1. Download archive data from ALMA Archive
2. Restore calibrated MeasurementSet using scriptForPI.py with compatible version of CASA
3. Split off science targets and representative spectral window into a single MeasurementSet
4. Convert calibrated MeasurementSet into zarr format using `cngi.conversion.convert_ms`

This allowed for generation of image data from visibilities for comparison. Tests were run in two different environments:

1. On premises using the same high performance computing (HPC) cluster environment used for offline processing of data from North American [ALMA](https://science.nrao.edu/facilities/alma/facilities/alma) operations.
2. Using commercial cloud resources furnished by Amazon Web Services ([AWS](https://aws.amazon.com/)).

## Dataset Selection

Observations were chosen for their source/observation properties, data volume, and usage mode diversity, particularly the relatively large number of spectral channels, pointings, or executions. Two were observed by the ALMA Compact Array (ACA) of 7m antennas, and two were observed by the main array of 12m antennas.

The datasets from each project code and Member Object Unit Set (MOUS) were processed following publicly documented ALMA archival reprocessing workflows, and come from public observations used used by other teams in previous benchmarking and profiling efforts.


### 2017.1.00271.S 

Compact array observations over many (nine) execution blocks using the mosaic gridder 

-   MOUS uid://A001/X1273/X2e3
-   MeasurementSet Rows: 315831
-   CNGI Shape (time, baseline, chan, pol): (455, 745, 7635, 2)
-   Visibility Data Size: 82.82 GB


![im10](https://raw.githubusercontent.com/casangi/cngi_prototype/master/docs/_media/X2e3/combined_X2e3.png)


### 2018.1.01091.S

Compact array observations with many (141) pointings using the mosaic gridder

-   MOUS uid://A001/X133d/X1a36
-   MeasurementSet Rows: 15510
-   CNGI Shape (time, baseline, chan, pol): (282, 55, 1024, 2)
-   Visibility Data Size: 508.23 MB

![im2](https://raw.githubusercontent.com/casangi/cngi_prototype/master/docs/_media/X1a36/combined_X1a36.png)



### 2017.1.00717.S

Main array observations with many spectral channels and visibilities using the standard gridder

-   MOUS uid://A001/X1273/Xc66
-   MeasurementSet Rows: 100284
-   CNGI Shape (time, baseline, chan, pol): (2564, 53, 2048, 2)
-   Visibility Data Size: 8.91 GB

![im6](https://raw.githubusercontent.com/casangi/cngi_prototype/master/docs/_media/Xc66/combined_Xc66.png)


### 2017.1.00983.S 

Main array observations with many spectral channels and visibilities using the mosaic gridder

-   MOUS uid://A001/X12a3/X3be
-   MeasurementSet Rows: 646418
-   CNGI Shape (time, baseline, chan, pol): (729, 1159, 3853, 2)
-   Visibility Data Size: 104.17 GB

![im14](https://raw.githubusercontent.com/casangi/cngi_prototype/master/docs/_media/Xc66/combined_Xc66.png)




**Speed Up and Work**

![im340](https://raw.githubusercontent.com/casangi/cngi_prototype/master/docs/_media/gcf_size_cluster_A001_X12a3_X3be.png)


## Comparison of Runtimes

**Single Machine**

The total runtime of the prototype mosaic gridder was less than the 6.1 and 6.2 reference implementations in most cases. The prototype standard gridder has comparable performance for all but the least-optimal chunk size selection. 

There does not appear to be a performance penalty associated with the adoption of a pure Python framework in comparison to the compiled C++/Fortran reference implementation. This is likely due in large part to the prototype's reliance on the `numba` Just-In-Time (JIT) transpiler and the C foreign function interface relied on by third-party framework packages including `numpy` and `scipy`.

The Fortran gridding code in CASA appears slightly more efficient than the JIT-decorated Python code in the prototype. However, the test implementation more efficiently handles chunked data and does not have intermediate steps where data is written to disk, whereas CASA generates TempLattice files to store intermediate files.

**Multi-Node**

The total runtime of the prototype mosaic and standard gridders was less than the 6.1 and 6.2 reference implementations in all cases. 

There does not appear to be a performance penalty associated with the adoption of a pure Python framework for distributed scheduling in comparison to the MPI-based reference implementation. This is likely due in part to the graph optimization of the task scheduler, which includes overhead that begins to dominate the total runtime at higher levels of concurrency.


**Comparison of CASA versions**

Only the total time is comparable between test executions of CASA versions before and after cube refactor due to the difference of virtual concatenation vs. disk write using temp lattices. 

Note that for some settings of dask array chunking, one dask chunk had a shape smaller than the others due to combination of multiple executions before conversion. This effectively separated on-disk chunk shape along time time dimension, with some limited potential to degrade performance.


**CHILES Benchmark**




## Commercial Cloud

The total runtime curves for tests run on AWS show higher variance. One contributing factor that likely dominated this effect was the use of [preemptible instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html) underlying the compute nodes running the worker configuration. For this same reason, some cloud-based test runs show decreased performance with increased scale. This is due to the preemption of nodes and associated redeployment by kubernetes, which sometimes constituted a large fraction of the total test runtime, as demonstrated by the task stream for the following test case. Note the horizontal white bar (signifying to tasks executed) shortly after graph execution begins, as well as some final tasks being assigned to a new node that came online after a few minutes (represented by the new "bar" of 8 rows at top right) in the following figure:

![im18](https://raw.githubusercontent.com/casangi/cngi_prototype/master/docs/_media/task_stream_A001_X1273_Xc66_threads_40_chans_45.png)

Qualitatively, failure rates were higher during tests of CASA on local HPC infrastructure than they were using dask on the cluster or cloud. The cube refactor shows a noticeable improvement in this area, but still worse than the prototype.

## Profiling Results

Benchmarks performed using a single chunk size constitute a test of strong scaling (constant data volume, changing number of processors). Three of the projects used the mosaic gridder, the test of which consisted of multiple function calls composed into a rudimentary "pipeline". The other project used the standard gridder, with fewer separate function calls and thus relatively more time spent on compute 

The time spent in each function `<plots>`

The communication of data between workers constituted a relatively small proportion of the total runtime, and the distribution of data between workers was relatively uniform, at all horizontal scalings, with some hot spots beginning to present once tens of nodes were involved. This is demonstrated by the following figure, taken from the performance report of a representative test execution:
![im17](https://raw.githubusercontent.com/casangi/cngi_prototype/master/docs/_media/bandwidth_A001_X12a3_X3be_threads_256_chans_48.png)

The time overhead associated with graph creation and task scheduling (approximately 100 ms per task for dask) grew as more nodes were introduced until eventually coming to represent a fraction of total execution time comparable to the computation itself, especially in the test cases with smaller data.


## Reference Configurations

Dask profiling data were collected using the [`performance_report`](https://distributed.dask.org/en/latest/diagnosing-performance.html#performance-reports) function in tests run both on-premises and in the commercial cloud.

Some values of the [distributed configuration](https://distributed.dask.org/en/latest/worker.html) were modified from their defaults:
```
distributed:
  worker:
    # Fractions of worker memory at which we take action to avoid memory blowup
    # Set any of the lower three values to False to turn off the behavior entirely
    memory:
      target: 0.85  # fraction to stay below (default 0.60)
      spill: 0.92  # fraction at which we spill to disk (default 0.70)
      pause: 0.95  # fraction at which we pause worker threads (default 0.80)
```

Thread based parallelism in dependent libraries was disabled using environment variables `BLAS_NUM_THREADS`, `BLOSC_NOLOCK`, `MKL_NUM_THREADS`, and `OMP_NUM_THREADS`

**On-premises HPC cluster**

- Test execution via Python scripts submitted to Moab scheduler and Torque resource manager with specifications documented [internally](https://info.nrao.edu/computing/guide/cluster-processing)
- Scheduling backend: `dask-jobqueue`
- I/O of visibility and image data via shared infiniband-interconnected lustre file system for access from on-premises high performance compute (HPC) nodes
- 16 threads per dask worker
- Compute via nodes from the cvpost batch queue with Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz with clock speed 1199.865 MHz and cache size 20480 KB.

**Commercial cloud (AWS)**

- Test execution via Jupyter notebooks running on a cloud deployment of the public [dask-docker](https://docs.dask.org/en/latest/setup/docker.html) image (version 2021.3.0) backed by a [Kubernetes cluster](https://docs.dask.org/en/latest/setup/kubernetes-helm.html) installed with `kops` (version 1.18.0), modified to include installation of version 0.0.83 of `cngi-prototype` and associated dependencies.
- Distributed scheduling backend: `dask.distributed`
- I/O of visibility and image data via Simple Storage Service (S3) object storage for access from commercial cloud Elastic Compute Cloud (EC2) nodes
- 8 threads per dask worker
- Compute via managed Kubernetes cluster backed by a variety of [instance types](https://aws.amazon.com/ec2/instance-types/) all running on the current daily build of the [Ubuntu 20.04](http://cloud-images.ubuntu.com/focal/current/) operating system. Cluster coordination service pods were run on a single dedicated `t3.small` instance. Jupyter notebook, dask scheduler, and [etcd](https://etcd.io/) service pods were run on a single dedicated `m5dn.4xlarge` instance. Worker pods were run on a configured number of preemptible instances drawn from a pool composed of the following types: `m5.4xlarge`, `m5d.4xlarge`, `m5dn.4xlarge`, `r5.4xlarge`, `r4.4xlarge`,`m4.4xlarge`.

Hyperthreads [exposed as vCPUs](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-optimize-cpu.html) on the EC2 instances were disabled using the following shell script at instance launch:
```
spec:
  additionalUserData:
  - content: |
      #!/usr/bin/env bash
      for cpunum in $(cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | cut -s -d, -f2- | tr ',' '\n' | sort -un)
      do
        echo 0 > /sys/devices/system/cpu/cpu$cpunum/online
      done
    name: disable_hyperthreading.sh
    type: text/x-shellscript
  image: 
```
  