# qagmire

> Straightforward analysis of WEAVE data.

Qagmire tries to make dealing with [WEAVE](https://www.ing.iac.es/Astronomy/instruments/weave/weaveinst.html) data products quick and easy, so you can focus on verifying the data quality and producing science results. The focus is on working directly with data stored in FITS files, each typically containing information for a single WEAVE observation.

Qagmire is being developed for the needs of the Quality Assurance Group (QAG) and Science Verifiation phase of the [WEAVE-LOFAR survey](https://ingconfluence.ing.iac.es/confluence/display/WEAV/WEAVE-LOFAR), so is primarily focussed on WEAVE MOS data.  However, Qagmire is primarily a demonstration of techniques to efficiently analyse data in very large and growing collections of FITS files. The approaches can easily be extended to other WEAVE surveys and beyond.

A key concept is to read data from the FITS files into [xarray](https://docs.xarray.dev) [Datasets](https://docs.xarray.dev/en/stable/user-guide/data-structures.html#dataset), via a [netCDF](https://docs.xarray.dev/en/stable/user-guide/io.html) cache to improve efficiency. Operations can be performed on the resulting `Dataset`, e.g. performing calculations using specific data products for a particular selection of observed sources, in a _lazy_ fashion. This means that the required computations are not performed until the final results are required. At that point, they can be computed (via [dask](https://docs.xarray.dev/en/stable/user-guide/dask.html)) in a manner that makes best use of the available computational resources. Time-consuming computations can be easily accelerated by assigning more computational resources to the task.

Another important element of qagmire is that tests, verification and analysis can all be performed interactively in Jupyter notebooks. Code in one notebook can be automatically exported into the qagmire library for reuse elsewhere, so each notebook can focus on a single issue. Thanks to nbdev, the notebooks are automatically available online for others to inspect.

There is some explanation [below](#Why-is-this-needed?) of why one might prefer these approaches to some alternatives.

## Structure

If you are primarily interested in seeing qagmire used to run diagnostic tests on WEAVE data products, then take a look at the [observing conditions check](diagnostics/10_obs_cond_check.ipynb), the [L1 spectrum value check](13_l1_spectrum_value_check.ipynb), and other notebooks in the diagnostics folder.

The approach to data access is implemented in [data](01_data.ipynb). This includes a set of `get_*_files` for locating WEAVE files for specific dates, etc. and `read_*` functions which return xarray `Dataset`s of WEAVE data. The [data](01_data.ipynb) notebook also demonstrates the use of these functions and documents the structure of the resulting `Dataset`s.

A framework for writing and running diagnostic tests is implemented in [quality_assurance](01_quality_assurance.ipynb). This defines a `Diagnostics` class that is used by all the notebooks in the diagnostics folder.

Finally, utility functions used in other notebooks are collected in [utilities](03_utilities.ipynb).


## Installation

Qagmire is currently being developed at [github.com/bamford/qagmire](https://github.com/bamford/qagmire).

At the moment you can install it (ideally in a fresh environment), using:
```sh
pip install -e git+https://github.com/bamford/qagmire.git
```
the `-e` option ensures that you can make edits to the code and they will be picked up without needing to reinstall.

Once development has progressed a little further, I plan to transfer the repo to the [WEAVE-LOFAR](https://github.com/WEAVE-LOFAR) organisation and submit it to PyPi.

## Development

Qagmire is developed using [nbdev](https://nbdev.fast.ai/). If you want to contribute to the main code, then you would be best getting familiar with nbdev, e.g. via the [walkthrough](https://nbdev.fast.ai/getting_started.html).
You can write and execute notebooks that use qagmire without needing to use `nbdev`. However, you should use nbdev (ideally via the [git hooks](https://nbdev.fast.ai/tutorials/pre_commit.html) by running `pre-commit install`) when commiting back to the repository.

## Why is qagmire needed?

Those in the know will be aware that we already have a tool for reading and analysing WEAVE data, [weave-io](https://github.com/WEAVE-LOFAR/weave-io). The approach taken by `weave-io` is to ingest data from the original FITS files into a database. This database has a carefully designed relational structure, with a [specialised syntax](http://shaun.science/weave-io/objects/) for accessing the data. Information about the structure of the data, as well as various quantities from the headers of the FITS files, are stored in the database itself, but when pixel data is required, weave-io reads it as required from the original FITS files.

Unfortunately, this approach has proved to have a few downsides:

1. The structure of the FITS files is hardcoded into weave-io. This means that if the structure changes, which is likely during the early Quality Assurance and Science Verification stages of survey, then ingesting the data may fail. The code would need to be adapted to work with the new structure.
2. Weave-io's code is very sophisticated, containing many layers of abstraction and conventions. Unfortunately, it can therefore be quite difficult to make changes, especially in a way that doesn't break some other functionality.
4. Ingesting data is a slow process. It can take on the order of hours to ingest a single night of data.
5. If there is a change in the data structure, within weave-io or in the FITS files, then the entire database may need to be reingested.
6. While it is possible to ingest different subsets of data into different databases, it is then very difficult to perform computations that use data across different databases.
7. The relational syntax sometimes produces hard-to-understand or non-intuitive results.
8. Accessing many elements of data is slow. If a test is written to operate on a single element (e.g. a spectrum), then a database access is required for each element, which is very slow. However, this can be sped up by writing the tests to work on a larger unit, e.g. an OB, in one go.
9. Accessing pixel data is particularly slow, and it is difficult to speed up such queries. Each spectrum is a separate element in the database, its data is read individually. This means that performing a test on all spectra in an observation involves ~1000 individual disk read operations. This is much slower than simply reading all the spectra into memory in one go.
10. Although a multi-process job distribution method was intended to be part of weave-io, this has not been tested and there are signs that it may be difficult to get working (and would still be inefficient). Increasing the capabilities of the weave-io requires involvement of the cluster manager.

Some of these issues could likely be addressed with further development of weave-io. However, at least in the interim, it seems wise to have an appoach that enables more direct and flexible access to the data in the original FITS files, while allowing computations across large amounts of data to be performed in a reasonable time. This is the goal of qagmire.

With qagmire, tests take orders of magnitude less time than they would in weave-io. Long-running computations can be easily accelerated by taking advantage of multiple processors or even nodes through the regular user facing job submission system.

This certainly doesn't preclude making data available via a database front-end once things are more stable. It may be interesting to explore using [TileDB](https://tiledb-inc.github.io/TileDB-CF-Py/documentation/index.html). 