# Building a waveform dataset

For training neural networks, the more training samples the better. With too little training data, one runs the risk of overfitting. Waveforms, however, can be expensive to generate and take up significant storage. Dingo adopts several strategies to mitigate these problems:

* Dingo partitions parameters into two types---intrinsic and extrinsic---and builds a training set based only on the intrinsic parameters. This consists of waveform polarizations $h_+$ and $h_\times$. Extrinsic parameters are selected during training, and applied to generate the detector waveforms $h_I$. This augments the training set to provide unlimited samples from the extrinsic parameters.

* Saved waveforms are compressed using a singular value decomposition. Although this is lossy, waveform mismatches can monitored to ensure that they fall below the intrinsic error in the waveform model. 




## The `WaveformDataset` class

The `WaveformDataset` is a storage container for waveform polarizations and parameters, which can used to serve samples to a neural network during training:

```{eval-rst}
.. autoclass:: dingo.gw.dataset.WaveformDataset
    :members:
    :inherited-members:
    :show-inheritance:
```

`WaveformDataset` subclasses `dingo.core.dataset.DingoDataset` and `torch.utils.data.Dataset`. The former provides generic functionality for saving and loading datasets as HDF5 files and dictionaries, and is used in several components of Dingo. The latter allows the `WaveformDataset` to be used with a PyTorch `DataLoader`. In general, we follow the PyTorch design framework for training, including [Datasets, DataLoaders,](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) and [Transforms](https://pytorch.org/tutorials/beginner/basics/transforms_tutorial.html).

## Generating a simple dataset

As described above, the `WaveformDataset` class is just a container, and does not generate the contents itself. Dataset generation is instead carried out using functions in the `dingo.gw.dataset.generate_dataset` module. Although in practice, datasets are likely to be generated from a settings file using the command line interface, here we describe how to generate one interactively.

A dataset is based on an intrinsic prior and a waveform generator, so we build these as described [here](generating_waveforms.ipynb).

In [3]:
from dingo.gw.waveform_generator import WaveformGenerator
from bilby.core.prior import PriorDict
from dingo.gw.prior import default_intrinsic_dict
from dingo.gw.domains import FrequencyDomain

domain = FrequencyDomain(f_min=20.0, f_max=1024.0, delta_f=0.125)
wfg = WaveformGenerator(approximant='IMRPhenomXPHM', domain=domain, f_ref=20.0)
prior = PriorDict(default_intrinsic_dict)

We can use the following function to generate sets of parameters and associated waveforms:

In [6]:
from dingo.gw.dataset.generate_dataset import generate_parameters_and_polarizations

parameters, polarizations = generate_parameters_and_polarizations(wfg,
                                                                  prior,
                                                                  num_samples=100,
                                                                  num_processes=1)

Generating dataset of size 100


In [7]:
parameters

Unnamed: 0,mass_ratio,chirp_mass,luminosity_distance,theta_jn,phase,a_1,a_2,tilt_1,tilt_2,phi_12,phi_jl,geocent_time
0,0.174711,46.023150,1000.0,1.785557,5.332880,0.870659,0.444271,0.927936,2.372401,4.737038,2.205327,0.0
1,0.435956,82.691423,1000.0,1.465030,3.734158,0.257709,0.494684,2.108445,0.719251,3.216247,2.617178,0.0
2,0.308695,86.753491,1000.0,1.128628,5.741556,0.570944,0.936525,2.207516,1.800024,1.187898,1.449362,0.0
3,0.561744,67.422981,1000.0,1.001984,0.644845,0.572964,0.851733,1.974848,1.206800,3.633199,3.692582,0.0
4,0.253560,87.186127,1000.0,2.237976,2.054238,0.831898,0.200498,0.598407,0.439991,6.242744,0.548293,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
95,0.189224,30.649494,1000.0,1.649312,1.998984,0.985395,0.546526,2.369616,0.859203,0.518297,2.452604,0.0
96,0.523299,62.313277,1000.0,0.158947,1.507464,0.739663,0.478863,0.373476,2.263823,0.006073,4.600835,0.0
97,0.271249,29.078625,1000.0,0.882830,3.676211,0.275971,0.362211,1.193439,1.620267,0.676554,0.612802,0.0
98,0.228471,70.515950,1000.0,0.439977,4.991404,0.621307,0.160902,2.695218,1.470612,3.300429,4.546350,0.0


In [8]:
polarizations

{'h_plus': array([[ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j, -0.+0.j],
        [ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j,  0.-0.j],
        [ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j,  0.+0.j],
        ...,
        [ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j,  0.-0.j],
        [ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j,  0.+0.j],
        [ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j,  0.+0.j]]),
 'h_cross': array([[ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j, -0.+0.j],
        [ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j,  0.-0.j],
        [ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j,  0.+0.j],
        ...,
        [ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j,  0.-0.j],
        [ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j,  0.+0.j],
        [ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j,  0.+0.j]])}

We can then put these in a `WaveformDataset`,

In [10]:
from dingo.gw.dataset import WaveformDataset

dataset_dict = {'parameters': parameters, 'polarizations':polarizations}
wfd = WaveformDataset(dictionary=dataset_dict)

Samples can then be easily indexed,

In [13]:
wfd[0]

{'parameters': {'mass_ratio': 0.17471100347005172,
  'chirp_mass': 46.02314987592415,
  'luminosity_distance': 1000.0,
  'theta_jn': 1.7855568083299755,
  'phase': 5.332879672785095,
  'a_1': 0.8706586447405819,
  'a_2': 0.4442705825888118,
  'tilt_1': 0.9279362382149828,
  'tilt_2': 2.3724006876536285,
  'phi_12': 4.7370383398776985,
  'phi_jl': 2.2053266465830292,
  'geocent_time': 0.0},
 'waveform': {'h_plus': array([ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j, -0.+0.j]),
  'h_cross': array([ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j, -0.+0.j])}}

```{note}
The sample is represented as a *nested dictionary*. This is a standard format for Dingo.
```

## Automated dataset construction

The simple dataset constructed above is useful for illustrative purposes, but it lacks the several important features:
* Waveforms are not compressed. A dataset with many samples would therefore take up enormous storage space.
* Not reproducible. The dataset contains no metadata describing its construction (e.g., waveform approximant, domain, prior, ...).

The `generate_dataset` function automates all of these advanced features:
```{eval-rst}
.. autofunction:: dingo.gw.dataset.generate_dataset.generate_dataset
```
This function is in turn wrapped by the command-line functions `dingo_generate_dataset` and `dingo_generate_dataset_dag`. These take a `.yaml` file with the same contents as the settings dictionary.

#### Configuration

A typical settings dictionary / `.yaml` config file takes the following form, described in detail below:

```yaml
domain:
  type: FrequencyDomain
  f_min: 20.0
  f_max: 1024.0
  delta_f: 0.125

waveform_generator:
  approximant: IMRPhenomXPHM
  f_ref: 20.0
  # f_start: 15.0  # Optional setting useful for EOB waveforms. Overrides f_min when generating waveforms.
  # new_interface: true # Optional setting for employing new waveform interface. This is needed for SEOBNRv5 approximants, and optional for standard LAL approximants.
  spin_conversion_phase: 0.0

# Dataset only samples over intrinsic parameters. Extrinsic parameters are chosen at train time.
intrinsic_prior:
  mass_1: bilby.core.prior.Constraint(minimum=10.0, maximum=80.0)
  mass_2: bilby.core.prior.Constraint(minimum=10.0, maximum=80.0)
  chirp_mass: bilby.gw.prior.UniformInComponentsChirpMass(minimum=25.0, maximum=100.0)
  mass_ratio: bilby.gw.prior.UniformInComponentsMassRatio(minimum=0.125, maximum=1.0)
  phase: default
  a_1: bilby.core.prior.Uniform(minimum=0.0, maximum=0.99)
  a_2: bilby.core.prior.Uniform(minimum=0.0, maximum=0.99)
  tilt_1: default
  tilt_2: default
  phi_12: default
  phi_jl: default
  theta_jn: default
  # Reference values for fixed (extrinsic) parameters. These are needed to generate a waveform.
  luminosity_distance: 100.0  # Mpc
  geocent_time: 0.0  # s

# Dataset size
num_samples: 5000000

# Save a compressed representation of the dataset
compression:
  svd:
    # Truncate the SVD basis at this size. No truncation if zero.
    size: 200
    num_training_samples: 50000
    num_validation_samples: 10000
  whitening: aLIGO_ZERO_DET_high_P_asd.txt
```

domain
: Specifies the data domain. Currenly only `FrequencyDomain` is implemented.

waveform_generator
: Choose the approximant and reference frequency. For EOB models that require time integration, it is usually necessary to specify a lower starting frequency. In this case, `f_ref` is ignored.

  spin_conversion_phase (optional)
  : Value for `phiRef` when converting PE spins to Cartesian spins via `bilby_to_lalsimulation_spins`. When set to `None` (default), this uses the `phase`
   parameter. When set to 0.0, `phase` only refers to the azimuthal observation angle, allowing for it to be treated as an extrinsic parameter.
     ```{important}
     It is necessary to set this to 0.0 if planning to train a `phase`-marginalized network, and then reconstruct the `phase` synthetically.
     ```

intrinsic_prior
: Specify the prior over intrinsic parameters. Intrinsic parameters here refer to those parameters that are needed to generate waveform polarizations. Extrinsic parameters here refer to those parameters that can be sampled and applied rapidly during training. As shown in the example, it is also possible to specify `default` priors, which is convenient for certain parameters. These are listed in `dingo.gw.prior.default_intrinsic_dict`.

  Intrinsic parameters obviously include masses and spins, but also inclination, reference phase, luminosity distance, and time of coalescense at geocenter. Although inclination and phase are often considered extrinsic parameters, they are needed to generate waveform polarizations and cannot be easily transformed.

  Luminosity distance and time of coalescense are considered as *both* intrinsic and extrinsic. Indeed they are needed to generate polarizations, but they can also be easily transformed during training to augment the dataset. We therefore fix them to fiducial values for generating polarizations.
  
num_samples
: The number of samples to include in the dataset. For a production model, we typically use $5 \times 10^6$ samples.

compression (optional)
: How to compress the dataset.

  svd (optional)
  : Construct an SVD basis based on a specified number of additional samples. Save the main dataset in terms of its SVD basis coefficients. The number of elements in the basis is specified by the `size` setting. The performance of the basis is also evaluated in terms of the mismatch against a number of validation samples. All of the validation information, as well as the basis itself, is saved along with the waveform dataset.
  
  whitening (optional)
  : Whether to save whitened waveforms, and in particular, whether to construct the basis based on whitened waveforms. The basis will be more efficient if whitening is used to adapt it to the detector noise characteristics. To use whitening, simply specify the desired ASD do use, from the Bilby [list of ASDs](https://git.ligo.org/lscsoft/bilby/-/tree/master/bilby/gw/detector/noise_curves). Note that the whitening is used only for the internal storage of the dataset. When accessing samples from the dataset, they will be unwhitened.
  
  Dataset compression is implemented internally by setting the `WaveformGenerator.transform` operator, so that elements are compressed immediately after generation (avoiding the need to store many uncompressed waveforms in memory). Likewise, decompression is implemented by setting the `WaveformDataset.decompression_transform` operator to apply the inverse transformation. This will act on samples to decompress them when accessed through `WaveformDataset.__getitem__()`.
  
```{important}
The automated dataset constructors store the configuration settings in `WaveformDataset.settings`. This is so that the settings can be accessed by more downstream tasks, and for reference.
```

### Command-line interface

In most cases the command-line interface will be used to generate a dataset. Given a settings file, one can call
```bash
dingo_generate_dataset --settings_file settings.yaml
                       --num_processes N
                       --out_file waveform_dataset.hdf5
```
This will generate a dataset following the configuration in `settings.yaml` and save it as `waveform_dataset.hdf5`, using `N` processes.

To inspect the dataset (or any other Dingo-generated file) use
```bash
dingo_ls waveform_dataset.hdf5
```
This will print the configuration settings, as well as a summary of the SVD compression performance (if available).

For larger datasets, or those based on slower waveform models, Dingo includes a script that builds a condor DAG, `dingo_generate_dataset_dag`. This splits the generation of waveforms across several nodes, and then reconstitutes the final dataset.