# Data pre-processing

A sample from a `WaveformDataset` consists of labeled waveform polarizations $(\theta_{\text{intrinsic}}, (h_+,h_\times))$, represented as a nested dictionary. This must be transformed into noisy detector data $d_I$ (with additional noise context data) in a form suitable for input to a neural network. Dingo accomplishes this by applying a sequence of [**transforms**](https://pytorch.org/tutorials/beginner/basics/transforms_tutorial.html) to the sample.

A transform is simply a class with a `__call__()` method, which takes a sample as input and returns a transformed sample. A sequence of transforms can be then be [composed](https://pytorch.org/vision/stable/generated/torchvision.transforms.Compose.html#torchvision.transforms.Compose) to build a more complex transform in a modular way. Dingo's training transform sequence is stored as `WaveformDataset.transform`, and is applied automatically when elements are accessed through indexing.

## GW transform sequence

For Dingo, the flowchart below indicates the sequence of transforms applied to a sample from a `WaveformDataset`.

```{mermaid}
    :caption: Flowchart for Dingo data-preprocessing pipeline for training, starting from a sample from a `WaveformDataset`. Transforms with rounded corners include an element of randomness, whereas trapezoidal items are deterministic.
    
flowchart TB
    sample[Sample from WaveformDataset]
    sample-->extrinsic([SampleExtrinsicParameters])
    subgraph det[Simulate waveforms in detectors]
        direction TB
        det_times[/GetDetectorTimes/]
        det_times-->gnpe_maybe{Using GNPE?}
        gnpe_maybe-- No -->project_det[/ProjectOntoDetectors/]
        gnpe_maybe-- Yes -->gnpe_times([GNPECoalescenceTimes])
        gnpe_times-->project_det
    end
    subgraph noise[Add noise]
        direction TB
        sample_asd([SampleNoiseASD])
        sample_asd-->whiten[/WhitenAndScaleStrain/]
        whiten-->add_noise([AddWhiteNoiseComplex])
    end
    subgraph output[Prepare output]
        direction TB
        standardize[/SelectStandardizeRepackageParameters/]   
        standardize-->repackage[/RepackageStrainsAndASDS/]
        repackage-->unpack[/UnpackDict/]
    end
    extrinsic-->det
    det-->noise
    noise-->output
    output-->E[End]
```

```{important}
Some pre-processing transforms include an element of randomness. This serves to augment the training data and reduce overfitting.
```

### Extrinsic parameters

The starting point for this chain of transforms is a sample `sample` with `parameters` and `polarizations` sub-dictionaries. The first transform samples the extrinsic parameters, and adds a new sub-dictionary `extrinsic_parameters` to `sample`. Extrinsic parameters include sky position (right ascension, declination), polarization, time of coalescense, and luminosity distance (the latter two of which are also considered intrinsic parameters).

```{eval-rst}
.. autoclass:: dingo.gw.transforms.SampleExtrinsicParameters
    :members:
```

### Detector waveforms

The next sequence of transforms applies the extrinsic parameters to `sample["polarizations"]` to produce detector waveforms in `sample["waveform"]`. First it calculates the arrival time $t_I$ of the waveform in each detector, based on the time of coalescense at geocenter and the sky position, and stores this in `sample["extrinsic_parameters"]`,
```{eval-rst}
.. autoclass:: dingo.gw.transforms.GetDetectorTimes
    :members:
```

(ref:ref-time)=
```{important}
Dingo models are trained for a **fixed set of detectors.** This must be selected prior to training, and a new model must be trained if one wishes to analyze data in a different set of detectors. Thus, e.g., separate models must be trained for HL and HLV configurations.
```

```{note}
During training, Dingo **fixes the orientation of the Earth** (and corresponding interferometer positions and orientations) to that at a fixed reference time `ref_time`. This is so that the model does not have to learn about the rotation of the Earth. This is corrected in post-processing by shifting the inferred right ascension by the difference between the true and reference sidereal times.
```

Optionally, the times $t_I$ are perturbed to give new "proxy times" as part of the [](gnpe.md) algorithm.

```{eval-rst}
.. autoclass:: dingo.gw.transforms.GNPECoalescenceTimes
    :members:
```

Finally, the detector waveforms $h_I$ are calculated from the extrinsic parameters. (In the backend, these transforms use the Bilby interferometer libraries.) The contents of the `extrinsic_parameters` sub-dictionary are then moved into `sample["parameters"]`; this was essentially a holding place for parameters not yet applied to the waveform.

```{eval-rst}
.. autoclass:: dingo.gw.transforms.ProjectOntoDetectors
    :members:
```

### Noise

Once the detector waveforms have been obtained, noise $n_I$ must be added to simulate realistic data. First, noise ASDs are selected randomly for each detector from an `ASDDataset` for the relevant observing run. This is stored in `sample["asds"]`. For details see [](noise_dataset.ipynb#asd-dataset).

```{eval-rst}
.. autoclass:: dingo.gw.transforms.SampleNoiseASD
    :members:
```

The waveform is then whitened based on the PSD, and furthermore scaled by the standard deviation of white noise. This is so that each input to the network will have unit variance, which is important for successful training.

```{eval-rst}
.. autoclass:: dingo.gw.transforms.WhitenAndScaleStrain
    :members:
```

For whitened waveforms, noise is white, so finally this is randomly sampled and added to `sample["waveform"]`.

```{eval-rst}
.. autoclass:: dingo.gw.transforms.AddWhiteNoiseComplex
    :members:
```

### Output

The final set of transforms prepares the sample for input to the neural network. First, the desired inference parameters are selected. By taking only a subset of `parameters`, one can train a marginalized posterior model. These parameters are also standardized to have zero mean and unit variance to improve training. (Standardization will be undone in post-processing after inference.) The parameters will then be repackaged into a `numpy.ndarray`, so that parameter labels are implicit based on ordering.

```{eval-rst}
.. autoclass:: dingo.gw.transforms.SelectStandardizeRepackageParameters
    :members:
```

The `waveform` and `asds` dictionaries are also repackaged into a single array of shape suitable for input to the network. In particular, the complex frequency domain strain data are decomposed into real and imaginary parts.


```{eval-rst}
.. autoclass:: dingo.gw.transforms.RepackageStrainsAndASDS
    :members:
```

Finally, the `samples` dictionary of arrays is unpacked to a tuple of arrays for parameters and data.

```{eval-rst}
.. autoclass:: dingo.gw.transforms.UnpackDict
    :members:
```

When used with a torch `DataLoader`, the final numpy arrays are automatically transformed into torch tensors.


## Building the transforms

The following function will set the `transform` property of a `WaveformDataset` to the above transform sequence:

```{eval-rst}
.. autofunction:: dingo.gw.training.set_train_transforms
```

The various options are specified by passing an appropriate `data_settings` dictionary. In practice, these settings will be specified along with other [training settings](training).

```{code-block} yaml
---
caption: Sample `data_settings` dictionary for configuring a sequence of training transforms. This dictionary includes several options not needed for `set_train_transforms`, but which are needed as part of other training settings.
---
waveform_dataset_path: /path/to/waveform_dataset.hdf5  # Contains intrinsic waveforms
train_fraction: 0.95
window:  # Needed to calculate window factor for simulated data
  type: tukey
  f_s: 4096
  T: 8.0
  roll_off: 0.4
domain_update:
  f_min: 20.0
  f_max: 1024.0
svd_size_update: 200  # Optionally, reduce the SVD size when decompressing (for performance)
detectors:
  - H1
  - L1
extrinsic_prior:  # Sampled at train time
  dec: default
  ra: default
  geocent_time: bilby.core.prior.Uniform(minimum=-0.10, maximum=0.10)
  psi: default
  luminosity_distance: bilby.core.prior.Uniform(minimum=100.0, maximum=1000.0)
ref_time: 1126259462.391
gnpe_time_shifts:
  kernel: bilby.core.prior.Uniform(minimum=-0.001, maximum=0.001)
  exact_equiv: True
inference_parameters: default
```

waveform_dataset_path
: Points to the waveform dataset.

train_fraction
: Fraction of waveform dataset to be used for training. The remainder are used to compute the test loss.

window
: Specifies the window function to use when FFTing the time-domain data. It is used here to calculate a window factor for simulating data. See the discussion [here](ref:window-factor).

domain_update (optional)
: Optionally specify new domain properties. These will update the domain associated to the `WaveformDataset`. They must necessarily describe a domain contained within the original.

svd_size_update (optional)
: If the `WaveformDataset` uses SVD compression, optionally use a smaller number of basis elements than stored in the dataset. Decompression of the waveforms is the slowest preprocessing operation, so using this option can improve training speed at the expense of accuracy.

detectors
: Set the desired GW interferometers for the Dingo model.

extrinsic_prior
: Specify the extrinsic prior. Default options are available.

ref_time
: Reference time for the interferometer locations and orientations. See the [important note](ref:ref-time) above.

gnpe_time_shifts (optional)
: GNPE kernel and additional options. See [](gnpe.md).

inference_parameters
: Parameters to infer with the model. At present they must be a subset of `sample["parameters"]`. By specifying a strict subset, this can be used to marginalize over parameters. The `default` setting points to `dingo.gw.prior.default_inference_parameters`:

In [1]:
import warnings
warnings.filterwarnings("ignore", "Wswiglal-redir-stdio")
import lal

In [2]:
from dingo.gw.prior import default_inference_parameters
default_inference_parameters

['chirp_mass',
 'mass_ratio',
 'phase',
 'a_1',
 'a_2',
 'tilt_1',
 'tilt_2',
 'phi_12',
 'phi_jl',
 'theta_jn',
 'luminosity_distance',
 'geocent_time',
 'ra',
 'dec',
 'psi']