# Neural network architecture

Dingo employs different network architectures for posterior estimation, e.g. [Neural posterior estimation](https://arxiv.org/abs/1605.06376) and [Flow Matching Posterior estimation](https://arxiv.org/pdf/2305.17161), see [here](sbi.md) for an introduction. A central object is the conditional neural density estimator, a deep neural network trained to represent the Bayesian posterior. This section describes the neural network architecture developed in {cite:p}`Dax:2021tsq`, and subsequently used in {cite:p}`Dax:2021myb`, {cite:p}`Dax:2022pxd` and {cite:p}`Wildberger:2022agw`.

## Neural spline flow with SVD compression
The NPE architecture consists of two compenents, the embedding network which compresses the high-dimensionl data to a lower dimensional feature vector, and the conditional normalizing flow which estimates the Bayesian posterior based on this feature vector. Both components are trained jointly and end-to-end with the objective descriped [here](sbi.md). The network can be build with

In [29]:
from dingo.core.nn.nsf import create_nsf_with_rb_projection_embedding_net

### Embedding network
The embedding network compresses the high-dimensional conditioning information (consisting of frequency domain strain and PSD data). The first layer of this network is initialized with an [SVD](https://en.wikipedia.org/wiki/Singular_value_decomposition) matrix from a reduced basis built with non-noisy waveforms. This projection filters out the noise that is orthogonal to the signal manifold, and significantly simplifies the task for the neural network.

The initial compression layer is followed by a sequence of residual blocks consisting of dense layers for further compression. Example kwargs:

In [None]:
embedding_kwargs = {
    "input_dims": (2, 3, 8033),
    "output_dim": 128,
    "hidden_dims": [
        1024, 1024, 1024, 1024, 1024, 1024, \
        512, 512, 512, 512, 512, 512, \
        256, 256, 256, 256, 256, 256, \
        128, 128, 128, 128, 128, 128
    ],
    "activation": "elu",
    "dropout": 0.0,
    "batch_norm": True,
    "svd": {
        "num_training_samples": 50000,
        "num_validation_samples": 5000,
        "size": 200,
    }
}

Here, `input_dims=(2, 3, 8033)` refers to the input dimension, for frequency domain data with `8033` frequency bins and `3` channels (real part, complex part, ASD) in `2` detectors. The embedding network compresses this to `output_dim=128` components. The SVD initialization is controlled with the `svd` argument, and the residual blocks are specified with `hidden_dims`.

```{note}
Not all of these arguments have to be set in the configuration file when training dingo. For example, the `input_dims` argument is automatically filled in based on the specified domain information and number of detectors. Similarly, the `context_dim` of the flow (see below) is filled in based on the `output_dim` of the embedding network and the number of [GNPE](gnpe.md) proxies. See the [Dingo examples](https://github.com/dingo-gw/dingo/tree/main/examples) for the corresponding configuration files and training commands.
```

### Discrete Normalizing Flow
We use the [neural spline flow](https://arxiv.org/abs/1906.04032) as a density estimator. This takes the output of the embedding network as context information and estimates the Bayesian posterior distribution. Example kwargs:

In [25]:
posterior_kwargs = {
    "input_dim": 15,
    "context_dim": 129,
    "num_flow_steps": 30,
    "base_transform_kwargs": {
        "hidden_dim": 512,
        "num_transform_blocks": 5,
        "activation": "elu",
        "dropout_probability": 0.0,
        "batch_norm": True,
        "num_bins": 8,
        "base_transform_type": "rq-coupling",
    },
}

This creates a neural spline flow with `input_dim=15` parameters, conditioned on a `129` dimensional context vector, corresponding to the `128` dimensional output of the embedding network and one [GNPE](gnpe.md) proxy variable. The neural spline flow consists of `num_flow_steps=30` layers, for which the transformation is specified with `base_transform_kwargs`.

In [35]:
nde = create_nsf_with_rb_projection_embedding_net(posterior_kwargs, embedding_kwargs)

### Continuous Normalizing Flow

Flow Matching Posterior Estimation (FMPE) utilizes continuous normalizing flows to represent posterior distributions. Instead of discrete mappings, FMPE models a velocity field that governs transformations over time, mapping a base distribution (e.g., standard normal) to the posterior.

We use a dense residual network to parameterize the velocity field, which takes as input the parameters $ \theta $, time $ t $, and context $ d $. The model is trained by minimizing the flow matching loss, which ensures that the learned velocity field matches the optimal transport-inspired target velocity field.

Example kwargs for an FMPE model:

In [None]:
posterior_kwargs = {
    "activation": "gelu",
    "batch_norm": True,
    "context_with_glu": False,
    "dropout": 0.0,
    "hidden_dims": [
        1024, 1024, 1024, 1024, 1024, 1024,
        512, 512, 512, 512, 512, 512, 512, 512
    ],
    "sigma_min": 0.001,
    "theta_embedding_kwargs": {
        "embedding_net": {
            "activation": "gelu",
            "hidden_dims": [16, 32, 64, 128, 256],
            "output_dim": 256,
            "type": "DenseResidualNet",
        },
        "encoding": {
            "encode_all": False,
            "frequencies": 0,
        },
    },
    "theta_with_glu": True,
    "time_prior_exponent": 1,
    "type": "DenseResidualNet",
}


### Explanation of Key Settings

- **`theta_embedding_kwargs`**: 
  - Specifies how the parameter \(\theta\) is embedded for input into the velocity network.
  - Contains settings for:
    - `embedding_net`: Defines the embedding network, such as a `DenseResidualNet`. This network transforms the input \(\theta\) into a higher-dimensional representation for downstream processing. The `hidden_dims` control the architecture depth and complexity, while the `output_dim` specifies the dimensionality of the embedding.
    - `encoding`: Configures the positional encoding for \(\theta\). The `encode_all` flag determines whether all components of \(\theta\) or only a subset (e.g., the first component) are encoded. `frequencies` adjusts the sinusoidal encoding settings.

- **`hidden_dims`**:
  - Defines the layer sizes of the velocity network. The dimensions decrease progressively to capture hierarchical features and allow for efficient computation. Larger dimensions can better approximate complex velocity fields but increase computational costs.

- **`sigma_min`**:
  - Controls the minimum noise level in the interpolation path between the base distribution and the target posterior distribution. Too low values of sigma_min can lead to sharp trajectories that are harder for the velocity network to model accurately, potentially resulting in unstable training.

- **`theta_with_glu`**:
  - If `True`, applies Gated Linear Units (GLUs) for processing \(\theta\). GLUs introduce an additional nonlinear gating mechanism, which can improve the network’s ability to model complex relationships between inputs.

- **`context_with_glu`**:
  - Similar to `theta_with_glu`, but applies GLUs to the context \(d\). Useful for enhancing feature extraction when the context data has intricate dependencies.
