### Notes for training 

* Adjust learning rate: ``batch_size`` / ``learning_rate`` = const. See [this blog](https://medium.com/deep-learning-experiments/effect-of-batch-size-on-neural-net-training-c5ae8516e57). Might have to do with normalisation details.
* H1 loss considerably better than L2 loss it seems.
* Distance to center
* Optimal Kernel size? 60 better than 40 better than 20
* More data = good. Implement data loader for GPU? 
* Decide on features. 
  * Can have coordinates + derivatives + input, or natural coordinates + curvature.
  * Natural coordinates + curvature results in worse performance
  * Reparameterisation to uniform (not GL-points)
  * [DMD](https://arxiv.org/pdf/1409.6358.pdf)
  * <span style="color:red"> IDEA: Generalize our results to non-linear projection mappings.</span> Let $u$ be micro data and $U$ macro data.
    *   Instead of $\langle a(u), U\rangle$, generalize to $\sigma(\langle\pmb{a}_2(u), \sigma(\langle \pmb{a}_1(u), U\rangle))\rangle)$ or even
        $$
        z_{k+1} = z_k + \sigma(W_k(u)z_k + b_k(u))
        $$ 
    * Geo-FNO fits in this framework. Rescale FNO, but canonical arclength parameterisation means det() = 1.
* Uncertainty quantification
  * Sampling from posterior
    * HMC - Hamiltonian monte carlo
    * MH - Metropolis-hastings
    * No U-turn
    * Langevin dynamics
    * Robinson-Monro
    * MFVI - Mean-Field Variational inference
    * Monte Carlo Dropout, also during inference - obtain mean + std
    * Laplace approximation
    * Deep Ensebles / Snapshot encebles (cyclic learning rate)
    * Stochastic weight averaging gaussian (SWAG)
  * Bayesian Pinn - bayesian neural networks
    * weights + biases follow a distribution
    * Gappy data: Gaussian process regression
    * Better to use Gaussian Process Regression for UQ
    * ELBO (evidence lower bound) = KL divergence measurement thing
    * [Multi-fidelity](https://www.sciencedirect.com/science/article/pii/S0021999119307260?ref=pdf_download&fr=RR-2&rr=7dc5cafa9f9d09a3) bayesian nn: High Fidelity + Low Fidelity data, train low fidelity NN 
  * META learning
    * "Learning to learn"
    * Model Agnostic Meta-Learning (MAML)
  * Functional Prior
    * Gaussian prior + historical observations -> generate functional priors
    * Functional prior + new observations -> posterior
    * Latent space | neural space | functional space
  * Uncertainty quantification in scientific machine learning (Karniadakis)
    * Neural UQ library / Hamiltonian Mnte Carlo conceptual intro / 
    * Epistemic uncertainty vs aliatoric uncertainty (account for noise in data?)
* Architecture
  * PINN 
    * Main drawback: Retrain for each new problem.
    * Compute derivatives in forward mode.
    * Compute gradients of loss function in reverse mode.
    * Adversarial sampling / adversarial weighting
    * Self adaptive loss weights
    * Dynamic weights for PINN (weights vary over domain)
    * Hard constraints: compact fcn, periodic, rotation of scalar field
    * hp-VPINN (global NN, local weighting), VPINN, VarNet, D3M, conservative VPINN
    * XPINN, cPINN - domain decomposition
    * FPINN - fractional derivatives
    * SPINN - Pi-GAN, physics informed GAN
    * Separable PINN: rensor products
    * Sobolev training for PINN
    * ENG [Marius](https://arxiv.org/pdf/2302.13163v1.pdf)
      * **Observations**
        * Gramiam better with more points (smoother loss)
        * Works best for smooth landscape (struggles with sin(3x))
        * Inversion scales like O(n^3)
        * Gramiam is poorly conditioned (k > 10^10)
        * Sparse (only a few entries have high curvature)
        * Converges such that $\nabla L_\Omega(\theta) = \nabla L_{\partial\Omega}(\theta)$. If $L_\Omega << L_{\partial\Omega}$, we're stuck here
      * **Methods**
        * Gauss-Seidel? Conjugate gradient? Coordinate Newton descent?
        * [Stochastic Conjugate Gradient](https://arxiv.org/pdf/1710.09979.pdf)
        * Jax uses a [LAPACK implementation of least squares](https://netlib.org/lapack/explore-html/d7/d3b/group__double_g_esolve_ga94bd4a63a6dacf523e25ff617719f752.html), based on [Householder](https://en.wikipedia.org/wiki/Singular_value_decomposition) transforms to find the singular values. 
        * [Competetive gradient](https://arxiv.org/pdf/1905.12103.pdf) [Competetive pinn](https://arxiv.org/pdf/2204.11144.pdf)
        * Energy natural gradient is actually [Gauss-newton](https://en.wikipedia.org/wiki/Gauss%E2%80%93Newton_algorithm), with an appropriate choice for the residuals.
        * Add diagonal to matrix: Regularised gauss-newton [here 1](https://arxiv.org/pdf/2112.02089.pdf) [here 2, called damping](https://arxiv.org/pdf/2010.00879.pdf) can also be viewed as [Tikhonov regularisation](https://arxiv.org/pdf/1412.1193.pdf), or [Levenberg-Marquart](https://github.com/google/jaxopt/blob/main/jaxopt/_src/levenberg_marquardt.py). [Gauss-Newton ](https://arxiv.org/pdf/2306.08727.pdf) [Lev-Marq 2](https://arxiv.org/ftp/arxiv/papers/2111/2111.06060.pdf)
        * The [Does Optimisation matter?](https://arxiv.org/pdf/2002.12642.pdf)
        * Raj: [optax](https://optax.readthedocs.io/en/latest/api.html?highlight=adam#adam)
          *  GPU: Memory should be saturated. CPU not as important. 
          * Use cpython profiler to check flop counts
    * FNO Karniadakis felt Honest
      * PINO  - Physics-informed neural operator.
  * Unrolled gradient
    * Feature maps that evaluate integral operator
    * Operations are too costly
  * [FNO](https://github.com/neuraloperator/neuraloperator/)
    * [Comparison to DeepOnet by Karniadakis](https://arxiv.org/pdf/2111.05512.pdf)
    * Best performance so far. (40 modes with a lot of data)
    * Best result on low variability data with anchored parameterisation
    * **Relative loss to spread error - big difference**
    * **H1 Penalty - better generalisation**
    * Invariance/equivariance. [Clifford layers](https://arxiv.org/pdf/2209.04934.pdf)
      * fno, curvature + natural coordinates + relu + no bias [invariant](https://proceedings.neurips.cc/paper_files/paper/2022/file/5aea56eefab60e06f35016478e21aae6-Paper-Conference.pdf) [invariant 2](https://arxiv.org/pdf/2006.15646.pdf)
      * Gives equinvariance wrt rescale, rotate, and circular shift
      * **INVARIANT!** Use only normalised curvature + natural coordinates, can use bias + grelu. 
    * Decimate and train on low resolution data (multigrid). 
      * Same loss, likely because the initial frequencies are not initially tuned.
      * V-cycle?
  * [Geo-FNO](https://arxiv.org/pdf/2207.05209.pdf) (geometry-aware FNO) 
  * FNOproj
    * Force linearity of map: map directly to eigen basis.
    * Performs well, not as well as FNO on low res data
  * U-FNO
    * Performs well with a lot of parameters
    * Too costly in comparison to FNO
  * FNO-conv?
    * Parallel: Spectral Ewald
    * Close range interactions with kernel
  * Ideas
    * "Spherical Convolutions"
    * "Normalizing inputs"
    * [Map to source field](https://academic.oup.com/gji/article/205/1/575/2594866?login=true)
    * "Equivalent maps"
    * Don't plot input vs output
    * [Linearly constrained](https://arxiv.org/pdf/2002.01600.pdf)
  * [Wavelet Neural Operator](https://github.com/TapasTripura/Wavelet-Neural-Operator-for-pdes/blob/main/wno_2d_AC.py), Article [here](https://arxiv.org/pdf/2205.02191.pdf)
  * [Deep M&M net](https://arxiv.org/pdf/2009.12935.pdf)
  * FNO-former
  * DeepOnet
    * Branch-Net, trunk net. Recurrent for time.
    * **Better Feature map (Legendre, Fourier) -> better fit**, lower generalisation error.
      * Karhunen-Loeve (KL) expansion, expand process into "eigen modes"
      * Karhunen Loeve transform, Hotelling transform
    * Good at one-shot, important to give right data distribution
    * DeepOnet Transfer learning / doman adaptation
      * Multiple domains -> change the basis functions of branch net
  * [ViTO](https://arxiv.org/pdf/2303.08891.pdf)
    * U-Net architecture with vision transformer at the bottleneck
    * Performs on par with FNO on other data.
  * [DiTTO](https://github.com/lucidrains/DALLE-pytorch/discussions/375)
    * Diffusion-Type Transformer Operator