In [1]:
# ============================================================
# Notebook setup: run this before everything
# ============================================================

%load_ext autoreload
%autoreload 2

# Control figure size
figsize=(14, 4)

from util import util

# From UDEs to PINNs

## UDEs and Similar Approaches

**UDEs provide a good starting point for _two more approaches_**

If you keep the connection to physics, but your _relax the ODE mechanism_

* ...Then you get _Physics Informed Neural Networks_
  - Technically, UDEs can be considered PINNs
  - ...But the term refers typically to the approaches surveyed (e.g.) [here](https://link.springer.com/article/10.1007/s10915-022-01939-z)

If you keep the ODE mechanism, but you _drop the connection to physics_

* ...Then you get _Neural Ordinary Differential Equations_
  - This was the first approach to integrate NNs and differential equations
  - The seminal paper is [publicly available](https://proceedings.neurips.cc/paper/2018/hash/69386f6bb1dfed68692a24c8686939b9-Abstract.html)
  
**We are going to briefly outline the former approach**

## From UDEs to PINNs

**Let's start by recapping how UDEs work**

At _inference time_, we (typically) integrate an initial value problem:

$$\begin{align}
& \dot{\hat{y}} = f(\hat{y}, t, U(\hat{y}, t, \theta)) \\
& \hat{y}(0) = y_0
\end{align}$$

At _training time_, we solve:

$$\begin{align}
\text{argmin}_\theta\ & L(\hat{y}(t), y) \\
& \dot{\hat{y}} = f(\hat{y}, t, U(\hat{y}, t, \theta)) \\
& \hat{y}(t_0) = y_0
\end{align}$$

* Which requires to embed ODE integration in gradient descent

## From UDEs to PINNs

**What if we tried to simplify the inference process?**

For example, we could use a NN to _approximate $y(t)$ itself_

$$
\hat{y}(t; \theta) \simeq y(t)
$$

**This approach has several immediate benefits:**

* Inference becomes as efficient as evaluating $\hat{y}(t; \theta)$
  - No need to integrate anything, linear scalability w.r.t. the sampling points
* Handling PDEs also becomes pretty simple
  - We just need to use a multivariate $t$

**But where is physics here?**

## Training in PINNs

**The ODE is taken into account _at training time_**

Superficially, the training problem is similar to the UDE one:

$$\begin{align}
\text{argmin}_\theta\ & L(\hat{y}(t, \theta), y) \\
& \dot{\hat{y}}(t; \theta) = f(\hat{y}(t; \theta), t) \\
& \hat{y}(t_0; \theta) = y_0
\end{align}$$

...But in fact, the situation is very different:

* Since both $\hat{y}(t; \theta)$ and $\dot{\hat{y}}(t; \theta)$ need to be learned
* ...Classical ODE integration methods are no longer viable

**PINNs circumvent this issue by _using NN training for ODE integration_**

## Training in PINNs

**In particular, we can apply a Lagrangian relaxation to the problem**

We relax the constraints in the previous formulation so that we obtain:

$$\begin{align}
\mathcal{L}(y, \hat{y}, t, \theta) & = L(\hat{y}(t; \theta), y) \\
                          & + \lambda_{de}^T \|\dot{\hat{y}}(t; \theta) - f(\hat{y}(t; \theta), t)\|_2^2 \\
                          & + \lambda_{bc}^T \|\hat{y}(t_0; \theta) - y_0\|_2^2
\end{align}$$

In optimization, this is called a _Lagrangian_

* Besides the original loss $L$
* ...There is a penalty term linked to the ODE, with weights (multipliers) $\lambda_{de}^T$
* ...And a penalty term linked to the initial value, with multipliers $\lambda_{bc}^T$

## Training in PINNs

**The approach can be generalized**

* In particular we can take into account both ODEs and PDEs
* ...And we can use different types of penalizers

**We just need to abstract a bit the formulation:**

$$\begin{align}
\mathcal{L}(y, \hat{y}, t, \theta) & = L(\hat{y}(t; \theta), y) \\
                          & + \lambda_{de}^T L_{de}(F(\hat{y}, t; \theta)) \\
                          & + \lambda_{bc}^T L_{bc}(B(\hat{y}, t; \theta))
\end{align}$$


* Where $F(y, t; \theta) = 0$ defines the original ordinary or partial DE
* ...And $B(y, t; \theta) = 0$ defines the original initial or boundary conditions
* The $L_{de}$ and $L_{bc}$ terms can be L2 norms, but also other types of penalizer

## Training in PINNs

**Then we train by:**

* Sampling points $\{t_i\}_{i=1}^n$ in the input space
* Choosing $\theta$ so as to minimize the sum of Lagrangians

$$\begin{align}
\text{argmin}_\theta \sum_{i=1}^n \mathcal{L}(y, \hat{y}, t, \theta)
\end{align}$$

We can employ gradient descent, as usual

**Again, there is no need to use ODE/PDE integration at training time**

...Because _training is the integration process_

* In fact, it is possible to drop the data-based loss $L$ and the approach still works
* In such a case, PINNs can act as _approximate_ ODE/PDE integrators

## No Free Lunch

**In the above description, it's easy to miss an important point**

Let's consider again the DE-based components in the Lagrangian:

$$
L_{de}(F(\hat{y}, t, \theta)) \quad \text{ which could be e.g. } \quad \|\color{red}{\dot{\hat{y}}(t; \theta)} - f(\hat{y}(t; \theta), t)\|_2^2
$$

* The penalizer contains _derivatives_ (possibly partial)
* ...And it should provide a contribution for gradient descent

**This means that we need a way to compute the components of $\dot{y}$**

...So that we obtain an expression that is _again differentiable in $\theta$_

* This can be a bit tricky in practice!
* Viable approaches include symbolic differentiation (manual or automatic)
* ...Or partially numeric methods such as finite differences, etc.

## No Free Lunch

**Moreover, assigning a value to the multipliers is not trivial**

These the "weight" vectors $\lambda_{de}$ and $\lambda_{bc}$

$$\begin{align}
\mathcal{L}(y, \hat{y}, t, \theta) = L(\hat{y}(t; \theta), y)
                            + \color{red}{\lambda_{de}^T} L_{de}(F(\hat{y}, t;\theta))
                          + \color{red}{\lambda_{bc}^T} L_{bc}(B(\hat{y}, t; \theta))
\end{align}$$

Finding a good balance might be very tricky

* A good alternative might be using dual ascent

**Finally, boundary conditions are _incorporated at training time_**

...So, if they change, we need to _repeat training_

* In some contexts, this can be a major problem

## No Free Lunch

**Finally, one should be careful with the problem semantic**

Let's consider for a given input vector $t$ the constraint:

$$
\|\dot{\hat{y}}(t; \theta) - f(\hat{y}(t; \theta), t)\|_2^2
$$

**The constraint is enforced in a _soft_ fashion**

...Meaning that it might be violated

* Proper weight calibration can help, but violations will still typically occur

**Even if we manage exact satisfaction**

...The constraint will hold only locally, for the specified $t$ values

* When we move away from the $t$ values considered at training time
* ...The NN may behave inconsistently with the underlying physics

## Some Remarks

**Let's conclude with some differences betweenn mainstream PINNs and UDEs**

Unlike UDEs, PINNs need to _learn the involved physics_

* It might be necessary to use larger networks
* ...Because they will need to learn a more complex relation

The DE constraints are only _approximately satisfied_

* UDEs provide instead full guarantees
* ...But approximate satisfisfaction might be good if the DE is not fully reliable

PINNs do not rely on DE integration: they _are_ an integration method

* This makes them faster than UDEs at inference and possibly training time
* ...But don't forget that changing the boundary conditions requires retraining!

## Some Remarks

**If you are looking for additional information**

* There's a very well done [PyTorch library for PINNs](https://github.com/mathLab/PINA)
* A well-known library is also [available for JAX](https://docs.kidger.site/diffrax/)
* The PINN idea can be generalized, leading to [Neural Operators](https://arxiv.org/pdf/2010.08895)
* ...Which map boundary conditions into integrated differential equations