# Neural ODEs

We here consider, as already done in the [Neural Hamiltonian ODE](<./NeuralHamiltonianODEs.ipynb>) example, a generic system in the form:

$$
\dot {\mathbf x} = \mathbf f(\mathbf x, \mathcal N_\theta(\mathbf x))
$$

whose solution is indicated with $\mathbf x(t; x_0, \theta)$ to explicitly denote the dependance on the initial conditions $\mathbf x+0$ and the network parameters $\theta$.
We refer to these systems as Neural ODEs, a term that has been made popular in the paper,

*Chen, Ricky TQ, Yulia Rubanova, Jesse Bettencourt, and David K. Duvenaud.* "Neural ordinary differential equations." Advances in neural information processing systems 31 (2018).

where it has been used to indicate a specific form of the above equation, when the state represented the neuronal activity of a Neural Network. We here depart from that terminology and use the term in general for any ODE with a right hand side containing an Artificial Neural Network. All the cases illustrated in the [Neural Hamiltonian ODE](<./NeuralHamiltonianODEs.ipynb>) example are, therefore, also to be considered as special cases of Neural ODEs.

Whenever we have a Neural ODE, it is important to be able to define a training pipeline able to change the neural parameters $\theta$ as to make some loss decrease. 

We indicate such a loss with $\mathcal L(\mathbf x(t; x_0, \theta))$ and show in this example how to compute, using *heyoka*, its gradient, and hence how to setup a training pipeline for Neural ODEs.

In [1]:
# The usual main imports
import heyoka as hy
import numpy as np
import time

%matplotlib inline
import matplotlib.pyplot as plt

The gradients we seek can be written as:

$$
\begin{array}{l}
\frac{\partial \mathcal L}{\partial \mathbf x_0} =  \frac{\partial \mathbf x}{\partial \mathbf x_0} \frac{\partial \mathcal L}{\partial \mathbf x}\\
\frac{\partial \mathcal L}{\partial \theta} = \frac{\partial \mathbf x}{\partial \theta} \frac{\partial \mathcal L}{\partial \mathbf x}
\end{array}
$$

In the expressions above we know the functional form of $\mathcal L$ and hence its derivatives w.r.t. $\mathbf x$, we thus need to compute the remaining terms, i.e. the ODE sensitivities: $\mathbf \Phi = \frac{\partial \mathbf x(t)}{\partial \mathbf x_0} $, $\boldsymbol \varphi = \frac{\partial \mathbf x(t)}{\partial \boldsymbol \theta} $

```{note}

The computation of the ODE sensitivities can be achieved following two methods: the variational equations and the adjoint method. Both methods compute the same quantities and we shall see how they are, ultimately, two version of the same reasoning leading to algorithms sharing a similar complexity, contrary to what sometimes believed / reported in the scientific literature.
```

For the sake of clarity we here consider a system in the simplified form:

$$
\dot {\mathbf x} = \mathcal N_\theta(\mathbf x)
$$

the r.h.s. is a Feed Forward Neural Network and we use the *heyoka* factory function `ffnn()` to instantiate it:

In [2]:
# We create the symbols for the network inputs (only one in this frst simple case)
state = hy.make_vars("x", "y")

# We define as nonlinearity a simple linear layer
linear = lambda inp: inp

# We call the factory to construct the FFNN:
ffnn = hy.model.ffnn(inputs = state, nn_hidden = [5,5,5], n_out = 2, activations = [hy.tanh, hy.tanh, hy.tanh, hy.tanh])
print(ffnn)

[tanh((p85 + (p60 * tanh((p80 + (p35 * tanh((p75 + (p10 * tanh((p70 + (p0 * x) + (p1 * y)))) + (p11 * tanh((p71 + (p2 * x) + (p3 * y)))) + (p12 * tanh((p72 + (p4 * x) + (p5 * y)))) + (p13 * tanh((p73 + (p6 * x) + (p7 * y)))) + (p14 * tanh((p74 + (p8 * x) + (p9 * y))))))) + (p36 * tanh((p76 + (p15 * tanh((p70 + (p0 * x) + (p1 * y)))) + (p16 * tanh((p71 + (p2 * x) + (p3 * y)))) + (p17 * tanh((p72 + (p4 * x) + (p5 * y)))) + (p18 * tanh((p73 + (p6 * x) + (p7 * y)))) + (p19 * tanh((p74 + (p8 * x) + (p9 * y))))))) + (p37 * tanh((p77 + (p20 * tanh((p70 + (p0 * x) + (p1 * y)))) + (p21 * tanh((p71 + (p2 * x) + (p3 * y)))) + (p22 * tanh((p72 + (p4 * x) + (p5 * y)))) + (p23 * tanh((p73 + (p6 * x) + (p7 * y)))) + (p24 * tanh((p74 + (p8 * x) + (p9 * y))))))) + (p38 * tanh((p78 + (p25 * tanh((p70 + (p0 * x) + (p1 * y)))) + (p26 * tanh((p71 + (p2 * x) + (p3 * y)))) + (p27 * tanh((p72 + (p4 * x) + (p5 * y)))) + (p28 * tanh((p73 + (p6 * x) + (p7 * y)))) + (p29 * tanh((p74 + (p8 * x) + (p9 * y))))))) + 

## The Variational Equations
As derived already in the examples dedicated to the [variational equations](<./The Variational equations.ipynb>) and to the [Periodic orbits in the CR3BP](<./The Variational equations.ipynb>) the ODE sensitivities can be computed from the differential equations:

$$
 \frac{d\mathbf \Phi}{dt} = \nabla_\mathbf x \mathcal N_\theta(\mathbf x) \cdot \mathbf \Phi \qquad (n,n) = (n,n) (n,n)
$$

and

$$
\frac{d\boldsymbol \varphi}{dt} = \nabla_\mathbf x \mathcal N_\theta(\mathbf x) \cdot \boldsymbol \varphi + \frac{\partial \mathcal N_\theta(\mathbf x)}{\partial \boldsymbol \theta} \qquad (n,N) = (n,n) (n,N) + (n,N) 
$$
where we have reported also the dimensions of the various terms for clarity: $n$ is the system dimension (2 in our case) and $N$ the number of parameters (87 in our case).

Now this may all sound very complicated, but *heyoka* simplifies everything for you, so that the code looks like:

In [3]:
# Parametes
dNdtheta = hy.diff_tensors(ffnn, hy.diff_args.params)
dNdtheta = dNdtheta.jacobian
print("Shape of dNdtheta:", dNdtheta.shape)

# Variables
dNdx = hy.diff_tensors(ffnn, hy.diff_args.vars)
dNdx= dNdx.jacobian
print("Shape of dNdx:", dNdx.shape)


Shape of dNdtheta: (2, 87)
Shape of dNdx: (2, 2)


To assemble the differential equation we must now define the symbolic variables of all the elements in $\mathbf \Phi$ and $\mathbf p$.

In [4]:
# We define the symbols for phi
symbols_phi = []
for i in range(dNdtheta.shape[0]):
    for j in range(dNdtheta.shape[0]):
        # Here we define the symbol for the variations
        symbols_phi.append("phi_"+str(i)+str(j))  
phi = np.array(hy.make_vars(*symbols_phi)).reshape((dNdtheta.shape[0], dNdtheta.shape[0]))

# We define the symbols for varphi
symbols_varphi = []
for i in range(dNdtheta.shape[0]):
    for j in range(dNdtheta.shape[1]):
        # Here we define the symbol for the variations
        symbols_varphi.append("varphi_"+str(i)+str(j))  
varphi = np.array(hy.make_vars(*symbols_varphi)).reshape((dNdtheta.shape[0], dNdtheta.shape[1]))

... and finally assemble the variational equations.

In [5]:
# The (variational) equations of motion in matrix form
dphidt = dNdx@phi
dvarphidt =  dNdx@varphi + dNdtheta

All that remains to be done, is to build a Tayor adaptive integrator (an ODE solver) with all the compiled equations:

In [6]:
dyn = []
# The \dot x = ffnn
for lhs, rhs in zip(state,ffnn):
    dyn.append((lhs, rhs))
# The variational equations for x0
for lhs, rhs in zip(phi.flatten(),dphidt.flatten()):
    dyn.append((lhs, rhs))
# The variaitonal equations for the thetas
for lhs, rhs in zip(varphi.flatten(),dvarphidt.flatten()):
    dyn.append((lhs, rhs))
# These are the initial conditions on the variational equations (the identity matrix) and zeros 
ic_var = np.eye(len(state)).flatten().tolist() + [0.] * len(symbols_varphi)

In [7]:
start_time = time.time()
ta = hy.taylor_adaptive(
    # The ODEs.
    dyn,
    # The initial conditions.
    [1.1, 1.1] + ic_var,
    # Operate in compact mode.
    compact_mode = True
)
print("--- %s seconds --- to build the Taylor integrator" % (time.time() - start_time))

--- 1.4344515800476074 seconds --- to build the Taylor integrator


In [8]:
start_time = time.time()
ta.propagate_until(10.)
print("--- %s seconds --- to propagate" % (time.time() - start_time))

--- 0.0006282329559326172 seconds --- to propagate


## A note on the Adjoint Method
Let us, for a moment, instead of seeking $\frac{\partial \mathbf x(t)}{\partial \mathbf x_0}$, seek the opposite, and thus define:

$$
\mathbf a = \frac{\partial \mathbf x_0}{\partial \mathbf x(t)} 
$$

by definition $\mathbf a$ is the inverse of $\mathbf \Phi$, which implies $\mathbf a = \mathbf \Phi^{-1}$ and thus we also have (accounting fo the fact that the derivative of a matrix inverse is $\frac{d\mathbf A^{-1}}{dt} = - \mathbf A^{-1}\frac{d \mathbf A}{dt}\mathbf A^{-1}$

$$
\frac{\partial \mathbf a}{\partial t} = - \mathbf \Phi^{-1} \frac{\partial \mathbf \Phi}{\partial t} \mathbf \Phi^{-1} =- \mathbf \Phi^{-1} \nabla_\mathbf x \mathcal N_\theta(\mathbf x)  \mathbf \Phi  \mathbf \Phi^{-1} = -\mathbf a \nabla_\mathbf x \mathcal N_\theta(\mathbf x)
$$

which is a very compact and elegant demonstration (I know right?) of the adjoint equation for our case, otherwise often derived using the calculus of variations and a much more lengthy sequence of variational identities. 

More importantly the derivation shows how the adjoint method is strongly related to the variational equations and thus the resulting algorithm complexity cannot, and will not be different.

In the classic derivation of the adjoint method the sensitivities are taken with respect to $\mathbf x(T)$ and not $\mathbf x_0 = \mathbf x(t_0)$. This is irrelevant for the purpose of the demonstration as $t_0$ is just a point in time and can represent a point in the future as well as a point in the past.

In the paper "Neural ordinary differential equations." which popularized the use of ffnn on the r.h.s od ODEs, the derivation is made for a loss $\mathcal L$, and and ODE is seeked for $\mathbf {\hat a} = \frac{\partial \mathcal L(\mathbf x(T))}{\partial \mathbf x(t)}$.
 
Since:

$$
\mathbf {\hat a} = \frac{\partial \mathcal L(\mathbf x(T))}{\partial \mathbf x(t)} = \frac{\partial \mathcal L(\mathbf x(T))}{\partial \mathbf x(T)}\frac{\partial \mathbf x(T)}{\partial \mathbf x(t)}
$$

Its easy to see that the same differential equation we proved above holds for $\mathbf {\hat a}$ by taking the time derivatoive of the above identity and noting that $\frac{\partial \mathcal L(\mathbf x(T))}{\partial \mathbf x(T)}$ is a constant.