# Review of Typical Neural Network Framework

In any supervised or semi-supervised setting, we are given a set of inputs and outputs $\{(x_i, y_i)\}^n_i$. The goal is learn a function $F: x \rightarrow y$ that maps each input $x_i$ to its corresponding output $y_i$. Typically, $f$ is modeled as a neural network $f_\theta$ where the parameters $\theta$ are learned through stochastic gradient descent with respect to a user-defined loss function $\mathcal{L}(x_i, y_i)$ (e.g. squared loss).  

# Neural Ordinary Differential Equations (Neural ODEs)
Chen et al. showed that instead of modeling $F: x \rightarrow y$ as a neural network, we can reformulate the problem in a continuous setting. Namely, let us call $h(t)$ the **hidden state** which depends on $t$ ($t$ can but does not have to represent time). The initial hidden state at some time $t_0$ will be $h(t_0)=x$. The final hidden state at some time $t_1$ after $t_0$ will be $h(t_1) = y$. 

The derivative of $h(t)$ with respect to $t$ is modeled as a neural network $f$ which receives $h(t)$ and $t$ as input and is parameterized by parameters $\theta$. Notice that the values of the initial time $t_0$ and final time $t_1$ are not well-defined. Like the parameters of the neural network $\theta$, $t_0$ and $t_1$ are free parameters and are likewise optimized through gradient descent with respect to the loss $\mathcal{L}$.

In summary, we are learning a function $f(t, h(t), \theta)$ such that

$$f(t, h(t), \theta) = \frac{d h(t)}{d t}$$
$$h(t_0) = x$$
$$h(t_1) = y$$

Our prediction $\hat{y}$ is thus

$$\widehat{y} = h(t_1) = ODESolve\big(h(t_0), t_0, t_1, \theta, f\big)$$

where $ODESolve$ is any numerical ODE solver (e.g. Runge-Kutta) which solves for the hidden state at $t_1$, $h(t_1)$. To optimize, the free parameters $t_0$, $t_1$, and $\theta$, the adjoint sensitivity method as described by Chen et al. is first used to find the gradients $\frac{d \mathcal{L}}{d h(t)}$ and $\frac{d \mathcal{L}}{d \theta}$. The free parameters are then optimized using gradient descent.

## Advantages of Neural ODEs
There are several advantage of Neural ODEs over normal neural networks.

**First**, Neural ODEs ensure that **the learned mapping $F: x \rightarrow y$ is smooth**.

**Second**, the Neural ODE formulation is **well-suited for time series data**. For instance, in the framework given above, the only hidden states that had any meaning for our problem were $h(t_0$ which was our input $x$ and $h(t_1)$ which was our output $y$. All other hidden states in between $t_0$ and $t_1$ did carry any meaning for our problem. Imagine now that we are dealing with time-series data $x_0, x_1, x_2,...x_n)$ and our goal is to either interpolate in between two given data points or predict new data points. Now each hidden state has meaning $h(t) = x_t$.

**Third**, we can **take advantage of the rich differential equations theory and numerical solvers** developed over the past 200 years.

# Neural ODE Demo
In the demo below, we will attempt to use a Neural ODE to fit time-series data from a molecular simulation of alanine dipeptide.

In [None]:
# Neural ODE

# Neural Stochastic Differential Equations (Neural SDEs)

In [None]:
# Neural SDE