# Neural Ordinary Differential Equations
## Summary

NeurIPS is the largest AI conference in the world. 4,854 papers were submitted. 4 received "Best paper" award. This is one of them. The basic idea is that neural networks are made up of stacked layers of simple computation nodes that work together to approximate a function. If we re-frame a neural network as an "Ordinary Differential Equation", we can use existing ODE solvers (like Euler's method) to approximate a function. This means no discrete layers, instead the network is a continous function. No more specifying the # of layers beforehand, instead specify the desired accuracy, it will learn how to train itself within that margin of error. It's still early stages, but this could be as big a breakthrough as GANs! 

<table>
    <tr>
        <td> <img src="images/resnet_0_viz.png" alt="Drawing" style="width: 450px;"/> </td>
        <td> <img src="images/odenet_0_viz.png" alt="Drawing" style="width: 450px;"/> </td>
    </tr>
</table>

 -  Left: A Residual network defines a discrete sequence of finite transformations.
 -  Right: A ODE network defines a vector field, which continuously transforms the state.
 -  Both: Circles represent evaluation locations

## Demo 
An ODENet approximated this spiral function better than a Recurrent Network. 

![alt text](images/demon-timeseries.png)

ODENet give comparable result to ResNet but cheaper in memory
![Ode vs ResNet](images/resnet-vs-ode.png)


## Why Does this matter? 

1. Faster testing time than recurrent networks, but slower training time. Perfect for low power edge computing! (precision vs speed)
2. More accurate results for time series predictions (!!) i.e continous-time models
3. Opens up a whole new realm of mathematics for optimizing neural networks (Diff Equation Solvers, 100+ years of theory)
4, Compute gradients with constant memory cost



# Neural Network, Global Approximators

From Universal Approximation Theorem, a network made of linear matrix multiplication followed by a non-linear function can approximate any arbitrary continuous function. 

# Residual Neural Network

A solution to this was proposed by Microsoft for the 2015 ImageNet competiton (residual networks)
- In December of  2015, Microsoft proposed "Residual networks" as a solution to the ImageNet Classification Competition
- ResNets had the best accuracy in the competition
- ResNets utilize "skip-connections" between layers, which increases accuracy.
- They were able to train networks of up to 1000 layers deep while avoiding vanishing gradients (lower accuracy)
- 6 months later, their publicatio already had more than 200 references.

he residual layer is actually quite simple: add the output of the activation function to the original input to the layer. As a formula, the k+1th layer has the formula:

\begin{equation} x_{k+1} = x_{k} + F(x_{k})\end{equation}

where F is the function of the kth layer and its activation. For example, F might represent a convolutional layer with a relu activation. This simple formula is a special case of the formula:

\begin{equation} x_{k+1} = x_{k} + h F(x_k),\end{equation}

which is the formula for the Euler method for solving ordinary differential equations (ODEs) when h=1

# Euler Expansion

Consider a simplified ODE from physics: we want to model the position x of a marble. Assume we can calculate its velocity x′ (the derivative of position) at any position x. We know that the marble starts at rest x(0)=0 and that its velocity at time t depends on its position through the formula:

\begin{equation} \dot{x}(t) = f(x) \end{equation}

The Euler method solves this problem by following the physical intuition: my position at a time very close to the present depends on my current velocity and position. For example, if you are travelling at a velocity of 5 meters per second, and you travel 1 second, your position changes by 5 meters. If we travel h seconds, we will have travelled 5h meters. As a formula, we said:

\begin{equation}x(t+h) = x(t) + h \dot{x}(t),\end{equation}

but since we know

\begin{equation} \dot{x}(t) = f(x) \end{equation}

we can rewrite this as

\begin{equation} x(t+h) = x(t) + h f(x).\end{equation}

If you squint at this formula for the Euler method, you can see it looks just like the formula for residual layers!

This observation has meant three things for designing neural networks:

- New neural network layers can be created through different numerical approaches to solving ODEs
- The possibility of arbitrarily deep neural networks
- Training of a deep network can be improved by considering the so-called stability of the underlying ODE and its numerical discretization

### 2 more points

- To create arbitrarily deep networks with a finite memory footprint, design neural networks based on stable ODEs and numerical discretizations.
- Gradient descent can be viewed as applying Euler's method for solving ordinary differential equation to gradient flow.  

$$ \dot{x} = - \nabla f(x(t)) $$
