# Deep Learning on Differential Equations: a Case Study with NODEs and the Double Pendulum

In [3]:
from matplotlib import animation
animation.writer = animation.writers['ffmpeg']

## Introduction and Review of Literature

In this project, our goal was to learn how deep neural networks perform on physics-related problems when the neural network itself has physics-related concepts built into its architecture. The particular problem we used to investigate this question was prediction of the future state of a double pendulum system, as the system is governed by relatively simple ordinary differential equations (ODEs) but exhibit chaotic behavior, making the learning task nontrivial. Using a deterministic (if chaotic) system allowed us to generate effectively infinite data as we needed it; when data was required for training or validation, we simply generated random starting points and used the differential equations to generate time series beginning with those points.

We focused primarily on neural ODEs (NODEs) in our experiments. NODEs are in some sense a conceptual extension of residual networks (ResNets) or recurrent neural networks (RNNs) in the sense that they represent relevant information about a time series through the use of a hidden state, which is updated as time moves forward (Chen, Rubanova, Bettencourt \& Duvenaud). However, while ResNets and RNNs perform updates in discrete time, NODEs operate in continuous time. Effectively, an RNN or similar model is trained to approximate the function that maps from the value of the hidden state at time index $t$ to the value of the hidden state at time $t+1$. A NODE instead approximates a continuous derivative of the hidden state at any time. This allows the trained neural network to be used as the derivative function in a black-box numerical ODE solver, the result of which can be decoded into a prediction at an arbitrary future time with constant computational cost. Put another way, NODEs mimic the limit of applying RNNs to a problem as the size of the discrete timestep used by the RNNs goes to 0.

We also considered applying Physics-Informed Neural Networks (PINNs) to the task of predicting future states of a double pendulum system. PINNs are deep neural networks whose objective function contains a term penalizing divergence from observed training data as well as a term penalizing divergence from some amount of preexisting knowledge about the physics of the system.

A wide variety of architectures and approaches fall under this umbrella. The authors who introduced the term (Raissi, Perdikaris, \& Karniadakis) first described a method with high physics knowledge that was effectively a neural network used to solve a particular initial value problem. Other researchers applied this general method to the double pendulum problem and found that PINNs can fail to capture chaotic dynamics because they deviate slightly from given initial conditions in order to match training trajectories (Steger, Rohrhofer, \& Geiger). On the other hand, later works took the opposite approach and set out to derive the governing equations of motion from data, such as Dufera, T. T. We were most interested in a hybrid approach such as that described by Karniadakis et al, in which models learn from both data and some pre-knowledge about system physics.

The most obvious initial investigation to us was the application of a straightforward multilayer perceptron (MLP) model with a PINN loss to map between discrete time points in the state of the system. In other words, we would train a neural network to take in the positions and velocities of both pendulums and output a prediction for those values a small, constant time interval later. However, upon further reflection, we realized that in the case that training data are generated without noise, the data perfectly follow the physics, and the data loss and the physics loss are equivalent. As a result, the network would not truly be a PINN. We thus chose to focus mostly on NODEs, and use "PINNs" (which here are equivalent to a MLP with a data-only loss) as a control for comparison purposes.

<video src='presentation_chaos.mp4' controls>

## Methods

We forewent a traditional dataset in favor of online data generation to ease comparison across various tasks. We generated data using RK4 with $\Delta t = 0.005$s on the true derivative of the double pendulum system to create a ground-truth trajectory. When generating noisy data, we then injected noise by adding $\boldsymbol{\varepsilon} \sim \mathcal{N}(\textbf{0}, \sigma^2 I)$ to the trajectory.

The control model was a feedforward multilayer perceptron with 3 hidden layers of dimension 30. Its training data were noiseless and all used pendula with equal masses (1kg) and equal lengths (1m); they consisted of input-output pairs of four variables each, with the angles and angular velocities of each pendulum at one point in time as input and the output being those same four variables .005s later. Angles were generated uniformly between $-\pi$ and $\pi$ radians, and velocities were chosen from a normal distribution centered at 0. Training was done for 450,000 epochs, each of which consisted of $2^{15}$ training pairs. Loss was a simple mean squared error on vectors in $\mathbb{R}^4$.

For our ablation study, we varied four hyperparameters of the training dataset and model architecture: model size, trajectory length, noise level, and number of training samples. 

* For model size, we had 3 models with approximately $\dim\boldsymbol{\theta} = \{n_1, n_2, n_3\}$ parameters. 
* We then assigned maximum and minimum trajectory lengths and had the dataset generate trajectories of lengths uniformly sampled between them. Our trajectory lengths were then chosen to be in one of the following bins: [$m_1$, $M_1$], [$m_2$, $M_2$], [$m_2$, $M_2$]. We denote mean trajectory length $\bar \ell$.
* We added noise by adding $\boldsymbol{\varepsilon} \sim \mathcal{N}(\textbf{0}, \sigma^2 I)$ for $\sigma^2\in \{\sigma_1^2, \sigma_2^2\}$. We also trained with no noise (denoted $\sigma^2 = 0$ in the tables in this paper).
* We tried training for a total of $|X| \in \{s_1, s_2, s_3\}$ trajectories. 


Our NODE used tanh for its nonlinearities, and layers in the middle were twice as large as layers near the edge. Our PINNs and vanilla MLPs were similar, but used ReLU for the nonlinearities. 

## Results

We found that . . .

## Conclusions

We conclude that . . .

## References

We referenced . . .