# Article Review & Reproduction

This notebook is a work due for Pr. Nelly Pustelnik's machine learning class of the M2 complex systems at ENS de Lyon. In the following we review an article introducing physics-informed neural networks and reproduce some of its results.

**Students**: Ayoub Dhibi & Th√©odore Farmachidi

**Article**:
[Maziar Raissi, , Paris Perdikaris, George Em Karniadakis. "Physics Informed Deep Learning (Part I): Data-driven Solutions of Nonlinear Partial Differential Equations." (2017).](https://arxiv.org/abs/1711.10561)

**Note**: the presentation is expected to include a general introduction, the article's specific features in relation to the state of the art, a more technical section on the article's contributions, and a presentation of the results reproduced.

---

## Table of Contents
1. [Article Summary](#article-summary)
2. [Key Concepts & Methods](#key-concepts--methods)
3. [Important Figures & Results](#important-figures--results)
4. [Reproduction of Results](#reproduction-of-results)
5. [Personal Notes & Questions](#personal-notes--questions)

---

**Roadmap**:

1. Analyze the article and its details
2. Reproduce two examples (one per part)
3. Apply the method to a new use case
4. Prepare the presentation (10' presentation + 10' questions)

**Due dates**: 
- ML challenge for the 04/01
- notebook for the 05/01
- presentation for the 09/01

## 1. Article Summary <a id="article-summary"></a>

### Abstract
### 1. Introduction
When enough data is available neural networks (NNs) have proven very efficient at solving machine learning problems such as natural language processing, image recognition etc. However in many applications the data is scarce and difficult to obtain (complex physical systems, biological systems, etc), with few data, current state-of-the art NNs are less accurate and may not even converge to a proper solution of the problem.

The current approach to solve these complex systems problems do not use all the information available to constraint the solutions. Typically one feeds a neural network (NN) with data pairs of training and target points and trains the NN by minimizing some error criterion with a potential regularization the loss and regularization term are chosen heuristically. When we have a prior knowledge of the system and the physical laws governing it a better approach is to encode directly the physical constraints into the regularization. The resulting model, a physics-informed neural network (PINN) display better convergences guarantees and better generalization even with few data points available.

Previous work on physics-informed models used Gaussian process regression to solve linear problems. They got good approximations and theoretical error boundaries on different linear problems from mathematical physics. Further works extended the method to non linear problems by using the method on local linear approximations of the problem.

#### 1.1 Problem setup and summary of contributions
In the article, they use deep neural networks (DNNs) as universal function approximators extending naturally the class of problems solvable to non linear settings. They also use [automatic differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation) to compute efficiently the partial derivatives. Then a constraint is added to the neural network so that the approximated function is solution to the non linear partial differential equations (PDEs) ruling the problem these equations encode all the physical laws characteristic of the system (conservation principles, symmetries, etc). This new approach lead to a new class of numerical solvers for PDEs and data-driven model inference.

This new approach can be used to tackle two types of problems,
1. data-driven solution: given noisy measurements find a function that is solution to the PDE 
2. data-driven discovery: given noisy measurements find the parameters of the PDE such that its solution(s) could have produced the measurements
We can frame these as follow, lets consider parametrized non linear partial differential equations of the general form
$$
\partial_t u(t,x) + N[u;\lambda] = 0
$$
where $u(t,x)$ is the solution of the PDE and $N[.;\lambda]$ is a non linear operator parametrized by $\lambda$.
Let us introduce the following notations of subscripts as partial derivatives:
$$
\partial_t u = u_t \text{, } \partial_x u = u_x \text{, } \partial_x^2 u = u_{xx} \text{, etc}
$$
For example Burger's equation
$$
u_t + \lambda_1 uu_x - \lambda_2 uu_x = 0
$$
has $N[u;\lambda] = \lambda_1 uu_x - \lambda_2 uu_x$ with $\lambda = (\lambda_1, \lambda_2)$.
The two types of problems we outlined can now be reframed as:
1. data-driven solution: given noisy measurements and fixed parameters $\lambda$ what can be said about $u$
2. data-driven discovery: given noisy measurements and assuming they come from a solution of the equation, what are the most likely parameters $\lambda$ ruling the equation

This paper focuses on data-driven solutions to the restricted class of problems:
$$
u_t + N[u] = 0, \ x \in \Omega, \ t \in [0,T]
$$
with $\Omega$ a subset of $\mathbb{R}^D$. The paper presents two classes of algorithms, continuous and discrete time models and evaluate their properties and performance on benchmark problems. For reproducibility the author made the code available at [https://github.com/maziarraissi/PINNs](https://github.com/maziarraissi/PINNs).



### 2. Continuous Time Models
A deep neural network is used to approximate a solution of the PDE. If the network is deep enough it is an universal approximator i.e. it can approximate any function one will ever encounter (except very pathological examples) [ref](https://www.sciencedirect.com/science/article/pii/0893608089900208). We denote the neural network as $u^\theta$ where $\theta$ represents the parameters of the neural network. Finally we define the *physics informed neural network*
$$
f = u_t^\theta + N[u^\theta]
$$
Which can be derived via automatic differentiation.<span style="color:red"> Automatic differentiation</span> is a way to compute exact derivatives of numerical functions that are expressed as computational graphs e.g. neural networks.

The neural networks $u^\theta$ and $f$ are trained by minimizing
$$
MSE = MSE_u + MSE_f
$$

where

$$
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} |u(t_u^i, x_u^i) - u^i|^2,
$$

and

$$
MSE_f = \frac{1}{N_f} \sum_{i=1}^{N_f} |f(t_f^i, x_f^i)|^2.
$$

With $MSE_u$ computed over boundary and limit conditions training data for $u$ and $MSE_f$ computed on collocation points sampled randomly inside the support $\Omega$ via <span style="color:red"> Latin Hypercube Sampling Strategy </span>.

If the PDE problem is well posed, the boundary and limit conditions along are enough to determine an unique solution. Moreover, typical PDE solvers (based on other methods e.g. finite differences) also use only boundary and limit data. However there is no theoretical convergence guarantee. But the empirical results shows that with a deep enough neural network and with enough collocation points $N_f$, the approximated solution achieves good prediction accuracy.

In what follows the authors illustrate the continuous time method on Burger's equation and on Schrodinger's equation. They try different set of parameters to build an empirical result on the model accuracy as mentioned above.

We want enough collocation points to avoid over-fitting. Big $N_f$ = strong regularization.

#### 2.1 Burger's equation
##### a. Problem definition
Burger's equation can be derived from the Navier-Stokes equations for the velocity field by dropping the pressure gradient term. Burger's equation are used in fluid mechanics, non linear acoustics, gas dynamics and traffic flow. In 1D the Burger's equation writes,
$$
u_t + uu_x - (0.01/\pi)u_{xx} = 0, \ \forall x \in [-1,1], \ \forall t \in [0,1]
$$
**If we add** the Dirichlet boundary conditions,
$$
u(0,x) = -\sin (\pi x) \\
u(t,1) = u(t,1) = 0
$$
the Burger's equation as stated has an unique solution. We now look for a neural network $u^\theta$ that will approximate this solution. We define the physics informed neural network $f$ as we explained it above,
$$
f = u^{\theta}_t + u^\theta u_{x}^\theta - (0.01/\pi)u_{xx}^\theta
$$

##### b. Solver setup

- **Network architecture** : The neural network $u^\theta$ has 9 layers of 20 neurons ($= 3021$ parameters) with $\tanh$ activation functions.

<center>


| Input layer | Hidden layers | Output layer |
|-------------|---------------|--------------|
|[2] = $(x,t)$|    [20] x 8   |[1] = $u(x,t)$|

<details>
<summary>Note</summary>

---
Here the paper can be confusing because it says
> "the network architecture is fixed to 9 layers with 20 neurons per hidden layer"

It seems that either the input or output layer is not considered as a layer by the author. However the architecture above is the exact one used cf. [line 145 of the python script](https://github.com/maziarraissi/PINNs/blob/master/appendix/continuous_time_inference%20(Burgers)/Burgers.py) used to generate the results in the article.

---
</details>

</center>





- **Optimizer**: Loss functions were optimized using <span style="color:red">L-BFGS</span> algorithm.
For bigger datasets, the authors suggest to use a mini-batch setting with SGD/modern variants.

$N_u = 100$ randomly distributed initial and boundary data points. 


**Definitons**:
- **collocation points** = randomly chosen points on the support where the neural network is forced to satisfy the PDE.

**Notes**:
- In the pdf at page 4 the author explain the two types of problems and mentions "filtering and smoothing" for data driven solutions. He also makes extensive citations to previous work he conducted, maybe I need to also reed his other papers to get a better solutions of these "data-driven" problems.
- Text in <span style="color:red">red</span> refers to technical details that we have to explain in more depth.
- Apparently for these problems the L-BFGS algorithm is really better suited, we can try to compare it with Adam optimizer.

---

### L-BFGS algorithm
Mentionned as 
- a quasi-Newton, full-batch gradient-based optimization algorithm 

**quasi-Newton**


Newton methods update automatically the learning rate of gradient descent algorithms to make big steps in flat areas and small steps in steep areas. To do so they use the local curvature of the loss function $\mathcal{L}$ during the optimization. The local curvature is encoded by the Hessian $\mathbf{H}$ so the updated parameters will take a similar form as,
$$
\theta_{k+1} = \theta_k - \mathbf{H}^{-1}\nabla\mathcal{L}
$$
However the Hessian big and therefore it takes space to store it and time to invert it so quasi-Newton methods build and approximation of $\mathbf{H}^{-1}$

**full-batch**


Every gradient is computed using all collocation and boundary/limit points, no stochasticity is used.

## 2. Key Concepts & Methods <a id="key-concepts--methods"></a>

_List and explain the main concepts, equations, and methods introduced in the article. Use bullet points, equations, or diagrams as needed._

## 3. Important Figures & Results <a id="important-figures--results"></a>

_Add screenshots, plots, or descriptions of the most important figures and results from the article. Briefly explain their significance._

## 4. Reproduction of Results <a id="reproduction-of-results"></a>

_Use the following code cells to implement and reproduce key results from the article using Python. Add explanations and visualizations as needed._

## 5. Personal Notes & Questions <a id="personal-notes--questions"></a>

_Write down your thoughts, questions, and points to discuss during your presentation or with your professor._