<a href="https://colab.research.google.com/github/Xuan-He-97/Neural-networks-and-quantum-field-theory/blob/main/Neural_networks_and_quantum_field_theory.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Gaussian Process

### Concept of Gaussian Process

Prior over functions: $\; p(f)$

For training: minimize negative log marginal likelihood $\mathcal{L}(\theta)\;$
($\theta$ is hyperparameters and noise level)

Can we use Beyesian regression? $\;\;\;p(f|\mathcal{D}) = \frac{p(f)p(\mathcal{D}|f)}{P(\mathcal{D})}$

<br>

Gaussian process is a set of random variables $f$, indexed by a continuous variable $\;\; f = f(x)$. Any subset of finite function variables $\{f_n\}^N_{n=1}$ has joint (zero mean) Gaussian distribution.

$\;\;\;\;\;\;\;\;\;\; p(\textbf{f} | \textbf{X}) = \mathcal{N}(0, \textbf{K})$

infinite width $\;\;\;\;\;\;\;\Longleftrightarrow\;\;\;\;\;\;\;$ GP $\;\;\;\;\;\;\;\;\;\Longleftrightarrow\;\;\;\;\;\;\;$ free field theory

finite width $\;\;\;\;\;\;\;\;\;\Longleftrightarrow\;\;\;\;\;\;\;$ NGP $\;\;\;\;\;\;\;\Longleftrightarrow\;\;\;\;\;\;\;$ effective field theory

Function over $x\;$: $\;f(x)$ 

$\;f(x)$ has a distribution: $p(f)$

$x$ is like time/space coordinates. Infinite choices of $x$, but for any finite subset of $x$, the finite function variables $\{f_i\}^m_{i=1}$ has joint (zero mean) Gaussian distribution.

$\;\;\;\;\;\;\;\;\;\; p(\textbf{f} | \textbf{X}) \sim \mathcal{N}$

For neural networks, $x$ is our input data, $f$ is the network. 

$f_{\theta, N}:$ $R^{d_{in}}$ $\to R^{d_{out}}$

parameter space distribution $\to$ function space distribution

### Neural Network Corresponds to Gaussian Process

parameter space distribution $\to$ function space distribution

$f_{\theta, N}:$ $R^{d_{in}}$ $\to R^{d_{out}}$

Consider one hidden later of $N$ units, and output dimension is $1$:

$\;\;\;\;\;\; f(\textbf{x}) = b_1 + \displaystyle \sum^{N}_{j=1} W_1^j \sigma(\textbf{x}; \textbf{W}_0^j, b_0^j)$

$\textbf{W}_0^j$ is i.i.d., $b_0^j$, $b_1$ and $W_1^j$ is zero mean and independent.

In this paper, $b_0^j, b_1 \sim \mathcal{N}(0, \sigma_b^2)\;$, $\;\textbf{W}_0^j \sim \mathcal{N}(0, \sigma_W^2 / d_{in})\;$, $\; \textbf{W}_1^j \sim \mathcal{N}(0, \sigma_W^2/N)$ 

(normalize w.r.t. input dimension)

kernel function dependents on the activation function.

- For neural networks with $N \to \infty$:  Gaussian distribution on function space (Gaussian Process) (distribution depends on the GP kernel)

- For neural networks with $N < \infty$: distribution receives $1/N$ corrections $\to$ non-Gaussian Process (NGP)

Learning $\to$ data-induced flow of function space distribution (remain NGP during training) 



### NN-QFT correspondence

EFT approach can be used for any architecture admitting a GP limit.

Admit GP limits: 

- multipayer perceptrons, recurrent NN, skip connections,, convolutions, graph convolutions, pooling, batch/layer normalization, attention.

- both randomly initialized NN or appropriately trained networks.

### Meaning of NN-QFT Correspondence

$P[f] \sim e^{-S}$

overparameterization (increasingly large numbers of parameters) $\Leftrightarrow$ neural network likelihood simplicity (simple distributions)

because of:

- asymptotic NN $\to$ GP

- non-Gaussian correction is $1/N$ (actually need only a single number to correct correlation functions)

## Asymptotic NN and Free Field Theory

### GP/asymptotic NN and Free QFT correspondence

| GP/asymptotic NN | Free QFT |
| --- | --- |
| Input x | External space or momentum space point |
| Kernel $K(x_1, x_2)$ | Feynman propagator |
| Asymptotic NN $f(x)$ | Free Field |
| Log-likelihood | Free action $S_{GP}$ |

### GP/asymptotic NN

$\theta \sim P(\theta)$ + network architecture $\to$ $P(f)$

$\{f(x_1), ..., f(x_k)\} \sim \mathcal{N}(\mu, \Xi^{-1})$ 

assumption: $\mu = 0$

$(\Xi^{-1})_{ij} = K_{ij}$

- Correlation function (n-pt functions)

$G^{(n)}(x_1, ..., x_n) = \frac{\int df f_1 ... f_n e^{- S}}{Z}$

- partition function $Z = \int df e^{-S}$

- discrete action $S = \frac{1}{2} f_i \Xi_{ij} f_j$ (Einstein summation)

- continuous action $S = \frac{1}{2} \int d^{d_{in}}x d^{d_{in}}x' f(x) \Xi(x, x')f(x')$

- inverse covariance function $\int d^{d_{in}}x' K(x, x') \Xi(x', x'') = \delta^{(d_{in})} (x - x'')$

- local GP $S = \frac{1}{2} \int d^{d_{in}}x f(x) \Xi(x) f(x)$  where $\Xi(x, x') = \Xi(x) L_\sigma(x, x')$ and $L_0(x, x') = \delta^{(d_{in})} (x - x')$

- ultra-local GP: f is constant $S = \frac{1}{2} f \Xi f$ where $K\Xi = 1$

question: what are propagator (probability or amplitude of propagation of a particle from one point to another) and action in QFT

### Free Field Theory

- quantum field $\phi(x)$

- path integral $Z = \int D \phi e^{-S[\phi]}$

- action $S[\phi] = \int d^dx \phi(x) (\Box + m^2) \phi(x)$

question A.21

### Computing correlation function

$G_{GP}^{(n)}(x_1, ..., x_n) = \sum_{p\in \textrm{Wick}(x_1, ..., x_n)} K(a_1, b_1)...K(a_{n/2}, b_{n/2})$
(can be derived)

Diagrammatic representaions: pairs of points, due to the Gaussian nature of $Z_{GP}$

Question: how to understand

### GP in neural networks

- Why GP: by CLT

$f_{\theta, N}(x) = z_1^k = \sum_{j=1}^N W_1^{jk}x_1^j + b_1^k$

<br>

- $f$ has two parts

$f = f_b \text{(ultra-local)} + f_W \text{(depends on model and activation)}$ $\to$ $G^{(2)}(x_1, x_2) = G_b^{(2)}(x_1, x_2) + G_W^{(2)}(x_1, x_2)$

$G^{(4)}(x_1, x_2, x_3, x_4) = G_b^{(4)}(x_1, x_2, x_3, x_4) + G_W^{(4)}(x_1, x_2, x_3, x_4) + \sum_{i \ne j \ne k \ne l} G_b^{(2)}(x_i, x_j) G_W^{(2)}(x_k, x_l)$

TODO: kernel derivation

### Three networks

- Erf-net

- ReLu-net

- Gauss-net (translation invariant kernel)

## Experiments of GP limit

### How to measure

- normalized deviation $m_n$

$m_n(x_1, ..., x_n) = \Delta G^{(n)}(x_1, ..., x_n) / G_{GP}^{(n)}(x_1, ..., x_n) $

$\Delta G^{(n)}(x_1, ..., x_n) = G^{(n)}(x_1, ..., x_n) - G_{GP}^{(n)}(x_1, ..., x_n)$

<br>

$G^{(n)}(x_1, ..., x_n) = \mathcal{E}[f_{\alpha}(x_1)...f_{\alpha}(x_n)]$ is measured in the experiment $100 \text{(experiments)} * 10^5 \text{(nets)}$ times 

weights and biases is drawn from Gaussian dist. with mean equals zero and std equals 1

<br>

$G_{GP}^{(n)}(x_1, ..., x_n)$ is computed using Wick contraction

<br>

$d_{in} = d_{out} = 1$

### Inputs

![alt text](https://user-images.githubusercontent.com/79208856/119465386-33560c80-bd76-11eb-876a-5d61c736fbd9.png)

- inputs of ReLu-net and Erf-net are chosen to be positive so that the kernel is always positive

- inputs are chosen where the finite width NGP is well approximated by local-operator correction terms to the associated log-likelihood,

question local-operator correction terms

### Results

![alt text](https://user-images.githubusercontent.com/79208856/119467356-03a80400-bd78-11eb-86c1-87690245757c.png)

- 2-pt function is exactly the kernel even away from GP (can be proved)

- background is average std across 100 experiments of $m_n$

- $\Delta G^{(n)} \propto N^{-1}$ for $n = 4, 6$  (can be proved)

- connected contribution of the 4-pt function is same as $\Delta G^{(4)}$ $\propto N^{-1}$  (can be proved)

- connected contribution of the 6-pt function is $\propto N^{-2}$  (can be proved)

- $G^{(2k)}(x_1, ..., x_{2k})|_{\text{connected}} = \left[ G^{(2k)}(x_1, ..., x_{2k}) - S(x_1, ..., x_{2k}) \right]|_{\text{internal indices same}}$

- $G^{(2k)}(x_1, ..., x_{2k})|_{\text{connected}} \propto \frac{1}{N^{k-1}}$

- $G^{(2k)}(x_1, ..., x_{2k})|_{\text{connected}}$ is computed from expectation of layer postactivation value

![alt text](https://user-images.githubusercontent.com/79208856/119483310-68b72600-bd87-11eb-968c-ff39dbc32e92.png)

$\Delta G^{(2)} \propto 0$

$\Delta G^{(4)} \propto N^{-1}$

$\Delta G^{(6)} \propto N^{-1} + N^{-2} \propto N^{-1}$

$G^{(4)}(x_1, ..., x_{4})|_{\text{connected}} = \Delta G^{(4)} \propto N^{-1}$

$G^{(6)}(x_1, ..., x_{6})|_{\text{connected}} = \Delta G^{(6)} - \sum G^{(4)}(x_1, ..., x_{4})|_{\text{connected}} \; G^{(2)}(x_5, x_6)  \propto N^{-2}$

## Non-Gaussian processes with effective field theory

### NGP NN and EFT correspondence

| NGP NN | EFT |
| --- | --- |
| Input x | External space or momentum space point |
| Kernel $K(x_1, x_2)$ | Free or exact propagator |
| NN output $f(x)$ | Interacting Field |
| Non-Gaussianities | Interactions |
| Non-Gaussian coefficients | Coupling strengths |
| Log-likelihood | EFT action $S$ |

### EFT: scales

$[S_{GP}] = 0 \to [f] = - \frac{2d_{in} + [\Xi]}{2}$

for $\mathcal{O}_k := g_k f(x)^k$ in $\Delta S $ as $ \int d^{d_{in}}x \mathcal{O}_k$

$[g_k] = -d_{in} + \frac{k(2d_{in} + [\Xi])}{2} =  -d_{in} - \frac{k([K])}{2} $

$[\Xi] + [K] = -2d_{in}$

$\mathcal{O}_k$ can be ignored for sufficiently large $k$

### How to construct the effective action of an NGP

- Determine the symmetries respected by the system of interest.

- Fix and upper bound $k$ on the dimentison of any operator appearing in $\Delta S$.

- Define $\Delta S$ to contrain all operators of dimension $\leqslant k $ that respect the symmetries.

### Perturbation Theory

$G^{(n)}(x_1, ..., x_n) = \frac{\int df f_1 ... f_n e^{- S}}{Z_0}$

$Z_0 = \int df e^{-S}$

$S = S_{GP} + \Delta S$

$S_{GP}$ is Gaussian so the first part is computable, but $\Delta S$ is not computable.

But $\Delta S$ is small $\to$ perturbation theory.

### Cutoffs

$S \to S_{\Lambda}$ to deal with divergences. 

should be insensitive to the choice of $\Lambda$ $\to$ obey RGEs (ReLu-net satisfies)

$\Delta S = \int d^{d_{in}}x \left[ \lambda f(x)^4 + \kappa f(x)^6 \right]$

question: why $k \geqslant 4$

- odd-order coeefficients are zero. Odd-point function must be zero because means of weights and biases are zero or $S$ must have $f \to -f$ symmetry.

### Represent correlation function diagrammatically - Feynman rules

$O(\lambda^l \kappa^m)$ correction to $G^{(n)}$, $l$ 4-pt interaction vertices and $m$ 6-pt interaction vertices.

$G^{(2)}(x_1, x_2) = K(x_1, x_2) \; \to \; S = S_{G} + \Delta S $ 

$G^{(n)}(x_1, ..., x_n) = \frac{\int df f(x_1)...f(x_n) \left[ 1 - \int d^{d_{in}}x g_k f(x)^k + O(g_k^2) \right] e^{-S_{GP}} / Z_{GP, 0}}{\int df \left[ 1 - \int d^{d_{in}}x g_k f(x)^k + O(g_k^2) \right] e^{-S_{GP}} / Z_{GP, 0}} $

<br>

- Feynman rules
![alt text](https://user-images.githubusercontent.com/79208856/119619696-d706f180-be36-11eb-8729-43821cd96c78.png)

- Feynman lines from --- to - - $\Leftrightarrow$ propagator of $S_G \to $ 2-pt function

- Expand to leading order in $\lambda$ and $\kappa$

question eq.70 why linear? Taylor expansion?

### Symmetries and Interaction strength coefficient (also called coupling constants in QFT)

- Technical naturalness: a coupling $g$ appearing in $\Delta S$ may be small relative to $\Lambda$ if a symmetry is restored when $g$ is set to zero.

$\Delta g = 0$ if $S$ has a symmetry.

$g \to 0 \Rightarrow \Delta g \to 0$

<br>

- Technical naturalness for NGP neural networks:

$\lambda (x) = \bar{\lambda} + \delta \lambda (x)$

if there is symmetry $T$ in $\delta \lambda \to 0$ limit

then $\frac{\delta \lambda}{\lambda} \leqslant 1$

which means the couplings in NGPs are (near) constants. (not proven but can be tested: coupling are effectively constants in Gauss-net)

$T$ : $K(x, y)$ is translation invariant

$\lambda, \; \kappa $ is constants by symmetry at GP and Technical Naturalness.

### Independence of single-layer networks

$S_{GP} = S_{GP}^b + S_{GP}^W = \frac{1}{2} \int d^{d_{in}}x d^{d_{in}}y \left[ f_b(x) \Xi_b (x, y) f_b(y) + f_W(x) \Xi_W (x, y) f_W(y) \right]$

$K(x, y) = K_b(x, y) + K_W(x, y)$

$\Delta S = \Delta S_b + \Delta S_W = \Delta S_W$

similarly $G^{(n)}(x_1, ..., x_n) = \frac{\int df f(x_1)...f(x_n) \left[ 1 - \int d^{d_{in}}x g_k f_W(x)^k + O(g_k^2) \right] e^{-S_{GP}} / Z_{GP, 0}}{\int df \left[ 1 - \int d^{d_{in}}x g_k f_W(x)^k + O(g_k^2) \right] e^{-S_{GP}} / Z_{GP, 0}} \; \Rightarrow \; $ Feynman diagrams

dashed lines in Fetnman diagram correspond to $K_W(u, v)$

## Experiments of NGP

### compute coupling constants

- if $\lambda$ is constant

&emsp; $\lambda = \frac{K(x_1, x_2)K(x_3, x_4) + K(x_1, x_3)K(x_2, x_4) + K(x_1, x_4)K(x_2, x_3) - G^{(4)}(x_1, x_2, x_3, x_4)}{24 \int d^{d_{in}}y K_W(x_1, y)K_W(x_2, y)K_W(x_3, y)K_W(x_4, y)}$

- if $\lambda$ is not constant

&emsp; $\lambda (y) = \bar{\lambda} + \delta \lambda (y)$

&emsp; $\bar{\lambda} = \frac{K(x_1, x_2)K(x_3, x_4) + K(x_1, x_3)K(x_2, x_4) + K(x_1, x_4)K(x_2, x_3) - G^{(4)} \; (x_1, x_2, x_3, x_4)}{24 \int d^{d_{in}}y \; \Delta_{1234y}} - \frac{\int d^{d_{in}}y \; \delta \lambda (y) \; \Delta_{1234y}}{\int d^{d_{in}}y \; \Delta_{1234y}}$

&emsp;  Where $\Delta_{1234y} = K_W(x_1, y)K_W(x_2, y)K_W(x_3, y)K_W(x_4, y)$

- if $\delta \lambda (y)$ is small 

&emsp; $\lambda \backsimeq \bar{\lambda} \backsimeq \text{mean} (\lambda_m(x_1, x_2, x_3, x_4))$

- for $\kappa$

&emsp; $\delta'(x_1, ..., x_6) := G^{(6)} \; (x_1, ..., x_6) - \sum_{\text{15 combinations}} \left[ K(x_i, x_j)K(x_k, x_l)K(x_i, x_j) - 24 \int d^{d_{in}}y \lambda K_W(x_i, y)K_W(x_j, y)K_W(x_k, y)K_W(x_l, y)K_W(x_m, x_n) \right] \;$

&emsp; $\delta(x_1, ..., x_6) := \frac{\delta'(x_1, ..., x_6)}{G^{(6)} \; (x_1, ..., x_6)} $

&emsp; $\delta(x_1, ..., x_6) \to 0$ in large $N$ and large $\Lambda$ limits

### Results

![alt text](https://user-images.githubusercontent.com/79208856/119772490-d3847080-bef1-11eb-861d-961438b21d71.png)

![alt text](https://user-images.githubusercontent.com/79208856/119772541-f020a880-bef1-11eb-806b-ff150dfdbd2a.png)

## Experiments of fitting EFT parameters

### Compute deviation from GP in 4-pt function

$\Delta S_W = \int d^{d_{in}}x \lambda (x) f_W(x)^4$

design three functional forms of $\lambda$ 

$\Delta G^{(4)}_{EFT} = \lambda_0 T_0(x1, .., x_4) + \lambda_2 T_2(x1, .., x_4) + \lambda_{NL} T_{NL}(x1, .., x_4)$

- $T_0(x1, .., x_4) = 24 \int d^{d_{in}}x K_W(x_1, x)K_W(x_2, x)K_W(x_3, x)K_W(x_4, x)$

- $T_2(x1, .., x_4) = 24 \int d^{d_{in}}x x^2 K_W(x_1, x)K_W(x_2, x)K_W(x_3, x)K_W(x_4, x)$

- nonlocal term: $T_{NL}(x1, .., x_4) = 8 \int d^{d_{in}}x d^{d_{in}}y \left[ K_W(x_1, x)K_W(x_2, x)K_W(x_3, y)K_W(x_4, y) + K_W(x_1, x)K_W(x_2, y)K_W(x_3, x)K_W(x_4, y) + K_W(x_1, x)K_W(x_2, y)K_W(x_3, y)K_W(x_4, x) \right]$ 


Predict $\Delta G^{(4)}$

### Experimental measurements of deviation

$\Delta G^{(4)}_{exp}$ is averaged over $10^7$ experiments

thus we can find values of $\lambda$ to minimize the mean squared error between $\Delta G^{(4)}_{exp}$ and $\Delta G^{(4)}_{EFT}$

test effectiveness of measurement value by making predictions for the test set

## Wilsonian Renormalization



### How couplings change with $\Lambda$

- goal of RG: solve divergences arising from integrals over the space of inputs

using cutoff: $\Delta S_{\Lambda} = \int^{\Lambda}_{-\Lambda} d^{d_{in}}x \sum_{l \leqslant k} g_{\mathcal{O}_l}(\Lambda)\mathcal{O}_l$

- How to understand cutoff: MNIST maximum brightness and darkness scale.

- We can extract values of $g_{\mathcal{O}_l}(\Lambda)$ from experimental result of $G^{(n)}(x_1, ... , x_n)$

- $\Delta S_{\Lambda}$ should satisfy $\frac{d G^{(n)}\;(x_1, ... , x_n)}{d \log \Lambda} = 0$

- $\beta$ -function: $\beta(g_{\mathcal{O}_l}) := \frac{d(g_{\mathcal{O}_l}(\Lambda))}{d \log \Lambda}$

- RG flow: flow in the couplings induced by varying $\Lambda$

- Along a direction of flow, couplings can be irrelevant (decrease), relevant (increase), marginal (same). 

- Since $[\lambda] = -d_{in} - 2 [K]$, $\;\;\;$ $[\kappa] = -d_{in} - 3[K]$. When $[K] > 0$, $\kappa$ decrease more quickly than $\lambda$

### $\beta$ - functions

- Kernels have model independent term and model dependent term

$K(x, x')=\alpha + \varsigma(x, x')$

<br>

- Thus correlation functions can be written in combination of model independent and dependent terms

$ G^{(4)}\;(x_1, ... , x_4) = \gamma_{4, 0} + \varrho_{4, 0} - \int^{\Lambda}_{-\Lambda} d^{d_{in}}x (\gamma_{4, \lambda} + \varrho_{4, \lambda}) - \kappa \int^{\Lambda}_{-\Lambda} d^{d_{in}}x (\gamma_{4, \kappa} + \varrho_{4, \kappa})$

$ G^{(6)}\;(x_1, ... , x_6) = \gamma_{6, 0} + \varrho_{6, 0} - \int^{\Lambda}_{-\Lambda} d^{d_{in}}x (\gamma_{6, \lambda} + \varrho_{6, \lambda}) - \kappa \int^{\Lambda}_{-\Lambda} d^{d_{in}}x (\gamma_{6, \kappa} + \varrho_{6, \kappa})$

<br>

- take derivative and cancel out the last term ($\kappa$ is negligible)

![alt text](https://user-images.githubusercontent.com/79208856/120148346-a51cd300-c21a-11eb-92c3-ce00cd3d4d9c.png)

### RG in neural networks

- Because bias term is Gaussian

$\Delta S_{W, \Lambda} = \int^{\Lambda}_{-\Lambda} d^{d_{in}}x \displaystyle \sum_{l \leqslant k} g_{\mathcal{O}_l}(\Lambda)\mathcal{O}_l$

$[\lambda] = -d_{in} - 2 [K_W]$, $\;\;\;$ $[\kappa] = -d_{in} - 3[K_W]$. 

<br>

- For network architecture

$K_W(x, x') = \varsigma(x, x')$

$ G^{(4)}\;(x_1, ... , x_4) = \gamma_{4, 0} + \varrho_{4, 0} - \int^{\Lambda}_{-\Lambda} d^{d_{in}}x \varrho_{4, \lambda} - \kappa \int^{\Lambda}_{-\Lambda} d^{d_{in}}x \varrho_{4, \kappa}$

$ G^{(6)}\;(x_1, ... , x_6) = \gamma_{6, 0} + \varrho_{6, 0} - \int^{\Lambda}_{-\Lambda} d^{d_{in}}x \varrho_{6, \lambda} - \kappa \int^{\Lambda}_{-\Lambda} d^{d_{in}}x \varrho_{6, \kappa}$

<br>

- RG equations

![alt text](https://user-images.githubusercontent.com/79208856/120154133-1ca23080-c222-11eb-8f34-8e651984f7bd.png)

### RG flows in Gauss-net

$K_{\text{Gauss}}(x, x') = \sigma_b^2 + \sigma_W^2 \exp{\left[ - \frac{\sigma_W^2 \; |x-x'|^2}{2d_{in}}\right]}$

- $K_{\text{Gauss}}$ converges for large integration value, thus $\Lambda \to \infty$

- for small $x$ and $x'$, $[K_W]<0$ 

### RuLU-net

$K_W(x, x') = \varsigma(x, x') = h_1(x, x')h_2(\theta)$

$h_1(x, x') = \sqrt{(\sigma_b^2 + \frac{\sigma_W^2}{d_{in}} x \cdot x)(\sigma_b^2 + \frac{\sigma_W^2}{d_{in}} x' \cdot x')}$

$h_2(\theta) = (sin \theta + (\pi - \theta) cos \theta)$

- $[K_W(x, x')] = 2$

- $[\lambda] = -d_{in} - 4, \; [\kappa] = -d_{in} - 6$

![alt text](https://user-images.githubusercontent.com/79208856/120420985-074f1280-c398-11eb-92d2-edc0fc4e10b5.png)