#Neural Tangent Kernel

Despite widespread practical success, the training behaviour of Deep Neural Networks (DNNs) has always been a poorly understood concept. Much focus has been placed on understanding convergence in parameter space, but it is well known that  the loss surface of DNNs are a high dimensional  and non-convex. They contain many saddle-points, valleys and symmetries that render it unreasonable to expect that a gradient based optimiser will be able to converge to the global minimum, if one exists.  In spite of this, DNNs are also known for their good generalisation properties, despite their seemingly obvious over-parametrisation. 

This report will study some recent papers that aim to shed light on the dyanmics of DNN training. The main idea involved is to view the training process in terms of the function $f_{\theta}$ that the DNN represents instead of the parameters $\theta$

Consider a fully connected, feed-forward neural network with layers numbered from $0$ (input) to $L$ (output), each layer has width $n_0, \dots, n_L$ respectively. Denote the training set by $\mathcal{D} \subseteq \mathbb{R}^{n_0} \times \mathbb{R}^{n_L}$, and let $\mathcal{X}=\{ x : (x,y) \in \mathcal{D} \}$ and $\mathcal{Y}=\{ y : (x,y) \in \mathcal{D} \}$ be the inputs and labels of the neural network. Following the parametrisation used by [Jacot et al.,  2018](https://arxiv.org/pdf/1806.07572.pdf), the recurrence relation of the neural network can then be described as 
\begin{align}
\alpha^{(0)} (x; \theta) &= x \\
\tilde{\alpha}^{(l+1)} (x ; \theta) &= \frac{1}{\sqrt{n_l}} W^{(l)} \alpha^{(l)} (x; \theta) + \beta b^{(l)} \\
\alpha^{(l)} (x; \theta) &= \sigma (\tilde{\alpha}^{(l)} (x ; \theta)) 
\end{align}
where the functions $\tilde{\alpha}^{(l)}(\cdot; \theta):\mathbb{R}^{n_o} \to \mathbb{R}^{n_l}$ and $\alpha^{(l)} (\cdot; \theta):\mathbb{R}^{n_o} \to \mathbb{R}^{n_l} $ are the preactivations and activations at layer $l$. The nonlinearity $\sigma : \mathbb{R} \to \mathbb{R}$ is applied entrywise, and the parameters $\theta$ consists of the trainable variables contained in the connection matrices $W^{(l)} \in \mathbb{R}^{n_l \times n_{l+1}}$ and the bias vectors $b^{(l)} \in \mathbb{R}^{n_{l+1}}$ for $l=0, \dots, L-1$. As the width of the hidden layers $n_1, \dots, n_{L-1}$ grow to infinity, the factors $\frac{1}{\sqrt{n_l}}$ ensure a consistent asymptotic behavior is obtained from the neural network, while the factor $\beta$ is introduced to balance the influence of the connection weights and bias when $n_l$ is large. 

Let $p^{in}$ be a distribution on the input space $\mathbb{R}^{n_0}$, $P=\sum_{l=0}^{L-1} (n_l + 1)n_{L+1}$ be the dimension of the parameter space, and $\mathcal{F}=\{f:\mathbb{R}^{n_0} \to \mathbb{R}^{n_L} \}$ be the function space. For a functional cost $C:\mathcal{F} \to \mathbb{R}$, the composite cost $C \circ F : \mathbb{R}^{P} \to \mathbb{R}$ is generally highly non-convex even for a convex $C$, making the study of behaviours of parameters during training difficult. [Jacot et al.,  2018](https://arxiv.org/pdf/1806.07572.pdf) focuses on the realisation function $F^{(L)}:\mathbb{R}^P \to \mathcal{F}$ that maps parameters $\theta$ to functions $f_{\theta} \in \mathcal{F}$, and studied the network function $f_{\theta}(x):= \tilde{\alpha}^{(L)} (x ; \theta)$ computed by the neural network instead of the parameters $\theta$.

On the functional space $\mathcal{F}$, consider the seminorm $\|\cdot\|_{p^{in}}$ induced from the bilinear form  
\begin{align}
\langle f,g \rangle_{p^{in}}=\mathbb{E}_{x \sim p^{in}}\Bigl[f(x)^Tg(x)\Bigr]
\end{align}
and the set $\mathcal{F}^*$ containing linear forms $\mu: \mathcal{F} \to \mathbb{R}$ where $\mu=\langle d, \cdot \rangle_{p^{in}}$ for some $d \in \mathcal{F}$. The functional derivative of the cost $C$ at $f_{\theta(t)}$ is then an element of $\mathcal{F}^*$
\begin{align}
\partial_t f_{\theta(t)}=-\nabla_{\Theta^{(L)}}C|_{f_{\theta(t)}}
\end{align}



[Jacot et al.,  2018](https://arxiv.org/pdf/1806.07572.pdf)

[Yang, 2019](https://arxiv.org/pdf/1902.04760.pdf)

[Lee et al., 2019](https://arxiv.org/pdf/1902.06720.pdf)

