<h1><center>PSP Linkages using Deep Learning</center></h1>
<h3><center>Apaar Shanker</center></h3>
<h3><center>Georgia Institute of Technology</center></h3>
 
![image.png](pres_imgs/intro_slide.png)

* image: Ahmet Cecen et al., Acta Materialia 146 (2018) 76e84

<h1><center>Overview</center></h1>

### * Description of Neural Network Model
### * Training a Neural Network Model: Backpropogation
### * Applications of Neural Net models in materials domain
### * Popular Libraries : How to NN?
### * Convolutional Neural Networks
### * Analogy between CNNs and MKS Localization
### * PDE-NETs and learning Differential Equations using Conv-Net filters


<h1><center>The inevitable brain analogy and the Perceptron</center></h1>
![StanfordLectureNotes](pres_imgs/neuron_and_perceptron.png)

<h1><center>Zooming in on the Perceptron</center></h1>

![StanfordLectureNotes](pres_imgs/perceptron.png)

<h1><center>The neural network model</center></h1>

## A simple linear model
 * $f = Wx$
     * $x=\{x_1, x_2,\cdots,x_n\}$
     * $ W = 
     \begin{pmatrix}w_{11} & w_{12} & \cdots & w_{1n} \\w_{21} & w_{22} & \cdots & w_{2n} \\ \vdots& \vdots & \ddots & \vdots \\ a_{m1} & w_{m2} & \cdots & w_{mn}    
     \end{pmatrix}$
     * $f = \{f_1, f_2,\cdots,f_m\}$

$ x$ : $(n \times 1)$, $W_1$ : $(m1 \times n)$, $f$ : $(m1 \times 1)$

<h1><center>The neural network model</center></h1>

A simple linear model
 * $f = Wx$
 
A 2-layer Neural Network
 * $f = W_2 max(0, W_1x)$
 
$ max(0,x)$ is known as the ReLU (Regularized Linear Unit) function

$x$ : ($n \times 1$), $W_1$ : ($m1 \times n$), $W_2$ : ($m2 \times m1$), $f$ : ($m2 \times 1$)

<h1><center>The neural network model</center></h1>

A simple linear model
 * $f = Wx$
 
A 2-layer Neural Network
 * $f = W_2 max(0, W_1x)$

or A 3-layer Neural Network
 * $f = W_3max(0, W_2 max(0, W_1x))$

$x$: ($n \times 1$), $W_1$: ($m1 \times n$), $W_2$: ($m2 \times m1$), $W_3$: ($m3 \times m2$), $f$: ($m3 \times 1$)

<h1><center>The neural network model</center></h1>

A simple linear model
 * $f = Wx$
 
A 2-layer Neural Network
 * $f = W_2 max(0, W_1x)$

or if you fancy, 3-layer network with both ReLU and Sigmoid activation
 * $f = W_3max(0, W_2 \sigma(W_1x))$ where $\sigma(h) = \dfrac{1}{1 - \exp{-h}}$

$x$ : ($n \times 1$), $W_1$ : ($m1 \times n$), $W_2$ : ($m2 \times m1$), $W_3$ : ($m3 \times m2$), $f$ : ($m3 \times 1$)

<h1><center>Commonly Used Activation Functions</center></h1>
![StanfordLectureNotes](pres_imgs/activation_functions.png)

* Sigmoid and tanh functions is most commonly used in MultiLayer Perceptron models, whereas ReLU is the standard for conv-nets described later. 
* Please note that the derivatives of all these functions are really easy to compute, for eg: $\dfrac{d \sigma(x)}{d(x)} = \sigma(x)(1-\sigma(x))$

<h1><center>The Neural Network as a Computational Graph</center></h1>
![StanfordLectureNotes](pres_imgs/graph_representation.png)

* s denotes the number of nodes(perceptrons) in each layer
* This is a fully connected multi-layered perceptron model
* Implemented in scikit-learn as the MLPC model and also available in the MATLAB machine learning toolbox

<h1><center>Training the model: Optimizing the loss function</center></h1>

Consider the linear regression model:
 - $y = w^Tx$

<h1><center>Training the model: Optimizing the loss function</center></h1>

Consider the linear model:
 - $y = w^Tx$

We can define a function $\mathcal{L}$:
\begin{align}
    \mathcal{L(w)} &= \sum^{N}(\hat{y_i} - y_i)^2\\
    &= \sum^{N}(\hat{y_i} - w^Tx_i)^2
\end{align}
such that the problem of guessing the weights reduces to the problem of minimizing the function $\mathcal{L}$ also known as the loss function.


<h1><center>Training the model: Optimizing the loss function</center></h1>

Consider the linear model:
 - $y = w^Tx$

We can define a function $\mathcal{L}$:
\begin{align}
    \mathcal{L(w)} &= \sum^{N}(\hat{y_i} - y_i)^2\\
    &= \sum^{N}(\hat{y_i} - w^Tx_i)^2
\end{align}
such that the problem of guessing the weights reduces to the problem of minimizing the function $\mathcal{L}$ also known as the loss function.

**In this case, the function $\mathcal{L}$ is clearly convex, i.e. a parabola in $w$ space, so we have an analytical solution to the problem as:**
 - $\hat{w} = (X^TX)^{-1}X^T\hat{Y}$ where $X: \{x_i\}$ and $\hat{Y}: \{\hat{y}_i\}$

<h1><center>Training the model: Gradient Descent</center></h1>

![StanfordLectureNotes](pres_imgs/mountains.png)

<h1><center>Training the model: Gradient Descent</center></h1>

* The gradient at any point in the loss function denoted as *$\nabla_w\mathcal{L}$*   
  
* It is a vector that gives the direction of maximal positive change in the loss function.  
  
* As such, loss function can be minimized by moving in the direction opposite to the gradient.  
  
* This gives us an update rule  
    * $w_{i}^{t+1} = w_{i}^{t} - \lambda \dfrac{\partial \mathcal{L(w)}}{\partial{w_i}}$
    * $\lambda$ is reffered to as the learning rate and controls the speed of descent.

![StanfordLectureNotes](pres_imgs/grad_descent.png)

<h1><center>Training the model: Stochastic Gradient Descent</center></h1>


* Recal:
    * $\mathcal{L} = \dfrac{1}{N}\sum^{N}(\hat{y}_i - f(x_i))^{2}$
* For large datasets, it is expensive to compute loss for the entire dataset in each update step.
* An alternative is to compute gradient over batches of training data.
* **Stochastic refers to the fact that the "mini-batch" loss function is a "stochastic" approximation of the actual loss**
* This gives us a modified update rule  
    * $w_{i}^{t+1} = w_{i}^{t} - \lambda \dfrac{\partial \mathcal{l_j(w)}}{\partial{w_i}}$
    * $\lambda$ is reffered to as the learning rate and controls the speed of descent.



<h1><center>Training the model: Backpropogation</center></h1>


* Recal the form of the 3-layer Neural Network Model:
    * $f = W_3max(0, W_2 max(0, W_1x))$

<h1><center>Training the model: Backpropogation</center></h1>


* Recal the form of the 3-layer Neural Network Model:
    * $f = W_3max(0, W_2 max(0, W_1x))$
* We again define the loss function as:
    * $\mathcal{L} = \dfrac{1}{N}\sum^{N}(\hat{y}_i - f(x_i))^{2}$

<h1><center>Training the model: Backpropogation</center></h1>


* Recal the form of the 3-layer Neural Network Model:
    * $f = W_3\sigma(W_2 \sigma(W_1x))$
* We again define the loss function as:
    * $\mathcal{L} = \dfrac{1}{N}\sum^{N}(\hat{y}_i - f(x_i))^{2}$
* We would like to use the gradient descent strategy for optimizing $L$ which is:
    * $w_{i,j}^{l}[t+1] = w_{i,j}^{l}[t] - \lambda \dfrac{\partial \mathcal{L(w[t])}}{\partial{w^l_{i,j}[t]}}$
    * where l is the index of the layer and i,j are indices of the parameter in the parameter matrix
* However, because of the deep nesting of weights, it is difficult to the get the analytic form of the partial derivative:
    * $\dfrac{\partial \mathcal{L(w[t])}}{\partial{w^l_{i,j}[t]}}$

<h1><center>Training the model: Backpropogation</center></h1>

## What if we use chain rule?

Recall, chain rule:  
\begin{align}
\dfrac{d(f\cdot g)(x)}{dx} = \dfrac{f(g(x))}{d(g(x))}\dfrac{d(g(x))}{dx}
\end{align}

* A simplified illustration of backpropogation using the univariate logistic least squares model

![StanfordLectureNotes](pres_imgs/back_prop_simple.png)

http://www.cs.toronto.edu/~rgrosse/courses/csc321_2017/slides/lec6.pdf

<h1><center>Training the model: Backpropogation</center></h1>
   
![StanfordLectureNotes](pres_imgs/back_prop_mlpc.png)

<h1><center>Training the model: Backpropogation</center></h1>
 
  
   
![StanfordLectureNotes](pres_imgs/back_prop_mlpc_vectorized.png)

<h1><center>Training the model: Backpropogation</center></h1>

* In the message passing notation:  
  
   
![StanfordLectureNotes](pres_imgs/back-prop.png)

* For each training step, a forward pass results in the computation of the loss value
* A backward pass results in the updation of the weigts

<h1><center>Back to the equation</center></h1>
### A 3-layer feed-forward Neural Network
 * $f = W_3max(0, W_2 max(0, W_1x))$
 
### To Summarize:
* A multilayered perceptron is a just a set of linear followed by non-linear transforms performed on a input vector.
* A feed-forward fully connected neural network with a single hidden layer using practically any nonlinear activation function can approximate any continuous function of any number of real variables on any compact set to any desired degree of accuracy.
* Number of Parameters in the model = $\sum_{i=1}^{N} (L_{n-1}+1)*L_n$
* **How to guess the values of these parameters?**
* https://papers.nips.cc/paper/874-how-to-choose-an-activation-function.pdf

<h1><center>Resources for implementing Neural Networks</center></h1>

 - Pytorch - http://pytorch.org/
 - Tensorflow - http://tensorflow.org/
 - Theano - http://deeplearning.net/software/theano/
 - Keras - https://keras.io/
 
 A useful learning resource - 
 https://playground.tensorflow.org/
 
 Background
 http://cs231n.github.io/
 
 

<h1><center>Convolutional Neural Networks</center></h1>

![StanfordLectureNotes](pres_imgs/CNN_template.png)

Gradient Based Learning applied to document recognition [LeCun, Bottou, Bengio, Haffner 1998]

<h1><center>Convolutional Neural Networks</center></h1>

* Image data are high dimensional and have local embedded structures.  

* CNNs were conceptualized to overcome the limitations of Fully Connected neural networks in processing image data

<h1><center>Convolutional Neural Networks</center></h1>
![StanfordLectureNotes](pres_imgs/fc_layer.png)

<h1><center>Convolutional Neural Networks</center></h1>
![StanfordLectureNotes](pres_imgs/conv_layer_1.png)

Recall convolution:

\begin{align}
f[x, y] * g[x,y] = \sum_{n_1 = -\inf}^{\inf}\sum_{n_2 = -\inf}^{\inf} f[n_1, n_2]\cdot g[x-n_1, y-n_2]
\end{align}

<h1><center>Convolutional Neural Networks</center></h1>
![StanfordLectureNotes](pres_imgs/conv_layer_2.png)

<h1><center>Convolutional Neural Networks</center></h1>
![StanfordLectureNotes](pres_imgs/conv_layer_3.png)

<h1><center>Convolutional Neural Networks</center></h1>
![StanfordLectureNotes](pres_imgs/conv_layer_4.png)

<h1><center>Convolutional Neural Networks</center></h1>
![StanfordLectureNotes](pres_imgs/conv_net_full.png)

<h1><center>Convolutional Neural Networks</center></h1>

## Inorder to reduce number of parameters and prevent overfitting.
![StanfordLectureNotes](pres_imgs/pooling.png)

<h1><center>Convolutional Neural Networks</center></h1>

## Typical off the shelf CNN / Deep Learning Model
### Representation Learning versus Template learning
### A compaction of the typical approach of MKS workflow which involves first feature generation followed by linkage

![StanfordLectureNotes](pres_imgs/typical_cnn.png)

<h1><center>Convolutional Neural Networks</center></h1>

## VGG-Net :  A Production CNN

![StanfordLectureNotes](pres_imgs/vgg_net.png)

<h1><center>Convolutional Neural Networks</center></h1>

## Why you should care?

![StanfordLectureNotes](pres_imgs/image_net.png)