<div style="text-align: right"> 

### DATA 22100 - Introduction to Machine Learning

</div>

<img src="https://github.com/david-biron/DATA221imgs/blob/main/UChicago_DSI.png?raw=true" align="right" alt="UC-DSI" width="300">





<center> 

# Neural Networks in a nutshell
    
<br/>
    
</center> 

    

### Neurons 

A **neuron** is an electrically excitable cell that fires electric signals called action potentials. 

<img src="https://github.com/david-biron/DATA221imgs/blob/main/neuron.png?raw=true" width="500">


In a *grossly oversimplified* picture, a neuron would: 

* receive all or none ($1$ or $0$) input signals via its **dendrites** (and soma). 
* multiply each input by a 'weight' (**synaptic strength**). 
* aggregate the weighted input signals in the **soma** (and dendrites), <br/> i.e., sum the weighted signals. 
* pass the sum through a non-linear function that determines whether to send a $1$ or $0$ output signal. 
* send out an output signal (or not) to other neurons down the **axon**, <br/> which would connect to dendrites of downstream neurons. 


The signaling process is electro-chemical: 

* Input electric action potentials trigger chemical signaling in the receiving dendrite. 
* Chemical signaling in the dendrites and the soma aggregate the inputs and, eventually, trigger (or not) the output electric action potential.  
* The structure connecting an upstream axon to a downstream dendrite is called a **synapse**. 


### Simple artificial 'neurons'

Computational neural networks are composed of simple units ('neurons') loosely inspired by biological neurons: 

| | | 
|:-:|:-:|
| <img src="https://github.com/david-biron/DATA221imgs/blob/main/neuronArtificial1.png?raw=true" width="400"> | <img src="https://github.com/david-biron/DATA221imgs/blob/main/neuronArtificial2.png?raw=true" width="500"> | 

* Inputs ($1$ or $0$) from other neurons are multiplied by 'synaptic' weights $w_j$
* Weighted inputs are summed ($\sum$), optionally along with a 'bias' ($b$). 
* The sum is passed through a nonlinearity ($\cal S$) that transforms it to a $0$ or $something$ output. <br/> The non-linearity is also called the **activation function**, $g()$.  
* The output can serve as input to downstream neurons. 

<center>    
$\begin{eqnarray}
\boxed{ \  \text{single_neuron_output} = \text{activation_function} \left( \sum_j \text{weight}_j \ \times \ \text{input}_j + \text{bias} \right)  \ } 
\end{eqnarray}$ 
</center>    

#### A simple neuron example 

|   |   |
|:--|:--|
| <img src="https://github.com/david-biron/DATA221imgs/blob/main/icon_example.png?raw=true" width="50"> | Consider a 'neuron' with two inputs ($24$ and $16$), where the corresponding weights are $0.5$ and $-1$, <br/> there is no bias, and the activation function is a Rectifier Linear Unit, where $ReLU(x) = \max\left(0, x\right)$.  |

<img src="https://github.com/david-biron/DATA221imgs/blob/main/ExampleNeuronWithTwoInputs.png?raw=true" width="500">

Input 1: $24 \times 0.5 = 12$ <br/>
Input 2: $16 \times (-1.0) = -16$ <br/>
Sum ($\sum$): $12 + (-16) = -4$ <br/>
Output $ = ReLU(-4) = 0$

<br/>

Input 1: $36 \times 0.5 = 18$ <br/>
Input 2: $16 \times (-1.0) = (-16)$ <br/>
Sum ($\sum$): $12 + (-16) = 2$ <br/>
Output $ = ReLU(2) = 2$


### Commonly used activation functions include:

* [`Sigmoid`](https://pytorch.org/docs/stable/generated/torch.nn.Sigmoid.html)
* [`Tanh`](https://pytorch.org/docs/stable/generated/torch.nn.Tanh.html) (hyperbolic tangent). 
* [`ReLU`](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html) (Rectified Linear Unit). 

<img src="https://github.com/david-biron/DATA221imgs/blob/main/CommonActivationFunctions2.png?raw=true" width="500"> 

<br> 



### Artificial Neural Networks

* An **Artificial Neural Network** is composed of multiple (many) artifical neurons that are connected to each other, <br> i.e., the output of some serve as inputs of others.
* A common architecture arranges the neurons in **layers**: an input layer, multiple hidden layers, and an output layer (left to right). 

<img src="https://github.com/david-biron/DATA221imgs/blob/main/ArtificialNN1.png?raw=true" width="500">

* Fully connected layers (every node in the upstream layer is connected to every node in the downstream layer) are called **linear layers**.   
* **Deep networks** have many (more than one) hidden layers. 
* A **forward pass** is a signal flowing from left to right: <br> input $\to$ hidden$_1$ $\to$ hidden$_2$ $\to \ \dots $ $\to$ hidden$_n$ $\to$ output. <br> A prediction is computed with a forward pass.  

<br> 

<center>
${ \begin{eqnarray} 
\boxed{ \
\text{Training a neural network means learning the weights that would produce a desired result.}
 \ }
\end{eqnarray}}$ 
</center>    

<br> 

* A typical training scheme: 
    * Randomly (or not) **initialize** the weights for all the nodes.
    * Given a training example, perform a **forward pass** using the current weights and calculate the output value.
    * Compare the output with the target output in the training data: measure the error using a **loss function**.
    * Perform a **backwards pass** (from right to left) and use **backpropagation** $+$ **gradient descent** to update the weights. 
    * Repeat the  **forward pass** and **backwards pass** for each training example, <br> thereby gradually improving the weights.  
    
* **Backpropagation** (+**Gradient descent**) is a method for calculating each weight’s contribution to the error and adjusting the weights in a direction that **reduces the loss**. 
    

See some examples [here](https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6), [here](https://iamtrask.github.io/2015/07/12/basic-python-network/), and [here](https://iamtrask.github.io/2015/07/27/python-network-part2/). 

<br> 



### What is the gradient with respect to the parameter $w_i$? 

#### Given data $\{x_1, x_2, \dots x_p\}$ 

Suppose that $\{w_1, w_2, \dots w_p\}$ are a particular layer of weights, e.g., they connect the $2^{nd}$ and $3^{rd}$ layers of neurons.  

$$\begin{eqnarray}
\hat y_{\text{'w layer'}} &=& w_0 + w_1 x_1 + \dots w_p x_p \\ 
\hat p &=& g\left( \hat y \right) \hspace{1cm} \text{(activation function)} \\ 
\text{loss_function} &=& {\cal L}\left( \left. \hat p \ , \ \text{ground_truth} \ \right|  \ \text{data, including $x_i$} \right) \\ 
\ \\ 
\rightarrow \ \frac{\partial \ \text{loss_function}}{\partial w_i} &=& \frac{\partial {\cal L}}{\partial \hat p} \frac{\partial \hat p}{\partial \hat y}\frac{\partial \hat y}{\partial w_i} \\
&\underbrace{=}_{\text{e.g. $\sum (\dots)^2$ loss}}& 2\left( \left. \hat p \ - \ \text{ground_truth} \ \right|  \ \text{data} \right) \frac{\partial g}{\partial \hat y} x_i
\end{eqnarray}$$


In a fully connected network (linear layers), the weights are stored in matrices with dimensions corresponding to their upstream/downstream layers. E.g.,:  

<img src="https://github.com/david-biron/DATA221imgs/blob/main/ArtificialNN2.png?raw=true" width="500">

* The input layer has $3$ nodes. <br/> To fully connect them to the following $4$ nodes we'd need a $3 \times 4$ matrix of weights.
* The first hidden layer has $4$ nodes. <br> To fully connect them to the following $4$ nodes we'd need a $4 \times 4$ matrix of weights.
* The second hidden layer has $4$ nodes. <br> To fully connect them to the output layer we'd need a $4 \times 1$ matrix of weights (which in the image is depicted as $1 \times 4$).


To make a prediction given an input, the network needs to learn these weights (and the biases and possibly parameters in the activation functions). <br> **Backpropagation** sequentially updates these weights until they are 'learned'.


### A Neural Network Playground

To get a feel for NNs, [try this out](https://playground.tensorflow.org/#activation=relu&batchSize=26&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=3,2,2&seed=0.58949&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=true&xSquared=false&ySquared=false&cosX=false&sinX=true&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false&batchSize_hide=false). 

<br> 

(don't forget to train the network with the 'Play' button on the top left). 