# Neural Nets

http://neuralnetworksanddeeplearning.com/

## Neurons

<img src='https://upload.wikimedia.org/wikipedia/commons/thumb/3/36/Components_of_neuron.jpg/640px-Components_of_neuron.jpg' alt='components of neuron' />

By Jennifer Walinga - https://opentextbc.ca/introductiontopsychology/chapter/3-1-the-neuron-is-the-building-block-of-the-nervous-system/, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=97847412

## Artificial Neuron (Perceptron)

<img src='https://raw.githubusercontent.com/gitmystuff/Linkables/main/Artificial_neural_network.png' alt='artificial neuron' />

Artificial Neuron - By Geetika saini - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=57111539
artificial neuron consisting of dendrites,axon and threshold function

* The diagram above is a single observation (one row)
* Each input represents a feature (explanatory variable)
* Weights are how neural nets learn (parameters, coefficients)
* The synapse is represented by the lines between the weights and the transfer function
* The output ($o_j$) can be continuous, binary, or categorical
* Probabilities are often attached to the categorical outputs
* Activation functions / formulas start are defined by $\phi(x)$ or $\phi(\sum w_i x_i)$
* The type of activation function is based on the DV, for example if the DV is binary, we can use threshold or sigmoid activation functions
* Threshold: $y = \phi(\sum w_i x_i)$
* **OR** y = 1 if $\sum w_i x_i$ ≥ θ, otherwise 0
* Sigmoid: $p(y=1) = \phi(\sum w_i x_i)$
* **OR** $p(y = 1) = σ(∑ w_i x_i)$, where $σ(x) = 1 / (1 + e^{-x})$

## Transfer Function

The transfer function translates the input signals to output signals. Four types of transfer functions are commonly used, Unit step (threshold), sigmoid, piecewise linear, and Gaussian. The output is set at one of two levels, depending on whether the total input is greater than or less than some threshold value.

https://www.saedsayad.com/artificial_neural_network.htm

## Activation Functions

An Activation Function decides whether a neuron should be activated or not. This means that it will decide whether the neuron's input to the network is important or not in the process of prediction using simpler mathematical operations.

https://www.v7labs.com/blog/neural-networks-activation-functions

### The Sigmoid: $S(x) = \frac{1}{1 + e^{-x}} = \frac{e^x}{e^x + 1} = 1 - S(-x)$

Learn more:

* https://en.wikipedia.org/wiki/Sigmoid_function
* https://www.analyticsvidhya.com/blog/2020/01/fundamentals-deep-learning-activation-functions-when-to-use-them/
* https://en.wikipedia.org/wiki/Activation_function

### Rectified Linear Unit (ReLU)

* The Rectified Linear Unit is the most commonly used activation function in deep learning models. The function returns 0 if it receives any negative input, but for any positive value x it returns that value back.
* $f(x) = max(0, x)$

and

### Gaussian Error Linear Unit (GELU)
* Uses the standard Gaussian distribution, making it a smoother version of ReLU
* A common expression is: GELU(x) = $xΦ(x)$, where $Φ$ is the standard normal CDF


<img src='https://raw.githubusercontent.com/gitmystuff/Linkables/main/ReLU_and_GELU.svg' alt='ReLU' />

## Some Neural Net Categories

* Artificial Neural Nets: Feature selection / extraction
* Convolutional Neural Nets: Feature learning
* Recurrent Neural Nets: Feature propagation (memory)
* Propagation, in science, is the breeding of specimens of a plant or animal by natural processes from the parent stock

## Artificial Neural Network (ANN)

<img src='https://raw.githubusercontent.com/gitmystuff/Linkables/main/Colored_neural_network.svg' alt='simple neural net' />

ANN - By Glosser.ca - Own work, Derivative of File:Artificial neural network.svg, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=24913461
An artificial neural network is an interconnected group of nodes, inspired by a simplification of neurons in a brain. Here, each circular node represents an artificial neuron and an arrow represents a connection from the output of one artificial neuron to the input of another.
ReLU - By Ringdongdang - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=95947821
Plot of the ReLU rectifier (blue) and GELU (green) functions near x = 0

### A Simple Example Process

* Initialize weights
* Observation features are passed through a hidden layer (forward propagation)
* Weights are adjusted, e.g., the third blue activation function might fire up, activate, when it receives inputs from the three red input features (shows importance)
* The two green outputs can be thought of as $\hat{y}$, the predicted value, and is evaluated with y the actual value
* Cost function is calculated e.g., $\frac{1}{2}(\hat{y} - y)^2$ (cost function and gradient descent)
* Weights get adjusted (back propagation)
* Repeat (known as epochs) until some criteria is met such as cost function is minimized
* One epoch is when the whole training set is passed through the ANN

### 3 Inputs Without the Hidden Layer

* $y = w_1 x_1 + w_2 x_2 + w_3 x_3$

### With a Hidden Layer

* Inputs may or may not go to each node in the hidden layer
* Some inputs may not be relevant to some nodes
* Hidden layers are where features are discovered and selected (feature selection)
* $y = \phi(\sum w_i x_i)$

### Batch Gradient Descent vs Stochastic Gradient Descent

* Batch updates weights when all observations have been passed through the net
* SGD updates weights per observation

### Output Probabilities

* Output layer using softmax to provide probabilities for each label such as Setosa, Versicolor, Virginica
* Output shows probabilities for each label in the class (.3, .2, .5)

## Another Neural Net

<img src='https://raw.githubusercontent.com/gitmystuff/Linkables/main/neural%20network.jpg' alt='neural network' />

Neural Network - "Neural Network : basic scheme with legends" by fdecomite is licensed under CC BY 2.0. To view a copy of this license, visit https://creativecommons.org/licenses/by/2.0/?ref=openverse.

## Deep Neural Net

<img src='https://raw.githubusercontent.com/gitmystuff/Linkables/main/deep%20neural%20net.jpg' alt='deep neural net' />

Deep Neural Net - https://www.vectorstock.com/royalty-free-vector/neural-net-neuron-network-vector-10723960

## MLP (Multi-Layer Perceptron)

Only goes one direction, feed forward

## Backpropagation

* Neuron output is compared to actual value using $C = \frac{1}{2}(\hat{y} - y)^2$
* Weights are then updated according to how much they are responsible for the error (such as MSE)
* The learning rate decides by how much we update the weights
* Adjust all weights simultaneously

As a machine-learning algorithm, backpropagation performs a backward pass to adjust the model's parameters, aiming to minimize the mean squared error (MSE). In a single-layered network, backpropagation uses the following steps:

* Traverse through the network from the input to the output by computing the hidden layers' output and the output layer (The Feedforward Step)
* In the output layer, calculate the derivative of the cost function with respect to the input and the hidden layers
* Repeatedly update the weights until they converge or the model has undergone enough iterations

https://en.wikipedia.org/wiki/Backpropagation

## CNN (Convolutional Neural Network)

https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53

Gradient-Based Learning Applied to Document Recognition by Yann LeCun et al. (1998)

http://vision.stanford.edu/cs598_spring07/papers/Lecun98.pdf

### CNN and Document Classification

Convolutional neural networks are effective at document classification, namely because they are able to pick out salient features (e.g. tokens or sequences of tokens) in a way that is invariant to their position within the input sequences.

https://machinelearningmastery.com/best-practices-document-classification-deep-learning/

Networks with convolutional and pooling layers are useful for classification tasks in which we expect to find strong local clues regarding class membership, but these clues can appear in different places in the input. […] We would like to learn that certain sequences of words are good indicators of the topic, and do not necessarily care where they appear in the document. Convolutional and pooling layers allow the model to learn to find such local indicators, regardless of their position.

https://arxiv.org/abs/1510.00726

<img src='https://raw.githubusercontent.com/gitmystuff/Linkables/main/Convolutional_Neural_Network_with_Color_Image_Filter.gif' alt='convolutional neural network' />

By Cecbur - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=76640332

In [None]:
import numpy as np
n = 15
a = np.zeros(n*n, dtype=int)
a[::2]=1
a = a.reshape(n, n)
print(a)
n = 3
a = np.zeros(n*n, dtype=int)
a[::2]=1
a = a.reshape(n, n)
print(a)

[[1 0 1 0 1 0 1 0 1 0 1 0 1 0 1]
 [0 1 0 1 0 1 0 1 0 1 0 1 0 1 0]
 [1 0 1 0 1 0 1 0 1 0 1 0 1 0 1]
 [0 1 0 1 0 1 0 1 0 1 0 1 0 1 0]
 [1 0 1 0 1 0 1 0 1 0 1 0 1 0 1]
 [0 1 0 1 0 1 0 1 0 1 0 1 0 1 0]
 [1 0 1 0 1 0 1 0 1 0 1 0 1 0 1]
 [0 1 0 1 0 1 0 1 0 1 0 1 0 1 0]
 [1 0 1 0 1 0 1 0 1 0 1 0 1 0 1]
 [0 1 0 1 0 1 0 1 0 1 0 1 0 1 0]
 [1 0 1 0 1 0 1 0 1 0 1 0 1 0 1]
 [0 1 0 1 0 1 0 1 0 1 0 1 0 1 0]
 [1 0 1 0 1 0 1 0 1 0 1 0 1 0 1]
 [0 1 0 1 0 1 0 1 0 1 0 1 0 1 0]
 [1 0 1 0 1 0 1 0 1 0 1 0 1 0 1]]
[[1 0 1]
 [0 1 0]
 [1 0 1]]


## CNN Steps

### Convolution

* Imagine an input image with a series of 0s and 1s
* We take a nxm feature detector (filter) which is also a series of 0s and 1s and let it stride across our image to see if there are any similar nxm sections
* A feature map (convolved map) records the results of an element wise multiplication for matching patterns
* This reduces the size of our *image*, lossy, but the feature map preserves the discovered features
* We create many feature maps to create a convolution layer
* A convolutional layer is a layer of feature maps

### ReLU Layer

* Now we apply a rectifier such as $\phi(x) = max(x, 0)$

### Max Pooling

* Example, find the max value from a stride and record it to a pooled feature map
* This reduces the size again and preserve the features
* Disregards noise
* Prevents overfitting

### Flattening

* This step takes a matrix and flattens it to a 1xn stack

### Full Connection

* From here, the flattened stack gets fed into an ANN

## RNN (Recurrent Neural Network)

Called recurrent because they perform the same task for every element of a sequence with the output being dependent on the previous computations (memory)

* Time series
* Sequential data
* Recognizes sequential patterns and tries to predict the next likely event

<img src='https://raw.githubusercontent.com/gitmystuff/Linkables/main/Recurrent_neural_network_unfold.svg' alt='recurrent neural network' />

By fdeloche - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=60109157

* The bottom input is the state
* The middle is the hidden state
* The top is the output state
* U, V, W are the weights matrices of the hidden layer, output layer, and hidden state respectively and are shared across time
* Compressed version on the left, the unfolded version on the right (represents time, or words in a sentence)
* RNNs suffer from exploding or vanishing gradients

## Long Short Term Memory

Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) can process not only single data points (such as images), but also entire sequences of data (such as speech or video).

https://en.wikipedia.org/wiki/Long_short-term_memory

* Cell state - information flows from left to right unchanged
* Information can be added or removed from the cell state and are regulated by gates
* LSTMs remembers or forgets things selectively which is different from the RNN

<img src='https://raw.githubusercontent.com/gitmystuff/Linkables/main/LSTM_Cell.svg' alt='LSTM Cell' />

By Guillaume Chevalier - File:The_LSTM_Cell.svg, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=109362147


## Notes

Relu returns 0 or a positive number

Softmax activation for classification

Cross-entropy is a measure of the difference between two probability distributions for a given random variable or set of events.

You might recall that information quantifies the number of bits required to encode and transmit an event. Lower probability events have more information, higher probability events have less information.

In information theory, we like to describe the “surprise” of an event. An event is more surprising the less likely it is, meaning it contains more information.

https://machinelearningmastery.com/cross-entropy-for-machine-learning/

Adam is an optimization algorithm that can be used instead of the classical stochastic gradient descent procedure to update network weights iterative based in training data.

The algorithm is called Adam. It is not an acronym and is not written as “ADAM”.

https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/