# 1. Introduction

This notebook explains what a neural network is and how we represent it in a machine learning model.  A neural network in machine learning is an algorithmic architecture designed to mimic biological neural networks.

# 2. Biological Neural Networks

## 2.1. Neurons

### 2.1.1. What are they?

A neuron is the brain's <b>basic computational unit</b>.  

### 2.1.2. What do they do?

It receives and integrates chemical signals from other neurons via <b>dendrites</b> and, depending on a number of factors, it either does nothing or generates an electrical signal output via <b>axons</b>, which in turn signals other connected neurons via synapses:

<p align = "center">
<img src="..\Images\neurons.PNG" width="80%"/>
</p>

Source: https://becominghuman.ai/making-a-simple-neural-network-2ea1de81ec20

### 2.1.3. How do they work?

In essence, the cell acts as a <b>function</b> into which we provide inputs (via the dendrites) that are churned into an output via the axon terminals.  

### 2.1.4. Relevance to machine learning?

Neural networks in machine learning aim to represent this function and connect neurons in a useful way.

# 3. Logistic Regression vs. Neural Networks

## 3.1. Logistic Regression Representation

In logistic regression, we composed a linear model $h_\theta(x)$ with the logistic function $g(z)$ to form our predictions.  This linear model was a combination of feature inputs $x_i$ and weights $\theta_i$:

<p align = "center">
<img src="../Images\logisticRegressionNN.PNG" width=60%>
</p>

Represented like this, we can describe that:

1. The firts layer contains a node for each value in our input feature vector $x^1$, i.e. $x_0$, $x_1$ and $x_2$.


2. These input features are scaled by their corresponding weights stored in the $\theta^1$ vector, i.e. $\theta_0$, $\theta_1$ and $\theta_2$.


3. The values from (1) and (2) - a single linear combination of all the inputs - are then passed into a single output node, which processes these values via the logistic (aka sigmoid) function.


4. In neural network parlance, we can say (3) is the "<b>activation unit</b>".  It controls whether or not this node, or "<b>neuron</b>" fires


5. Recall the $x_0$ bias unit has been added.  Recall this is added to enable matrices operations but also to ensure our model isn't fixed through the origin, i.e. $(0, 0)$.  Without this, our model would necessarily pass through the origin and cause problems like so:

<p align = "center">
<img src="../Images\logisticRegressionOrigin.PNG" width=60%>
</p>

## 3.2. Difference to a simple Neural Network

At a high level, they're practically identical - the main difference being the activation function, $g(z)$, used to control neuron firing. The perceptron activation is a <b>step-function</b> from 0 (when the neuron doesn't fire) to 1 (when the neuron fires) while the logistic regression model has a smoother activation function with values ranging from 0 to 1.

## 3.3. Why can't we simply use linear or logistic regression?

Even for a relatively simple problem such as predicting house prices, if there are 100 features it may become incredibly complicated to fit a linear or logistic function to the data.  

<p align = "center">
<img src="../Images\neuralnetworks1.PNG" width=70%>
</p>

# 4. What are neural networks in machine learning?

The below is a great video from [3Blue1Brown](http://www.3blue1brown.com/) introducing and explaining neural newtworks:

In [19]:
from IPython.display import HTML

# Youtube
HTML('<center><iframe width="560" height="315" src="https://www.youtube.com/embed/aircAruvnKk" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen> </iframe></center>')


# 5. Neural Network Ingredients + Notation

Working with this example from the Andrew Ng course we will explain the function of each component, plus its corresponding notation.

<p align = "center">
<img src="../Images\neuralnetworks6.PNG" width=80%>
</p>

## 5.1. Nodes

A neural network is a network of computational units, also known as nodes.  These nodes store and / or process information passing through the neural network from its input later to its output layer.

In the above, each node is represented by each circle.  There are different types of nodes, including:

* Input nodes


* Activiation Units


* Output nodes

These, and their notation, are explored below.

## 5.2. Layers $j$

### 5.2.1. What do they do?

Layers are simply a collection of nodes organised together but separate from other layers of nodes.  There are different types of layers:

* <b>Input Layer:</b> the first layer in the neural network where each node represents a value for each feature $x_i$.
    
    
* <b>Hidden Layers:</b> subsequent layers intermediate between the input layer and output layer.


* <b>Output Layer:</b> the last layer in the neural network where each node represents a value corresponding to an output value $y$.


### 5.2.2. Notation

* Current layer = $j$.


* Preceding layer = $j - 1$.


* Subsequent layer = $j + 1$.



## 5.3. Input Nodes $x$

### 5.3.1. What does it do?

Stores a feature value for $x_i$.  For instance, if we were teaching a neural network to classify 28 x 28 pixel images (i.e. 784 pixels total) of handwritten digits to their corresponding numerical values we might:

1. Represent each <b>pixel</b> by its grayscale value, i.e. a number between $0$ (white) and $1$ (black);


2. Store this information in a single input node, e.g. node 1 = grayscale value of pixel 1 and so on from $x_1, x_2, ... x_{784}$; and


3. Therefore each <b>image</b>, $x^i$, would be represented by a $784$ x $1$ vector of pixel grayscale values.

### 5.3.2. Notation

Simply:

* $x_i$ = the $i^{th}$ feature; and


* $x^i$ = the $i^{th}$ sample.


<b>Example:</b> $x_1^{(2)}$ = feature $1$ in the $2^{nd}$ sample.


## 5.4. Activation Units $a$

### 5.4.1. What does it do?

Each activation unit:

1. Takes as its input the sum of: (i) each value from each input node, $x_i$, multiplied (ii) by the corresponding weight, $w_i$.


2. The information from (1) is then processed by an <b>activation function</b>, tradtionally the <b>sigmoid / logit</b> function although in recent times the <b>ReLU</b> function is preferred.  Both functions squish that number into a value between $0$ and $1$.


3. The value from (2) is then compared against a <b>threshold value</b>.  If the linear sum of inputs and weights is higher than the threshold the neuron fires, and if less than the threshold it does not fire.

### 5.4.2. Notation

1. $_i$ the activation unit number.


2. $^j$ the layer number used to reference the current layer from:

    (a) $j + 1$, the <b>subsequent</b> layer; and
    
    (b) $j - 1$, the <b>previous</b> layer.


3. $a_i^{(j)}$ = activation unit $i$ in layer $j$.  

    <b>Example:</b> $a_1^{(2)}$ = activation unit $1$ in the $2^{nd}$ layer.

## 5.4. Output Nodes

### 5.4.2. What does it do?

Is an activation unit that computes the final output for the hypothesis.

### 5.4.3. Notation

Same as for activation units above.

    
## 5.5. Parameters Matrix $\Theta$ 

### 5.5.1. What does it do?

The parameters matrix controlls the function mapping from layer $j$ to layer $j + 1$.  In other words, these are the dials and knobs that the algorithm tweaks to learn the optimum values and combinations of weights to accurately predict outputs from inputs.

### 5.5.2. Notation

<p align = "center">
<img src="../Images\neuralnetwork10.PNG" width=100%>
</p>
  

### 5.5.3. Dimensionality of the $\Theta$ Parameters Matrix

The $\Theta$ matrix's dimensionality is expressed as: $s_{j + 1}$ x $(s_j + 1)$ (rows x columns), where: 

* $s_{j + 1}$ = number of units in layer $j + 1$, i.e. the follwoing layer; and


* $(s_j + 1)$ = number of units in the current layer $+ 1$, i.e. the additional unit to represent the bias unit, which is always $1$.

#### Example 1: $\Theta^{1}$

* <b>Row length</b> = number of units in the <b>following</b> layer, i.e. $3$.


* <b>Column length</b> = number of units in the <b>current</b> layer $+ 1$ (because of the bias unit), i.e. $4$.
  
  
* <b>Resulting Dimensionality</b> = $\Theta^{1}$ is of $3$ x $4$ dimensions.

#### Example 2: $\Theta^{2}$

* <b>Row length</b> = number of units in the <b>following</b> layer, i.e. $1$.


* <b>Column length</b> = number of units in the <b>current</b> layer $+ 1$ (because of the bias unit), i.e. $4$.

    
* <b>Resulting Dimensionality</b> = $\Theta^{2}$ is of $1$ x $4$ dimensions.

# 6. How is information forward processed from one layer to another?

In the below the value of $a_o^{(1)}$ is equal to the sigmoid of: the weighted sum of each activation unit in the preceding layer + the bias unit.

<p align = "center">
<img src="../Images\NNProcessing.PNG" width=80%>
</p>

<p align = "center">
<img src="../Images\NNProcessing3.PNG" width=80%>
</p>

# 7. Forward Propagation 

## 7.1. What is forward propagation?

In neural networks, you propagate <b>forward</b> from the inital input feature values in layer 1 to the output values of the final layer.  This is done to get the output and compare it with the real value to get the error. 

## 7.2. What's clever about forward propagation?

The clever thing is the difference in what a neural network can learn vs. logistic regression:

* <b>Logistic Regression:</b> output layer learns new parameters based on the original input features.


* <b>Logistic Regression:</b> output layer learns <b>both</b> new parameters <b>and</b> new features based on inputs from intermediate layers.

### 7.2.1. Logistic Regression

<p align = "center">
<img src="../Images\neuralnetworks14.PNG" width=40%>
</p>

Here Layer 3 is simply a logistic regression node where the hypothesis output is simply: g(Ɵ<sub>10</sub><sup>2</sup> a<sub>0</sub><sup>2</sup> + Ɵ<sub>11</sub><sup>2</sup> a<sub>1</sub><sup>2</sup> + Ɵ<sub>12</sub><sup>2</sup> a<sub>2</sub><sup>2</sup> + Ɵ<sub>13</sub><sup>2</sup> a<sub>3</sub><sup>2</sup>.

The model is constrained to processing the input featurs scaled by the weights.  It cannot learn new features, only new parameters.

 
### 7.2.2. Neural Networks

<p align = "center">
<img src="../Images\neuralnetworks12.PNG" width=40%>
</p>

Here, unlike the above, Layer 3 learns both: 

1. new parameters; and


2. new features!

So it's as if the neural network, instead of being constrained to feed the features x<sub>1</sub>, x<sub>2</sub> and x<sub>3</sub> into logistic regression, it instead gets to learn its own features a<sup>(1)</sup>, a<sup>(2)</sup> and a<sup>(3)</sup>.

### 7.2.3. Why is this helpful?

It means the neural network can learn some interesting and complex features (e.g. edges or combinations of edges in images of handwritten digits as in the 3Brown1Blue video) and therefore end up with better hypotheses than if you were constrained to use the raw features x<sub>1</sub>, x<sub>2</sub> and x<sub>3</sub>.

## 7.3. An Example

<p align = "center">
<img src="../Images\neuralnetworks11.PNG" width=70%>
</p>

# 8. Backward Propagation

## 8.1. What is backward propagation?

To minimize the error, you propagate <b>backwards</b> by finding the derivative of error with respect to each weight and then subtracting this value from the weight value.


## 8.2. Back Propagation

See week 5 notes for full detail.

# 9. Neural Networks and Logic Gates

## 9.1. Why Demonstrate This?

Demonstrating the below helps build an intuitive understanding of how neural networks can be configured to replicate logic gates to perform increasingly complex decision making.

## 9.2. AND gate

### 9.2.1. Defining the Problem

We want to create an AND logic gate with a neural network:  

<p align = "center">
<img src="../Images\logicGateDiagramAND.PNG" width=40%>
</p>

<p align = "center">
<img src="../Images\ANDLogicGate.PNG" width=50%>
</p>

This is a function that will output 1 if and only if <b>both inputs are equal to 1</b>.

### 9.2.2. A Neural Network Solution

<p align = "center">
<img src="../Images\neuralnetworks15.PNG" width=50%>
</p>

To achieve this, we choose a negative value of x<sub>0</sub> that is greater than each of x<sub>1</sub> and x<sub>2</sub>. 

## 9.3. OR gate

### 9.3.1. Defining the Problem

We want to create an OR logic gate with a neural network:  

<p align = "center">
<img src="../Images\logicGateDiagramOR.PNG" width=40%>
</p>

<p align = "center">
<img src="../Images\ORLogicGate.PNG" width=50%>
</p>

This is a function that will output 1 if and only if <b>one or both of the inputs are equal to 1</b>.

### 9.3.2. Demonstrating the Solution

<p align = "center">
<img src="../Images\neuralnetworks16.PNG" width=50%>
</p>

To achieve this, we choose a negative value of x<sub>0</sub> that is less than x<sub>1</sub> and x<sub>2</sub>.

## 9.4. NOT gate

### 9.4.1. Defining the Problem

We want to create a NOT logic gate with a neural network.

<p align = "center">
<img src="../Images\logicGateDiagramNOT.PNG" width=40%>
</p>

<p align = "center">
<img src="../Images\NOTLogicGate.PNG" width=50%>
</p>

This is a function that will output 1 if and only if <b>the input is equal to 0</b>.

### 9.4.2. Demonstrating the Solution

<p align = "center">
<img src="../Images\neuralnetworks17.PNG" width=50%>
</p>

To achieve this, we choose a positive value of x<sub>0</sub> that is less than x<sub>1</sub> but also negate x<sub>1</sub>. 

## 9.5. NAND gate

### 9.5.1. Defining the Problem

We want to create a NAND logic gate with a neural network.

<p align = "center">
<img src="../Images\logicGateDiagramNAND.PNG" width=40%>
</p>

<p align = "center">
<img src="../Images\NANDLogicGate.PNG" width=50%>
</p>

This is a function that will output 1 if and only if <b>one or more of the inputs are equal to 0</b>.

### 9.5.2. Demonstrating the Solution

<p align = "center">
<img src="../Images\neuralnetworks21.PNG" width=50%>
</p>

To achieve this, we choose a positive value of x<sub>0</sub> that is greater than x<sub>1</sub> and x<sub>2</sub> and also negate x<sub>1</sub> and x<sub>2</sub>.  

## 9.6. NOR gate

### 9.6.1. Defining the Problem

We want to create a NOR logic gate with a neural network.

<p align = "center">
<img src="../Images\logicGateDiagramNOR.PNG" width=40%>
</p>

<p align = "center">
<img src="../Images\NORLogicGate.PNG" width=50%>
</p>

This is a function that will output 1 if and only if <b> both inputs are equal to 0</b>.

### 9.6.2. Demonstrating the Solution

<p align = "center">
<img src="../Images\neuralnetworks19.PNG" width=50%>
</p>

To achieve this, we choose a positive value of x<sub>0</sub> that is less than x<sub>1</sub> and x<sub>2</sub> and also negate both of x<sub>1</sub> and x<sub>2</sub>.

## 9.7. A Limitation of Neural Networks

The above examples are possible because it allows us to separate positive examples from negative examples linearly.  Unfortunately, for <b>XOR</b> and <b>XNOR</b> logic gates we need a non-linear decision boundary, which rules out further permutations on the above basic structures.  For example:

<p align = "center">
<img src="../Images\XORXNORproblem.GIF" width=50%>
</p>

Instead, we must combine these structures to create a more complex neural network with a hidden layer.

## 9.8. XOR gate

### 9.8.1. Defining the Problem

We want to create a XOR logic gate with a neural network.

<p align = "center">
<img src="../Images\logicGateDiagramXOR.PNG" width=40%>
</p>

<p align = "center">
<img src="../Images\XORLogicGate.PNG" width=50%>
</p>

This is a function that will output 1 if and only if <b> one but not both inputs are equal to 1</b>.

### 9.8.2. Demonstrating the Solution

<p align = "center">
<img src="../Images\XORneuralnetwork.PNG" width=50%>
</p>

To solve for a XNOR neural network we must combine:

1. an NOR neural network (ORANGE); AND


2. a NAND neural network (BLUE).

## 9.10 XNOR gate

### 9.10.1. Defining the Problem

We want to create a XNOR logic gate with a neural network.

<p align = "center">
<img src="../Images\logicGateDiagramXNOR.PNG" width=40%>
</p>

<p align = "center">
<img src="../Images\XNORLogicGate.PNG" width=50%>
</p>

This is a function that will output 1 if and only if <b> both inputs are the same, i.e. both 1 or both 0</b>.

### 9.10.11. Demonstrating the Solution

<p align = "center">
<img src="../Images\XNORneuralnetwork.PNG" width=50%>
</p>

To solve for a XNOR neural network we must combine:

1. an AND neural network (RED);


2. a NOR neural network (BLUE); AND


3. an OR neural network (GREEN).

# 9.11. Useful Resources

- http://www.ee.surrey.ac.uk/Projects/CAL/digital-logic/gatesfunc/index.html

- https://medium.com/@jayeshbahire/the-xor-problem-in-neural-networks-50006411840b (re the XOR and XNOR problems for neural networks re inability to linearly separate this types of classification)

- https://www.quora.com/How-can-we-design-a-neural-network-that-acts-as-an-XOR-gate

# 10. Multiclass Classifcation

As with logistic regression we can apply neural networks to multiclass classification problems. 

## 10.1. How does this work?

To do so we will have multiple output nodes, each output node corresponding to a particularl classification.  E.g.

<p align = "center">
<img src="../Images\multiclassneuralnetwork2.PNG" width=50%>
</p>

* <b>Node 1</b> = Pedestrian.


* <b>Node 2</b> = Car.


* <b>Node 3</b> = Motorcycle.


* <b>Node 4</b> = Truck.

This resulting set of classes can be defined as $y$:

<p align = "center">
<img src="../Images\multiclassneuralnetworkYClasses.PNG" width=20%>
</p>

And if our resulting hypothesis for one set of inputs looks like this:

$h_\Theta(x) =\begin{bmatrix}0 \newline 0 \newline 1 \newline 0 \newline\end{bmatrix}$

Then the resulting class for the input image is the third one down, or $h_\Theta(x)$, which represents the motorcycle.

# 11. Why use neural networks

To solve large and complicated problems.  For instance, a computer vision problem to identify cars from non-cars:

<p align = "center">
<img src="../Images\neuralnetworks2.PNG" width=50%>
</p>

In this type of problem, the number of features is huge.  In such circumstances, there is no simple way to build decent classifiers.  

Instead, we need a non-linear hypothesis to separate the classes.  Neural networks perform much better in this regard than simply logistic regression.

<b>Explanation re Quadratic Terms:</b>

1. The ~3,000,000 quadratic features figure is the number of unique pairs of x<sub>i</sub> and x<sub>j</sub>.


2. This can be calculated as follows: n(n+1)/2.


3. Therefore, this equates to: 2,500 x (2501 + 1) / 2.


4. 6,252,500 / 2 = 3,126,250.

See also here:

- https://www.coursera.org/learn/machine-learning/discussions/weeks/4/threads/zWn1oshFEeWRfg5WVr61Uw


- https://www.mathsisfun.com/combinatorics/combinations-permutations.html

# 12. Popular Neural Network Architectures


<p align = "center">
<img src="../Images\neuralNetworksArchitectires.PNG" width=70%>
</p>


# 13. Useful Resources

- https://www.jeremyjordan.me/intro-to-neural-networks/


- http://www.asimovinstitute.org/neural-network-zoo/


- https://media.readthedocs.org/pdf/ml-cheatsheet/latest/ml-cheatsheet.pdf


- https://bigtheta.io/2016/02/24/intro-to-neural-networks.html


- https://medium.com/@erikhallstrm/backpropagation-from-the-beginning-77356edf427d


- https://ml-cheatsheet.readthedocs.io/en/latest/forwardpropagation.html


- https://d3c33hcgiwev3.cloudfront.net/_106ac679d8102f2bee614cc67e9e5212_deep-learning-notation.pdf?Expires=1543708800&Signature=jRrjLR7FHWycjGsg5b24zlDfEfeOZdm9mqq~j1BQ7Zj2QUgG-Zbk3DvjxKJxgo6j-043b~EQK433DRu7Db3u~hpTYlak331V8oc-CB~lyWEPtLB6zv6JMknrGuoGbwOG4hVy~JLD9msqX9bo~QiAuC0iuFMJhJ7s74bCuuZ8UWY_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A


- https://towardsdatascience.com/under-the-hood-of-neural-network-forward-propagation-the-dreaded-matrix-multiplication-a5360b33426


- https://towardsdatascience.com/introducing-deep-learning-and-neural-networks-deep-learning-for-rookies-1-bd68f9cf5883 


- https://towardsdatascience.com/multi-layer-neural-networks-with-sigmoid-function-deep-learning-for-rookies-2-bf464f09eb7f


- https://towardsdatascience.com/introducing-deep-learning-and-neural-networks-deep-learning-for-rookies-1-bd68f9cf5883


- https://towardsdatascience.com/multi-layer-neural-networks-with-sigmoid-function-deep-learning-for-rookies-2-bf464f09eb7f


- https://media.readthedocs.org/pdf/ml-cheatsheet/latest/ml-cheatsheet.pdf