# Neural Networks Representation
In this lession, we explore **Neural Networks (NN)**, a machine learning model designed to mimic how biological neurons (brain) works. Neural Networks are the backbone of many "intelligent" technologies we use on daily basis:
- **Speech Recognition**: Powering voice assistance (like Siri or Alexa) by interpreting and understanding spoken language.
- **Computer Vision**: Used in automated banking to read handwritten digits on checks.
- **Pattern Recognition**: Identifying complex relationships in data that tradditional models might miss.
    - Example: Predictive Maintenance in manufacturing, where NNs recognize the abnormal patterns in machine's vibration and sound data to predict failures before they occur.

## Motivations

### Non-linear Hypotheses
To captures complex, non-linear trends using **Logistic Regression**, you must add polynomial features ($x_1^2, x_1x_2,\cdots$). However, this becomes impractical when the number of original features (n) is large.

#### **The Problem: "Features Explosion"**

As you increase the complexity of your hypothesis, the number of features grows exponentially:
- **Quadratic features**: Grow at $O(n^2)$. If $n=100$, you have $\approx5,000$ features.
- **Cubic features**: Grow at $O(n^3)$. If $n=100$, you have \approx170,000$ features.

#### The Copmputer Vision Challenge
In real-world tasks like image recognition, $n$ is naturally high:
- A smal $50\times50$ pixel image has 2,500 features (pixels).
- Using quadratic features to detect a car in that image would require over **3 million features**.

$\rightarrow$ **The Bottom Line**: For high-dimensional data, traditional `Logistic Regression` is too computationally *expensive* and prone to *overfitting*. **Neural Network** offer a supperior way to learn these complex non-linear hypotheses efficiently.

### Neurons and Brain
Neural networks were originally inspired by the goal of mimicking the human brain. While an older concept, they have become the modern state-of-the-art due to the surge in computing power.

#### The "One Learning Algorithm" Hypothesis
The core motivation for neural newworks is the idea that the brain do not use different software fore different senses. Instead, it likely uses a **single, universal learning algorithm** to process all data.

#### Evidence: Brain Plasticity
Neuro-rewiring experiments demonstrate that the brain is incredibly adaptable:
- **Auditory/Somatosensory Cortex**: If visual signals are rerouted to parts of the brain normally used for hearing or touch, those areas will **learn to see**.
- Sensory Substitution: Humans can learn to "see" using electrical pulses on the **tongue** (BrainPort) or navigate via **sonar** (echolocation).

#### Stategic Value
The goal of AI is to implement a mathematical approximation of this universal algorithm. We use neural networks today not just for the "AI dream" of true intelligence, but because they are the most effective way to solve complex, high-dimensional machine learning problems.

## Neural Networks

### Model Representation I

#### Biological Neural Models
Neuron and myelinated axon, with signal flow from `Inputs` at `Dendrites` to `Outputs` at `Axon` terminals. The signal is a short electrical pulse called **Action Potential** or **"spike"**.

![image.png](attachment:image.png)

(https://en.wikipedia.org/wiki/Biological_neuron_model#/media/File:Neuron3.png)

![<img src="image-2.png" width="50"/>](attachment:image-2.png)

https://www.youtube.com/watch?v=nWbzFrZvdYM



![image.png](attachment:image.png)

$$
z(x)=\theta_0x_0+\theta_1x_2+\cdots+\theta_nx_n
=\begin{bmatrix}
    \theta_0 & \theta_1 & \cdots & \theta_n
\end{bmatrix}
\begin{bmatrix}
    x_0\\
    x_1\\
    \cdots\\ 
    x_n
\end{bmatrix}
=\theta^Tx
$$
$$
g(z)=\frac{1}{1+e^{(-z)}}
$$

#### Key Terminologies
- **Bias Unit $(x_0)$**: An extra input node that is always equal to 1, providing a baseline for the activation calculation.
- **Weight $(\theta)$**: The parameters that control the stength of the connection between neurons from layer $j$ to layer $(j+1)$.
- **Activation Function**: The logistic (sigmoid) funciton $\begin{bmatrix}g(z)=\frac{1}{1+e^{(-z)}}\end{bmatrix}$.

#### Artifical Neural Network Architecture
![image-2.png](attachment:image-2.png)

#### Layer Structures
- **Input Layer (Layer 1)**: Contains the originals features data $(x_1,x_2,\cdots,x_i,\cdots,x_n)$.
- **Hidden Layers**: Intermediate layers where internal computations happen. We call they are `Activation Units` $(a_i^{(j)})$
- **Output Layer**: The final layer that produces the hypothesis result $h_\theta(x)$.

Matrix Representation


$$
z^{(j)}=\theta^{(j-1)}a^{(j-1)}\\
a^{(j)}=g(z^{(j)})
$$

**Demension of Matrices**:
$$
z^{(j)}_{[s^{(j)}\times 1]}=\theta^{(j-1)}_{[s^{(j)}\times (s^{(j-1)}+1)]}.a^{(j-1)}_{[(s^{(j-1)}+1)\times 1]}
$$

$$
\begin{bmatrix}
    z_{1}^{(j)}\\
    z_{2}^{(j)}\\
    \vdots\\
    z_{k^{(j)}}^{(j)}\\
    \vdots\\
    z_{s^{(j)}}^{(j)}
\end{bmatrix}=
\begin{bmatrix}
    \theta_{10}^{(j-1)}=b_1^{(j-1)} & \theta_{11}^{(j-1)} & \theta_{12}^{(j-1)} & \cdots & \theta_{1k^{(j-1)}}^{(j-1)} & \cdots & \theta_{1s^(j-1)}^{j-1}\\
    \theta_{20}^{(j-1)}=b_2^{(j-1)} & \theta_{21}^{(j-1)} & \theta_{22}^{(j-1)} & \cdots & \theta_{2k^{(j-1)}}^{(j-1)} & \cdots & \theta_{2s^(j-1)}^{j-1}\\
    \vdots\\
    \theta_{k^{(j)}0}^{(j-1)}=b_{k^{(j)}}^{(j-1)} & \theta_{k^{(j)}1}^{(j-1)} & \theta_{k^{(j)}2}^{(j-1)} & \cdots & \theta_{k^{(j)}k^{(j-1)}}^{(j-1)} & \cdots   & \theta_{k^{(j)}s^(j-1)}^{j-1}\\
    \vdots\\
    \theta_{s^{(j)}0}^{(j-1)}=b_{s^{(j)}}^{(j-1)} & \theta_{s^{(j)}1}^{(j-1)} & \theta_{s^{(j)}2}^{(j-1)} & \cdots & \theta_{s^{(j)}k^{(j-1)}}^{(j-1)} & \cdots         & \theta_{s^{(j)}s^{(j-1)}}^{(j-1)}\
\end{bmatrix}\begin{bmatrix}
    a_{0}^{(j-1)}=1\\
    a_{1}^{(j-1)}\\
    a_{2}^{(j-1)}\\
    \vdots\\
    a_{k^{(j-1)}}^{(j-1)}\\
    \vdots\\
    a_{s^{(j-1)}}^{(j-1)}
\end{bmatrix}
$$

$$
\begin{bmatrix}
    a_{1}^{(j)}\\
    a_{2}^{(j)}\\
    \vdots\\
    a_{k^{(j)}}^{(j)}\\
    \vdots\\
    a_{s^{(j)}}^{(j)}
\end{bmatrix}=
\begin{bmatrix}
    g(z_{1}^{(j)})\\
    g(z_{2}^{(j)})\\
    \vdots\\
    g(z_{k^{(j)}}^{(j)})\\
    \vdots\\
    g(z_{s^{(j)}}^{(j)})
\end{bmatrix}
$$

## Applications

![z](attachment:image.png)

### Examples and Intuitions I

This example demonstrates how a single-layer neural network can simulate logical gates (AND, OR, NOT, XOR) by choosing specific weights (parameters).


The graph of our function:
$$
\begin{bmatrix}
    x_0\\
    x_1\\
    x_2\\
\end{bmatrix}\rightarrow
\begin{bmatrix}
    g(z^{(2)})
\end{bmatrix}\rightarrow
\begin{bmatrix}
    h_\theta(x)
\end{bmatrix}\\
x_0=1\text{: bias unit}\\
$$

#### The AND Gate Simulation
$$
\theta^{(1)}=\begin{bmatrix}
    -30\\
    20\\
    20
\end{bmatrix}\\
h_\theta(x)=g(-30+20x_1+20x_2)\\
(x_1,x_2)=(0,0)\rightarrow g(-30+0+0)=g(-30)\approx0\\
(x_1,x_2)=(0,1)\rightarrow g(-30+20+0)=g(-10)\approx0\\
(x_1,x_2)=(1,0)\rightarrow g(-30+0+20)=g(-10)\approx0\\
(x_1,x_2)=(1,1)\rightarrow g(-30+20+20)=g(10)\approx1
$$

#### The OR Gate Simulation
$$
\theta^{(1)}=\begin{bmatrix}
    -30\\
    20\\
    20
\end{bmatrix}\\
h_\theta(x)=g(-10+20x_1+20x_2)\\
(x_1,x_2)=(0,0)\rightarrow g(-10+0+0)=g(-10)\approx0\\
(x_1,x_2)=(0,1)\rightarrow g(-10+20+0)=g(10)\approx1\\
(x_1,x_2)=(1,0)\rightarrow g(-10+0+20)=g(10)\approx1\\
(x_1,x_2)=(1,1)\rightarrow g(-10+20+20)=g(30)\approx1
$$

### Example and Intuitions II

These $\theta$ matrices for AND, NOR, OR:
$$
\theta^{(1)}=\begin{bmatrix}
    -30\\
    20\\
    20
\end{bmatrix}\rightarrow \text{"AND" Gate}\\
\theta^{(1)}=\begin{bmatrix}
    10\\
    -20\\
    -20
\end{bmatrix}\rightarrow \text{"NOR ((NOT x1) AND (NOT x2))" Gate}\\
\theta^{(1)}=\begin{bmatrix}
    -10\\
    20\\
    20
\end{bmatrix}\rightarrow \text{"OR": Gate}
$$



$$
\begin{bmatrix}
    x_0\\
    x_1\\
    x_2\\
\end{bmatrix}\rightarrow
\begin{bmatrix}
    a_1^{(1)}=g(z_1^{(1)})\\
    a_2^{(2)}=g(z_2^{(1)})\\
\end{bmatrix}\rightarrow
\begin{bmatrix}
    a^{(3)}=g(z^{(3)})
\end{bmatrix}\rightarrow
\begin{bmatrix}
    h_\theta(x)
\end{bmatrix}\\
x_0=1\text{: bias unit}\\
$$


$$
\theta^{(1)}=\begin{bmatrix}
    -30 & 10\\
    20 & -20\\
    20 & -20
\end{bmatrix}, \theta^{(2)}=\begin{bmatrix}
    -10\\
    20\\
    20
\end{bmatrix}
$$

$$
a^{(2)}=g(\theta^{(1)}.x)\\
h_\theta(x)=a^{(3)}=g(\theta^{(2)}.a^{(2)})
$$

### Multiclass Classification

![image.png](attachment:image.png)

To perform **Multiclass Classification** using neural networks, we extend the output layer to represent multiple categories simultaneously.

#### The One-vs-All Approach
Instead of a single scalar ouput, the hypothesis function $h_\theta(x)$ return a **vector**. Each element in that vector corresponds to the probability (or a binary indicator) of the input belonging to a specific class.
For example: a four-class system (Pedestrian, Car, Truck, Motorcycle), the target labels $y$ are represented as flows:
$$
y=\begin{bmatrix}
    1\\0\\0\\0
\end{bmatrix},
\begin{bmatrix}
    0\\1\\0\\0
\end{bmatrix},
\begin{bmatrix}
    0\\0\\1\\0
\end{bmatrix},
\begin{bmatrix}
    0\\0\\0\\1
\end{bmatrix}
$$

The inner layers, each provide some new information which leads to our final hypothesis function. The architech setup look like this:
$$
\begin{bmatrix}
    x_0\\x_1\\x_2\\ \cdots \\x_n
\end{bmatrix} \rightarrow
\begin{bmatrix}
    a_0^{(2)}\\a_1^{(2)}\\a_2^{(2)}\\ \cdots
\end{bmatrix} \rightarrow
\begin{bmatrix}
    a_0^{(3)}\\a_1^{(3)}\\a_2^{(3)}\\ \cdots
\end{bmatrix} \rightarrow \cdots \rightarrow
\begin{bmatrix}
    h_\theta(x)_1\\
    h_\theta(x)_2\\
    h_\theta(x)_3\\
    h_\theta(x)_4
\end{bmatrix}

$$

Prediction: if the ouput is $h_\theta(x)=\begin{bmatrix}0 & 0 & 1 & 0\end{bmatrix}^T$, the newwork has identified the input as the third category (**Truck**)