# Biological Inspiration

**Inspired By Neurons:**  
Perceptron mimics the basic function of biological neurons in the brain

**Biological Neuron:**
- **Dendrites:** Act as input receivers, collecting signals from other neurons.
- **Activation:** If the combined input signals exceed a threshold, the neuron fires a signal.
- **Axon:** Transmits the signal to other neurons.
- **Synapse:** The junction between an axon and a dendrite, where synaptic strength acts like a weight, modulating the signal.

*Note: The perceptron simplifies this biological process into a computational model.*

<div style="text-align:center">
  <img src="../assets/biological_neuron.png" alt="Biological Motivation Behind Perceptron">
</div>

# McCulloch-Pitts Neuron

An early model of artificial neurons with **binary inputs** and **threshold-based activation**.

**Components:**
- Input vector: $X = [x_1, \cdots, x_d]$, where each $x_i \in \{0,1\}$
- Weights vector: $W = [w_1, \cdots, w_d]$
- Threshold: $\theta$

**Pre-Activation:**  
Compute the weighted sum:
$$z = \sum_{i=1}^d w_i x_i$$

**Activation function (Threshold-Based):**
$$
y = \begin{cases}
1 & \text{if } z \ge \theta \\
0 & \text{otherwise}
\end{cases}
\quad
\text{where } z = \sum_i w_i x_i
$$

<div style="text-align:center">
  <img src="../assets/neuron_first.png" alt="simple ai neuron example">
</div>

**Bias Alternative:**  
Instead of a threshold $\theta$, introduce a bias term $b = -\theta$, shifting the activation function to:
$$
y = \begin{cases}
1 & \text{if } \sum_{i=1}^d w_i x_i + b \geq 0 \\
0 & \text{otherwise}
\end{cases}$$

<div style="text-align:center">
  <img src="../assets/neuron_second.png" alt="better ai neuron example">
</div>

# Perceptron

A generalization of the McCulloch-Pitts neuron that:
- Accepts **real-valued inputs**
- Supports **customizable activation functions**

**Activation Function Recommendations:**
- Non-zero derivative in most regions (important for gradient-based training)
- Differentiable, to allow gradient computation

## common Activation Functions

### **Step function:**

**Formula:**
$$
y = \begin{cases}
1 & \text{if } x \ge 0 \\ 
0 & \text{if } x \lt 0
\end{cases}
$$

**Issues:**
- Not differentiable at $x=0$
- Derivative is $0$ elsewhere $\implies$ **Not suitable** for gradient-based learning



### **Sigmoid (Logistic) Function:**

Also known as **soft** perceptron

**Activation Function:**
$$\sigma(z) = \frac{1}{1 + e^{-z}}, \quad \text{Where: } z = \sum_i w_ix_i + b$$

<div style="text-align:center">
  <img src="../assets/sigmoid.png" alt="sigmoid function">
</div>

**Key Properties:**
- It is a good candidate for activation function
- It gives us a number between 0 and 1 smoothly
- Smooth & differentiable

# Multi-Layer Perceptron (MLP)

## Introduction

**Definition:**  
A network of perceptron organized in layers, also called a **feed-forward neural network**.

$$\text{Input Layer} \rightarrow \underbrace{\underbrace{\text{Hidden Layer $1$} \rightarrow \cdots \rightarrow \text{Hidden Layer $n$}}_{\text{Hidden Layers}} \rightarrow \text{Output Layer}}_{\text{Network Layers (Computational Layers)}}$$
input Layer

- **Computational Layers** = Hidden Layers + Output Layer

- **Network Depth** = Number of computational layers

<div style="text-align:center">
  <img src="../assets/mlp_example.png" alt="MLP example">
</div>

## MLP Computation Flow

Suppose we have a Multi-Layer Neural Network:
<div style="text-align:center">
  <img src="../assets/mlp_weighted.png" alt="MLP example">
</div>

**Layer 1 Computation:**
$$z_{j}^{[1]} = \phi_1 \left(\sum_{i=0}^d w_{ji}^{[1]} x_i \right)$$
equivalently in the matrix form:
$$z = \phi_1 (W^{[1]} x)$$

Where:
- $x$: Input vector

- $W^{[1]}$: Weights of first layer

- $\phi_1$: Activation function for layer 1

- $\sum_{i=0}^dw_ji^{[1]} x_i$ is pre-activation function

- $z^{[1]}$: Output of first layer

**Layer 2 Computation:**
$$z_{j} = \phi_2 (\sum_{i=0}^d w_ji^{[2]} \phi_1 (W^{[1]} x))$$
equivalently in the matrix form:
$$z = \phi_2 (W^{[2]} \phi_1 (W^{[1]} x))$$

Each layer applies a linear transformation followed by a non-linear activation.

*Usually, all hidden layers use the same activation function.*

**Importance of Non-linear Activation:**

If no non-linear activation is used:

$$z^{[1]} = W^{[1]} x$$
$$z^{[2]} = W^{[2]} W^{[1]} x = W' x$$

This is equivalent to a single linear layer. **No extra modeling power is added.**

# Deep vs. Shallow Networks

**Depth of a Network:**  
Defined as the shortest path (number of computational layers) from any input node to any output node in the DAG of the network.

The network is deep if it has more than one hidden layers (i.e., more than 2 computational layers)

- **Width**:
  - More neurons per layer increase the model's capacity and complexity.

- **Depth**:
  - More layers allow greater abstraction and modeling of complex patterns
  - Deep networks (more than one hidden layer) are considered **deep learning**

- **Balance**:
    - **Too narrow/shallow**: risk of underfitting
    - **Too wide/deep**: risk of overfitting

<div style="text-align:center">
  <img src="../assets/width_depth_balance.png" alt="width-depth balance">
</div>

# MLP as Universal Approximator

## Introduction

**Theorem:**  
A feed-forward neural network with a single hidden layer and linear output can approximate any continuous function on a compact subset of $R^D$ to arbitrary accuracy.

**Conditions:**
- Activation function must be non-linear and satisfy mild assumptions (e.g., continuous, bounded, non-constant, such as sigmoid or ReLU).
- Requires a sufficiently large (but finite) number of hidden units

**Mathematical Representation:** For a network with $M$ hidden units:
$$
F_k(x) = \sum_{j=1}^M w_{kj}^{[2]} \phi \left(\sum_{i=1}^d w{ji}^{[1]} x_i \right)
$$
Where:
- $x \in \mathbb{R}^D$: Input vector.

- $w_{ji}^{[1]}$: Weights from input to hidden layer.

- $b_j^{[1]}$: Bias for hidden units.

- $\phi$: Non-linear activation function (e.g., sigmoid, ReLU).

- $w_{kj}^{[2]}, b_k^{[2]}$: Weights and biases for the output layer.


**Note:** This is of more theoretical interest than practical use.

## MLPs as Universal Boolean Functions

### **Core Concepts**

**Key Idea:**  
MLPs can represent any Boolean function by constructing networks for basic logic gates (**AND**, **OR**, **NOT**) and **combining them**.

**Basic Gates:**
- **AND Gate**:
    - Single-layer network with threshold $\theta$ set between 1 and 2.

- **OR Gate**:
    - Single-layer network with threshold $\theta$ between 0 and 1.

- **NOT Gate**:  
    - Single neuron with weight $w = -1$

### **Generalization**

Any Boolean function can be expressed in CNF or disjunctive normal form (DNF), making MLPs universal for Boolean functions.

Given $N$ Boolean variables:
- **Number of neurons** depends on the representation **complexity**
- More compact representations reduce required neurons
- One such compact tool: **Karnaugh Map**

<div style="text-align:center">
  <img src="../assets/karnaugh_map.png" alt="Karnaugh map example">
</div>

### **Largest Irreducible Boolean Function: Parity**

**Parity Function:**  
Computes whether the number of 1s in $D$ inputs is odd or even.

For a single hidden layer:
- Requires $2^{D-1}$ hidden units.
- Total weights and biases: $(D + 2) \cdot 2^{D-1} + 1$.


<div style="text-align:center">
  <img src="../assets/single_layer_parity.png" alt="single layer parity">
</div>

**Deep Network for Parity**

**Recursive Construction Formula:**
$$f = (((X_1 \oplus X_2) \oplus X_3) \cdots \oplus X_D)$$
- Each XOR requires 3 neurons (2 hidden, 1 output).

*Please Check XOR part if you have any confusion about this*

**Resource Requirements:**
- **Total Nodes:** $3 (D - 1)$
- **Weights & biases:** $9 (D - 1)$

Example for $D = 4$:
<div style="text-align:center">
  <img src="../assets/deep_network_parity.png" alt="deep network parity">
</div>

**Layer Breakdown:**
- Layer 1:  
    $Result1 = X_1 \oplus X_2\rightarrow 3 nodes$

- Layer 2:  
    $Result2 = Result1 \oplus X_3 \rightarrow 3 nodes$

- Layer 3:  
    $Result3 = Result2 \oplus X_4 \rightarrow 3 nodes$

**Depth vs. Width Tradeoff:**

Optimal architecture balances width and depth based on the function’s complexity.

**Better Architecture Example:**
$$f = ((X_1 \oplus X_2) \oplus (X_3\oplus X_4)) \oplus ((X_4 \oplus X_5) \oplus (X_6\oplus X_7))$$

Requires $2 \log N$ layers

### **Summary: Wide vs. Deep Network**

- A **one-layer MLP** can represent any Boolean function (**universal** Boolean function)
    - But may require an **exponential** number of neurons
- Deep networks can represent the same function with **fewer parameters**
    - Depth gives **efficiency**
- Complexity depends on:
    - Number of variables
    - Minimum DNF/CNF size

### **Limit: Shannon’s Theorem**

- For $D > 2$, some Boolean functions require at least $\frac{2^D}{D}$ gates in a Boolean circuit.

- **Implication:** For large $D$, most $D$-input Boolean functions need exponentially large circuits, regardless of depth.

- **Note:** If all Boolean functions could be computed with polynomial-sized circuits, it would imply $P = NP$.

**Threshold Circuits (TCs):**

- MLPs use threshold gates, which are more powerful than standard Boolean gates.
- Example: A single threshold gate can compute the majority function (“at least $K$ inputs are 1”), which requires an exponential-sized Boolean circuit.
- For fixed depth: Boolean Circuits $\subset$ Threshold Circuits.
- Threshold Circuits can also be viewed as Arithmetic Circuits
    - Model computation over reals (not just Boolean values)

## MLPs as Universal Classifiers

**Problem Overview**

- MLP as a function over real inputs
- MLP as a function that finds a complex “decision boundary” over a space of reals
- Output is binary or multi-class

| Structure                     | Type of Decision Regions      | Interpretation                     |
|-------------------------------|-------------------------------|------------------------------------|
| Single Layer (no hidden layer) | Half space                   | Region found by a hyper-plane      |
| Two Layer (one hidden layer)   | Polyhedral (open or closed)   | Intersection of half spaces        |
| Three Layer (two hidden layers)| Arbitrary regions            | Union of polyhedrals               |

In [None]:
# TODO: complete this part later

<div style="text-align:center">
  <img src="../assets/convex_classifier.png" alt="convex classifier">
</div>

<div style="text-align:center">
  <img src="../assets/non_convex_classifier.png" alt="non-convex classifier">
</div>

**Polygon-Based Region Approximation**

- Let $M$ = number of sides
- As $M$ increases, approximation to a circular region improves
  - Reduces the area outside the polygon that have $\frac{M}{2} \lt \text{Sum} \lt M$

$$\sum_i z_i = M \left( 1 - \frac{1}{\pi} \arccos\left(\min\left(1, \frac{\text{radius}}{|x - \text{cent}|}\right)\right) \right)$$

<div style="text-align:center">
  <img src="../assets/cylinder_region.png" alt="cylinder">
</div>

- Inside: Sum = $M$
- Outside: Sum = $\frac{M}{2}$
- Apply bias: $b = -\frac{M}{2}$

Result: Network can isolate a circle-like region with high precision

The circle net
- Very large number of neurons
- Sum is $M$ inside the circle, $\frac{M}{2}$ outside almost everywhere
    - After adding bias $-\frac{M}{2}$
        - Sum is $\frac{M}{2}$ inside the circle, $0$ outside almost everywhere
- Circle can be at any location

In [None]:
# TODO: complete thjis part later

## MLPs as Universal Approximator

**Goal:** MLP as a continuous-valued regression

A simple 3-unit MLP with a **summing** output unit can generate a **square pulse** over an input
- Output is 1 only if the input lies between $T_1$ and $T_2$
- $T_1$ and $T_2$ can be arbitrarily specified

<div style="text-align:center">
  <img src="../assets/square_pulse.png" alt="square pulse">
</div>

An MLP with many units can model an arbitrary function over an input
- Simply make the individual pulses narrower

**Result:** A one-layer MLP can model an arbitrary function of a single input

<div style="text-align:center">
  <img src="../assets/continuous_value_regression.png" alt="continuous value regression example">
</div>

# MLP with ReLU Activation Function

**ReLU Activation:**
$$\phi(z) = \max(0, z)$$

**Effect:**  
Creates piecewise linear functions.

**Number of Pieces:**
- Proportional to the number of neurons in the layer.
- Increasing neurons increases the number of linear segments, enabling finer approximations.

<div style="text-align:center">
  <img src="../assets/relu_pieces.png" alt="relation between ReLU function and pieces">
</div>

### **Composing Networks**

What is the effect of increasing layers?

Look at this example:

<div style="text-align:center">
  <img src="../assets/composing_network.png" alt="composing networks example">
</div>

The space has 3 time folded because Network 1 has been moved from -1 to 1:
- between -1, 0
- between 0, 0.8
- between 0.8, 1

And Network 2 will replace 3 times in these boundaries

**Key Insight:**
- Each additional layer multiplies the number of linear regions.
- For a network with $K$ layers and $n$ neurons per layer, the number of linear regions grows exponentially

### **Result:**

Max number of linear regions increases as a function of the number of parameters for networks mapping scalar input x to scalar output y.

For K layers:
<div style="text-align:center">
  <img src="../assets/max_number_regions.png" alt="max number of regions">
</div>

Adding more layers increases the number of linear regions **exponentially.**