<a href="https://colab.research.google.com/github/danieleduardofajardof/DataSciencePrepMaterial/blob/main/7_DeepLearning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 7. Deep Learning
# Index

---




## 1. Multi-Layer Perceptron (MLP)

An MLP is a fully connected feedforward neural network consisting of an input layer, one or more hidden layers, and an output layer.

- **Forward Propagation**:

  $$
  z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)} \\
  a^{(l)} = f(z^{(l)})
  $$

  where:
  - $ W^{(l)} $: weight matrix of layer $ l $
  - $ b^{(l)} $: bias vector of layer $l $
  - $ a^{(l)} $: activation of layer $ l $
  - $ f $: activation function

- Commonly used for structured data and basic classification tasks.


### Perceptron

A perceptron is the simplest type of neural network used for binary classification.

- **Output**:

  $$
  y = \begin{cases}
  1 & \text{if } w \cdot x + b > 0 \\
  0 & \text{otherwise}
  \end{cases}
  $$

- If the data is linearly separable, perceptrons can classify it successfully.
---

## 2. Activation Functions
Activation functions introduce non-linearity into the network, allowing it to learn complex patterns. They are applied after each neuron to determine the neuron's output.


#### 2.1. ReLU (Rectified Linear Unit)

**Definition**:
$$
f(x) = \max(0, x)
$$

**Pros**:
- Computationally efficient
- Reduces likelihood of vanishing gradient
- Sparse activation (many outputs are zero)

**Cons**:
- Dying ReLU problem: neurons can become inactive and stop learning if inputs are always negative

**Use Cases**:
- Default choice for hidden layers in CNNs and MLPs



#### 2.2. Leaky ReLU

**Definition**:
$$
f(x) = \begin{cases}
x & \text{if } x > 0 \\
\alpha x & \text{if } x \leq 0
\end{cases}
$$

Where $\alpha$ is a small constant (e.g., 0.01)

**Pros**:
- Addresses dying ReLU by allowing a small gradient when $ x < 0 $

**Cons**:
- Slightly more complex than ReLU

---

#### 2.3. ELU (Exponential Linear Unit)

**Definition**:
$$
f(x) = \begin{cases}
x & \text{if } x \geq 0 \\
\alpha (e^x - 1) & \text{if } x < 0
\end{cases}
$$

**Pros**:
- Avoids dying neurons
- Mean activations closer to zero, which helps learning

**Cons**:
- More computationally expensive than ReLU


#### 2.4. Sigmoid

**Definition**:
$$
f(x) = \frac{1}{1 + e^{-x}}
$$

**Pros**:
- Output is bounded between (0, 1)
- Useful for binary classification (as final layer)

**Cons**:
- Saturates and kills gradients for large positive/negative inputs
- Not zero-centered

**Use Cases**:
- Output layer in binary classification


#### 2.5. Tanh (Hyperbolic Tangent)

**Definition**:
$$
f(x) = \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
$$

**Pros**:
- Output is bounded between (-1, 1)
- Zero-centered

**Cons**:
- Suffers from vanishing gradient problem like sigmoid

**Use Cases**:
- Sometimes preferred over sigmoid for hidden layers


#### 2.6. Softmax

**Definition**:
For a vector $\mathbf{z}$, the softmax output for class $i$ is:
$$
\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j} e^{z_j}}
$$

**Pros**:
- Produces a probability distribution over classes
- Useful for multi-class classification

**Cons**:
- Sensitive to large input values (can cause numerical instability)

**Use Cases**:
- Output layer in multi-class classification problems


#### Summary Table

| Function     | Output Range    | Zero-Centered | Pros                                | Common Use                      |
|--------------|------------------|---------------|-------------------------------------|----------------------------------|
| ReLU         | [0, ∞)           | No            | Fast, simple, non-linear            | Hidden layers (CNNs, MLPs)       |
| Leaky ReLU   | (−∞, ∞)          | Yes (partially)| Fixes ReLU dying problem            | Advanced CNNs                    |
| ELU          | (−α, ∞)          | Yes           | Better gradient flow                | Deep networks                    |
| Sigmoid      | (0, 1)           | No            | Smooth, interpretable               | Binary classification output     |
| Tanh         | (−1, 1)          | Yes           | Better than sigmoid for hidden layers| RNNs, deep MLPs                  |
| Softmax      | (0, 1)           | No            | Converts logits to probabilities    | Multi-class classification output|


---
## 3. Autoencoders

Autoencoders are unsupervised neural networks designed to learn compressed representations of input data (useful for dimensionality reduction, feature learning, and reconstruction).


#### **Core Idea**

An autoencoder consists of two main components:

- **Encoder**: Maps the input data $x$ to a lower-dimensional latent representation $z$
- **Decoder**: Attempts to reconstruct the input $\hat{x}$ from the latent code $z$



#### **Architecture**

- Input layer: Raw data $x\in\mathbb{R}^n$
- **Encoder**: A stack of layers that reduce dimensionality (e.g., MLP or CNN layers)
- Latent layer: Bottleneck that contains the compressed representation $z\in\mathbb{R}^k$, where $k < n$
- **Decoder**: A symmetric stack of layers that reconstruct $\hat{x}\in \mathbb{R}^n$



#### **Loss Function**

Objective is to minimize the reconstruction error between the original input and the output:

$$
L = \| x - \hat{x} \|^2
$$

Where:
- $x$: original input
- $\hat{x}$: reconstructed input

Alternative loss functions:
- Binary cross-entropy for binary input
- KL-divergence in variational autoencoders



#### **Variants**

- **Denoising Autoencoders**:
  - Learn to reconstruct clean input from noisy input
  - Trains on corrupted $\tilde{x} $, outputs clean $x$
  
- **Sparse Autoencoders**:
  - Use sparsity regularization on latent representation (e.g., L1 penalty)
  
- **Variational Autoencoders (VAE)**:
  - Probabilistic model; learns a distribution over latent space
  - Latent variables $z \sim \mathcal{N}(\mu, \sigma^2)$
  - Uses a combination of reconstruction loss and KL divergence:
  
  $$
  L = \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x) \| p(z))
  $$


#### **Applications**

- **Dimensionality Reduction**: Similar to PCA but can learn non-linear transformations
- **Denoising**: Remove noise from images, signals, or data
- **Anomaly Detection**: High reconstruction error may indicate anomalies
- **Pretraining**: Learn useful representations for downstream supervised tasks
- **Image Compression**: Encode images into compact representations

---

#### **Example Use Case: Anomaly Detection**

1. Train autoencoder on normal (non-anomalous) data
2. During inference, compute reconstruction error:
   $$
   \text{Error} = \| x - \hat{x} \|
   $$
3. If error exceeds a threshold, classify input as an anomaly



#### **Visualization**

- Latent space can be visualized in 2D/3D to understand data clusters
- Useful in exploratory data analysis and clustering




---

## 4. Convolutional Neural Networks (CNNs)

Specialized neural networks for spatial data (especially images).

- **Core Components**:
  - **Convolutional layers**: Apply filters to extract features
  - **Pooling layers**: Downsample feature maps (e.g., max pooling)
  - **Fully connected layers**: Final prediction

- **Equation for convolution**:

  $$
  (I * K)(i, j) = \sum_m \sum_n I(i+m, j+n) \cdot K(m,n)
  $$

- **Use Cases**: Image classification, object detection, medical imaging

---

## 5. Recurrent Neural Networks (RNNs)

Neural networks designed for sequence data.

- **Core Concept**: Hidden state captures previous time steps

  $$
  h_t = f(W_{hh} h_{t-1} + W_{xh} x_t + b)
  $$

- **Limitations**: Struggle with long-term dependencies due to vanishing gradients.

---

## 6. LSTM (Long Short-Term Memory)

An RNN variant that preserves long-term dependencies using memory cells.

- **Gates**:
  - Forget gate
  - Input gate
  - Output gate

- **Core Equations**:

  $$
  f_t = \sigma(W_f x_t + U_f h_{t-1} + b_f) \\
  i_t = \sigma(W_i x_t + U_i h_{t-1} + b_i) \\
  o_t = \sigma(W_o x_t + U_o h_{t-1} + b_o) \\
  c_t = f_t \cdot c_{t-1} + i_t \cdot \tanh(W_c x_t + U_c h_{t-1} + b_c) \\
  h_t = o_t \cdot \tanh(c_t)
  $$

- **Use Cases**: Time series forecasting, language modeling

---

## 7. GANs (Generative Adversarial Networks)

Two neural networks — a **generator** and a **discriminator** — compete in a game-theoretic setup.

- **Generator**: Tries to generate realistic data
- **Discriminator**: Tries to distinguish real from fake data

- **Loss Function**:

  $$
  \min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]
  $$

- **Use Cases**: Image generation, data augmentation, super-resolution

---