# Definition

**What is Deep Learning?**

Deep learning is a subfield of machine learning that uses **deep models**

**Deep models:** computational models consisting of multiple processing layers to learn hierarchical representations of data.  

Each layer transforms the input into a more abstract and meaningful representation, where the output of one layer serves as the input to the next.

# Comparison with Traditional Machine Learning

**Machine Learning Recap:**
- Learns a mapping from input features to output labels using training data.
- Requires **hand-crafted** features to perform well.
- Performance depends heavily on the **quality of the data** representation.

However, designing good features, especially for complex inputs (e.g., images, video, and time series) is challenging:
- It is difficult to know which features should be extracted
- Experts may spend years refining feature sets that are still incomplete or over-specified

**Deep Learning Approach:**

Deep learning automatically learns both:
- Feature representations from raw input data
- The mapping from these features to the output

$$\text{Input} \rightarrow \underbrace{\text{Trainable Feature Extractor} \rightarrow \text{Trainable Classifier}}_{\text{End-to-End Learning}} \rightarrow \text{Output}$$

<div style="text-align:center">
    <img src="../assets/deep_example.png" alt="deep model example">
</div>

In this schematic:
- First layer ($h_i$): learns features
- Second layer ($y_i$): performs classification

# Multi-Layer Neural Networks

For a layer $ l $, result after the activation $f$ of the $ k $-th neuron is computed as:
$$a_k^{[l]} = f \left( \sum_{i=1}^M W_{ki}^{[l]} a_i^{[l-1]} + {b_k}^{[l]} \right)$$

Where:
- $ k \in \{1, \dots, K\} $: Index of the $ k $-th neuron in layer $ l $.

- $ W_{ki}^{[l]} $: Weight connecting the $ i $-th neuron in layer $ l-1 $ to the $ k $-th neuron in layer $ l $.

- $ b_k^{[l]} $: Bias term for the $ k $-th neuron in layer $ l $.

- $ f(\cdot) $: Activation function (e.g., ReLU, sigmoid).

- $ i \in \{0, \dots, M\} $: Index of neurons in the previous layer.


<div style="text-align:center">
    <img src="../assets/mlp_represent.png" alt="multi-layer neural network example">
</div>

**Learnable parameters:**

The learnable parameters in an MLP are:
- Weights $ W_{ki}^{[l]} $
- Biases $ b_k^{[l]} $

The total number of parameters in a network with $ L $ layers is:

$$(d+1)m_1 + (m_1+1)m_2 + \dots + (m_L+1)k$$

Where:
- $ d $: Number of input features.
- $ m_i $: Number of neurons in the $ i $-th layer.
- $ k $: Number of output units.

**Compositionality in Deep Learning**

Deep learning breaks down complex mappings into simpler, nested functions:
$$\text{Input} \rightarrow \underbrace{\text{layer 1} \rightarrow \cdots \rightarrow \text{layer n}}_{\text{Trainable Feature Extractor}} \rightarrow \text{Classifier} \rightarrow \text{Output}$$

Why is this powerful?
- Each layer captures increasingly abstract features
- **Early layers:** **low-level** features (e.g., edges, corners)
- **Later layers:** **high-level** concepts (e.g., faces, objects)

Compositionality is useful to describe the world around us efficiently
- Learned function seen as a composition of simpler operations
- Hierarchy of features, concepts, leading to more abstract factors enabling better generalization
    - each concept defined in relation to simpler concepts
    - more abstract representations computed in terms of less abstract ones.
- Again, theory shows this can be **exponentially advantageous**

Deep learning has great power and flexibility by learning to represent the world as a nested hierarchy of concepts

# History of Deep Learning

- **1943 Artificial Neuron**
    - Modeled brain neurons using a weighted sum of **boolean** inputs passed through an activation function.

- **1957 Perceptron**
    - A linear classifier for real inputs, limited to **linearly separable** problems (e.g., unable to solve XOR).
<div style="text-align:center">
    <img src="../assets/perceptron_schematic.png" alt="Perceptron example">
</div>

- **1969 Limitations of Neural Networks:**
    - The perceptron’s algorithm can't handle non-linearly separable problems,

- **1979 Neocognitron (inspires CNNs):**
    - Inspired by the brain’s visual system, it introduced multiple computational layers, laying the groundwork for convolutional neural networks (CNNs).
<div style="text-align:center">
    <img src="../assets/cnn.png" alt="CNN example">
</div>

- **1982 Recurrent Neural Networks (RNNs):**
    - Designed for sequential and time-series data
<div style="text-align:center">
    <img src="../assets/rnn.png" alt="RNN example">
</div>

- **1986 Back Propagation:**
    - Combined gradient descent, the chain rule, and dynamic programming to train neural networks efficiently (building on automatic differentiation from 1970).

- **1997 Long Short-Term Memory (LSTM)**
    - Enhanced RNNs to better handle long-term dependencies in sequential data.
<div style="text-align:center">
    <img src="../assets/lstm.png" alt="LSTM example">
</div>

- **1998 LeNet (Neocognitron + Back-prop)**
    - Combined Neocognitron’s architecture with back-propagation to solve handwritten digit recognition, marking a practical deep learning success.
<div style="text-align:center">
    <img src="../assets/le_net.png" alt="LeNet example">
</div>

- **2006 Deep Learning**
    - The training of each layer individually is an easier undertaking
        - Training multi layered neural networks became easier
        - Per-layer trained parameters initialize further training
    - Resource and data limited
    - Layer-wise pre-training simplified training of deep networks, revitalizing interest in neural networks.

- **2009 ImageNet**
    - Introduced a large-scale image dataset, enabling deep learning to tackle complex vision tasks

- **2012 AlexNet**
    - A deep CNN that achieved breakthrough performance on ImageNet, leveraging GPUs and large datasets.

**Why Does Deep Learning Become Popular?**

- **Data:**
    - Availability of massive datasets
    - provided the volume of training examples needed for deep models
    - E.g., ImageNet

- **Hardware:**
    - Availability of the computational resources to run much larger models
    - Specially GPU

- **Algorithm:**
    - New architectures (CNNs, RNNs)
    - Frameworks like (Tensorflow or Pytorch)
    - New training techniques

In [1]:
# TODO: Do not forget GRU in the RNN !!

# Advanced Concepts in Deep Learning

## Transfer Learning

**Transfer Learning**

- **Problem:**
    - **Training** deep networks from scratch **requires massive labeled data**.

- **Solution:**
    - Use **pre-trained** models (e.g., ResNet, BERT) as feature extractors.

- **Example:**
    - After image classification, achievements were obtained in other vision tasks:
        - Object detection
        - Segmentation
        - Image captioning
        - Visual Question Answering (VQA)
        - …

## Transformer

Introduced for sequence-to-sequence tasks in natural language processing (NLP), transformers consist of:

- **Encoder:** Encodes input sequences into a latent representation.
- **Decoder:** Generates output sequences from the encoded representation.

**Impact:**  
Dominates NLP (GPT, BERT) and vision (ViT).

## Self-Supervised Learning (SSL)

**Idea:** Generate **pretext tasks** from unlabeled data to learn useful representations.

**Benefits:**
- Leverages vast unlabeled data.
- Produces more generalizable models.

The learning mechanism is the same as supervised learning, but instead of tagging the data manually, the model itself estimates the labels.

## Multi-Modal Models

Multi-modal models learn from multiple data types (e.g., image + text)

**CLIP:** Learns a multi-modal embedding space by jointly training an image encoder and text encoder (map image and text)

- Using the available large amount of multi-modal data
- Zero-shot classification

## Generative Models

Output text, image, video, audio, …. given no condition or a partial guidance or prompt

## Large Language Models (LLMs)

Large language models are one of the most successful applications of transformer models.

LLMs, built on **transformer** architectures, are trained on massive text datasets in a **self-supervised** manner.
- Recognize
- Summarize
- Translate
- Predict
- Generate text and other content based on knowledge gained from massive datasets.