<font size=6>**Deep Learning**</font>

**_Deep Learning_** is a type of Machine Learning which is characterized by being **deep**.

Meaning, it uses **multiple layers** to process the input information (Figure 0).
<img src="images/Simple_vs_Deep.png">

<table><tr>
    <td width=640>
        <img src="images/Simple_vs_Deep.png">
        <center>
            <br>
            Figure 0.  A simple <i>feedforward</i> Neural Network compared with a Deep <i>feedforward</i> Neural Network.<br>
            (From <a href="https://thedatascientist.com/what-deep-learning-is-and-isnt/">here</a>)
        </center>
    </td>
</tr></table>

The actual way the depth is designed can be very different. It could be achieved e.g. by **stacking** sequential layers (_feedforward neural networks_), via **recurrent** layers (_recurrent neural networks_), via a "**mix**" of these two approaches (_U-nets_), and many other ways.

Don't worry: we will explain how to _computationally_ create neurons/layers [later](#Generic_Architecture_and_Neurons).

# Why Deep Learning is cool

    It is not, we are geeks, and that's the truth.

However ... we do live in the era of "Big Data":

In [None]:
import pandas as pd

df_surveys = pd.DataFrame([
    ['2MASS',                                  1997,    20, 25.4],
    ['Sloan Digital Sky Survey (SDSS)',        2000,   200, 50],
    ['Large Synoptic Survey Telescope (LSST)', 2023,  30e3, 200e3],
    ['Square Kilometer Array (SKA)',           2027, 150e3, 4.6e6]
], columns=['Sky Survey Project', 'First Light', 'Velocty (GB/day)', 'Volume (TB)']).reset_index(drop=True)

df_surveys[df_surveys.columns[1:]] = df_surveys[df_surveys.columns[1:]].astype(int)

display(df_surveys)

In [None]:
import cutecharts.charts as ctc

chart = ctc.Line("Survey size evolution", width='500px')
chart.set_options(labels=list(df_surveys['First Light']), x_label='Year', y_label='Volume (TB)')
chart.add_series('year',list(df_surveys['Volume (TB)']))
chart.render_notebook()

We cannot expect to humanly inspect these data and derive the intuition for the rules which categorize them.

$\rightarrow$ We have to leverage on:

- the **large number** of examples

- algorithms that can abstract **arbitrarily complex** rules

## So how does Deep Learning address big data issues?

The basic idea is that layers construct **new features**.

In practice, Deep Learning systems include implicit **feature engeneering** _on top_ of the learning task (e.g., _classificaton_ or _regression_).<br>

In this way, they are a step forward with respect to "classic" ML approaches (Figure 1).

<table><tr>
    <td width=480>
        <img src="images/Deep_Feature_Engeneering.png">
        <center>
            <br>
            Figure 1.  A Deep Neural Network seen as a combination of feature extractor + learner (e.g. classifier or regressor).<br>
            (From <a href="https://stats.stackexchange.com/questions/562466/neural-networks-automatically-do-feature-engineering-how/">here</a>)
        </center>
    </td>
</tr></table>

From this perspective, the connections between the network neurons represent **potential correlations** betweeen features.

<u>What are the implications?</u>

The scientist **does _not_ have to get detailed insight of the problem** to build the proper features or select the proper classifier<br>
$\rightarrow$ the DL system does it all for us!

This comes particularly handy when we deal with databases with **millions of objects** and **hundreds of features**!

<u>References</u>

In case you are curious, it has been proven that Deep Neural Networks are indeed "**_universal approximators_**"
(e.g. [Kurt Hornik (1991), Neural Networks, 4, 2](https://www.sciencedirect.com/science/article/abs/pii/089360809190009T?via%3Dihub)), meaning that they can in principle explain any linear or non-linear relation beteen the features and the target.

## Some example applications

Indeed, Deep Learning (hereafter, **DL**) is being used to solve very _different_ problems, e.g.:

- **Self-Driving cars**

<table><tr>
    <td width=640>
        <img src="images/DL_Self_Driving.png">
        <center>
            <br>
            Figure 1.2.a.  NVIDIA's driverless car simulator.<br>
            (From <a href="https://images.nvidia.com/content/tegra/automotive/images/2016/solutions/pdf/end-to-end-dl-using-px.pdf/">"End to End Learning for Self-Driving Cars" (2016)</a>)
        </center>
    </td>
</tr></table>

- **Protein Structure Prediction**

<table><tr>
    <td width=640>
        <img src="images/DL_AlphaFold.jpg">
        <center>
            <br>
            Figure 1.2.b.  Deep Mind's Alpha Fold network for the prediction of molecular structures of proteins.
            Original paper: <a href="https://www.nature.com/articles/s41586-021-03819-2">Jumper, J., Evans, R., Pritzel, A. et al. 2021,  Nature, 596, 583</a>.<br>
            (From <a href"https://www.deepmind.com/blog/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology">Deep Mind's blog</a>)
        </center>
    </td>
</tr></table>


- **Natural Language Processing, translation, and text generation**


<table><tr>
    <td width=640>
        <img src="images/DL_NLP.png">
        <center>
            <br>
            Figure 1.2.c.  Google's unified text-to-text transformer.<br>
            (From <a href"https://arxiv.org/abs/1910.10683">Raffel et al. 2021, arxiv/1910.10683</a>)
        </center>
    </td>
</tr></table>


- **Computer Vision (lots and lots of it!)**

<table><tr>
    <td width=640>
    <img src="https://scontent-vie1-1.xx.fbcdn.net/v/t39.2365-6/10000000_3947476245325303_7673388906041049088_n.png?_nc_cat=107&ccb=1-7&_nc_sid=ad8a9d&_nc_ohc=to_z8oxJaisAX_1lJ8s&_nc_ht=scontent-vie1-1.xx&oh=00_AfDyNwVlX9BSki5r28HOjoteYlR7lyR3I_Hidq3rxC9Gcg&oe=6473E54D" alt="Detectron example">
        <center>
        Figure 1.2.d.  Facebook's Detectron2 for multiple computer vision tasks.<br>
        (From <a href"https://ai.facebook.com/tools/detectron2/">Meta AI blog</a>)
        </center>
    </td>
</tr></table>


... and many, many other _scary_ applications like:

- **Deep Fakes**

<table><tr>
    <td width=640>
        <img src="images/DL_DeepFake.jpg">
        <center>
        Figure 1.2.e.  Deep Fakes can be used to bring back actors from when the cinema was actually good (i.e., before 1999!), but also to produce false evidence.  Luckily, there are already ML efforts to uncover Deep Fakes, e.g. <a href"https://arxiv.org/abs/2101.01456/">Zi et al. 2021, arXiv/2101.01456
</a>.
        </center>
    </td>
</tr></table>

- **AI chats**

<table><tr>
    <td width=640>
        <img src="images/ChatGPT.png">
        <center>
        Figure 1.2.f.  Large Language Models applied as chat tools necessarily bring along the creators' moral judgment.
        (From <a href"https://openai.com/blog/chatgpt">OpenAI's blog</a>)
        </center>
    </td>
</tr></table>

- **Video Games**

<table><tr>
    <td width=640>
        <img src="https://assets-global.website-files.com/621e749a546b7592125f38ed/62271e2f604e640534eeca99_AlphaStar%2003.gif">
        <center>
        Figure 1.2.g.  Deep Mind's Alpha Star absolutely demolishng a human player who later claimed ... erhm ... that the internet connection was bad that day because ... ehrm, mmmh ... someone in the house was watching Netflix.<br>
        (From <a href"https://www.deepmind.com/blog/alphastar-mastering-the-real-time-strategy-game-starcraft-ii">Deep Mind's blog</a>)
        </center>
    </td>
</tr></table>

- - -

Catching up with all the new DL developments is becoming physically impossible, but you can follow great channels like [Two Minute Papers](https://www.youtube.com/c/K%C3%A1rolyZsolnai/featured) to try and stay updated.

## Deep Learning in Astronomy


The application of DL in Astronomy is still at an **amatour level**, with respect to what happens in the industry (_prejudice against the "black box"?_).<br>  However ... 

Astronomy is the perfect ML lab because it offers:
- tough problems to solve
- large data

In fact, Deep Learning publications are **exploding** in Astronomy (Figure 2)!

<table><tr>
    <td width=480>
        <img src="images/Deep_Learning_astro_papers.png">
        <center>
            <br>
            Figure 1.3.a. Number of refereed astronomy papers containing the text "Deep Learning" in their abstracts.<br>
            (From <a href="https://ui.adsabs.harvard.edu/search/filter_property_fq_property=AND&filter_property_fq_property=property%3A%22refereed%22&fq=%7B!type%3Daqp%20v%3D%24fq_property%7D&fq_property=(property%3A%22refereed%22)&q=abs%3A%22deep%20learning%22%20year%3A2015-2022&sort=date%20desc%2C%20bibcode%20desc&p_=0">NASA ADS</a>)
        </center>
    </td>
</tr></table>

<font size=3><u>**Some notable examples**</u><font>

**Galaxy Classification**
    
- [Dieleman et al. (2015), MNRAS, 450, 1441](https://ui.adsabs.harvard.edu/abs/2015MNRAS.450.1441D/abstract) $-$ calculate probabilities for the 37 Galaxy Zoo possible answers

    - **training**: classification of 61,578 JPEG images from SDSS with GZ labels
    - **architecture**: standard CNN

<table><tr>
    <td width=420>
        <img src="images/Galaxy_Zoo_flowchart.png">
        <center>
            <br>
            Figure 1.3.b. Galaxy Zoo classification tree.<br>
            (From <a href="https://ui.adsabs.harvard.edu/abs/2013yCat..74352835W/abstract">Willet et al. (2013)</a>)
        </center>
    </td>
    <td width=480>
        <img src="images/Dieleman_Fig11.png">
        <center>
            <br>
            Figure 1.3.c. Activation of the CNN layers.<br>
            (From <a href="https://ui.adsabs.harvard.edu/abs/2015MNRAS.450.1441D/abstract">Dieleman et al. (2015)</a>)
        </center>
    </td>
</tr></table>
    
- [Ackerman et al. 2017, MNRAS, 479, 415](https://ui.adsabs.harvard.edu/abs/2018MNRAS.479..415A/abstract) $-$ identify mergers

    - **training**: classification of ~4000 JPEG images from SDSS with GZ labels
    - **architecture**: CNN with transfer learning
    
<table><tr>
    <td width=480>
        <img src="images/Ackerman_Fig8.png">
        <center>
            <br>
            Figure 1.3.d. Some galaxy pairs confidently identified as mergers.<br>
            (From <a herf="https://ui.adsabs.harvard.edu/abs/2018MNRAS.479..415A/abstract">Ackerman et al. (2017)</a>)
        </center>
    </td>
</tr></table>
    
**Galaxy Morphology**
    
- [Aragon-Calvo et al. 2020, MNRAS, 498, 3713](https://ui.adsabs.harvard.edu/abs/2020MNRAS.498.3713A/abstract) $-$ obtain structural parameters via self-supervised learning

    - **training**: re-produce parameters used to generate artificial galaxies
    - **architecture**: semantic autoencoder
    
<table><tr>
    <td width=640>
        <img src="images/Aragon_Semantic_Autoencoder.png">
        <center>
            <br>
            Figure 1.3.e. Fitting of morpological structural parameters with an Autoencoder.<br>
            (From <a herf="https://ui.adsabs.harvard.edu/abs/2018MNRAS.479..415A/abstract">Ackerman et al. (2017)</a>)
        </center>
    </td>
</tr></table>    
    
**Serendipitous Detection**  
    
- [Lanusse et al. 2018, MNRAS, 473, 3895](https://ui.adsabs.harvard.edu/abs/2018MNRAS.473.3895L/abstract) $-$ spot gravitational lenses

    - **training**: 20,000 LSST-like observations
    - **architecture**: CNN + ResNet
    
<table><tr>
    <td width=480>
        <img src="images/DeepLens_Fig8.png">
        <center>
            <br>
            Figure 1.3.f. Some images correctly identified as hosting lenses.<br>
            (From <a herf="https://ui.adsabs.harvard.edu/abs/2018MNRAS.473.3895L/abstract">Lanusse et al. (2018)</a>)
        </center>
    </td>
</tr></table>    

- [Dekany & Grebel al. 2020, ApJ, 898, 46](https://ui.adsabs.harvard.edu/abs/2020ApJ...898...46D/abstract) $-$ spot fundamental-mode RR Lyrae stars 

    - **training**: 10$^7$$-$10$^8$ near-IR photometric time-series
    - **architecture**: RNN
    
<table><tr>
    <td width=480>
        <img src="images/Dekani_Fig4.png">
        <center>
            <br>
            Figure 1.3.g. Spatial distribution of the objects used as training set.<br>
            (From <a herf="https://ui.adsabs.harvard.edu/abs/2020ApJ...898...46D/abstract">Dekany & Grebel al. (2020)</a>)
        </center>
    </td>
</tr></table>    

    
**Image Reconstruction**
       
- [Schawinski et al. 2017, MNRAS, 467, 110](https://ui.adsabs.harvard.edu/abs/2017MNRAS.467L.110S/abstract) $-$ image denoising

    - **training**: 4550 nearby SDSS galaxies
    - **architecture**: GAN
    
<table><tr>
    <td width=800>
        <img src="images/Schawinski_Fig2.png">
        <center>
            <br>
            Figure 1.3.h. Degraded image details reconstructed by a GAN.<br>
            (From <a herf="https://ui.adsabs.harvard.edu/abs/2017MNRAS.467L.110S/abstract">Schawinski et al. (2017)</a>)
        </center>
    </td>
</tr></table>    

**Cosmological Simulations**
    
- [Rodríguez et al. 2018, ComAC, 5, 4](https://ui.adsabs.harvard.edu/abs/2018ComAC...5....4R/abstract) $-$ create computationally _cheap_ cosmological simulations

    - **training**: 10 independent L-PICOLA simulation boxes
    - **architecture**: GAN
    
<table><tr>
    <td width=800>
        <img src="images/Rodriguez_Fig1.png">
        <center>
            <br>
            Figure 1.3.i. Comparison between the results of a N-body simulation (<i>left</i>) and from a GAN (<i>right</i>).<br>
            (Adapted from <a herf="https://ui.adsabs.harvard.edu/abs/2018ComAC...5....4R/abstract">Rodríguez et al. (2018)</a>)
        </center>
    </td>
</tr></table>    

**Source Density Predicition**

- [Xu et al. 2023, ApJ preprint](https://ui.adsabs.harvard.edu/abs/2023arXiv230401670X/abstract) $-$ get a census of Giant Molecular Clouds (GMCs) from their $N_H$ column-density images
    
    - **training**: 7179 high-res MHD simulations of clouds
    - **architecture**: Diffusion Models
    
<table><tr>
    <td width=600>
        <img src="images/Xu_Fig2.png">
        <center>
            <br>
            Figure 1.3.j. The true GMC number density (<i>bottom-left</i>) is iteratively reconstructed, conditioned on the line-of-sight $N_H$ column density (<i>top-left</i>) .<br>
            (From <a herf="https://ui.adsabs.harvard.edu/abs/2018ComAC...5....4R/abstract">Rodríguez et al. (2018)</a>)
        </center>
    </td>
</tr></table>   

# Neural Networks (NN) Components

## Generic Architecture and Neurons
<a id='Generic_Architecture_and_Neurons'></a>

<table><tr>
    <td width=640>
        <img src="images/Generic_Architecture.png">
        <center>
            <br>
            Figure 11.  A simple, generic <i>feedforward</i> deep neural architecture.  Neurons of a layers might be connected to al the neurons of the neighboring layers, like in this example (<i>fully-connected</i> layers), or not.<br>
            (Adapted from <a href="https://ui.adsabs.harvard.edu">here</a>)
        </center>
    </td>
</tr></table>

<font size=3><u>**Nomenclature**</u><font>
    
**Neuron**: A simple element in a network, carrying 1 value.
    
**Layers**: A collection of neurons activated simulataneusly.<br>
    
    Layers are represented diffrently depending on the architecture.
    E.g., fully-connected layers (as in the Figure above), appear as vertical stripes of neurons.
  
- **input layer**: the data
- **hidden layers**: the internal layers ("_hidden_" from the point of view of the NN user)
- **output layer**: the variable(s) of interest (e.g., class(es) or $y$)
    
    
    E.g., if we provide an image as input, each pixel is 1 neuron of the input layer.


Contemporary NNs contain hundreds to thousands of layers, with million to billion of neurons.

## Weights and Biases

The core of the functioning of any NN is how the **information flows** through a neuron.

<table><tr>
    <td width=640>
        <img src="images/Weights_and_Biases.png">
        <center>
            <br>
            Figure 12.  How the information is propagated through a neuron. Don't get confused with $\hat{y}$: in this image, it only represents the neuron's output, not the target variabe (e.g. the <i>class</i>)<br>
            (From <a href="https://ui.adsabs.harvard.edu">here</a>)
        </center>
    </td>
</tr></table>


1. **The first stage is <u>linear</u>:**<br>
    A neuron takes all the inputs (values) x$_i$ directed into it, multiplies each of them by a different _weight_ ($w_i$), and takes the sum.<br>
    Then, it adds a _bias_ ($b$).
<br>

2. **The second stage is (usually) <u>non-linear</u>:**<br>
    The summation is passed to an **activation function**.<br>
    The activation function acts as a filter, basically deciding when and how the information shall flow. 

The neuron output is therefore:

   $$ \hat{y} = f(\sum{\textbf{w}\cdot{}\textbf{x} + b}) $$

<u>**Important**</u>

The _weights_ and _biases_ are <u>the</u> elements that are fit during the training of the model! 

Fitting a model == optimizing **all** the _weights_ and _biases_ within the NN, in order to **approximate** the desired output $y$ given a corresponding example $x$.

## Activation Functions

Activation functions are what make NNs so **efficient** as universal tools.

The introduce <u>non-linearities</u> $\rightarrow$ a NN can create an arbitrarily complex model.

They can be basically **any** _filter_-like function, but they better posses some features:

- **computationally inexpensive** $\leftarrow$ hence simple, since they get executed at each neuron

- **zero-centered** $\leftarrow$ not to shift values towards a preferential direction

- **differentiable** $\leftarrow$ because NNs work with [Backpropagation](#Backpropagation)

- **avoid vanishing when chained** $\leftarrow$ more correctly, we need to avoid vanishing gradients<br>
  (see [Gradient Descent and Loss](#Gradient-Descent-and-Loss))


<table><tr>
    <td width=480>
        <img src="images/Activation_Functions.png">
        <center>
            <br>
            Figure 13.  A collection of commonly used activation functions<br>
            (Adapted from <a href="https://wandb.ai/lavanyashukla/vega-plots/reports/Natural-Language-Processing--Vmlldzo2Nzk2Ng">here</a>)
        </center>
    </td>
</tr></table>


## NN Architecture Variants

NN come in **countless architectures**, and even trying to classify them is a tough task ...

<table><tr>
    <td width=640>
        <img src="images/NN_Zoo.png">
        <center>
            <br>
            Figure 14.  A comprehensive scheme of NN architectures.<br>
            (Image credit: <a href="https://www.asimovinstitute.org/author/fjodorvanveen/">Fjodor van Venn</a>) 
        </center>
    </td>
</tr></table>

<table><tr>
    <td width=640>
        <img src="images/Morpheus.jpg">
    </td>
</tr></table>

# Training NNs

As mentioned above:

    training a NN = optimize its weights and biases
    
in order to produce the desired output (class or values) given a corresponding input.

- - -

The details of the training method depends on the DL learning problem:

- **supervised:** we have labelled examples 
- **unsupervised:** no labeles are available 
- **reinforced:** the examples are associated to a "reward" 

We will focus on the **supervised** case to illustrate the NN mechanics:<br>
$\rightarrow$ Let's assume we have some predicting variables $X$, and labels $y$.

- - -

How can we _tell_ the network in which way it shall modify weights (and biases)?<br>
Let's break down the NN rationale:

1. We initialize the weights to some arbitrary value
2. We take a sub-sample (batch) of the data $X$, $X_{batch}$ 
3. We propagate $X_{batch}$ through the network, obtaining a predicted $\hat{y}_{batch}$
4. We assess the **error** between $\hat{y}_{batch}$ and the true $y_{batch}$
5. We need to **backpropagate** back the information about the difference
6. We need to **update** the weights in the right direction
7. Repeat from step 2 untill all data are used

The critical steps are #4, #5 and #6. 

## Error Function, Gradient Descent and Backpropagation

Let's consider an **errorr** (a.k.a. **loss**, or **cost**) function:

$$ E (\hat{y}, y) $$

which assesses the intensity of the error.  For _example_, we might adopt: $E (\hat{y}, y) = {1\over2}(\hat{y} - y)^2 $.

To be precise, $\hat{y}$ is itself a function of the input **x**, and of the NN parameters $\theta$ (all the $w$ and $b$ of each node):

$$ \hat{y} = f(\textbf{x}, \theta) $$

therefore:

$$ E = E (x, y, \theta) $$

- - -

The **gradient** of $E$ with respect to $\theta$, i.e. $\nabla E (x, y, \theta)$, is a vector _roughly_ pointing towards the minimum of $E$:

<table><tr>
    <td width=640>
        <img src="images/Gradient_Descent.png">
        <center>
            <br>
            Figure 15.  Iterative update of a weight via the Gradient Descent method.<br>
            (From <a href="https://ekamperi.github.io/machine%20learning/2019/07/28/gradient-descent.html">here</a>) 
        </center>
    </td>
</tr></table>

We can therefore **update** the weights by adding a vector proportional to $\nabla E$:

$$ \theta^\prime = \theta - \eta \cdot \nabla E (x, y, \theta) ~~~~~(1)$$

where the proportionality constant $\eta$ is called **learning rate** because it regulates how fast we shall proceed over the mininum (and possibly _overshooting_ it, if $\alpha$ is too large).

This is the **Gradient Descent** method.

- - - 

So we need to calculate $\nabla E~$:

$$ \nabla E (x, y, \theta) =
      \left( \frac{\partial E}{\partial \theta_1},
             \frac{\partial E}{\partial \theta_2},
             . . .,
             \frac{\partial E}{\partial \theta_n}
\right)$$

But what are the elements $\frac{\partial E}{\partial \theta_i}$?

Let's start from the output neuron, and consider only the parameter $i = 1$, i.e. $\theta_i = w_1$.<br>
We can _re_-write $\frac{\partial E}{\partial \theta_1}$ as:

$$\frac{\partial E}{\partial \theta_1} = 
  \frac{\partial E}{\partial w_1} = 
    \frac{\partial E}{\partial \hat{y}} \cdot
    \frac{\partial \hat{y}}{\partial w_1}
$$

where we applied the **chain rule** for derivatives.

We have the ingredients to calculate the two components because we saw above their functional form:

- $ \hat{y} = f(\sum{\textbf{w}\cdot{}\textbf{x} + b}) $

- $E (\hat{y}, y) = {1\over2}(\hat{y} - y)^2 $.

You can imagine **propagating** the chain rule back to the original input: all the steps are either _linear_ or passing through a _differentiable_ activation function!


A visual explanation of the **Backpropagation Algorithm** is given in this

> [Google Developers webpage](https://developers-dot-devsite-v2-prod.appspot.com/machine-learning/crash-course/backprop-scroll).

<font size=3><u>**Stochastic Gradient Descent**</u><font>

The Gradient Descent (GD) presented above is the "**minibatch GD**", where the gradient is calculated as the average gradient over the **batch** of data, $X_{batch}$.

$$ \nabla E (x, y, \theta) = {1 \over N}\sum_i^{N_{batch}} \nabla E (x_i, y_i, \theta) $$

_NOTE: Confusingly enough, the "**batch GD**" is the one using <u>all</u> data at once._
    
It is arguably the most used technique, and the minibatch helps reducing the large number of calculations involved in Backprop.

    
However in some cases, especially with large datasets, it might be convenient to use an **estimate** of the Gradient, e.g. with a **Stochastic Gradient Descent (SGD)**.
    
In SGD, we:
1. Propagate 1 example through the NN
2. Calculate its gradient and update the weigths
3. Repeat for all the training examples

**PROs**:
- starts converging earlier because of more frequent updates $\rightarrow$ quick insights
- stochasticity can help avoiding local minima
    
**CONs**:
- the Error function oscillates more    
- reaches a sub-optimal minimum compared to batch GD
- it is slower because it is sequential (cannot be parallelized)

    
Read more on [this post](https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/).
    
_NOTE: To add confusion, it is common to call the minibatch GD as "SGD", since in practice it applies the stochasticity by picking minibatches.  In the remainder we will subtend the same nomenclature._

## Optimization Algorithms

A NN model can easily contain millions of parameters.

How can we efficiently **explore** the parameter space, to find the global minimum?

### Momentum

This method takes in account the **previous values** of the gradient.

It is particularly useful when exploring _monotonic_ parameter spaces, because it can skip faster through large areas of the landscape.
    
The parameter update is a variation Equation 1:
    
$$ v^\prime = \alpha v  - \eta \cdot \nabla E (x, y, \theta) $$

$$ \theta^\prime = \theta + v $$

- $v$ represents the current "velocity" of the gradient.<br>
- The larger $\alpha\in$ [0, 1], the more previous steps are taken into account.

<table><tr>
    <td width=1000>
        <img src="images/Momentum.gif">
        <center>
            <br>
            Figure 16.  Representation of the influence of Momentum on the GD.<br>
            (From <a href="https://mlfromscratch.com/optimizers-explained/#/">here</a>) 
        </center>
    </td>
</tr></table>

### Adaptive learning rates

There are multiple alternatives to SGD, here we list a few famous ones.<br>

Their differ on how they dynamically **adjust** the **Learning Rate** (**LR**; $\eta$), which is otherwise a tricky hyperparameter to calibrate.

<font size=3><u>**AdaGrad**</u><font>

AdaGrad **separately** scales the LR for each parameter (i.e. for each "direction" in the parameter space).<br>
$\rightarrow$ Each direction gets a "personalized" search.

- Scaling is performed inversely to the **sum of the squares** of _all_ the historical gradients.
- Directions with larger gradient see their LR decrease faster.
    

<font size=3><u>**RMSProp**</u><font>

Same concept as AdaGrad, but it scales with an **exponentially weighted moving average** of previous gradients (decay of importance).<br> (_Instead of using the plain sum, as for AdaGrad_).


<font size=3><u>**Adam**</u><font>

Adam (i.e., Adaptive Moments) is sort of **RMSProp + momentum**.

But it uses both the **first** and the **second** moment of past gradients.

- - -

_NOTE: Don't confuse "moment" of a quantity with "momentum":_
    
$$ moment\_n = E~[X^n]$$

- _moment_1 = **average**_
- _moment_2 = **uncentered variance** (variance without centering around the mean)_
   
_In other words, RMSProp  uses the first moment, just with $E[]$ being an exponentially-weighted average._
    
_**Adam also** makes use of the second moments of the historical gradients._

- - -
    
**TL;DR:** Adam is arguably the **best-performing** optimizer on average and quite **robust** with respect to the choice of hyperpars.      

$\rightarrow$ Safe choice!
    
    
<font size=3><u>**Reading material**</u><font>
    
At this [page](https://mlfromscratch.com/optimizers-explained/#/) you will find graphical explanations of these optimizers as well as their mathematical formulations.

## Learning Curves

In DL, it is common to cycle the training on the same data $N$ times $-$ Each pass is called "**Epoch**".

Training rationale with **N_epochs** epochs, and **N_batches** batches:

    0. set epoch = 0
    1. Train on batch 1, update gradient
       Continue train on batch 2, update gradient
       Continue train on batch 3, update gradient
       .. .
       Continue train on batch N_batches, update gradient
    3. epoch += 1
       (epoch completed)
    4. if epoch <= N_epochs: continue from 1 else stop
    
- - -

<u>NOTE:</u>
_This introduces bias towards the training data, but it usually more than compensated by a longer gradient descent._

- - -

We can display the values of the Loss as a function of  "**time**" $\rightarrow$ **Learning Curves** (**Validation Curves**)

<table><tr>
    <td width=480>
        <img src="images/Learning_Curves.png">
        <center>
            <br>
            Figure 16.  Learning curves for train and validation set.<br>
            (From <a href="https://www.kaggle.com/code/ryanholbrook/overfitting-and-underfitting">here</a>) 
        </center>
    </td>
</tr></table>

We can use them to spot **overfitting** (train goes much better than validation) or **underfitting** (training shows trend to to better):


<table><tr>
    <td width=480>
        <img src="images/Learning_Curves_Good.png">
        <center>
            <br>
            Figure 17.A.  <b>A good fit</b>.<br>
            Training and validation converge <u>and</u> they flatten.<br>
            (From <a href="https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/">here</a>) 
        </center>
    </td>
    <td width=480>
        <img src="images/Learning_Curves_Underfitting.png">
        <center>
            <br>
            Figure 17.B.  <b>Underfitting</b>.<br>
            The training set shows a downward trend at the right edge: maybe a few more epochs could yield the best performance.<br>
            (From <a href="https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/">here</a>) 
        </center>
    </td>
    <td width=480>
        <img src="images/Learning_Curves_Overfitting.png">
        <center>
            <br>
            Figure 17.C.  <b>Overfitting</b>.<br>
            The validation curve cannot keep up with the training curve.
            (From <a href="https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/">here</a>) 
        </center>
    </td>
</tr></table>

# What you will see in the Workshops

You will explore 2 domains of DL networks:

- **Supervised Learning - Regression** $~~~~\rightarrow$ **Fully-connected Layers**

- **Supervised Learning - Classification** $~\rightarrow$ **Convolutional Neural Networks (CNNs)**

- **Unsupervised Learning - Generation**  $~\rightarrow$ **Diffusion Models**

## Supervised Learning: CNNs

**CNNs** excel at classifying images thanks to their **Convolutional Layers**:

<table><tr>
    <td width=1000>
        <img src="images/CNN_Architecture.png">
        <center>
            <br>
            Figure 18.  A prototypical CNN architecture.<br>
        </center>
    </td>
</tr></table>

- <font color='darkred'>**Convolution**</font> $~\rightarrow$ Filters that scan the image to detect different features
- <font color='darkgreen'>**Pooling**</font> $~~~~~~~~\rightarrow$ Reduce dimensionality to increase abstraction
- <font color='darkblue'>**Flattening**</font> $~~~~\rightarrow$ Encodes features into variables
- <font color='purple'>**Dense Layers**</font> $\rightarrow$ Feature classifier

### Transfer Learning

CNNs require lots of data $\rightarrow$  **Transfer Learning** helps to address this issue:

    1. Pre-train the convolutional/pooling with any extended image dataset
    2. Freeze those parameters
    3. Fit the classifier part on your astro images

<table><tr>
    <td width=800>
        <img src="images/CNN_Transfer.png">
        <center>
            <br>
            Figure 19.  Don't have enough data?  Just add cats and dogs, SMH ...<br>
        </center>
    </td>
</tr></table>

> The pre-training will teach the CNN to **recognize features** in images (shapes, edges, etc.).

See [Ackerman et al. 2017, MNRAS, 479, 415](https://ui.adsabs.harvard.edu/abs/2018MNRAS.479..415A/abstract) $-$ identify mergers


## Unsupervised Learning: Autoencoders

**Autoencoders** allow to _consistently_ **encode** data into a lower-dimensional space $\rightarrow$ can **summarize** data properties.

They are "consistent" because they learn to **match** an _input_ and its _reconstruction_ from the embedding space.

<table><tr>
    <td width=640>
        <img src="images/Autoencoder.png">
        <center>
            <br>
            Figure 1.  A prototypical Aoutoencoder architecture.
            <br>
            (From <a href="https://www.assemblyai.com/blog/introduction-to-variational-autoencoders-using-keras/">here</a>)
        </center>
    </td>
</tr></table>

# Libraries

<table><tr>
    <td width=640>
        <img src="images/Dinosaurs.png">
    </td>
</tr></table>

There are several libraries that implement **Deep Learning routines** in Python.

For the Woskshop we will focus on [**Tensorflow**](https://www.tensorflow.org/) and [**Keras**](https://keras.io/).

## TensorFlow

<table><tr>
    <td width=640>
        <img src="images/Tensorflow_Logo.png">
    </td>
</tr></table>


**Tensorflow** the _heavy lifting_ for us:
- pre-defined **functions** and **layers**
- automatically computes the **derivatives** (for Backprop!)

## Keras

**Keras** is an **API (Application Programming Interface)** $-$ it supports multiple libraries, e.g.:
- Tensorflow (Google)
- CNTK (Microsoft)
- MXNet (Apache)
- Theano

It provides _high-level_ functionalities built on e.g. Tensorflow.

<table><tr>
    <td width=640>
        <img src="images/Keras.png">
    </td>
</tr></table>

You can see it imported in different ways, because it is **mantained separately** by both Tensorflow and Keras (functionalities _might_ differ slightly):

In [None]:
# Install: Anaconda -> Tensorflow -> Keras

import keras.models
# and
from keras import backend as K
# use Keras repository code

# .. OR

import tensorflow.keras
# use TensorFlow repository code itself (recommended: better maintaned)

It can be coded in 2 "_styles_":

<u>Sequential API</u>

  >**PROs:** Simpler (stack of layers)<br>
  >**CONs:** Single-input, single-output

<u>Functional API</u>

  > **PROs:** More flexible, multi-input, multi-output<br>
  > **CONs:** Steep learning curve, need to understand tensor programming

### Keras Sequential API

In [None]:
import tensorflow.keras as keras
from keras import layers
model = keras.Sequential()
model.add(layers.Dense(20, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))
#model.fit(x_train, y_train, epochs=5, batch_size=32)

### Keras Functional API

In [None]:
import tensorflow.keras as keras
from keras import layers
inputs = keras.Input(shape=(10,))
x = layers.Dense(20, activation='relu')(inputs)
x = layers.Dense(20, activation='relu')(x)
outputs = layers.Dense(10, activation='softmax')(x)
model = keras.Model(inputs, outputs)
#model.fit(x_train, y_train, epochs=5, batch_size=32)