# Processing images with Neural Networks
After introducing the concept of Neural Networks (in its simplest form, that is the MLP), we are now ready to move back to the main topic of this course: **image reconstruction**. In the following, we will quickly recall the problem setup, mainly to fix some notation, and we will discuss how the basic architecture of MLP can be modified to address the task of image reconstruction.
This discussion will include the concept of **Convolutional Neural Networks (CNN)**, a particular modifications to the already-described MLP that allows for a more flexible processing of the input image, which generalizes better on new data by imitating the behavior of the human eye. To this aim, we will introduce the concept of **Receptive Field (RF)**, as the sub-portion of input image that provides the context in image reconstruction. To improve the RF, we will then discuss **U-Net**, arguably the most-used neural network architecture for every possible task related to images, highlighiting its main advantages and limitations.

Finally, we will introduce **Vision Transformers (ViT)**: a recently-introduced architecture which seems to perform particularly well on computer vision tasks trying to imitate the success that Transformers already had in Language Processing.


## A Brief Recall on Image Reconstruction

In the first module of this course, you already discussed the problem of image reconstruction from measurements acquired via **linear operators**. In particular, this was done by considering the following acquisition system:

$$
y^\delta = Kx_{true} + e,
$$

where $K \in \mathbb{R}^{m \times n}$ represents the **acquisition operator**, $x_{true} \in \mathbb{R}^n$ is the **true datum** we want to reconstruct (here represented in *vectorized* form), $e \in \mathbb{R}^m$ is the **measurement noise**, which satisfies $|| e ||_2 \leq \delta$, and $y^\delta \in \mathbb{R}^m$ is the **acquired datum**.

The task of image reconstruction is to approximate $x_{true}$ (denoted as $x^*$ in the following), starting from $y^\delta$ and, possibly, some information about the noise—e.g., $e$ could be Gaussian noise with zero mean and a given standard deviation $\sigma>0$.

For simplicity, in the first part of this course, $x_{true}$ was usually represented as a matrix of shape $n_x \times n_y$, where $n_x$ and $n_y$ denote the number of pixels per row and column, respectively. Clearly, this implies that $(n_x, n_y)$ satisfies $n_x \cdot n_y = n$. Similarly, the acquired measurement data $y^\delta$ is also treated as an image, with shape $m_x \times m_y$ such that $m_x \cdot m_y = m$.

```{note}
Due to the dimensions of the reconstructed image $x_{true}$, the operator $K$ **cannot** be stored in memory. For this reason, we typically consider an **operator** that *simulates* the application of $K$ to the input $x_{true}$.
```

We also recall that most classical methods for solving the image reconstruction problem defined above rely on the **regularized least squares optimization problem**:

$$
\min_{x \in \mathcal{X}} \frac{1}{2} || K x - y^\delta ||_2^2 + \lambda R(x),
$$

where $\mathcal{X}$ denotes the image domain (typically, $\mathcal{X} = \{ x \geq 0 \}$), $R(x)$ is the regularizer, which incorporates prior information about the solution, and $\lambda > 0$ is the *regularization parameter*.

This optimization problem is then solved using an *optimizer*, which depends on the mathematical properties of $R(x)$.

## Neural networks for Image Processing
While working with neural network, we need to take a slightly different approach. Indeed, as we already remarked in the previous sections, a neural network pipeline is composed of two main components:
* A **model architecture**, described by the type of layers, the number of layers ($L$) and the activation functions ($\rho$), and represented as $f_\Theta$ for simplicity.
* A **training set** $D$ containing $N$ pairs of input-output data, which is used to train the model, optimizing its parameters to achieve the task described by $D$.

While we leave the description of the neural network model to the next few sections, we will focus here on the on how the dataset is usually built to achieve image processing tasks.

### End-to-end vs Hybrid
The first distinction that is important to clarify when working with neural networks for image reconstruction is the **end-to-end** approach versus the **hybrid** approach. 

* An end-to-end neural network is a model $f_\Theta$ that is trained to take as input the corrupted datum $y^\delta$ and **directly** compute the reconstruction $x^* = f_\Theta(y^\delta)$, in a single forward application of the model. This causes ...

```{image} /imgs/end-to-end.pdf
:width: 600px
:align: center
```

* An hybrid algorithm for image reconstruction, instead, ...

```{image} /imgs/hybrid-approach.pdf
:width: 600px
:align: center
```