# Explanation

Most CNN architectures were originally designed for tasks like image classification, where identifying high dimensional features is usually sufficient to accomplish the task effectively. These tasks usually involve taking in images as outputs, and then producing much lower dimensional outputs like image labels.

However, there are a set of (important) tasks that require much more localization, where the network has to understand more low-level features in the images. One such case is the medical image segmentation task, which requires a network to draw border lines around entities in medical images. This task requires the network to understand pixel-level details in order to distinguish between different parts of an image.

In addition to the novelty of this task, the datasets for medical image segmentation are also quite small, making it even more of a challenge.

Despite this, the **U-Net** architecture was able to achieve a new record performance on the ISBI image segmentation challenge.

The U-Net builds on the architecture of CNNs and adds additional inductive biases that are particularly effective for tasks involving **image-to-image** conversion.

Medical segmentation happens to be one such tasks since the output has to be an equivalent sized image to the input image, but with segmentation marks on it.

However, there are many other useful image-to-image tasks which U-Net has become extremely effective for. Notably, diffusion models, which are now the state of the art models for image generation, use U-Nets to convert noisy images into eventual synthesized images.

U-Nets are also particularly useful for high resolution images, where their ability to understand complex details in images is especially valuable.

Models like SDXL and DALL E 2 have U-Nets at the core of their architecture, which is part of why I decided to highlight the U-Net.

### Intuition

The U-Net architecture is an encoder-decoder model. Like most encoder-decoder architectures (more are covered later in this repository in the sequence modeling and image generation sections), the encoder of the model is responsible for encoding some high-dimensional representations, and the decoder is responsible for taking these representations and adding back detail to them.

It has two core pathways to it that account for it's functionality.

The first pathway involves a series of convolutional blocks that "down-sample" and "up-sample" the features.

Each down-sampling block consists of three 3x3 convolutions, each followed by a ReLU activation function, and then a down-sampling max-pool kernel to reduce the total size of the image by a factor of 2. After each convolution, the total channels in the image also grows. The U-Net usually has a number of these down-sampling blocks in sequence.

The up-sampling blocks at the end of the architecture mirror this process in reverse - they first have up-convolutions to map the latents back to less channels but higher dimensions wich each block, with ReLUs in between each convolution. The last layer of the up-sampling section before the output contains 1x1 convolutions for each pixel, allowing the model to output an image of the same dimension as the input.

Between the down-sampling and up-sampling pathways is the "bottleneck" where representations have the most channels and lowest dimensionality.

In order to construct images effectively (in the cast of diffusion models covered in the last section, predict the noise in images correctly), the model needs to learn to store the most useful high-dimensional representations past the bottleneck, and each sequential down-sampling block stores representations of increase feature-size.

Meanwhile, the decoder block adds in features of decreasing feature size, effectively adding more complex details closer to the end.

The last important piece of the U-Net architecture is the "connecting paths" which connect directly from down-sampling layers to up-sampling layers of the same size. These conenctions provide information about features from the encoder to each decoder layer, allowing the decoder to use this information to modify features of varying complexity effectively.

# My Notes

## 📜 [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/pdf/1505.04597)

> There is large consent that successful training of deep networks  requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently.
>

This paper shows us that it’s possible to train effective deep neural networks even with a small amount of data, given their new approaches. The U-Net architecture introduces a powerful inductive bias that makes the network learn effectively with far less data.

> The typical use of convolutional networks is on classification tasks, where the output to an image is a single class label. However, in many visual tasks, especially in biomedical image processing, the desired output should include localization.
>

Most tasks using CNNs are built for image classification, where they just need they need to predict a relatively small number of output labels given the inputs to the network. Meanwhile, in medical image segmentation (which was the subject that prompted U-Net), networks that can localize features are critical.

> The architecture consists of a contracting path to capture
context and a symmetric expanding path that enables precise localization.
>

The U-Net architecture has a “down-sampling” pathway that captures increasingly general features, and then an “up-sampling” pathway that lets the network focus on specific details.

> We show that such a network can be trained end-to-end from very
few images and outperforms the prior best method.
>

> The main idea in to supplement a usual contracting network by
successive layers, where pooling operators are replaced by upsampling operators.
>

Down-sampling consists of convolutions with max pooling to decrease the dimensions of the data at each layer. Then, the up-sampling involves up-convolutions and standard convolutions.

> In order to localize, high resolution features from the contracting path are combined with the upsampled output.
>

In the upsampling path, the U-Net has connecting paths back to the corresponding encoder layers, allowing the decoder to attend to different dimensionality features at each layer.

> One important modification in our architecture is that in the upsampling part we have also a large number of feature channels, which allow the network to propagate context information to higher resolution layers.
>

The encoder expands the channels as it down-samples to store more complex representations, and then reverses this on the up-sample.

> As for our tasks there is very little training data available, we use excessive data augmentation by applying elastic deformations to the available training images.
>

### Network Architecture

![Screenshot 2024-05-22 at 2.48.43 AM.png](../../images/Screenshot_2024-05-22_at_2.48.43_AM.png)

### Training

**3.1 Data Augmentation**

> We generate smooth deformations using random displacement vectors on a coarse 3 by 3 grid.
>

They use data augmentation to successfully modify the dataset and make it much larger, letting them improve the model with very few available training samples.

### Experiments

![Screenshot 2024-05-22 at 3.02.05 AM.png](../../images/Screenshot_2024-05-22_at_3.02.05_AM.png)

### Conclusion

> Thanks to data augmentation with elastic deformations, it only needs very few annotated images and has a very reasonable training time of only 10 hours on a NVIDIA Titan GPU
>