# 1 - Semantic Segmentation

The goal of semantic image segmentation is to label each pixel of an image with a corresponding class of what is being represented. Because we're predicting for every pixel in the image, this task is commonly referred to as dense prediction.

Note that unlike the previous tasks, the expected output in semantic segmentation are not just labels and bounding box parameters. The output itself is a high resolution image (typically of the same size as input image) in which each pixel is classified to a particular class. Thus it is a pixel level image classification.

<br>

<div style="text-align:center">
    <img src="media/instance_seg.png" width=800>
</div>

To generate this pixel-level output image, semantic segmentation models often have a U-shape architecture:
- First sample down to capture context and high-level features
- Then sample back up to the original input resolution using upsampling and ***transpose convolutions*** in order to output a dense prediction with the same dimensions.

<br>

<div style="text-align:center">
    <img src="media/segmen_example.png" width=800>
    <caption><center><font color="purple">Down-sampling (first half) and then Up-sampling (latter half) using Transpose Convolution for Semantic Segmentation</font></center></caption>
</div>

## 1.1 - Transpose Convolution

A transpose convolution (also called a deconvolution) upsamples a smaller input to produce a larger output, opposite to a normal convolution which downsamples the input.

<br>

<div style="text-align:center">
    <img src="media/normal_vs_tranpose.png" width=700>
    <caption><center><font color="purple">Normal vs. Transpose Convolution</font></center></caption>
</div>

In a normal convolution, we convolve a filter over the input image which results in a smaller output dimension. In a transpose convolution, we connect the filter to the output image - then back-propagate to determine what input image would produce this output. This results in a larger input dimension

<br>

<div style="text-align:center">
    <img src="media/transpose_conv.png" width=800>
    <caption><center><font color="purple">Applying 2x2 filter to an input image 2x2 to obtain 3x3 output image</font></center></caption>
</div>

For a different set of configuations:
- Input image: $2 \times 2$
- Filter size: $3 \times 3$
- Padding: $p=1$
- Stride: $s=2$

<div style="text-align:center">
    <img src="media/transpose_conv1.png" width=800>
    <caption><center><font color="purple">Applying 3x3 filter to an input image 2x2 to obtain 4x4 output image</font></center></caption>
</div>


## 1.2 - Skip Connections

The U-Net architecture for semantic segmentation consists of two main parts:

1. An encoder similar to a normal CNN that compresses the input image into a smaller representation, capturing high-level context but losing spatial details.
2. A decoder that upsamples the representation back to the input dimensions using transpose convolutions. This allows dense prediction at the pixel level.

<br>

<div style="text-align:center">
    <img src="media/unet-skip-conn.png" width=800>
</div>

# 2. U-Net Architecture

The U-Net architecture consists of two main pathways:

**Encoder (contracting path):**
- Typical CNN layers to compress the input image into feature representations, capturing higher-level context but losing spatial details. Uses convolutions and max pooling.


**Decoder (expanding path):**
- Transpose convolutions to upsample features back to original input resolution so dense predictions can be made at pixel level.

The key innovation in U-Net is the use of skip connections to bypass information from the encoder layers directly to the decoder layers at the same spatial scale. This helps the decoder recover spatial details lost during downsampling.

So the decoder combines:
- High-level context from the encoder path
- Fine-grained localization from the skip connections

This provides sufficient information to accurately classify pixels for semantic segmentation. The overall architecture looks like a U shape, with encoder and decoder paths bridged by skip connections, enabling precise localization while leveraging context.

<br>

<div style="text-align:center">
    <img src="media/unet.png" width=900>
</div>

**Left Half**
- Apply normal convolutions to reduce height and width but increase depth.

**Right Half**
1. Apply transpose convolution to reduce depth (same heigh and width). Using skip connection, copy over the activations from the encode part.
2. Apply transpose convolution to reduce depth, but this time increase height and width and also copy over the activations from the encoder part.
3. At the top layer, after applying the last transpose convolution, the dimensions are the same as the input.
4. Lastly, apply a $1 \times 1$ convolution to output a $(h_{\text{in}}, w_{\text{in}}, n_{\text{classes}})$ volume. In the image above, since we're interested in segmenting 3 classes; if the input image was $19 \times 19$, the output volume would be $19 \times 19 \times 3$ where 3 is the desired classes (not the depth or number of channels).