## Image Segmentation

Image segmentation is the process of partitioning a digital image into multiple segments (sets of pixels, also known as image objects). The goal is to simplify or change the representation of an image into something that is more meaningful and easier to analyze.

### Types

#### Semantic Segmentation
Every pixel in the image is assigned a class label (e.g., "road," "car," "pedestrian").
- It does not differentiate between individual objects of the same class. All cars are just one big "car" blob.

#### Instance Segmentation
This goes a step further. It not only identifies the class of each pixel but also differentiates between individual instances of that class.
- If there are five cars, each car gets its own unique mask and ID.

#### Panoptic Segmentation
It combines both:
- Instance: Countable objects like cars and people.
- Semantic: Amorphous regions like sky, grass, or road.

### Classic Techniques
- **Thresholding** (splitting image into foreground and background)
- **Region-Based** (Region Growing): starts with a "seed" pixel and collects neighbouring pixels similar in color or texture until hits a boundary.
- **Watershed Algorithm**: Treats the image like a topographic map. High-intensity pixels are peaks, and low-intensity are valleys. It "floods" the valleys with water (labels) until the different basins meet at the edges.
- **KMean Clustering**: Effective for segmenting images based on color similarity.

### Deep Learning Techniques

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/0*lxn3ufd8_gcN0Ij2" width="500">

#### U-Net
Developed in 2015 for medical image segmentation, U-Net is a convolutional neural network with a U-shaped structure consisting of:
- **Encoder (Contracting Path)**: Captures context using conv + pooling layers.
- **Decoder (Expanding Path)**: Reconstructs image using transposed convolutions.
- **Skip Connections**: Transfer fine-grained details from encoder to decoder layers.

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/0*wpPSAaAgZBq6eaCY" width="400">

Imagine scanning a photo and printing it again. 
- The scanner compresses it (encoder), while the printer reconstructs the fine details (decoder). 
- Skip connections are like post-it notes helping the printer fill in the missing textures.

#### Mask R-CNN
Before the mask is drawn, the model performs standard object detection through two stages:
- **Backbone (CNN)**: A feature extractor (usually ResNet-50 or ResNet-101) that turns the raw image into a high-level feature map.
- **Region Proposal Network (RPN)**: This scans the feature map and proposes "candidate" areas (Region of Interest or RoIs) where it thinks an object might exist.
- **The Classifier/Regressor**: For each proposal, the model predicts a class (e.g., "Person") and refines the bounding box coordinates.

Mask R-CNN adds a third branch in parallel to the classification and bounding box branches.
- **Output**: While the other branches output numbers (class ID and 4 coordinates), the Mask Head outputs a $m \times m$ binary mask.
- **Pixel-to-Pixel**: This branch is a small Fully Convolutional Network (FCN). It preserves the spatial layout of the object, allowing it to map exactly which pixels inside the bounding box belong to the object and which belong to the background.

### Training Segmentation Models
Standard "Accuracy" is almost never used in segmentation because of Class Imbalance (e.g., a tumor is only 1% of the image; the background is 99%).
- **Dice Loss**: Based on the Dice Coefficient. It focuses specifically on the overlap between the predicted mask and the true mask, ignoring the "background" pixels.
- **Weighted Cross-Entropy**: You tell the model to care 10x more about "tumor" pixels than "empty space" pixels.
- **Focal Loss**: Used in DeepLab and RetinaNet. It forces the model to focus more on "hard-to-classify" pixels (like the thin edges of an object) rather than "easy" pixels (like the solid center).