### **Semantic Segmentation vs. Instance Segmentation**


![](images/classification_semantic_segmentation_object_detection_instance_segmentation.png)


### **Key Differences**  
| **Aspect**                | **Semantic Segmentation**                     | **Instance Segmentation**                     |
|---------------------------|-----------------------------------------------|-----------------------------------------------|
| **Granularity**           | Class-level (groups all objects of same class) | Object-level (distinguishes individual objects) |
| **Output**                | Single class label per pixel                  | Class label + instance ID per pixel           |
| **Object Differentiation** | No differentiation within same class          | Differentiates individual instances           |
| **Complexity**            | Simpler, focuses only on class prediction      | More complex, combines detection and segmentation |
| **Example Models**        | U-Net, DeepLab, FCN                           | Mask R-CNN, YOLACT, SOLO                     |


Both approaches are critical in computer vision, with semantic segmentation being sufficient for class-based tasks and instance segmentation required for applications needing individual object tracking or counting.

## Dice Loss

Dice loss is a **very common loss function for segmentation tasks** in deep learning, especially when dealing with **imbalanced datasets** where the foreground (object of interest) occupies only a small part of the image. The **Dice coefficient** (also called **Sørensen–Dice index**) is a **similarity measure** between two sets.
If you have two sets $A$ (ground truth) and $B$ (prediction), the Dice coefficient is:

$$
\text{Dice}(A, B) = \frac{2|A \cap B|}{|A| + |B|}
$$

* $|A|$ = number of elements (pixels) in set $A$ (ground truth foreground pixels)
* $|B|$ = number of elements (pixels) in set $B$ (predicted foreground pixels)
* $|A \cap B|$ = overlap between $A$ and $B$

So **Dice = 1** means perfect overlap (prediction = ground truth)
and **Dice = 0** means no overlap at all.

---

**Why it’s Useful**?

In segmentation, we want our predicted mask to match the ground truth mask as much as possible.
Dice coefficient directly measures **overlap**, so it is more robust to class imbalance than just pixel-wise accuracy.

---

## From Dice Coefficient to Dice Loss

Since we minimize loss functions during training, we use:

$$
\text{Dice Loss} = 1 - \text{Dice Coefficient}
$$

This makes the loss **small when overlap is high** and **large when overlap is poor**.

---






## **Numerical Example**

Let’s say you have a very small image with 6 pixels:

Ground truth: `1 0 0 1 0 0`
Prediction:   `1 0 1 1 0 0`

* Intersection (where both are 1): `1 0 0 1 0 0` → **2 pixels**
* Ground truth positives: **2**
* Prediction positives: **3**

Dice coefficient:

$$
\text{Dice} = \frac{2 \times 2}{2 + 3} = \frac{4}{5} = 0.8
$$

So Dice loss = $1 - 0.8 = 0.2$.

---


## Mathematical Formulation for Deep Learning Soft Dice

For pixels:

$$
|A \cap B| = \sum_i g_i p_i
$$

$$
|A| = \sum_i g_i, \quad |B| = \sum_i p_i
$$

---

**Make It “Soft”**

In deep learning, the network predicts **probabilities** $p_i \in [0,1]$ (sigmoid output for binary segmentation).
So we simply **do not round** them — we keep them continuous.

Thus, the **soft Dice coefficient** becomes:

$$
\text{SoftDice}(p, g) = \frac{2 \sum_i p_i g_i}{\sum_i p_i + \sum_i g_i + \epsilon}
$$

Where $g_i$ is still binary (0 or 1), but $p_i$ is a probability.
$\epsilon$ is a small constant to avoid division by zero.

---

**Soft Dice Loss**:

Since we minimize losses, we define:

$$
\boxed{\text{Soft Dice Loss} = 1 - \frac{2 \sum_i p_i g_i}{\sum_i p_i + \sum_i g_i + \epsilon}}
$$

* When prediction $p = g$ (perfect match), numerator = denominator → loss = 0.
* When prediction is bad (no overlap), numerator ≈ 0 → loss ≈ 1.

---

## When to Use Dice Loss

**Best for segmentation with class imbalance**, like:

* Medical image segmentation (tumor occupies tiny fraction of image)
* Road/lane detection
* Object segmentation with sparse objects

Sometimes people combine **Dice loss + Cross Entropy loss** to benefit from both:

* Cross-entropy gives good per-pixel supervision.
* Dice focuses on overall overlap (global structure).

---

## Numerical Example: Soft Dice Multi Class Semantic Segmentation

An example of **3-class semantic segmentation** (classes A,B,C) on a **2×2 image** (4 pixels). We’ll (1) turn **logits → probabilities** with **softmax**, then (2) compute **soft Dice per class**, and (3) average for the final loss.

---

#### Logits Values

Suppose the network outputs these **logits** per pixel (order = [A,B,C]):

* Pixel 1: `[1, 0, 0]`
* Pixel 2: `[0, 1, 0]`
* Pixel 3: `[0, 0, 1]`
* Pixel 4: `[0.5, 0.5, 0]`


$
C_1=A= \begin{bmatrix}
1.0 & 0.0 \\
0.0 & 0.5 \\
\end{bmatrix}
$


$
C_2=B= \begin{bmatrix}
0.0 & 1.0 \\
0.0 & 0.5 \\
\end{bmatrix}
$


$
C_3=C= \begin{bmatrix}
0.0 & 0.0 \\
1.0 & 0.0 \\
\end{bmatrix}
$



---




#### Probabilities (softmax)

Softmax at a pixel:

$$
p_c=\frac{e^{z_c}}{\sum_{k} e^{z_k}}
$$

Using $e\approx2.71828$ and $e^{0.5}\approx1.64872$:

* **Pixel 1**: exp = `[2.71828, 1, 1]`, sum = `4.71828` → probs ≈ `[0.5761, 0.2119, 0.2119]`
* **Pixel 2**: exp = `[1, 2.71828, 1]`, sum = `4.71828` → probs ≈ `[0.2119, 0.5761, 0.2119]`
* **Pixel 3**: exp = `[1, 1, 2.71828]`, sum = `4.71828` → probs ≈ `[0.2119, 0.2119, 0.5761]`
* **Pixel 4**: exp = `[1.64872, 1.64872, 1]`, sum = `4.29744` → probs ≈ `[0.3838, 0.3838, 0.2327]`

(rounded to 4 decimals)


$
C_1=A= \begin{bmatrix}
0.5761 & 0.2119 \\
0.2119 & 0.3838 \\
\end{bmatrix}
$

$
C_2=B= \begin{bmatrix}
0.2119 & 0.5761 \\
0.2119 & 0.3838 \\
\end{bmatrix}
$

$
C_3=C= \begin{bmatrix}
0.2119 & 0.2119 \\
0.5761 & 0.2327 \\
\end{bmatrix}
$


---




#### Ground truth (one-hot masks)

Let the **GT class indices** for the 4 pixels be:
`[A, B, B, C]` → counts: $|A|=1, |B|=2, |C|=1$.

Convert to one-hot (per class, it’s 1 at pixels of that class, 0 elsewhere).

`[A, B, B, C]` →
$
 \begin{bmatrix}
0.9 & 0.6 \\
0.4 & 0.1 \\
\end{bmatrix}
$



$
 \begin{bmatrix}
0=A & 1=B \\
1=B & C=2 \\
\end{bmatrix}
$


---



#### Soft Dice (per class → macro average)

We have 4 pixels ($i=1,2,3,4$), 3 classes ($A,B,C$).
Predicted probabilities $p_{i,c}$ are from the softmax in step 1.
Ground-truth one-hot $y_{i,c}$ is from step 2.


Formula (per class $c$):
**Soft Dice** for class (c) (no batch here, just 4 pixels):


$$
\text{Dice}_c = \frac{2\sum_i p_{i,c}y_{i,c} + \epsilon}{\sum_i p_{i,c} + \sum_i y_{i,c} + \epsilon}
$$


* $\sum_i y_{i,c} =$ number of pixels of class (c).
* $\sum_i p_{i,c} = $ sum of predicted probabilities for class (c).
* $\sum_i p_{i,c}y_{i,c} =$ “soft intersection” (only at GT pixels for class (c)).





---



#### Sums you need (with variables + numbers)


* For **Class A**:

$
p_A= \begin{bmatrix}
0.5761 & 0.2119 \\
0.2119 & 0.3838 \\
\end{bmatrix}
$

$
y_A= \begin{bmatrix}
1 & 0 \\
0 & 0 \\
\end{bmatrix}
$




$$
\sum_i p_{i,A} = p_{1,A} + p_{2,A} + p_{3,A} + p_{4,A}
= 0.5761 + 0.2119 + 0.2119 + 0.3838 = 1.3837
$$

$$
\sum_i y_{i,A} = y_{1,A}+y_{2,A}+y_{3,A}+y_{4,A}
= 1+0+0+0 = 1
$$

$$
\sum_i p_{i,A}y_{i,A} = p_{1,A}\cdot y_{1,A} = 0.5761 \quad(\text{since only pixel 1 is class A})
$$

---

* For **Class B**:


$
p_B= \begin{bmatrix}
0.2119 & 0.5761 \\
0.2119 & 0.3838 \\
\end{bmatrix}
$

$
y_B= \begin{bmatrix}
0 & 1 \\
1 & 0 \\
\end{bmatrix}
$





$$
\sum_i p_{i,B} = p_{1,B} + p_{2,B} + p_{3,B} + p_{4,B}
= 0.2119 + 0.5761 + 0.2119 + 0.3838 = 1.3837
$$

$$
\sum_i y_{i,B} = y_{1,B}+y_{2,B}+y_{3,B}+y_{4,B}
= 0+1+1+0 = 2
$$

$$
\sum_i p_{i,B}y_{i,B} = p_{2,B}\cdot y_{2,B} + p_{3,B}\cdot y_{3,B}
= 0.5761 + 0.2119 = 0.7880
$$

---

* For **Class C**:



$
p_C= \begin{bmatrix}
0.2119 & 0.2119 \\
0.5761 & 0.2327 \\
\end{bmatrix}
$

$
y_C= \begin{bmatrix}
0 & 0 \\
0 & 1 \\
\end{bmatrix}
$



$$
\sum_i p_{i,C} = p_{1,C} + p_{2,C} + p_{3,C} + p_{4,C}
= 0.2119 + 0.2119 + 0.5761 + 0.2327 = 1.2326
$$

$$
\sum_i y_{i,C} = y_{1,C}+y_{2,C}+y_{3,C}+y_{4,C}
= 0+0+0+1 = 1
$$

$$
\sum_i p_{i,C}y_{i,C} = p_{4,C}\cdot y_{4,C} = 0.2327
$$

---



#### Per-class Dice with variables

* **Class A**:

$$
\text{Dice}_A = \frac{2\sum_i p_{i,A}y_{i,A}}{\sum_i p_{i,A} + \sum_i y_{i,A}}
= \frac{2(0.5761)}{1.3837 + 1}
= \frac{1.1522}{2.3837} \approx 0.4834
$$

* **Class B**:

$$
\text{Dice}_B = \frac{2\sum_i p_{i,B}y_{i,B}}{\sum_i p_{i,B} + \sum_i y_{i,B}}
= \frac{2(0.7880)}{1.3837 + 2}
= \frac{1.5760}{3.3837} \approx 0.4657
$$

* **Class C**:

$$
\text{Dice}_C = \frac{2\sum_i p_{i,C}y_{i,C}}{\sum_i p_{i,C} + \sum_i y_{i,C}}
= \frac{2(0.2327)}{1.2326 + 1}
= \frac{0.4654}{2.2326} \approx 0.2085
$$

---



#### Macro Dice and Soft Dice loss

$$
\text{Macro Dice} = \frac{\text{Dice}_A + \text{Dice}_B + \text{Dice}_C}{3}
= \frac{0.4834 + 0.4657 + 0.2085}{3}
\approx 0.3859
$$

$$
\text{Soft Dice Loss} = 1 - \text{Macro Dice} \approx 0.6141
$$

---