## Structural Similarity Index Measure (SSIM)


The **Structural Similarity Index Measure (SSIM)** is a widely used metric for measuring the similarity between two images. Unlike simple metrics such as Mean Squared Error (MSE) or Peak Signal-to-Noise Ratio (PSNR), SSIM is designed to model the way humans perceive image quality — focusing on structural information, contrast, and luminance rather than raw pixel differences.

---

### The Idea Behind SSIM

SSIM tries to answer: *“How similar are two images in terms of structure, contrast, and brightness?”*

It decomposes similarity into **three components**:

1. **Luminance similarity** $l(x, y)$:

   Are the two images equally bright on average?

   $$
   l(x,y) = \frac{2 \mu_x \mu_y + C_1}{\mu_x^2 + \mu_y^2 + C_1}
   $$

- If both images have similar brightness, this term is near 1.
- If one image is much darker, it will drop below 1.   

2. **Contrast similarity** $c(x, y)$:

   Do the two images have the same amount of contrast?

   $$
   c(x,y) = \frac{2 \sigma_x \sigma_y + C_2}{\sigma_x^2 + \sigma_y^2 + C_2}
   $$

- If both images have similar contrast (variability), this is near 1.
- If one is flat (low contrast) and the other is textured (high contrast), the similarity decreases.

3. **Structural similarity** $s(x, y)$:

   Do the two images have the same patterns and textures?

   $$
   s(x,y) = \frac{\sigma_{xy} + C_3}{\sigma_x \sigma_y + C_3}
   $$
- If $x$ and $y$ rise and fall together (high correlation), this is near $1$.
- If they are uncorrelated or inverted (noise, wrong edges), this value becomes smaller or even negative.   

Here:

* $\mu_x, \mu_y$ are the mean intensities,
* $\sigma_x, \sigma_y$ are standard deviations,
* $\sigma_{xy}$ is covariance between $x$ and $y$,
* $C_1, C_2, C_3$ are small constants to stabilize division.

The final SSIM is:

$$
SSIM(x, y) = [l(x, y)]^\alpha \cdot [c(x, y)]^\beta \cdot [s(x, y)]^\gamma
$$

Usually $\alpha = \beta = \gamma = 1$.


SSIM is bounded between **–1 and 1**, where **1 means perfect similarity**.

In loss form we usually use:

$$
\text{SSIMLoss} = 1 - \text{SSIM}
$$


---



This single equation combines three components:
Often $ C_3 = C_2 / 2 $, and the product of these three terms is simplified into the combined form used above.


$$
SSIM(x,y) = \frac{(2\mu_x\mu_y + C_1)(2\sigma_{xy} + C_2)}
{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)}
$$

Constants $ C_1 $ and $ C_2 $ are small stabilizers:
$$
C_1 = (0.01)^2, \quad C_2 = (0.03)^2
$$






---


### **Interpreting the number**

| SSIM Index (≈ 1 – loss) | Meaning                                           |
| ----------------------- | ------------------------------------------------- |
| 0.95 – 1.00             | Almost identical                                  |
| 0.70 – 0.95             | Similar (small structural differences)            |
| 0.40 – 0.70             | Moderately different (like adjacent KITTI frames) |
| 0.10 – 0.40             | Visibly different                                 |
| < 0.10                  | Almost unrelated                                  |



### Numerical SSIM Example
Awesome—let’s do a fully worked **numerical SSIM example** with two $3\times3$ grayscale image patches and compute every piece: mean, std, covariance, normalized correlation, the three SSIM terms (luminance/contrast/structure), and the final SSIM.

**Images (grayscale, 8-bit scale assumed)**

$$
x=\begin{bmatrix}
10&20&30\\
20&30&40\\
30&40&50
\end{bmatrix},\quad
y=\begin{bmatrix}
12&22&32\\
21&31&41\\
29&39&49
\end{bmatrix}
$$

We’ll treat the whole $3\times3$ window as one patch (i.e., a single SSIM window).

---

#### 1) Basic statistics

Let $N=9$.

**Means**

$$
\mu_x=30.0000,\qquad \mu_y=30.6667
$$

**Sample standard deviations** (ddof=1)

$$
\sigma_x=\sqrt{150}=12.2474,\qquad \sigma_y=\sqrt{129.25}=11.3688
$$

**Sample covariance**

$$
\sigma_{xy}=\frac{1}{N-1}\sum (x_i-\mu_x)(y_i-\mu_y)=138.75
$$

**Normalized correlation (Pearson $\rho$)**

$$
\rho=\frac{\sigma_{xy}}{\sigma_x\sigma_y}=\frac{138.75}{12.2474\cdot 11.3688}\approx 0.9965
$$

> Intuition: the patches vary together almost perfectly (very strong structural agreement).

---

#### 2) SSIM components

Use the standard SSIM constants for 8-bit images:

$$
L=255,\quad C_1=(0.01L)^2=6.5025,\quad C_2=(0.03L)^2=58.5225,\quad C_3=\frac{C_2}{2}=29.26125
$$

#### (a) Luminance term $l(x,y)$

$$
l=\frac{2\mu_x\mu_y+C_1}{\mu_x^2+\mu_y^2+C_1}
=\frac{2\cdot 30\cdot 30.6667+6.5025}{30^2+30.6667^2+6.5025}
\approx 0.99976
$$

#### (b) Contrast term $c(x,y)$

$$
c=\frac{2\sigma_x\sigma_y+C_2}{\sigma_x^2+\sigma_y^2+C_2}
=\frac{2\cdot 12.2474\cdot 11.3688+58.5225}{12.2474^2+11.3688^2+58.5225}
\approx 0.99771
$$

#### (c) Structure term $s(x,y)$

$$
s=\frac{\sigma_{xy}+C_3}{\sigma_x\sigma_y+C_3}
=\frac{138.75+29.26125}{12.2474\cdot 11.3688+29.26125}
\approx 0.99710
$$

---

#### 3) Final SSIM

$$
SSIM=l\cdot c\cdot s\approx 0.99976\cdot 0.99771\cdot 0.99710\approx \mathbf{0.99458}
$$

**Takeaway:** Despite small brightness/contrast differences, the structures match extremely well (high $\rho$ and high SSIM ≈ **0.995**). This is exactly the kind of case where SSIM (and the structure term) shines compared to plain MSE/PSNR.


### **PyTorch Implementation of SSIM**


#### **Local Computation and the Role of the Gaussian Window**

Unlike pixel-wise losses, SSIM operates on **local patches** — e.g. $ 11 \times 11 $ regions — because the human eye perceives structure **locally**, not globally.

For each pixel location $(i,j)$, a local window $ \mathcal{W}(i,j) $ is centered on that pixel.
Within this window, SSIM computes the following:

* Local means $ \mu_x, \mu_y $
* Local variances $ \sigma_x^2, \sigma_y^2 $
* Local covariance $ \sigma_{xy} $

#### **Why Gaussian?**

Not all pixels in a local window should contribute equally — center pixels are perceptually more relevant than distant ones.
Therefore, SSIM uses a **Gaussian weighting function** to emphasize the center:

$$
G(x, y) = \frac{1}{2\pi\sigma^2} e^{-\frac{x^2 + y^2}{2\sigma^2}}
$$

This Gaussian acts as a **smooth, normalized weighting kernel** over the window.
It ensures:

* Higher weight near the window center
* Gradual falloff toward the edges
* Smooth transitions between neighboring windows

The local mean (brightness) is then computed as a **weighted average**:
$$
\mu_x(i,j) = (G * x)(i,j)
$$
and similarly for $ \mu_y(i,j) $.

The variance and covariance use the same Gaussian weights:
$$
\sigma_x^2 = (G * x^2) - \mu_x^2, \quad
\sigma_y^2 = (G * y^2) - \mu_y^2, \quad
\sigma_{xy} = (G * xy) - \mu_x\mu_y
$$

So, the Gaussian window defines how **local** and how **smooth** these statistics are.

---



#### **Implementation Breakdown**

Here is how the PyTorch implementation encodes the above:

```python
class SSIMLoss(nn.Module):
    def __init__(self, window_size=11, size_average=True):
        super().__init__()
        self.window_size = window_size
        self.size_average = size_average
```

#### **(a) Building the Gaussian Window**

```python
def gaussian_window(self, window_size, sigma):
    gauss = torch.Tensor([
        torch.exp(-(x - window_size//2)**2 / float(2*sigma**2))
        for x in range(window_size)
    ])
    return gauss / gauss.sum()
```

Creates a **1D Gaussian vector** centered at `window_size // 2`.

Then extended to **2D** via outer product:

```python
_2D_window = _1D_window.mm(_1D_window.t())
```

and broadcast to all channels:

```python
window = _2D_window.expand(channel, 1, window_size, window_size)
```

---

### **(b) Computing Local Statistics**

Using grouped 2D convolution, each channel is processed independently:

```python
mu1 = F.conv2d(img1, window, padding=self.window_size//2, groups=channel)
mu2 = F.conv2d(img2, window, padding=self.window_size//2, groups=channel)
```

This computes local **weighted means** using the Gaussian kernel.

Then:

```python
sigma1_sq = F.conv2d(img1 * img1, window, ...) - mu1**2
sigma2_sq = F.conv2d(img2 * img2, window, ...) - mu2**2
sigma12   = F.conv2d(img1 * img2, window, ...) - mu1 * mu2
```

These correspond to **local variances and covariance**, weighted by the same Gaussian.

---

#### **(c) Computing the SSIM Map**

```python
C1, C2 = 0.01**2, 0.03**2
ssim_map = ((2*mu1*mu2 + C1) * (2*sigma12 + C2)) / \
           ((mu1**2 + mu2**2 + C1) * (sigma1_sq + sigma2_sq + C2))
```

This yields a pixel-wise SSIM map (one value per local patch).

---

#### **(d) Turning into a Loss**

The final output is:

```python
return 1 - ssim_map.mean()
```

So the model minimizes dissimilarity (i.e., maximizes SSIM).

---

#### **Interpretation**

| Aspect                   | Meaning                                     | Implementation                |
| ------------------------ | ------------------------------------------- | ----------------------------- |
| **Window**               | Defines local region around each pixel      | Gaussian kernel (e.g., 11×11) |
| **Gaussian weighting**   | Emphasizes center pixels, smooth transition | `gaussian_window()`           |
| **Luminance similarity** | Compares brightness                         | `mu1`, `mu2`                  |
| **Contrast similarity**  | Compares variance                           | `sigma1_sq`, `sigma2_sq`      |
| **Structure similarity** | Compares pattern correlation                | `sigma12`                     |
| **Final SSIM**           | Combined measure                            | `ssim_map = ...`              |
| **Loss**                 | ( 1 - \text{mean(SSIM)} )                   | Returned as `SSIMLoss`        |

---

#### **Typical Use in Deep Learning**

In tasks like **Monodepth2** (self-supervised depth estimation), the photometric loss combines SSIM and L1:

$$
L_{photo} = \alpha \cdot L_{SSIM} + (1 - \alpha) \cdot L_{L1}
$$

with $ \alpha = 0.85 $.
This way, the loss captures both perceptual similarity and pixel accuracy.

---

#### **Behavior**

* If two images are identical:
  $ SSIM = 1 \Rightarrow L_{SSIM} = 0 $
* If completely different:
  $ SSIM \approx 0 \Rightarrow L_{SSIM} \approx 1 $

---

### **Summary**

* The **Gaussian window** defines the **local region** and gives **weighted emphasis** toward the center.
* SSIM computes **luminance**, **contrast**, and **structure** similarity **within** that window.
* Using **convolution with a Gaussian kernel** efficiently computes local statistics for every pixel.
* The final loss, ( 1 - \text{SSIM} ), encourages the model to produce images that are perceptually and structurally similar to the target.

---

Would you like me to follow up with a **numerical or visual example** showing how Gaussian weighting affects the local mean and SSIM calculation across an image patch?


###  Normalized Correlation
The **normalized correlation** part of SSIM is the most “structural” component, so it’s worth understanding carefully.

**Step 1: Represent the Images**

Suppose you have two image patches $x$ and $y$ of size $N$ pixels (can be grayscale or single channel).

$$
x = [x_1, x_2, \dots, x_N], \quad y = [y_1, y_2, \dots, y_N]
$$

---

**Step 2: Compute the Mean**

Compute the mean intensity (average brightness) of each patch:

$$
\mu_x = \frac{1}{N} \sum_{i=1}^{N} x_i, \qquad
\mu_y = \frac{1}{N} \sum_{i=1}^{N} y_i
$$

---

**Step 3: Compute the Standard Deviations**

Compute how much pixel values vary around the mean:

$$
\sigma_x = \sqrt{\frac{1}{N-1} \sum_{i=1}^{N} (x_i - \mu_x)^2}
$$

$$
\sigma_y = \sqrt{\frac{1}{N-1} \sum_{i=1}^{N} (y_i - \mu_y)^2}
$$

These represent the **contrast** of each image.

---

**Step 4: Compute the Covariance**

Covariance measures how much the two patches vary *together*:

$$
\sigma_{xy} = \frac{1}{N-1} \sum_{i=1}^{N} (x_i - \mu_x)(y_i - \mu_y)
$$

* If $x$ and $y$ increase/decrease together → covariance is **positive**.
* If one increases when the other decreases → covariance is **negative**.

---

**Step 5: Normalize → Get the Correlation**

The **Pearson correlation coefficient** is just the covariance normalized by the product of standard deviations:

$$
\rho_{xy} = \frac{\sigma_{xy}}{\sigma_x \sigma_y}
$$

* This value ranges between **-1 and 1**.

  * $1.0 \Rightarrow$ perfect positive linear correlation (structures align perfectly).
  * $0 \Rightarrow$ no linear correlation (structures unrelated).
  * $-1.0 \Rightarrow$ perfect negative correlation (inverted contrast).

---

**Step 6: Add Stabilization for SSIM**

In SSIM, to avoid division by zero when contrast is very low, we use a small constant $C_3$:

$$
s(x, y) = \frac{\sigma_{xy} + C_3}{\sigma_x \sigma_y + C_3}
$$

This keeps the measure well-defined even for very flat patches (e.g., almost uniform gray).

---

**Intuition**

* **Covariance** tells you whether pixel intensities move together.
* **Normalization** by $\sigma_x \sigma_y$ removes the effect of scale/contrast so you focus purely on **structure** (edges, textures, gradients).

This is why SSIM can still give a high similarity score if one image is slightly brighter/darker — because the **pattern** of variations is the same.

---

SSIM tries to mimic the **human visual system** by:

* **Normalizing for lighting** (so small brightness changes are ignored).
* **Normalizing for contrast** (so small contrast changes are less penalized).
* **Measuring structure** (so it cares about edges, patterns, textures).





## LPIPS

So far, we talked about **SSIM** (hand-crafted metric). Now, **LPIPS (Learned Perceptual Image Patch Similarity)** goes a step further: instead of manually designing similarity measures, it uses **deep features** from pretrained networks (e.g., AlexNet, VGG, SqueezeNet) to capture perceptual similarity.


* Proposed in **"The Unreasonable Effectiveness of Deep Features as a Perceptual Metric"** (Zhang et al., 2018).
* Idea: Humans judge images by *perceptual similarity*, not pixel-wise equality.
* LPIPS measures distance in the **feature space** of a pretrained CNN rather than raw pixels.

---

####  How It Works

1. Take two images $x$ and $y$.
2. Pass both through a **pretrained network** (e.g., VGG).
3. Extract activations from multiple layers (feature maps).
4. Normalize features and compute **L2 distance** per spatial location.
5. Average distances across spatial positions and layers.
6. Optionally, train small linear weights to better align with human judgments.

$$
LPIPS(x,y) = \sum_l \frac{1}{H_l W_l} \sum_{h,w} w_l \; \| \hat{f}_l(x)_{h,w} - \hat{f}_l(y)_{h,w} \|_2^2
$$

* $f_l$: feature map from layer $l$.
* $\hat{f}_l$: channel-wise normalized.
* $w_l$: learned weights.

---

####  Why LPIPS is Important

* **MSE/PSNR**: pixel-wise, not perceptual.
* **SSIM**: structural but still hand-crafted.
* **LPIPS**: learned perceptual similarity, matches human perception much better.

In practice, LPIPS is considered **state-of-the-art** for evaluating perceptual image quality (GANs, super-resolution, style transfer, inpainting).

---

**LPIPS in Deep Learning Workflows**

* Used as an **evaluation metric** for generative models (GANs, diffusion, etc.).
* Sometimes used as a **loss function** (LPIPS loss) for training perceptual similarity.
* Often combined with pixel losses (L1/L2) or SSIM.


---

####  Comparison: SSIM vs LPIPS

| Metric       | Based on                           | Pros                                                         | Cons                                             |
| ------------ | ---------------------------------- | ------------------------------------------------------------ | ------------------------------------------------ |
| **MSE/PSNR** | Pixel differences                  | Simple, fast                                                 | Not perceptual, sensitive to shifts              |
| **SSIM**     | Luminance, contrast, structure     | Better perceptual alignment                                  | Hand-crafted, less robust to complex distortions |
| **LPIPS**    | Deep features (VGG, AlexNet, etc.) | Best matches human perception, widely used in GAN evaluation | Heavier, requires pretrained nets                |

---