# Quantitative Evaluation:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import numpy as np
import torch
import os
from PIL import Image

path = '/content/drive/MyDrive/Pic 16B/CAN'

### a) Fréchet Inception Distance (FID)

FID is a metric specifically used to quantitatively assess the quality of an image produced by a generative model, which improves upon the *inception score* by comparing the generated images with real images as opposed to only evaluating how well the generated images can be classified by a model (Inception v3) as a known object.



$$d^2((m, C), (m_w, C_w)) = ||m-m_w||_2^{2} + Tr(C+C_w - 2(CC_w)^{1/2}$$

Source: *GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium (Heusel et al. 2017)*

Essentially, FID captures the difference between the two Gaussian distributions underlying the synthetic and real images (m = feature-wise mean, C = covariance matrix, _w = real-world data). As it is a "spin-off" of inception score, it too is related to the Inception network—it assumes the two distributions are the activations of the pool_3 layer of InceptionNet for generated and real samples.


*Remark: It is recommended to use a minimum sample size of 10,000 to calculate the FID; otherwise the true FID of the generator is underestimated.* In fact, I think > 50,000 is preferred.

*Remark 2: FID scores depend largely on the number of samples (fewer samples = larger score), so it is crucial to use the same number of samples for each.*

Implementation: Complicated. Requires loading pre-trained InceptionV3, resizing images, extracting its activations on both real and generated images, calculating mean and covariance of each, then feeding it into the formula above:

In [None]:
# from pytorch_fid.inception import InceptionV3
# model = InceptionV3().to(device)
# model.eval(),

# with torch.no_grad():
#   pred = model(images)

# act = (get activations somehow)
# mu = np.mean(act, axis=0)
# sigma = np.cov(act, rowvar=False)

# etc...

Instead, we will import a handy package to do this for us:

In [None]:
!pip install clean-fid

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting clean-fid
  Downloading clean_fid-0.1.35-py3-none-any.whl (26 kB)
Installing collected packages: clean-fid
Successfully installed clean-fid-0.1.35


#### remove some duplicates (already done, not necessary to run again)
(formmated as fake_img_x (1).png)

In [None]:
for file in os.listdir(f'{path}/fake_images'):
  if file.endswith(').png'):
    os.remove(f'{path}/fake_images/{file}')

#### remove style from wikiart_real images (already done, not necessary to run again)

In [None]:
for file in os.listdir(f'{path}/wikiart_real'):
  new = "_".join(file.split("_", 2)[:2])
  os.rename(f'{path}/wikiart_real/{file}', f'{path}/wikiart_real/{new}.png')

### calculate FID

In [None]:
from cleanfid import fid
dcgan_score = fid.compute_fid(f'{path}/fake_images', f'{path}/wikiart_real')

compute FID between two folders




Found 10000 images in the folder /content/drive/MyDrive/Pic 16B/CAN/fake_images


FID fake_images : 100%|██████████| 313/313 [03:53<00:00,  1.34it/s]


Found 10000 images in the folder /content/drive/MyDrive/Pic 16B/CAN/wikiart_real


FID wikiart_real : 100%|██████████| 313/313 [03:19<00:00,  1.57it/s]


In [None]:
can_score = fid.compute_fid(f'{path}/CAN_images', f'{path}/wikiart_real')

compute FID between two folders
Found 10000 images in the folder /content/drive/MyDrive/Pic 16B/CAN/CAN_images


FID CAN_images : 100%|██████████| 313/313 [03:42<00:00,  1.41it/s]


Found 10000 images in the folder /content/drive/MyDrive/Pic 16B/CAN/wikiart_real


FID wikiart_real : 100%|██████████| 313/313 [01:20<00:00,  3.90it/s]


In [None]:
dcgan_score, can_score

(185.15256074979845, 370.78927324817784)

### b) Traditional image processing metrics

*Source: PyTorch Image Quality: Metrics for Image Quality Assessment*

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

!pip install piq
!pip install lpips

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# Helper functions to retrieve images, resize if needed, and convert to np array

def get_dcgan_fake_img_batch(idx):
  imgs = []
  for j in range(idx, idx+10):
      im_frame = Image.open(f'{path}/fake_images/fake_img_{j}.png')
      im_frame = im_frame.resize((128, 128))
      img = np.array(im_frame)
      imgs.append(img)
  return imgs

def get_can_fake_img_batch(idx):
  imgs = []
  for j in range(idx, idx+10):
      im_frame = Image.open(f'{path}/CAN_images/gen_image_{j}.png')
      img = np.array(im_frame)
      imgs.append(img)
  return imgs

def get_real_img_batch(idx):
  imgs = []
  for j in range(idx, idx+10):
      im_frame = Image.open(f'{path}/wikiart_real/image_{j}.png')
      img = np.array(im_frame)
      imgs.append(img)
  return imgs

SSIM (Structural Similarity Index Measure), PSNR (Peak Signal-to-Noise Ratio), etc. These are full-reference metrics (they compare the image to an initial uncompressed/distortion-free reference of the same image), and so they are not necessarily the most appropriate for generative image quality. They are better suited for image compression/restoration in comparison to the original image, but we can try.

### Structural Similarity Index Measure (SSIM)

-Calculates perceived structural differences, based on the idea that spatially close pixels have strong inter-dependencies
-A weighted combination of three comparison measurements—luminance, contrast, and structure using a sliding Gaussian window
-A value of 1 indicates perfect similarity, 0 = no similarity, and -1 = perfect anti-correlation

$$l(x,y) = \frac{2\mu_x\mu_y + c_1}{\mu_x^2+\mu_y^2+c_1}$$
$$c(x,y) = \frac{2\sigma_x\sigma_y + c_2}{\sigma_x^2+\sigma_y^2+c_2}$$
$$s(x,y) = \frac{\sigma_{xy} + c_3}{\sigma_x+\sigma_y+c_3}$$

where $\mu$ is the pixel sample mean, $\sigma^2$ is the variance,
$\sigma_{xy}$ is the covariance,
$c_1 = (k_1L)^2$ and $c_2 = (k_2L)^2$ are stablizing coefficients where $L$ is the dynamic range of the pixel values and $k_1=0.01$ and $k_2=0.03$ are constants, and $c_3 = c_2/2$.

$$SSIM(x,y) = l(x,y)^\alpha c(x,y)^\beta s(x,y)^\gamma$$


To calculate SSIM, we will take our 10,000 images of each (real, DCGAN, CAN) in batches of 10, pair them and calculate SSIM for each batch pair, then finally compute the mean SSIM.

In [None]:
from piq import ssim

In [None]:
from tqdm import tqdm

# We also calculate Learned Perceptual Image Patch Similarity (LPIPS),
# a common image quality metric often used as a loss for training GANs. More formally, it
# measures perceptual distance in the feature space of the AlexNet model (or VGG, depends on which you use)
# A lower score indicates that the compared images are more similar
import lpips
lpips_alex = lpips.LPIPS(net='alex') # best forward scores

dcgan_ssims = []
can_ssims = []
dcgan_lpipses = []
can_lpipses = []

for i in tqdm(range(0, 10000, 10)):
  dcgan_imgs = get_dcgan_fake_img_batch(i)
  dcgan_imgs = np.stack(dcgan_imgs)
  dcgan_imgs = np.transpose(dcgan_imgs, (0, 3, 1, 2))

  can_imgs = get_can_fake_img_batch(i)
  can_imgs = np.stack(can_imgs)
  can_imgs = np.transpose(can_imgs, (0, 3, 1, 2))

  real_imgs = get_real_img_batch(i)
  real_imgs = np.stack(real_imgs)
  real_imgs = np.transpose(real_imgs, (0, 3, 1, 2))

  dcgan_ssim = ssim(torch.Tensor(dcgan_imgs/255), torch.Tensor(real_imgs/255)).item()
  can_ssim = ssim(torch.Tensor(can_imgs/255), torch.Tensor(real_imgs/255)).item()
  dcgan_ssims.append(dcgan_ssim)
  can_ssims.append(can_ssim)

  dcgan_lpips = lpips_alex(torch.Tensor(dcgan_imgs), torch.Tensor(real_imgs)).mean().item()
  can_lpips = lpips_alex(torch.Tensor(can_imgs), torch.Tensor(real_imgs)).mean().item()
  dcgan_lpipses.append(dcgan_lpips)
  can_lpipses.append(can_lpips)

print("\nMean SSIM:")
print(f'DCGAN: {np.mean(dcgan_ssims)}, CAN: {np.mean(can_ssims)}')
print("Mean LPIPS:")
print(f'DCGAN: {np.mean(dcgan_lpipses)}, CAN: {np.mean(can_lpipses)}')

Setting up [LPIPS] perceptual loss: trunk [alex], v[0.1], spatial [off]




Loading model from: /usr/local/lib/python3.10/dist-packages/lpips/weights/v0.1/alex.pth


100%|██████████| 1000/1000 [2:01:00<00:00,  7.26s/it]

Mean SSIM:
DCGAN: 0.08275356287509203, CAN: 0.011302134850993753
Mean LPIPS:
DCGAN: 0.3396139343678951, CAN: 0.8619114523530006





In contrast, we can also try using no-reference image quality metrics such as
BRISQUE (Blind/Referenceless Image Spatial Quality Evaluator), NIQE (Natural Image Quality Evaluator), or PIQE (Perception based Image Quality Evaluator). These are directly performed on the generated images with no reference to the real images, and they use statistical features to evaluate image quality.

#### iii) Blind/Referenceless Image Spatial Quality Evaluator (BRISQUE)

-Extracts 36 point-wise statistical features of locally normalized luminance coefficients to measure deviations from a Natural Scene
Statistics (NSS)-based model
-Smaller scores indicate higher perceptual quality



In [None]:
!pip install brisque

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from brisque import BRISQUE

obj = BRISQUE(url=False)

from PIL import Image

dcgan_brisques = []
can_brisques = []
real_brisques = []
for i in tqdm(range(0, 10000, 10)):
  dcgan_imgs = get_dcgan_fake_img_batch(i)
  can_imgs = get_can_fake_img_batch(i)
  real_imgs = get_real_img_batch(i)

  dcgan_brisques.extend([obj.score(img) for img in dcgan_imgs])
  can_brisques.extend([obj.score(img) for img in can_imgs])
  real_brisques.extend([obj.score(img) for img in real_imgs])

print("\nMean BRISQUE scores:")
print(f'DCGAN: {np.mean(dcgan_brisques)}, CAN: {np.mean(can_brisques)}, Real: {np.mean(real_brisques)}')

100%|██████████| 1000/1000 [4:07:41<00:00, 14.86s/it]


Mean BRISQUE scores:
DCGAN: 50.716437383869525, CAN: 88.80208471716486, Real: 18.88789619882305



