# Import Dependencies

In [3]:
import os 
import numpy as np 
import cv2 as cv
import matplotlib.pyplot as plt

from pygments.formatters import img
from tqdm import tqdm

from skimage.metrics import structural_similarity
from sklearn.cluster import KMeans

from preprocessing.edge_extraction import *
from feature_extraction import * 
from preprocessing.fourier_transform import * 
from preprocessing.image_conversion import * 
from clustering import *
from preprocessing.contrast_enhancement import *

# Pre-processing

To reduce noise in images of whole artworks and fragments, we initially considered using the Fourier transform to process the images in the frequency domain.

While converting an image from RGBA to grayscale simplifies processing, it results in the loss of RGB color and alpha channel data, which can be problematic if that information is needed later. Therefore, we chose to split the image into its primary color channels (excluding the alpha channel) and process each channel separately in the frequency domain. After filtering, we planned to reconstruct the filtered image by recombining the processed channels.

However, after several trials, we found that processing the channels separately led to significant information loss in one or more channels. Consequently, we decided to use the NLMeansDenoising filter instead.

Since our goal is to cluster fragments that belong to the same image, we focus on maintaining "continuity" along the fragment borders. Therefore, our process emphasizes the information present along these edges.

Steps:
1. Extract a working region from the borders of the fragment.
2. Filter out the transparent pixels from the working region.
3. Denoise the working region.

**CONSIDERATION**: Contrast enhancement.

# Feature Extraction

To extract relevant features from the fragments, we employ two methods:
- Color Histograms
- Gradient Jacobians

## Color Histograms

Color histograms are graphical representations of the distribution of colors in an image. They quantify the number of pixels that have specific color values, effectively capturing the color composition of the image. By analyzing the color histograms of image fragments, we can compare and cluster similar fragments based on their color distributions.

**This technique is particularly useful for identifying and matching regions of images that share similar color patterns**.

## Similarity Structural Index Measure

The Similarity Structural Index Measure (SSIM) is used as a metric to measure the similarity between two given images incorporating illuminance, contrast and structural information. 
For each reference image we compute the SSIM for each fragment on the key idea that the fragments of the same image should have the highest SSIM since they share structural, illumination 
and contrast information with the reference image.

We use both the color histogram and the SSIM to mainly highlight color and structural information.

### Structural Similarity (example)

In [None]:
from skimage.metrics import structural_similarity
import cv2
import numpy as np

before = cv2.imread('data/5.38.35.png')
after = cv2.imread('references/5.37.jpg')

max_w = max(before.shape[0], after.shape[0])
max_h = max(before.shape[1], after.shape[1])

before = cv2.resize(before, (max_w, max_h))
after = cv2.resize(after, (max_w, max_h))

# Convert images to grayscale
before_gray = cv2.cvtColor(before, cv2.COLOR_BGR2GRAY)
after_gray = cv2.cvtColor(after, cv2.COLOR_BGR2GRAY)

# Compute SSIM between two images
(score, diff) = structural_similarity(before_gray, after_gray, full=True)
print("Image similarity", score)

# The diff image contains the actual image differences between the two images
# and is represented as a floating point data type in the range [0,1]
# so we must convert the array to 8-bit unsigned integers in the range
# [0,255] before we can use it with OpenCV
diff = (diff * 255).astype("uint8")

# Threshold the difference image, followed by finding contours to
# obtain the regions of the two input images that differ
thresh = cv2.threshold(diff, 0, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)[1]
contours = cv2.findContours(thresh.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
contours = contours[0] if len(contours) == 2 else contours[1]

mask = np.zeros(before.shape, dtype='uint8')
filled_after = after.copy()

for c in contours:
    area = cv2.contourArea(c)
    if area > 40:
        x,y,w,h = cv2.boundingRect(c)
        cv2.rectangle(before, (x, y), (x + w, y + h), (36,255,12), 2)
        cv2.rectangle(after, (x, y), (x + w, y + h), (36,255,12), 2)
        cv2.drawContours(mask, [c], 0, (0,255,0), -1)
        cv2.drawContours(filled_after, [c], 0, (0,255,0), -1)

cv2.imshow('before', before)
cv2.imshow('after', after)
cv2.imshow('diff',diff)
cv2.imshow('mask',mask)
cv2.imshow('filled after',filled_after)
cv2.waitKey(0)

# Clustering

We perform "iterative clustering" using color histograms and SSIM scores for each fragment relative to a reference image. Our goal is to create two clusters for each reference image:
- **IN-CLUSTER**: Contains all and only the fragments of the reference image.
- **OUT-CLUSTER**: Contains the spurious fragments.

**Determining Cluster Identity**

To determine which cluster is which, we use "dummy" precision, recall, and F1 scores, leveraging our knowledge of the reference image and the number of its fragments. Initially, we focus on the cluster with the highest recall score, identifying it as the IN-CLUSTER. This cluster has the most fragments of the reference image under examination.

Next, we refine the IN-CLUSTER to reduce the number of spurious fragments, thereby increasing its precision and purity. Once refined, the fragments in the IN-CLUSTER are excluded from further clustering iterations.

This process is repeated for each reference image.


## Dataset Creation

We compute the SSIM score on entire fragments, but consider only the color histograms of the borders for reasons of "continuity." If two fragments are from the same image, they should have similar color distribution on the borders, especially if they were originally adjacent.

In [8]:
threshold = 5
references_path = "references"
data_dir = "./data"

In [9]:
reference_images_ids = [reference.split(".")[1] for reference in tqdm(os.listdir(references_path), desc="Retrieving reference ids")]
reference_images = [cv.imread(os.path.join(references_path, reference), cv.IMREAD_UNCHANGED) for reference in tqdm(os.listdir(references_path), desc="Retrieving reference images")]

Retrieving reference ids: 100%|██████████| 8/8 [00:00<00:00, 33354.31it/s]
Retrieving reference images: 100%|██████████| 8/8 [00:00<00:00, 57.45it/s]


In [20]:
def compute_ssim_scores(fragments, reference_image, reference_id):
    ssim_scores = []
    
    # compute the SSIM for each fragment with regard to a specific reference image
    for fragment in tqdm(fragments, desc=f"Calculating SSIM scores for reference ID {reference_id}"):
        max_w = max(fragment.shape[0], reference_image.shape[0])
        max_h = max(fragment.shape[1], reference_image.shape[1])
    
        fragment = cv.resize(fragment, (max_w, max_h))
        reference = cv.resize(reference_image, (max_w, max_h))
    
        # Convert images to grayscale
        fragment_gray = cv.cvtColor(fragment, cv.COLOR_BGR2GRAY)
        reference_gray = cv.cvtColor(reference, cv.COLOR_BGR2GRAY)
        (score, diff) = structural_similarity(fragment_gray, reference_gray, full=True)
        ssim_scores.append(score)
    
    return ssim_scores

In [25]:
def define_IN_clusters(reference_id, threshold, output_dir, metric):
    scores = compute_metrics(reference_id, output_dir, metric=metric)
    print(scores)
    opt_clusters = {}
    # select the IN-CLUSTER
    # this is the generalized form in case we want to consider more than 2 clusters of more than one reference image per time
    for max_item in scores[f"max_{metric}"]:
        if max_item[1] >= threshold:
            if reference_id in opt_clusters:
                opt_clusters[reference_id].append(max_item[0])
            else:
                opt_clusters[reference_id] = [max_item[0]]
                    
    return opt_clusters

In [23]:
optimal_data_dir = "./optimal_data"
metric = "recall"
metric_threshold = 0.80
output_dir = "clusters/kmeans/colors_ssim"
optimal_dir = "optimal_clusters/kmeans/colors_ssim"

In [27]:
os.makedirs(optimal_data_dir, exist_ok=True)
n_references = len(reference_images_ids)
c = 0

for i in range(len(reference_images_ids)):
    reference_id = reference_images_ids[i]
    reference_image = reference_images[i]

    print(f"Iteration {i + 1} out of {n_references} - Reference ID {reference_id}")
    
    working_region_fragments_dataset = create_dataset(img_dir=data_dir, threshold=threshold)
    original_fragments_dataset = create_dataset(img_dir=data_dir, extract_borders=False)

    ssim_scores = compute_ssim_scores(original_fragments_dataset, reference_image, reference_id)
    color_histograms = compute_color_histograms(working_region_fragments_dataset)
    # actual dataset
    X = []
    for idx, color_histogram in enumerate(color_histograms):
        combined_features = np.concatenate((color_histogram, [ssim_scores[idx]]))
        X.append(combined_features)
    
    X = np.array(X)
    
    kmeans = KMeans(n_clusters=2, random_state=42)
    fit_kmeans = kmeans.fit(X)
    create_cluster_dirs(data_dir=data_dir, output_dir=output_dir, labels=fit_kmeans.labels_)
    IN_clusters = define_IN_clusters(reference_id=reference_id, threshold=metric_threshold, output_dir=output_dir, metric=metric)

    print(IN_clusters)
    if len(IN_clusters) == 0 or c != 0:
        break

    # refine the IN-Cluster
    # ....
    
    # move the refined IN-Cluster (optimal) to another path and reinitiate the clustering process without those fragments
    os.makedirs(optimal_dir, exist_ok=True)
    
    for reference_id, cluster_dirs in IN_clusters.items():
        reference_dir = os.path.join(optimal_dir, reference_id)
        os.makedirs(reference_dir, exist_ok=True)
        
        for cluster_dir in cluster_dirs:
            img_dir = os.path.join("clusters/kmeans/colors_ssim", cluster_dir)
            for filename in os.listdir(img_dir):
                shutil.copy(os.path.join(img_dir, filename), os.path.join(reference_dir, filename))
                shutil.move(os.path.join(data_dir, filename), os.path.join(optimal_data_dir, filename))
            shutil.rmtree(img_dir)
        del reference_images_ids[i]
        del reference_images[i]
    c = 1

Iteration 1 out of 8 - Reference ID 33


Creating dataset: 100%|██████████| 328/328 [00:03<00:00, 101.21it/s]
Creating dataset: 100%|██████████| 328/328 [00:13<00:00, 23.71it/s]
Calculating SSIM scores for reference ID 33: 100%|██████████| 328/328 [00:02<00:00, 113.76it/s]
Computing color histograms: 100%|██████████| 328/328 [00:00<00:00, 106736.89it/s]
Creating cluster dirs: 100%|██████████| 328/328 [00:00<00:00, 2897.76it/s]


{'max_recall': [('cluster_0', 0.9555555555555556)], 'scores': {'cluster_1': {'precision': 0.010416666666666666, 'recall': 0.044444444444444446, 'f1': 0.01687763713080169}, 'cluster_0': {'precision': 0.3161764705882353, 'recall': 0.9555555555555556, 'f1': 0.47513812154696133}}}
{'33': ['cluster_0']}
Iteration 2 out of 8 - Reference ID 36


Creating dataset: 100%|██████████| 192/192 [00:01<00:00, 115.92it/s]
Creating dataset: 100%|██████████| 192/192 [00:05<00:00, 37.90it/s]
Calculating SSIM scores for reference ID 36:  11%|█▏        | 22/192 [00:09<01:16,  2.22it/s]


KeyboardInterrupt: 

In [28]:
restore_data(optimal_data_dir, data_dir)