# Cluster Logos

Acest notebook preia logo-urile din 'dataset/logos_preprocessed/', le grupeaza in functie de similitudine, le salveaza in "dataset/logos_clustered" si salveaza domeniile in clustere intr-un fisier clusters.pkl.

M-am gandit la o reducere dimenionala a imaginilor vectorizate, urmata de compararea vectorilor sub diferite norme in spatiul de dimensiune redusa. Am aplicat PCA pe matricea de date, matricea de date fiind obtinuta prin alipirea fiecarei imagini vectorizate ca vector-coloana. Initial, acesti vectori au 256x256x3 = 196608 elemente fiecare. Am redus initial la 250 de elemente fiecare vector urmand sa le compar in norma euclidiana. Norma euclidiana s-a dovedit a fi buna pentru a filtra logo urile identice care au putin zgomot de resize, etc provenite de pe domenii detinute de aceeasi firma. Intrucat nu suntem interesati sa testam identitatea logo-urilor, le-am filtrat pe cele care se aflau la distanta mica unele de altele in sens euclidian. Pe lista rezultata, de logo-uri distincte, am aplicat PCA cu 200 de elemente si le-am comparat distanta in sens unghiular, de directie a vectorilor, care s-a dovedit a fi o norma mai buna in cazul in care se cauta similitudine. Un tuning a fost pe parametrii de distanta acceptata si numarul de componente la care reduce PCA-ul. Acesti doi parametri captureaza tradeoff-ul dintre cat de generala sau particulara e forma unui logo. Numar mai mic de componente si distanta acceptata mai mare favorizeaza generalizarea, iar viceversa particularizarea.

Clusteringul efectiv l-am facut prin calculul unei matrici de distante, de la fiecare element la fiecare element, apoi folosirea unui graf, unde aceste clustere deveneau componente conexe in cazul in care distanta dintre noduri era mai mica decat cea din parametrul setat. Logo-urile care nu au putut fi puse in niciun cluster pentru ca erau prea indepartate de orice punct au fost salvate intr-o categorie diferita de anomalii. In contextul asta, anomalii inseamna ca acele logo-uri sunt originale.

Alte directii interesante ar fi fost compararea datelor in niste domenii de frecventa ale unor transformate, de ex. Discrete Cosine Transform. O alta metoda buna pentru reducerea dimensionala ar fi fost dictionary learning care ar fi capturat mai bine caracterisiticile specifice, urmat probabil de PCA pentru reducerea dimensionalitatii datelor sparse.

In [20]:
import os
import numpy as np
from PIL import Image
from sklearn.decomposition import PCA
import shutil
import pickle

In [None]:
INPUT_FOLDER = "dataset/logos_processed"
OUTPUT_FOLDER_PCA = "dataset/logos_processed_pca"
OUTPUT_FOLDER_CLUSTER = "dataset/logos_clustered"
ORIGINAL_SHAPE=(256, 256, 3)

In [7]:
DIST_PCA = 0.10
N_COMPONENTS_PCA = 250
METRIC_PCA = "cosine"

In [None]:
def load_image_matrices(file_list = None, size=(256, 256), normalize=True):

    images = {}
    valid_exts = (".png", ".jpg", ".jpeg", ".webp")

    if file_list is None:
        file_list = [
            fname for fname in os.listdir(INPUT_FOLDER)
            if fname.lower().endswith(valid_exts)
        ]

    for fname in file_list:
        path = os.path.join(INPUT_FOLDER, fname)

        if not os.path.exists(path):
            print(f"[WARN] Fisierul '{fname}' nu a fost gasit în {INPUT_FOLDER}")
            continue

        try:
            img = Image.open(path).convert("RGB")
            img = img.resize(size, Image.Resampling.LANCZOS)
            arr = np.array(img, dtype=np.float32)
            if normalize:
                arr /= 255.0
            images[fname] = arr
        except Exception as e:
            print(f"[ERROR] Eroare la '{fname}': {e}")

    print(f"{len(images)} imagini incarcate din {INPUT_FOLDER}.")
    return images

In [None]:
def save_clusters(cluster_result):

    clusters = cluster_result.get("clusters")
    if clusters is None:
        raise ValueError("cluster_result nu conține cheia 'clusters'.")

    if os.path.exists(OUTPUT_FOLDER_CLUSTER):
        shutil.rmtree(OUTPUT_FOLDER_CLUSTER)

    os.makedirs(OUTPUT_FOLDER_CLUSTER, exist_ok=True)

    for cluster_id, image_names in clusters.items():
        cluster_folder = os.path.join(OUTPUT_FOLDER_CLUSTER, f"cluster_{cluster_id}")
        os.makedirs(cluster_folder, exist_ok=True)

        for name in image_names:
            src_path = os.path.join(INPUT_FOLDER, name)
            if os.path.exists(src_path):
                shutil.copy(src_path, cluster_folder)
            else:
                print(f"Imagine lipsa: {src_path}")

    print("Toate imaginile au fost organizate cu succes.")


In [None]:
def select_representative_names(cluster_result):

    clusters = cluster_result.get("clusters", {})
    if not clusters:
        raise ValueError("cluster_result nu contine cheia 'clusters' sau e gol.")

    selected = []

    if "anomalies" in clusters:
        selected.extend(clusters["anomalies"])
    
    # if "cluster_0" in clusters:
    #     selected.extend(clusters["cluster_0"])

    for cid, imgs in clusters.items():
        if cid == "anomalies":
            continue
        if imgs:
            selected.append(imgs[0]) 

    print(f"{len(selected)} imagini selectate (anomalies + 1 per cluster).")
    return selected


In [None]:
def pca_compress_reconstruct_logos(images_dict, n_components, save = False):
    names = list(images_dict.keys())
    data = np.array([img.flatten() for img in images_dict.values()]) 
    print(f"Matrice de date: {data.shape} (num_imagini, pixeli_per_imagine)")

    print(f"Se aplică PCA cu {n_components} componente...")
    pca = PCA(n_components=n_components)
    Z = pca.fit_transform(data)
    print("PCA antrenat si imaginile comprimate.")

    if save:
        os.makedirs(OUTPUT_FOLDER_PCA, exist_ok=True)

        X_recon = pca.inverse_transform(Z)
        example_shape = next(iter(images_dict.values())).shape
        h, w, c = example_shape

        for i, name in enumerate(names):
            img_recon = X_recon[i].reshape(h, w, c)
            img_recon = np.clip(img_recon, 0, 1)
            out_img = (img_recon * 255).astype(np.uint8)

            out_path = os.path.join(OUTPUT_FOLDER_PCA, f"{name}_pca.png")
            Image.fromarray(out_img).save(out_path)

        np.save(os.path.join(OUTPUT_FOLDER_PCA, "pca_components.npy"), pca.components_)
        np.save(os.path.join(OUTPUT_FOLDER_PCA, "pca_mean.npy"), pca.mean_)
        np.save(os.path.join(OUTPUT_FOLDER_PCA, "compressed_Z.npy"), Z)

        print(f"Imaginile reconstruite salvate in '{OUTPUT_FOLDER_PCA}'")

    result = {'Z': Z, 'W': pca.components_, 'mean': pca.mean_, 'names': names}
    return result

In [None]:
def compute_distance_matrix(Z, metric = "cosine", normalize = True):

    Z = np.asarray(Z, dtype=np.float64)

    if metric == "cosine":
        if normalize:
            norms = np.linalg.norm(Z, axis=1, keepdims=True)
            safe_norms = np.where(norms == 0.0, 1.0, norms)
            Z = Z / safe_norms
        S = Z @ Z.T
        np.clip(S, -1.0, 1.0, out=S)
        D = 1.0 - S  
    elif metric == "euclidean":
        D = np.sqrt(
            np.sum(Z**2, axis=1, keepdims=True)
            + np.sum(Z**2, axis=1)
            - 2 * (Z @ Z.T)
        )
    elif metric == "manhattan":
        n = Z.shape[0]
        D = np.zeros((n, n))
        for i in range(n):
            D[i] = np.sum(np.abs(Z[i] - Z), axis=1)
    else:
        raise ValueError(f"Metrica necunoscuta: {metric}")

    np.fill_diagonal(D, 0.0)
    return D

In [None]:
def cluster_by_similarity_radius(Z, names, dist = 0.15, metric = "cosine", return_distance_matrix = False, normalize = True):

    Z = np.asarray(Z, dtype=np.float64)
    n = Z.shape[0]
    if len(names) != n:
        raise ValueError("Eroare dimensiune")

    print(f"Calcul distanta ({metric}) intre {n} imagini ...")
    D = compute_distance_matrix(Z, metric=metric, normalize=normalize)

    neigh = (D <= dist)
    np.fill_diagonal(neigh, False)

    labels = -np.ones(n, dtype=int)
    cid = 0
    for i in range(n):
        if labels[i] != -1:
            continue
        stack = [i]
        labels[i] = cid
        while stack:
            u = stack.pop()
            for v in np.where(neigh[u])[0]:
                if labels[v] == -1:
                    labels[v] = cid
                    stack.append(v)
        cid += 1

    clusters = {}
    for lab, name in zip(labels.tolist(), names):
        clusters.setdefault(int(lab), []).append(name)

    anomalies = []
    new_clusters = {}
    new_label_map = {}
    new_id = 0

    for cid, imgs in clusters.items():
        if len(imgs) == 1:
            anomalies.extend(imgs)
            new_label_map[cid] = -1
        else:
            new_clusters[new_id] = imgs
            new_label_map[cid] = new_id
            new_id += 1

    if anomalies:
        new_clusters["anomalies"] = anomalies
        print(f"{len(anomalies)} anomalii (clustere de un singur element).")

    updated_labels = np.array([new_label_map[l] for l in labels], dtype=object)

    print(f"{len(new_clusters)} clustere (inclusiv 'anomalies', dist={dist}, metric={metric}).")

    result = {
        "labels": updated_labels,
        "clusters": new_clusters,
        "method": "radius_components",
        "metric": metric,
    }
    if return_distance_matrix:
        result["D"] = D

    return result

In [14]:
logos = load_image_matrices()

2314 imagini incarcate din dataset/logos_processed.


In [15]:
n_pca_components_euclidean = 250
result_pca_euclidean = pca_compress_reconstruct_logos(logos, n_components = n_pca_components_euclidean)

Matrice de date: (2314, 196608) (num_imagini, pixeli_per_imagine)
Se aplică PCA cu 250 componente...
PCA antrenat si imaginile comprimate.


In [None]:
# norma euclidiana este foarte buna pentru a filtra logo urile identice care au putin zgomot de resize, etc
# intrucat nu suntem interesati sa testam identitatea la logo uri, putem da drop la cele asemanatoare dpdv eucildian
# iar pe restul sa le comparam intr un sens unghiular

metric = 'euclidean'
dist = 10

res_cluster_euclidean = cluster_by_similarity_radius(
    Z=result_pca_euclidean['Z'],
    names=result_pca_euclidean['names'],
    dist=dist,                   
    metric = metric,
    return_distance_matrix=False
)

for cid, imgs in sorted(res_cluster_euclidean["clusters"].items(), key=lambda x: len(x[1]), reverse=True):
    print(f"Cluster {cid}: {len(imgs)} imagini")
# save_clusters(res_cluster_euclidean)

Calcul distanta (euclidean) intre 2314 imagini ...
677 anomalii (clustere de un singur element).
213 clustere (inclusiv 'anomalies', dist=10, metric=euclidean).
Cluster anomalies: 677 imagini
Cluster 3: 219 imagini
Cluster 28: 77 imagini
Cluster 84: 54 imagini
Cluster 6: 39 imagini
Cluster 53: 37 imagini
Cluster 72: 35 imagini
Cluster 31: 31 imagini
Cluster 27: 29 imagini
Cluster 36: 29 imagini
Cluster 9: 28 imagini
Cluster 104: 28 imagini
Cluster 135: 27 imagini
Cluster 12: 24 imagini
Cluster 46: 24 imagini
Cluster 25: 20 imagini
Cluster 40: 18 imagini
Cluster 18: 15 imagini
Cluster 78: 15 imagini
Cluster 118: 15 imagini
Cluster 24: 14 imagini
Cluster 13: 13 imagini
Cluster 41: 12 imagini
Cluster 60: 11 imagini
Cluster 1: 10 imagini
Cluster 62: 10 imagini
Cluster 39: 9 imagini
Cluster 44: 9 imagini
Cluster 76: 9 imagini
Cluster 123: 9 imagini
Cluster 151: 9 imagini
Cluster 161: 9 imagini
Cluster 179: 9 imagini
Cluster 22: 8 imagini
Cluster 48: 8 imagini
Cluster 64: 8 imagini
Cluster 7

  D = np.sqrt(


Toate imaginile au fost organizate cu succes.


In [17]:
representatives = select_representative_names(res_cluster_euclidean)
print(f"\nTotal reprezentanți selectați: {len(representatives)}")

distinctive_logos = load_image_matrices(file_list = representatives)

n_pca_components_cosine = 200
result_pca_cosine = pca_compress_reconstruct_logos(distinctive_logos, n_components = n_pca_components_cosine)

889 imagini selectate (anomalies + 1 per cluster).

Total reprezentanți selectați: 889
889 imagini incarcate din dataset/logos_processed.
Matrice de date: (889, 196608) (num_imagini, pixeli_per_imagine)
Se aplică PCA cu 200 componente...
PCA antrenat si imaginile comprimate.


In [18]:
dist = 0.15
metric = 'cosine'

res_final_cluster = cluster_by_similarity_radius(
    Z=result_pca_cosine['Z'],
    names=result_pca_cosine['names'],
    dist=dist,                   
    metric = metric,
    return_distance_matrix=False
)

for cid, imgs in sorted(res_final_cluster["clusters"].items(), key=lambda x: len(x[1]), reverse=True):
    print(f"Cluster {cid}: {len(imgs)} imagini")

save_clusters(res_final_cluster)

Calcul distanta (cosine) intre 889 imagini ...
444 anomalii (clustere de un singur element).
81 clustere (inclusiv 'anomalies', dist=0.15, metric=cosine).
Cluster anomalies: 444 imagini
Cluster 0: 162 imagini
Cluster 2: 12 imagini
Cluster 21: 10 imagini
Cluster 17: 9 imagini
Cluster 29: 9 imagini
Cluster 15: 8 imagini
Cluster 19: 8 imagini
Cluster 8: 7 imagini
Cluster 10: 7 imagini
Cluster 13: 7 imagini
Cluster 5: 6 imagini
Cluster 9: 6 imagini
Cluster 16: 6 imagini
Cluster 26: 6 imagini
Cluster 30: 6 imagini
Cluster 31: 6 imagini
Cluster 3: 5 imagini
Cluster 11: 5 imagini
Cluster 20: 5 imagini
Cluster 46: 5 imagini
Cluster 69: 5 imagini
Cluster 4: 4 imagini
Cluster 24: 4 imagini
Cluster 32: 4 imagini
Cluster 34: 4 imagini
Cluster 37: 4 imagini
Cluster 39: 4 imagini
Cluster 43: 4 imagini
Cluster 64: 4 imagini
Cluster 7: 3 imagini
Cluster 12: 3 imagini
Cluster 23: 3 imagini
Cluster 27: 3 imagini
Cluster 36: 3 imagini
Cluster 52: 3 imagini
Cluster 55: 3 imagini
Cluster 56: 3 imagini
Clus

In [19]:
with open("clusters.pkl", "wb") as f:
    pickle.dump(res_final_cluster["clusters"], f)