# Ekstrak Wajah

Fungsi ini digunakan untuk mengekstraksi wajah dari sebuah folder gambar (baik struktur flat maupun nested) menggunakan DeepFace. Semua hasil embedding wajah dan metadata akan dikembalikan langsung ke dalam memori (faces_array dan metadata).


### Argumen:

| Parameter           | Tipe  | Deskripsi                                                                                                                         |
| ------------------- | ----- | --------------------------------------------------------------------------------------------------------------------------------- |
| `event_folder_path` | `str` | Path folder yang berisi gambar (bisa berupa direktori gambar langsung atau folder berisi subfolder album).                        |
| `min_face_size`     | `int` | Ukuran minimum (dalam piksel) lebar atau tinggi wajah agar dianggap valid. Wajah yang terlalu kecil akan dilewati. Default: `27`. |


### Return

```faces_array```: Numpy array berisi embedding wajah dari seluruh gambar yang valid.

```metadata```: List dictionary berisi metadata setiap wajah yang berhasil diekstraksi, termasuk lokasi file asal, area wajah, dan confidence.



### Output

```cropped_dir``` yang berisikan wajah-wajah crop yang sudah dipadding

### Keterangan Tambahan
- Wajah akan dilewati jika:

    - Lebar/tingginya terlalu besar (>=800 piksel, kemungkinan gambar penuh).
    - Lebar/tingginya lebih kecil dari min_face_size.
    - Tidak terdeteksi wajah.
    - Wajah dengan confidence < 1
- Proses cropping dilakukan dengan padding 30% ke kanan, kiri, atas, dan bawah untuk menjaga kontekstual wajah.



### Struktur Folder yang Didukung

**1. Flat Folder**

```
ef-efekta/
├── img1.jpg
├── img2.png
├── ...
```

**2. Nested Folder (Album per Subfolder)**
```
ef-efekta/
├── album1/
│   ├── a1_img1.jpg
│   └── a1_img2.jpg
├── album2/
│   ├── a2_img1.jpg
│   └── a2_img2.jpg
```


In [None]:
import os
from tqdm import tqdm
from PIL import Image
import numpy as np
from deepface import DeepFace

def extract_faces(event_folder_path, min_face_size=27, min_confidence=1.0):
    """
    Extract faces from event folder WITHOUT saving cache, and filter by face confidence.

    Args:
        event_folder_path (str): Path ke folder gambar (flat atau nested)
        min_face_size (int): Ukuran minimal wajah (lebar/tinggi)
        min_confidence (float): Confidence minimal agar wajah dianggap valid

    Returns:
        faces_array (np.ndarray): Embedding wajah yang valid
        metadata (List[Dict]): Metadata wajah valid
    """
    if not os.path.exists(event_folder_path):
        print(f"❌ Source folder not found: {event_folder_path}")
        return np.array([]), []

    faces, metadata = [], []
    event_id = os.path.basename(event_folder_path)

    def process_image(img_path, album_id="main", album_name="main"):
        nonlocal faces, metadata
        try:
            result = DeepFace.represent(
                img_path=img_path,
                model_name="Facenet512",
                detector_backend="retinaface",
                align=True,
                enforce_detection=False
            )
            img = Image.open(img_path).convert("RGB")
            img_width, img_height = img.size
            img_name = os.path.basename(img_path)

            for face_idx, face_data in enumerate(result):
                facial_area = face_data['facial_area']
                x, y, w, h = facial_area['x'], facial_area['y'], facial_area['w'], facial_area['h']
                confidence = face_data.get("face_confidence", 0.0)

                if w >= 800 or h >= 800:
                    return  # Skip full-image faces
                if w < min_face_size or h < min_face_size:
                    return  # Skip small faces
                if confidence < min_confidence:
                    print(f"⚠️ Skipped face with confidence {confidence:.2f} from {img_path}")
                    return

                # Crop with padding
                pad = 0.3
                new_x = max(0, x - int(w * pad))
                new_y = max(0, y - int(h * pad))
                new_w = min(img_width, x + w + int(w * pad)) - new_x
                new_h = min(img_height, y + h + int(h * pad)) - new_y
                cropped_img = img.crop((new_x, new_y, new_x + new_w, new_y + new_h))
                cropped_name = f"{os.path.splitext(img_name)[0]}_f-{face_idx}.jpg"
                cropped_path = os.path.join("cropped_dir", cropped_name)
                os.makedirs("cropped_dir", exist_ok=True)
                cropped_img.save(cropped_path)

                face_vector = face_data['embedding']
                faces.append(face_vector)
                metadata.append({
                    "foto_id": f"{os.path.basename(img_path)}_f-{face_idx}",
                    "album": {
                        "id": album_id,
                        "name": album_name,
                        "event": {
                            "id": event_id,
                            "name": event_id
                        }
                    },
                    "embedding": face_vector,
                    "cluster_id": None,
                    "path": img_path,
                    "facial_area": facial_area,
                    "face_confidence": confidence
                })

        except Exception as e:
            print(f"⚠️ Error processing {img_path}: {e}")

    # Check folder structure
    items = os.listdir(event_folder_path)
    image_files = [f for f in items if f.lower().endswith(('.jpg', '.jpeg', '.png', '.bmp'))]
    subdirs = [f for f in items if os.path.isdir(os.path.join(event_folder_path, f))]

    if image_files:
        print("📁 Flat folder mode")
        with tqdm(total=len(image_files), desc="Processing Images") as pbar:
            for fname in image_files:
                process_image(os.path.join(event_folder_path, fname))
                pbar.update(1)

    elif subdirs:
        print("📁 Nested folder mode")
        total = sum(len(os.listdir(os.path.join(event_folder_path, sd))) for sd in subdirs)
        with tqdm(total=total, desc="Processing Albums") as pbar:
            for album_name in subdirs:
                album_path = os.path.join(event_folder_path, album_name)
                for fname in os.listdir(album_path):
                    if fname.lower().endswith(('.jpg', '.jpeg', '.png', '.bmp')):
                        process_image(os.path.join(album_path, fname), album_id=album_name, album_name=album_name)
                    pbar.update(1)
    else:
        print("❌ No images or subdirectories found.")
        return np.array([]), []

    faces_array = np.array(faces)
    print(f"\n✅ Done! Total valid faces: {len(faces_array)}, Metadata: {len(metadata)}")
    return faces_array, metadata


In [None]:
SOURCE_FOLDER = "ef-efekta"  #---nama folder
faces, metadata = extract_faces(SOURCE_FOLDER)

print(f"Jumlah wajah ditemukan: {len(faces)}")
print(f"Dimensi embedding: {faces.shape[1] if len(faces) > 0 else 'N/A'}D")


# Clustering

Fungsi ini melakukan clustering embedding wajah menggunakan gabungan UMAP (untuk reduksi dimensi) dan HDBSCAN (untuk clustering), lalu secara otomatis mengorganisasi wajah ke dalam folder sesuai hasil cluster.



### Argumen 

| Parameter          | Tipe         | Deskripsi                                                                        |
| ------------------ | ------------ | -------------------------------------------------------------------------------- |
| `faces`            | `np.ndarray` | Array berisi embedding wajah berdimensi tinggi (contoh: `(N, 512)`)              |
| `metadata`         | `List[dict]` | Daftar metadata tiap wajah (harus sejajar dengan `faces`)                        |
| `min_cluster_size` | `int`        | Ukuran minimal satu cluster. Wajah lebih sedikit dari nilai ini dianggap outlier |
| `metric`           | `str`        | Metode pengukuran jarak untuk HDBSCAN (`'euclidean'`, `'manhattan'`, dll)        |
| `output_dir`       | `str`        | Nama folder output untuk menyimpan hasil clustering wajah                        |


### Return 

```cluster_labels``` → ```np.ndarray``` berisi label cluster untuk setiap wajah (-1 berarti outlier)

```n_clusters``` → jumlah cluster valid (tidak termasuk outlier)

```n_outliers``` → jumlah wajah yang tidak masuk ke cluster manapun

### Konfigurasi
 
**UMAP**
- n_neighbors: 3
- n_components: 38 (untuk jumlah wajah diatas 1000 wajah --> n_components: 35) 
- min_dist: 0.0
- metric: cosine
- random_state: 42

**Dinamika n_components UMAP**

- Jika jumlah wajah > 1000 → n_components = 35
- Jika jumlah wajah ≤ 1000 → n_components = 38

**HDBSCAN**
- Min clusster size: 2
-  Metric: euclidean
- min_samples: 2





### Struktur output 

```
clustered_faces/
├── cluster_00/
│   ├── IMG001_f0.jpg
│   ├── IMG010_f1.jpg
│   └── ...
├── cluster_01/
│   └── ...
└── outliers/
    └── IMG099_f3.jpg
```

In [None]:
import os
import shutil
import numpy as np
from collections import defaultdict
import umap
import hdbscan

def cluster_faces(faces, metadata, min_cluster_size=2, metric='euclidean', output_dir="clustered_faces"):
    """
    Cluster face embeddings using UMAP + HDBSCAN and organize into folders.

    Args:
        faces (np.ndarray): Embedding vectors for each face (e.g., shape (N, 512))
        metadata (list[dict]): Metadata list, one per face
        min_cluster_size (int): Minimum number of samples in a cluster
        metric (str): Distance metric for clustering (e.g., 'euclidean', 'manhattan')
        output_dir (str): Output directory to save clustered faces

    Returns:
        tuple: (cluster_labels, n_clusters, n_outliers)
    """
    n_faces = len(faces)
    if n_faces == 0:
        print("❌ No face embeddings provided.")
        return np.array([]), 0, 0

    if n_faces < min_cluster_size:
        print(f"⚠️ Only {len(faces)} faces available; min_cluster_size={min_cluster_size}")
        cluster_labels = np.array([-1] * len(faces))
        _organize_faces(metadata, cluster_labels, output_dir)
        return cluster_labels, 0, len(faces)
    
    n_components = 35 if n_faces > 1000 else 38
    # UMAP Dimensionality Reduction
    reducer = umap.UMAP(
        n_neighbors=3,
        n_components=n_components,
        min_dist=0.0,
        metric='cosine',
        random_state=42
    )
    reduced_embeddings = reducer.fit_transform(faces)

    # HDBSCAN Clustering
    clusterer = hdbscan.HDBSCAN(
        min_cluster_size=min_cluster_size,
        min_samples=2,
        metric=metric,
        cluster_selection_method='eom'
    )
    cluster_labels = clusterer.fit_predict(reduced_embeddings)

    n_clusters = len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0)
    n_outliers = list(cluster_labels).count(-1)

    # Update metadata with cluster_id
    for i, label in enumerate(cluster_labels):
        metadata[i]['cluster_id'] = int(label)

    # Organize clustered faces into folders
    _organize_faces(metadata, cluster_labels, output_dir)

    print(f"✅ Clustering completed: {n_clusters} clusters, {n_outliers} outliers.")
    return cluster_labels, n_clusters, n_outliers


def _organize_faces(metadata, cluster_labels, output_dir):
    """
    Copy face images into folders based on their cluster labels.
    """
    print("📁 Organizing faces into cluster folders...")
    if os.path.exists(output_dir):
        shutil.rmtree(output_dir)
    os.makedirs(output_dir)

    cluster_groups = defaultdict(list)
    for i, label in enumerate(cluster_labels):
        cluster_groups[label].append(metadata[i])

    for cluster_id, faces_in_cluster in cluster_groups.items():
        folder_name = "outliers" if cluster_id == -1 else f"cluster_{cluster_id:02d}"
        cluster_dir = os.path.join(output_dir, folder_name)
        os.makedirs(cluster_dir, exist_ok=True)

        for face_data in faces_in_cluster:
            src_path = face_data['foto_id']
            if os.path.exists(src_path):
                dst_path = os.path.join(cluster_dir, os.path.basename(src_path))
                shutil.copy2(src_path, dst_path)


In [None]:
cluster_labels, n_clusters, n_outliers = cluster_faces(
    faces=faces,
    metadata=metadata,
    output_dir="clustered_faces"
)


# Pengambilan Centroid

Fungsi ini menghitung centroid (rata-rata embedding) dan memilih satu wajah yang paling representatif (paling mirip centroid) dari setiap cluster.

### Parameter:
- `metadata` (`list`): List metadata wajah, wajib memiliki `embedding` dan `cluster_id`.

### Output:
Dictionary per `cluster_id` dengan struktur:
- `centroid_embedding`: Vektor centroid (vektor dari wajah yang paling mendekati mean).
- `representative`: Metadata wajah paling mirip centroid.
- `centroid_id`: Path file dari wajah representatif.
- `event_id`, `album_id`: ID event dan album berdasarkan nama file.
- `members`: Seluruh anggota cluster.


In [None]:
import os
import numpy as np
from collections import defaultdict
from sklearn.metrics.pairwise import cosine_similarity

def compute_centroids(metadata: list) -> dict:
    """
    Menghitung centroid (rata-rata embedding) dan metadata representatif dari tiap cluster wajah.

    Parameters:
        metadata (list): Daftar metadata dari wajah yang memiliki 'cluster_id' dan 'embedding'.

    Returns:
        dict: Dictionary berisi centroid embedding, metadata representatif, dan anggota cluster untuk setiap cluster_id.
    """

    cluster_to_items = defaultdict(list)
    for item in metadata:
        cluster_id = item.get("cluster_id")
        if cluster_id == -1 or item.get("embedding") is None:
            continue
        cluster_to_items[cluster_id].append(item)

    centroids = {}

    for cluster_id, items in cluster_to_items.items():
        embeddings = np.array([item["embedding"] for item in items])

        # Hitung centroid sebagai rata-rata
        centroid_vec = np.mean(embeddings, axis=0)

        # Cari wajah paling mirip dengan centroid
        sims = cosine_similarity([centroid_vec], embeddings)[0]
        best_idx = np.argmax(sims)
        representative_meta = items[best_idx]

        # Ekstrak event_id dan album_id dari nama file
        filename = os.path.basename(representative_meta["foto_id"])
        name_wo_ext = os.path.splitext(filename)[0]
        album_id = name_wo_ext.split("_")[0]
        event_id = ''.join(filter(str.isalpha, album_id))

        # Simpan hasil ke dict
        centroids[cluster_id] = {
            "cluster_id": str(cluster_id),
            "event_id": representative_meta['album']['event']['name'],
            "album_id": representative_meta['album']['name'],
            "centroid_id": representative_meta["foto_id"],
            "centroid_embedding": representative_meta['embedding'],
            "representative": representative_meta,
            "members": items,
        }

    return centroids


In [None]:
centroids = compute_centroids(metadata)


# ChromaDB untuk Faces


Embedding wajah disimpan ke koleksi `face_embeddings` pada ChromaDB menggunakan metode pencocokan `cosine`. Metadata setiap wajah disimpan dalam format string agar bisa dicari kembali.

### Struktur Metadata:
- `foto_id`: Path ke file wajah ter-crop
- `album_id`: ID album
- `event_id`: ID event
- `cluster_id`: ID cluster hasil clustering
- `path`: Path ke original image
- `face_confidence`: Confidence deteksi wajah


In [None]:
import chromadb
from chromadb.config import Settings

# Inisialisasi ChromaDB client dan collection
client = chromadb.PersistentClient(path="chroma")
collection = client.get_or_create_collection(
    name="face_embeddings",
    metadata={"hnsw:space": "cosine"}  # pastikan pakai 'cosine'
)

print("\n💾 Menyimpan data ke ChromaDB...")

# Menyimpan setiap embedding ke ChromaDB
for idx, (face_vector, meta) in enumerate(zip(faces, metadata)):
    doc_id = f"face-{idx}"

    # Metadata harus bertipe string
    str_metadata = {
        "foto_id": meta.get("foto_id", ""),
        "album_id": meta.get("album", {}).get("id", ""),
        "event_id": meta.get("album", {}).get("event", {}).get("id", ""),
        "cluster_id": str(meta.get("cluster_id", "")),
        "path": meta.get("path", ""),
        "face_confidence": str(meta.get("face_confidence", ""))
    }

    collection.add(
        ids=[doc_id],
        embeddings=[face_vector],
        metadatas=[str_metadata],
        documents=[meta["foto_id"]]
    )

print(f"✅ {len(faces)} embeddings berhasil disimpan ke ChromaDB.")


### 🔍 search_similar_faces()

Fungsi ini digunakan untuk melakukan pencarian `top_k` wajah yang paling mirip dengan embedding yang diberikan. Dihitung menggunakan cosine similarity melalui ChromaDB.

### Parameter:
- `query_vector`: Vektor embedding yang ingin dicari kemiripannya.
- `collection_name`: Nama koleksi dalam ChromaDB (default: `"face_embeddings"`).
- `top_k`: Jumlah wajah paling mirip yang ditampilkan.

### Contoh Penggunaan:
```python
results = search_similar_faces(query_vector=your_embedding, top_k=5)

In [None]:
def search_similar_faces(query_vector, collection_name="face_embeddings", top_k=10):
    """
    Melakukan pencarian wajah paling mirip dari database ChromaDB berdasarkan vektor embedding.

    Args:
        query_vector (list/np.array): Vektor embedding query wajah.
        collection_name (str): Nama koleksi di ChromaDB.
        top_k (int): Jumlah hasil wajah mirip yang ingin ditampilkan.

    Returns:
        dict: Hasil pencarian dari ChromaDB.
    """
    import chromadb

    # Connect ke database lokal
    client = chromadb.PersistentClient(path="chroma")
    collection = client.get_or_create_collection(
        name=collection_name,
        metadata={"hnsw:space": "cosine"}
    )

    # Query ke ChromaDB
    results = collection.query(
        query_embeddings=query_vector,
        n_results=top_k,
        include=["distances", "metadatas"]
    )

    # Tampilkan hasil pencarian
    print(f"\n🔍 Found {len(results['ids'][0])} similar faces:")
    for i, (doc_id, metadata, distance) in enumerate(zip(
        results['ids'][0],
        results['metadatas'][0],
        results['distances'][0]
    )):
        print(f"{i+1}. ID: {doc_id}")
        print(f"   ➤ Foto ID: {metadata.get('foto_id')}")
        print(f"   ➤ Path: {metadata.get('path')}")
        print(f"   ➤ Distance: {distance:.4f}")
        print("---")

    return results


# ChromaDB untuk Centroid

Menyimpan data centroid dari hasil clustering wajah ke dalam ChromaDB dengan struktur metadata yang mencakup event_id, album_id, cluster_id, dan lainnya.

### Parameter
```centroids``` (```dict```): Dictionary hasil clustering yang memuat informasi embedding, foto_id, dan metadata representatif lainnya.

```collection_name``` (str, default "centroids_collection"): Nama koleksi yang akan digunakan di ChromaDB.

### Output
Menampilkan log jumlah centroid yang berhasil disimpan.


In [None]:
import chromadb

def save_centroids_to_chromadb(centroids, collection_name="centroids_collection"):
    # Inisialisasi Chroma client
    client = chromadb.PersistentClient(path="cluster")

    # Buat atau dapatkan koleksi
    collection = client.get_or_create_collection(
        name=collection_name,
        metadata={"hnsw:space": "cosine"}  # disarankan tetap pakai cosine
    )

    ids = []
    embeddings = []
    metadatas = []

    for cluster_id, data in centroids.items():
        representative_meta = data["representative"]
        foto_id = representative_meta["foto_id"]

        album_id = representative_meta["album"]['event']['name']
        event_id = representative_meta["album"]['name']

        centroid_id = f"{event_id}_{cluster_id}"  # format unik ID

        ids.append(centroid_id)
        embeddings.append(data["centroid_embedding"])  # <- hasil centroid vec as list
        metadatas.append({
            "cluster_id": str(cluster_id),
            "event_id": event_id,
            "album_id": album_id,
            "centroid_id": centroid_id,
            "foto_id": foto_id
        })

    # Simpan ke Chroma
    collection.add(
        ids=ids,
        embeddings=embeddings,
        metadatas=metadatas
    )

    print(f"✅ Saved {len(ids)} centroids to collection '{collection_name}'")


In [None]:
save_centroids_to_chromadb(centroids)

### 🔍 search_similar_centroids()

Fungsi ini melakukan pencarian centroid yang paling mirip dengan vektor wajah yang diberikan menggunakan **cosine similarity** di koleksi ChromaDB.

### Parameter
- `query_embedding` (`list[float]`): Vektor embedding wajah yang ingin dicocokkan.
- `top_k` (`int`, optional): Jumlah hasil terdekat yang ditampilkan. Default: `5`.
- `collection_name` (`str`, optional): Nama koleksi tempat pencarian. Default: `"centroids_collection"`.

### Proses
1. Hubungkan ke koleksi ChromaDB menggunakan `PersistentClient`.
2. Lakukan pencarian terhadap `query_embedding`.
3. Tampilkan hasil pencarian dengan metadata dan jarak cosine-nya.

### Output
Menampilkan daftar hasil terdekat


In [None]:
def search_similar_centroids(query_embedding, top_k=5, collection_name="centroids_collection"):
    import chromadb
    client = chromadb.PersistentClient(path="cluster")

    collection = client.get_or_create_collection(name=collection_name)

    # Search berdasarkan embedding centroid (cosine distance default)
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k
    )

    for i in range(len(results["ids"][0])):
        print(f"{i+1}. Centroid ID: {results['ids'][0][i]}")
        print(f"   ➤ Face ID: {results['metadatas'][0][i]['foto_id']}")

        print(f"   ➤ Event ID: {results['metadatas'][0][i]['event_id']}")
        print(f"   ➤ Album ID: {results['metadatas'][0][i]['album_id']}")
        print(f"   ➤ Cluster ID: {results['metadatas'][0][i]['cluster_id']}")
        print(f"   ➤ Distance (cosine): {results['distances'][0][i]:.4f}")
        print("---")

    return results
