# Explorative Cluster-Analyse der Gleisbilder

Diese Notebook führt eine unüberwachte Cluster-Analyse der Center-Bilder durch, um ohne Vorwissen Strukturen und Muster in den Daten zu entdecken.

**Ziele:**
- Feature-Extraktion mit vortrainiertem CNN (ResNet50)
- K-Means und DBSCAN Clustering
- Evaluierung der Cluster-Qualität
- Visualisierung und Interpretation der Ergebnisse

**Erwartete Erkenntnisse:**
- Unterschiedliche Untergrundtypen (Schotter vs. Asphalt)
- Verschiedene Beleuchtungsbedingungen
- Stadtspezifische Unterschiede
- Anomalien oder seltene Situationen

### 1. Setup: Importe, Umgebung & Reproduzierbarkeit

In [None]:
# Import Required Libraries
import os
import shutil
# Force TensorFlow to use CPU only - MUST BE SET BEFORE IMPORTING TENSORFLOW
#os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
#os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'  # Reduce TensorFlow logging

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import json
import time
from collections import Counter, defaultdict
from PIL import Image

# Machine Learning Libraries
from sklearn.cluster import KMeans, MiniBatchKMeans, DBSCAN
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.metrics import silhouette_score, silhouette_samples
from sklearn.preprocessing import StandardScaler, normalize

# Deep Learning Libraries - Import AFTER setting environment variables
import tensorflow as tf
from tensorflow.keras.applications import ResNet50, VGG16
from tensorflow.keras.applications.resnet50 import preprocess_input
from tensorflow.keras.applications.vgg16 import preprocess_input
from tensorflow.keras.preprocessing import image

# Image Processing
from img_preprocessing import ImagePreprocessor

# Visualization
from matplotlib.patches import Rectangle
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

### 2. Konfiguration: Pfade, Modellparameter

In [None]:
# Configuration with timestamped results
from datetime import datetime

# Generate timestamp for this run
TIMESTAMP = datetime.now().strftime('%Y%m%d_%H%M%S')
print(f"Analysis timestamp: {TIMESTAMP}")

# Configuration
DATASET_PATH = "./datasets/clustering_sample_100"  # Path to sampled dataset
BASE_RESULTS_PATH = f"./results_{TIMESTAMP}"                       # Base results directory
RESULTS_PATH = f"{BASE_RESULTS_PATH}/clustering_analysis"  # Timestamped analysis results
CLUSTERS_PATH = f"{BASE_RESULTS_PATH}/clustered_images"    # Timestamped clustered images

BATCH_SIZE = 32                                     # Batch size for feature extraction
IMG_SIZE = (224, 224)                              # Input size for ResNet50
FEATURE_DIM = 2048                                 # ResNet50 feature dimension
N_CLUSTERS_RANGE = range(3, 16)                    # Range of k values to test
PCA_COMPONENTS = 50                                # PCA components for dimensionality reduction
RANDOM_STATE = 42                                  # Random state for reproducibility

print(f"Results will be saved to:")
print(f"  Analysis results: {RESULTS_PATH}")
print(f"  Clustered images: {CLUSTERS_PATH}")

### 3. Ergebnisse-Verzeichnis & Datensatz-Check
- Legt das Ergebnisverzeichnis an (idempotent).
- Gibt Pfade für Dataset und Results aus.
- Prüft, ob der Dataset-Pfad existiert; falls nicht, listet verfügbare Unterordner in ./datasets inkl. Bildanzahl.
- Zählt die .png-Bilder im gewählten Dataset und gibt die Anzahl aus.

In [None]:
# Create timestamped results directories
Path(RESULTS_PATH).mkdir(parents=True, exist_ok=True)
Path(CLUSTERS_PATH).mkdir(parents=True, exist_ok=True)

print(f"Created timestamped directories:")
print(f"Dataset path: {DATASET_PATH}")
print(f"Analysis results path: {RESULTS_PATH}")
print(f"Clustered images path: {CLUSTERS_PATH}")

# Check if dataset exists
if not os.path.exists(DATASET_PATH):
    print(f"Error: Dataset path {DATASET_PATH} does not exist!")
    print("Available dataset directories:")
    datasets_dir = Path("./datasets")
    if datasets_dir.exists():
        for subdir in datasets_dir.iterdir():
            if subdir.is_dir():
                img_count = len([f for f in subdir.iterdir() if f.suffix == '.png'])
                print(f"  {subdir.name}: {img_count} images")
else:
    img_count = len([f for f in Path(DATASET_PATH).iterdir() if f.suffix == '.png'])
    print(f"Found {img_count} images in dataset")

# Save run metadata
run_metadata = {
    'timestamp': TIMESTAMP,
    'dataset_path': str(Path(DATASET_PATH).resolve()),
    'total_images_found': img_count if os.path.exists(DATASET_PATH) else 0,
    'analysis_results_path': str(Path(RESULTS_PATH).resolve()),
    'clustered_images_path': str(Path(CLUSTERS_PATH).resolve()),
    'configuration': {
        'batch_size': BATCH_SIZE,
        'img_size': IMG_SIZE,
        'feature_dim': FEATURE_DIM,
        'n_clusters_range': list(N_CLUSTERS_RANGE),
        'pca_components': PCA_COMPONENTS,
        'random_state': RANDOM_STATE
    }
}

# Save run metadata
with open(f"{RESULTS_PATH}/run_metadata.json", 'w') as f:
    json.dump(run_metadata, f, indent=2)
    
print(f"\nRun metadata saved to: {RESULTS_PATH}/run_metadata.json")

### 4. Dateiname-Parsing, Dataset-Metadaten & Tenant-Verteilung
- Extrahiert aus `*_C.png`-Dateinamen: **tenant**, **SID** und **original_filename** (`parse_filename`).
- Liest das Dataset ein und baut eine **Dateiliste** + **Tenant-Verteilung** auf (`load_dataset_info`).
- Gibt **Gesamtanzahl Bilder**, **#Tenants** und **Verteilung je Tenant (%)** aus.
- Hinweis: Nur Dateien mit Suffix **`_C.png`** werden berücksichtigt; Parser bei abweichender Konvention anpassen.

In [None]:
def parse_filename(filename: str):
    """Parse filename to extract tenant, SID, and original filename."""
    if not filename.endswith('_C.png'):
        return None, None, None
    
    name_without_ext = filename[:-6]  # Remove '_C.png'
    parts = name_without_ext.split('_')
    
    if len(parts) >= 3:
        tenant = parts[0]
        sid = parts[1]
        original_filename = '_'.join(parts[2:])
        return tenant, sid, original_filename
    
    return None, None, None

def load_dataset_info(dataset_path: str):
    """Load and analyze dataset information."""
    image_files = []
    tenant_distribution = defaultdict(int)
    
    for file_path in Path(dataset_path).glob('*_C.png'):
        filename = file_path.name
        tenant, sid, original_name = parse_filename(filename)
        
        if tenant:
            image_files.append({
                'filepath': str(file_path),
                'filename': filename,
                'tenant': tenant,
                'sid': sid,
                'original_name': original_name
            })
            tenant_distribution[tenant] += 1
    
    return image_files, dict(tenant_distribution)

# Load dataset information
print("Loading dataset information...")
image_files, tenant_distribution = load_dataset_info(DATASET_PATH)

print(f"\nTotal images: {len(image_files)}")
print(f"Number of tenants: {len(tenant_distribution)}")
print("\nTenant distribution:")
for tenant, count in sorted(tenant_distribution.items()):
    percentage = (count / len(image_files)) * 100
    print(f"  {tenant}: {count} images ({percentage:.1f}%)")

### 5. Feature-Extraktion

#### 5.1 Feature-Extraktion vorbereiten: Funktionen & Pipeline
- Definiert die Hilfsfunktionen `load_and_preprocess_image`, `create_feature_extractor` und `extract_features_batch`.
- Zweck: Aufbau der Pipeline (Laden, Normalisieren, Batch-Verarbeitung) — **keine Ausführung** der Extraktion.
- Erwartung: ResNet50-Encoder mit GlobalAveragePooling → Feature-Vektor-Dimension **2048**.
- Voraussetzungen: `IMG_SIZE`, `BATCH_SIZE`, `preprocess_input` sind gesetzt.
- Wird in **Zelle 6** aufgerufen, um die Features tatsächlich zu berechnen.


#### 5.0 Preprocessing-Methoden Konfiguration

**Wählen Sie eine Preprocessing-Methode für die Feature-Extraktion:**
- `'none'`: Keine zusätzliche Vorverarbeitung (Standard ResNet50 Preprocessing)
- `'hist_eq'`: Histogram Equalization - Verbessert den Kontrast
- `'clahe'`: CLAHE (Contrast Limited Adaptive Histogram Equalization) - Adaptive Kontrastverbesserung

**Hinweis:** Ändern Sie die Variable `PREPROCESSING_METHOD` in der nächsten Zelle, um zwischen den Methoden zu wechseln.

In [None]:
# Preprocessing Method Selection
# =============================

import ipywidgets as widgets
from IPython.display import display

# Available preprocessing methods
PREPROCESSING_METHODS = {
    'none': 'Keine Vorverarbeitung (Standard)',
    'hist_eq': 'Histogram Equalization',
    'clahe': 'CLAHE (Adaptive Histogram Equalization)'
}

# Create options as list of tuples (display_name, value)
method_options = [(display_name, key) for key, display_name in PREPROCESSING_METHODS.items()]

# Create interactive dropdown for method selection
method_selector = widgets.Dropdown(
    options=method_options,
    value='none',  # Now this matches the actual key values
    description='Methode:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='400px')
)

info_text = widgets.HTML(
    value="""
    <div style="padding: 10px; border: 1px solid #ddd; border-radius: 5px; background-color: #f9f9f9;">
    <h4>🔧 Preprocessing-Methoden Vergleich</h4>
    <ul>
    <li><b>Keine Vorverarbeitung:</b> Standard ResNet50 Preprocessing ohne zusätzliche Bildverbesserung</li>
    <li><b>Histogram Equalization:</b> Globale Kontrastverbesserung durch Histogramm-Angleichung</li>
    <li><b>CLAHE:</b> Adaptive lokale Kontrastverbesserung mit Begrenzung</li>
    </ul>
    <p><i>Wählen Sie eine Methode und führen Sie die nachfolgenden Zellen aus.</i></p>
    </div>
    """
)

# Global variable to store selected method
PREPROCESSING_METHOD = 'none'

def on_method_change(change):
    global PREPROCESSING_METHOD
    PREPROCESSING_METHOD = change['new']
    print(f"✅ Preprocessing-Methode geändert zu: {PREPROCESSING_METHODS[PREPROCESSING_METHOD]}")

method_selector.observe(on_method_change, names='value')

# Display the interface
print("🔧 Wählen Sie eine Preprocessing-Methode:")
display(widgets.VBox([method_selector, info_text]))

print(f"\n📋 Aktuelle Auswahl: {PREPROCESSING_METHODS[PREPROCESSING_METHOD]}")
print("💡 Tipp: Nach Änderung der Methode alle nachfolgenden Zellen neu ausführen!")

In [None]:
def load_and_preprocess_image(image_path: str, target_size: tuple = IMG_SIZE):
    """Load and preprocess image for ResNet50 with optional preprocessing methods."""
    try:
        # Load image
        img = image.load_img(image_path, target_size=target_size)
        
        # Apply selected preprocessing method
        if PREPROCESSING_METHOD == 'hist_eq':
            # Convert PIL image to numpy array for preprocessing
            img_array = np.array(img)
            # Apply histogram equalization
            img_processed = ImagePreprocessor.method_1_histogram_equalization(img_array)
            # Convert back to PIL Image
            img = Image.fromarray(img_processed.astype('uint8'))
            
        elif PREPROCESSING_METHOD == 'clahe':
            # Convert PIL image to numpy array for preprocessing
            img_array = np.array(img)
            # Apply CLAHE
            img_processed = ImagePreprocessor.method_2_clahe(img_array)
            # Convert back to PIL Image
            img = Image.fromarray(img_processed.astype('uint8'))
            
        # Note: If PREPROCESSING_METHOD == 'none', no additional preprocessing is applied
        
        # Convert to array and prepare for ResNet50
        img_array = image.img_to_array(img)    
        img_array = np.expand_dims(img_array, axis=0)
        img_array = preprocess_input(img_array)
        return img_array
        
    except Exception as e:
        print(f"Error loading {image_path}: {e}")
        return None

def create_feature_extractor():
    """Create ResNet50 feature extractor (without top classification layer)."""
    print("Loading ResNet50 model...")
    base_model = ResNet50(
        weights='imagenet',
        include_top=False,
        input_shape=(IMG_SIZE[0], IMG_SIZE[1], 3),
        pooling='avg'  # Global average pooling
    )
    
    # The model output will be (batch_size, 2048) features
    print(f"Feature extractor output shape: {base_model.output_shape}")
    return base_model

def extract_features_batch(model, image_paths: list, batch_size: int = BATCH_SIZE):
    """Extract features from images in batches."""
    features = []
    valid_paths = []
    
    print(f"Extracting features from {len(image_paths)} images...")
    print(f"Using preprocessing method: {PREPROCESSING_METHODS.get(PREPROCESSING_METHOD, 'Unknown')}")
    
    for i in range(0, len(image_paths), batch_size):
        batch_paths = image_paths[i:i + batch_size]
        batch_images = []
        batch_valid_paths = []
        
        for img_path in batch_paths:
            img_array = load_and_preprocess_image(img_path)
            if img_array is not None:
                batch_images.append(img_array[0])  # Remove batch dimension
                batch_valid_paths.append(img_path)
        
        if batch_images:
            batch_images = np.array(batch_images)
            batch_features = model.predict(batch_images, verbose=0)
            features.extend(batch_features)
            valid_paths.extend(batch_valid_paths)
        
        # Progress update
        processed = min(i + batch_size, len(image_paths))
        print(f"Progress: {processed}/{len(image_paths)} images processed")
    
    return np.array(features), valid_paths

#### 5.1 Feature-Extraktion mit wählbarer Vorverarbeitung

- **Adaptive Pipeline**: Nutzt die in Zelle 5.0 gewählte Preprocessing-Methode
- **Funktionen**: `load_and_preprocess_image`, `create_feature_extractor` und `extract_features_batch`
- **Preprocessing-Integration**: 
  - `'none'`: Standard ResNet50 Preprocessing
  - `'hist_eq'`: Histogram Equalization vor ResNet50 
  - `'clahe'`: CLAHE vor ResNet50
- **Feature-Extraktion**: ResNet50-Encoder mit GlobalAveragePooling → **2048-dimensionale Features**
- **Batch-Verarbeitung**: Effiziente Verarbeitung in konfigurierbaren Batches

**Wichtig**: Nach Änderung der Preprocessing-Methode müssen alle nachfolgenden Zellen neu ausgeführt werden!

In [None]:
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
# Extract features using ResNet50
print("=" * 80)
print("FEATURE EXTRACTION")
print("=" * 80)

# Create feature extractor
feature_extractor = create_feature_extractor()

# Extract image paths
image_paths = [item['filepath'] for item in image_files]

# Extract features
start_time = time.time()
features, valid_paths = extract_features_batch(feature_extractor, image_paths)
extraction_time = time.time() - start_time

print(f"\nFeature extraction completed in {extraction_time:.2f} seconds")
print(f"Extracted features shape: {features.shape}")
print(f"Valid images: {len(valid_paths)}/{len(image_paths)}")

# Update image_files to only include valid images
valid_image_files = []
for img_file in image_files:
    if img_file['filepath'] in valid_paths:
        valid_image_files.append(img_file)

image_files = valid_image_files
print(f"Updated image files: {len(image_files)}")

In [None]:
# Status Check: Current Preprocessing Configuration
# =================================================

print("🔍 Aktuelle Konfiguration:")
print("=" * 50)
print(f"📊 Preprocessing-Methode: {PREPROCESSING_METHODS.get(PREPROCESSING_METHOD, 'Unbekannt')}")
print(f"🔧 Methoden-Code: {PREPROCESSING_METHOD}")
print(f"📁 Dataset-Pfad: {DATASET_PATH}")
print(f"🖼️  Bildgröße: {IMG_SIZE}")
print(f"📦 Batch-Größe: {BATCH_SIZE}")
print(f"🎯 Feature-Dimension: {FEATURE_DIM}")

# Import check for ImagePreprocessor
try:
    from img_preprocessing import ImagePreprocessor
    print("✅ ImagePreprocessor erfolgreich importiert")
    
    # Check if methods exist
    if hasattr(ImagePreprocessor, 'method_1_histogram_equalization'):
        print("✅ Histogram Equalization Methode verfügbar")
    else:
        print("❌ Histogram Equalization Methode nicht gefunden")
        
    if hasattr(ImagePreprocessor, 'method_2_clahe'):
        print("✅ CLAHE Methode verfügbar")
    else:
        print("❌ CLAHE Methode nicht gefunden")
        
except ImportError as e:
    print(f"❌ ImagePreprocessor Import-Fehler: {e}")
    print("💡 Stellen Sie sicher, dass img_preprocessing.py im gleichen Verzeichnis liegt")

print("\n🚀 Bereit für Feature-Extraktion!")

if PREPROCESSING_METHOD != 'none':
    print(f"⚠️  Hinweis: Preprocessing-Methode '{PREPROCESSING_METHODS[PREPROCESSING_METHOD]}' wird angewendet")
    print("   Dies kann die Verarbeitungszeit verlängern, aber die Clustering-Qualität verbessern.")
else:
    print("ℹ️  Standard ResNet50 Preprocessing wird verwendet (keine zusätzliche Vorverarbeitung)")

### 6. Feature-Nachbearbeitung (Normalisierung & PCA -> Vorbereitung fürs Clustering)
- L2-Normierung zeilenweise (`normalize(..., axis=1)`): bringt alle Embeddings auf Einheitsnorm → stabilere Distanz-/KMeans-Ergebnisse.
- Gibt Kennzahlen vor/nach Normalisierung aus (Mean/Std/Min/Max sowie L2-Norm eines Beispiels).
- PCA reduziert Dimension von `FEATURE_DIM` → `PCA_COMPONENTS` (Fit auf normalisierten Features); meldet erklärte Gesamtvarianz und Shape.
- Fürs Clustering wird zunächst `features_for_clustering = features_normalized` genutzt; **optional** auf `features_pca` umstellen (schneller/rauschärmer, ggf. bessere Separation).
- Reproduzierbarkeit: `random_state=RANDOM_STATE` bei PCA gesetzt.

In [None]:
# Normalize features
print("Normalizing features...")
features_normalized = normalize(features, norm='l2', axis=1)

print(f"Original feature statistics:")
print(f"  Mean: {features.mean():.4f}")
print(f"  Std: {features.std():.4f}")
print(f"  Min: {features.min():.4f}")
print(f"  Max: {features.max():.4f}")

print(f"\nNormalized feature statistics:")
print(f"  Mean: {features_normalized.mean():.4f}")
print(f"  Std: {features_normalized.std():.4f}")
print(f"  L2 norm (first sample): {np.linalg.norm(features_normalized[0]):.4f}")

# Optional: Apply PCA for dimensionality reduction
print(f"\nApplying PCA to reduce dimensions from {FEATURE_DIM} to {PCA_COMPONENTS}...")
pca = PCA(n_components=PCA_COMPONENTS, random_state=RANDOM_STATE)
features_pca = pca.fit_transform(features_normalized)

print(f"PCA explained variance ratio: {pca.explained_variance_ratio_.sum():.4f}")
print(f"PCA features shape: {features_pca.shape}")

# We'll use both normalized full features and PCA features for comparison
features_for_clustering = features_normalized  # Start with full features

### 7. K-Means

#### 7.1 K-Means: k-Sweep & Clusterbewertung (Inertia, Silhouette, Clustergrößen)
- Ein k-Sweep ist das **systematische Durchtesten mehrerer k-Werte** (z. B. k=3…15) bei K-Means/MiniBatchKMeans, um eine sinnvolle Clusterzahl zu finden.
- Führt K-Means/MiniBatchKMeans über `N_CLUSTERS_RANGE` auf `features_for_clustering` aus (Default: `MiniBatchKMeans` für Effizienz).
- Für jedes k: fit & predict → berechnet **Inertia**, **Silhouette-Score** und **Clustergrößen** (inkl. min/max); speichert zudem das jeweilige Modell.
- Ergebnisliste `kmeans_results`: pro Eintrag `{k, inertia, silhouette_score, min_cluster_size, max_cluster_size, cluster_sizes, model}`.
- Praxis: bestes k per **max. Silhouette** wählen und mit **Inertia-Knick** plausibilisieren; bei starken Imbalancen zusätzlich Clustergrößen prüfen.


In [None]:
def evaluate_kmeans_clusters(features, k_range, use_minibatch=True):
    """Evaluate K-Means clustering for different k values."""
    results = []
    
    print(f"Evaluating K-Means for k in {list(k_range)}...")
    
    for k in k_range:
        print(f"  Testing k={k}...")
        
        # Use MiniBatchKMeans for efficiency with large datasets
        if use_minibatch:
            kmeans = MiniBatchKMeans(
                n_clusters=k,
                random_state=RANDOM_STATE,
                batch_size=100,
                n_init=10
            )
        else:
            kmeans = KMeans(
                n_clusters=k,
                random_state=RANDOM_STATE,
                n_init=10
            )
        
        # Fit and predict
        cluster_labels = kmeans.fit_predict(features)
        
        # Calculate metrics
        inertia = kmeans.inertia_
        silhouette_avg = silhouette_score(features, cluster_labels)
        
        # Cluster sizes
        cluster_sizes = Counter(cluster_labels)
        min_cluster_size = min(cluster_sizes.values())
        max_cluster_size = max(cluster_sizes.values())
        
        results.append({
            'k': k,
            'inertia': inertia,
            'silhouette_score': silhouette_avg,
            'min_cluster_size': min_cluster_size,
            'max_cluster_size': max_cluster_size,
            'cluster_sizes': dict(cluster_sizes),
            'model': kmeans
        })
        
        print(f"    Inertia: {inertia:.2f}, Silhouette: {silhouette_avg:.4f}")
    
    return results

# Evaluate K-Means clustering
print("=" * 80)
print("K-MEANS CLUSTERING EVALUATION")
print("=" * 80)

kmeans_results = evaluate_kmeans_clusters(features_for_clustering, N_CLUSTERS_RANGE)

#### 7.2 K-Means Evaluation visualisieren & k bestimmen
- Erzeugt ein 2×2-Dashboard:
  - Elbow-Plot (Inertia vs. k)
  - Silhouette-Plot (Silhouette vs. k)
  - Clustergrößen (min/max vs. k)
  - Kombinierte Metriken (1−normierte Inertia vs. normierte Silhouette)
- Speichert die Grafik als `kmeans_evaluation.png` in `RESULTS_PATH` und zeigt sie an.
- Ermittelt das optimale `k` über das Maximum des Silhouette-Scores (`np.argmax`), gibt `optimal_k` und den besten Silhouette-Wert aus.
- Praxis-Hinweis: Prüfe, ob das Silhouette-Maximum stabil ist (keine Mini-Cluster, Domänensinn); Elbow dient als Plausibilisierung, nicht als alleiniges Kriterium.


In [None]:
# Plot K-Means evaluation results
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('K-Means Clustering Evaluation', fontsize=16)

# Extract data for plotting
k_values = [r['k'] for r in kmeans_results]
inertias = [r['inertia'] for r in kmeans_results]
silhouette_scores = [r['silhouette_score'] for r in kmeans_results]
min_cluster_sizes = [r['min_cluster_size'] for r in kmeans_results]
max_cluster_sizes = [r['max_cluster_size'] for r in kmeans_results]

# Elbow plot
axes[0, 0].plot(k_values, inertias, 'bo-')
axes[0, 0].set_xlabel('Number of Clusters (k)')
axes[0, 0].set_ylabel('Inertia')
axes[0, 0].set_title('Elbow Method')
axes[0, 0].grid(True)

# Silhouette score plot
axes[0, 1].plot(k_values, silhouette_scores, 'ro-')
axes[0, 1].set_xlabel('Number of Clusters (k)')
axes[0, 1].set_ylabel('Silhouette Score')
axes[0, 1].set_title('Silhouette Analysis')
axes[0, 1].grid(True)

# Cluster size distribution
axes[1, 0].plot(k_values, min_cluster_sizes, 'go-', label='Min cluster size')
axes[1, 0].plot(k_values, max_cluster_sizes, 'mo-', label='Max cluster size')
axes[1, 0].set_xlabel('Number of Clusters (k)')
axes[1, 0].set_ylabel('Cluster Size')
axes[1, 0].set_title('Cluster Size Distribution')
axes[1, 0].legend()
axes[1, 0].grid(True)

# Combined metrics (normalized)
norm_inertias = np.array(inertias) / max(inertias)
norm_silhouettes = np.array(silhouette_scores) / max(silhouette_scores)
axes[1, 1].plot(k_values, 1 - norm_inertias, 'b-', label='1 - Normalized Inertia')
axes[1, 1].plot(k_values, norm_silhouettes, 'r-', label='Normalized Silhouette')
axes[1, 1].set_xlabel('Number of Clusters (k)')
axes[1, 1].set_ylabel('Normalized Score')
axes[1, 1].set_title('Combined Metrics')
axes[1, 1].legend()
axes[1, 1].grid(True)

plt.tight_layout()
plt.savefig(f"{RESULTS_PATH}/kmeans_evaluation.png", dpi=300, bbox_inches='tight')
plt.show()

# Find optimal k based on silhouette score
best_silhouette_idx = np.argmax(silhouette_scores)
optimal_k = k_values[best_silhouette_idx]
print(f"\nOptimal k based on silhouette score: {optimal_k}")
print(f"Best silhouette score: {silhouette_scores[best_silhouette_idx]:.4f}")

#### 7.3 Finales K-Means mit optimalem k & Clusteranalyse
- Führt `MiniBatchKMeans` mit `k=optimal_k` aus (`batch_size=100`, `n_init=20`) und weist für alle Samples Cluster-Labels zu.
- Ergänzt `image_files` um das Feld `cluster` (int) für die nachgelagerte Auswertung/Speicherung.
- Erstellt eine Clusterzusammenfassung: Gesamtanzahl je Cluster sowie **Tenant-Verteilung pro Cluster** (Anzahl & Prozent).
- Konsolenoutput: Für jedes Cluster → `#Bilder` und Rangfolge der Tenants (absteigend).
- Praxis: Im nächsten Schritt Artefakte persistieren (z. B. `index_with_clusters.csv`, Modell-Params) und Beispielbilder je Cluster visualisieren.


In [None]:
# Perform final K-Means clustering with optimal k
print(f"\nPerforming final K-Means clustering with k={optimal_k}...")

final_kmeans = MiniBatchKMeans(
    n_clusters=optimal_k,
    random_state=RANDOM_STATE,
    batch_size=100,
    n_init=20  # More initializations for final model
)

cluster_labels = final_kmeans.fit_predict(features_for_clustering)

# Add cluster labels to image files
for i, img_file in enumerate(image_files):
    img_file['cluster'] = int(cluster_labels[i])

# Analyze cluster composition
print(f"\nCluster Analysis:")
cluster_stats = defaultdict(lambda: defaultdict(int))

for img_file in image_files:
    cluster = img_file['cluster']
    tenant = img_file['tenant']
    cluster_stats[cluster]['total'] += 1
    cluster_stats[cluster][tenant] += 1

for cluster_id in sorted(cluster_stats.keys()):
    stats = cluster_stats[cluster_id]
    total = stats['total']
    print(f"\nCluster {cluster_id}: {total} images")
    
    # Show tenant distribution in this cluster
    tenant_counts = {k: v for k, v in stats.items() if k != 'total'}
    for tenant, count in sorted(tenant_counts.items(), key=lambda x: x[1], reverse=True):
        percentage = (count / total) * 100
        print(f"  {tenant}: {count} ({percentage:.1f}%)")

#### 7.4 K-Means-Clusterbeispiele: Visualisierung von Beispielbildern
- Kurzantwort: **Ja.** Die Funktion zeigt Beispielbilder aus den durch **MiniBatchKMeans** (finales k) zugewiesenen Clustern.
- Grundlage: `cluster_labels = final_kmeans.fit_predict(...)` → wird in `image_files[i]['cluster']` gespeichert; die Visualisierung sampled daraus pro Cluster bis zu `n_examples` Bilder.
- Art der Darstellung: **Bildgrid** (3 Spalten, mehrere Zeilen) pro Cluster; die Tafeln werden als `cluster_<cluster_id>_examples.png` gespeichert.
- Hinweis: Es sind **zufällig gewählte Beispiele**, **keine** Clusterzentroid-Bilder. Für „repräsentativste“ Beispiele könnten Bilder mit **minimaler Distanz zum Zentroid** gezeigt werden (optional erweiterbar).


In [None]:
def display_cluster_examples(image_files, cluster_id, n_examples=6):
    """Display example images from a specific cluster."""
    cluster_images = [img for img in image_files if img['cluster'] == cluster_id]
    
    if not cluster_images:
        print(f"No images found for cluster {cluster_id}")
        return
    
    # Randomly sample examples
    examples = np.random.choice(cluster_images, min(n_examples, len(cluster_images)), replace=False)
    
    # Create subplot
    cols = 3
    rows = (len(examples) + cols - 1) // cols
    fig, axes = plt.subplots(rows, cols, figsize=(15, 5 * rows))
    if rows == 1:
        axes = axes.reshape(1, -1)
    
    fig.suptitle(f'Cluster {cluster_id} Examples ({len(cluster_images)} total images)', fontsize=16)
    
    for i, img_info in enumerate(examples):
        row = i // cols
        col = i % cols
        
        # Load and display image
        try:
            img = Image.open(img_info['filepath'])
            axes[row, col].imshow(img)
            axes[row, col].set_title(f"{img_info['tenant']}_{img_info['sid']}", fontsize=10)
            axes[row, col].axis('off')
        except Exception as e:
            axes[row, col].text(0.5, 0.5, f"Error loading\n{img_info['filename']}", 
                               ha='center', va='center', transform=axes[row, col].transAxes)
            axes[row, col].axis('off')
    
    # Hide empty subplots
    for i in range(len(examples), rows * cols):
        row = i // cols
        col = i % cols
        axes[row, col].axis('off')
    
    plt.tight_layout()
    plt.savefig(f"{RESULTS_PATH}/cluster_{cluster_id}_examples.png", dpi=300, bbox_inches='tight')
    plt.show()

# Display examples for each cluster
print("=" * 80)
print("CLUSTER EXAMPLES")
print("=" * 80)

for cluster_id in sorted(set(cluster_labels)):
    display_cluster_examples(image_files, cluster_id)

### 8. DBSCAN

#### 8.1 DBSCAN: Parameter-Grid, Heatmaps & Best-Selection
- Verwendet **PCA-Features** für effizientere Dichte-Clustering-Läufe (niedrigere Dimension).
- Testet ein **Parameter-Grid** aus `eps_values × min_samples_values` und führt für jede Kombination `DBSCAN.fit_predict(features_pca)` aus.
- Ermittelt pro Run: **#Cluster** (ohne Noise), **Noise-Count/Noise-Ratio** sowie **Silhouette** (nur wenn >1 Cluster; Noise-Punkte ausgeschlossen).
- Visualisiert die Ergebnisse als **3 Heatmaps**: Anzahl Cluster, Silhouette-Score, Noise-Ratio; Achsen: `eps` (x), `min_samples` (y).
- Wählt „Besten“ auf zwei Arten:
  - **Best by silhouette** (max. Silhouette)
  - **Best balanced**: `silhouette * (1 - noise_ratio)` (Trade-off Qualität vs. Noise)
  → setzt `best_dbscan` auf den „balanced“ Sieger (robuster in der Praxis).
- Speichert die Übersicht als `dbscan_parameter_optimization.png` und zeigt sie an; gibt die besten Parameter und Metriken im Log aus.
- Hinweise:
  - Negative Silhouette-Werte werden für die Heatmap auf **0** gecappt (reine Visualisierung).
  - Bei sehr hoher Noise-Ratio Parameterbereich erweitern oder **PCA-Komponenten** anpassen.
  - Für datengetriebene Epsilon-Wahl zusätzlich **k-Distance-Plot** (z. B. k=MinPts) in Erwägung ziehen; **HDBSCAN** als Alternative prüfen.


In [None]:
# DBSCAN Clustering
print("=" * 80)
print("DBSCAN CLUSTERING")
print("=" * 80)

# Use PCA features for DBSCAN (better performance in lower dimensions)
print("Applying DBSCAN on PCA-reduced features...")

# Create a grid of parameters to test
eps_values = [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.2, 1.5]
min_samples_values = [3, 5, 7, 10, 15, 20]

print("\nTesting parameter combinations:")
print(f"eps values: {eps_values}")
print(f"min_samples values: {min_samples_values}")
print(f"Total combinations to test: {len(eps_values) * len(min_samples_values)}")

dbscan_results = []

# Test all combinations
for eps in eps_values:
    for min_samples in min_samples_values:
        print(f"\nTesting DBSCAN with eps={eps}, min_samples={min_samples}...")
        
        dbscan = DBSCAN(eps=eps, min_samples=min_samples)
        dbscan_labels = dbscan.fit_predict(features_pca)
        
        # Count clusters and noise points
        n_clusters = len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)
        n_noise = list(dbscan_labels).count(-1)
        
        # Calculate silhouette score (only if we have more than 1 cluster)
        if n_clusters > 1:
            # Filter out noise points for silhouette calculation
            mask = dbscan_labels != -1
            if mask.sum() > 1:  # Need at least 2 points
                silhouette_avg = silhouette_score(features_pca[mask], dbscan_labels[mask])
            else:
                silhouette_avg = -1
        else:
            silhouette_avg = -1
        
        result = {
            'eps': eps,
            'min_samples': min_samples,
            'n_clusters': n_clusters,
            'n_noise': n_noise,
            'noise_ratio': n_noise / len(dbscan_labels),
            'silhouette_score': silhouette_avg,
            'labels': dbscan_labels
        }
        dbscan_results.append(result)
        
        print(f"  Clusters: {n_clusters}")
        print(f"  Noise points: {n_noise} ({result['noise_ratio']*100:.1f}%)")
        print(f"  Silhouette: {silhouette_avg:.4f}")

# Create results visualization
plt.figure(figsize=(15, 10))

# Create a matrix of results
eps_grid, min_samples_grid = np.meshgrid(eps_values, min_samples_values)
n_clusters_grid = np.zeros_like(eps_grid, dtype=float)
silhouette_grid = np.zeros_like(eps_grid, dtype=float)
noise_ratio_grid = np.zeros_like(eps_grid, dtype=float)

for result in dbscan_results:
    i = min_samples_values.index(result['min_samples'])
    j = eps_values.index(result['eps'])
    n_clusters_grid[i, j] = result['n_clusters']
    silhouette_grid[i, j] = max(result['silhouette_score'], 0)  # Replace negative scores with 0
    noise_ratio_grid[i, j] = result['noise_ratio']

# Plot number of clusters
plt.subplot(221)
plt.imshow(n_clusters_grid, aspect='auto', interpolation='nearest')
plt.colorbar(label='Number of Clusters')
plt.ylabel('min_samples')
plt.xlabel('eps')
plt.title('Number of Clusters')
plt.xticks(range(len(eps_values)), [f'{x:.1f}' for x in eps_values], rotation=45)
plt.yticks(range(len(min_samples_values)), min_samples_values)

# Plot silhouette scores
plt.subplot(222)
plt.imshow(silhouette_grid, aspect='auto', interpolation='nearest')
plt.colorbar(label='Silhouette Score')
plt.ylabel('min_samples')
plt.xlabel('eps')
plt.title('Silhouette Score')
plt.xticks(range(len(eps_values)), [f'{x:.1f}' for x in eps_values], rotation=45)
plt.yticks(range(len(min_samples_values)), min_samples_values)

# Plot noise ratio
plt.subplot(223)
plt.imshow(noise_ratio_grid, aspect='auto', interpolation='nearest')
plt.colorbar(label='Noise Ratio')
plt.ylabel('min_samples')
plt.xlabel('eps')
plt.title('Noise Ratio')
plt.xticks(range(len(eps_values)), [f'{x:.1f}' for x in eps_values], rotation=45)
plt.yticks(range(len(min_samples_values)), min_samples_values)

# Find best results
valid_results = [r for r in dbscan_results if r['silhouette_score'] > 0]
if valid_results:
    # Sort by different metrics
    best_silhouette = max(valid_results, key=lambda x: x['silhouette_score'])
    balanced_score = max(valid_results, key=lambda x: x['silhouette_score'] * (1 - x['noise_ratio']))
    
    print("\nBest results:")
    print("\nBest by silhouette score:")
    print(f"eps={best_silhouette['eps']}, min_samples={best_silhouette['min_samples']}")
    print(f"Clusters: {best_silhouette['n_clusters']}")
    print(f"Noise points: {best_silhouette['n_noise']} ({best_silhouette['noise_ratio']*100:.1f}%)")
    print(f"Silhouette: {best_silhouette['silhouette_score']:.4f}")
    
    print("\nBest balanced (silhouette * (1 - noise_ratio)):")
    print(f"eps={balanced_score['eps']}, min_samples={balanced_score['min_samples']}")
    print(f"Clusters: {balanced_score['n_clusters']}")
    print(f"Noise points: {balanced_score['n_noise']} ({balanced_score['noise_ratio']*100:.1f}%)")
    print(f"Silhouette: {balanced_score['silhouette_score']:.4f}")
    
    # Use balanced score as best result
    best_dbscan = balanced_score
else:
    print("\nNo valid DBSCAN results found. Consider adjusting parameters.")
    best_dbscan = dbscan_results[0]  # Use first result as fallback

plt.tight_layout()
plt.savefig(f"{RESULTS_PATH}/dbscan_parameter_optimization.png", dpi=300, bbox_inches='tight')
plt.show()

### 9. Visualisierung: K-Means & DBSCAN (Side-by-Side) mit t-SNE
- Berechnet ein 2D-t-SNE auf den **PCA-Features** (`features_pca`) mit `random_state=RANDOM_STATE`, `perplexity=min(30, N-1)` und `max_iter=1000`.
- Linkes Panel: Streudiagramm der **K-Means-Zuordnungen** (`cluster_labels`, cmap `tab10`, alpha 0.7).
- Rechtes Panel: **DBSCAN-Zuordnungen** mit spezieller Farbgebung:
  - Noise (`-1`) wird auf **grau** gemappt, übrige Cluster erhalten Farben aus `tab10`.
  - Labels werden remappt, damit `-1 → 0` (grau) und Cluster fortlaufend folgen.
- Legt Achsentitel/Überschriften, erzeugt **Colorbars** für beide Plots und zeigt **#Cluster** sowie **#Noise** (DBSCAN) an.
- Speichert die Abbildung als `clustering_tsne_visualization.png` in `RESULTS_PATH` und zeigt sie an.
- Hinweise:
  - t-SNE ist **stochastisch** und lokal strukturtreu; das globale Layout ist nicht maßstabsgetreu (Interpretation vorsichtig).
  - Für große N kann **UMAP** eine schnellere Alternative sein; `random_state` hält Visuals reproduzierbar.


In [None]:
# Visualize clustering results with t-SNE (Fixed Colors & Legends)
print("=" * 80)
print("CLUSTER VISUALIZATION WITH t-SNE (IMPROVED)")
print("=" * 80)

print("Computing t-SNE embedding with stable parameters...")
N = len(features_pca)
# Use stable t-SNE parameters for reproducible and better results
tsne = TSNE(
    n_components=2,
    init='pca',  # PCA initialization for better stability
    learning_rate='auto',  # Adaptive learning rate
    perplexity=min(50, max(5, int(N * 0.01))),  # Dynamic perplexity based on sample size
    max_iter=1500,  # More iterations for convergence
    random_state=RANDOM_STATE,
    metric='euclidean',
    early_exaggeration=12.0
)

tsne_features = tsne.fit_transform(features_pca)
print(f"t-SNE completed with perplexity={tsne.perplexity}")

# Create visualization with fixed discrete colors
fig, axes = plt.subplots(1, 2, figsize=(20, 8))

# === K-MEANS PLOT WITH DISCRETE COLORS ===
# Remap K-Means labels to 0..C-1 for consistent coloring
unique_kmeans = sorted(set(cluster_labels))
kmeans_label_map = {old: new for new, old in enumerate(unique_kmeans)}
kmeans_colors = np.array([kmeans_label_map[label] for label in cluster_labels])

# Create discrete colormap for K-Means
n_kmeans_clusters = len(unique_kmeans)
kmeans_cmap = matplotlib.colors.ListedColormap(plt.cm.tab10(np.linspace(0, 1, n_kmeans_clusters)))
kmeans_norm = matplotlib.colors.BoundaryNorm(
    boundaries=np.arange(-0.5, n_kmeans_clusters, 1), 
    ncolors=n_kmeans_clusters
)

scatter1 = axes[0].scatter(
    tsne_features[:, 0],
    tsne_features[:, 1],
    c=kmeans_colors,
    cmap=kmeans_cmap,
    norm=kmeans_norm,
    alpha=0.7,
    s=50
)
axes[0].set_title(f'K-Means Clustering (k={optimal_k})\nt-SNE Visualization')
axes[0].set_xlabel('t-SNE Component 1')
axes[0].set_ylabel('t-SNE Component 2')

# Discrete colorbar for K-Means
cbar1 = plt.colorbar(scatter1, ax=axes[0], ticks=range(n_kmeans_clusters))
cbar1.set_ticklabels([f'Cluster {unique_kmeans[i]}' for i in range(n_kmeans_clusters)])
cbar1.set_label('K-Means Clusters')

# === DBSCAN PLOT WITH NOISE IN GREY ===
dbscan_colors = best_dbscan['labels'].copy()
unique_dbscan = sorted(set(dbscan_colors))
n_dbscan_clusters = len(unique_dbscan) - (1 if -1 in unique_dbscan else 0)

# Create color mapping: noise (-1) -> 0 (grey), clusters -> 1,2,3...
if -1 in unique_dbscan:
    # Noise gets grey, clusters get tab10 colors
    colors = ['#808080']  # Grey for noise
    colors.extend(plt.cm.tab10(np.linspace(0, 1, n_dbscan_clusters)))
    # Map -1 to 0, others to 1,2,3...
    dbscan_label_map = {-1: 0}
    cluster_idx = 1
    for label in unique_dbscan:
        if label != -1:
            dbscan_label_map[label] = cluster_idx
            cluster_idx += 1
    total_colors = len(unique_dbscan)
else:
    # No noise, just use tab10 for all clusters
    colors = plt.cm.tab10(np.linspace(0, 1, n_dbscan_clusters))
    dbscan_label_map = {label: idx for idx, label in enumerate(unique_dbscan)}
    total_colors = n_dbscan_clusters

dbscan_colors_mapped = np.array([dbscan_label_map[label] for label in dbscan_colors])

# Create discrete colormap for DBSCAN
dbscan_cmap = matplotlib.colors.ListedColormap(colors)
dbscan_norm = matplotlib.colors.BoundaryNorm(
    boundaries=np.arange(-0.5, total_colors, 1), 
    ncolors=total_colors
)

scatter2 = axes[1].scatter(
    tsne_features[:, 0],
    tsne_features[:, 1],
    c=dbscan_colors_mapped,
    cmap=dbscan_cmap,
    norm=dbscan_norm,
    alpha=0.7,
    s=50
)
axes[1].set_title(f'DBSCAN Clustering (eps={best_dbscan["eps"]})\nt-SNE Visualization\n{n_dbscan_clusters} clusters, {best_dbscan["n_noise"]} noise points')
axes[1].set_xlabel('t-SNE Component 1')
axes[1].set_ylabel('t-SNE Component 2')

# Discrete colorbar for DBSCAN
cbar2 = plt.colorbar(scatter2, ax=axes[1], ticks=range(total_colors))
if -1 in unique_dbscan:
    tick_labels = ['Noise'] + [f'Cluster {i}' for i in range(n_dbscan_clusters)]
else:
    tick_labels = [f'Cluster {unique_dbscan[i]}' for i in range(n_dbscan_clusters)]
cbar2.set_ticklabels(tick_labels)
cbar2.set_label('DBSCAN Clusters')

plt.tight_layout()
plt.savefig(f"{RESULTS_PATH}/clustering_tsne_visualization_fixed.png", dpi=300, bbox_inches='tight')
plt.show()

print(f"\nVisualization improvements:")
print(f"- Used stable t-SNE parameters (init='pca', learning_rate='auto')")
print(f"- Discrete colormaps with BoundaryNorm for clean cluster separation")
print(f"- Proper legend labels for both K-Means and DBSCAN")
print(f"- Noise points clearly marked in grey for DBSCAN")

### 9.1 K-Nearest Neighbor Grafik

In [None]:
# K-Distance Plot for DBSCAN Parameter Guidance
print("=" * 80)
print("K-DISTANCE PLOT FOR DBSCAN EPS SELECTION")
print("=" * 80)

from sklearn.neighbors import NearestNeighbors

# Use the min_samples from best DBSCAN result
k = best_dbscan['min_samples']
print(f"Computing {k}-distance plot for eps guidance...")

# Fit NearestNeighbors on PCA features
neighbors = NearestNeighbors(n_neighbors=k, metric='euclidean')
neighbors.fit(features_pca)

# Get distances to k-th nearest neighbor for each point
distances, indices = neighbors.kneighbors(features_pca)
k_distances = distances[:, k-1]  # k-th distance (0-indexed)

# Sort distances in descending order
k_distances_sorted = np.sort(k_distances)[::-1]

# Create the plot
plt.figure(figsize=(10, 6))
plt.plot(range(len(k_distances_sorted)), k_distances_sorted, 'b-', linewidth=1)
plt.axhline(y=best_dbscan['eps'], color='red', linestyle='--', 
           label=f'Best eps = {best_dbscan["eps"]}')

# Add some reference lines for other tested eps values
for eps in [0.3, 0.5, 0.7, 1.0]:
    if eps != best_dbscan['eps']:
        plt.axhline(y=eps, color='grey', linestyle=':', alpha=0.5, 
                   label=f'eps = {eps}')

plt.xlabel('Points sorted by distance')
plt.ylabel(f'{k}-th Nearest Neighbor Distance')
plt.title(f'K-Distance Plot (k={k}) for DBSCAN eps Selection')
plt.legend()
plt.grid(True, alpha=0.3)

# Highlight the "elbow" region
elbow_start = int(len(k_distances_sorted) * 0.05)
elbow_end = int(len(k_distances_sorted) * 0.20)
plt.axvspan(elbow_start, elbow_end, alpha=0.2, color='yellow', 
           label='Typical elbow region')

plt.tight_layout()
plt.savefig(f"{RESULTS_PATH}/dbscan_k_distance.png", dpi=300, bbox_inches='tight')
plt.show()

print(f"\nK-Distance Plot Analysis:")
print(f"- Red line shows selected eps = {best_dbscan['eps']}")
print(f"- Look for 'elbow' in the curve to find optimal eps")
print(f"- Steep increase indicates good density separation")
print(f"- Current eps captures {np.sum(k_distances_sorted >= best_dbscan['eps'])} points as potential core points")

### 9.2 Interaktive Grafik zur explorativen Analyse der Cluster 
- Ohne Verwendung neuer Libs

In [None]:
## Interactive Image Browser for t-SNE Results
## This version works with the standard inline matplotlib backend
#
#import numpy as np
#from pathlib import Path
#from PIL import Image
#import matplotlib.pyplot as plt
#from IPython.display import display, clear_output
#import ipywidgets as widgets
#from ipywidgets import interact, interactive, fixed
#
## --- Prepare data from your notebook ---
#X2d = tsne_features                  # shape (N,2)
#paths = np.array([f['filepath'] for f in image_files])   # length N
#labels_k = np.array(cluster_labels)  # K-Means labels
#labels_db = np.array(best_dbscan['labels'])  # DBSCAN labels (may include -1 for noise)
#
#def create_cluster_browser(X, labels, paths, title, clustering_type="kmeans"):
#    """Create an interactive cluster browser with dropdown selection"""
#    
#    # Create dropdown for point selection
#    n_points = len(X)
#    point_options = [(f"Punkt {i} - Cluster {labels[i]}", i) for i in range(n_points)]
#    
#    # Create widgets
#    point_selector = widgets.Dropdown(
#        options=point_options,
#        value=0,
#        description='Punkt wählen:',
#        style={'description_width': 'initial'}
#    )
#    
#    cluster_filter = widgets.Dropdown(
#        options=[('Alle Cluster', -999)] + [(f'Cluster {c}', c) for c in np.unique(labels)],
#        value=-999,
#        description='Cluster Filter:',
#        style={'description_width': 'initial'}
#    )
#    
#    # Output areas
#    plot_output = widgets.Output()
#    image_output = widgets.Output()
#    info_output = widgets.Output()
#    
#    def update_point_options(cluster_choice):
#        """Update available points based on cluster selection"""
#        if cluster_choice == -999:  # All clusters
#            filtered_indices = list(range(n_points))
#        else:
#            filtered_indices = np.where(labels == cluster_choice)[0].tolist()
#        
#        new_options = [(f"Punkt {i} - Cluster {labels[i]}", i) for i in filtered_indices]
#        point_selector.options = new_options
#        if new_options:
#            point_selector.value = new_options[0][1]
#    
#    def show_plot_and_image(selected_point, cluster_choice):
#        """Display the scatter plot and selected image"""
#        
#        # Clear outputs
#        plot_output.clear_output(wait=True)
#        image_output.clear_output(wait=True)
#        info_output.clear_output(wait=True)
#        
#        with plot_output:
#            # Create scatter plot
#            fig, ax = plt.subplots(figsize=(10, 8))
#            
#            # Prepare colors for clusters
#            unique_labels = np.unique(labels)
#            colors = plt.cm.tab10(np.linspace(0, 1, len(unique_labels)))
#            color_map = {}
#            
#            for i, label in enumerate(unique_labels):
#                if label == -1:  # Noise points for DBSCAN
#                    color_map[label] = 'gray'
#                else:
#                    color_map[label] = colors[i]
#            
#            # Plot all points
#            for label in unique_labels:
#                mask = labels == label
#                label_name = 'Noise' if label == -1 else f'Cluster {label}'
#                ax.scatter(
#                    X[mask, 0], X[mask, 1], 
#                    c=[color_map[label]], 
#                    label=label_name,
#                    s=50, alpha=0.6
#                )
#            
#            # Highlight selected point
#            ax.scatter(
#                X[selected_point, 0], X[selected_point, 1], 
#                c='red', s=200, marker='o', 
#                facecolors='none', edgecolors='red', linewidths=3,
#                label='Ausgewählter Punkt'
#            )
#            
#            ax.set_title(f'{title}\nAusgewählter Punkt: {selected_point}, Cluster: {labels[selected_point]}')
#            ax.set_xlabel('t-SNE Component 1')
#            ax.set_ylabel('t-SNE Component 2')
#            ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
#            ax.grid(True, alpha=0.3)
#            
#            plt.tight_layout()
#            plt.show()
#        
#        with image_output:
#            # Show the selected image
#            try:
#                img_path = paths[selected_point]
#                img = Image.open(img_path)
#                
#                fig_img, ax_img = plt.subplots(figsize=(8, 6))
#                ax_img.imshow(img)
#                ax_img.set_title(f'{Path(img_path).name}\n{clustering_type.upper()} Cluster: {labels[selected_point]}')
#                ax_img.axis('off')
#                plt.tight_layout()
#                plt.show()
#                
#            except Exception as e:
#                print(f"Fehler beim Laden des Bildes: {str(e)}")
#        
#        with info_output:
#            # Show detailed information
#            img_path = paths[selected_point]
#            print(f"📍 Punkt Index: {selected_point}")
#            print(f"🏷️  Cluster: {labels[selected_point]}")
#            print(f"📁 Dateiname: {Path(img_path).name}")
#            print(f"📂 Vollständiger Pfad: {img_path}")
#            print(f"📊 t-SNE Koordinaten: ({X[selected_point, 0]:.3f}, {X[selected_point, 1]:.3f})")
#    
#    # Connect the cluster filter to point options update
#    def on_cluster_change(change):
#        if change['type'] == 'change' and change['name'] == 'value':
#            update_point_options(change['new'])
#    
#    cluster_filter.observe(on_cluster_change)
#    
#    # Create interactive interface
#    interactive_plot = interactive(
#        show_plot_and_image,
#        selected_point=point_selector,
#        cluster_choice=fixed(cluster_filter.value)
#    )
#    
#    # Layout
#    controls = widgets.VBox([
#        widgets.HTML(f"<h3>{title}</h3>"),
#        widgets.HTML("<b>Wählen Sie einen Punkt aus, um das entsprechende Bild zu sehen:</b>"),
#        cluster_filter,
#        point_selector
#    ])
#    
#    plot_area = widgets.VBox([
#        widgets.HTML("<b>Streudiagramm:</b>"),
#        plot_output
#    ])
#    
#    image_area = widgets.VBox([
#        widgets.HTML("<b>Bildvorschau:</b>"),
#        image_output,
#        info_output
#    ])
#    
#    # Display everything
#    display(widgets.VBox([
#        controls,
#        widgets.HBox([plot_area, image_area])
#    ]))
#    
#    # Show initial plot
#    show_plot_and_image(0, -999)
#    
#    return interactive_plot
#
## Create browsers for both clustering methods
#print("Erstelle interaktive Cluster-Browser...")
#print("📝 Anweisungen:")
#print("   1. Wählen Sie optional einen Cluster aus dem Filter")
#print("   2. Wählen Sie einen Punkt aus der Dropdown-Liste")
#print("   3. Das entsprechende Bild wird automatisch angezeigt")
#print()
#
## K-Means browser
#print("🔍 K-Means Clustering Browser:")
#kmeans_browser = create_cluster_browser(
#    X2d, labels_k, paths, 
#    title=f"K-Means Clustering (k={optimal_k})",
#    clustering_type="kmeans"
#)
#
#print("\n" + "="*80 + "\n")
#
## DBSCAN browser
#print("🔍 DBSCAN Clustering Browser:")
#dbscan_browser = create_cluster_browser(
#    X2d, labels_db, paths,
#    title=f"DBSCAN Clustering (eps={best_dbscan['eps']:.3f}, min_samples={best_dbscan.get('min_samples', 'N/A')})",
#    clustering_type="dbscan"
#)

### 10. Cluster-Ergebnisse speichern

#### 10.1 Speicher-Helfer definieren (Ordner/Bildkopie/Metadaten)
- `create_cluster_directories(base_path, method_name, cluster_labels, dbscan_labels=None)`: legt unter `base_path/method_name/` pro Cluster einen Ordner an (`cluster_<id>`); bei DBSCAN landen Noise-Punkte in `noise/`. Rückgabe: Dict `{cluster_id: Path}`.
- `copy_images_to_clusters(image_files, cluster_dirs, cluster_labels, method_name='kmeans', dbscan_labels=None)`: kopiert die Bilder in ihre Cluster-Ordner (überspringt bereits identische Dateien). Gibt Anzahl Kopien und Fehler zurück.
- `save_cluster_metadata(cluster_dirs, image_files, cluster_labels, method_name='kmeans', dbscan_labels=None)`: schreibt je Cluster eine `cluster_metadata.json` mit `cluster_id`, `method`, `total_images`, `tenant_distribution` und `images[filename, tenant, sid, original_name]`.
- Hinweise: Ordner-Anlage ist idempotent; `cluster_id` wird für JSON sauber in `int` konvertiert; DBSCAN-Noise (`-1`) wird separat in `noise/` geführt.


In [None]:
def create_cluster_directories(base_path, method_name, cluster_labels, dbscan_labels=None):
    """Create directory structure for clustered images."""
    method_path = Path(base_path) / method_name
    method_path.mkdir(parents=True, exist_ok=True)
    
    if method_name == 'kmeans':
        unique_clusters = sorted(set(cluster_labels))
    else:  # dbscan
        unique_clusters = sorted(set(dbscan_labels))
        # Handle noise points (-1) separately
        if -1 in unique_clusters:
            unique_clusters = [c for c in unique_clusters if c != -1] + [-1]
    
    cluster_dirs = {}
    for cluster_id in unique_clusters:
        if cluster_id == -1:
            cluster_dir = method_path / 'noise'
        else:
            cluster_dir = method_path / f'cluster_{cluster_id}'
        cluster_dir.mkdir(exist_ok=True)
        cluster_dirs[cluster_id] = cluster_dir
    
    return cluster_dirs

def copy_images_to_clusters(image_files, cluster_dirs, cluster_labels, method_name='kmeans', dbscan_labels=None):
    """Copy images to their respective cluster directories."""
    print(f"\nCopying images to {method_name.upper()} cluster directories...")
    
    if method_name == 'kmeans':
        labels_to_use = cluster_labels
    else:  # dbscan
        labels_to_use = dbscan_labels
    
    copied_count = 0
    error_count = 0
    
    for i, img_file in enumerate(image_files):
        try:
            cluster_id = labels_to_use[i]
            source_path = Path(img_file['filepath'])
            target_dir = cluster_dirs[cluster_id]
            target_path = target_dir / source_path.name
            
            # Copy file if it doesn't exist or is different
            if not target_path.exists() or target_path.stat().st_size != source_path.stat().st_size:
                shutil.copy2(source_path, target_path)
                copied_count += 1
            
        except Exception as e:
            print(f"Error copying {img_file['filename']}: {e}")
            error_count += 1
    
    print(f"Successfully copied {copied_count} images")
    if error_count > 0:
        print(f"Errors: {error_count}")
    
    return copied_count, error_count

def save_cluster_metadata(cluster_dirs, image_files, cluster_labels, method_name='kmeans', dbscan_labels=None):
    """Save metadata for each cluster."""
    print(f"\nSaving {method_name.upper()} cluster metadata...")
    
    if method_name == 'kmeans':
        labels_to_use = cluster_labels
    else:  # dbscan
        labels_to_use = dbscan_labels
    
    for cluster_id, cluster_dir in cluster_dirs.items():
        # Get images for this cluster
        cluster_images = []
        for i, img_file in enumerate(image_files):
            if int(labels_to_use[i]) == cluster_id:  # Convert numpy int32 to Python int
                cluster_images.append({
                    'filename': img_file['filename'],
                    'tenant': img_file['tenant'],
                    'sid': img_file['sid'],
                    'original_name': img_file['original_name']
                })
        
        # Calculate statistics
        tenant_counts = Counter([img['tenant'] for img in cluster_images])
        
        metadata = {
            'cluster_id': int(cluster_id) if isinstance(cluster_id, (np.integer, np.int32, np.int64)) else cluster_id,  # Ensure cluster_id is JSON serializable
            'method': method_name,
            'total_images': len(cluster_images),
            'tenant_distribution': dict(tenant_counts),
            'images': cluster_images
        }
        
        # Save metadata
        metadata_file = cluster_dir / 'cluster_metadata.json'
        with open(metadata_file, 'w') as f:
            json.dump(metadata, f, indent=2)
    
    print(f"Metadata saved for {len(cluster_dirs)} clusters")

#### 10.2 K-Means-Cluster speichern (Ausführung)
- Legt `CLUSTERS_PATH=./results/clustered_images` an und erzeugt darin die K-Means-Ordnerstruktur (`kmeans/cluster_<id>`).
- Gibt pro Cluster die Bildanzahl aus (Berechnung über `cluster_labels`).
- Kopiert die Bilder in die jeweiligen Cluster-Ordner (überspringt bereits identische Dateien) und zählt Kopien/Fehler.
- Schreibt je Cluster eine `cluster_metadata.json` (Größe, Tenant-Verteilung, Bildliste) in den jeweiligen Ordner.
- Abschluss: Meldung mit Zielpfad `./results/clustered_images/kmeans`.
- Hinweis: Vorgang ist idempotent; für große Datenmengen ggf. Symlinks statt Kopien erwägen, um Speicher zu sparen.

In [None]:
# Save K-Means clustering results to timestamped directories
print("=" * 80)
print("SAVING K-MEANS CLUSTERS TO TIMESTAMPED DIRECTORIES")
print("=" * 80)

print(f"Using timestamped cluster directory: {CLUSTERS_PATH}")

# Create K-Means cluster directories
kmeans_dirs = create_cluster_directories(CLUSTERS_PATH, 'kmeans', cluster_labels)

print(f"Created K-Means cluster directories:")
for cluster_id, cluster_dir in sorted(kmeans_dirs.items()):
    cluster_size = sum(1 for label in cluster_labels if label == cluster_id)
    print(f"  {cluster_dir.name}: {cluster_size} images")

# Copy images to K-Means clusters
kmeans_copied, kmeans_errors = copy_images_to_clusters(
    image_files, kmeans_dirs, cluster_labels, 'kmeans'
)

# Save K-Means cluster metadata
save_cluster_metadata(kmeans_dirs, image_files, cluster_labels, 'kmeans')

print(f"\nK-Means clustering results saved to: {Path(CLUSTERS_PATH) / 'kmeans'}")

#### 10.3 DBSCAN-Cluster speichern (Ausführung)
- Ergänzt `image_files` um `dbscan_cluster` aus `best_dbscan['labels']` (inkl. Noise = `-1`).
- Erstellt Ordnerstruktur unter `CLUSTERS_PATH/dbscan/` mit `cluster_<id>/` und `noise/` (für Label `-1`).
- Gibt pro DBSCAN-Cluster die Bildanzahl aus (Noise separat ausgewiesen).
- Kopiert alle Bilder in die entsprechenden **DBSCAN-Ordner** (überspringt identische Zieldateien).
- Schreibt je Cluster eine **cluster_metadata.json** (Total, Tenant-Verteilung, Bildliste).
- Abschluss: bestätigt Zielpfad `./results/clustered_images/dbscan`.


In [None]:
# Save DBSCAN clustering results to timestamped directories
print("=" * 80)
print("SAVING DBSCAN CLUSTERS TO TIMESTAMPED DIRECTORIES")
print("=" * 80)

print(f"Using timestamped cluster directory: {CLUSTERS_PATH}")

# Add DBSCAN cluster labels to image files for consistency
dbscan_labels = best_dbscan['labels']
for i, img_file in enumerate(image_files):
    img_file['dbscan_cluster'] = int(dbscan_labels[i])

# Create DBSCAN cluster directories
dbscan_dirs = create_cluster_directories(CLUSTERS_PATH, 'dbscan', None, dbscan_labels)

print(f"Created DBSCAN cluster directories:")
for cluster_id, cluster_dir in sorted(dbscan_dirs.items()):
    cluster_size = sum(1 for label in dbscan_labels if label == cluster_id)
    if cluster_id == -1:
        print(f"  {cluster_dir.name} (noise): {cluster_size} images")
    else:
        print(f"  {cluster_dir.name}: {cluster_size} images")

# Copy images to DBSCAN clusters
dbscan_copied, dbscan_errors = copy_images_to_clusters(
    image_files, dbscan_dirs, None, 'dbscan', dbscan_labels
)

# Save DBSCAN cluster metadata
save_cluster_metadata(dbscan_dirs, image_files, None, 'dbscan', dbscan_labels)

print(f"\nDBSCAN clustering results saved to: {Path(CLUSTERS_PATH) / 'dbscan'}")

### 11. Zusammenfassungsreport erzeugen & Verzeichnisübersicht
- Erstellt `cluster_summary` mit:
  - `dataset_info`: Quellpfad, Anzahl verarbeiteter Bilder, Zeitstempel.
  - `kmeans_clustering`: Methode, `optimal_k`, bester Silhouette-Wert, Kopier-/Fehlerzahlen, Clusterverteilung.
  - `dbscan_clustering`: Methode, `eps`, `min_samples`, #Cluster, #Noise, Silhouette, Kopier-/Fehlerzahlen, Clusterverteilung.
  - `directory_structure`: absolute Pfade zu `base_path`, `kmeans_path`, `dbscan_path`.
- Speichert die Zusammenfassung als `clustering_summary.json` im `CLUSTERS_PATH`.
- Konsolenübersicht:
  - Gesamtanzahl verarbeiteter Bilder, Anzahl angelegter K-Means/DBSCAN-Cluster, Kopierstatistik.
  - Baumansicht der angelegten Verzeichnisstruktur einschließlich Clustergrößen.
  - Auflistung der relevanten Ausgabedateien (Summary, je Cluster `cluster_metadata.json`, Analyseplots unter `RESULTS_PATH`).
- Hinweis (kleine Korrektur): In `dbscan_clustering` sollte `min_samples` aus `best_dbscan['min_samples']` übernommen werden, nicht aus einer evtl. veralteten Variablen.

In [None]:
# Generate summary report for timestamped clustered images
print("=" * 80)
print("TIMESTAMPED CLUSTER ORGANIZATION SUMMARY")
print("=" * 80)

# Create comprehensive summary with timestamp info
cluster_summary = {
    'run_info': {
        'timestamp': TIMESTAMP,
        'analysis_results_path': str(Path(RESULTS_PATH).resolve()),
        'clustered_images_path': str(Path(CLUSTERS_PATH).resolve())
    },
    'dataset_info': {
        'source_dataset': str(Path(DATASET_PATH).resolve()),
        'total_images_processed': len(image_files),
        'clustering_date': time.strftime('%Y-%m-%d %H:%M:%S')
    },
    'kmeans_clustering': {
        'method': 'K-Means (MiniBatch)',
        'optimal_k': optimal_k,
        'silhouette_score': float(silhouette_scores[best_silhouette_idx]),
        'images_copied': kmeans_copied,
        'copy_errors': kmeans_errors,
        'cluster_distribution': {
            str(cluster_id): int(count) for cluster_id, count in Counter(cluster_labels).items()
        }
    },
    'dbscan_clustering': {
        'method': 'DBSCAN',
        'eps': best_dbscan['eps'],
        'min_samples': best_dbscan['min_samples'],
        'n_clusters': best_dbscan['n_clusters'],
        'n_noise': best_dbscan['n_noise'],
        'silhouette_score': float(best_dbscan['silhouette_score']),
        'images_copied': dbscan_copied,
        'copy_errors': dbscan_errors,
        'cluster_distribution': {
            str(cluster_id): int(count) for cluster_id, count in Counter(dbscan_labels).items()
        }
    },
    'directory_structure': {
        'base_path': str(Path(CLUSTERS_PATH).resolve()),
        'kmeans_path': str(Path(CLUSTERS_PATH, 'kmeans').resolve()),
        'dbscan_path': str(Path(CLUSTERS_PATH, 'dbscan').resolve())
    }
}

# Save cluster summary in both locations
summary_file_clusters = Path(CLUSTERS_PATH) / 'clustering_summary.json'
summary_file_analysis = Path(RESULTS_PATH) / 'clustering_summary.json'

with open(summary_file_clusters, 'w') as f:
    json.dump(cluster_summary, f, indent=2)
with open(summary_file_analysis, 'w') as f:
    json.dump(cluster_summary, f, indent=2)

print(f"Timestamped cluster organization completed successfully!")
print(f"\nRun timestamp: {TIMESTAMP}")
print(f"Summary:")
print(f"- Total images processed: {len(image_files)}")
print(f"- K-Means clusters created: {len(kmeans_dirs)}")
print(f"- DBSCAN clusters created: {len(dbscan_dirs)}")
print(f"- Images copied (K-Means): {kmeans_copied}")
print(f"- Images copied (DBSCAN): {dbscan_copied}")

print(f"\nTimestamped directory structure created:")
print(f"results/")
print(f"├── clustering_analysis_{TIMESTAMP}/")
print(f"│   ├── run_metadata.json")
print(f"│   ├── clustering_report.json")
print(f"│   ├── clustering_summary.json")
print(f"│   └── *.png (analysis plots)")
print(f"└── clustered_images_{TIMESTAMP}/")
print(f"    ├── clustering_summary.json")
print(f"    ├── kmeans/")
for cluster_id in sorted(set(cluster_labels)):
    cluster_size = sum(1 for label in cluster_labels if label == cluster_id)
    print(f"    │   ├── cluster_{cluster_id}/ ({cluster_size} images)")
print(f"    └── dbscan/")
for cluster_id in sorted(set(dbscan_labels)):
    cluster_size = sum(1 for label in dbscan_labels if label == cluster_id)
    if cluster_id == -1:
        print(f"        ├── noise/ ({cluster_size} images)")
    else:
        print(f"        ├── cluster_{cluster_id}/ ({cluster_size} images)")

print(f"\nFiles saved:")
print(f"- Cluster summary: {summary_file_clusters}")
print(f"- Analysis summary: {summary_file_analysis}")
print(f"- Individual cluster metadata: cluster_metadata.json in each cluster directory")
print(f"- Analysis results: {RESULTS_PATH}/")
print(f"- Clustered images: {CLUSTERS_PATH}/")

### 12. Tenant-Cluster-Analyse (Heatmap & Detailstatistik)
- Baut eine Tenant×Cluster-Matrix auf (zeilenweise normalisiert) und berechnet **Prozentanteile je Tenant** pro Cluster.
- Visualisiert die Verteilung als **Heatmap** mit Beschriftung (%.1f) und speichert sie als `tenant_cluster_heatmap.png` unter `RESULTS_PATH`.
- Gibt zusätzlich eine **Detailtabelle** in der Konsole aus: für jeden Tenant Gesamtzahl + Prozent/Anzahl pro Cluster.
- Interpretation: Zeilen summieren zu ~100 % → zeigt **Präferenz/Schieflage** eines Tenants über Cluster; bei kleinen Tenants Vorsicht (Varianz).
- Optional: Ergänze eine zweite Heatmap auf **Rohzählungen** oder einen **Stacked Bar Chart** je Tenant für absolute Vergleichbarkeit.


In [None]:
# Analyze tenant distribution across clusters
print("=" * 80)
print("TENANT DISTRIBUTION ANALYSIS")
print("=" * 80)

# Create tenant-cluster matrix
tenant_cluster_matrix = defaultdict(lambda: defaultdict(int))
total_by_tenant = defaultdict(int)

for img_file in image_files:
    tenant = img_file['tenant']
    cluster = img_file['cluster']
    tenant_cluster_matrix[tenant][cluster] += 1
    total_by_tenant[tenant] += 1

# Convert to DataFrame for easier visualization
tenants = sorted(total_by_tenant.keys())
clusters = sorted(set(cluster_labels))

matrix_data = []
for tenant in tenants:
    row = []
    for cluster in clusters:
        count = tenant_cluster_matrix[tenant][cluster]
        percentage = (count / total_by_tenant[tenant]) * 100
        row.append(percentage)
    matrix_data.append(row)

# Create heatmap
plt.figure(figsize=(12, 8))
heatmap_data = np.array(matrix_data)
sns.heatmap(
    heatmap_data,
    xticklabels=[f'Cluster {c}' for c in clusters],
    yticklabels=tenants,
    annot=True,
    fmt='.1f',
    cmap='YlOrRd',
    cbar_kws={'label': 'Percentage of Tenant Images'}
)
plt.title('Tenant Distribution Across Clusters (%)')
plt.xlabel('Clusters')
plt.ylabel('Tenants')
plt.tight_layout()
plt.savefig(f"{RESULTS_PATH}/tenant_cluster_heatmap.png", dpi=300, bbox_inches='tight')
plt.show()

# Print detailed statistics
print("\nDetailed Tenant-Cluster Distribution:")
for tenant in tenants:
    print(f"\n{tenant} ({total_by_tenant[tenant]} images):")
    for cluster in clusters:
        count = tenant_cluster_matrix[tenant][cluster]
        percentage = (count / total_by_tenant[tenant]) * 100
        if count > 0:
            print(f"  Cluster {cluster}: {count} images ({percentage:.1f}%)")

### 13. Clustering-Report (JSON) erstellen & Kernergebnisse ausgeben
- Erstellt `clustering_report.json` unter `RESULTS_PATH` mit:
  - `dataset_info`: `total_images`, `feature_dimension`, `pca_components`, `pca_explained_variance`, `tenant_distribution`.
  - `kmeans_results`: `optimal_k`, bester `silhouette_score`, `cluster_sizes` und Liste der `evaluation_results` (k, inertia, silhouette).
  - `dbscan_results`: `best_eps`, `n_clusters`, `n_noise`, `silhouette_score`.
  - `tenant_cluster_analysis`: je Tenant `total_images` und `cluster_distribution`.
- Persistiert den Report (indent=2) und gibt eine kompakte **Zusammenfassung** in der Konsole aus:
  - Optimales k (K-Means), bester Silhouette-Score.
  - DBSCAN-Ergebnis: #Cluster, #Noise.
  - PCA: erklärte Gesamtvarianz (%).
  - Clustergrößen (K-Means) inkl. Prozentanteile.
- Hinweise:
  - Falls `image_files` nach dem Valid-Filter reduziert wurde, kann `tenant_distribution` (aus der frühen Ladephase) abweichen. Optional neu berechnen auf Basis der gefilterten `image_files`.
  - Optional `best_dbscan['min_samples']` zusätzlich in `dbscan_results` aufnehmen, um die Parameterwahl vollständig zu dokumentieren.
  - Visualisierungen (`*.png`) und Clusterbeispiele (`cluster_*_examples.png`) liegen parallel im `RESULTS_PATH`.


In [None]:
# Generate comprehensive clustering report with timestamp info
print("=" * 80)
print("TIMESTAMPED CLUSTERING ANALYSIS SUMMARY")
print("=" * 80)

# Compile results with timestamp information
clustering_report = {
    'run_info': {
        'timestamp': TIMESTAMP,
        'run_date': time.strftime('%Y-%m-%d %H:%M:%S'),
        'analysis_results_path': str(Path(RESULTS_PATH).resolve()),
        'clustered_images_path': str(Path(CLUSTERS_PATH).resolve())
    },
    'dataset_info': {
        'total_images': int(len(image_files)),
        'feature_dimension': int(FEATURE_DIM),
        'pca_components': int(PCA_COMPONENTS),
        'pca_explained_variance': float(pca.explained_variance_ratio_.sum()),
        'tenant_distribution': {k: int(v) for k, v in tenant_distribution.items()}
    },
    'kmeans_results': {
        'optimal_k': int(optimal_k),
        'silhouette_score': float(silhouette_scores[best_silhouette_idx]),
        'cluster_sizes': {int(k): int(v) for k, v in Counter(cluster_labels).items()},
        'evaluation_results': [
            {
                'k': int(r['k']),
                'inertia': float(r['inertia']),
                'silhouette_score': float(r['silhouette_score'])
            } for r in kmeans_results
        ]
    },
    'dbscan_results': {
        'best_eps': float(best_dbscan['eps']),
        'min_samples': int(best_dbscan['min_samples']),
        'n_clusters': int(best_dbscan['n_clusters']),
        'n_noise': int(best_dbscan['n_noise']),
        'silhouette_score': float(best_dbscan['silhouette_score'])
    },
    'tenant_cluster_analysis': {
        tenant: {
            'total_images': int(total_by_tenant[tenant]),
            'cluster_distribution': {int(k): int(v) for k, v in tenant_cluster_matrix[tenant].items()}
        } for tenant in tenants
    }
}

# Save results to timestamped directory
with open(f"{RESULTS_PATH}/clustering_report.json", 'w') as f:
    json.dump(clustering_report, f, indent=2)

# Print summary
print(f"Analysis completed successfully!")
print(f"\nRun Information:")
print(f"- Timestamp: {TIMESTAMP}")
print(f"- Analysis results saved to: {RESULTS_PATH}")
print(f"- Clustered images saved to: {CLUSTERS_PATH}")

print(f"\nKey Findings:")
print(f"- Optimal number of clusters (K-Means): {optimal_k}")
print(f"- Best silhouette score: {silhouette_scores[best_silhouette_idx]:.4f}")
print(f"- DBSCAN found {best_dbscan['n_clusters']} clusters with {best_dbscan['n_noise']} noise points")
print(f"- PCA captured {pca.explained_variance_ratio_.sum():.1%} of the variance")

print(f"\nCluster sizes (K-Means):")
cluster_sizes = Counter(cluster_labels)
for cluster_id, size in sorted(cluster_sizes.items()):
    percentage = (size / len(cluster_labels)) * 100
    print(f"  Cluster {cluster_id}: {size} images ({percentage:.1f}%)")

print(f"\nTimestamped results saved to: {RESULTS_PATH}")
print(f"- Run metadata: run_metadata.json")
print(f"- Clustering report: clustering_report.json")
print(f"- Clustering summary: clustering_summary.json")
print(f"- Visualizations: *.png files")
print(f"- Cluster examples: cluster_*_examples.png")

print(f"\nTo compare different runs, check the various timestamped directories in ./results/")

### 14. Interpretation & Next Steps

In [None]:
# Interpretation and next steps for timestamped analysis
print("=" * 80)
print("INTERPRETATION AND NEXT STEPS")
print("=" * 80)

print(f"Analysis run timestamp: {TIMESTAMP}")
print(f"Results saved in timestamped directories for comparison with future runs.")
print()

print("Based on the cluster analysis, consider the following interpretations:")
print()

# Analyze cluster characteristics
print("Cluster Characteristics Analysis:")
for cluster_id in sorted(cluster_sizes.keys()):
    cluster_images = [img for img in image_files if img['cluster'] == cluster_id]
    cluster_tenants = [img['tenant'] for img in cluster_images]
    tenant_counts = Counter(cluster_tenants)
    dominant_tenant = tenant_counts.most_common(1)[0]
    
    print(f"\nCluster {cluster_id} ({len(cluster_images)} images):")
    print(f"  Dominant tenant: {dominant_tenant[0]} ({dominant_tenant[1]}/{len(cluster_images)} images, {dominant_tenant[1]/len(cluster_images)*100:.1f}%)")
    
    if len(tenant_counts) == 1:
        print(f"  Interpretation: Tenant-specific characteristics ({dominant_tenant[0]})")
    elif dominant_tenant[1] / len(cluster_images) > 0.7:
        print(f"  Interpretation: Primarily {dominant_tenant[0]} with some similarities to other tenants")
    else:
        print(f"  Interpretation: Mixed tenant cluster - likely represents common infrastructure features")

print("\nPotential classification tasks based on clustering:")
print("1. Tenant Classification: Classify images by transportation company")
print("2. Infrastructure Type: Distinguish between different track types (e.g., embedded vs. ballasted)")
print("3. Environmental Conditions: Classify by lighting, weather, or time of day")
print("4. Urban vs. Rural: Distinguish between city and countryside rail infrastructure")
print()
print("Recommended next steps:")
print("1. Manually inspect cluster examples to understand what visual features drive the clustering")
print("2. Create labels based on identified patterns (e.g., 'gravel_track', 'asphalt_embedded', etc.)")
print("3. Use these labels to train supervised classifiers")
print("4. Consider data augmentation strategies for underrepresented clusters")
print("5. Evaluate whether clustering captures meaningful domain-specific patterns")
print("6. Compare results across different runs using the timestamped directories")
print("7. Track how clustering results change with different parameters or datasets")

print(f"\nFor future comparisons:")
print(f"- This run's results: {RESULTS_PATH}")
print(f"- Clustered images: {CLUSTERS_PATH}")
print(f"- Look for patterns across different timestamps to validate clustering stability")