# Nonlinear Clustering Workflow — Step-by-Step Summary

---

## 1. Random Sampling  
- Loads a **random subset (7,000 rows)** from the large normalized dataset `features_ready.csv`.  
- This prevents memory overload and speeds up computation (yes, it was the issue during tries).  

---

## 2. Try: Gaussian Mixture Models (GMM)  
- Tests several GMM configurations with **12 → 60 clusters** (step = 6).  
- Uses `covariance_type='tied'` for stability and lower memory usage.  
- For each model:  
  - Fits clusters using the EM algorithm.  
  - Calculates **Silhouette Score** and **Davies–Bouldin Index (DBI)**.  
  - Saves results for comparison.  

---

## 3. Try: Spectral Clustering  
- Runs **Spectral Clustering** with the same cluster counts (12 → 60).  
- Uses a **nearest-neighbors affinity graph** to capture nonlinear relations.  
- Evaluates each configuration with the same metrics (Silhouette, DBI).  

---

## 4. Try: DBSCAN (Density-Based)  
- Applies **DBSCAN** with tuned parameters (`eps=0.1`, `min_samples=10`)  
  to generate more than 10 clusters.  
- Computes Silhouette and DBI if more than one cluster is detected.  

---

## 6. Evaluation of All Models 

### Understanding the Evaluation Formula

Each clustering model is evaluated using two key metrics:

1. **Silhouette Score (range: -1 → 1)**  
   - Measures how similar each point is to its own cluster compared to other clusters.  
   - **Interpretation:**
     - **~1.0** → points are tightly grouped and well-separated (excellent clustering)  
     - **~0.5** → moderately good separation  
     - **~0.0** → overlapping clusters  
     - **< 0** → likely incorrect clustering or too many clusters  

2. **Davies–Bouldin Index (DBI, range: 0 → ∞)**  
   - Represents the average "similarity" between each cluster and its most similar neighbor.  
   - **Interpretation:**
     - **Closer to 0** → compact, well-separated clusters (good)  
     - **> 2.0** → poor separation or overlapping clusters  

---

### Combined Score Formula

To compare models using both metrics, a combined score is computed:

\[
\text{score} = \text{Silhouette} - 0.1 \times \text{DBI}
\]

This formula balances the two metrics:
- **Silhouette** is *maximized* (higher is better),
- **DBI** is *penalized* (lower is better),
- The multiplier `0.1` ensures DBI’s influence is moderate, not dominant.

### How to Interpret the Combined Score (for real nonlinear data)

| Combined Score | Meaning | Interpretation |
|----------------|----------|----------------|
| **> 0.4** | Excellent | Very distinct, compact clusters |
| **0.2 – 0.4** | Good | Clear separation, reliable structure |
| **-0.2 – 0.2** | Moderate | Some overlap, acceptable for complex data |
| **-0.4 – 0.2** | Weak | Noisy or poorly separated clusters |
| **< -0.4** | Bad | Overfitted or meaningless clustering |
---

## 6. Saving the Best Model  
- Retrieves the cluster assignments from the best-scoring model.  
- Adds two columns to the dataset:  
  - `cluster` → cluster label for each sample,  
  - `model` → model name (e.g., `Spectral_24`, `GMM_30`, `DBSCAN`).  
- Saves the annotated results to **`clustering_results_best_nonlinear.csv`**.  

In [None]:
import pandas as pd
import numpy as np
from sklearn.cluster import DBSCAN, SpectralClustering
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score, davies_bouldin_score
from tqdm import tqdm

#Load random sample to prevent the death of the laptop
total_rows = sum(1 for _ in open("features_ready.csv")) - 1
sample_size = 7000
skip = sorted(np.random.choice(range(1, total_rows + 1), total_rows - sample_size, replace=False))
df = pd.read_csv("features_ready.csv", skiprows=skip)

#Select numeric features
feature_cols = [c for c in df.columns if c.startswith(('t_', 'f_', 'w_'))]
X = df[feature_cols].to_numpy(dtype=np.float64)
results = []
cluster_labels = {}

#Gaussian Mixture
for k in tqdm(range(12, 61, 6), desc="Testing GMM clusters"):
    try:
        gmm = GaussianMixture(
            n_components=k,
            covariance_type='tied',
            reg_covar=1e-4,
            max_iter=500,
            random_state=42
        )
        labels_gmm = gmm.fit_predict(X)
        sil_gmm = silhouette_score(X, labels_gmm)
        db_gmm = davies_bouldin_score(X, labels_gmm)
        results.append({"model": f"GMM_{k}", "silhouette": sil_gmm, "db": db_gmm, "n_clusters": k})
        cluster_labels[f"GMM_{k}"] = labels_gmm
    except Exception as e:
        print(f"⚠️ GMM (k={k}) failed: {e}")

#Spectral Clustering
for k in tqdm(range(12, 61, 6), desc="Testing Spectral clusters"):
    try:
        spectral = SpectralClustering(
            n_clusters=k,
            affinity='nearest_neighbors',
            assign_labels='kmeans',
            random_state=42
        )
        labels_sp = spectral.fit_predict(X)
        sil_sp = silhouette_score(X, labels_sp)
        db_sp = davies_bouldin_score(X, labels_sp)
        results.append({"model": f"Spectral_{k}", "silhouette": sil_sp, "db": db_sp, "n_clusters": k})
        cluster_labels[f"Spectral_{k}"] = labels_sp
    except Exception as e:
        print(f"Spectral (k={k}) failed: {e}")

#DBSCAN
try:
    dbscan = DBSCAN(eps=0.1, min_samples=10, n_jobs=-1) #esp modified due to insufficient number of clusters before
    labels_db = dbscan.fit_predict(X)
    if len(set(labels_db)) > 1:
        sil_db = silhouette_score(X, labels_db)
        db_db = davies_bouldin_score(X, labels_db)
        results.append({
            "model": "DBSCAN",
            "silhouette": sil_db,
            "db": db_db,
            "n_clusters": len(set(labels_db))
        })
        cluster_labels["DBSCAN"] = labels_db
except Exception as e:
    print(f"DBSCAN failed: {e}")

#Evaluate all nonlinear models
df_results = pd.DataFrame(results)
df_results["score"] = df_results["silhouette"] - 0.1 * df_results["db"]

best_model_name = df_results.sort_values("score", ascending=False).iloc[0]["model"]
print("Model comparison (non-linear only):")
print(df_results.sort_values("score", ascending=False).head(10))

#Save best model results
best_labels = cluster_labels[best_model_name]
df["cluster"] = best_labels
df["model"] = best_model_name
df.to_csv("clustering_results_best_nonlinear.csv", index=False)
print(f"\n✅ Best nonlinear model: {best_model_name}")
print("\nCluster size distribution:")
print(df["cluster"].value_counts().sort_index())


# Evaluation and Interpretation Guide

---

## 1. Dataset Overview
The file **`clustering_results_best_nonlinear.csv`** contains:
- Extracted time-, frequency-, and wavelet-domain features (`t_`, `f_`, `w_...`);
- Metadata fields such as `fault_type`, `bearing_type`, and `load_level`;
- The assigned cluster label (`cluster`);
- The name of the nonlinear model used (`model`), e.g., `Spectral_24`, `GMM_diag_30`, or `DBSCAN`.

---

## 2. Cluster Evaluation Metrics

| Metric | Description | Interpretation |
|---------|--------------|----------------|
| **Silhouette Score** | Measures cohesion and separation of clusters | Closer to **1.0** → compact, well-separated clusters |
| **Davies–Bouldin Index (DBI)** | Ratio of within-cluster to between-cluster distances | **Lower** = better cluster quality |
| **Cluster Count** | Number of clusters found by the algorithm | Indicates overall structure complexity |
| **Cluster Size Distribution** | Number of samples per cluster | Detects unbalanced or noisy results |

### Example: Evaluating Clusters
Use `silhouette_score` and `davies_bouldin_score` from **scikit-learn** to quantify cluster quality.  
The closer the silhouette score is to **1**, and the smaller the DBI, the better the clustering.

---

## 3. Cluster–Fault Relationship

### Purpose
To verify whether the discovered clusters correspond to known fault types or load conditions.

### How to Evaluate
- Use a **cross-tabulation** (`pd.crosstab`) between `cluster` and `fault_type` to see the proportion of each fault type in every cluster.  
- Do the same for `load_level` or `bearing_type`.

### Interpretation
- A cluster dominated by one `fault_type` likely represents that fault condition.  
- Mixed clusters might indicate transitional states, noise, or overlapping operating modes.  
- If clusters group primarily by `load_level`, the model may capture load-related patterns instead of faults.
