## Step 05: Hyperparameter Tuning

### Import necessary libraries

In [1]:
import pandas as pd
import numpy as np
import os

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import plotly.express as px
from sklearn.manifold import TSNE

from sklearn.metrics import silhouette_score, davies_bouldin_score,calinski_harabasz_score
from sklearn.model_selection import ParameterGrid, train_test_split
from sklearn.metrics import pairwise_distances
from sklearn.metrics import mean_squared_error

import warnings
warnings.filterwarnings("ignore")

### 5.1 Load datasets from file paths

In [3]:
data_path = '../Dataset/data_cleaned.csv'
genre_data_path = '../Dataset/genre_data_cleaned.csv'

# Check if files exist and load them
if os.path.exists(data_path) and os.path.exists(genre_data_path):
    data = pd.read_csv(data_path)
    genre_data = pd.read_csv(genre_data_path)
    print("Info: Data and genre data successfully loaded.")
else:
    print("Attention: One or both files are not found in the specified directory.")

Info: Data and genre data successfully loaded.


In [5]:
X = genre_data.select_dtypes(include=np.number) 
Y = data.select_dtypes(include=np.number)

### 5.2  KMeans Hyperparameter tuning 
- Exhaustive grid search is perfomed over KMeans hyperparameters using a training-validation split of 80%-20%. 

#### 5.2.1 Genre_data_Cleaned

In [7]:
# Split dataset into training and validation sets to prevent overfitting
X_train, X_val = train_test_split(X, test_size=0.2, random_state=42)

# Define an exhaustive parameter grid for KMeans
param_grid_kmeans = {
    'n_clusters': [13, 20],                       # Reasonable cluster range (based on elbow Method)
    'init': ['k-means++', 'random'],              # Initialization methods
    'n_init': [10, 20],                           # Number of initializations
    'max_iter': [300, 500],                       # Maximum iterations
    'tol': [1e-3, 1e-4]                           # Convergence tolerance
}

# Track progress
parameter_combinations = list(ParameterGrid(param_grid_kmeans))
total_iterations = len(parameter_combinations)
current_iteration = 0

# Track the best and worst parameters and scores
best_score = -1
worst_score = np.inf
best_params = None
worst_params = None

print(f"Total parameter combinations to evaluate: {total_iterations}\n")

# Perform grid search with training-validation split
for params in parameter_combinations:
    current_iteration += 1
    print(f"Iteration {current_iteration}/{total_iterations}: Testing parameters {params}...")

    kmeans = KMeans(
        n_clusters=params['n_clusters'],
        init=params['init'],
        n_init=params['n_init'],
        max_iter=params['max_iter'],
        tol=params['tol'],
        random_state=42
    )
    try:
        # Fit on training data
        kmeans.fit(X_train)
        labels_train = kmeans.labels_

        # Predict on validation data
        labels_val = kmeans.predict(X_val)

        # Compute metrics on validation data
        silhouette_val = silhouette_score(X_val, labels_val)
        davies_bouldin_val = davies_bouldin_score(X_val, labels_val)
        calinski_harabasz_val = calinski_harabasz_score(X_val, labels_val)


        print(f"Silhouette Score: {silhouette_val:.4f}, Davies-Bouldin Score: {davies_bouldin_val:.4f},Calinski Harabasz Score: {calinski_harabasz_val:.4f}")

        # Update the best parameters if validation silhouette score improves
        if silhouette_val > best_score:
            best_score = silhouette_val
            best_params = params

        # Update the worst parameters if validation silhouette score decreases
        if silhouette_val < worst_score:
            worst_score = silhouette_val
            worst_params = params

    except Exception as e:
        print(f"❌ Iteration {current_iteration}/{total_iterations} failed with error: {e}")

# Print the final best and worst parameters and silhouette scores
print("\nSearch Completed!")
print("Best Parameters for KMeans:", best_params)
print("Best Silhouette Score (Validation):", best_score)
print("\nWorst Parameters for KMeans:", worst_params)
print("Worst Silhouette Score (Validation):", worst_score)


Total parameter combinations to evaluate: 32

Iteration 1/32: Testing parameters {'init': 'k-means++', 'max_iter': 300, 'n_clusters': 13, 'n_init': 10, 'tol': 0.001}...
Silhouette Score: 0.2878, Davies-Bouldin Score: 0.9162,Calinski Harabasz Score: 316.1541
Iteration 2/32: Testing parameters {'init': 'k-means++', 'max_iter': 300, 'n_clusters': 13, 'n_init': 10, 'tol': 0.0001}...
Silhouette Score: 0.2860, Davies-Bouldin Score: 0.9133,Calinski Harabasz Score: 315.4086
Iteration 3/32: Testing parameters {'init': 'k-means++', 'max_iter': 300, 'n_clusters': 13, 'n_init': 20, 'tol': 0.001}...
Silhouette Score: 0.2846, Davies-Bouldin Score: 0.9270,Calinski Harabasz Score: 313.9262
Iteration 4/32: Testing parameters {'init': 'k-means++', 'max_iter': 300, 'n_clusters': 13, 'n_init': 20, 'tol': 0.0001}...
Silhouette Score: 0.2846, Davies-Bouldin Score: 0.9270,Calinski Harabasz Score: 313.9262
Iteration 5/32: Testing parameters {'init': 'k-means++', 'max_iter': 300, 'n_clusters': 20, 'n_init': 10

##### Conclusion
**Best hyperparameter values are**:

**Initialization Method**
- k-means initialization provided better centroids compared to random initialization.

**Cluster Count**
- Using 13 clusters results in better-defined groups compared to 20 clusters which lead poor separated clusters.

**Tolerance Impact**
- Smaller tolerances (0.0001) were not significantly better than 0.001, indicating stability.

**Iteration Count**
- 300 iterations were sufficient for convergence.

**Evaluation Metrics(for best parameters)**
- Silhouette Score (0.2878): Indicates moderately defined clusters with relatively high cohesion within clusters.Clusters are not distinct.
- Davies-Bouldin Score (0.9162): A lower score compared to other configurations, highlighting good separation and compactness of clusters.
- Calinski-Harabasz Score (316.1541): The highest score, suggesting compact and well-separated clusters.

#### 5.2.2 Data_Cleaned

In [13]:
# Split the dataset into training and testing subsets
Y_train, Y_test = train_test_split(Y, test_size=0.2, random_state=42)
print("Data split into training and testing sets.")

# Define the parameter grid for hyperparameter tuning
param_grid_kmeans = {
    'kmeans__n_clusters': [11,14],                  # Reasonable (based on Elbow Method)
    'kmeans__init': ['k-means++', 'random'],
    'kmeans__n_init': [10, 20],
    'kmeans__max_iter': [300, 500],
    'kmeans__tol': [1e-3, 1e-4]
}
print("Parameter grid defined.")

# Sample a smaller portion of the data to manage memory usage
sampled_Y_train = Y_train.sample(frac=0.1, random_state=42)
sampled_Y_test = Y_test.sample(frac=0.1, random_state=42)
print("Data sampled for tuning.")

# Perform hyperparameter tuning on the sampled data
results = []
for index, params in enumerate(ParameterGrid(param_grid_kmeans)):
    print(f"Testing parameter set {index + 1}/{len(ParameterGrid(param_grid_kmeans))}: {params}")

    # Set up the pipeline with the current parameters
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('kmeans', KMeans(
            n_clusters=params['kmeans__n_clusters'],
            init=params['kmeans__init'],
            n_init=params['kmeans__n_init'],
            max_iter=params['kmeans__max_iter'],
            tol=params['kmeans__tol'],
            random_state=42
        ))
    ])

    # Fit the model on the sampled training data
    pipeline.fit(sampled_Y_train)
    print("Model fitting completed.")

    # Predict on the sampled testing data
    labels_test = pipeline.predict(sampled_Y_test)
    print("Model prediction completed.")

    # Calculate evaluation metrics
    silhouette_test = silhouette_score(sampled_Y_test, labels_test)
    davies_bouldin_test = davies_bouldin_score(sampled_Y_test, labels_test)
    calinski_harabasz_val = calinski_harabasz_score(X_val, labels_val)


    print(f"Silhouette Score: {silhouette_val:.4f}, Davies-Bouldin Score: {davies_bouldin_val:.4f},Calinski Harabasz Score: {calinski_harabasz_val:.4f}")
    # Store the results
    results.append({
        'params': params,
        'silhouette_score': silhouette_test,
        'davies_bouldin_score': davies_bouldin_test
    })

print("Hyperparameter tuning completed.")

# Find and print the best parameters
results_df = pd.DataFrame(results)
best_result = results_df.sort_values(by='silhouette_score', ascending=False).iloc[0]
print("Best Parameters for KMeans:", best_result['params'])


Data split into training and testing sets.
Parameter grid defined.
Data sampled for tuning.
Testing parameter set 1/32: {'kmeans__init': 'k-means++', 'kmeans__max_iter': 300, 'kmeans__n_clusters': 11, 'kmeans__n_init': 10, 'kmeans__tol': 0.001}
Model fitting completed.
Model prediction completed.
Silhouette Score: 0.2420, Davies-Bouldin Score: 0.9869,Calinski Harabasz Score: 280.5569
Testing parameter set 2/32: {'kmeans__init': 'k-means++', 'kmeans__max_iter': 300, 'kmeans__n_clusters': 11, 'kmeans__n_init': 10, 'kmeans__tol': 0.0001}
Model fitting completed.
Model prediction completed.
Silhouette Score: 0.2420, Davies-Bouldin Score: 0.9869,Calinski Harabasz Score: 280.5569
Testing parameter set 3/32: {'kmeans__init': 'k-means++', 'kmeans__max_iter': 300, 'kmeans__n_clusters': 11, 'kmeans__n_init': 20, 'kmeans__tol': 0.001}
Model fitting completed.
Model prediction completed.
Silhouette Score: 0.2420, Davies-Bouldin Score: 0.9869,Calinski Harabasz Score: 280.5569
Testing parameter set 

##### Conclusion

**Best hyperparameter values are**:

**Initialization Method**
- k-means initialization provided better centroids compared to random initialization.

**Tolerance Impact**
- Smaller tolerances (0.0001) were not significantly better than 0.001, indicating stability.

**Iteration Count**
- 300 iterations were sufficient for convergence.

**Clusters**
- Using 11 clusters results in better-defined groups compared to 14 clusters which lead to overlapping.

**Evaluation Metrics**
- Silhouette Score (0.2420): Indicates moderate cluster separation with overlapping boundaries. Clusters are not distinct.
- Davies-Bouldin Score (0.9869): Moderately low, suggesting average separation and compactness between clusters.
- Calinski-Harabasz Score (280.5569): A moderate value indicating clusters compact but not well-separated.

### 5.3 t-SNE Pipeline Hyperparameter tuning 

- TSNE: t-Distributed Stochastic Neighbor Embedding (t-SNE)
- Pupose : This is a technique for reducing the dimensionality of data to two dimensions (n_components=2) for the purpose of visualization.
- Metric : KL Diveregnce
- Grid search for the optimal t-SNE hyperparameters is performed. 


**t-SNE is used for `genre_data_cleaned` for below reasons:**
- t-SNE is great for visualizing data by reducing it to 2 or 3 dimensions.
- It's designed to capture complex, non-linear relationships, making it ideal for exploring clusters, like genres.
- However, t-SNE is computationally heavy, so it's best suited for small to medium datasets.

**Conclusion:**
- Use t-SNE for the `genre_data_cleaned` dataset to visualize clusters and patterns.

In [19]:
# Split dataset into subsamples for validation to prevent overfitting
X_train, X_val = train_test_split(X, test_size=0.2, random_state=42)

# Define an extended parameter grid for t-SNE
param_grid_tsne = {
    'perplexity': [10, 30, 50],                             # Cover a wider range of perplexities
    'learning_rate': [100, 200, 300],                       # Extend learning rate values
    'max_iter': [500, 1000, 2000],                          # Include shorter iterations for quick results
    'early_exaggeration': [6, 12, 24],                      # Test more exaggeration values
}

# Prepare for tracking progress
parameter_combinations = list(ParameterGrid(param_grid_tsne))
total_iterations = len(parameter_combinations)
current_iteration = 0

# Initialize tracking for the best parameters
best_kl_divergence = np.inf
best_tsne_params = None

# Evaluate t-SNE configurations with progress updates
print(f"Total parameter combinations to evaluate: {total_iterations}\n")

for params in parameter_combinations:
    current_iteration += 1
    print(f"Iteration {current_iteration}/{total_iterations}: Testing parameters {params}...")
    
    try:
        # Fit t-SNE on training data
        tsne = TSNE(
            n_components=2,
            perplexity=params['perplexity'],
            learning_rate=params['learning_rate'],
            max_iter=params['max_iter'],
            early_exaggeration=params['early_exaggeration'],
            random_state=42,
            verbose=0
        )
        embedding_train = tsne.fit_transform(X_train)

        # Compute KL divergence as the primary metric
        kl_divergence = tsne.kl_divergence_

        print(f"KL Divergence: {kl_divergence:.4f}")

        # Update best parameters based on KL divergence
        if kl_divergence < best_kl_divergence:
            best_kl_divergence = kl_divergence
            best_tsne_params = params

    except Exception as e:
        print(f"❌ Iteration {current_iteration}/{total_iterations} failed with error: {e}")

# Print the best parameters and lowest KL divergence
print("\nSearch Completed!")
print("Best Parameters for t-SNE:", best_tsne_params)
print("Lowest KL Divergence:", best_kl_divergence)


Total parameter combinations to evaluate: 81

Iteration 1/81: Testing parameters {'early_exaggeration': 6, 'learning_rate': 100, 'max_iter': 500, 'perplexity': 10}...
KL Divergence: 0.9862
Iteration 2/81: Testing parameters {'early_exaggeration': 6, 'learning_rate': 100, 'max_iter': 500, 'perplexity': 30}...
KL Divergence: 0.8759
Iteration 3/81: Testing parameters {'early_exaggeration': 6, 'learning_rate': 100, 'max_iter': 500, 'perplexity': 50}...
KL Divergence: 0.7866
Iteration 4/81: Testing parameters {'early_exaggeration': 6, 'learning_rate': 100, 'max_iter': 1000, 'perplexity': 10}...
KL Divergence: 0.8664
Iteration 5/81: Testing parameters {'early_exaggeration': 6, 'learning_rate': 100, 'max_iter': 1000, 'perplexity': 30}...
KL Divergence: 0.8373
Iteration 6/81: Testing parameters {'early_exaggeration': 6, 'learning_rate': 100, 'max_iter': 1000, 'perplexity': 50}...
KL Divergence: 0.7658
Iteration 7/81: Testing parameters {'early_exaggeration': 6, 'learning_rate': 100, 'max_iter'

#### Conclusions

**Best Parameters for t-SNE:**
- **KL Divergence (0.7602)** :The lowest value achieved, indicating an effective mapping of high-dimensional data into lower dimensions with minimal loss of structural information.
- **Perplexity (50)**:This value balances the local and global aspects of the data, suggesting that the chosen perplexity effectively captures the underlying data structure.
- **Learning Rate (300)**:This learning rate allows for stable convergence while avoiding oscillations or slow updates.
- **Early Exaggeration(6)**:Enhances separation during the initial iterations, improving visualization of clusters in the final embedding.
- **Max Iterations (2000)**:The model converges effectively, reflecting that the chosen number of iterations provides sufficient time for optimization.

#### Summary
Use this configuration for final t-SNE visualization to ensure optimal preservation of data structure and cluster separability in the low-dimensional space.

### 5.4 PCA Pipeline Hyperparameter tuning 

- PCA: Principal Component Analysis (PCA)
- Purpose: A technique for dimensionality reduction that transforms the data into a lower-dimensional space while retaining as much variance as possible.
- Metric: Reconstrction error (Mean squared error)
- Grid search for the optimal PCA is performed. 


**We use PCA for `data_cleaned` for below reasons:**
- PCA is efficient for reducing dimensions while keeping most of the original data's variance.
- It's better for preprocessing large datasets because it is faster and scales well.
- PCA assumes linear relationships, which works well for structured data intended for modeling.

**Conclusion:**
- Use PCA for the `data_cleaned` dataset to prepare it for further analysis or machine learning.


In [23]:
Y_train, Y_test = train_test_split(Y, test_size=0.2, random_state=42)

# Define the parameter grid for PCA
param_grid_pca = {
    'PCA__n_components': [2],                  # Range of components
    'PCA__whiten': [True, False],              # Whitening options
    'PCA__svd_solver': ['auto', 'full', 'randomized'],  # Additional solvers
    'PCA__tol': [1e-4, 1e-3]                   # Tolerance for convergence
}

# Track progress
parameter_combinations = list(ParameterGrid(param_grid_pca))
total_combinations = len(parameter_combinations)
current_iteration = 0

# Track the best parameters and lowest reconstruction error
best_error = np.inf
best_params = None

# Perform grid search with train-test split
print(f"Total parameter combinations to evaluate: {total_combinations}")

for params in parameter_combinations:
    current_iteration += 1
    print(f"Evaluating combination {current_iteration}/{total_combinations}: {params}")

    try:
        # Define the PCA pipeline with the current parameters
        pca_pipeline = Pipeline([
            ('scaler', StandardScaler()),                               # Standardize the features
            ('PCA', PCA(
                n_components=params['PCA__n_components'],
                whiten=params['PCA__whiten'],
                svd_solver=params['PCA__svd_solver'],
                tol=params.get('PCA__tol', None),                      # Include tolerance if provided
                random_state=42
            ))
        ])
        
        # Fit the pipeline on training data and transform test data
        pca_pipeline.fit(Y_train)
        transformed_train = pca_pipeline.transform(Y_train)
        transformed_test = pca_pipeline.transform(Y_test)
        
        # Inverse transform the test data for reconstruction
        reconstructed_test = pca_pipeline.named_steps['PCA'].inverse_transform(transformed_test)
        
        # Calculate reconstruction error on test data
        error = mean_squared_error(Y_test, reconstructed_test)
        print(f"Reconstruction Error (Test) for combination {current_iteration}/{total_combinations}: {error:.4f}")
        
        # Update the best parameters if test reconstruction error improves
        if error < best_error:
            best_error = error
            best_params = params
            print(f"✅ New Best Reconstruction Error: {error:.4f} with parameters {params}")

    except Exception as e:
        print(f"❌ Combination {current_iteration}/{total_combinations} failed with error: {e}")

# Print the final best parameters and reconstruction error
print("\nSearch Completed!")
print("Best Parameters for PCA:", best_params)
print("Lowest Reconstruction Error (Test):", best_error)


Total parameter combinations to evaluate: 12
Evaluating combination 1/12: {'PCA__n_components': 2, 'PCA__svd_solver': 'auto', 'PCA__tol': 0.0001, 'PCA__whiten': True}
Reconstruction Error (Test) for combination 1/12: 459695.2128
✅ New Best Reconstruction Error: 459695.2128 with parameters {'PCA__n_components': 2, 'PCA__svd_solver': 'auto', 'PCA__tol': 0.0001, 'PCA__whiten': True}
Evaluating combination 2/12: {'PCA__n_components': 2, 'PCA__svd_solver': 'auto', 'PCA__tol': 0.0001, 'PCA__whiten': False}
Reconstruction Error (Test) for combination 2/12: 459695.2128
Evaluating combination 3/12: {'PCA__n_components': 2, 'PCA__svd_solver': 'auto', 'PCA__tol': 0.001, 'PCA__whiten': True}
Reconstruction Error (Test) for combination 3/12: 459695.2128
Evaluating combination 4/12: {'PCA__n_components': 2, 'PCA__svd_solver': 'auto', 'PCA__tol': 0.001, 'PCA__whiten': False}
Reconstruction Error (Test) for combination 4/12: 459695.2128
Evaluating combination 5/12: {'PCA__n_components': 2, 'PCA__svd_s

#### Conclusions

- **Reconstruction Error (459695.21)** PCA successfully reduced dimensionality without significant data loss.
- **Tolerance(0.0001)** Ensures PCA converges with high precision by preventing premature stopping.
- **PCA_svd_solver('auto')** The best solver is automatically choosen based on dataset size, balancing speed and accuracy.

#### Summary
The best parameters provide a good balance between efficiency and minimizing reconstruction error by ensuring optimal preservation of data structure and maintaining meaningful cluster separability in the reduced-dimensional space.
