# Modeling and Evaluation

Train and evaluate a series of KMeans models to find the best performing model by choosing a value for **k**.

## Steps:

1. **Load the Clean, Combined Dataset**
   - Load the preprocessed and combined dataset containing the audio features.

2. **Select Audio Features Based on Description**
   - Choose the relevant audio features from the dataset for clustering.

3. **Scale the Dataset**
   - Apply scaling (e.g., StandardScaler) to normalize the features before training the model.

4. **Train a Range of Models with Different k Values**
   - Train multiple KMeans models using different values for **k** (e.g., k=2, 3, 4, ..., 10).

5. **Evaluate and Select the Top 2 Values for k**
   - Use the **Elbow Method** to visually inspect the optimal number of clusters.
   - Use the **Silhouette Score** to evaluate how well-defined the clusters are.
   
6. **Try a Live Test with the Selected Models**
   - Test the two top-performing models (based on the Elbow Method and Silhouette Score) in a live setting.
   - Select the best performing value of **k** based on the test results.

In [6]:
import matplotlib.pyplot as plt
import os
import pandas as pd
import numpy as np
import pickle
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.cluster import DBSCAN
from sklearn.mixture import GaussianMixture
from scipy.spatial import distance_matrix

In [7]:
df = pd.read_csv('../data/clean/spotify_data_encoded.csv')

In [8]:
def display_basic_info(df):
    """
    Display basic information about the dataset including shape, data types, and missing values
    """
    print('Dataset Shape:', df.shape)
    print('\nData Types:')
    print(df.dtypes)
    print('\nMissing Values:')
    print(df.isnull().sum())

def display_numerical_summary(df):
    """
    Display summary statistics for numerical columns
    """
    print('Numerical Columns Summary:')
    print(df.describe())

def check_duplicates(df):
    """
    Check for duplicate entries in the dataset
    """
    duplicates = df.duplicated().sum()
    print(f'Number of duplicate entries: {duplicates}')
    
def display_unique_values(df, columns):
    """
    Display number of unique values for specified columns, with special handling for the genres column
    """
    print('Unique Values Count:')
    for col in columns:
        if col == 'genres':
            all_genres = []
            for genre_list in df[col].dropna():
                if isinstance(genre_list, str):
                    genre_list = eval(genre_list)
                all_genres.extend(genre_list)
            unique_genres = len(set(all_genres))
            print(f'{col}: {unique_genres} unique genres')
        else:
            print(f'{col}: {df[col].nunique()} unique values')

In [10]:
# Show basic information about the dataset
display_basic_info(df)

# Show numerical summary
display_numerical_summary(df)

# Check for duplicates
check_duplicates(df)

# Display unique values for relevant columns
display_unique_values(df, ['artist', 'release_year', 'genres'])

Dataset Shape: (7282, 25)

Data Types:
title               object
artist              object
album               object
release_year         int64
popularity           int64
is_explicit           bool
duration_seconds     int64
rock                 int64
pop                  int64
blues                int64
metal                int64
hip-hop              int64
country              int64
punk                 int64
jazz                 int64
rap                  int64
reggae               int64
folk                 int64
soul                 int64
latin                int64
dance                int64
indie                int64
classical            int64
album_cover         object
genres              object
dtype: object

Missing Values:
title               0
artist              0
album               0
release_year        0
popularity          0
is_explicit         0
duration_seconds    0
rock                0
pop                 0
blues               0
metal               0
hip-hop      

Select Audio Features

In [None]:
data = df.copy()

In [None]:
# Select relevant audio features (music genres)
features = data[['rock', 'pop', 'blues', 'metal', 'hip-hop', 'country', 
                 'punk', 'jazz', 'rap', 'reggae', 'folk', 'soul', 'latin', 
                 'dance', 'indie', 'classical']]


Scale the Dataset

In [None]:
# Scale the features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

Train Models with Different k Values

In [None]:

# Train models with different k values
k_values = [2, 3, 4, 5, 6, 7, 8, 9, 10]
models = [KMeans(n_clusters=k, random_state=42) for k in k_values]
for model in models:
    model.fit(scaled_features)
    print(f'Model trained with k={model.n_clusters}')
