# Mood-Based Music Recommender (Scientific Computing Project)

This notebook demonstrates an end-to-end **mood-based song recommender** built on the Spotify tracks dataset.

We will:
1. Load and preprocess the data
2. Engineer a simple **rule-based mood model** using `valence` and `energy`
3. Apply **PCA** to build a low-dimensional "mood space" (linear algebra)
4. Build **alternative mood models**:
   - K-means clustering on PCA space
   - Gaussian Mixture Model (GMM) on PCA space
5. Compare mood models using clustering metrics and visualizations
6. Implement a simple **song recommender** that, given a mood, returns the nearest songs

You can run this notebook cell by cell to demonstrate that the code works.

## 1. Setup and Data Loading

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score, silhouette_score

plt.rcParams['figure.figsize'] = (8, 6)

DATA_PATH = '/mnt/data/dataset.csv'  # adjust if needed

df = pd.read_csv(DATA_PATH)
print('Raw shape:', df.shape)
df.head()

### Select audio features and basic cleaning

In [None]:
# Audio features to use
FEATURES = [
    'danceability', 'energy', 'valence', 'tempo',
    'acousticness', 'instrumentalness', 'liveness',
    'speechiness', 'loudness'
]

id_cols = ['track_id', 'track_name', 'artists', 'track_genre', 'album_name', 'popularity']

# Keep only rows with all required columns present
df = df.dropna(subset=FEATURES)
df = df[id_cols + FEATURES].drop_duplicates(subset='track_id').reset_index(drop=True)

print('After cleaning:', df.shape)
df[FEATURES].describe()

## 2. Rule-Based Mood Model (Heuristic)

We use `valence` and `energy` to define 4 basic moods:
- **Happy / Energetic**: high valence, high energy
- **Calm / Positive**: high valence, low energy
- **Sad / Low Energy**: low valence, low energy
- **Angry / Intense**: low valence, high energy

Tracks that fall between these thresholds are categorized as **Neutral / Mixed** (optional).

In [None]:
def rule_based_mood(valence, energy,
                    high_thr=0.6, low_thr=0.4):
    if valence >= high_thr and energy >= high_thr:
        return 'Happy / Energetic'
    elif valence >= high_thr and energy < high_thr:
        return 'Calm / Positive'
    elif valence < low_thr and energy < low_thr:
        return 'Sad / Low Energy'
    elif valence < low_thr and energy >= high_thr:
        return 'Angry / Intense'
    else:
        return 'Neutral / Mixed'

df['mood_rule'] = [
    rule_based_mood(v, e)
    for v, e in zip(df['valence'], df['energy'])
]

df['mood_rule'].value_counts()

Optionally, we can drop 'Neutral / Mixed' tracks to get cleaner clusters.

In [None]:
df_clean = df[df['mood_rule'] != 'Neutral / Mixed'].reset_index(drop=True)
print('After removing Neutral / Mixed:', df_clean.shape)
df_clean['mood_rule'].value_counts()

## 3. Standardization and PCA (2D Mood Space)

We standardize the features and perform PCA to map tracks into a 2D space that captures most of the variance. This gives us a low-dimensional representation of each song that we can visualize and use for clustering.

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_clean[FEATURES])

pca = PCA(n_components=2, random_state=42)
Z = pca.fit_transform(X_scaled)
df_clean['pc1'] = Z[:, 0]
df_clean['pc2'] = Z[:, 1]

print('Explained variance ratio:', pca.explained_variance_ratio_)
df_clean[['pc1', 'pc2']].head()

### Visualize rule-based moods in PCA space

In [None]:
plt.figure()
for mood in df_clean['mood_rule'].unique():
    subset = df_clean[df_clean['mood_rule'] == mood]
    plt.scatter(subset['pc1'], subset['pc2'], alpha=0.5, label=mood)

plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('Rule-based moods in PCA space')
plt.legend()
plt.show()

## 4. Alternative Mood Model 1: K-Means Clustering

We now treat the problem as **unsupervised clustering** in PCA space. We choose the number of clusters `k=4` (to roughly match 4 moods) and run K-means.

We then compare K-means clusters to rule-based moods using:
- Adjusted Rand Index (ARI)
- Normalized Mutual Information (NMI)
- Silhouette score of the K-means clustering

In [None]:
k = 4
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
df_clean['mood_kmeans'] = kmeans.fit_predict(df_clean[['pc1', 'pc2']])

# For comparison metrics, we need integer labels for rule-based moods
mood_names = df_clean['mood_rule'].unique()
mood_to_int = {m: i for i, m in enumerate(mood_names)}
rule_labels = df_clean['mood_rule'].map(mood_to_int).values
kmeans_labels = df_clean['mood_kmeans'].values

from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score, silhouette_score

ari_kmeans = adjusted_rand_score(rule_labels, kmeans_labels)
nmi_kmeans = normalized_mutual_info_score(rule_labels, kmeans_labels)
sil_kmeans = silhouette_score(df_clean[['pc1', 'pc2']], kmeans_labels)

print('K-Means vs Rule-based moods:')
print('  Adjusted Rand Index:', ari_kmeans)
print('  Normalized Mutual Information:', nmi_kmeans)
print('  Silhouette score (K-means clusters):', sil_kmeans)

### Visualize K-means clusters

In [None]:
plt.figure()
for cluster_id in sorted(df_clean['mood_kmeans'].unique()):
    subset = df_clean[df_clean['mood_kmeans'] == cluster_id]
    plt.scatter(subset['pc1'], subset['pc2'], alpha=0.5, label=f'Cluster {cluster_id}')

plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('K-means clusters in PCA space')
plt.legend()
plt.show()

## 5. Alternative Mood Model 2: Gaussian Mixture Model (GMM)

We also try a probabilistic clustering approach using a **Gaussian Mixture Model** with 4 components.

Again, we compare its clusters to the rule-based moods and compute a silhouette score.

In [None]:
gmm = GaussianMixture(n_components=4, covariance_type='full', random_state=42)
gmm_labels = gmm.fit_predict(df_clean[['pc1', 'pc2']])
df_clean['mood_gmm'] = gmm_labels

ari_gmm = adjusted_rand_score(rule_labels, gmm_labels)
nmi_gmm = normalized_mutual_info_score(rule_labels, gmm_labels)
sil_gmm = silhouette_score(df_clean[['pc1', 'pc2']], gmm_labels)

print('GMM vs Rule-based moods:')
print('  Adjusted Rand Index:', ari_gmm)
print('  Normalized Mutual Information:', nmi_gmm)
print('  Silhouette score (GMM clusters):', sil_gmm)

### Visualize GMM clusters

In [None]:
plt.figure()
for cluster_id in sorted(df_clean['mood_gmm'].unique()):
    subset = df_clean[df_clean['mood_gmm'] == cluster_id]
    plt.scatter(subset['pc1'], subset['pc2'], alpha=0.5, label=f'Cluster {cluster_id}')

plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('GMM clusters in PCA space')
plt.legend()
plt.show()

## 6. Compare Mood Models

We summarize the comparison between:
- **Rule-based moods** (heuristic, interpretable)
- **K-means clusters** (hard clusters)
- **GMM clusters** (probabilistic clusters)

Metrics:
- Adjusted Rand Index (ARI): agreement with rule-based labels (1 = perfect)
- Normalized Mutual Information (NMI): information overlap with rule-based labels
- Silhouette score: cluster quality in PCA space


In [None]:
comparison = pd.DataFrame({
    'Model': ['K-means', 'GMM'],
    'ARI_vs_rule': [ari_kmeans, ari_gmm],
    'NMI_vs_rule': [nmi_kmeans, nmi_gmm],
    'Silhouette': [sil_kmeans, sil_gmm],
})
comparison

## 7. Mood-Based Song Recommendation

We now implement a simple recommender:
- Given a **target mood** (from the rule-based model), we compute the **centroid** of that mood in PCA space.
- We recommend the songs whose PCA coordinates are closest (Euclidean distance) to that centroid.

In [None]:
# Compute centroids for rule-based moods in PCA space
centroids_rule = df_clean.groupby('mood_rule')[['pc1', 'pc2']].mean()
centroids_rule

In [None]:
def recommend_songs(mood_label, k=10):
    if mood_label not in centroids_rule.index:
        raise ValueError(f"Unknown mood {mood_label}. Available: {list(centroids_rule.index)}")

    centroid = centroids_rule.loc[mood_label].values
    coords = df_clean[['pc1', 'pc2']].values
    distances = np.linalg.norm(coords - centroid, axis=1)
    idx = np.argsort(distances)[:k]
    return df_clean.iloc[idx][id_cols + ['mood_rule', 'pc1', 'pc2']]

print('Available rule-based moods:')
print(list(centroids_rule.index))

### Example: get recommendations for a chosen mood

In [None]:
# Choose mood and number of songs
chosen_mood = 'Happy / Energetic'  # change this as you like
k = 10

recs = recommend_songs(chosen_mood, k)
print(f'Recommendations for mood: {chosen_mood}')
for _, row in recs.iterrows():
    print(f"{row['track_name']} â€” {row['artists']} (Genre: {row['track_genre']}, Popularity: {row['popularity']})")

## 8. Conclusions

- We built a **scientific computing pipeline** using:
  - Standardization
  - PCA for mood-space construction
  - Rule-based moods from `valence` and `energy`
  - Unsupervised clustering (K-means, GMM) as alternative mood models
- We compared mood models using ARI, NMI, and silhouette score.
- Finally, we implemented a simple **mood-based recommender** that picks songs closest to a mood centroid in PCA space.

You can now tweak thresholds, the number of clusters, or add more sophisticated models (e.g. supervised classifiers) as further extensions of this project.