# 🎵 Content-Based Music Recommender System
## Building a Spotify Track Recommendation Engine

### Executive Summary
This notebook demonstrates the development of a content-based music recommendation system using Spotify's audio features. We build a transparent, interpretable recommender that suggests tracks based on musical similarity rather than collaborative filtering.

### Methodology Overview
Our approach follows a classical content-based recommendation pipeline:

1. **Data Exploration**: Load and inspect the Spotify tracks dataset
2. **Feature Engineering**: Select and standardize audio features for modeling  
3. **User Preference Modeling**: Create taste vectors from user's seed tracks
4. **Similarity Computation**: Use cosine similarity in standardized feature space
5. **Recommendation Generation**: Rank and filter similar tracks
6. **Evaluation**: Display results with similarity scores

### Key Technical Decisions
- **Feature Selection**: Focus on 6 core audio features (danceability, energy, tempo, valence, acousticness, loudness)
- **Standardization**: Zero-mean, unit-variance scaling for fair feature contribution
- **Similarity Metric**: Cosine similarity for orientation-based matching
- **User Modeling**: Simple averaging of seed track features

### Dataset
- **Source**: `spotify_songs.csv` (32,833 tracks)
- **Features**: 23 columns including audio features, metadata, and playlist information
- **Scope**: Multiple genres and playlists for diverse recommendations


## 1. Environment Setup and Imports

### Libraries and Configuration
We start by importing essential libraries for data manipulation, machine learning, and visualization. Pandas display options are configured to show all columns during exploration.


## 2. Data Loading and Initial Exploration

### Dataset Import
Loading the Spotify tracks dataset from CSV. This dataset contains comprehensive audio features and metadata for over 32K tracks across multiple genres and playlists.


In [None]:
import pandas as pd
import warnings

# Suppress sklearn warnings about feature names (they're harmless in our context)
warnings.filterwarnings('ignore', message='X does not have valid feature names')

pd.set_option('display.max_columns', None)


In [None]:
# Libraries already imported above - this cell can be removed or used for additional imports

**Data Source**: Local file `spotify_songs.csv`  
**Expected Schema**: Track identifiers, audio features, and playlist metadata  
**Memory Considerations**: ~6MB dataset, easily fits in memory for analysis

In [6]:
df = pd.read_csv("spotify_songs.csv")

### 2.1 Schema Analysis and Data Quality Assessment

Understanding the dataset structure is crucial for feature selection and preprocessing strategy. We examine:
- **Column types and names**: Identify categorical vs. numerical features
- **Missing values**: Assess data completeness and quality
- **Dataset size**: Confirm we have sufficient data for meaningful recommendations
- **Feature distributions**: Prepare for normalization decisions


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32833 entries, 0 to 32832
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   track_id                  32833 non-null  object 
 1   track_name                32828 non-null  object 
 2   track_artist              32828 non-null  object 
 3   track_popularity          32833 non-null  int64  
 4   track_album_id            32833 non-null  object 
 5   track_album_name          32828 non-null  object 
 6   track_album_release_date  32833 non-null  object 
 7   playlist_name             32833 non-null  object 
 8   playlist_id               32833 non-null  object 
 9   playlist_genre            32833 non-null  object 
 10  playlist_subgenre         32833 non-null  object 
 11  danceability              32833 non-null  float64
 12  energy                    32833 non-null  float64
 13  key                       32833 non-null  int64  
 14  loudne

### 2.2 Sample Data Inspection

Examining the first few rows provides immediate insights into:
- **Data format and structure**: Verify expected columns are present
- **Track metadata quality**: Check for meaningful track names and artists  
- **Audio feature ranges**: Understand the scale and distribution of our modeling features
- **Data completeness**: Spot any obvious quality issues early


In [8]:
df.head()

Unnamed: 0,track_id,track_name,track_artist,track_popularity,track_album_id,track_album_name,track_album_release_date,playlist_name,playlist_id,playlist_genre,playlist_subgenre,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
0,6f807x0ima9a1j3VPbc7VN,I Don't Care (with Justin Bieber) - Loud Luxur...,Ed Sheeran,66,2oCs0DGTsRO98Gh5ZSl2Cx,I Don't Care (with Justin Bieber) [Loud Luxury...,2019-06-14,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,dance pop,0.748,0.916,6,-2.634,1,0.0583,0.102,0.0,0.0653,0.518,122.036,194754
1,0r7CVbZTWZgbTCYdfa2P31,Memories - Dillon Francis Remix,Maroon 5,67,63rPSO264uRjW1X5E6cWv6,Memories (Dillon Francis Remix),2019-12-13,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,dance pop,0.726,0.815,11,-4.969,1,0.0373,0.0724,0.00421,0.357,0.693,99.972,162600
2,1z1Hg7Vb0AhHDiEmnDE79l,All the Time - Don Diablo Remix,Zara Larsson,70,1HoSmj2eLcsrR0vE9gThr4,All the Time (Don Diablo Remix),2019-07-05,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,dance pop,0.675,0.931,1,-3.432,0,0.0742,0.0794,2.3e-05,0.11,0.613,124.008,176616
3,75FpbthrwQmzHlBJLuGdC7,Call You Mine - Keanu Silva Remix,The Chainsmokers,60,1nqYsOef1yKKuGOVchbsk6,Call You Mine - The Remixes,2019-07-19,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,dance pop,0.718,0.93,7,-3.778,1,0.102,0.0287,9e-06,0.204,0.277,121.956,169093
4,1e8PAfcKUYoKkxPhrHqw4x,Someone You Loved - Future Humans Remix,Lewis Capaldi,69,7m7vv9wlQ4i0LFuJiE2zsQ,Someone You Loved (Future Humans Remix),2019-03-05,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,dance pop,0.65,0.833,1,-4.672,1,0.0359,0.0803,0.0,0.0833,0.725,123.976,189052


## 3. Feature Engineering and Data Preparation

### 3.1 Feature Selection Strategy

We carefully select features that capture musical similarity while maintaining model interpretability:

**Identifiers & Metadata** (for results display):
- `track_id`: Unique identifier for lookup and deduplication
- `track_name`, `track_artist`: Human-readable labels for recommendations

**Audio Features** (for similarity computation):
- `danceability`: Rhythmic predictability and beat strength
- `energy`: Perceived intensity and power  
- `tempo`: Speed/pace in beats per minute
- `valence`: Musical positivity/happiness conveyed
- `acousticness`: Confidence measure of acoustic vs. electronic
- `loudness`: Overall dynamic range and volume

**Rationale**: These 6 features capture the most distinctive and interpretable aspects of musical content while avoiding redundancy with features like `speechiness` or `instrumentalness` that may be genre-specific.


In [10]:
feature_cols = [
    "danceability",
    "energy",
    "tempo",
    "valence",
    "acousticness",
    "loudness"
]


df = df[["track_id", "track_name", "track_artist"]+ feature_cols]
df.head()

Unnamed: 0,track_id,track_name,track_artist,danceability,energy,tempo,valence,acousticness,loudness
0,6f807x0ima9a1j3VPbc7VN,I Don't Care (with Justin Bieber) - Loud Luxur...,Ed Sheeran,0.748,0.916,122.036,0.518,0.102,-2.634
1,0r7CVbZTWZgbTCYdfa2P31,Memories - Dillon Francis Remix,Maroon 5,0.726,0.815,99.972,0.693,0.0724,-4.969
2,1z1Hg7Vb0AhHDiEmnDE79l,All the Time - Don Diablo Remix,Zara Larsson,0.675,0.931,124.008,0.613,0.0794,-3.432
3,75FpbthrwQmzHlBJLuGdC7,Call You Mine - Keanu Silva Remix,The Chainsmokers,0.718,0.93,121.956,0.277,0.0287,-3.778
4,1e8PAfcKUYoKkxPhrHqw4x,Someone You Loved - Future Humans Remix,Lewis Capaldi,0.65,0.833,123.976,0.725,0.0803,-4.672


### 3.2 Utility Functions for Feature Extraction

Before standardization, we define helper functions to extract features for individual tracks or sets of tracks.


In [None]:
def get_song_features(song_ids):
    """
    song_ids: list of track IDs
    Returns: numpy array shape (len(song_ids), num_features)
    """
    seed_df = df[df['track_id'].isin(song_ids)]
    return seed_df[feature_cols].values

In [11]:
from sklearn.preprocessing import StandardScaler
import numpy as np

scaler = StandardScaler()
X_all = scaler.fit_transform(df[feature_cols])

# Save for later use
import joblib
joblib.dump(scaler, "scaler.pkl")

def get_normalized_features(song_ids):
    seed_df = df[df['track_id'].isin(song_ids)]
    X_seed = seed_df[feature_cols].values
    return scaler.transform(X_seed)


In [12]:
def make_taste_vector(song_ids):
    """
    Takes list of song IDs, returns the averaged feature vector.
    """
    seed_features = get_normalized_features(song_ids)
    return np.mean(seed_features, axis=0, keepdims=True)


### 3.3 Feature Standardization and Normalization

**Why Standardization Matters**:
Different audio features have vastly different scales:
- `tempo` ranges from ~60-200 BPM  
- `loudness` ranges from ~-60 to 0 dB
- `danceability` ranges from 0.0 to 1.0

Without standardization, high-magnitude features would dominate similarity calculations.

**Implementation**:
- **Fit** `StandardScaler` on the entire dataset to learn global mean/variance
- **Transform** features to zero-mean, unit-variance distributions
- **Persist** the fitted scaler for consistent preprocessing of new tracks
- **Apply** same transformation to user seed tracks for fair comparison

This ensures each feature contributes equally to cosine similarity calculations.


In [15]:
my_seeds = ["6f807x0ima9a1j3VPbc7VN", "75FpbthrwQmzHlBJLuGdC7"]  # Example IDs
taste_vector = make_taste_vector(my_seeds)
print(taste_vector)  # (1, num_features)


[[ 0.57312308  1.22740996  0.04193558 -0.31265999 -0.44514596  1.23951851]]




## 4. User Preference Modeling

### 4.1 Taste Vector Construction

**Concept**: Represent user preferences as a single point in standardized feature space by averaging their liked tracks.

**Process**:
1. `get_normalized_features(song_ids)`: Apply the same standardization to user's seed tracks
2. `make_taste_vector(song_ids)`: Compute element-wise mean across all seed tracks
3. Result: A single (1 × 6) vector representing the "center" of user's musical taste

**Mathematical Foundation**:
- Assumes user preferences cluster around a central tendency
- Simple averaging works well when users provide 3-10 representative tracks
- More sophisticated approaches could weight by play count or recency

**Example Output**: The taste vector shows your relative preferences across the 6 audio dimensions.


In [16]:
import torch

def make_taste_embedding(song_ids, model):
    taste_vector = make_taste_vector(song_ids)
    taste_tensor = torch.tensor(taste_vector, dtype=torch.float32)
    with torch.no_grad():
        taste_emb = model(taste_tensor).numpy()
    return taste_emb


### Embedding model and taste embedding
This section defines a simple neural network (`SongEmbedder`) and constructs a taste embedding:
- `SongEmbedder`: a tiny feed-forward network mapping 6 features → 32-dimensional vector.
- `make_taste_embedding`: converts the averaged, normalized taste vector to a tensor and passes it through the model.

Important: the model is randomly initialized in this notebook, so embeddings are not learned. For stable, meaningful results:
- Either compute cosine similarity directly on standardized features, or
- Train the encoder (e.g., as an autoencoder) before using it for recommendations.


In [19]:
import torch
import torch.nn as nn

class SongEmbedder(nn.Module):
    def __init__(self, input_dim=6, embedding_dim=32):
        super().__init__()
        self.fc = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Linear(64, embedding_dim)
        )

    def forward(self, x):
        return self.fc(x)

model = SongEmbedder()

['29zWqhca3zt5NsckZqDf6c']

### Computing embeddings for all songs (dataset index)
We compute an embedding for each song to enable fast similarity search.

- For each row in `df`, we normalize its features and run them through `model` to obtain a vector.
- These vectors form `all_song_embeddings`, which we later compare to your taste embedding.

Caveats in this notebook version:
- The model is untrained; results may not be meaningful. In production, use standardized features directly or train an encoder.
- This loop is row-by-row and can be vectorized for better performance.
- The warnings about feature names come from scikit-learn seeing raw arrays; they are benign here.


In [22]:
all_song_embeddings = []
for i in range(len(df)):
    song_vector = get_normalized_features([df.iloc[i]['track_id']])
    emb = model(torch.tensor(song_vector, dtype=torch.float32)).detach().numpy()
    all_song_embeddings.append(emb[0])

[1;30;43mStreaming output truncated to the last 5000 lines.[0m


### Recommendation function (cosine similarity)
`recommend_songs(example_song_ids, model, all_song_embeddings, df, top_n)` performs the core retrieval:
1. Builds a taste embedding for the seed tracks (`make_taste_embedding`).
2. Computes cosine similarity between the taste embedding and all song embeddings.
3. Excludes the seed songs from candidates to avoid echoing your input.
4. Sorts by similarity and returns the top results along with metadata.

Note: With a learned or deterministic embedding, higher similarity correlates with closer proximity in the feature/embedding space.


In [33]:
from sklearn.metrics.pairwise import cosine_similarity

def recommend_songs(example_song_ids, model, all_song_embeddings, all_song_info, top_n=10):
    """
    example_song_ids: list of track IDs the user likes
    model: your PyTorch embedding model
    all_song_embeddings: precomputed embeddings for all songs in dataset (numpy array)
    all_song_info: dataframe with columns ['track_id', 'track_name', 'artists']
    top_n: number of recommendations to return
    """

    # 1. Compute the taste embedding from example songs
    taste_emb = make_taste_embedding(example_song_ids, model)  # shape (1, embedding_dim)

    # 2. Compute cosine similarity between taste and all songs
    similarities = cosine_similarity(taste_emb, all_song_embeddings)  # shape (1, num_songs)
    similarities = similarities.flatten()

    # 3. Exclude the example songs from recommendations
    example_set = set(example_song_ids)
    candidates = [(idx, score) for idx, score in enumerate(similarities)
                  if all_song_info.iloc[idx]['track_id'] not in example_set]

    # 4. Sort by similarity
    candidates.sort(key=lambda x: x[1], reverse=True)

    # 5. Take top N
    top_candidates = candidates[:top_n]

    # 6. Return human-readable list
    recommendations = []
    for idx, score in top_candidates:
        song = all_song_info.iloc[idx]
        recommendations.append({
            "track_id": song['track_id'],
            "track_name": song['track_name'],
            "track_artist": song['track_artist'],
            "similarity": float(score)
        })

    return recommendations

### Choosing seed songs and generating recommendations
Here we define a few example seed `track_id`s and call `recommend_songs(...)` to retrieve the top-N most similar tracks. In a user-facing app, these IDs would be collected from user input (e.g., text search and selection).

- `example_songs`: the set of tracks you like
- `top_n`: how many recommendations to return

The recommender compares your taste embedding against all tracks and returns the most similar ones.


In [34]:
example_songs = ["6f807x0ima9a1j3VPbc7VN", "1e8PAfcKUYoKkxPhrHqw4x"]
recommendations = recommend_songs(example_songs, model, all_song_embeddings, df, top_n=5)



### Displaying recommendations
This cell formats and prints the final recommendation list. For each recommended track, we show:
- the track name
- the artist name
- the cosine similarity to your taste vector (higher = more similar)

Use this to quickly scan whether the suggestions align with your preferences.


In [36]:
for i, rec in enumerate(recommendations, 1):
    print(f"{i}. {rec['track_name']} by {rec['track_artist']} (similarity: {rec['similarity']:.3f})")

1. The Fox (What Does the Fox Say?) by Ylvis (similarity: 0.993)
2. D.D.D by THE BOYZ (similarity: 0.993)
3. Just Got Paid by Sigala (similarity: 0.992)
4. Way Too Good by Lucas Estrada (similarity: 0.991)
5. Baby by Bakermat (similarity: 0.990)


## 7. Results Analysis and Next Steps

### 7.1 Recommendation Quality Assessment

**Interpreting Similarity Scores**:
- **High similarity (>0.99)**: Near-identical audio profiles
- **Good matches (0.85-0.99)**: Strong musical similarity  
- **Moderate matches (0.70-0.85)**: Decent recommendations worth exploring
- **Low matches (<0.70)**: May introduce beneficial diversity

### 7.2 Model Limitations and Improvements

**Current Limitations**:
- **Feature scope**: Only 6 audio features, missing genre/mood context
- **User modeling**: Simple averaging may not capture preference diversity
- **Cold start**: Requires seed tracks, no handling for new users
- **Popularity bias**: No consideration of track popularity or recency

**Potential Enhancements**:
1. **Feature expansion**: Include genre, mood, lyrical themes
2. **Advanced user modeling**: Weighted averages, preference clustering  
3. **Hybrid approaches**: Combine content-based with collaborative filtering
4. **Evaluation metrics**: Implement precision@k, diversity measures
5. **Production optimizations**: Approximate nearest neighbors, caching

### 7.3 Business Applications

This recommender could serve as:
- **Playlist generation**: Auto-create themed playlists from seed tracks
- **Music discovery**: Help users find similar artists/genres
- **A/B testing baseline**: Transparent algorithm for comparison studies
- **Cold start solution**: Recommend to new users without interaction history
