# Spotify Music Taste Evolution Analysis

This notebook analyzes how your music taste has evolved over time using Spotify API data.

## Features:
- **Year-by-year analysis** of listening patterns
- **Genre clustering** to identify music preferences
- **Tempo and mood analysis** over time
- **Visualization** of music taste evolution

## Setup:
1. Install required packages: `pip install spotipy pandas numpy matplotlib seaborn scikit-learn`
2. Get Spotify API credentials from [Spotify Developer Dashboard](https://developer.spotify.com/dashboard)
3. Set up your CLIENT_ID and CLIENT_SECRET below

In [1]:
# Install required packages (uncomment if needed)
!pip install spotipy pandas numpy matplotlib seaborn scikit-learn

Collecting spotipy
  Downloading spotipy-2.25.2-py3-none-any.whl.metadata (5.1 kB)
Collecting redis>=3.5.3 (from spotipy)
  Downloading redis-7.1.0-py3-none-any.whl.metadata (12 kB)
Downloading spotipy-2.25.2-py3-none-any.whl (31 kB)
Downloading redis-7.1.0-py3-none-any.whl (354 kB)
Installing collected packages: redis, spotipy
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m2/2[0m [spotipy]
[1A[2KSuccessfully installed redis-7.1.0 spotipy-2.25.2

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
import spotipy
from spotipy.oauth2 import SpotifyOAuth
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
from collections import defaultdict
import warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

print("‚úÖ Libraries imported successfully!")

‚úÖ Libraries imported successfully!


## 1. Spotify API Authentication

Set up your Spotify API credentials. You can get them from:
https://developer.spotify.com/dashboard

**Required Scopes:**
- `user-read-recently-played` - to get recently played tracks
- `user-library-read` - to get saved tracks
- `user-read-private` - to get user profile
- `playlist-read-private` - to read playlists

In [None]:
# ====== CONFIGURATION ======
CLIENT_ID = "YOUR_CLIENT_ID_HERE"
CLIENT_SECRET = "YOUR_CLIENT_SECRET_HERE"
REDIRECT_URI = "http://localhost:8888/callback"

# Scopes needed for the analysis
SCOPE = "user-read-recently-played user-library-read user-read-private playlist-read-private"
# ============================

# Initialize Spotify client
sp = spotipy.Spotify(auth_manager=SpotifyOAuth(
    client_id=CLIENT_ID,
    client_secret=CLIENT_SECRET,
    redirect_uri=REDIRECT_URI,
    scope=SCOPE
))

# Test connection
try:
    user = sp.current_user()
    print(f"‚úÖ Successfully connected to Spotify!")
    print(f"üë§ Logged in as: {user['display_name']}")
    print(f"üÜî User ID: {user['id']}")
except Exception as e:
    print(f"‚ùå Error connecting to Spotify: {e}")
    print("Please check your CLIENT_ID and CLIENT_SECRET")

## 2. Data Collection

We'll collect:
- Recently played tracks (last 50 tracks)
- Saved tracks from your library
- Playlists (to get historical data)

In [None]:
def get_recently_played_tracks(sp, limit=50):
    """Get recently played tracks"""
    try:
        results = sp.current_user_recently_played(limit=limit)
        tracks = []
        for item in results['items']:
            track = item['track']
            played_at = item['played_at']
            tracks.append({
                'track_id': track['id'],
                'name': track['name'],
                'artists': ', '.join([a['name'] for a in track['artists']]),
                'album': track['album']['name'],
                'played_at': played_at,
                'release_date': track['album']['release_date']
            })
        return tracks
    except Exception as e:
        print(f"Error fetching recently played: {e}")
        return []

def get_saved_tracks(sp, limit=1000):
    """Get all saved tracks from user's library"""
    tracks = []
    results = sp.current_user_saved_tracks(limit=50)
    
    while results:
        for item in results['items']:
            track = item['track']
            added_at = item['added_at']
            tracks.append({
                'track_id': track['id'],
                'name': track['name'],
                'artists': ', '.join([a['name'] for a in track['artists']]),
                'album': track['album']['name'],
                'added_at': added_at,
                'release_date': track['album']['release_date']
            })
        
        if results['next'] and len(tracks) < limit:
            results = sp.next(results)
        else:
            break
    
    return tracks

def get_playlist_tracks(sp, playlist_id):
    """Get all tracks from a playlist"""
    tracks = []
    results = sp.playlist_tracks(playlist_id)
    
    while results:
        for item in results['items']:
            if item['track'] and item['track']['id']:
                track = item['track']
                tracks.append({
                    'track_id': track['id'],
                    'name': track['name'],
                    'artists': ', '.join([a['name'] for a in track['artists']]),
                    'album': track['album']['name'],
                    'added_at': item.get('added_at', None),
                    'release_date': track['album']['release_date']
                })
        
        if results['next']:
            results = sp.next(results)
        else:
            break
    
    return tracks

print("‚úÖ Data collection functions defined!")

In [None]:
# Collect data from multiple sources
print("üì• Collecting recently played tracks...")
recent_tracks = get_recently_played_tracks(sp, limit=50)
print(f"   Found {len(recent_tracks)} recently played tracks")

print("\nüì• Collecting saved tracks...")
saved_tracks = get_saved_tracks(sp, limit=1000)
print(f"   Found {len(saved_tracks)} saved tracks")

print("\nüì• Collecting tracks from playlists...")
playlist_tracks = []
try:
    playlists = sp.current_user_playlists(limit=50)
    for playlist in playlists['items']:
        print(f"   Fetching tracks from: {playlist['name']}")
        tracks = get_playlist_tracks(sp, playlist['id'])
        playlist_tracks.extend(tracks)
    print(f"   Found {len(playlist_tracks)} tracks from playlists")
except Exception as e:
    print(f"   Error fetching playlists: {e}")

# Combine all tracks and remove duplicates
all_tracks = recent_tracks + saved_tracks + playlist_tracks
unique_tracks = {}
for track in all_tracks:
    track_id = track['track_id']
    if track_id not in unique_tracks:
        unique_tracks[track_id] = track
    else:
        # Keep the earliest date if multiple entries
        if track.get('added_at') or track.get('played_at'):
            existing_date = unique_tracks[track_id].get('added_at') or unique_tracks[track_id].get('played_at')
            new_date = track.get('added_at') or track.get('played_at')
            if new_date and (not existing_date or new_date < existing_date):
                unique_tracks[track_id] = track

all_tracks = list(unique_tracks.values())
print(f"\n‚úÖ Total unique tracks collected: {len(all_tracks)}")

## 3. Get Audio Features

Fetch audio features (tempo, energy, valence, danceability, etc.) for all tracks.

In [None]:
def get_audio_features_batch(sp, track_ids):
    """Get audio features for multiple tracks (Spotify API limit: 100 per request)"""
    all_features = []
    
    # Process in batches of 100
    for i in range(0, len(track_ids), 100):
        batch = track_ids[i:i+100]
        try:
            features = sp.audio_features(batch)
            all_features.extend([f for f in features if f is not None])
        except Exception as e:
            print(f"Error fetching features for batch {i//100 + 1}: {e}")
            continue
    
    return all_features

def get_track_genres(sp, track_ids):
    """Get genres for tracks via artist information"""
    track_genres = {}
    
    # Get unique artist IDs
    artist_ids = set()
    artist_to_tracks = defaultdict(list)
    
    # First, get artist info from tracks
    for i in range(0, len(track_ids), 50):
        batch = track_ids[i:i+50]
        try:
            tracks = sp.tracks(batch)
            for track in tracks['tracks']:
                if track:
                    for artist in track['artists']:
                        artist_ids.add(artist['id'])
                        artist_to_tracks[artist['id']].append(track['id'])
        except Exception as e:
            print(f"Error fetching track details: {e}")
            continue
    
    # Get artist genres
    artist_ids_list = list(artist_ids)
    for i in range(0, len(artist_ids_list), 50):
        batch = artist_ids_list[i:i+50]
        try:
            artists = sp.artists(batch)
            for artist in artists['artists']:
                if artist:
                    genres = artist.get('genres', [])
                    for track_id in artist_to_tracks[artist['id']]:
                        if track_id not in track_genres:
                            track_genres[track_id] = []
                        track_genres[track_id].extend(genres)
        except Exception as e:
            print(f"Error fetching artist details: {e}")
            continue
    
    return track_genres

print("‚úÖ Audio features functions defined!")

In [None]:
# Get track IDs
track_ids = [t['track_id'] for t in all_tracks if t['track_id']]

print(f"üìä Fetching audio features for {len(track_ids)} tracks...")
audio_features = get_audio_features_batch(sp, track_ids)

print(f"üìä Fetching genre information...")
track_genres = get_track_genres(sp, track_ids)

# Create a mapping of track_id to features
features_dict = {f['id']: f for f in audio_features if f}

# Combine all data
complete_data = []
for track in all_tracks:
    track_id = track['track_id']
    if track_id in features_dict:
        row = track.copy()
        features = features_dict[track_id]
        
        # Add audio features
        row.update({
            'danceability': features.get('danceability'),
            'energy': features.get('energy'),
            'key': features.get('key'),
            'loudness': features.get('loudness'),
            'mode': features.get('mode'),
            'speechiness': features.get('speechiness'),
            'acousticness': features.get('acousticness'),
            'instrumentalness': features.get('instrumentalness'),
            'liveness': features.get('liveness'),
            'valence': features.get('valence'),  # Positivity/happiness
            'tempo': features.get('tempo'),
            'duration_ms': features.get('duration_ms'),
            'time_signature': features.get('time_signature'),
            'genres': ', '.join(track_genres.get(track_id, []))
        })
        complete_data.append(row)

print(f"\n‚úÖ Collected data for {len(complete_data)} tracks with audio features")

## 4. Data Preprocessing

Clean and prepare the data for analysis.

In [None]:
# Convert to DataFrame
df = pd.DataFrame(complete_data)

# Parse dates
df['played_at'] = pd.to_datetime(df['played_at'], errors='coerce')
df['added_at'] = pd.to_datetime(df['added_at'], errors='coerce')
df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')

# Create a unified date column (prefer played_at, then added_at, then release_date)
df['date'] = df['played_at'].fillna(df['added_at']).fillna(df['release_date'])

# Extract year
df['year'] = df['date'].dt.year

# Remove rows without valid dates
df = df[df['date'].notna()].copy()

# Calculate mood category based on valence and energy
def categorize_mood(row):
    valence = row['valence']
    energy = row['energy']
    
    if valence > 0.6 and energy > 0.6:
        return 'Happy/Energetic'
    elif valence > 0.6 and energy <= 0.6:
        return 'Happy/Calm'
    elif valence <= 0.6 and energy > 0.6:
        return 'Sad/Energetic'
    else:
        return 'Sad/Calm'

df['mood_category'] = df.apply(categorize_mood, axis=1)

# Categorize tempo
def categorize_tempo(tempo):
    if tempo < 60:
        return 'Very Slow'
    elif tempo < 90:
        return 'Slow'
    elif tempo < 120:
        return 'Moderate'
    elif tempo < 150:
        return 'Fast'
    else:
        return 'Very Fast'

df['tempo_category'] = df['tempo'].apply(categorize_tempo)

print(f"‚úÖ Data processed: {len(df)} tracks")
print(f"üìÖ Date range: {df['date'].min()} to {df['date'].max()}")
print(f"üìÖ Years covered: {sorted(df['year'].unique())}")
print(f"\nüìä Sample data:")
df[['name', 'artists', 'year', 'tempo', 'valence', 'energy', 'mood_category', 'genres']].head(10)

## 5. Year-by-Year Analysis

Analyze how your music taste changed over the years.

In [None]:
# Year-by-year statistics
yearly_stats = df.groupby('year').agg({
    'tempo': ['mean', 'std'],
    'energy': ['mean', 'std'],
    'valence': ['mean', 'std'],
    'danceability': ['mean', 'std'],
    'acousticness': ['mean', 'std'],
    'instrumentalness': ['mean', 'std'],
    'track_id': 'count'
}).round(3)

yearly_stats.columns = ['_'.join(col).strip() for col in yearly_stats.columns.values]
yearly_stats = yearly_stats.rename(columns={'track_id_count': 'track_count'})

print("üìä Year-by-Year Statistics:")
print(yearly_stats)

In [None]:
# Visualize year-by-year evolution
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
fig.suptitle('Music Taste Evolution Over Time', fontsize=16, fontweight='bold')

years = sorted(df['year'].unique())
yearly_means = df.groupby('year').mean()

# Tempo evolution
axes[0, 0].plot(years, yearly_means['tempo'], marker='o', linewidth=2, markersize=8)
axes[0, 0].set_title('Average Tempo (BPM)', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Year')
axes[0, 0].set_ylabel('Tempo (BPM)')
axes[0, 0].grid(True, alpha=0.3)

# Energy evolution
axes[0, 1].plot(years, yearly_means['energy'], marker='o', linewidth=2, markersize=8, color='orange')
axes[0, 1].set_title('Average Energy', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('Year')
axes[0, 1].set_ylabel('Energy (0-1)')
axes[0, 1].grid(True, alpha=0.3)

# Valence (Happiness) evolution
axes[0, 2].plot(years, yearly_means['valence'], marker='o', linewidth=2, markersize=8, color='green')
axes[0, 2].set_title('Average Valence (Happiness)', fontsize=12, fontweight='bold')
axes[0, 2].set_xlabel('Year')
axes[0, 2].set_ylabel('Valence (0-1)')
axes[0, 2].grid(True, alpha=0.3)

# Danceability evolution
axes[1, 0].plot(years, yearly_means['danceability'], marker='o', linewidth=2, markersize=8, color='purple')
axes[1, 0].set_title('Average Danceability', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Year')
axes[1, 0].set_ylabel('Danceability (0-1)')
axes[1, 0].grid(True, alpha=0.3)

# Acousticness evolution
axes[1, 1].plot(years, yearly_means['acousticness'], marker='o', linewidth=2, markersize=8, color='brown')
axes[1, 1].set_title('Average Acousticness', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Year')
axes[1, 1].set_ylabel('Acousticness (0-1)')
axes[1, 1].grid(True, alpha=0.3)

# Track count per year
track_counts = df.groupby('year').size()
axes[1, 2].bar(years, track_counts, color='steelblue', alpha=0.7)
axes[1, 2].set_title('Number of Tracks per Year', fontsize=12, fontweight='bold')
axes[1, 2].set_xlabel('Year')
axes[1, 2].set_ylabel('Track Count')
axes[1, 2].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

## 6. Mood and Tempo Category Analysis

In [None]:
# Mood category distribution over years
mood_by_year = pd.crosstab(df['year'], df['mood_category'], normalize='index') * 100
tempo_by_year = pd.crosstab(df['year'], df['tempo_category'], normalize='index') * 100

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Mood categories
mood_by_year.plot(kind='bar', stacked=True, ax=axes[0], colormap='Set2')
axes[0].set_title('Mood Category Distribution Over Years', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Year')
axes[0].set_ylabel('Percentage (%)')
axes[0].legend(title='Mood Category', bbox_to_anchor=(1.05, 1), loc='upper left')
axes[0].grid(True, alpha=0.3, axis='y')

# Tempo categories
tempo_by_year.plot(kind='bar', stacked=True, ax=axes[1], colormap='viridis')
axes[1].set_title('Tempo Category Distribution Over Years', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Year')
axes[1].set_ylabel('Percentage (%)')
axes[1].legend(title='Tempo Category', bbox_to_anchor=(1.05, 1), loc='upper left')
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

## 7. Genre Analysis and Clustering

In [None]:
# Extract and analyze genres
all_genres = []
for genres_str in df['genres'].dropna():
    if genres_str:
        genres_list = [g.strip() for g in genres_str.split(',') if g.strip()]
        all_genres.extend(genres_list)

# Count genre frequencies
from collections import Counter
genre_counts = Counter(all_genres)

# Get top genres
top_genres = dict(genre_counts.most_common(20))
print("üéµ Top 20 Genres in Your Library:")
for genre, count in top_genres.items():
    print(f"   {genre}: {count} tracks")

# Visualize top genres
plt.figure(figsize=(12, 8))
top_15 = dict(list(top_genres.items())[:15])
plt.barh(range(len(top_15)), list(top_15.values()), color='steelblue', alpha=0.7)
plt.yticks(range(len(top_15)), list(top_15.keys()))
plt.xlabel('Number of Tracks', fontsize=12)
plt.title('Top 15 Genres in Your Music Library', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

In [None]:
# Genre evolution over years
# Create a genre-year matrix
genre_year_data = []
for _, row in df.iterrows():
    if pd.notna(row['genres']) and row['genres']:
        genres_list = [g.strip() for g in row['genres'].split(',') if g.strip()]
        for genre in genres_list:
            genre_year_data.append({
                'year': row['year'],
                'genre': genre
            })

genre_year_df = pd.DataFrame(genre_year_data)
top_10_genres = [g for g, _ in genre_counts.most_common(10)]

# Filter to top genres and create pivot table
genre_year_filtered = genre_year_df[genre_year_df['genre'].isin(top_10_genres)]
genre_year_pivot = pd.crosstab(genre_year_filtered['year'], genre_year_filtered['genre'])

# Normalize by year
genre_year_pivot_norm = genre_year_pivot.div(genre_year_pivot.sum(axis=1), axis=0) * 100

# Visualize
plt.figure(figsize=(14, 8))
genre_year_pivot_norm.plot(kind='bar', stacked=True, colormap='tab20', ax=plt.gca())
plt.title('Top 10 Genres Evolution Over Years', fontsize=14, fontweight='bold')
plt.xlabel('Year')
plt.ylabel('Percentage (%)')
plt.legend(title='Genre', bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=9)
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

## 8. Clustering Analysis

Use K-Means clustering to identify music taste clusters based on audio features.

In [None]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Select features for clustering
feature_cols = ['danceability', 'energy', 'valence', 'tempo', 'acousticness', 
                'instrumentalness', 'liveness', 'speechiness']

# Prepare data
X = df[feature_cols].dropna()

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Determine optimal number of clusters using elbow method
inertias = []
K_range = range(2, 11)
for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)

# Plot elbow curve
plt.figure(figsize=(10, 6))
plt.plot(K_range, inertias, 'bo-', linewidth=2, markersize=8)
plt.xlabel('Number of Clusters (k)', fontsize=12)
plt.ylabel('Inertia', fontsize=12)
plt.title('Elbow Method for Optimal k', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("üí° Choose the number of clusters based on the elbow point above")

In [None]:
# Perform clustering (adjust n_clusters based on elbow plot)
n_clusters = 5  # Change this based on the elbow plot
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_scaled)

# Add cluster labels to dataframe
df_clustered = df[feature_cols].dropna().copy()
df_clustered['cluster'] = clusters

# Analyze cluster characteristics
cluster_analysis = df_clustered.groupby('cluster')[feature_cols].mean()
print("üìä Cluster Characteristics (Mean Values):")
print(cluster_analysis.round(3))

# Visualize clusters using PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plt.figure(figsize=(12, 8))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=clusters, cmap='viridis', 
                     alpha=0.6, s=50, edgecolors='black', linewidth=0.5)
plt.colorbar(scatter, label='Cluster')
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%} variance)', fontsize=12)
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%} variance)', fontsize=12)
plt.title(f'Music Clusters (K-Means, k={n_clusters})', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Cluster evolution over years
# Add cluster info back to main dataframe
df_with_clusters = df.copy()
df_with_clusters.loc[X.index, 'cluster'] = clusters

# Count clusters per year
cluster_by_year = pd.crosstab(df_with_clusters['year'], df_with_clusters['cluster'], normalize='index') * 100

plt.figure(figsize=(14, 8))
cluster_by_year.plot(kind='bar', stacked=True, colormap='Set3', ax=plt.gca())
plt.title('Cluster Distribution Over Years', fontsize=14, fontweight='bold')
plt.xlabel('Year')
plt.ylabel('Percentage (%)')
plt.legend(title='Cluster', labels=[f'Cluster {i}' for i in range(n_clusters)], 
          bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

# Describe each cluster
print("\nüìã Cluster Descriptions:")
for cluster_id in range(n_clusters):
    cluster_data = cluster_analysis.loc[cluster_id]
    print(f"\nüéµ Cluster {cluster_id}:")
    print(f"   Tempo: {cluster_data['tempo']:.1f} BPM")
    print(f"   Energy: {cluster_data['energy']:.2f}")
    print(f"   Valence (Happiness): {cluster_data['valence']:.2f}")
    print(f"   Danceability: {cluster_data['danceability']:.2f}")
    print(f"   Acousticness: {cluster_data['acousticness']:.2f}")

## 9. Advanced Visualizations

In [None]:
# Create a comprehensive correlation heatmap
audio_features = ['danceability', 'energy', 'valence', 'tempo', 'acousticness', 
                  'instrumentalness', 'liveness', 'speechiness']

correlation_matrix = df[audio_features].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Audio Features Correlation Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# 2D scatter plot: Energy vs Valence (mood map)
plt.figure(figsize=(12, 8))
scatter = plt.scatter(df['energy'], df['valence'], 
                     c=df['year'], cmap='plasma', 
                     alpha=0.6, s=50, edgecolors='black', linewidth=0.3)
plt.colorbar(scatter, label='Year')
plt.xlabel('Energy', fontsize=12)
plt.ylabel('Valence (Happiness)', fontsize=12)
plt.title('Music Mood Map: Energy vs Valence (Colored by Year)', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)

# Add quadrant labels
plt.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
plt.axvline(x=0.5, color='gray', linestyle='--', alpha=0.5)
plt.text(0.75, 0.75, 'Happy/Energetic', fontsize=10, alpha=0.7)
plt.text(0.25, 0.75, 'Happy/Calm', fontsize=10, alpha=0.7)
plt.text(0.75, 0.25, 'Sad/Energetic', fontsize=10, alpha=0.7)
plt.text(0.25, 0.25, 'Sad/Calm', fontsize=10, alpha=0.7)

plt.tight_layout()
plt.show()

In [None]:
# Tempo vs Danceability over years
fig, ax = plt.subplots(figsize=(12, 8))
scatter = ax.scatter(df['tempo'], df['danceability'], 
                    c=df['year'], cmap='viridis', 
                    alpha=0.6, s=50, edgecolors='black', linewidth=0.3)
plt.colorbar(scatter, label='Year', ax=ax)
ax.set_xlabel('Tempo (BPM)', fontsize=12)
ax.set_ylabel('Danceability', fontsize=12)
ax.set_title('Tempo vs Danceability (Colored by Year)', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Overall statistics
print("=" * 60)
print("üìä OVERALL MUSIC TASTE SUMMARY")
print("=" * 60)

print(f"\nüéµ Total Tracks Analyzed: {len(df)}")
print(f"üìÖ Years Covered: {df['year'].min()} - {df['year'].max()}")
print(f"üé® Unique Genres: {len(set(all_genres))}")

print(f"\nüéº Average Audio Features:")
print(f"   Tempo: {df['tempo'].mean():.1f} BPM")
print(f"   Energy: {df['energy'].mean():.2f}")
print(f"   Valence (Happiness): {df['valence'].mean():.2f}")
print(f"   Danceability: {df['danceability'].mean():.2f}")
print(f"   Acousticness: {df['acousticness'].mean():.2f}")

print(f"\nüìà Year-by-Year Trends:")
for year in sorted(df['year'].unique()):
    year_data = df[df['year'] == year]
    print(f"\n   {year}:")
    print(f"      Tracks: {len(year_data)}")
    print(f"      Avg Tempo: {year_data['tempo'].mean():.1f} BPM")
    print(f"      Avg Energy: {year_data['energy'].mean():.2f}")
    print(f"      Avg Valence: {year_data['valence'].mean():.2f}")
    print(f"      Top Mood: {year_data['mood_category'].mode().iloc[0] if len(year_data['mood_category'].mode()) > 0 else 'N/A'}")

print(f"\nüéØ Most Common Genres:")
for genre, count in list(genre_counts.most_common(5)):
    print(f"   {genre}: {count} tracks")

print("\n" + "=" * 60)

## 11. Export Data

Save the analyzed data for further analysis or visualization.

In [None]:
# Export to CSV
output_file = 'spotify_music_analysis.csv'
df.to_csv(output_file, index=False)
print(f"‚úÖ Data exported to {output_file}")

# Export yearly statistics
yearly_stats_file = 'spotify_yearly_stats.csv'
yearly_stats.to_csv(yearly_stats_file)
print(f"‚úÖ Yearly statistics exported to {yearly_stats_file}")

# Export cluster analysis
if 'cluster' in df_with_clusters.columns:
    cluster_file = 'spotify_cluster_analysis.csv'
    df_with_clusters[['name', 'artists', 'year', 'cluster'] + feature_cols].to_csv(cluster_file, index=False)
    print(f"‚úÖ Cluster analysis exported to {cluster_file}")

## Notes

- **Data Collection**: This notebook collects data from recently played tracks, saved tracks, and playlists. For more comprehensive historical data, consider using Spotify's extended listening history if available.

- **API Rate Limits**: Spotify API has rate limits. If you encounter rate limit errors, the code will need to be modified to add delays between requests.

- **Clustering**: The number of clusters (k) can be adjusted based on the elbow plot. Experiment with different values to find the best fit for your data.

- **Genres**: Genre information comes from artist data, so tracks may have multiple genres or no genres if the artist information is incomplete.

- **Date Handling**: The analysis uses the earliest available date (played_at, added_at, or release_date) for each track. This may not perfectly reflect when you first listened to a track.

## Next Steps

1. **Extended History**: Request your extended listening history from Spotify for more comprehensive analysis
2. **Time-based Clustering**: Perform separate clustering for different time periods
3. **Recommendation System**: Use the clusters to build a music recommendation system
4. **Comparative Analysis**: Compare your taste with friends or global trends