<h3 align="center">Machine Learning: Project 2</h3>

<b>Getting data from Spotify's API:</b> <br>
First, an app was created online Spotify's API website and the necessary client id and client secret were created in order to access my profile.

In [32]:
# Spotify API access
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from spotipy.oauth2 import SpotifyOAuth

# general use
import pandas as pd
import numpy as np
from time import sleep

# plotting
import plotly.express as px
import matplotlib.pyplot as plt

# pre-processing
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.feature_selection import RFE

# Kmeans clustering
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score #calculates silhouette scores
from sklearn.metrics import silhouette_score, davies_bouldin_score # quality metrics

# Agglomerative 
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

# DBSCAN
from sklearn.cluster import DBSCAN
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

# Gaussian Mixture model
from sklearn.mixture import GaussianMixture

# Mean shift clustering
from sklearn.cluster import MeanShift, estimate_bandwidth

The next few lines create the spotify object that can then be used to get the data needed from my profile.

In [4]:
# Spotify Authentication - without signing in
cid = 'whatever your client ID is'
secret = 'whatever your client secret is'
client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

The next few lines allows us to get data on a specific playlist that is public in Spotify.

In [5]:
playlist_link = "https://open.spotify.com/playlist/4cRlMUSYAavxsaFLbc6IxY?si=f21a419cbac147ad"
playlist_URI = playlist_link.split("/")[-1].split("?")[0]
track_uris = [x["track"]["uri"] for x in sp.playlist_tracks(playlist_URI)["items"]]

The next block of code will allow us to extract the track name, artist, genre, and popularity for each song in the selected playlist.

In [82]:
# lists of data for each track in the playlist
track_uris = []
track_names = []
artist_uris = []
artist_uris = []
artist_infos = []
artist_names = []
artist_pops = []
artist_gens = []
albums = []
track_pops = []
for track in sp.playlist_tracks(playlist_URI)["items"]:
    #URI
    track_uri = track["track"]["uri"]
    track_uris.append(track_uri)

    #Track name
    track_name = track["track"]["name"]
    track_names.append(track_name)
 
    #Main Artist
    artist_uri = track["track"]["artists"][0]["uri"]
    artist_info = sp.artist(artist_uri)
    artist_uris.append(artist_uri)
    artist_infos.append(artist_info)
    
    #Name, popularity, genre
    artist_name = track["track"]["artists"][0]["name"]
    artist_pop = artist_info["popularity"]
    artist_genres = artist_info["genres"]
    artist_names.append(artist_name)
    artist_pops.append(artist_pop)
    artist_gens.append(artist_genres)
    
    #Album
    album = track["track"]["album"]["name"]
    albums.append(album)
    
    #Popularity of the track
    track_pop = track["track"]["popularity"]
    track_pops.append(track_pop)

The next line shows an example of how we look at target features for one of the tracks

In [74]:
sp.audio_features(track_uris)[0]

{'danceability': 0.398,
 'energy': 0.939,
 'key': 9,
 'loudness': -2.865,
 'mode': 0,
 'speechiness': 0.0648,
 'acousticness': 0.00591,
 'instrumentalness': 0.000881,
 'liveness': 0.357,
 'valence': 0.235,
 'tempo': 92.027,
 'type': 'audio_features',
 'id': '5hheGdf1cb4rK0FNiedCfK',
 'uri': 'spotify:track:5hheGdf1cb4rK0FNiedCfK',
 'track_href': 'https://api.spotify.com/v1/tracks/5hheGdf1cb4rK0FNiedCfK',
 'analysis_url': 'https://api.spotify.com/v1/audio-analysis/5hheGdf1cb4rK0FNiedCfK',
 'duration_ms': 342821,
 'time_signature': 4}

In [43]:
features_list = []
for uri in track_uris:
    features = sp.audio_features(uri)
    features_list.append(features) 
print(len(features_list))

49


In [86]:
features_df = pd.DataFrame(features_list[0])

for song in features_list:
    df_song = pd.DataFrame(song)
    features_df = pd.concat([features_df, df_song], ignore_index=True)

# drop first row to remove repeated track
features_df = features_df.drop(df.index[0])

print(features_df)

    danceability  energy  key  loudness  mode  speechiness  acousticness  \
1          0.398   0.939    9    -2.865     0       0.0648      0.005910   
2          0.668   0.921    7    -3.727     1       0.0439      0.049200   
3          0.589   0.797    1    -4.213     1       0.1730      0.000673   
4          0.381   0.984    1    -2.420     1       0.0638      0.000025   
5          0.608   0.792    9    -5.721     1       0.0355      0.000674   
6          0.839   0.560   11    -8.339     1       0.3310      0.006050   
7          0.407   0.980    7    -1.580     1       0.1070      0.001740   
8          0.560   0.925    6    -3.821     0       0.0871      0.000197   
9          0.569   0.745    2    -3.854     1       0.1820      0.005750   
10         0.629   0.889   11    -3.573     1       0.0984      0.024300   
11         0.551   0.953    0    -4.695     1       0.1700      0.010600   
12         0.638   0.961    7    -1.726     1       0.3000      0.014400   
13         0

<b> Data Pre-Processing and Testing</b> <br>
<b>Step 1:</b> Select target Features for clustering, excluding data not relevant for grouping the data based on the audio features. I also removed the key (for now), and mode (not sure what this is), and the time signature since it is almost always in 4/4 for typical songs.

In [87]:
target_features = features_df.loc[:,['danceability', 'energy', 'loudness', 'speechiness', 'acousticness',
                              'instrumentalness', 'liveness', 'valence', 'tempo',]]
print(target_features)

    danceability  energy  loudness  speechiness  acousticness  \
1          0.398   0.939    -2.865       0.0648      0.005910   
2          0.668   0.921    -3.727       0.0439      0.049200   
3          0.589   0.797    -4.213       0.1730      0.000673   
4          0.381   0.984    -2.420       0.0638      0.000025   
5          0.608   0.792    -5.721       0.0355      0.000674   
6          0.839   0.560    -8.339       0.3310      0.006050   
7          0.407   0.980    -1.580       0.1070      0.001740   
8          0.560   0.925    -3.821       0.0871      0.000197   
9          0.569   0.745    -3.854       0.1820      0.005750   
10         0.629   0.889    -3.573       0.0984      0.024300   
11         0.551   0.953    -4.695       0.1700      0.010600   
12         0.638   0.961    -1.726       0.3000      0.014400   
13         0.555   0.883    -3.426       0.0341      0.001140   
14         0.646   0.851    -4.220       0.0662      0.006180   
15         0.728   0.783 

<b>Step 2:</b> Scale the data so that each feature has a mean of 0 and standard deviation of 1. This ensures each feature is weighted equally when clustering. I also apply a principal component analysis and compared this with the original clustering for kmeans.

In [94]:
# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(target_features)

# Apply PCA for dimensionality reduction to 3 dimensions
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_scaled)

For K-Means clustering, we need to input the number of clusters for analysis. A decent way of determining the best number of clusters is using a silhouette analysis, which calcualtes how good clustering is for different initializations. The higher this number is, the less interpreable the results will be (especially if songs that are the same genre/artist are being grouped in different clusters) - so I am hoping one of the lower initializations has a better score.

In [95]:
# performing silhouette analysis
try_k = [2, 3, 5, 7, 9, 12] # list of clusters to try

silhouette_scores_raw = []
silhouette_scores_pca = []

# PCA data
for k in try_k:
    # perform kmeans
    # kmeans ++ supposedly speeds up convergence
    km = KMeans(n_clusters=k ,init='k-means++', tol = 0.001)
    km.fit(X_pca)
    labels = km.predict(X_pca)
        
    #calculate score
    sc = silhouette_score(X_pca, labels, metric = 'euclidean')
    silhouette_scores_pca.append(sc)
    
# raw data
for k in try_k:
    # perform kmeans
    # kmeans ++ supposedly speeds up convergence
    km = KMeans(n_clusters=k ,init='k-means++', tol = 0.001)
    km.fit(target_features)
    labels = km.predict(target_features)
        
    #calculate score
    sc = silhouette_score(target_features, labels, metric = 'euclidean')
    silhouette_scores_raw.append(sc)
    
print("Results from raw features:\n",silhouette_scores_raw)
print("Results from PCA data:\n",silhouette_scores_pca)

Results from raw features:
 [0.5018202481247832, 0.5679635951106933, 0.524588723910124, 0.44260768434174125, 0.4875594284870414, 0.47281059770330697]
Results from PCA data:
 [0.6152900151592087, 0.26828476645875476, 0.3495951628570357, 0.32947905346999345, 0.2774943134868466, 0.29547444751813634]


For this example, the number of clusters is best for k = 2, which makes sense because the playlist i got this data from is almost exclusively rock/metal music. For the raw data, the best initialization is for 3 clusters, but the silhouette score is worse for that one.

In [99]:
X_pca = pd.DataFrame(X_pca)
# performing clustering for k = 2 and plotting results
km = KMeans(n_clusters = 2, init = 'k-means++', tol = 0.001)
km.fit(X_pca)
labels = km.predict(X_pca)

# Add cluster labels to DataFrame
results = X_pca.copy()
results['cluster_kmeans'] = km.labels_

# putting results into df and getting uri/song name
results['name'] = track_names
results['uri'] = track_uris

# checking clusters
print(results[['cluster_kmeans', 'name']])

    cluster_kmeans                                               name
0                0                                     The Devil in I
1                0                            Supermassive Black Hole
2                0                                         Shut Me Up
3                0                                     American Idiot
4                0                                         masquerade
5                0                                         Underworld
6                0                                             Custer
7                0                                            Witness
8                0                              Never Wanted To Dance
9                0                                         Lights Out
10               0                                        I'm Yer Dad
11               0                                          Suffocate
12               0                                             Psycho
13               0  

So clearly this is not telling us much (since this is my angsty playlist), but it is good that most of these songs were put into the same cluster because they are all very similar in genre and overall audio qualities.

The next thing I want to do is perform clustering using a few other methods and see how these labels compare to the basic K-Means clustering. Once I have a solid framework for the smaller playlist selected, then I can move onto clustering a large playlist (my liked songs).

In [None]:
# making a function for calculating the two quality metrics 
def evaluate_clustering(df, labels):
    silhouette = silhouette_score(df, labels)
    davies_bouldin = davies_bouldin_score(df, labels)
    print(f"Silhouette Score: {silhouette}")
    print(f"Davies-Bouldin Index: {davies_bouldin}")

Silhouette Score is on a scale from (-1,1) where 1 is better clustering so we want a higher number here, while a lower number is better for the davies bouldin score.

In [106]:
# agglomerative
# Perform hierarchical clustering
clustering = AgglomerativeClustering(n_clusters=3).fit(X_pca)

# Add cluster labels to DataFrame
results['cluster_agglo'] = clustering.labels_

In [103]:
# DBSCAN
# Perform DBSCAN clustering
eps = 0.5
min_samples = 2
clustering = DBSCAN(eps=eps, min_samples=min_samples).fit(X_pca)

# Add cluster labels to DataFrame
results['cluster_dbscan'] = clustering.labels_

In [104]:
# Gaussian Mixture model
# Perform GMM clustering
gmm = GaussianMixture(n_components=2).fit(X_pca)

# Predict cluster labels
labels = gmm.predict(X_pca)

# Add cluster labels to DataFrame
results['cluster_gauss'] = labels

Getting all of my liked songs instead of just a few songs from a playlist. Apparently there is a query limit so I have to send multiple requests for this and then append that to the total data.

In [5]:
# lists of data for each track in the playlist
track_uris = []
track_names = []
artist_uris = []
artist_uris = []
artist_infos = []
artist_names = []
artist_pops = []
artist_gens = []
albums = []
track_pops = []

features_list = []

# get current playlist
playlist_link = "https://open.spotify.com/playlist/4O2iqhl3wxr5yWlwMquosG?si=0de84dd940d24213"
playlist_URI = playlist_link.split("/")[-1].split("?")[0]

for track in sp.playlist_tracks(playlist_URI)["items"]:
    #URI
    track_uri = track["track"]["uri"]
    track_uris.append(track_uri)

    #Track name
    track_name = track["track"]["name"]
    track_names.append(track_name)
 
    #Main Artist
    artist_uri = track["track"]["artists"][0]["uri"]
    artist_info = sp.artist(artist_uri)
    artist_uris.append(artist_uri)
    artist_infos.append(artist_info)
    
    #Name, popularity, genre
    artist_name = track["track"]["artists"][0]["name"]
    artist_pop = artist_info["popularity"]
    artist_genres = artist_info["genres"]
    artist_names.append(artist_name)
    artist_pops.append(artist_pop)
    artist_gens.append(artist_genres)
    
    #Album
    album = track["track"]["album"]["name"]
    albums.append(album)
    
    #Popularity of the track
    track_pop = track["track"]["popularity"]
    track_pops.append(track_pop)


# get current playlist
playlist_link = "https://open.spotify.com/playlist/4OXaKynRScKM8E8Xg8BKtF?si=dc74c53ba57a4c37"
playlist_URI = playlist_link.split("/")[-1].split("?")[0]

for track in sp.playlist_tracks(playlist_URI)["items"]:
    #URI
    track_uri = track["track"]["uri"]
    track_uris.append(track_uri)

    #Track name
    track_name = track["track"]["name"]
    track_names.append(track_name)
 
    #Main Artist
    artist_uri = track["track"]["artists"][0]["uri"]
    artist_info = sp.artist(artist_uri)
    artist_uris.append(artist_uri)
    artist_infos.append(artist_info)
    
    #Name, popularity, genre
    artist_name = track["track"]["artists"][0]["name"]
    artist_pop = artist_info["popularity"]
    artist_genres = artist_info["genres"]
    artist_names.append(artist_name)
    artist_pops.append(artist_pop)
    artist_gens.append(artist_genres)
    
    #Album
    album = track["track"]["album"]["name"]
    albums.append(album)
    
    #Popularity of the track
    track_pop = track["track"]["popularity"]
    track_pops.append(track_pop) 
      
# get current playlist
playlist_link = "https://open.spotify.com/playlist/0U8NMOvqpDlPRErY4AJdLm?si=998b7ac041754749"
playlist_URI = playlist_link.split("/")[-1].split("?")[0]

for track in sp.playlist_tracks(playlist_URI)["items"]:
    #URI
    track_uri = track["track"]["uri"]
    track_uris.append(track_uri)

    #Track name
    track_name = track["track"]["name"]
    track_names.append(track_name)
 
    #Main Artist
    artist_uri = track["track"]["artists"][0]["uri"]
    artist_info = sp.artist(artist_uri)
    artist_uris.append(artist_uri)
    artist_infos.append(artist_info)
    
    #Name, popularity, genre
    artist_name = track["track"]["artists"][0]["name"]
    artist_pop = artist_info["popularity"]
    artist_genres = artist_info["genres"]
    artist_names.append(artist_name)
    artist_pops.append(artist_pop)
    artist_gens.append(artist_genres)
    
    #Album
    album = track["track"]["album"]["name"]
    albums.append(album)
    
    #Popularity of the track
    track_pop = track["track"]["popularity"]
    track_pops.append(track_pop) 
      
# get current playlist
playlist_link = "https://open.spotify.com/playlist/0H1u46JdIlP5q8l6G9oGNu?si=ab3f77e8dcc24fe6"
playlist_URI = playlist_link.split("/")[-1].split("?")[0]

for track in sp.playlist_tracks(playlist_URI)["items"]:
    #URI
    track_uri = track["track"]["uri"]
    track_uris.append(track_uri)

    #Track name
    track_name = track["track"]["name"]
    track_names.append(track_name)
 
    #Main Artist
    artist_uri = track["track"]["artists"][0]["uri"]
    artist_info = sp.artist(artist_uri)
    artist_uris.append(artist_uri)
    artist_infos.append(artist_info)
    
    #Name, popularity, genre
    artist_name = track["track"]["artists"][0]["name"]
    artist_pop = artist_info["popularity"]
    artist_genres = artist_info["genres"]
    artist_names.append(artist_name)
    artist_pops.append(artist_pop)
    artist_gens.append(artist_genres)
    
    #Album
    album = track["track"]["album"]["name"]
    albums.append(album)
    
    #Popularity of the track
    track_pop = track["track"]["popularity"]
    track_pops.append(track_pop) 
    
    
# get current playlist
playlist_link = "https://open.spotify.com/playlist/3sm17yisPaZmlqCVC5DHbq?si=c00e34a6c9024da1"
playlist_URI = playlist_link.split("/")[-1].split("?")[0]

for track in sp.playlist_tracks(playlist_URI)["items"]:
    #URI
    track_uri = track["track"]["uri"]
    track_uris.append(track_uri)

    #Track name
    track_name = track["track"]["name"]
    track_names.append(track_name)
 
    #Main Artist
    artist_uri = track["track"]["artists"][0]["uri"]
    artist_info = sp.artist(artist_uri)
    artist_uris.append(artist_uri)
    artist_infos.append(artist_info)
    
    #Name, popularity, genre
    artist_name = track["track"]["artists"][0]["name"]
    artist_pop = artist_info["popularity"]
    artist_genres = artist_info["genres"]
    artist_names.append(artist_name)
    artist_pops.append(artist_pop)
    artist_gens.append(artist_genres)
    
    #Album
    album = track["track"]["album"]["name"]
    albums.append(album)
    
    #Popularity of the track
    track_pop = track["track"]["popularity"]
    track_pops.append(track_pop) 
    
    
# get current playlist
playlist_link = "https://open.spotify.com/playlist/6KoxtnKxyNYdEcwzNde8xl?si=1fdd9857a14d4293"
playlist_URI = playlist_link.split("/")[-1].split("?")[0]

for track in sp.playlist_tracks(playlist_URI)["items"]:
    #URI
    track_uri = track["track"]["uri"]
    track_uris.append(track_uri)

    #Track name
    track_name = track["track"]["name"]
    track_names.append(track_name)
 
    #Main Artist
    artist_uri = track["track"]["artists"][0]["uri"]
    artist_info = sp.artist(artist_uri)
    artist_uris.append(artist_uri)
    artist_infos.append(artist_info)
    
    #Name, popularity, genre
    artist_name = track["track"]["artists"][0]["name"]
    artist_pop = artist_info["popularity"]
    artist_genres = artist_info["genres"]
    artist_names.append(artist_name)
    artist_pops.append(artist_pop)
    artist_gens.append(artist_genres)
    
    #Album
    album = track["track"]["album"]["name"]
    albums.append(album)
    
    #Popularity of the track
    track_pop = track["track"]["popularity"]
    track_pops.append(track_pop) 
    
    
# get current playlist
playlist_link = "https://open.spotify.com/playlist/24YFWROXbYvfW9KnmHebpP?si=21fd4d8ab6084330"
playlist_URI = playlist_link.split("/")[-1].split("?")[0]

for track in sp.playlist_tracks(playlist_URI)["items"]:
    #URI
    track_uri = track["track"]["uri"]
    track_uris.append(track_uri)

    #Track name
    track_name = track["track"]["name"]
    track_names.append(track_name)
 
    #Main Artist
    artist_uri = track["track"]["artists"][0]["uri"]
    artist_info = sp.artist(artist_uri)
    artist_uris.append(artist_uri)
    artist_infos.append(artist_info)
    
    #Name, popularity, genre
    artist_name = track["track"]["artists"][0]["name"]
    artist_pop = artist_info["popularity"]
    artist_genres = artist_info["genres"]
    artist_names.append(artist_name)
    artist_pops.append(artist_pop)
    artist_gens.append(artist_genres)
    
    #Album
    album = track["track"]["album"]["name"]
    albums.append(album)
    
    #Popularity of the track
    track_pop = track["track"]["popularity"]
    track_pops.append(track_pop) 
       
# get current playlist
playlist_link = "https://open.spotify.com/playlist/5jtHjPMtFTpSHgq0Q1t1Ab?si=165b91e716e84c54"
playlist_URI = playlist_link.split("/")[-1].split("?")[0]

for track in sp.playlist_tracks(playlist_URI)["items"]:
    #URI
    track_uri = track["track"]["uri"]
    track_uris.append(track_uri)

    #Track name
    track_name = track["track"]["name"]
    track_names.append(track_name)
 
    #Main Artist
    artist_uri = track["track"]["artists"][0]["uri"]
    artist_info = sp.artist(artist_uri)
    artist_uris.append(artist_uri)
    artist_infos.append(artist_info)
    
    #Name, popularity, genre
    artist_name = track["track"]["artists"][0]["name"]
    artist_pop = artist_info["popularity"]
    artist_genres = artist_info["genres"]
    artist_names.append(artist_name)
    artist_pops.append(artist_pop)
    artist_gens.append(artist_genres)
    
    #Album
    album = track["track"]["album"]["name"]
    albums.append(album)
    
    #Popularity of the track
    track_pop = track["track"]["popularity"]
    track_pops.append(track_pop) 
       
# get current playlist
playlist_link = "https://open.spotify.com/playlist/2Dwv0orJRDNlfUA5LmHdBh?si=8ae6f83095c14601"
playlist_URI = playlist_link.split("/")[-1].split("?")[0]

for track in sp.playlist_tracks(playlist_URI)["items"]:
    #URI
    track_uri = track["track"]["uri"]
    track_uris.append(track_uri)

    #Track name
    track_name = track["track"]["name"]
    track_names.append(track_name)
 
    #Main Artist
    artist_uri = track["track"]["artists"][0]["uri"]
    artist_info = sp.artist(artist_uri)
    artist_uris.append(artist_uri)
    artist_infos.append(artist_info)
    
    #Name, popularity, genre
    artist_name = track["track"]["artists"][0]["name"]
    artist_pop = artist_info["popularity"]
    artist_genres = artist_info["genres"]
    artist_names.append(artist_name)
    artist_pops.append(artist_pop)
    artist_gens.append(artist_genres)
    
    #Album
    album = track["track"]["album"]["name"]
    albums.append(album)
    
    #Popularity of the track
    track_pop = track["track"]["popularity"]
    track_pops.append(track_pop) 
    
print(len(track_uris))

886


In [6]:
# creating a new app to work around rate limit
# re authenticate with a new app??
cid = '2eb4ebd1c0b2479ebcf4b88186c8dccf'
secret = '5335487540bb4f1abbd5f9e862801676'
client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

# getting all audio features
i = 1
for uri in track_uris:
    print(uri)
    if i == 100:
        sleep(30)
        i = 1
        features = sp.audio_features(uri)
        features_list.append(features) 
    else:
        features = sp.audio_features(uri)
        features_list.append(features) 
        i +=1

# creating df of features
features_df = pd.DataFrame(features_list[0])
for song in features_list:
    df_song = pd.DataFrame(song)
    features_df = pd.concat([features_df, df_song], ignore_index=True)
# drop first row to remove repeated track
features_df = features_df.drop(df.index[0])
print(len(features_df))

spotify:track:06PQwEgNVggyvZdzHAR9VM
spotify:track:3coeo9H305LiuDbXXn9ANe
spotify:track:6rUp7v3l8yC4TKxAAR5Bmx
spotify:track:6J4VOoKhRZFNeWkVH0WTzH
spotify:track:0rG5cHbTGTytEO5pPRqacV
spotify:track:6s1dhk7s70vZHtERfJUDMz
spotify:track:6CoVyYMT1M7S9DBRYp8HJL
spotify:track:42slPi7T4oM1VP2JnP6v4o
spotify:track:5d4HcdMt7gVY0gBu1Sgb5G
spotify:track:1I5zEkyDQTJ6TeDxyXBSQ5
spotify:track:4eml8c7ZKYbMPFNgtfiwou
spotify:track:2ViFO3hmIkNUsvoUebYTEm
spotify:track:5UPHeuDP0AnG830Yf3bJJD
spotify:track:3qhlB30KknSejmIvZZLjOD
spotify:track:2GxrNKugF82CnoRFbQfzPf
spotify:track:1VN2vWSkSmMKOhxr8lHzSx
spotify:track:6mSyEZWYvYpEmRoIsTDe8J
spotify:track:1Dr5JexwA15wmKe7Y7maA9
spotify:track:5rKTGQ2Q1wndZnX7km8WYu
spotify:track:0Z7nGFVCLfixWctgePsRk9
spotify:track:1158ckiB5S4cpsdYHDB9IF
spotify:track:7cioKB5CHVzk09SOtTyn0T
spotify:track:47k7FCxk7ylTwKCnJ3QTVc
spotify:track:7lSdUlVf8k6kxklKkskb1m
spotify:track:3COKv7qvzBy9KZEr720y3B
spotify:track:6FjFcwc5GE8ONvInstIM84
spotify:track:51ZQ1vr10ffzbwIjDCwqm4
s

NameError: name 'df' is not defined

In [11]:
# saving to a csv just in case....
#features_df.to_csv('C:/Users/vince/Desktop/spotAPI_data.csv', index=False)
URis = features_df['uri'].tolist()
print(features_df)

     danceability  energy  key  loudness  mode  speechiness  acousticness  \
0           0.545   0.983   10    -5.154     1       0.1940      0.002500   
1           0.545   0.983   10    -5.154     1       0.1940      0.002500   
2           0.440   0.993    1    -1.115     0       0.1410      0.000176   
3           0.498   0.830    6    -5.157     0       0.0421      0.004610   
4           0.450   0.892    8    -5.099     1       0.0466      0.000003   
..            ...     ...  ...       ...   ...          ...           ...   
882         0.635   0.382    9   -10.951     0       0.0518      0.847000   
883         0.479   0.553    5    -9.753     1       0.2090      0.274000   
884         0.455   0.444    3   -11.493     1       0.3280      0.547000   
885         0.816   0.726    5    -3.998     0       0.1290      0.099000   
886         0.820   0.696    7    -4.918     0       0.1820      0.119000   

     instrumentalness  liveness  valence    tempo            type  \
0     

In [12]:
spot_uris = features_df['uri'].tolist()

names = []
artists = []
for uri in spot_uris:
    # Extract track ID from URI
    track_id = uri.split(':')[-1]
    
    # Retrieve track information
    track_info = sp.track(track_id)
    
    # Extract track name from track information
    track_name = track_info['name']
    track_artist = track_info['artists']
    names.append(track_name)
    artists.append(track_artist)
    
features_df['song_name'] = names
features_df['artists'] = artists

<h3 align="center">Method 1: K-Means Clustering</h3>

In [14]:
# select target features
target_features = features_df.loc[:,['danceability', 'energy', 'loudness', 'speechiness', 'acousticness',
                              'instrumentalness', 'liveness', 'valence', 'tempo',]]

target_features_2 = features_df.loc[:,['danceability','energy', 'tempo', 'speechiness', 'liveness', 'valence', 'acousticness']]
# performing silhouette analysis
try_k = [2, 3, 5, 7, 9, 12, 17] # list of clusters to try

silhouette_scores = []

# PCA data
for k in try_k:
    # perform kmeans
    # kmeans ++ supposedly speeds up convergence
    km = KMeans(n_clusters=k ,init='k-means++', tol = 0.00001)
    km.fit(target_features_2)
    labels = km.predict(target_features_2)
        
    #calculate score
    sc = silhouette_score(target_features_2, labels, metric = 'euclidean')
    silhouette_scores.append(sc)
    
print("Results from silhouette analysis:\n",silhouette_scores)

Results from silhouette analysis:
 [0.5179619230429265, 0.5756797023966441, 0.5332665866449482, 0.5307193963730487, 0.5251819273644387, 0.5335332900171172, 0.5261344817799591]


Another issue with clustering in higher dimensions is feature selection. the curse of dimensionality can make clustering perform worse for higher dimensions of data, so determining which features are most prominenet in the data can help reduce the number of dimensions being used for clustering. I will perform clustering for many combinations of different variables and see how this changes the silhouette scores calculated.

In [15]:
fixed_feature = 'danceability'  # Fixed feature name
selected_feature_indices = [1,2,3,4,5,6,7,8,9,10,16,17]  # Column indices of selected features

# number of clusters to try
num_k = [2,3,4,6,8,9,12]

for index in selected_feature_indices:
    # get current variable from dataframe
    X = features_df[[fixed_feature, features_df.columns[index]]]
    for k in num_k:
        # perform kmeans
        silhouette_scores_test = []
        # kmeans ++ supposedly speeds up convergence
        km = KMeans(n_clusters=k ,init='k-means++', tol = 0.001)
        km.fit(X)
        labels = km.predict(X)

        #calculate score
        sc = silhouette_score(X, labels, metric = 'euclidean')
        silhouette_scores_test.append(sc)
        
        # view results
        print("Results for feature:",features_df.columns[index])
        print("current cluster number:",k)
        print("silhouette score:",sc)  

Results for feature: energy
current cluster number: 2
silhouette score: 0.47053891384337726
Results for feature: energy
current cluster number: 3
silhouette score: 0.39113475735344994
Results for feature: energy
current cluster number: 4
silhouette score: 0.35978979658439253
Results for feature: energy
current cluster number: 6
silhouette score: 0.3563436986598031
Results for feature: energy
current cluster number: 8
silhouette score: 0.35520328460541795
Results for feature: energy
current cluster number: 9
silhouette score: 0.34431644817658563
Results for feature: energy
current cluster number: 12
silhouette score: 0.3408933319441528
Results for feature: key
current cluster number: 2
silhouette score: 0.6365648433900102
Results for feature: key
current cluster number: 3
silhouette score: 0.6373618612820973
Results for feature: key
current cluster number: 4
silhouette score: 0.6392076806904978
Results for feature: key
current cluster number: 6
silhouette score: 0.6262705964510645
Resul

From this, we can see a few features that seem to be most important at different cluster number initializations. Energy, loudness, mode, spechiness, acousticness, and instrumentalness all have highest scores for k = 2. While key has the best score at k = 12 since there are 12 different key signatures across this sample of tracks. Time signature is best represneted by 4 clusters since there are 4 different time signatures across the songs in this sample.

In [16]:
# testing the top features together 
test_features = features_df.loc[:,['danceability', 'energy', 'loudness','mode','speechiness', 'acousticness',
                              'instrumentalness','key', 'time_signature']]

# number of clusters to try (will probably be 2 still)
num_k = [2,3,4,6,8,9,12]

# performing kmeans iteratively
for k in num_k:
        # perform kmeans
        # kmeans ++ supposedly speeds up convergence
        km = KMeans(n_clusters=k ,init='k-means++', tol = 0.001)
        km.fit(test_features)
        labels = km.predict(test_features)

        #calculate score
        sc = silhouette_score(test_features, labels, metric = 'euclidean')

        # view results
        print("current cluster number:",k)
        print("silhouette score:",sc)  
        
# although k = 2 has the best clustering score, it does not show us 
# anything meaningful about the data
km = KMeans(n_clusters=8 ,init='k-means++', tol = 0.00001)
km.fit(test_features)
labels = km.predict(test_features)  
test_features['labels'] = labels
test_features['song_name'] = features_df['song_name']

current cluster number: 2
silhouette score: 0.48661550430831174
current cluster number: 3
silhouette score: 0.412341994041607
current cluster number: 4
silhouette score: 0.4176134361818712
current cluster number: 6
silhouette score: 0.35142199687399783
current cluster number: 8
silhouette score: 0.36292339193440015
current cluster number: 9
silhouette score: 0.35005253518356383
current cluster number: 12
silhouette score: 0.3385738049618452


In [17]:
# looking at labels
pd.set_option('display.max_rows', None)
print(test_features[test_features['labels'] == 3][['labels', 'song_name']])
print(test_features[test_features['labels'] == 5][['labels', 'song_name']])
kmeans_results_8 = test_features.copy()

     labels                                          song_name
7         3                                            Big Sky
24        3                                             Heroin
30        3                                  Violets for Roses
31        3                                 Black Bathing Suit
35        3                                 Fuck it I love you
38        3                                     Cherry Blossom
40        3                                 Nectar Of The Gods
41        3                               Are You Gone Already
50        3               Slumber Party (feat. Princess Nokia)
62        3                                          Van Vogue
67        3        Gimme! Gimme! Gimme! (A Man After Midnight)
71        3                                            Runaway
78        3                        Take Me Home, Country Roads
79        3                                 Breaking Up Slowly
85        3                                           Y

Spoiler alert - Most of them are failry incoherent; however, label 3 seems to identify classical music pretty good. label 5 also has a bunch of classical msuic and more sad/slow songs (a lot of Billie and Lana here). These two labels had very similar trends in the songs clustered here. Another interesting thing - Lana was found in every single label which is not shocking since I have here entire discography in my liked songs and her music is typically alternative, making the audio features quite variable.

In [18]:
# testing the top features together 
test_features = features_df.loc[:,['danceability', 'energy', 'loudness','mode','speechiness', 'acousticness',
                              'instrumentalness','key', 'time_signature']]

# although k = 2 has the best clustering score, it does not show us 
# anything meaningful about the data
km = KMeans(n_clusters=4 ,init='k-means++', tol = 0.00001)
km.fit(test_features)
labels = km.predict(test_features)  
test_features['labels'] = labels
test_features['song_name'] = features_df['song_name']

# checking results
print(test_features[test_features['labels'] == 0][['labels', 'song_name']])
print(test_features[test_features['labels'] == 1][['labels', 'song_name']])
print(test_features[test_features['labels'] == 2][['labels', 'song_name']])
print(test_features[test_features['labels'] == 3][['labels', 'song_name']])
print(test_features[test_features['labels'] == 4][['labels', 'song_name']])

# not much better - classical music is clustered the best
kmeans_results_4 = test_features.copy()

     labels                                          song_name
5         0                       Nothing Fades Like the Light
7         0                                            Big Sky
9         0     It's Always Good to Tell Someone You Love Them
10        0                                 Buddy's Rendezvous
11        0                                        Powder Blue
29        0                                          Honeymoon
31        0                                 Black Bathing Suit
32        0        Snow On The Beach (feat. More Lana Del Rey)
35        0                                 Fuck it I love you
36        0                    Summertime The Gershwin Version
37        0                               Dark But Just A Game
38        0                                     Cherry Blossom
41        0                               Are You Gone Already
57        0                           Happiness is a butterfly
78        0                        Take Me Home, Countr

<h3 align="center">Method 2: Density-Based Spatial Clustering of Applications with Noise (DBSCAN)</h3>

In [19]:
# DBSCAN
dbscan = DBSCAN()
test_features = features_df.loc[:,['danceability', 'energy', 'loudness','mode','speechiness', 'acousticness',
                              'instrumentalness','key', 'time_signature']]

# Instantiate DBSCAN with specified parameters
eps = 1.5  # Maximum distance between two samples for them to be considered as in the same cluster
min_samples = 5  # min number of songs to form a cluster
dbscan = DBSCAN(eps=eps, min_samples=min_samples)

# Fit DBSCAN to the data
dbscan.fit(test_features)

# Get cluster labels
labels = dbscan.labels_
test_features['labels'] = labels
test_features['song_name'] = features_df['song_name']
dbscan_results = test_features.copy()

So this method did not work great either. Even after adjusting the parameters for eps and min_samples, the data is not dense enough around clusters in order to identify unique regions. eps controls the maximum distance between two samples in order for those points to be counted in the same cluster and min_samples is the minimum number of points to form a unqiue clsuter

Most of the data gets classified as -1, which means it does not belong to a unqiue cluster in the data. This is likely because the data is so spread out and random. If I only downloaded music of a certain type (country, metal, etc...), then regions of higher density might be more apparent.

<h3 align="center">Method 3: Agglomerative Clustering</h3>

In [20]:
# agglomerative
test_features = features_df.loc[:,['danceability', 'energy', 'loudness','mode','speechiness', 'acousticness',
                              'instrumentalness','key', 'time_signature']]

# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(test_features)

# Apply PCA for dimensionality reduction to 3 dimensions
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_scaled)

silhouette_scores = []
for n_clusters in range(2, 20):  # Try different k
    # Fit agglomerative clustering model
    model = AgglomerativeClustering(n_clusters=n_clusters)
    labels = model.fit_predict(X_pca)
    
    # Compute silhouette score
    silhouette_avg = silhouette_score(X_pca, labels)
    silhouette_scores.append(silhouette_avg)

# trying k = 4, 12, 2, 8 (4 was the best overall)
model = AgglomerativeClustering(n_clusters=4)
labels = model.fit_predict(X_pca)

test_features['labels'] = labels
test_features['song_name'] = features_df['song_name']

# looking at labels
i = 0
for i in range(max(labels)):
    print(test_features[test_features['labels'] == i][['labels', 'song_name']])
    i +=1
    
agglo_results = test_features.copy()

     labels                                          song_name
0         0                                             Smokey
1         0                                             Smokey
2         0                                            Akudama
3         0                        I Hate Everything About You
4         0                                For the Glory of...
6         0                                      It Don’t Fade
14        0                                   End of Beginning
15        0                         i like the way you kiss me
16        0                      Seven Wonders - 2017 Remaster
18        0                                         Past Lives
20        0                                     TEXAS HOLD 'EM
21        0                           My Own Summer (Shove It)
23        0                                     Brand New City
26        0                            The West Side Freestyle
27        0           we can't be friends (wait for you

This is still very random, k = 2 is the best silhouette score but does not capture any features of the data. Again, classical music and sad/slow music is usually grouped better than other types of songs in label 1, while the other two clusters are fairly random.

<h3 align="center">Method 4: Gaussian Mixture Model</h3>

In [21]:
test_features = features_df.loc[:,['danceability', 'energy', 'loudness','mode','speechiness', 'acousticness',
                              'instrumentalness','key', 'time_signature']]

# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(target_features)

# Apply PCA for dimensionality reduction to 3 dimensions
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_scaled)

# Define range of clusters to try
min_components = 2
max_components = 10
n_components_range = range(min_components, max_components + 1)

# Initialize lists to store BIC and AIC values
bic_values = []
aic_values = []

# Compute BIC and AIC for different numbers of components
for n_components in n_components_range:
    gmm = GaussianMixture(n_components=n_components)
    
    # Fit GMM to the data
    gmm.fit(X_pca)
    
    # Compute BIC and AIC
    bic = gmm.bic(X_pca)
    aic = gmm.aic(X_pca)
    
    # Append BIC and AIC values to lists
    bic_values.append(bic)
    aic_values.append(aic)

print("Resulting AIC/BIC Scores for GMM:")
print(aic_values)
print(bic_values)

Resulting AIC/BIC Scores for GMM:
[8340.584046375503, 8066.28961482814, 7987.075155476151, 7990.4939072715815, 7958.1190253825625, 7965.658975377883, 7967.013192774182, 7951.886105801923, 7957.734689885252]
[8431.553101039384, 8205.137119315117, 8173.8011097862245, 8225.09831140475, 8240.601879338828, 8296.020279157243, 8345.252946376639, 8378.004309227475, 8431.7313431339]


The AIC/BIC scores can be used for GMM for determining the goodness of fitting for data, while also accounting for the number of parameters initialized. BIC is a criterion for model selection among a finite set of models. It is derived from Bayesian probability theory and penalizes models based on both goodness of fit and the number of parameters in the model. AIC is another criterion for model selection that also penalizes models based on both goodness of fit and complexity. 

A better fit to the data will have a smaller value of AIC and BIC, so we can choose the best number of clusters by looking at where these two values are as small as possible. From the analysis output above, the best AIC score is at 10 clusters; however, the BIC score is high for this number of clusters. The lowest BIC score occurs at 4 clusters and this is the second lowest AIC value as well - so lets try k = 4 for the next attempt

In [22]:
test_features = features_df.loc[:,['danceability', 'energy', 'loudness','mode','speechiness', 'acousticness',
                              'instrumentalness','key', 'time_signature']]

# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(target_features)

# Apply PCA for dimensionality reduction to 3 dimensions
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_scaled)

# apply GMM
n_comp = 4
gmm = GaussianMixture(n_components=n_comp,max_iter=1000, tol=1e-8)
# Fit GMM to the data
gmm.fit(X_pca)
labels = gmm.predict(X_pca)

test_features['labels'] = labels
test_features['song_name'] = features_df['song_name']

# looking at labels
i = 0
for i in range(max(labels)):
    print(test_features[test_features['labels'] == i][['labels', 'song_name']])
    i +=1
    
gmm_results = test_features.copy()

     labels                                          song_name
3         0                        I Hate Everything About You
6         0                                      It Don’t Fade
14        0                                   End of Beginning
15        0                         i like the way you kiss me
16        0                      Seven Wonders - 2017 Remaster
18        0                                         Past Lives
19        0  Cowboys Are Frequently Secretly Fond Of Each O...
20        0                                     TEXAS HOLD 'EM
21        0                           My Own Summer (Shove It)
23        0                                     Brand New City
26        0                            The West Side Freestyle
27        0           we can't be friends (wait for your love)
28        0                                    Breakin' Dishes
42        0                                    Don't Start Now
43        0                                  Pink Frida

<h3 align="center">Analysis</h3>

Overall, the different methods used clustered resulted in very different results for the songs put into different categories. I feel like K-Means did the best overall with a higher number of clusters (k). DBSCAN performed the worst likely because the data is fairly random and does not have pockets of very high density. Agglomerative and GMM performed similarly, having the best number of clusters at k = 4.

Across K-means, GMM, and Agglomerative, clustering did the best on classical music, slower songs, sad music, and similar 'genres' of music. Usually rock/heavy metal got split between 2 or more clusters and pop/alternative music was the most randomly scattered.

As a final test, I want to use the results from each method to try getting some recommendations from spotify. I will do this by getting a random list of songs from spotify and then having them ranked by each of the methods based on a correlation measurement. I am mainly doing this because Spotify's recommendations consider the artists/albums you listen to most and can soemtimes recommend a lot of the same music.

In [24]:
# Get the top 100 songs from the global Top 200 chart
results = sp.playlist_tracks('37i9dQZEVXbMDoHDwVN2tF', limit=100)

# Extract track information from the playlist
tracks = results['items']

top_uris = []
top_names = []
# getting uris
for t in tracks:
    track_uri = t["track"]["uri"]
    track_name = t["track"]["name"]
    top_uris.append(track_uri)
    top_names.append(track_name)

# getting all audio features
features_list = []
i = 1
for uri in top_uris:
    if i == 100:
        sleep(30) # ensures the rate limit is not exceeded
        i = 1
        features = sp.audio_features(uri)
        features_list.append(features) 
    else:
        features = sp.audio_features(uri)
        features_list.append(features) 
        i +=1
        
# creating df of features
top_df = pd.DataFrame(features_list[0])
for song in features_list:
    df_song = pd.DataFrame(song)
    top_df = pd.concat([top_df, df_song], ignore_index=True)
# drop first row to remove repeated track
top_df = top_df.drop(top_df.index[0])

In [25]:
index_values = range(1, len(top_names) + 1)
name_df = pd.DataFrame({'names':top_names}, index = index_values) # df of all song names

The next step is getting the average audio features from each of the previous methods for each of the clusters to see what features change between each of the clusters. Then I can sort the list of music from spotify based on these scores to see how well the features correlate to recommendations I actually like.

In [26]:
exclude_columns = ['labels', 'song_name']

# getting first label (0)
kmeans_results_8 = kmeans_results_8[kmeans_results_8['labels'] == 0].dropna()
kmeans_results_4 = kmeans_results_4[kmeans_results_4['labels'] == 0].dropna()
dbscan_results = dbscan_results[dbscan_results['labels'] == 0].dropna()
agglo_results = agglo_results[agglo_results['labels'] == 0].dropna()
gmm_results = gmm_results[gmm_results['labels'] == 0].dropna()

# Calculate the average of audio features, excluding name and labels
included_columns = [col for col in kmeans_results_8.columns if col not in exclude_columns]
kmeans_averages_8 = kmeans_results_8[included_columns].mean().tolist()
kmeans_8_filter = kmeans_results_8.drop(columns = exclude_columns)

included_columns = [col for col in kmeans_results_4.columns if col not in exclude_columns]
kmeans_averages_4 = kmeans_results_8[included_columns].mean().tolist()
kmeans_4_filter = kmeans_results_4.drop(columns = exclude_columns)

included_columns = [col for col in dbscan_results.columns if col not in exclude_columns]
dbscan_averages = dbscan_results[included_columns].mean().tolist()
dbscan_filter = dbscan_results.drop(columns = exclude_columns)

included_columns = [col for col in agglo_results.columns if col not in exclude_columns]
agglo_averages = agglo_results[included_columns].mean().tolist()
agglo_filter = agglo_results.drop(columns = exclude_columns)

included_columns = [col for col in gmm_results.columns if col not in exclude_columns]
gmm_averages = gmm_results[included_columns].mean().tolist()
gmm_filter = gmm_results.drop(columns = exclude_columns)

# calculate similarity score for each song in each method
def calculate_similarity(row, input_features):
    # input_features will be the current averages list
    song_features = row.values
    return np.dot(song_features, input_features) / (np.linalg.norm(song_features) * np.linalg.norm(input_features))

# Function to sort DataFrame based on similarity to input features
def sort_by_similarity(df_audio_features, input_features):
    # do this to each row
    df_audio_features['similarity'] = df_audio_features.apply(lambda row: calculate_similarity(row, input_features), axis=1)
    # Sort the DataFrame based on similarity 
    return df_audio_features.sort_values(by='similarity', ascending=False)

In [27]:
# getting top 10 suggestions from each method
top_features = top_df.loc[:,['danceability', 'energy', 'loudness','mode','speechiness', 'acousticness',
                              'instrumentalness','key', 'time_signature']]

kmeans_suggestions_8 = sort_by_similarity(top_features, kmeans_averages_8)

# getting names from indexes
indexes = kmeans_suggestions_8.index.tolist()

song_names = name_df.loc[indexes, 'names']
kmeans_suggestions_8['song_name'] = song_names

# viewing results
print(kmeans_suggestions_8[['similarity', 'song_name']].head(10))

    similarity                        song_name
24    0.986282  I Can Fix Him (No Really I Can)
45    0.970376                    The Albatross
17    0.964219                   Guilty as Sin?
19    0.960159                             loml
10    0.954601                 End of Beginning
35    0.952768                    The Black Dog
14    0.951618   Who’s Afraid of Little Old Me?
32    0.948220                        Clara Bow
50    0.944887                 I Wanna Be Yours
13    0.942852             But Daddy I Love Him


In [28]:
# getting top 10 suggestions from each method
top_features = top_df.loc[:,['danceability', 'energy', 'loudness','mode','speechiness', 'acousticness',
                              'instrumentalness','key', 'time_signature']]

kmeans_suggestions_4 = sort_by_similarity(top_features, kmeans_averages_4)

# getting names from indexes
indexes = kmeans_suggestions_4.index.tolist()

song_names = name_df.loc[indexes, 'names']
kmeans_suggestions_4['song_name'] = song_names

# viewing results
print(kmeans_suggestions_4[['similarity', 'song_name']].head(10))

    similarity                        song_name
24    0.986282  I Can Fix Him (No Really I Can)
45    0.970376                    The Albatross
17    0.964219                   Guilty as Sin?
19    0.960159                             loml
10    0.954601                 End of Beginning
35    0.952768                    The Black Dog
14    0.951618   Who’s Afraid of Little Old Me?
32    0.948220                        Clara Bow
50    0.944887                 I Wanna Be Yours
13    0.942852             But Daddy I Love Him


In [353]:
# getting top 10 suggestions from each method
top_features = top_df.loc[:,['danceability', 'energy', 'loudness','mode','speechiness', 'acousticness',
                              'instrumentalness','key', 'time_signature']]

dbscan_suggestions = sort_by_similarity(top_features, dbscan_averages)

# getting names from indexes
indexes = dbscan_suggestions.index.tolist()
song_names = name_df.loc[indexes, 'names']
dbscan_suggestions['song_name'] = song_names

# viewing results
print(dbscan_suggestions[['similarity', 'song_name']].head(10))

    similarity                                 song_name
14    0.997673  we can't be friends (wait for your love)
24    0.995489                                 Clara Bow
33    0.992032                           thanK you aIMee
9     0.991685                      But Daddy I Love Him
5     0.990719                           So Long, London
42    0.985797                     My Love Mine All Mine
7     0.985572                                 Gata Only
30    0.983333                         imgonnagetyouback
20    0.982688                     Fresh Out The Slammer
37    0.979286                              The Prophecy


In [35]:
# getting top 10 suggestions from each method
top_features = top_df.loc[:,['danceability', 'energy', 'loudness','mode','speechiness', 'acousticness',
                              'instrumentalness','key', 'time_signature']]

agglo_suggestions = sort_by_similarity(top_features, agglo_averages)

# getting names from indexes
indexes = agglo_suggestions.index.tolist()
song_names = name_df.loc[indexes, 'names']
agglo_suggestions['song_name'] = song_names

# viewing results
print(agglo_suggestions[['similarity', 'song_name']].head(10))

song_names = agglo_suggestions['song_name'].tolist()
scope = 'playlist-modify-public'
sp = spotipy.Spotify(auth_manager=SpotifyOAuth(client_id='whatever CID',
                                               client_secret='shhh its a secret',
                                               redirect_uri='1',
                                               scope=scope))
# Create a new playlist
playlist_name = 'ML Project 2'
playlist_description = 'Does this thing work?'
user_id = sp.me()['id']  # Get the current user's ID
playlist = sp.user_playlist_create(user_id, playlist_name, public=True, description=playlist_description)

# Search for each song and add its top result to the playlist
playlist_id = playlist['id']
for song_name in song_names:
    result = sp.search(q='track:' + song_name, type='track', limit=1)
    if result['tracks']['items']:
        track_uri = result['tracks']['items'][0]['uri']
        sp.user_playlist_add_tracks(user_id, playlist_id, [track_uri])
    else:
        print(f"No matching track found for '{song_name}'.")

    similarity                                       song_name
29    0.998012                                        Magnetic
43    0.990977                                 thanK you aIMee
3     0.986888                                       Gata Only
30    0.986765  One Of The Girls (with JENNIE, Lily Rose Depp)
36    0.983054                               imgonnagetyouback
20    0.982476                           Fresh Out The Slammer
25    0.980042                              Tell Ur Girlfriend
47    0.979537                                    Stick Season
38    0.978493                                  So High School
39    0.977507                                    The Prophecy
Enter the URL you were redirected to: 1


SpotifyOauthError: error: invalid_request, error_description: code must be supplied

In [369]:
# getting top 10 suggestions from each method
top_features = top_df.loc[:,['danceability', 'energy', 'loudness','mode','speechiness', 'acousticness',
                              'instrumentalness','key', 'time_signature']]

gmm_suggestions = sort_by_similarity(top_features, gmm_averages)

# getting names from indexes
indexes = gmm_suggestions.index.tolist()
song_names = name_df.loc[indexes, 'names']
gmm_suggestions['song_name'] = song_names

# viewing results
print(gmm_suggestions[['similarity', 'song_name']].head(10))

    similarity                                       song_name
26    0.998372                                        Magnetic
33    0.990324                                 thanK you aIMee
35    0.988336  One Of The Girls (with JENNIE, Lily Rose Depp)
7     0.986852                                       Gata Only
30    0.983064                               imgonnagetyouback
36    0.982520                              Tell Ur Girlfriend
20    0.982331                           Fresh Out The Slammer
45    0.981285                                    Stick Season
32    0.980822                                  So High School
44    0.979463                                           Pedro


<h3 align="center">Results</h3>

k-Means with 4 and 8 clusters has similar suggestions while the other 3 methods are quite different. So now I go listen to these songs.

To the surprise of nobody, Taylor Swift is in the top 10 in every recommendation. I could have looked at other playlists to as well, this is just the top 100 list.

<b>K-Means:</b> Pretty much all Taylor Swift. Although the first song also features the weekend who i do like.

"we cant be friends" is one of my top songs right now so this was a positive result for both DBSCAN and agglomerative; however, it would be nice if i could filter recommendations to only show new songs. <br>

<b>DBSCAN:</b> Almost all Taylor Swift.

<b>Agglomerative:</b>Good top result, I also like seeing Mitski on here. So much Taylor Swift.

<b>GMM:</b> Recommended a few songs in different languages, which I guess makes sense since I am just looking at the audio features. This is likely another thing that Spotify's recommendations takes into account that I did not look into here.

Overall, I liked the results from Agglomerative and K-Means clustering the best just based on what I would listen to. DBSCAN performed the worst in my opinion, which makes sense because the data does not really have 'denser regions' and DBSCAN typically does well for identifying highly concentrated data. GMM perfomed okay, but for some reason recommended more songs in different langauges, but also recommended the leats Taylor Swift. 

Testing the model on different playlists and accounting for other variables such as how many times i listen to each of my liked songs, artists, and albums would likely improve results.