## RECOMMENDER TEST

1. Load the dataset that contains our full dataset with clusters, including the columns that we need to display the songs in the app (album_cover_url, name, title, release_date, popularity, duration_ms, explicit ...)
2. Use Spotify API to get the track details for the user's selected song, using the sp.search() method (search endpoint). Use results to display to the user the list of songs that match the user's query.
3. Build the dataframe with the same song features / columns that we used for the clustering. Make sure the columns are in the same order as the columns in the clustering dataset. Make sure the transformations you apply to calculate genre columns are consistent with the ones used for the clustering.
4. Scale the data with the same scaler used for the clustering.
5. Predict the cluster index for the user's selected song.
6. Filter the dataset to get the songs from the same cluster.
7. Get the top 10 most popular songs from the same cluster.
8. Display the songs to the user.

In [1]:
from dotenv import load_dotenv
import os
import time
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import pickle
from spotipy.oauth2 import SpotifyOAuth


# Load environment variables from .env file

In [2]:
# Load environment variables from .env file
load_dotenv()

# Now you can access the environment variables
client_id = os.getenv('SPOTIFY_CLIENT_ID')
client_secret = os.getenv('SPOTIFY_CLIENT_SECRET')

spotify = spotipy.Spotify(client_credentials_manager=SpotifyClientCredentials(client_id, client_secret))



## Load the dataset

In [3]:
#load the dataset
spotify_df = pd.read_csv('../data/7_clustered_dataset.csv')
spotify_df.head()



Unnamed: 0,original_title,original_artist,spotify_title,spotify_artist,album,release_date,popularity,duration_ms,explicit,album_cover,genres,cluster
0,Je Sais Que La Terre Est Plate,Raphaël,Je sais que la Terre est plate,Raphaël,Je Sais Que La Terre Est Plate,2008-03-14,14,150040,False,https://i.scdn.co/image/ab67616d0000b2739e6b95...,"['chanson', 'french pop', 'french rock', 'nouv...",0
1,On Efface,Julie Zenatti,On efface,Julie Zenatti,Comme vous...,2004-03-21,1,253000,False,https://i.scdn.co/image/ab67616d0000b27398d445...,"['chanson', 'french pop']",0
2,Howells Delight,The Baltimore Consort,Howells Delight,Anonymous,The Best of the Baltimore Consort,2011-02-01,3,240400,False,https://i.scdn.co/image/ab67616d0000b27353a906...,['medieval'],0
3,Martha Served,I Hate Sally,Martha Served,I Hate Sally,Don't Worry Lady,2007-06-12,1,138760,True,https://i.scdn.co/image/ab67616d0000b273e6d949...,"['canadian metal', 'canadian post-hardcore', '...",6
4,Zip-A-Dee-Doo-Dah,Orlando Pops Orchestra,"Zip-a-Dee-Doo-Dah (From ""Song of the South"")",Orlando Pops Orchestra,"Most Amazing Movie, Musical & TV Themes, Vol.6",2022-10-07,0,199986,False,https://i.scdn.co/image/ab67616d0000b27349ea4d...,['pops orchestra'],0


In [4]:
spotify_df.columns

Index(['original_title', 'original_artist', 'spotify_title', 'spotify_artist',
       'album', 'release_date', 'popularity', 'duration_ms', 'explicit',
       'album_cover', 'genres', 'cluster'],
      dtype='object')

## Spotify API  user query

In [5]:
def search_spotify_tracks(query, limit=10):
    """
    Search for tracks on Spotify and return formatted results
    
    Parameters:
    query (str): Search query for the track
    limit (int): Maximum number of results to return
    
    Returns:
    list: List of dictionaries containing track information
    """
    results = spotify.search(q=query, type='track', limit=limit)
    
    tracks = []
    for track in results['tracks']['items']:
        track_info = {
            'title': track['name'],
            'artist': track['artists'][0]['name'],
            'artist_id': track['artists'][0]['id'],
            'duration_ms': track['duration_ms'],
            'explicit': track['explicit'],
            'album': track['album']['name'],
            'album_cover': track['album']['images'][0]['url'] if track['album']['images'] else None,
            'preview_url': track['preview_url'],
            'popularity': track['popularity'],
            'id': track['id'],
            'release_year': track['album']['release_date'][:4]
        }
        tracks.append(track_info)
    
    return tracks


In [6]:
# Example usage: user types a song, then we search features by track id

query = "Born to die" 
search_results = search_spotify_tracks(query)
search_results


[{'title': 'Born To Die',
  'artist': 'Lana Del Rey',
  'artist_id': '00FQb4jTyendYWaN8pK0wa',
  'duration_ms': 285400,
  'explicit': False,
  'album': 'Born To Die - The Paradise Edition',
  'album_cover': 'https://i.scdn.co/image/ab67616d0000b273ebc8cfac8b586bc475b04918',
  'preview_url': None,
  'popularity': 74,
  'id': '4Ouhoi2lAhrLJKFzUqEzwl',
  'release_year': '2012'},
 {'title': 'Top',
  'artist': 'Biba',
  'artist_id': '2AVWBp9fVh1GOCQflKM7wo',
  'duration_ms': 173638,
  'explicit': True,
  'album': 'Mango',
  'album_cover': 'https://i.scdn.co/image/ab67616d0000b2735bf284bfd4ad320354138280',
  'preview_url': None,
  'popularity': 53,
  'id': '4IBoZloHkTEug8fKqmLBa4',
  'release_year': '2023'},
 {'title': 'Summertime Sadness',
  'artist': 'Lana Del Rey',
  'artist_id': '00FQb4jTyendYWaN8pK0wa',
  'duration_ms': 265427,
  'explicit': False,
  'album': 'Born To Die - The Paradise Edition',
  'album_cover': 'https://i.scdn.co/image/ab67616d0000b273ebc8cfac8b586bc475b04918',
  'pre

## Dataframe with the same song features

Bulid the dataframe with the same song features / columns that we used for the clustering. 

Columns must be in the same order as the columns in the clustering dataset.

Transformations to calculate genre columns must beconsistent with the ones used for the clustering.

columns_to_keep = ['popularity','duration_ms', 'explicit', 'pop','old school hip hop','rap','rock' ,'hard rock', 'release_year']

In [7]:
#user selects a song i.e. "Born to die" index 0
user_selection_index = 0
user_selection_track_data= search_results[user_selection_index]

#get genre from artist id
results=spotify.artist(user_selection_track_data['artist_id']) 

genres=results['genres']

print(genres)
print(user_selection_track_data)

['art pop', 'pop']
{'title': 'Born To Die', 'artist': 'Lana Del Rey', 'artist_id': '00FQb4jTyendYWaN8pK0wa', 'duration_ms': 285400, 'explicit': False, 'album': 'Born To Die - The Paradise Edition', 'album_cover': 'https://i.scdn.co/image/ab67616d0000b273ebc8cfac8b586bc475b04918', 'preview_url': None, 'popularity': 74, 'id': '4Ouhoi2lAhrLJKFzUqEzwl', 'release_year': '2012'}


In [8]:
#create a function to get necessary columns, and loop to get the genre
def get_necessary_columns(search_results):
    user_selection_index = 0
    user_selection_track_data= search_results[user_selection_index]

    #get genre from artist id
    results=spotify.artist(user_selection_track_data['artist_id']) 

    genres=results['genres']


    return {
        'popularity': user_selection_track_data['popularity'],
        'duration_ms': user_selection_track_data['duration_ms'],
        'explicit': user_selection_track_data['explicit'],
        'pop': 1 if 'pop' in genres else 0,
        'old school hip hop': 1 if 'old school hip hop' in genres else 0,
        'rap': 1 if 'rap' in genres else 0,
        'rock': 1 if 'rock' in genres else 0,
        'hard rock': 1 if 'hard rock' in genres else 0,
        'release_year': user_selection_track_data['release_year']
    }


In [9]:
# Dataframe with the same song features / columns that we used for the clustering.


selection_data = {
    'popularity': user_selection_track_data['popularity'],
    'duration_ms': user_selection_track_data['duration_ms'],
    'explicit': user_selection_track_data['explicit'],
    'pop': 1 if 'pop' in genres else 0,
    'old school hip hop': 1 if 'old school hip hop' in genres else 0,
    'rap': 1 if 'rap' in genres else 0,
    'rock': 1 if 'rock' in genres else 0,
    'hard rock': 1 if 'hard rock' in genres else 0,
    'release_year': user_selection_track_data['release_year']
}

selection_data_df = pd.DataFrame([selection_data])
selection_data_df


Unnamed: 0,popularity,duration_ms,explicit,pop,old school hip hop,rap,rock,hard rock,release_year
0,74,285400,False,1,0,0,0,0,2012


# Scale the data

In [10]:
# Load the scaler
scaler = pickle.load(open('../scaler/standard_scaler.pkl', 'rb'))

# Scale the data
selection_data_scaled = scaler.transform(selection_data_df)

selection_data_scaled_df = pd.DataFrame(selection_data_scaled, columns=selection_data_df.columns)

In [11]:
selection_data_scaled_df

Unnamed: 0,popularity,duration_ms,explicit,pop,old school hip hop,rap,rock,hard rock,release_year
0,3.929352,0.45503,-0.257316,7.991633,-0.077505,-0.104234,-0.211111,-0.152744,0.861052


# Using the pre-trained model

Getting the cluster index for the user's selected song and filtering top 10 songs in the same cluster

In [12]:
# Load model and predict
model = pickle.load(open('../models/kmeans_model_11.pkl', 'rb'))

cluster_prediction = model.predict(selection_data_scaled_df)
#cluster_prediction[0]

print(f'Cluster label: {cluster_prediction[0]}')

Cluster label: 3


In [13]:
# Load the clustered dataset and filter for the cluster
clustered_df = pd.read_csv('../data/7_clustered_dataset.csv')
clustered_df_filtered = clustered_df[clustered_df['cluster'] == cluster_prediction[0]]

# Get the top 10 most popular songs from the same cluster
top_10_songs = clustered_df_filtered.sort_values(by='popularity', ascending=False).sample(10)
top_10_songs.head(10)

Unnamed: 0,original_title,original_artist,spotify_title,spotify_artist,album,release_date,popularity,duration_ms,explicit,album_cover,genres,cluster
8079,she will be loved,maroon 5,She Will Be Loved - Radio Mix,Maroon 5,Songs About Jane: 10th Anniversary Edition,2002,79,259453,False,https://i.scdn.co/image/ab67616d0000b27392f2d7...,['pop'],3
9125,through with you,maroon 5,Through With You,Maroon 5,Songs About Jane,2002-06-25,40,181573,False,https://i.scdn.co/image/ab67616d0000b27317b385...,['pop'],3
70,3,Britney Spears,3,Britney Spears,The Singles Collection,2009-11-09,57,213173,False,https://i.scdn.co/image/ab67616d0000b273cc8fa7...,"['dance pop', 'pop']",3
4237,Someday,Mariah Carey,Someday,Mariah Carey,Mariah Carey,1990-06-12,46,246106,False,https://i.scdn.co/image/ab67616d0000b2735084c6...,"['dance pop', 'pop', 'urban contemporary']",3
1850,Speed Of Sound,Coldplay,Speed of Sound,Coldplay,X&Y,2005-06-07,70,287906,False,https://i.scdn.co/image/ab67616d0000b2734e0362...,"['permanent wave', 'pop']",3
2970,PDA,Backstreet Boys,PDA,Backstreet Boys,This Is Us,2009-10-06,27,228400,False,https://i.scdn.co/image/ab67616d0000b2731c5da6...,"['boy band', 'dance pop', 'pop']",3
8611,the world turned upside down,coldplay,The World Turned Upside Down,Coldplay,Fix You,2005-09-05,31,272626,False,https://i.scdn.co/image/ab67616d0000b273d1471c...,"['permanent wave', 'pop']",3
6752,sexy ladies,justin timberlake,Sexy Ladies,Justin Timberlake,FutureSex/LoveSounds,2006-09-12,42,238600,False,https://i.scdn.co/image/ab67616d0000b273c6ba98...,"['dance pop', 'pop']",3
698,Shape Of My Heart,Backstreet Boys,Shape of My Heart,Backstreet Boys,Black & Blue,2000-11-21,70,230093,False,https://i.scdn.co/image/ab67616d0000b273cf871e...,"['boy band', 'dance pop', 'pop']",3
7188,she will be loved,maroon 5,She Will Be Loved - Radio Mix,Maroon 5,Songs About Jane: 10th Anniversary Edition,2002,79,259453,False,https://i.scdn.co/image/ab67616d0000b27392f2d7...,['pop'],3
