## Basic recommendation algorithm ##


**Import libraries**

Import the required libraries for data analysis and machine learning.


In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler

**Load data**

Load selected columns from a CSV file into a DataFrame, rename some columns for consistency, and display the first few rows of the resulting data.


In [2]:
# Select relevant columns 
usecols = ["track_id", "track_name", "artist_name", "new_genre", "popularity", "energy", "instrumentalness", "valence"]

# Load data file into a DataFrame
df = pd.read_csv('../datasets/track_data.csv', usecols=usecols, dtype={'popularity': 'float64'})

# Rename columns
df.rename(columns={'artist_name': 'artist', 'new_genre': 'genre'}, inplace=True)

# Display the first few rows 
df.head()

Unnamed: 0,track_id,track_name,artist,genre,valence,energy,popularity,instrumentalness
0,53QF56cjZA9RTuuMZDrSA6,I Won't Give Up,Jason Mraz,Folk,0.139,0.303,68.0,0.0
1,1s8tP3jP4GZcyHDsjvw218,93 Million Miles,Jason Mraz,Folk,0.515,0.454,50.0,1.4e-05
2,7BRCa8MPiyuvr2VU3O9W0F,Do Not Let Me Go,Joshua Hyslop,Folk,0.145,0.234,57.0,5e-05
3,63wsZUhUZLlh1OsyrZq7sz,Fast Car,Boyce Avenue,Folk,0.508,0.251,58.0,0.0
4,6nXIYClvJAfi6ujLiKqEq8,Sky's Still Blue,Andrew Belle,Folk,0.217,0.791,54.0,0.0193


**Normalise data**

Scale the numerical features in the DataFrame, so that they can be compared and treated equally.  Each feature will now have a mean of 0 and a standard deviation of 1.


In [3]:
# Normalise numerical features
numerical_features = ["popularity", "energy", "instrumentalness", "valence"]
scaler = StandardScaler()
df.loc[:, numerical_features] = scaler.fit_transform(df[numerical_features].values)
df.head()

Unnamed: 0,track_id,track_name,artist,genre,valence,energy,popularity,instrumentalness
0,53QF56cjZA9RTuuMZDrSA6,I Won't Give Up,Jason Mraz,Folk,-1.173026,-1.242615,3.104013,-0.699768
1,1s8tP3jP4GZcyHDsjvw218,93 Million Miles,Jason Mraz,Folk,0.223382,-0.686037,1.974708,-0.699731
2,7BRCa8MPiyuvr2VU3O9W0F,Do Not Let Me Go,Joshua Hyslop,Folk,-1.150743,-1.496945,2.413882,-0.699632
3,63wsZUhUZLlh1OsyrZq7sz,Fast Car,Boyce Avenue,Folk,0.197385,-1.434284,2.476621,-0.699768
4,6nXIYClvJAfi6ujLiKqEq8,Sky's Still Blue,Andrew Belle,Folk,-0.883345,0.556127,2.225665,-0.647133


**Setup user preferences**

Setup user preferences for the recommendation algorithm.

In [4]:
# Setup user preferences
preferences = {
    "popularity": 80,
    "energy": 0.5,
    "instrumentalness": 0.1,
    "valence": 0.5,
    "genres": ['Pop', 'Rock'],
    "artists": ['Ed Sheeran', 'Coldplay'],
}

**Filter dataset by genre**

Filter the dataset based on the user's genre preferences.

In [5]:
# Create a boolean mask for matching genres
genre_mask = df['genre'].isin(preferences['genres'])  
genre_mask.head()

0    False
1    False
2    False
3    False
4    False
Name: genre, dtype: bool

In [6]:
# Use the mask to filter and copy the DataFrame
filtered_df = df[genre_mask].copy()                   

# Display the filtered DataFrame
filtered_df.head()

Unnamed: 0,track_id,track_name,artist,genre,valence,energy,popularity,instrumentalness
1770,2iUmqdfGZcHIhS3b9E9EWq,Everybody Talks,Neon Trees,Rock,1.00329,1.046358,3.668666,-0.699768
1771,4FEr6dIdH6EqLKR0jB560J,Rosemary,Deftones,Rock,-1.402542,-0.099971,3.292231,-0.427045
1772,1RTYixE1DD3g3upEpmCJpa,In The End,Black Veil Brides,Rock,-0.686511,1.101647,3.104013,-0.683705
1773,0AOmbw8AwDnwXhHC3OhdVB,Courtesy Call,Thousand Foot Krutch,Rock,-0.036588,-0.007823,3.292231,-0.699768
1774,4bLCPfBLKlqiONo6TALTh5,Entombed,Deftones,Rock,-1.069038,0.416061,2.915796,-0.350683


**Create user vector**

Create a list of the user's preferences, focusing on the numerical features only. We use only numerical features because they can be **directly** compared using math, while non-numeric features like genre or artist cannot.

In [7]:
# Create user vector
user_vector = [preferences[feature] for feature in numerical_features]
print(user_vector)

[80, 0.5, 0.1, 0.5]


**Normalise user vector**

Scale the user's vector, so that it can be compared to the song data.

In [8]:
# Scale the user vector using the same scaler as the dataset
user_vector_scaled = scaler.transform([user_vector])[0]
print(user_vector_scaled)

[ 3.85688358 -0.51648326 -0.42704533  0.16767413]


**Extract track vectors**

Extract the numerical features for each track.

In [9]:
# Extract track vectors
track_vectors = df[numerical_features].values
print(track_vectors)

[[ 3.10401332 -1.24261459 -0.69976823 -1.17302569]
 [ 1.97470792 -0.68603677 -0.69973087  0.22338188]
 [ 2.41388224 -1.49694486 -0.69963187 -1.15074259]
 ...
 [-1.03677314 -0.73764001 -0.691232   -1.5588947 ]
 [-1.16225152 -0.86664812 -0.6995956  -0.93905315]
 [-0.97403395  0.81414319 -0.69974524  1.49351856]]


**Calculate similarity between user vector and track vectors**

Measure how similar each track is to the user's preferences by comparing their feature vectors. The measure of similarity used is called "cosine similarity".

Geometrically, it is the cosine of the angle between the two vectors. If two vectors are identical, the angle is 0 and the cosine is 1, indicating perfect similarity. If they are completely different, the angle is 90 degrees and the cosine is 0, indicating no similarity. Cosine similarity is a measure of orientation, not magnitude.

In [10]:
# Calculate similarity between user vector and track vectors
similarity_matrix = cosine_similarity(track_vectors, [user_vector])
similarity_scores = similarity_matrix.flatten()
print(similarity_scores)

[ 0.85496718  0.88947064  0.76233236 ... -0.49441418 -0.63014501
 -0.46146504]


**Boost similarity scores for preferred artists**

Boost similarity scores for tracks by preferred artists. If a track is by a preferred artist, its similarity score is increased (e.g. by 30%) to make it more likely to be recommended.

In [11]:
# Create a boolean mask for matching artists
artist_matches = df['artist'].isin(preferences['artists'])  
artist_matches.head()

0    False
1    False
2    False
3    False
4    False
Name: artist, dtype: bool

In [12]:
# Show tracks for matching artists
df[artist_matches].head()

Unnamed: 0,track_id,track_name,artist,genre,valence,energy,popularity,instrumentalness
39647,0pjMTISKHTJkogN1BPZxaC,Paradise - Tiësto Remix,Coldplay,Pop,0.698755,0.3534,2.037447,-0.317956
91634,6fxVffaTuwjgEk5h9QyRjy,Photograph,Ed Sheeran,Pop,-0.942767,-0.962483,2.853057,-0.698503
91636,1Slwb6dOYkBlWal1PGtnNg,Thinking out Loud,Ed Sheeran,Pop,0.505634,-0.71921,2.853057,-0.699768
91649,1fu5IQSRgPxJL2OTP7FVLW,I See Fire,Ed Sheeran,Pop,-0.931625,-2.168156,2.476621,-0.699768
91680,3Th56VIq2sEaEmPPETu7p5,All of the Stars,Ed Sheeran,Pop,-0.623376,-0.306384,2.162925,-0.699272


In [13]:
# Boost factor - this can be tweaked.
# A factor of 1.3 means that a preferred artist is 30% more likely to be recommended.
boost_factor = 1.3

# Boost scores for tracks by matching artists
similarity_scores *= (artist_matches.values * (boost_factor - 1)) + 1
print(similarity_scores)

[ 0.85496718  0.88947064  0.76233236 ... -0.49441418 -0.63014501
 -0.46146504]


**Recommend songs**

Return a list of the top N songs based on the similarity scores.


In [14]:
# Recommend top N songs based on the similarity scores
top_n = 10

# Get the indices of the top N scores
top_track_indices = np.argsort(-similarity_scores)[:top_n]
print(top_track_indices)

[ 505879  193625  641658  558753  398633 1128126 1128155  558592  287641
 1128142]


In [15]:
# Select the top N recommendations
recommendations = df.iloc[top_track_indices][["track_name", "artist", "track_id"]]
print(recommendations)

                              track_name      artist                track_id
505879                       My Universe    Coldplay  3FeVmId7tL5YN8B7R3imoM
193625              Hymn for the Weekend    Coldplay  3RiPr603aXAoi4GHyXx0uy
641658                       Don't Panic    Coldplay  2QhURnm7mQDxBb5jWkbDug
558753            2step (feat. Lil Baby)  Ed Sheeran  2UN0lp72LAusrXi8LLVomt
398633   Beautiful People (feat. Khalid)  Ed Sheeran  70eFcWOvlMObDhURTqT4Fv
1128126                       Lego House  Ed Sheeran  5ubHAQtKuFfiG4FXfLP804
1128155                            Drunk  Ed Sheeran  4RnCPWlBsY7oUDdyruod7Y
558592                         Celestial  Ed Sheeran  4zrKN5Sv8JS5mqnbVcsul7
287641                Castle on the Hill  Ed Sheeran  6PCUP3dWmTjcTtXY02oFdT
1128142    Every Teardrop Is a Waterfall    Coldplay  2U8g9wVcUu9wsg6i7sFSv8
