## Final recommendation algorithm ##

Steps that are new or different from the basic algorithm are marked with (New) or (Updated).

**Import libraries**

Import the required libraries for data analysis and machine learning.


In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler

**Load data**

Load selected columns from a CSV file into a DataFrame, rename some columns for consistency, and display the first few rows of the resulting data.


In [2]:
# Select relevant columns 
usecols = ["track_id", "track_name", "artist_name", "new_genre", "popularity", "energy", "instrumentalness", "valence"]

# Load data file into a DataFrame
df = pd.read_csv('../datasets/track_data.csv', usecols=usecols, dtype={'popularity': 'float64'})

# Rename columns
df.rename(columns={'artist_name': 'artist', 'new_genre': 'genre'}, inplace=True)

# Display the first few rows 
df.head()

Unnamed: 0,track_id,track_name,artist,genre,valence,energy,popularity,instrumentalness
0,53QF56cjZA9RTuuMZDrSA6,I Won't Give Up,Jason Mraz,Folk,0.139,0.303,68.0,0.0
1,1s8tP3jP4GZcyHDsjvw218,93 Million Miles,Jason Mraz,Folk,0.515,0.454,50.0,1.4e-05
2,7BRCa8MPiyuvr2VU3O9W0F,Do Not Let Me Go,Joshua Hyslop,Folk,0.145,0.234,57.0,5e-05
3,63wsZUhUZLlh1OsyrZq7sz,Fast Car,Boyce Avenue,Folk,0.508,0.251,58.0,0.0
4,6nXIYClvJAfi6ujLiKqEq8,Sky's Still Blue,Andrew Belle,Folk,0.217,0.791,54.0,0.0193


**Normalise data (Updated)**

Scale the non-mood numerical features in the DataFrame, so that they can be compared and treated equally.  Each feature will now have a mean of 0 and a standard deviation of 1.

Mood features (valence, energy) are processed separately.

In [3]:
# Normalise numerical features
numerical_features = ["popularity", "instrumentalness"]
scaler = StandardScaler()
df.loc[:, numerical_features] = scaler.fit_transform(df[numerical_features].values)
df.head()

Unnamed: 0,track_id,track_name,artist,genre,valence,energy,popularity,instrumentalness
0,53QF56cjZA9RTuuMZDrSA6,I Won't Give Up,Jason Mraz,Folk,0.139,0.303,3.104013,-0.699768
1,1s8tP3jP4GZcyHDsjvw218,93 Million Miles,Jason Mraz,Folk,0.515,0.454,1.974708,-0.699731
2,7BRCa8MPiyuvr2VU3O9W0F,Do Not Let Me Go,Joshua Hyslop,Folk,0.145,0.234,2.413882,-0.699632
3,63wsZUhUZLlh1OsyrZq7sz,Fast Car,Boyce Avenue,Folk,0.508,0.251,2.476621,-0.699768
4,6nXIYClvJAfi6ujLiKqEq8,Sky's Still Blue,Andrew Belle,Folk,0.217,0.791,2.225665,-0.647133


**Setup user preferences (Updated)**

Setup user preferences for the recommendation algorithm.

In [4]:
# Setup user preferences
preferences = {
    "popularity": 80,
    "instrumentalness": 0.1,
    "genres": ['Pop', 'Rock'],
    "artists": ['Ed Sheeran', 'Coldplay'],
}

**Filter dataset by genre**

Filter the dataset based on the user's genre preferences.

In [5]:
# Create a boolean mask for matching genres
genre_mask = df['genre'].isin(preferences['genres'])  
genre_mask.head()

0    False
1    False
2    False
3    False
4    False
Name: genre, dtype: bool

In [6]:
# Use the mask to filter and copy the DataFrame
filtered_df = df[genre_mask].copy()                   

# Display the filtered DataFrame
filtered_df.head()

Unnamed: 0,track_id,track_name,artist,genre,valence,energy,popularity,instrumentalness
1770,2iUmqdfGZcHIhS3b9E9EWq,Everybody Talks,Neon Trees,Rock,0.725,0.924,3.668666,-0.699768
1771,4FEr6dIdH6EqLKR0jB560J,Rosemary,Deftones,Rock,0.0772,0.613,3.292231,-0.427045
1772,1RTYixE1DD3g3upEpmCJpa,In The End,Black Veil Brides,Rock,0.27,0.939,3.104013,-0.683705
1773,0AOmbw8AwDnwXhHC3OhdVB,Courtesy Call,Thousand Foot Krutch,Rock,0.445,0.638,3.292231,-0.699768
1774,4bLCPfBLKlqiONo6TALTh5,Entombed,Deftones,Rock,0.167,0.753,2.915796,-0.350683


**Create user vector**

Create a list of the user's preferences, focusing on the selected numerical features only. We use only numerical features because they can be directly compared using math, while non-numeric features like genre or artist cannot.

In [7]:
# Create user vector
user_vector = [preferences[feature] for feature in numerical_features]
print(user_vector)

[80, 0.1]


**Normalise user vector**

Scale the user's vector, so that it can be compared to the song data.

In [8]:
# Scale the user vector using the same scaler as the dataset
user_vector_scaled = scaler.transform([user_vector])[0]
print(user_vector_scaled)

[ 3.85688358 -0.42704533]


**Extract track vectors**

Extract the numerical features for each track.

In [9]:
# Extract track vectors
track_vectors = df[numerical_features].values
print(track_vectors)

[[ 3.10401332 -0.69976823]
 [ 1.97470792 -0.69973087]
 [ 2.41388224 -0.69963187]
 ...
 [-1.03677314 -0.691232  ]
 [-1.16225152 -0.6995956 ]
 [-0.97403395 -0.69974524]]


**Calculate similarity between user vector and track vectors**

Measure how similar each track is to the user's preferences by comparing their feature vectors. The measure of similarity used is called "cosine similarity".

Geometrically, it is the cosine of the angle between the two vectors. If two vectors are identical, the angle is 0 and the cosine is 1, indicating perfect similarity. If they are completely different, the angle is 90 degrees and the cosine is 0, indicating no similarity. Cosine similarity is a measure of direction, not magnitude.

In [10]:
# Calculate similarity between user vector and track vectors
similarity_matrix = cosine_similarity(track_vectors, [user_vector])
similarity_scores = similarity_matrix.flatten()
print(similarity_scores)

[ 0.97524212  0.94215563  0.96012234 ... -0.83272457 -0.85740602
 -0.81287903]


**Boost similarity scores for preferred artists**

Boost similarity scores for tracks by preferred artists. If a track is by a preferred artist, its similarity score is increased (e.g. by 30%) to make it more likely to be recommended.

In [11]:
# Create a boolean mask for matching artists
artist_matches = df['artist'].isin(preferences['artists'])  
artist_matches.head()

0    False
1    False
2    False
3    False
4    False
Name: artist, dtype: bool

In [12]:
# Show tracks for matching artists
df[artist_matches].head()

Unnamed: 0,track_id,track_name,artist,genre,valence,energy,popularity,instrumentalness
39647,0pjMTISKHTJkogN1BPZxaC,Paradise - Tiësto Remix,Coldplay,Pop,0.643,0.736,2.037447,-0.317956
91634,6fxVffaTuwjgEk5h9QyRjy,Photograph,Ed Sheeran,Pop,0.201,0.379,2.853057,-0.698503
91636,1Slwb6dOYkBlWal1PGtnNg,Thinking out Loud,Ed Sheeran,Pop,0.591,0.445,2.853057,-0.699768
91649,1fu5IQSRgPxJL2OTP7FVLW,I See Fire,Ed Sheeran,Pop,0.204,0.0519,2.476621,-0.699768
91680,3Th56VIq2sEaEmPPETu7p5,All of the Stars,Ed Sheeran,Pop,0.287,0.557,2.162925,-0.699272


In [13]:
# Boost factor - this can be tweaked.
# A factor of 1.3 means that a preferred artist is 30% more likely to be recommended.
boost_factor = 1.3

# Boost scores for tracks by matching artists
similarity_scores *= (artist_matches.values * (boost_factor - 1)) + 1
print(similarity_scores)

[ 0.97524212  0.94215563  0.96012234 ... -0.83272457 -0.85740602
 -0.81287903]


**Define current and target moods (New)**

Set the values for the user's current mood and the target mood. We will be trying to shift the user's mood over a number of steps. Based on these values, calculate the per-step mood adjustments required.

In [14]:
# Set starting mood values
start_valence = 0.3
start_energy = 0.2

# Set target mood values
target_valence = 0.8
target_energy = 0.6

# Set number of steps (recommendations)
num_steps = 10

# Calculate per-step mood adjustments
val_adj = (target_valence - start_valence) / num_steps
nrg_adj = (target_energy - start_energy) / num_steps
print(f"Per-step mood adjustments: valence += {val_adj:.2f}, energy += {nrg_adj:.2f}")

Per-step mood adjustments: valence += 0.05, energy += 0.04


**Extract track mood vectors (New)**

Extract the mood features for each track.


In [15]:
# Extract valence and energy values from all tracks
track_mood_vectors = df[["valence", "energy"]].values
print(track_mood_vectors)

[[0.139  0.303 ]
 [0.515  0.454 ]
 [0.145  0.234 ]
 ...
 [0.0351 0.44  ]
 [0.202  0.405 ]
 [0.857  0.861 ]]


**Create the user mood vector for Step n (New)**

Calculate the user's target energy and valence for this step. This step is repeated up to the number of recommendations needed.

In [16]:
# Set step number (n)
step = 1

# Gradually adjust target mood at each recommendation step
step_valence = start_valence + val_adj * step
step_energy = start_energy + nrg_adj * step
step_vector = np.array([step_valence, step_energy])
print(step_vector)

[0.35 0.24]


**Calculate similarity between user mood vector and track mood vectors (New)**

Measure how similar each track is to the user's mood by comparing their feature vectors. The measure of similarity used is called "Euclidean distance".

Geometrically, it is the straight-line distance between the two vectors. If two vectors are identical, the distance is 0, indicating perfect similarity. If they are completely different, the distance is the maximum possible value, indicating no similarity.  Euclidean distance is a measure of magnitude, not direction.  This is appropriate for comparing features like mood, because the intensity of the feature matters.

In [17]:
# Calculate the Euclidean distance between the track mood and the current step's target mood
distances = np.linalg.norm(track_vectors - step_vector, axis=1)
print(distances)

[2.90994049 1.87690435 2.26771205 ... 1.67042892 1.78037764 1.62363389]


In [18]:
# Normalize distances to compute mood closeness (1 = exact match, 0 = no closeness).
# Assumes valence and energy are both in [0, 1], so the maximum possible distance is sqrt(2).
closeness = np.clip(1 - (distances / np.sqrt(2)), 0, 1)
print(closeness)

[0. 0. 0. ... 0. 0. 0.]


**Combine scores for mood and non-mood features (New)**

Combine the similarity scores with the mood closeness scores to get the final scores for each track.

In [19]:
# Combine similarity scores with mood closeness
combined_scores = 0.6 * similarity_scores + 0.4 * closeness
print(combined_scores)

[ 0.58514527  0.56529338  0.5760734  ... -0.49963474 -0.51444361
 -0.48772742]


In [20]:
# Select best track for this step
best_index = int(np.argmax(combined_scores))
recommendation1 = df.iloc[best_index]
print(recommendation1)

track_id              4fZV1JJngUyBJFuGpUHoLx
track_name          You Could Be Mine - Live
artist                                 Slash
genre                                  Metal
valence                                0.419
energy                                 0.964
popularity                          0.406228
instrumentalness                    0.077492
Name: 182374, dtype: object
