## Final recommendation algorithm ##

Steps that are new or different from the basic algorithm are marked with (New) or (Updated).

**Import libraries**

Import the required libraries for data analysis and machine learning.


In [63]:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler

**Load data**

Load selected columns from a CSV file into a DataFrame, rename some columns for consistency, and display the first few rows of the resulting data.


In [64]:
# Select relevant columns 
usecols = ["track_id", "track_name", "artist_name", "new_genre", "popularity", "energy", "instrumentalness", "valence"]

# Load data file into a DataFrame
df = pd.read_csv('../static/dataset/track_data.csv', usecols=usecols, dtype={'popularity': 'float64'})

# Rename columns
df.rename(columns={'artist_name': 'artist', 'new_genre': 'genre'}, inplace=True)

# Display the first few rows 
df.head()

Unnamed: 0,artist,track_name,track_id,popularity,energy,instrumentalness,valence,genre
0,Jason Mraz,I Won't Give Up,53QF56cjZA9RTuuMZDrSA6,68.0,0.303,0.0,0.139,Folk
1,Jason Mraz,93 Million Miles,1s8tP3jP4GZcyHDsjvw218,50.0,0.454,1.4e-05,0.515,Folk
2,Joshua Hyslop,Do Not Let Me Go,7BRCa8MPiyuvr2VU3O9W0F,57.0,0.234,5e-05,0.145,Folk
3,Boyce Avenue,Fast Car,63wsZUhUZLlh1OsyrZq7sz,58.0,0.251,0.0,0.508,Folk
4,Andrew Belle,Sky's Still Blue,6nXIYClvJAfi6ujLiKqEq8,54.0,0.791,0.0193,0.217,Folk


**Normalise data (Updated)**

Scale the non-mood numerical features in the DataFrame, so that they can be compared and treated equally.  Each feature will now have a mean of 0 and a standard deviation of 1.

Mood features (valence, energy) are processed separately.

In [65]:
# Normalise numerical features
numerical_features = ["popularity", "instrumentalness"]
scaler = StandardScaler()
df.loc[:, numerical_features] = scaler.fit_transform(df[numerical_features].values)
df.head()

Unnamed: 0,artist,track_name,track_id,popularity,energy,instrumentalness,valence,genre
0,Jason Mraz,I Won't Give Up,53QF56cjZA9RTuuMZDrSA6,3.123399,0.303,-0.691229,0.139,Folk
1,Jason Mraz,93 Million Miles,1s8tP3jP4GZcyHDsjvw218,1.990293,0.454,-0.691191,0.515,Folk
2,Joshua Hyslop,Do Not Let Me Go,7BRCa8MPiyuvr2VU3O9W0F,2.430946,0.234,-0.691092,0.145,Folk
3,Boyce Avenue,Fast Car,63wsZUhUZLlh1OsyrZq7sz,2.493896,0.251,-0.691229,0.508,Folk
4,Andrew Belle,Sky's Still Blue,6nXIYClvJAfi6ujLiKqEq8,2.242095,0.791,-0.638363,0.217,Folk


**Setup user preferences (Updated)**

Setup user preferences for the recommendation algorithm.

In [66]:
# Setup user preferences
preferences = {
    "popularity": 80,
    "instrumentalness": 0.1,
    "genres": ['Pop', 'Rock'],
    "artists": ['Ed Sheeran', 'Coldplay'],
}

**Filter dataset by genre**

Filter the dataset based on the user's genre preferences.

In [67]:
# Create a boolean mask for matching genres
genre_mask = df['genre'].isin(preferences['genres'])  
genre_mask.head()

0    False
1    False
2    False
3    False
4    False
Name: genre, dtype: bool

In [68]:
# Use the mask to filter and copy the DataFrame
filtered_df = df[genre_mask].copy()                   

# Display the filtered DataFrame
filtered_df.head()

Unnamed: 0,artist,track_name,track_id,popularity,energy,instrumentalness,valence,genre
1770,Neon Trees,Everybody Talks,2iUmqdfGZcHIhS3b9E9EWq,3.689952,0.924,-0.691229,0.725,Rock
1771,Deftones,Rosemary,4FEr6dIdH6EqLKR0jB560J,3.31225,0.613,-0.417311,0.0772,Rock
1772,Black Veil Brides,In The End,1RTYixE1DD3g3upEpmCJpa,3.123399,0.939,-0.675095,0.27,Rock
1773,Thousand Foot Krutch,Courtesy Call,0AOmbw8AwDnwXhHC3OhdVB,3.31225,0.638,-0.691229,0.445,Rock
1774,Deftones,Entombed,4bLCPfBLKlqiONo6TALTh5,2.934548,0.753,-0.340614,0.167,Rock


**Create user vector**

Create a list of the user's preferences, focusing on the selected numerical features only. We use only numerical features because they can be directly compared using math, while non-numeric features like genre or artist cannot.

In [69]:
# Create user vector
user_vector = [preferences[feature] for feature in numerical_features]
print(user_vector)

[80, 0.1]


**Normalise user vector**

Scale the user's vector, so that it can be compared to the song data.

In [70]:
# Scale the user vector using the same scaler as the dataset
user_vector_scaled = scaler.transform([user_vector])[0]
print(user_vector_scaled)

[ 3.87880338 -0.41731084]


**Extract track vectors**

Extract the numerical features for each track.

In [71]:
# Extract track vectors
track_vectors = df[numerical_features].values
print(track_vectors)

[[ 3.1233993  -0.69122871]
 [ 1.99029317 -0.69119118]
 [ 2.43094555 -0.69109175]
 ...
 [-1.03132315 -0.68265508]
 [-1.15722383 -0.69105532]
 [-0.96837281 -0.69120562]]


**Calculate similarity between user vector and track vectors**

Measure how similar each track is to the user's preferences by comparing their feature vectors. The measure of similarity used is called "cosine similarity".

Geometrically, it is the cosine of the angle between the two vectors. If two vectors are identical, the angle is 0 and the cosine is 1, indicating perfect similarity. If they are completely different, the angle is 90 degrees and the cosine is 0, indicating no similarity. Cosine similarity is a measure of direction, not magnitude.

In [72]:
# Calculate similarity between user vector and track vectors
similarity_matrix = cosine_similarity(track_vectors, [user_vector])
similarity_scores = similarity_matrix.flatten()
print(similarity_scores)

[ 0.97610515  0.94424562  0.96154262 ... -0.83456128 -0.85920453
 -0.81465349]


**Boost similarity scores for preferred artists**

Boost similarity scores for tracks by preferred artists. If a track is by a preferred artist, its similarity score is increased (e.g. by 30%) to make it more likely to be recommended.

In [73]:
# Create a boolean mask for matching artists
artist_matches = df['artist'].isin(preferences['artists'])  
artist_matches.head()

0    False
1    False
2    False
3    False
4    False
Name: artist, dtype: bool

In [74]:
# Show tracks for matching artists
df[artist_matches].head()

Unnamed: 0,artist,track_name,track_id,popularity,energy,instrumentalness,valence,genre
40626,Coldplay,Paradise - Tiësto Remix,0pjMTISKHTJkogN1BPZxaC,2.053244,0.736,-0.307744,0.643,Pop
93580,Ed Sheeran,Photograph,6fxVffaTuwjgEk5h9QyRjy,2.871598,0.379,-0.689958,0.201,Pop
93582,Ed Sheeran,Thinking out Loud,1Slwb6dOYkBlWal1PGtnNg,2.871598,0.445,-0.691229,0.591,Pop
93595,Ed Sheeran,I See Fire,1fu5IQSRgPxJL2OTP7FVLW,2.493896,0.0519,-0.691229,0.204,Pop
93626,Ed Sheeran,All of the Stars,3Th56VIq2sEaEmPPETu7p5,2.179144,0.557,-0.69073,0.287,Pop


In [76]:
# Boost factor - this can be tweaked.
# A factor of 1.3 means that a preferred artist is 30% more likely to be recommended.
boost_factor = 1.3

# Boost scores for tracks by matching artists
similarity_scores *= (artist_matches.values * (boost_factor - 1)) + 1
print(similarity_scores)

[ 0.97610515  0.94424562  0.96154262 ... -0.83456128 -0.85920453
 -0.81465349]


**Define current and target moods (New)**

Set the values for the user's current mood and the target mood. We will be trying to shift the user's mood over a number of steps. Based on these values, calculate the per-step mood adjustments required.

In [77]:
# Set starting mood values
start_valence = 0.3
start_energy = 0.2

# Set target mood values
target_valence = 0.8
target_energy = 0.6

# Set number of steps (recommendations)
num_steps = 10

# Calculate per-step mood adjustments
val_adj = (target_valence - start_valence) / num_steps
nrg_adj = (target_energy - start_energy) / num_steps
print(f"Per-step mood adjustments: valence += {val_adj:.2f}, energy += {nrg_adj:.2f}")

Per-step mood adjustments: valence += 0.05, energy += 0.04


**Extract track mood vectors (New)**

Extract the mood features for each track.


In [78]:
# Extract valence and energy values from all tracks
track_mood_vectors = df[["valence", "energy"]].values
print(track_mood_vectors)

[[0.139  0.303 ]
 [0.515  0.454 ]
 [0.145  0.234 ]
 ...
 [0.0351 0.44  ]
 [0.202  0.405 ]
 [0.857  0.861 ]]


**Create the user mood vector for Step n (New)**

Calculate the user's target energy and valence for this step. This step is repeated up to the number of recommendations needed.

In [81]:
# Set step number (n)
step = 1

# Gradually adjust target mood at each recommendation step
step_valence = start_valence + val_adj * step
step_energy = start_energy + nrg_adj * step
step_vector = np.array([step_valence, step_energy])
print(step_vector)

[0.35 0.24]


**Calculate similarity between user mood vector and track mood vectors (New)**

Measure how similar each track is to the user's mood by comparing their feature vectors. The measure of similarity used is called "Euclidean distance".

Geometrically, it is the straight-line distance between the two vectors. If two vectors are identical, the distance is 0, indicating perfect similarity. If they are completely different, the distance is the maximum possible value, indicating no similarity.  Euclidean distance is a measure of magnitude, not direction.  This is appropriate for comparing features like mood, because the intensity of the feature matters.

In [82]:
# Calculate the Euclidean distance between the track mood and the current step's target mood
distances = np.linalg.norm(track_vectors - step_vector, axis=1)
print(distances)

[2.925565   1.88618099 2.27975136 ... 1.66112794 1.77160596 1.61407892]


In [84]:
# Normalize distances to compute mood closeness (1 = exact match, 0 = no closeness).
# Assumes valence and energy are both in [0, 1], so the maximum possible distance is sqrt(2).
closeness = np.clip(1 - (distances / np.sqrt(2)), 0, 1)
print(closeness)

[0. 0. 0. ... 0. 0. 0.]


**Combine scores for mood and non-mood features (New)**

Combine the similarity scores with the mood closeness scores to get the final scores for each track.

In [85]:
# Combine similarity scores with mood closeness
combined_scores = 0.6 * similarity_scores + 0.4 * closeness
print(combined_scores)

[ 0.58566309  0.56654737  0.57692557 ... -0.50073677 -0.51552272
 -0.48879209]


In [86]:
# Select best track for this step
best_index = int(np.argmax(combined_scores))
recommendation1 = df.iloc[best_index]
print(recommendation1)

artist                            Coldplay
track_name                             Ink
track_id            6c6W25YoDGjTq3qSPOga5t
popularity                        2.619797
energy                               0.705
instrumentalness                 -0.351571
valence                              0.696
genre                                  Pop
Name: 146942, dtype: object
