## Basic recommendation algorithm ##


**Import libraries**

Import the required libraries for data analysis and machine learning.


In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler

**Load data**

Load selected columns from a CSV file into a DataFrame, rename some columns for consistency, and display the first few rows of the resulting data.


In [2]:
# Select relevant columns 
usecols = ["track_id", "track_name", "artist_name", "new_genre", "popularity", "energy", "instrumentalness", "valence"]

# Load data file into a DataFrame
df = pd.read_csv('../datasets/track_data.csv', usecols=usecols, dtype={'popularity': 'float64'})

# Rename columns
df.rename(columns={'artist_name': 'artist', 'new_genre': 'genre'}, inplace=True)

# Display the first few rows 
df.head()

Unnamed: 0,artist,track_name,track_id,popularity,energy,instrumentalness,valence,genre
0,Jason Mraz,I Won't Give Up,53QF56cjZA9RTuuMZDrSA6,68.0,0.303,0.0,0.139,Folk
1,Jason Mraz,93 Million Miles,1s8tP3jP4GZcyHDsjvw218,50.0,0.454,1.4e-05,0.515,Folk
2,Joshua Hyslop,Do Not Let Me Go,7BRCa8MPiyuvr2VU3O9W0F,57.0,0.234,5e-05,0.145,Folk
3,Boyce Avenue,Fast Car,63wsZUhUZLlh1OsyrZq7sz,58.0,0.251,0.0,0.508,Folk
4,Andrew Belle,Sky's Still Blue,6nXIYClvJAfi6ujLiKqEq8,54.0,0.791,0.0193,0.217,Folk


**Normalise data**

Scale the numerical features in the DataFrame, so that they can be compared and treated equally.  Each feature will now have a mean of 0 and a standard deviation of 1.


In [3]:
# Normalise numerical features
numerical_features = ["popularity", "energy", "instrumentalness", "valence"]
scaler = StandardScaler()
df.loc[:, numerical_features] = scaler.fit_transform(df[numerical_features].values)
df.head()

Unnamed: 0,artist,track_name,track_id,popularity,energy,instrumentalness,valence,genre
0,Jason Mraz,I Won't Give Up,53QF56cjZA9RTuuMZDrSA6,3.123399,-1.244617,-0.691229,-1.178925,Folk
1,Jason Mraz,93 Million Miles,1s8tP3jP4GZcyHDsjvw218,1.990293,-0.686393,-0.691191,0.221349,Folk
2,Joshua Hyslop,Do Not Let Me Go,7BRCa8MPiyuvr2VU3O9W0F,2.430946,-1.4997,-0.691092,-1.15658,Folk
3,Boyce Avenue,Fast Car,63wsZUhUZLlh1OsyrZq7sz,2.493896,-1.436853,-0.691229,0.19528,Folk
4,Andrew Belle,Sky's Still Blue,6nXIYClvJAfi6ujLiKqEq8,2.242095,0.559444,-0.638363,-0.888443,Folk


**Setup user preferences**

Setup user preferences for the recommendation algorithm.

In [4]:
# Setup user preferences
preferences = {
    "popularity": 80,
    "energy": 0.5,
    "instrumentalness": 0.1,
    "valence": 0.5,
    "genres": ['Pop', 'Rock'],
    "artists": ['Ed Sheeran', 'Coldplay'],
}

**Filter dataset by genre**

Filter the dataset based on the user's genre preferences.

In [5]:
# Create a boolean mask for matching genres
genre_mask = df['genre'].isin(preferences['genres'])  
genre_mask.head()

0    False
1    False
2    False
3    False
4    False
Name: genre, dtype: bool

In [6]:
# Use the mask to filter and copy the DataFrame
filtered_df = df[genre_mask].copy()                   

# Display the filtered DataFrame
filtered_df.head()

Unnamed: 0,artist,track_name,track_id,popularity,energy,instrumentalness,valence,genre
1770,Neon Trees,Everybody Talks,2iUmqdfGZcHIhS3b9E9EWq,3.689952,1.051125,-0.691229,1.003417,Rock
1771,Deftones,Rosemary,4FEr6dIdH6EqLKR0jB560J,3.31225,-0.098595,-0.417311,-1.409076,Rock
1772,Black Veil Brides,In The End,1RTYixE1DD3g3upEpmCJpa,3.123399,1.106578,-0.675095,-0.691064,Rock
1773,Thousand Foot Krutch,Courtesy Call,0AOmbw8AwDnwXhHC3OhdVB,3.31225,-0.006173,-0.691229,-0.03934,Rock
1774,Deftones,Entombed,4bLCPfBLKlqiONo6TALTh5,2.934548,0.418964,-0.340614,-1.074649,Rock


**Create user vector**

Create a list of the user's preferences, focusing on the numerical features only. We use only numerical features because they can be **directly** compared using math, while non-numeric features like genre or artist cannot.

In [7]:
# Create user vector
user_vector = [preferences[feature] for feature in numerical_features]
print(user_vector)

[80, 0.5, 0.1, 0.5]


**Normalise user vector**

Scale the user's vector, so that it can be compared to the song data.

In [8]:
# Scale the user vector using the same scaler as the dataset
user_vector_scaled = scaler.transform([user_vector])[0]
print(user_vector_scaled)

[ 3.87880338 -0.51633828 -0.41731084  0.16548708]


**Extract track vectors**

Extract the numerical features for each track.

In [9]:
# Extract track vectors
track_vectors = df[numerical_features].values
print(track_vectors)

[[ 3.1233993  -1.24461718 -0.69122871 -1.17892497]
 [ 1.99029317 -0.68639325 -0.69119118  0.22134908]
 [ 2.43094555 -1.49969964 -0.69109175 -1.15658017]
 ...
 [-1.03132315 -0.73814911 -0.68265508 -1.5658624 ]
 [-1.15722383 -0.86753877 -0.69105532 -0.94430458]
 [-0.96837281  0.81822357 -0.69120562  1.4950026 ]]


**Calculate similarity between user vector and track vectors**

Measure how similar each track is to the user's preferences by comparing their feature vectors. The measure of similarity used is called "cosine similarity".

Geometrically, it is the cosine of the angle between the two vectors. If two vectors are identical, the angle is 0 and the cosine is 1, indicating perfect similarity. If they are completely different, the angle is 90 degrees and the cosine is 0, indicating no similarity. Cosine similarity is a measure of orientation, not magnitude.

In [10]:
# Calculate similarity between user vector and track vectors
similarity_matrix = cosine_similarity(track_vectors, [user_vector])
similarity_scores = similarity_matrix.flatten()
print(similarity_scores)

[ 0.8561433   0.89201651  0.7641776  ... -0.49189858 -0.62855837
 -0.45934824]


**Boost similarity scores for preferred artists**

Boost similarity scores for tracks by preferred artists. If a track is by a preferred artist, its similarity score is increased (e.g. by 30%) to make it more likely to be recommended.

In [11]:
# Create a boolean mask for matching artists
artist_matches = df['artist'].isin(preferences['artists'])  
artist_matches.head()

0    False
1    False
2    False
3    False
4    False
Name: artist, dtype: bool

In [12]:
# Show tracks for matching artists
df[artist_matches].head()

Unnamed: 0,artist,track_name,track_id,popularity,energy,instrumentalness,valence,genre
40626,Coldplay,Paradise - Tiësto Remix,0pjMTISKHTJkogN1BPZxaC,2.053244,0.356118,-0.307744,0.698038,Pop
93580,Ed Sheeran,Photograph,6fxVffaTuwjgEk5h9QyRjy,2.871598,-0.963657,-0.689958,-0.948029,Pop
93582,Ed Sheeran,Thinking out Loud,1Slwb6dOYkBlWal1PGtnNg,2.871598,-0.719665,-0.691229,0.504383,Pop
93595,Ed Sheeran,I See Fire,1fu5IQSRgPxJL2OTP7FVLW,2.493896,-2.172896,-0.691229,-0.936856,Pop
93626,Ed Sheeran,All of the Stars,3Th56VIq2sEaEmPPETu7p5,2.179144,-0.305618,-0.69073,-0.627753,Pop


In [101]:
# Boost factor - this can be tweaked.
# A factor of 1.3 means that a preferred artist is 30% more likely to be recommended.
boost_factor = 1.3

# Boost scores for tracks by matching artists
similarity_scores *= (artist_matches.values * (boost_factor - 1)) + 1
print(similarity_scores)

[ 0.8561433   0.89201651  0.7641776  ... -0.49189858 -0.62855837
 -0.45934824]


**Recommend songs**

Return a list of the top N songs based on the similarity scores.


In [14]:
# Recommend top N songs based on the similarity scores
top_n = 10

# Get the indices of the top N scores
top_track_indices = np.argsort(-similarity_scores)[:top_n]
print(top_track_indices)

[488046 139636 311664 532028 295510 420120 532233 597091 285574 130860]


In [15]:
# Select the top N recommendations
recommendations = df.iloc[top_track_indices][["track_name", "artist", "track_id"]]
print(recommendations)

                                  track_name                 artist  \
488046                               CHARGER                   ELIO   
139636           Cut Your Teeth - Kygo Remix         Kyla La Grange   
311664                     City on the Water        The Stone Foxes   
532028                              Beguiled  The Smashing Pumpkins   
295510  Cola - Live from the London Aquarium              CamelPhat   
420120                              Magnetik           Malandra Jr.   
532233               We're Not In Orbit Yet…           Broken Bells   
597091              Not Dark Yet - Version 1              Bob Dylan   
285574                            Mumble Rap                  Belly   
130860                                Poison          The Symposium   

                      track_id  
488046  0iBBOvVQ8QCK7F95boCn3C  
139636  1y4Kln6VEjQMpmHW7j9GeY  
311664  31IeYQlbVtlKss86WXBkLO  
532028  6rBiMyaGB1ZJQnxb01FkPG  
295510  58JJFLeOIq7WOYyrDdc8tr  
420120  4P8ZoEphHmwI