# Music Recommendation System Project

In this project, we will build a music recommendation system using machine learning techniques. The goal is to create a model that can recommend music based on similarity with another song the user inputs.

We have a large dataset in CSV format containing information about music tracks. This dataset includes features such as artists, album name, danceability, energy, and genre, among others. We will try to find the best way to make accurate recommendations, through exploring the data carefully, choosing the right features to use, and finally trying and comparing the accuracy of different algorithms.

In [185]:
# Import necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors

# Set the style for data visualization
sns.set(style="whitegrid")

# Load the dataset (replace 'your_dataset.csv' with the actual file path)
data = pd.read_csv("dataset.csv")

data.head()

Unnamed: 0.1,Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.461,...,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic
1,1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,149610,False,0.42,0.166,...,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic
2,2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,210826,False,0.438,0.359,...,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic
3,3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,201933,False,0.266,0.0596,...,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,acoustic
4,4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,Hold On,82,198853,False,0.618,0.443,...,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,acoustic


In [186]:
missing_values = data.isnull().sum()
print("Missing Values:")
missing_values

Missing Values:


Unnamed: 0          0
track_id            0
artists             1
album_name          1
track_name          1
popularity          0
duration_ms         0
explicit            0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
time_signature      0
track_genre         0
dtype: int64

In [187]:
data.dropna(subset=['artists', 'album_name', 'track_name'], inplace=True)
data.drop_duplicates(subset=['track_id'], inplace=True)

data.reset_index(drop=True, inplace=True)

missing_values = data.isnull().sum()

print("Missing Values:")
missing_values

Missing Values:


Unnamed: 0          0
track_id            0
artists             0
album_name          0
track_name          0
popularity          0
duration_ms         0
explicit            0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
time_signature      0
track_genre         0
dtype: int64

In [188]:
data.describe()

Unnamed: 0.1,Unnamed: 0,popularity,duration_ms,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
count,89740.0,89740.0,89740.0,89740.0,89740.0,89740.0,89740.0,89740.0,89740.0,89740.0,89740.0,89740.0,89740.0,89740.0,89740.0
mean,53479.005739,33.198808,229144.4,0.562166,0.634458,5.28353,-8.498994,0.636973,0.087442,0.328285,0.173415,0.216971,0.469474,122.058134,3.897426
std,33410.141924,20.58064,112945.8,0.176692,0.256606,3.559912,5.221518,0.480875,0.113278,0.338321,0.323849,0.194885,0.262864,30.117651,0.453437
min,0.0,0.0,8586.0,0.0,0.0,0.0,-49.531,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,23766.75,19.0,173040.0,0.45,0.457,2.0,-10.32225,0.0,0.036,0.0171,0.0,0.0982,0.249,99.26275,4.0
50%,50680.5,33.0,213295.5,0.576,0.676,5.0,-7.185,1.0,0.0489,0.188,5.8e-05,0.132,0.457,122.013,4.0
75%,80618.5,49.0,264293.0,0.692,0.853,8.0,-5.108,1.0,0.0859,0.625,0.097625,0.279,0.682,140.077,4.0
max,113999.0,100.0,5237295.0,0.985,1.0,11.0,4.532,1.0,0.965,0.996,1.0,1.0,0.995,243.372,5.0


In [189]:
data.drop(["track_id", "Unnamed: 0", "artists", "album_name", "track_name", "key", "time_signature", "mode", "track_genre"], axis=1, inplace=True)

data

Unnamed: 0,popularity,duration_ms,explicit,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,73,230666,False,0.676,0.4610,-6.746,0.1430,0.0322,0.000001,0.3580,0.7150,87.917
1,55,149610,False,0.420,0.1660,-17.235,0.0763,0.9240,0.000006,0.1010,0.2670,77.489
2,57,210826,False,0.438,0.3590,-9.734,0.0557,0.2100,0.000000,0.1170,0.1200,76.332
3,71,201933,False,0.266,0.0596,-18.515,0.0363,0.9050,0.000071,0.1320,0.1430,181.740
4,82,198853,False,0.618,0.4430,-9.681,0.0526,0.4690,0.000000,0.0829,0.1670,119.949
...,...,...,...,...,...,...,...,...,...,...,...,...
89735,21,384999,False,0.172,0.2350,-16.393,0.0422,0.6400,0.928000,0.0863,0.0339,125.995
89736,22,385000,False,0.174,0.1170,-18.318,0.0401,0.9940,0.976000,0.1050,0.0350,85.239
89737,22,271466,False,0.629,0.3290,-10.895,0.0420,0.8670,0.000000,0.0839,0.7430,132.378
89738,41,283893,False,0.587,0.5060,-10.889,0.0297,0.3810,0.000000,0.2700,0.4130,135.960


Here, I chose to drop some features that seemed irrelevant. For the first two, it is because they are randomly generated IDs, which would interfere with the model, for all the names, it is because liking a song by an artist doesn't mean you will like songs by artists with similar names, and same for albums and titles, for key and time signature, it is because most people do not care what key/time signature their music is in, for mode, I am not sure what that feature symbolized, so I chose to remove them. Finally, it seems there was a problem in my data, as the every track had a genre of "acoustic".

In [190]:
correlation_matrix = data.corr()

correlations = correlation_matrix.abs().unstack().sort_values(ascending=False)

# Filter out correlations with themselves (diagonal) and keep only the top correlated pairs
top_correlations = correlations[correlations != 1.0]

print("Top Correlated Feature Pairs:")
top_correlations

# all this looks alright, no feature seem to be correlated apart from what can be expected (energy/loudness/acousticness, mainly)

Top Correlated Feature Pairs:


loudness      energy          0.758774
energy        loudness        0.758774
              acousticness    0.732569
acousticness  energy          0.732569
              loudness        0.582664
                                ...   
tempo         speechiness     0.004033
explicit      valence         0.002709
valence       explicit        0.002709
loudness      duration_ms     0.000360
duration_ms   loudness        0.000360
Length: 132, dtype: float64

In [191]:
scaler = StandardScaler()

data[data.columns] = scaler.fit_transform(data[data.columns])

data[data.columns]

Unnamed: 0,popularity,duration_ms,explicit,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,1.933925,0.013472,-0.306447,0.644253,-0.675975,0.335727,0.490458,-0.875166,-0.535482,0.723656,0.934047,-1.133599
1,1.059312,-0.704186,-0.306447,-0.804604,-1.825602,-1.673087,-0.098364,1.760810,-0.535468,-0.595078,-0.770269,-1.479843
2,1.156491,-0.162188,-0.306447,-0.702731,-1.073473,-0.236524,-0.280219,-0.349626,-0.535485,-0.512978,-1.329497,-1.518259
3,1.836746,-0.240925,-0.306447,-1.676182,-2.240247,-1.918228,-0.451480,1.704650,-0.535266,-0.436009,-1.241999,1.981635
4,2.371232,-0.268195,-0.306447,0.315996,-0.746122,-0.226373,-0.307585,0.415925,-0.535485,-0.687954,-1.150696,-0.070030
...,...,...,...,...,...,...,...,...,...,...,...,...
89735,-0.592735,1.379914,-0.306447,-2.208184,-1.556706,-1.511831,-0.399395,0.921365,2.330062,-0.670508,-1.657046,0.130717
89736,-0.544146,1.379923,-0.306447,-2.196865,-2.016557,-1.880499,-0.417934,1.967716,2.478280,-0.574553,-1.652861,-1.222517
89737,-0.544146,0.374710,-0.306447,0.378251,-1.190384,-0.458874,-0.401161,1.592330,-0.535485,-0.682823,1.040567,0.342654
89738,0.379057,0.484736,-0.306447,0.140548,-0.500608,-0.457725,-0.509744,0.155815,-0.535485,0.272105,-0.214844,0.461588


In [192]:
df = pd.read_csv("dataset.csv")

df.dropna(subset=['artists', 'album_name', 'track_name'], inplace=True)

df.reset_index(drop=True, inplace=True)

In [193]:
selected_features = data[data.columns]  # Replace with your chosen features

normalized_features = (selected_features - selected_features.mean()) / selected_features.std()

k = 5  # Number of neighbors to consider
knn_model = NearestNeighbors(n_neighbors=k, metric='euclidean')  # Use Euclidean distance

knn_model.fit(normalized_features)

# Choose a target item for which you want to make recommendations
target_track_name = '21 Guns'
target_artist = 'Green Day'
target_row = df[(df['track_name'] == target_track_name) & (df['artists'] == target_artist)].index[0]

distances, indices = knn_model.kneighbors([normalized_features.iloc[target_row]])

recommended_item_indices = indices[0][1:]

print("Recommended Items:")
print(df.iloc[recommended_item_indices])

Recommended Items:
       Unnamed: 0                track_id        artists  \
85986       85987  6TSbdwlDyKMiExEpHvMfWp         Carajo   
23905       23905  1JybHYAOzP0sWjzplbruqJ  Viva La Panda   
85954       85955  1xCQa1dJC3jIXGHaTo7273   GOING STEADY   
86004       86005  2uGos1lTJU2Qd6UpPyRJSP      blink-182   

                    album_name          track_name  popularity  duration_ms  \
85986  Hoy Como Ayer (En Vivo)         Constrictor          36       280426   
23905       California in Rain  California in Rain          46       181016   
85954                    さくらの唄          もしも君が泣くならば          34       226400   
86004    Обратно в клас - rock          I Miss You           4       227813   

       explicit  danceability  energy  ...  loudness  mode  speechiness  \
85986     False         0.628   0.801  ...    -4.424     1       0.0312   
23905     False         0.627   0.653  ...    -8.117     0       0.0923   
85954      True         0.303   0.961  ...    -3.632     1 

