# Music Recommendation System Project

In this project, we will build a music recommendation system using machine learning techniques. The goal is to create a model that can recommend music based on similarity with another song the user inputs.

We have a large dataset in CSV format containing information about music tracks. This dataset includes features such as artists, album name, danceability, energy, and genre, among others. We will try to find the best way to make accurate recommendations, through exploring the data carefully, choosing the right features to use, and finally trying and comparing the accuracy of different algorithms.

In [81]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

# Set the style for data visualization
sns.set(style="whitegrid")

# Load the dataset (replace 'your_dataset.csv' with the actual file path)
data = pd.read_csv("dataset.csv")

data.head()

Unnamed: 0.1,Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.461,...,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic
1,1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,149610,False,0.42,0.166,...,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic
2,2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,210826,False,0.438,0.359,...,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic
3,3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,201933,False,0.266,0.0596,...,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,acoustic
4,4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,Hold On,82,198853,False,0.618,0.443,...,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,acoustic


In [82]:
missing_values = data.isnull().sum()
print("Missing Values:")
missing_values

Missing Values:


Unnamed: 0          0
track_id            0
artists             1
album_name          1
track_name          1
popularity          0
duration_ms         0
explicit            0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
time_signature      0
track_genre         0
dtype: int64

In [83]:
missing_rows = data[data['artists'].isnull() | data['album_name'].isnull() | data['track_name'].isnull()]

print("Rows with Missing Values:")
missing_rows

Rows with Missing Values:


Unnamed: 0.1,Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
65900,65900,1kR4gIb7nGxHPI3D2ifs59,,,,0,0,False,0.501,0.583,...,-9.46,0,0.0605,0.69,0.00396,0.0747,0.734,138.391,4,k-pop


In [84]:
data.dropna(subset=['artists', 'album_name', 'track_name'], inplace=True)

data.reset_index(drop=True, inplace=True)

missing_values = data.isnull().sum()

print("Missing Values:")
missing_values

Missing Values:


Unnamed: 0          0
track_id            0
artists             0
album_name          0
track_name          0
popularity          0
duration_ms         0
explicit            0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
time_signature      0
track_genre         0
dtype: int64

In [85]:
data.describe()

Unnamed: 0.1,Unnamed: 0,popularity,duration_ms,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
count,113999.0,113999.0,113999.0,113999.0,113999.0,113999.0,113999.0,113999.0,113999.0,113999.0,113999.0,113999.0,113999.0,113999.0,113999.0
mean,56999.421925,33.238827,228031.2,0.566801,0.641383,5.309126,-8.25895,0.637558,0.084652,0.314907,0.156051,0.213554,0.474066,122.147695,3.904034
std,32909.243463,22.304959,107296.1,0.173543,0.25153,3.559999,5.029357,0.480708,0.105733,0.332522,0.309556,0.190378,0.259261,29.97829,0.432623
min,0.0,0.0,8586.0,0.0,0.0,0.0,-49.531,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,28499.5,17.0,174066.0,0.456,0.472,2.0,-10.013,0.0,0.0359,0.0169,0.0,0.098,0.26,99.2185,4.0
50%,56999.0,35.0,212906.0,0.58,0.685,5.0,-7.004,1.0,0.0489,0.169,4.2e-05,0.132,0.464,122.017,4.0
75%,85499.5,50.0,261506.0,0.695,0.854,8.0,-5.003,1.0,0.0845,0.5975,0.049,0.273,0.683,140.071,4.0
max,113999.0,100.0,5237295.0,0.985,1.0,11.0,4.532,1.0,0.965,0.996,1.0,1.0,0.995,243.372,5.0


In [87]:
data.drop(["track_id", "Unnamed: 0", "artists", "album_name", "track_name", "key", "time_signature", "mode", "track_genre"], axis=1, inplace=True)

data

Unnamed: 0,popularity,duration_ms,explicit,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,73,230666,False,0.676,0.4610,-6.746,0.1430,0.0322,0.000001,0.3580,0.7150,87.917
1,55,149610,False,0.420,0.1660,-17.235,0.0763,0.9240,0.000006,0.1010,0.2670,77.489
2,57,210826,False,0.438,0.3590,-9.734,0.0557,0.2100,0.000000,0.1170,0.1200,76.332
3,71,201933,False,0.266,0.0596,-18.515,0.0363,0.9050,0.000071,0.1320,0.1430,181.740
4,82,198853,False,0.618,0.4430,-9.681,0.0526,0.4690,0.000000,0.0829,0.1670,119.949
...,...,...,...,...,...,...,...,...,...,...,...,...
113994,21,384999,False,0.172,0.2350,-16.393,0.0422,0.6400,0.928000,0.0863,0.0339,125.995
113995,22,385000,False,0.174,0.1170,-18.318,0.0401,0.9940,0.976000,0.1050,0.0350,85.239
113996,22,271466,False,0.629,0.3290,-10.895,0.0420,0.8670,0.000000,0.0839,0.7430,132.378
113997,41,283893,False,0.587,0.5060,-10.889,0.0297,0.3810,0.000000,0.2700,0.4130,135.960


Here, I chose to drop some features that seemed irrelevant. For the first two, it is because they are randomly generated IDs, which would interfere with the model, for all the names, it is because liking a song by an artist doesn't mean you will like songs by artists with similar names, and same for albums and titles, for key and time signature, it is because most people do not care what key/time signature their music is in, for mode, I am not sure what that feature symbolized, so I chose to remove them. Finally, it seems there was a problem in my data, as the every track had a genre of "acoustic".

In [88]:
correlation_matrix = data.corr()

correlations = correlation_matrix.abs().unstack().sort_values(ascending=False)

# Filter out correlations with themselves (diagonal) and keep only the top correlated pairs
top_correlations = correlations[correlations != 1.0]

print("Top Correlated Feature Pairs:")
top_correlations

# all this looks alright, no feature seem to be correlated apart from what can be expected (energy/loudness/acousticness, mainly)

Top Correlated Feature Pairs:


loudness      energy          0.761690
energy        loudness        0.761690
acousticness  energy          0.733908
energy        acousticness    0.733908
loudness      acousticness    0.589804
                                ...   
acousticness  speechiness     0.002184
popularity    energy          0.001053
energy        popularity      0.001053
tempo         liveness        0.000603
liveness      tempo           0.000603
Length: 132, dtype: float64

In [89]:
scaler = StandardScaler()

data[data.columns] = scaler.fit_transform(data[data.columns])

data[data.columns]

Unnamed: 0,popularity,duration_ms,explicit,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,1.782624,0.024557,-0.305769,0.629239,-0.717147,0.300825,0.551843,-0.850193,-0.504111,0.758735,0.929315,-1.141854
1,0.975625,-0.730889,-0.305769,-0.845908,-1.889974,-1.784739,-0.078995,1.831744,-0.504097,-0.591216,-0.798681,-1.489708
2,1.065291,-0.160353,-0.305769,-0.742187,-1.122667,-0.293289,-0.273827,-0.315489,-0.504115,-0.507172,-1.365679,-1.528303
3,1.692957,-0.243236,-0.305769,-1.733301,-2.312987,-2.039246,-0.457309,1.774605,-0.503886,-0.428381,-1.276965,1.987857
4,2.186123,-0.271942,-0.305769,0.295026,-0.788709,-0.282751,-0.303146,0.463409,-0.504115,-0.686290,-1.184394,-0.073343
...,...,...,...,...,...,...,...,...,...,...,...,...
113994,-0.548707,1.462948,-0.305769,-2.274956,-1.615652,-1.617321,-0.401507,0.977663,2.493742,-0.668431,-1.697779,0.128337
113995,-0.503873,1.462957,-0.305769,-2.263432,-2.084782,-2.000075,-0.421369,2.042258,2.648803,-0.570205,-1.693536,-1.231186
113996,-0.503873,0.404815,-0.305769,0.358411,-1.241937,-0.524135,-0.403399,1.660327,-0.504115,-0.681038,1.037314,0.341259
113997,0.347959,0.520635,-0.305769,0.116395,-0.538241,-0.522942,-0.519731,0.198764,-0.504115,0.296495,-0.235539,0.460746
