SPOTIFY GENRE PREDICTION

Useful Imports 

Below are the imports used in the cleaning and pre-processing, clustering, and visualization methods.

In [8]:
# Importing Libraries
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder

Pre-processing and cleaning

We started the cleaning the data by dropping the unneccesary columns reduce the dimensions of the dataset. The columns that were dropped didn't contribute to any improvements in regards of the classification problem so we decided to drop them. We started by converting the strings to floats by encoding them, but we then decided to drop them anyways since they didn't contribute to the algorithm. We also noticed some inconsistincies, for example some values were just missing, others conained questionmarks (?) and others said 'NaN'. We decided to replace all of the missing values with 'NaN' so it would be easier to remove all of them later.

Furthermore, we replaced the mode (Minor/Major) with the float values 0.0 and 1.0, so that all of the feature data is stored as the same type (float64). We also realised that there are five rows that only contain NaN values, so we removed those as well.

Lastly we noticed that the column 'tempo' for some reason contained lots of missing values, so we dropped that as well.

In [9]:
# Read the CSV file
data = pd.read_csv('MusicDataSet.csv', header='infer')

# Drop unnecessary columns
columns_to_drop = ['instance_id', 'artist_name', 'track_name', 'obtained_date', 'key', 'duration_ms','music_genre']
data = data.drop(columns=columns_to_drop, axis=1)

# Convert 'mode' column to numeric
data['mode'] = data['mode'].map({'Major': 1, 'Minor': 0}).astype(float)

# Replace '?' with NaN
data.replace('?', np.nan, inplace=True)

#Replace missing values with NaN
data = data.fillna(np.NaN)


# Convert columns to numeric and impute missing values with mean
data = data.apply(pd.to_numeric, errors='coerce')
data.fillna(data.mean(), inplace=True)

#Count missing values for every column in our dataset
NaN_values = data.isna().sum()
print("Missing values per column:")
print(NaN_values)

NaN_rows = data[data.isna().all(axis=1)]    #Check to see if they are all on the same row
print(NaN_rows)             

data = data.dropna(how='all')           #Drop rows with all NaN values

data = data.drop(columns='tempo')       #Lots of missing values, drop this column


Missing values per column:
popularity          0
acousticness        0
danceability        0
energy              0
instrumentalness    0
liveness            0
loudness            0
mode                0
speechiness         0
tempo               0
valence             0
dtype: int64
Empty DataFrame
Columns: [popularity, acousticness, danceability, energy, instrumentalness, liveness, loudness, mode, speechiness, tempo, valence]
Index: []


In [10]:
   popularity  acousticness  danceability  duration_ms  energy  \
0        27.0       0.00468         0.652         -1.0   0.941   
1        31.0       0.01270         0.622     218293.0   0.890   
2        28.0       0.00306         0.620     215613.0   0.755   
3        34.0       0.02540         0.774     166875.0   0.700   
4        32.0       0.00465         0.638     222369.0   0.587   
5        47.0       0.00523         0.755     519468.0   0.731   
6        46.0       0.02890         0.572     214408.0   0.803   
7        43.0       0.02970         0.809     416132.0   0.706   
8        39.0       0.00299         0.509     292800.0   0.921   
9        22.0       0.00934         0.578     204800.0   0.731   

   instrumentalness  liveness  loudness   mode  speechiness  \
0          0.792000    0.1150    -5.201  Minor       0.0748   
1          0.950000    0.1240    -7.043  Minor       0.0300   
2          0.011800    0.5340    -4.617  Major       0.0345   
3          0.002530    0.1570    -4.498  Major       0.2390   
4          0.909000    0.1570    -6.266  Major       0.0413   
5          0.854000    0.2160   -10.517  Minor       0.0412   
6          0.000008    0.1060    -4.294  Major       0.3510   
7          0.903000    0.0635    -9.339  Minor       0.0484   
8          0.000276    0.1780    -3.175  Minor       0.2680   
9          0.011200    0.1110    -7.091  Minor       0.1730   

                tempo  valence music_genre  
0             100.889    0.759  Electronic  
1  115.00200000000001    0.531  Electronic  
2             127.994    0.333  Electronic  
3             128.014    0.270  Electronic  
4             145.036    0.323  Electronic  
5                   ?    0.614  Electronic  
6             149.995    0.230  Electronic  
7             120.008    0.761  Electronic  
8  149.94799999999998    0.273  Electronic  
9             139.933    0.203  Electronic  
Missing values per column:
popularity             5
acousticness           5
danceability           5
duration_ms            5
energy                 5
instrumentalness       5
liveness               5
loudness               5
mode                   5
speechiness            5
tempo               3984
valence                5
music_genre            5
dtype: int64
       popularity  acousticness  danceability  duration_ms  energy  \
10000         NaN           NaN           NaN          NaN     NaN   
10001         NaN           NaN           NaN          NaN     NaN   
10002         NaN           NaN           NaN          NaN     NaN   
10003         NaN           NaN           NaN          NaN     NaN   
10004         NaN           NaN           NaN          NaN     NaN   

       instrumentalness  liveness  loudness  mode  speechiness tempo  valence  \
10000               NaN       NaN       NaN   NaN          NaN   NaN      NaN   
10001               NaN       NaN       NaN   NaN          NaN   NaN      NaN   
10002               NaN       NaN       NaN   NaN          NaN   NaN      NaN   
10003               NaN       NaN       NaN   NaN          NaN   NaN      NaN   
10004               NaN       NaN       NaN   NaN          NaN   NaN      NaN   

      music_genre  
10000         NaN  
10001         NaN  
10002         NaN  
10003         NaN  
10004         NaN  

SyntaxError: invalid syntax (1643612969.py, line 1)

Handling outliers
We mapped out the outliers by visualizing with boxplots for each feature. Thereafter we removed all of the outliers by setting the threshold for the z-score to 3 (meaning three standard deviations from the mean).

In [None]:

#BOXPLOTS TO SEE OUTLIERS ETC..
# numerical_columns = data.select_dtypes(include=['float64']).columns

# plt.figure(figsize=(15, 10))
# num_numerical_columns = len(numerical_columns)
# num_rows = (num_numerical_columns + 1) // 2  #Number of rows needed

# for i, column in enumerate(numerical_columns, start=1):
#     plt.subplot(num_rows, 2, i)
#     sns.boxplot(x=data[column].values)
#     plt.title(f'Boxplot of {column}')

# plt.tight_layout()
# plt.show()


#OUTLIERS
numerical_columns = data.select_dtypes(include=['float64']).columns

#Calculate Z-scores for each numerical column
z_scores = np.abs(stats.zscore(data[numerical_columns]))

threshold = 3

#Find and remove rows with outliers
data_no_outliers = data[(z_scores < threshold).all(axis=1)]

# Print the shape before and after removing outliers
print("\n")
print("Size of dataset before removing outliers:", data.shape)
print("Size of dataset after removing outliers:", data_no_outliers.shape)

Size of dataset before removing outliers: (40000, 12)
Size of dataset after removing outliers: (36680, 12)

K- means Clustering

Because we are focusing on clustering the data by genres we have our optimal number of clusters also known as K(k=11 since there 11 genres). Therefore there is no need to find K but I decided to do it anyway to see what I would get back and how the visualization of the data would respond with this number. 

In [None]:
# Determine the optimal number of clusters using the elbow method
inertia = []
for k in range(1, 11):  # Trying different values for the number of clusters
    kmeans = KMeans(n_clusters=k, max_iter=50, random_state=1)
    kmeans.fit(z)
    inertia.append(kmeans.inertia_)

In [None]:
# Determine the optimal number of clusters using Silhouette Score
silhouette_scores = []
for k in range(2, 16):  # Adjust the range as needed
    kmeans = KMeans(n_clusters=k, max_iter=50, random_state=1)
    kmeans.fit(data_selected)
    labels = kmeans.labels_
    silhouette_avg = silhouette_score(data_selected, labels)
    silhouette_scores.append(silhouette_avg)

I tried two different methods the Elboow and Silhouette
The graphs I received both showed me 15 =k
See this in Fig.1 and 2

I implemented th clustering:

In [None]:
# Interpretation of Clusters
cluster_centers = k_means.cluster_centers_
feature_names = selected_columns  
# Create a DataFrame with cluster centers and feature names
cluster_centers_df = pd.DataFrame(cluster_centers, columns=feature_names)

# Print cluster centers for interpretation
print("Cluster Centers (Feature Averages for Each Cluster):")
print(cluster_centers_df)


The I assigned the genres based on the clustering

In [None]:
# Assigning Genres based on the characteristics observed
genre_mapping = {
   0: "Rock",
   1: "Hiphop",
   2: "Jazz",
   3: "Electronic",
   4: "Anime",
   5: "Alternative",
   6: "Country",
   7: "Rap",
   8: "Blues",
   9: "Classical"
}


In the following code, we are using K-Means clustering to predict music genres based on the given audio features. The process is as follows:

In [None]:
# Predicted labels for each instance
predicted_labels = k_means.labels_

# Create a new column 'predicted_genre' based on cluster assignments
data['music_genre'] = predicted_labels

# Map cluster labels to genres
data['music_genre'] = data['music_genre'].map(genre_mapping).fillna('Unknown')


Mapping the genres with there associated cluster labeles

In [None]:
# Print the DataFrame with cluster labels
print("Data with Cluster Labels:")
print(data[['acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness',
            'mode', 'speechiness', 'tempo', 'valence', 'popularity', 'music_genre', 'cluster_label']])



In the following code, we are using Principal Component Analysis (PCA) to reduce the dimensionality of the selected data (`data_selected`). This is done to visualize the data in a two-dimensional space.

In [None]:
reduced_data = PCA(n_components=2).fit_transform(data_selected)  # Use original data or z-score normalized data

# Scatter plot for each genre
plt.figure(figsize=(12, 8))

In [None]:
Mapping the colors for the graph

In [None]:
genre_colors = {
    'Rock': 'blue',
    'Hiphop': 'orange',
    'Jazz': 'green',
    'Electronic': 'red',
    'Anime': 'purple',
    'Alternative': 'brown',
    'Country': 'pink',
    'Rap': 'gray',
    'Blues': 'cyan',
    'Classical': 'lime',
    'Unknown': 'black'  
}

In this code snippet, we visualize the results of K-Means clustering by plotting the reduced data obtained from PCA.


In [None]:
for genre, color in genre_colors.items():
    genre_data = reduced_data[data['music_genre'] == genre]
    plt.scatter(genre_data[:, 0], genre_data[:, 1], label=genre, c=color, s=80, alpha=0.7, edgecolors='k')

plt.title('K-Means Clustering Results with PCA')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(loc='upper right', bbox_to_anchor=(1.15, 1))
plt.grid(True, linestyle='--', alpha=0.5)
plt.show()