# Lab 10 Tasks

In this we will look at using superised machine learning algorithms to predict music genres. Specficially, the objective here is to classify the genre of a song from Spotify, based on a range of associated audio features.

Each song in our dataset is described by a range of features:
- artist_name: Song artist 
- track_name: Song track name
- acousticness: Describes the likelihood that the song is purely acoustic
- danceability: Describes how suitable a track is for dancing based on a combination of elements including tempo, rhythm stability, beat strength, and overall regularity
- energy: A perceptual measure of intensity and activity. More energetic tracks feel fast, loud, and noisy
- instrumentalness: Indicates whether a song includes vocals or not
- liveness: Describes the likelihood that the song was recorded with a live audience.
- loudness: Overall loudness of a track in decibels (dB), averaged across the entire track
- speechiness:  Describes the likelihood that the song contains spoken words
- tempo: Estimated tempo of a track in beats per minute (BPM)
- valence: Tracks with high valence sound more positive (e.g. cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, angry)
- genre: Target label ('Pop' or 'Rock' in this case)

Original dataset source: 
https://www.kaggle.com/code/iqbalbasyar/spotify-genre-classification/data

Original Spotify documentation:
https://developer.spotify.com/discover/

## Task 1

Load the dataset from the file 'music.csv' and examine the number of songs having each target label.

In [87]:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer

In [89]:
df = pd.read_csv('music.csv')
df.head()

Unnamed: 0,artist_name,track_name,acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,genre
0,Led Zeppelin,Black Dog,27.4,43.2,84.8,3.44,23.3,-8.095,8.78,81.201,74.3,Rock
1,Mike Posner,Song About You,18.0,67.6,84.9,0.00237,9.98,-3.008,4.12,87.025,64.1,Pop
2,Van Morrison,Days Like This,65.8,66.0,54.1,0.127,9.28,-7.851,5.76,93.744,69.3,Rock
3,The 1975,Give Yourself A Try,0.00306,31.3,80.0,0.0,49.7,-5.011,6.83,183.047,87.1,Rock
4,HUNNY,Parking Lot,0.303,49.8,84.4,0.0,35.5,-5.163,4.74,103.97,46.9,Rock


Remove any non-numeric features from the dataset, and then separate out the features to use for classification from the target label information.

In [97]:
normalizer = Normalizer()
target = df["genre"].values
data = df[["acousticness","danceability","energy","instrumentalness","liveness","loudness","speechiness","tempo","valence"]]
data_scaled = normalizer.fit_transform(data.values)
data_scaled

array([[0.18218362, 0.2872384 , 0.56383834, ..., 0.05837855, 0.53990846,
        0.49402345],
       [0.11640906, 0.43718068, 0.54906272, ..., 0.02664474, 0.56280546,
        0.41454559],
       [0.41299742, 0.41425273, 0.33956171, ..., 0.03615297, 0.58838951,
        0.43496537],
       ...,
       [0.10982616, 0.41167542, 0.41236615, ..., 0.01747548, 0.56484492,
        0.16024949],
       [0.02605898, 0.29232832, 0.44545267, ..., 0.02004537, 0.79070633,
        0.27729429],
       [0.0589719 , 0.51299825, 0.38417615, ..., 0.16546484, 0.64412198,
        0.20497028]])

## Task 2

Generate a 60/40 random training and test split of the data. Based on this split, evaluate the accuracy and F1-score achieved by a KNN classifier (for *k=1* neighbour). 

In [39]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

In [103]:
data_train, data_test, target_train, target_test = train_test_split(data_scaled, target, 
    test_size=0.4, random_state=1)
model = KNeighborsClassifier(n_neighbors=1)
m = model.fit(data_train,target_train)
predicted = model.predict(data_test)
acc = accuracy_score(target_test, predicted)
print(acc)
f1 = f1_score(target_test, predicted, pos_label='Rock')
print(f1)

0.604
0.6


Use a *confusion matrix* to illustrate where the errors lie with the classifier above.

In [105]:
from sklearn.metrics import confusion_matrix

In [107]:
cm = confusion_matrix(target_test, predicted)
print(cm)

[[307 198]
 [198 297]]


## Task 3



Use 5-fold cross-validation to evaluate the accuracy achieved by a KNN (*k=1*) classifier on the data.

In [113]:
knn = KNeighborsClassifier(n_neighbors=1)
from sklearn.model_selection import cross_val_score
acc_scores = cross_val_score(knn, data_scaled, target, cv=5, scoring="accuracy")
print(acc_scores)

[0.608 0.608 0.618 0.59  0.624]


Repeat the process above for different parameter values of *k*, from 2 to 10 neighbours. Generate a plot of the different accuracy values acheived for different values of *k*. Which value of *k* yields the highest accuracy?

In [115]:
for k in range(2,11):
    knn = KNeighborsClassifier(n_neighbors=k)
    acc_scores = cross_val_score(knn, data_scaled, target, cv=5, scoring="accuracy")
    mean_acc = acc_scores.mean()
    print("K=%02d neighbours: Accuracy=%.3f" % (k, mean_acc))

K=02 neighbours: Accuracy=0.597
K=03 neighbours: Accuracy=0.632
K=04 neighbours: Accuracy=0.621
K=05 neighbours: Accuracy=0.636
K=06 neighbours: Accuracy=0.635
K=07 neighbours: Accuracy=0.647
K=08 neighbours: Accuracy=0.646
K=09 neighbours: Accuracy=0.646
K=10 neighbours: Accuracy=0.650


## Task 4

For certain datasets, classification may work better on a subset of features, rather than on the entire feature set (e.g. when noisy or misleading features are removed).

Using the KNN and the best value of *k* identified in Task 3, compare classification performance for the three feature subsets in the lists below. Which subset gives the highest accuracy?

In [121]:
subset1 = ['danceability', 'energy', 'tempo', 'valence']
subset2 = ['acousticness', 'instrumentalness', 'liveness', 'speechiness']
subset3 = ['energy', 'tempo', 'valence', 'loudness']

In [131]:
data = df[subset1]
data_scaled = normalizer.fit_transform(data.values)
data_scaled

array([[0.29689935, 0.58280242, 0.55806768, 0.51063938],
       [0.44135021, 0.5542993 , 0.5681731 , 0.41849923],
       [0.45682412, 0.37445735, 0.64885637, 0.47966533],
       ...,
       [0.49767617, 0.49851119, 0.68284343, 0.19372629],
       [0.29439747, 0.44860566, 0.79630309, 0.27925702],
       [0.55073158, 0.41243403, 0.69150005, 0.22004677]])

In [145]:
knn = KNeighborsClassifier(n_neighbors=10)
acc_scores = cross_val_score(knn, data_scaled, target, cv=5, scoring="accuracy")
mean_acc = acc_scores.mean()
print("Accuracy=%.3f" % (mean_acc))

Accuracy=0.604


In [147]:
data = df[subset2]
data_scaled = normalizer.fit_transform(data.values)
knn = KNeighborsClassifier(n_neighbors=10)
acc_scores = cross_val_score(knn, data_scaled, target, cv=5, scoring="accuracy")
mean_acc = acc_scores.mean()
print("Accuracy=%.3f" % (mean_acc))

Accuracy=0.604


In [149]:
data = df[subset3]
data_scaled = normalizer.fit_transform(data.values)
knn = KNeighborsClassifier(n_neighbors=10)
acc_scores = cross_val_score(knn, data_scaled, target, cv=5, scoring="accuracy")
mean_acc = acc_scores.mean()
print("Accuracy=%.3f" % (mean_acc))

Accuracy=0.515
