# Lab 4: Building a Spotify Song Classifier with KNNs and Decision Trees

Author: Eva Newby

Collaborators: Haylee Oliver, Marina Kochuten, Jo Cardelle. 

**Classify by genres**. In this lab, you will build machine learning genre classifiers using a dataset of Spotify tracks. Your goal is to train a model that can distinguish between two selected genres based on various audio features such as danceability, energy, valence, tempo, and more.

We'll follow the familiar workflow of data exploration -> preprocessing -> model training -> evaluation

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

### Step 1: The data
We'll focus on two genres: Blues and Jazz (what’s the difference, anyway??).

Both genres have rich musical histories, but they differ in structure:

- Blues tends to have simpler chord progressions and a strong emotional expression.
- Jazz often involves complex improvisation and more varied rhythms.

But can a machine really tell them apart just by looking at numerical audio features? That’s what we’ll find out!

In this step, you will:
- Load the dataset
- Extract only blues and jazz for classification

In [2]:
# Load dataset
spotify_data = pd.read_csv('/courses/EDS232/Data/spotify_dat.csv')

# Print the data types in spotify_data
spotify_data.dtypes

id                   object
name                 object
album                object
album_id             object
artists              object
artist_ids           object
track_number          int64
disc_number           int64
explicit               bool
danceability        float64
energy              float64
key                   int64
loudness            float64
mode                  int64
speechiness         float64
acousticness        float64
instrumentalness    float64
liveness            float64
valence             float64
tempo               float64
duration_ms           int64
time_signature      float64
year                  int64
release_date         object
genres               object
dtype: object

In [3]:
# View the first 3 rows
spotify_data.head(3)

Unnamed: 0,id,name,album,album_id,artists,artist_ids,track_number,disc_number,explicit,danceability,...,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,year,release_date,genres
0,7lmeHLHBe4nmXzuXc0HDjk,Testify,The Battle Of Los Angeles,2eia0myWFgoHuttJytCxgX,['Rage Against The Machine'],['2d0hyoQ5ynDBnkvAbJKORj'],1,1,False,0.47,...,0.0261,1.1e-05,0.356,0.503,117.906,210133,4.0,1999,1999-11-02,rap metal
1,1wsRitfRRtWyEapl0q22o8,Guerrilla Radio,The Battle Of Los Angeles,2eia0myWFgoHuttJytCxgX,['Rage Against The Machine'],['2d0hyoQ5ynDBnkvAbJKORj'],2,1,True,0.599,...,0.0129,7.1e-05,0.155,0.489,103.68,206200,4.0,1999,1999-11-02,rap metal
2,1hR0fIFK2qRG3f3RF70pb7,Calm Like a Bomb,The Battle Of Los Angeles,2eia0myWFgoHuttJytCxgX,['Rage Against The Machine'],['2d0hyoQ5ynDBnkvAbJKORj'],3,1,False,0.315,...,0.0234,2e-06,0.122,0.37,149.749,298893,4.0,1999,1999-11-02,rap metal


In [4]:
# Filter down to two genres
blues_data = spotify_data[spotify_data['genres']=='blues']
jazz_data =  spotify_data[spotify_data['genres']=='jazz']

# Combine both genres into one dataset
combined_data = pd.concat([blues_data, jazz_data])

# Rename 'genres' column to 'genre' for clarity
combined_data = combined_data.rename(columns = {'genres':'genre'})

# Check changes by viewing first 3 rows
combined_data.head(3)

Unnamed: 0,id,name,album,album_id,artists,artist_ids,track_number,disc_number,explicit,danceability,...,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,year,release_date,genre
333,4MdlxZHzUi8fGvuXOtNVhb,Johnny B. Goode - Live,Johnny Winter: A Rock N' Roll Collection,6yn5Mylr8HsH9PVro7Fo22,['Johnny Winter'],['2ODUxmFxJSyvGiimNhMHbO'],1,1,False,0.415,...,0.023,0.0771,0.941,0.924,159.366,203407,4.0,1973,1973,blues
334,576DmaRTiVhm3l00mpQwDm,Good Morning Little School Girl - Live,Johnny Winter: A Rock N' Roll Collection,6yn5Mylr8HsH9PVro7Fo22,['Johnny Winter'],['2ODUxmFxJSyvGiimNhMHbO'],2,1,False,0.294,...,0.188,0.422,0.974,0.639,187.886,276968,4.0,1973,1973,blues
335,3k3Q8IZ5oOqCUfOP2UPgpW,I'll Drown in My Tears,Johnny Winter: A Rock N' Roll Collection,6yn5Mylr8HsH9PVro7Fo22,['Johnny Winter'],['2ODUxmFxJSyvGiimNhMHbO'],3,1,False,0.424,...,0.827,2.1e-05,0.0742,0.39,95.799,285360,3.0,1973,1973,blues


## Step 2: Data Exploration
Now that we have Blues and Jazz as our two genres, it's time to explore the data and look for patterns in the audio features.

Let's investigate:
- Which features are available?
- How strongly these features correlate with each other.
- How different genres distribute across  sample features: danceability 

In this step, we will:
- Filter out non-numeric columns before the plotting
- Visualize feature correlations using a heatmap.
- Examine key feature distributions with histograms to see if genres behave differently.


In [6]:
# Exclude non-numeric columns but keep 'genre' for grouping
numeric_features = combined_data.select_dtypes(include = [int, float]).copy()
numeric_features.loc[:,'genre'] = combined_data['genre']

# Calculate correlation matrix
corr_matrix = numeric_features.corr()

# Correlation heatmap
plt.figure(figsize = (10,8))
sns.heatmap(corr_matrix, annot = True, cmap = 'coolwarm', fmt = '.2f')

ValueError: could not convert string to float: 'blues'

In [None]:
# Histogram of danceability
plt.figure(figsize = (10,8))
sns.histplot(numeric_features['danceability'], bins = 5, kde = False, color = "#F673DA")

plt.xlabel("Danceability")
plt.ylabel("Count")
plt.title("Histogram of Danceability")
plt.show()

### Step 3: Check for class imbalance
An unequal distribution between the target classes can cause issues for our classifiers, so we'll need to check for that. A commonly used guideline for class imbalance is:

- Slight Imbalance (Less than 1.5:1) → Generally okay; no need for balancing.
- Moderate Imbalance (Between 1.5:1 and 3:1) → Might cause some bias; balancing can help.
- Severe Imbalance (More than 3:1) → Strongly affects model performance; balancing is usually necessary.

In [None]:
# Print counts of each genre
class_counts = numeric_features['genre'].value_counts()

# Calculate ratio of majority to minority class
class_ratio = class_counts.max()/class_counts.min()

# Print class distribution and ratio
print(f"The ratio is ", class_ratio,":1")
print(f"The class distribution is ", class_counts)

Given this result, should we be worried about the imbalance?

*As the class imbalance ratio is approximately 1.47:1, we are in the category of "slight imbalance". This means that we do not need to balance.*

### Step 4: Preprocessing
We need to ensure all the variables we use are numerical so the models can process them. First we'll drop all the variables that are not numeric or the target. Then we'll use a sklearn utility, `LabelEncoder()` to encode our categorical target variable into numerical values. Print the first five values of your encoded labels.

In [None]:
# Define the features (drop columns that are not numeric or target)
X = numeric_features.drop('genre', axis =1)

# Set the target variable as 'genre'
y = numeric_features['genre']

# Encode the target variable ('label') using label encoding
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Print first five values of encoded labels
print("First 5 encoded labels:", y_encoded[:5])

Good, now we'll do our data splitting and scaling.  Let's go with an 80/20 split this time.

In [None]:
# Split data into 80% train and 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the scaler
scaler = StandardScaler()

# Fit and transform only the training set
X_train_scaled = scaler.fit_transform(X_train)

# Use the same scaler to transform the test set
X_test_scaled = scaler.transform(X_test)

### Step 5: Train the models
OK, now it's time to build our models and assess the training accuracy. We'll use the default of k = 5 for the KNN model.

In [None]:
# Initialize models
knn = KNeighborsClassifier(n_neighbors = 5)
dt = DecisionTreeClassifier(random_state=42)

# Train (fit) both models
knn.fit(X_train_scaled, y_train)
dt.fit(X_train_scaled, y_train)

# Predictions on training data
knn_y_train_pred = knn.predict(X_train_scaled)
dt_y_train_pred = dt.predict(X_train_scaled)

# Compute training accuracy
knn_train_accuracy = accuracy_score(y_train, knn_y_train_pred)
dt_train_accuracy = accuracy_score(y_train, dt_y_train_pred)

# Print training accuracy for both models
print("KNN Training Accuracy:", knn_train_accuracy)
print("Decision Tree Training Accuracy:", dt_train_accuracy)

### Step 6: Evaluate models
OK, now let's go ahead and predict on the test data to see how well our models do at predicting the genre of unseen tracks.

In [None]:
# Make predictions
knn_pred = knn.predict(X_test_scaled)
dt_pred = dt.predict(X_test_scaled)

# Evaluate prediction accuracy
knn_accuracy = accuracy_score(y_test, knn_pred)
dt_accuracy = accuracy_score(y_test, dt_pred)

# Print training accuracy for both models
print("KNN Training Accuracy:", knn_accuracy)
print("Decision Tree Training Accuracy:", dt_accuracy)

### Step 7: Visualizing Model Performance
Now that we have evaluated our models on unseen data, let’s compare their performance by visualizing the training vs. test accuracy.  We'll look for signs of overfitting by plotting the training accuracy and test accuracy for both k-NN and Decision Trees.

Create a bar chart that compares the training accuracy and test accuracy for k-NN and Decision Trees. Your plot should clearly display:

- The training and test accuracy for both models.
- Properly labeled axes and a title.
- A legend to differentiate training and test accuracy.
- Accuracy values displayed on top of each bar.

In [None]:
# Create a df
accuracy_df = pd.DataFrame({
    'Model': ['KNN', 'KNN', 'Decision Tree', 'Decision Tree'],
    'Accuracy': [knn_train_accuracy, knn_accuracy, dt_train_accuracy, dt_accuracy],
    'Type': ['Training', 'Test', 'Training', 'Test']
})

accuracy_df.head()

In [None]:
# Create a df
accuracy_df = pd.DataFrame({
    'Model': ['K Nearest Neighbors', 'K Nearest Neighbors', 'Decision Tree', 'Decision Tree'],
    'Accuracy': [knn_train_accuracy, knn_accuracy, dt_train_accuracy, dt_accuracy],
    'Type': ['Training', 'Test', 'Training', 'Test']
})

# Make a bar chart
plt.figure(figsize = (10,8))
ax = sns.barplot(data = accuracy_df, x ='Model', y='Accuracy', hue='Type')
plt.title("Accuracy of K Nearest Neighbors and Decision Tree Training and Testing Data")
plt.xlabel('Model Type')
plt.ylabel('Accuracy')
plt.legend()
for container in ax.containers:
    ax.bar_label(container)
plt.show()

Which model does a better job of generalizing to unseen data? How do you know?
Does it appear that either of the models are overfitting? Why?

*The KNN model appears to do a better job of generalizing, as there is less of a difference between the accuracy on the training data and the testing data (difference of ~0.029) compared to the difference between the Decision Tree training and testing data (difference of ~0.147). The DT training accuracy model is overfitting as it performs perfectly (1.0 accuracy) on the training data, and less well on the testing data.*

### Bonus question
In the workflow above, we used the default parameter values of k = 5 for the KNN model, and maxdepth = None for the Decision Tree. Choose one of the models and see if you can improve test performance by optmizing its corresponding parameter value.

In [None]:
# Chosing the KNN model for optimization
def knn():
    
    k_values = [1,2,3,4,5,6,7,8,9,10]
    accuracies = [ ]
    
    # Loop through the different k values and fit a model for better accuracy
    for k in k_values:
        knn = KNeighborsClassifier(n_neighbors = k)
        knn.fit(X_train_scaled, y_train)
        
        # make predictions and calculate accuracy
        y_pred = knn.predict(X_test_scaled)
        accuracy = accuracy_score(y_test, y_pred)
        accuracies.append(accuracy)
        
        print(f"K = {k} - Accuracy: {accuracy:.3f}")
        
    # Visualize the effect of k on accuracy
    plt.figure(figsize = (10,6))
    plt.plot(k_values, accuracies, marker = 'o')
    plt.xlabel('Number of Neighbors')
    plt.ylabel('Accuracy')
    plt.title('KNN: Effect of K on Model Accuracy')
    plt.grid(True)
    plt.show()
    
knn()

It looks like k = 5, 7, and 9 provide the highest accuracy with 0.893

### Bonus question

In the workflow above, we used the default parameter values of `k = 5` for the KNN model, and `maxdepth = None` for the Decision Tree.  Choose one of the models and see if you can improve test performance by optmizing its corresponding parameter value.