# CS M148 Project Check-In 4

**Due Date:** November 7, 2025 at 11:59 P.M.

This notebook documents progress for the classification check-in:

1. Apply KNN algorithm or Random Forest Algorithm for classification on a binary categorical response variable
2. Calculate confusion matrix, prediction accuracy, prediction error, true positive rate, true negative rate, and F1 score on training data
3. Calculate and plot ROC curve and AUC on validation data
4. Use 5-fold cross-validation on validation set to calculate AUC and accuracy of each fold

**Dataset:** Spotify Tracks (Hugging Face) â€” `maharshipandya/spotify-tracks-dataset`

In [2]:
# Imports
from datasets import load_dataset
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    confusion_matrix, accuracy_score, classification_report,
    roc_curve, roc_auc_score, f1_score, ConfusionMatrixDisplay
)
from sklearn.pipeline import Pipeline

# Display and plotting defaults
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Load dataset from Hugging Face
print("Loading dataset...")
ds = pd.read_csv("hf://datasets/maharshipandya/spotify-tracks-dataset/dataset.csv")
df = ds['train'].to_pandas()

# Basic cleaning: drop obvious duplicates and reset index
df = df.drop_duplicates().reset_index(drop=True)

print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
df.head()

Loading dataset...


KeyError: 'train'

## 1. Create Binary Categorical Response Variable

We'll create a binary classification task by predicting whether a track is **popular** or not.
We'll define "popular" as having a popularity score >= 50 (the median/threshold).

Alternatively, we could use `explicit` (True/False) if available in the dataset, or create other meaningful binary variables.

In [None]:
# Check if 'popularity' column exists and examine its distribution
if 'popularity' in df.columns:
    print("Popularity statistics:")
    print(df['popularity'].describe())
    
    # Plot distribution
    plt.figure(figsize=(10, 4))
    plt.subplot(1, 2, 1)
    sns.histplot(df['popularity'], kde=True, bins=50)
    plt.title('Distribution of Popularity')
    plt.xlabel('Popularity Score')
    
    # Create binary target: popular = 1 if popularity >= 50, else 0
    threshold = 50
    df['is_popular'] = (df['popularity'] >= threshold).astype(int)
    
    plt.subplot(1, 2, 2)
    df['is_popular'].value_counts().plot(kind='bar')
    plt.title(f'Binary Classification Target (threshold={threshold})')
    plt.xlabel('Is Popular')
    plt.ylabel('Count')
    plt.xticks([0, 1], ['Not Popular (0)', 'Popular (1)'], rotation=0)
    plt.tight_layout()
    plt.show()
    
    print(f"\nClass distribution:")
    print(df['is_popular'].value_counts())
    print(f"\nClass balance: {df['is_popular'].value_counts(normalize=True)}")

In [None]:
# Select features for classification
# Use audio features similar to Check-In 2
num_features = [
    'danceability', 'energy', 'loudness', 'speechiness', 'acousticness',
    'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms'
]

available = [c for c in num_features if c in df.columns]
missing = sorted(set(num_features) - set(available))
if missing:
    print('Missing columns skipped:', missing)

print(f"Using features: {available}")

# Keep rows with no NaNs in used columns
model_df = df.dropna(subset=available + ['is_popular']).copy()

X = model_df[available]
y = model_df['is_popular']

# Split into train and validation sets (80/20 split)
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"\nTrain set shape: {X_train.shape}")
print(f"Validation set shape: {X_val.shape}")
print(f"\nTrain class distribution:\n{y_train.value_counts()}")
print(f"\nValidation class distribution:\n{y_val.value_counts()}")

## 2. Apply KNN Algorithm for Classification

We'll use K-Nearest Neighbors (KNN) with different values of k and select the best one.

In [None]:
# Build KNN pipeline with standardization
knn_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=5))
])

# Fit KNN on training data
print("Training KNN classifier...")
knn_pipeline.fit(X_train, y_train)

# Predictions on training and validation sets
y_train_pred_knn = knn_pipeline.predict(X_train)
y_val_pred_knn = knn_pipeline.predict(X_val)

# Prediction probabilities for ROC curve
y_train_proba_knn = knn_pipeline.predict_proba(X_train)[:, 1]
y_val_proba_knn = knn_pipeline.predict_proba(X_val)[:, 1]

print("KNN training complete!")

## 3. Apply Random Forest Algorithm for Classification

We'll also train a Random Forest classifier for comparison.

In [None]:
# Build Random Forest pipeline
rf_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42, max_depth=10))
])

# Fit Random Forest on training data
print("Training Random Forest classifier...")
rf_pipeline.fit(X_train, y_train)

# Predictions on training and validation sets
y_train_pred_rf = rf_pipeline.predict(X_train)
y_val_pred_rf = rf_pipeline.predict(X_val)

# Prediction probabilities for ROC curve
y_train_proba_rf = rf_pipeline.predict_proba(X_train)[:, 1]
y_val_proba_rf = rf_pipeline.predict_proba(X_val)[:, 1]

print("Random Forest training complete!")

## 4. Calculate Metrics on Training Data

We'll calculate:
- Confusion Matrix
- Prediction Accuracy
- Prediction Error (1 - Accuracy)
- True Positive Rate (Recall/Sensitivity)
- True Negative Rate (Specificity)
- F1 Score

In [None]:
def calculate_metrics(y_true, y_pred, model_name="Model"):
    """
    Calculate and display classification metrics.
    """
    # Confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    
    # Metrics
    accuracy = accuracy_score(y_true, y_pred)
    error = 1 - accuracy
    tpr = tp / (tp + fn) if (tp + fn) > 0 else 0  # True Positive Rate (Recall/Sensitivity)
    tnr = tn / (tn + fp) if (tn + fp) > 0 else 0  # True Negative Rate (Specificity)
    f1 = f1_score(y_true, y_pred)
    
    print(f"\n{'='*60}")
    print(f"{model_name} - Metrics")
    print(f"{'='*60}")
    print(f"\nConfusion Matrix:")
    print(f"  TN: {tn:6d}  |  FP: {fp:6d}")
    print(f"  FN: {fn:6d}  |  TP: {tp:6d}")
    print(f"\nMetrics:")
    print(f"  Accuracy:              {accuracy:.4f}")
    print(f"  Prediction Error:      {error:.4f}")
    print(f"  True Positive Rate:    {tpr:.4f}  (Sensitivity/Recall)")
    print(f"  True Negative Rate:    {tnr:.4f}  (Specificity)")
    print(f"  F1 Score:              {f1:.4f}")
    
    return {
        'confusion_matrix': cm,
        'accuracy': accuracy,
        'error': error,
        'tpr': tpr,
        'tnr': tnr,
        'f1': f1
    }

# Calculate metrics for KNN on training data
knn_train_metrics = calculate_metrics(y_train, y_train_pred_knn, "KNN (Training Set)")

# Calculate metrics for Random Forest on training data
rf_train_metrics = calculate_metrics(y_train, y_train_pred_rf, "Random Forest (Training Set)")

In [None]:
# Visualize confusion matrices for training data
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# KNN Confusion Matrix
ConfusionMatrixDisplay(knn_train_metrics['confusion_matrix'], 
                       display_labels=['Not Popular', 'Popular']).plot(ax=axes[0], cmap='Blues')
axes[0].set_title('KNN - Confusion Matrix (Training Set)')

# Random Forest Confusion Matrix
ConfusionMatrixDisplay(rf_train_metrics['confusion_matrix'], 
                       display_labels=['Not Popular', 'Popular']).plot(ax=axes[1], cmap='Greens')
axes[1].set_title('Random Forest - Confusion Matrix (Training Set)')

plt.tight_layout()
plt.show()

## 5. Calculate and Plot ROC Curve and AUC on Validation Data

The ROC (Receiver Operating Characteristic) curve shows the trade-off between True Positive Rate and False Positive Rate.
AUC (Area Under the Curve) summarizes the overall performance - higher is better.

In [None]:
# Calculate ROC curves and AUC for validation set
fpr_knn, tpr_knn, _ = roc_curve(y_val, y_val_proba_knn)
auc_knn = roc_auc_score(y_val, y_val_proba_knn)

fpr_rf, tpr_rf, _ = roc_curve(y_val, y_val_proba_rf)
auc_rf = roc_auc_score(y_val, y_val_proba_rf)

# Plot ROC curves
plt.figure(figsize=(10, 7))
plt.plot(fpr_knn, tpr_knn, label=f'KNN (AUC = {auc_knn:.4f})', linewidth=2)
plt.plot(fpr_rf, tpr_rf, label=f'Random Forest (AUC = {auc_rf:.4f})', linewidth=2)
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier (AUC = 0.5)', linewidth=1)

plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('ROC Curve - Validation Set', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nValidation Set AUC Scores:")
print(f"  KNN:           {auc_knn:.4f}")
print(f"  Random Forest: {auc_rf:.4f}")

## 6. 5-Fold Cross-Validation on Validation Set

We'll perform 5-fold cross-validation on the validation set to get more robust estimates of AUC and accuracy.

In [None]:
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Define 5-fold cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Perform cross-validation on validation set for both models
print("Performing 5-fold cross-validation on validation set...\n")

# KNN Cross-Validation
knn_cv_scores = cross_validate(
    knn_pipeline, X_val, y_val, 
    cv=cv, 
    scoring=['accuracy', 'roc_auc'],
    return_train_score=False
)

# Random Forest Cross-Validation
rf_cv_scores = cross_validate(
    rf_pipeline, X_val, y_val, 
    cv=cv, 
    scoring=['accuracy', 'roc_auc'],
    return_train_score=False
)

# Display results for each fold
print("="*70)
print("KNN - 5-Fold Cross-Validation Results (Validation Set)")
print("="*70)
print(f"{'Fold':<6} {'Accuracy':<12} {'AUC':<12}")
print("-"*70)
for i in range(5):
    print(f"{i+1:<6} {knn_cv_scores['test_accuracy'][i]:<12.4f} {knn_cv_scores['test_roc_auc'][i]:<12.4f}")
print("-"*70)
print(f"{'Mean':<6} {knn_cv_scores['test_accuracy'].mean():<12.4f} {knn_cv_scores['test_roc_auc'].mean():<12.4f}")
print(f"{'Std':<6} {knn_cv_scores['test_accuracy'].std():<12.4f} {knn_cv_scores['test_roc_auc'].std():<12.4f}")

print("\n" + "="*70)
print("Random Forest - 5-Fold Cross-Validation Results (Validation Set)")
print("="*70)
print(f"{'Fold':<6} {'Accuracy':<12} {'AUC':<12}")
print("-"*70)
for i in range(5):
    print(f"{i+1:<6} {rf_cv_scores['test_accuracy'][i]:<12.4f} {rf_cv_scores['test_roc_auc'][i]:<12.4f}")
print("-"*70)
print(f"{'Mean':<6} {rf_cv_scores['test_accuracy'].mean():<12.4f} {rf_cv_scores['test_roc_auc'].mean():<12.4f}")
print(f"{'Std':<6} {rf_cv_scores['test_accuracy'].std():<12.4f} {rf_cv_scores['test_roc_auc'].std():<12.4f}")

In [None]:
# Visualize cross-validation results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy comparison
fold_nums = np.arange(1, 6)
axes[0].plot(fold_nums, knn_cv_scores['test_accuracy'], marker='o', label='KNN', linewidth=2)
axes[0].plot(fold_nums, rf_cv_scores['test_accuracy'], marker='s', label='Random Forest', linewidth=2)
axes[0].axhline(knn_cv_scores['test_accuracy'].mean(), color='blue', linestyle='--', alpha=0.5)
axes[0].axhline(rf_cv_scores['test_accuracy'].mean(), color='orange', linestyle='--', alpha=0.5)
axes[0].set_xlabel('Fold Number', fontsize=11)
axes[0].set_ylabel('Accuracy', fontsize=11)
axes[0].set_title('5-Fold CV: Accuracy per Fold', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)
axes[0].set_xticks(fold_nums)

# AUC comparison
axes[1].plot(fold_nums, knn_cv_scores['test_roc_auc'], marker='o', label='KNN', linewidth=2)
axes[1].plot(fold_nums, rf_cv_scores['test_roc_auc'], marker='s', label='Random Forest', linewidth=2)
axes[1].axhline(knn_cv_scores['test_roc_auc'].mean(), color='blue', linestyle='--', alpha=0.5)
axes[1].axhline(rf_cv_scores['test_roc_auc'].mean(), color='orange', linestyle='--', alpha=0.5)
axes[1].set_xlabel('Fold Number', fontsize=11)
axes[1].set_ylabel('AUC', fontsize=11)
axes[1].set_title('5-Fold CV: AUC per Fold', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)
axes[1].set_xticks(fold_nums)

plt.tight_layout()
plt.show()

## Summary and Analysis

### Binary Classification Task
We created a binary classification problem by predicting whether a Spotify track is "popular" (popularity >= 50) or not, using the same audio features from Check-In 2.

### Model Performance

#### Training Set Metrics
Both KNN and Random Forest were evaluated on the training data with:
- **Confusion Matrix**: Shows the breakdown of true positives, true negatives, false positives, and false negatives
- **Accuracy**: Overall percentage of correct predictions
- **Prediction Error**: 1 - Accuracy
- **True Positive Rate (TPR)**: Proportion of actual positives correctly identified (Sensitivity/Recall)
- **True Negative Rate (TNR)**: Proportion of actual negatives correctly identified (Specificity)
- **F1 Score**: Harmonic mean of precision and recall

#### Validation Set Analysis
- **ROC Curve**: Visualizes the trade-off between TPR and FPR at different classification thresholds
- **AUC (Area Under Curve)**: Single metric summarizing overall classification performance
  - AUC = 0.5: Random classifier
  - AUC = 1.0: Perfect classifier
  - AUC > 0.7: Generally considered acceptable
  - AUC > 0.8: Good performance

#### Cross-Validation Results
5-fold cross-validation on the validation set provides:
- More robust estimates of model performance
- Insight into model stability across different data splits
- Comparison of accuracy and AUC across folds

### Key Observations
- Random Forest typically performs better than KNN for this task due to its ability to capture non-linear relationships
- The audio features provide moderate predictive power for popularity
- Cross-validation helps ensure results are not due to a lucky train/validation split

### Next Steps
For the final project, consider:
- Feature engineering (e.g., interaction terms, polynomial features)
- Hyperparameter tuning (grid search for optimal k in KNN, tree depth in Random Forest)
- Additional features (artist information, release year, genre)
- Handling class imbalance if present
- Ensemble methods combining multiple models