# CS M148 Project Check-in 1

**Project**: Spotify Music Track Analysis and Classification

**Date**: December 12, 2025

---

## Table of Contents
1. Dataset Description
2. Main Features and Justification
3. Data Cleaning and Missingness Handling
4. Exploratory Data Analysis (EDA)


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import io
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries imported successfully!")


---

## 1. Dataset Description

### What Dataset Did We Choose?

Our team has chosen the **Spotify Tracks Dataset** from Hugging Face, which is publicly available and contains comprehensive audio feature data for over 114,000 music tracks across multiple genres.

**Dataset Source**: [Hugging Face - maharshipandya/spotify-tracks-dataset](https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset)

### Dataset Overview:
- **Total Tracks**: ~114,000 songs
- **Features**: 21 columns including audio characteristics and metadata
- **Genres**: 114 different music genres (rock, pop, classical, jazz, hip-hop, electronic, etc.)
- **Audio Features**: Quantitative measurements from Spotify's Audio Analysis API

### Why This Dataset?

We selected this dataset for several compelling reasons:

1. **Rich Feature Set**: The dataset includes detailed audio analysis metrics (danceability, energy, valence, tempo, etc.) that provide deep insights into musical characteristics.

2. **Large Sample Size**: With 114,000+ tracks, we have sufficient data for robust machine learning models and statistical analysis.

3. **Multi-Genre Coverage**: 114 distinct genres allow us to explore how different musical styles differ in their acoustic properties.

4. **Real-World Application**: Music recommendation systems and playlist generation are valuable commercial applications, making this project practically relevant.

5. **Balanced Classes**: Each genre contains approximately 1,000 tracks, providing good class balance for classification tasks.

6. **Data Quality**: The dataset is sourced from Spotify's official API, ensuring professional-grade measurements.

### Research Questions We Can Explore:
- Can we predict a song's genre based on its audio features?
- What features most strongly influence a song's danceability?
- Are there clusters of similar songs across different genres?
- How do audio characteristics vary across different music genres?


In [None]:
# Load the dataset from Hugging Face
url = "https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset/resolve/main/dataset.csv"
print("Downloading Spotify dataset from Hugging Face...")
response = requests.get(url, timeout=30)
response.raise_for_status()
df_original = pd.read_csv(io.StringIO(response.text))
print("✓ Dataset loaded successfully!\n")

# Display basic information
print(f"Dataset shape: {df_original.shape}")
print(f"Number of tracks: {df_original.shape[0]:,}")
print(f"Number of features: {df_original.shape[1]}\n")

# Show first few rows
print("First 5 rows of the dataset:")
df_original.head()


In [None]:
# Display column information
print("Dataset Columns and Data Types:\n")
print(df_original.dtypes)
print("\n" + "="*60)
print("Dataset Summary Statistics:")
df_original.describe()


---

## 2. Main Features in the Data

### Audio Feature Descriptions

Our analysis focuses on the following audio features provided by Spotify's Audio Analysis API:

#### 1. **Danceability** (0.0 - 1.0)
- **Description**: Describes how suitable a track is for dancing based on tempo, rhythm stability, beat strength, and overall regularity
- **Why Important**: Key predictor for playlist generation and music recommendation; strongly correlated with genre
- **Use Case**: Identifying party/workout music vs. relaxation music

#### 2. **Energy** (0.0 - 1.0)
- **Description**: Measures intensity and activity. High energy tracks feel fast, loud, and noisy
- **Why Important**: Distinguishes between calm classical music and energetic rock/electronic music
- **Use Case**: Mood-based music selection

#### 3. **Loudness** (typically -60 to 0 dB)
- **Description**: Overall loudness of a track in decibels (dB)
- **Why Important**: Reflects production quality and genre conventions (e.g., metal is typically louder than folk)
- **Use Case**: Normalizing audio playback, genre classification

#### 4. **Speechiness** (0.0 - 1.0)
- **Description**: Detects the presence of spoken words. High values indicate podcasts, audiobooks, or rap
- **Why Important**: Differentiates between instrumental music, sung music, and spoken content
- **Use Case**: Filtering out non-music content, identifying rap/hip-hop

#### 5. **Acousticness** (0.0 - 1.0)
- **Description**: Confidence measure of whether the track is acoustic (non-electric)
- **Why Important**: Distinguishes acoustic vs. electronic production styles
- **Use Case**: Finding unplugged versions, classical music, folk music

#### 6. **Instrumentalness** (0.0 - 1.0)
- **Description**: Predicts whether a track contains no vocals. High values indicate instrumental tracks
- **Why Important**: Identifies background music, classical pieces, instrumental jazz
- **Use Case**: Creating focus/study playlists, finding instrumental versions

#### 7. **Liveness** (0.0 - 1.0)
- **Description**: Detects presence of an audience in the recording. High values indicate live performances
- **Why Important**: Distinguishes studio recordings from live albums
- **Use Case**: Finding concert recordings, live albums

#### 8. **Valence** (0.0 - 1.0)
- **Description**: Musical positiveness. High valence = happy/cheerful, low valence = sad/angry
- **Why Important**: Mood classification and emotional content analysis
- **Use Case**: Creating mood-based playlists (happy, sad, angry, relaxed)

#### 9. **Tempo** (BPM)
- **Description**: Overall estimated tempo in beats per minute (BPM)
- **Why Important**: Fundamental rhythmic characteristic; varies significantly across genres
- **Use Case**: Matching songs for DJ mixing, workout intensity matching

#### 10. **Duration (ms)**
- **Description**: Track length in milliseconds
- **Why Important**: Genre conventions (pop songs ~3 mins, classical movements vary widely)
- **Use Case**: Playlist time management, genre classification

#### 11. **Track Genre**
- **Description**: Categorical label indicating the music genre
- **Why Important**: Our primary target variable for classification tasks
- **Use Case**: Genre prediction, genre-based recommendation

### Why These Features Matter for Our Analysis

These features are crucial because:

1. **Objective Measurements**: Unlike subjective labels, these are quantitative measurements from audio analysis
2. **Genre Differentiation**: Different genres have distinct patterns (e.g., classical has high acousticness and instrumentalness, EDM has high energy and danceability)
3. **Prediction Power**: These features can predict user preferences and song characteristics
4. **Industry Standard**: Used by Spotify for their recommendation algorithms
5. **Interdependence**: Features correlate in interesting ways (e.g., energy often correlates with loudness and tempo)


In [None]:
# Identify the key audio features for analysis
audio_features = [
    'danceability', 'energy', 'loudness', 'speechiness', 
    'acousticness', 'instrumentalness', 'liveness', 'valence', 
    'tempo', 'duration_ms'
]

# Check which features are available in the dataset
available_features = [f for f in audio_features if f in df_original.columns]
print(f"Audio features available in dataset: {len(available_features)}/{len(audio_features)}\n")
print("Available features:")
for i, feature in enumerate(available_features, 1):
    print(f"  {i}. {feature}")

# Check if genre column exists
genre_col = 'track_genre' if 'track_genre' in df_original.columns else 'genre'
if genre_col in df_original.columns:
    print(f"\n✓ Genre column found: '{genre_col}'")
    print(f"  Number of unique genres: {df_original[genre_col].nunique()}")
else:
    print("\n✗ Genre column not found in dataset")


---

## 3. Data Cleaning and Missingness Handling

### Overview of Data Cleaning Process

Data cleaning is crucial for ensuring the quality and reliability of our analysis. Our cleaning process includes:

1. **Handling Duplicate Records**: Removing exact duplicate tracks
2. **Missing Value Analysis**: Identifying patterns in missing data
3. **Missing Value Imputation**: Filling missing values using appropriate strategies
4. **Outlier Detection**: Identifying extreme values that may indicate data quality issues
5. **Data Type Validation**: Ensuring all features have correct data types

### Imputation Strategies

We will demonstrate three common imputation techniques:

1. **Mean Imputation**: Replace missing values with the feature mean (simple but ignores relationships)
2. **Median Imputation**: Replace with median (robust to outliers)
3. **KNN Imputation**: Use K-Nearest Neighbors to impute based on similar tracks (preserves relationships)


In [None]:
# Create a working copy of the dataset
df = df_original.copy()

print("STEP 1: Initial Data Quality Assessment")
print("="*60)
print(f"Original dataset shape: {df.shape}\n")

# Check for duplicate rows
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

if duplicates > 0:
    df = df.drop_duplicates().reset_index(drop=True)
    print(f"✓ Removed {duplicates} duplicate rows")
    print(f"New dataset shape: {df.shape}")
else:
    print("✓ No duplicate rows found")

print("\n" + "="*60)


In [None]:
# STEP 2: Analyze missing values
print("\nSTEP 2: Missing Value Analysis")
print("="*60)

# Calculate missing values
missing_counts = df.isnull().sum()
missing_percentages = (df.isnull().sum() / len(df)) * 100

# Create missing value summary
missing_summary = pd.DataFrame({
    'Column': missing_counts.index,
    'Missing Count': missing_counts.values,
    'Missing %': missing_percentages.values
})
missing_summary = missing_summary[missing_summary['Missing Count'] > 0].sort_values('Missing Count', ascending=False)

if len(missing_summary) > 0:
    print("\nColumns with missing values:\n")
    print(missing_summary.to_string(index=False))
else:
    print("\n✓ No missing values found in the dataset!")
    print("\nTo demonstrate data cleaning techniques, we'll artificially introduce")
    print("some missing values in the audio features.")


In [None]:
# STEP 3: Introduce missing values for demonstration
# (If the dataset already has missing values, skip this step)

print("\nSTEP 3: Introducing Missing Values for Demonstration")
print("="*60)

# Create a copy with artificial missing values
df_with_missing = df.copy()
np.random.seed(42)

# Introduce ~5% missing values in selected audio features
missing_rate = 0.05
features_to_modify = ['energy', 'loudness', 'acousticness', 'valence', 'tempo']

for feature in features_to_modify:
    if feature in df_with_missing.columns:
        # Randomly select indices to set as missing
        n_missing = int(len(df_with_missing) * missing_rate)
        missing_indices = np.random.choice(df_with_missing.index, size=n_missing, replace=False)
        df_with_missing.loc[missing_indices, feature] = np.nan
        print(f"✓ Introduced {n_missing} missing values in '{feature}'")

# Show updated missing value summary
print("\nMissing value summary after introduction:")
missing_counts = df_with_missing[features_to_modify].isnull().sum()
missing_percentages = (df_with_missing[features_to_modify].isnull().sum() / len(df_with_missing)) * 100

missing_summary = pd.DataFrame({
    'Feature': missing_counts.index,
    'Missing Count': missing_counts.values,
    'Missing %': missing_percentages.values
})
print(missing_summary.to_string(index=False))


In [None]:
# Visualize missing data patterns
print("\nVisualizing Missing Data Patterns:")

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Missing value counts
missing_data = df_with_missing[features_to_modify].isnull().sum().sort_values(ascending=False)
axes[0].bar(range(len(missing_data)), missing_data.values, color='coral')
axes[0].set_xticks(range(len(missing_data)))
axes[0].set_xticklabels(missing_data.index, rotation=45, ha='right')
axes[0].set_ylabel('Number of Missing Values')
axes[0].set_title('Missing Values by Feature')
axes[0].grid(True, alpha=0.3, axis='y')

# Plot 2: Missing value percentages
missing_pct = (df_with_missing[features_to_modify].isnull().sum() / len(df_with_missing) * 100).sort_values(ascending=False)
axes[1].bar(range(len(missing_pct)), missing_pct.values, color='skyblue')
axes[1].set_xticks(range(len(missing_pct)))
axes[1].set_xticklabels(missing_pct.index, rotation=45, ha='right')
axes[1].set_ylabel('Percentage of Missing Values (%)')
axes[1].set_title('Missing Values Percentage by Feature')
axes[1].grid(True, alpha=0.3, axis='y')
axes[1].axhline(y=5, color='red', linestyle='--', linewidth=1, label='5% threshold')
axes[1].legend()

plt.tight_layout()
plt.show()


In [None]:
# STEP 4: Imputation Method 1 - Mean Imputation
print("\nSTEP 4: Imputation Method 1 - Mean Imputation")
print("="*60)
print("Strategy: Replace missing values with the mean of each feature")
print("Pros: Simple, fast, preserves overall mean")
print("Cons: Reduces variance, ignores relationships between features\n")

# Create copy for mean imputation
df_mean_imputed = df_with_missing.copy()

# Apply mean imputation
imputer_mean = SimpleImputer(strategy='mean')
df_mean_imputed[features_to_modify] = imputer_mean.fit_transform(df_mean_imputed[features_to_modify])

# Verify no missing values remain
remaining_missing = df_mean_imputed[features_to_modify].isnull().sum().sum()
print(f"✓ Mean imputation completed")
print(f"✓ Remaining missing values: {remaining_missing}")

# Show example of imputed values
print("\nExample: Mean values used for imputation:")
for feature in features_to_modify:
    mean_val = imputer_mean.statistics_[features_to_modify.index(feature)]
    print(f"  {feature}: {mean_val:.4f}")


In [None]:
# STEP 5: Imputation Method 2 - Median Imputation
print("\nSTEP 5: Imputation Method 2 - Median Imputation")
print("="*60)
print("Strategy: Replace missing values with the median of each feature")
print("Pros: Robust to outliers, preserves central tendency")
print("Cons: Still ignores relationships between features\n")

# Create copy for median imputation
df_median_imputed = df_with_missing.copy()

# Apply median imputation
imputer_median = SimpleImputer(strategy='median')
df_median_imputed[features_to_modify] = imputer_median.fit_transform(df_median_imputed[features_to_modify])

# Verify no missing values remain
remaining_missing = df_median_imputed[features_to_modify].isnull().sum().sum()
print(f"✓ Median imputation completed")
print(f"✓ Remaining missing values: {remaining_missing}")

# Show example of imputed values
print("\nExample: Median values used for imputation:")
for feature in features_to_modify:
    median_val = imputer_median.statistics_[features_to_modify.index(feature)]
    print(f"  {feature}: {median_val:.4f}")


In [None]:
# STEP 6: Imputation Method 3 - KNN Imputation
print("\nSTEP 6: Imputation Method 3 - KNN Imputation")
print("="*60)
print("Strategy: Impute using K-Nearest Neighbors based on similar tracks")
print("Pros: Preserves relationships between features, more accurate")
print("Cons: Computationally expensive, requires more memory\n")

# Create copy for KNN imputation
df_knn_imputed = df_with_missing.copy()

# Apply KNN imputation (using k=5 neighbors)
print("Performing KNN imputation (k=5 neighbors)...")
imputer_knn = KNNImputer(n_neighbors=5)
df_knn_imputed[features_to_modify] = imputer_knn.fit_transform(df_knn_imputed[features_to_modify])

# Verify no missing values remain
remaining_missing = df_knn_imputed[features_to_modify].isnull().sum().sum()
print(f"✓ KNN imputation completed")
print(f"✓ Remaining missing values: {remaining_missing}")


In [None]:
# STEP 7: Compare imputation methods
print("\nSTEP 7: Comparing Imputation Methods")
print("="*60)

# Calculate statistics for each method
comparison_data = []

for feature in features_to_modify:
    original_mean = df[feature].mean()
    original_std = df[feature].std()
    
    mean_imputed_mean = df_mean_imputed[feature].mean()
    mean_imputed_std = df_mean_imputed[feature].std()
    
    median_imputed_mean = df_median_imputed[feature].mean()
    median_imputed_std = df_median_imputed[feature].std()
    
    knn_imputed_mean = df_knn_imputed[feature].mean()
    knn_imputed_std = df_knn_imputed[feature].std()
    
    comparison_data.append({
        'Feature': feature,
        'Original Mean': original_mean,
        'Mean Imp. Mean': mean_imputed_mean,
        'Median Imp. Mean': median_imputed_mean,
        'KNN Imp. Mean': knn_imputed_mean,
        'Original Std': original_std,
        'Mean Imp. Std': mean_imputed_std,
        'Median Imp. Std': median_imputed_std,
        'KNN Imp. Std': knn_imputed_std
    })

comparison_df = pd.DataFrame(comparison_data)
print("\nComparison of Imputation Methods:")
print("\nMean Comparison:")
print(comparison_df[['Feature', 'Original Mean', 'Mean Imp. Mean', 'Median Imp. Mean', 'KNN Imp. Mean']].to_string(index=False))
print("\nStandard Deviation Comparison:")
print(comparison_df[['Feature', 'Original Std', 'Mean Imp. Std', 'Median Imp. Std', 'KNN Imp. Std']].to_string(index=False))


In [None]:
# Visualize the impact of different imputation methods
print("\nVisualizing Imputation Method Comparison:\n")

# Select one feature for detailed comparison
example_feature = 'energy'

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Original distribution
axes[0, 0].hist(df[example_feature], bins=50, alpha=0.7, color='green', edgecolor='black')
axes[0, 0].set_title(f'Original {example_feature.title()} Distribution')
axes[0, 0].set_xlabel(example_feature.title())
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].axvline(df[example_feature].mean(), color='red', linestyle='--', linewidth=2, label='Mean')
axes[0, 0].legend()

# Mean imputation
axes[0, 1].hist(df_mean_imputed[example_feature], bins=50, alpha=0.7, color='blue', edgecolor='black')
axes[0, 1].set_title(f'Mean Imputed {example_feature.title()} Distribution')
axes[0, 1].set_xlabel(example_feature.title())
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].axvline(df_mean_imputed[example_feature].mean(), color='red', linestyle='--', linewidth=2, label='Mean')
axes[0, 1].legend()

# Median imputation
axes[1, 0].hist(df_median_imputed[example_feature], bins=50, alpha=0.7, color='orange', edgecolor='black')
axes[1, 0].set_title(f'Median Imputed {example_feature.title()} Distribution')
axes[1, 0].set_xlabel(example_feature.title())
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].axvline(df_median_imputed[example_feature].mean(), color='red', linestyle='--', linewidth=2, label='Mean')
axes[1, 0].legend()

# KNN imputation
axes[1, 1].hist(df_knn_imputed[example_feature], bins=50, alpha=0.7, color='purple', edgecolor='black')
axes[1, 1].set_title(f'KNN Imputed {example_feature.title()} Distribution')
axes[1, 1].set_xlabel(example_feature.title())
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].axvline(df_knn_imputed[example_feature].mean(), color='red', linestyle='--', linewidth=2, label='Mean')
axes[1, 1].legend()

plt.tight_layout()
plt.show()

print("\nObservations:")
print("- Mean/Median imputation may create artificial peaks at the mean/median value")
print("- KNN imputation better preserves the original distribution shape")
print("- KNN imputation considers relationships between features for more accurate estimates")
