# Phase 2 - Step 2: Data Exploration
### Spotify Song Popularity Predictor 🎵

**Goal:**  
Explore and understand the combined Spotify dataset to identify:
- Column meanings and data types  
- Missing values and data quality issues  
- Basic distributions of key features (danceability, energy, tempo, etc.)  
- Early correlations with song popularity

We'll use visual and statistical methods to prepare for cleaning and feature engineering.


In [1]:
import os
os.chdir("..")


In [5]:
# Import Required Libraries


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from src.data.compute_vif import compute_and_print_vif_from_config, iterative_vif_reduction

import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
plt.rcParams["figure.figsize"] = (20, 6)

import logging
logging.getLogger('matplotlib.pyplot').setLevel(logging.WARNING)


from src.utils.logger import Logger
from src.utils.helper import get_config

Initialized logger...


In [None]:
# Initialize Logger and Config

logger = Logger().get_logger(__name__)
config = get_config()

logger.info("Starting Data Exploration phase...")

In [None]:
# Load the Combined Dataset

data_Path = config["data"]["processed"]["combined"]
df = pd.read_csv(data_Path)

df.head()

In [None]:
df.info()

In [None]:
# summary statistics

df.describe().T

## 💡 Insights

1. **Data Quality:**

   * Very few missing values ✅
   * Reasonable numeric ranges ✅
   * No crazy outliers (except maybe `loudness`) ✅
     → This dataset is high-quality — little cleaning needed.

2. **Feature Relevance:**

   * Useful for modeling: `danceability`, `energy`, `valence`, `tempo`, `loudness`, `acousticness`.
   * Likely drop: identifiers (`uri`, `id`, `track_href`, etc.), descriptive text columns (`track_name`, `playlist_name`).

3. **Feature Engineering Ideas (Future):**

   * Extract **release year** from `track_album_release_date`.
   * Encode categorical variables like `playlist_genre` and `playlist_subgenre`.
   * Normalize continuous variables like `duration_ms` or `tempo`.
   * One-hot encode `mode` (0 = minor, 1 = major).

4. **Potential Hypotheses to Test Later:**

   * Higher **energy** and **danceability** → more popular.
   * **Acoustic** or **instrumental** songs → less popular.
   * Popularity may vary by **genre** or **tempo range**.

---

- So the average song is around 3 minutes 26 seconds long —
which perfectly fits normal pop song lengths.

- So most songs vary by ±1.3 minutes from the mean —
meaning you have both short and long songs, but not extreme differences overall.

- So 25% of your songs are shorter than ~2 minutes 39 seconds.

- Half your songs are shorter than 3.25 minutes (3 min 15 sec), half are longer — again, typical for mainstream tracks.

- So 75% of songs are under ~3 minutes 53 seconds.
You can already see most songs fall neatly in that 2.5–4 minute window.

- Max song is 22 min 35 sec. Woah 😳 that’s a long one — maybe a live recording, podcast segment, or instrumental version. Definitely an outlier.

## Summary of Duration:
Your dataset mostly contains regular-length songs (2–4 minutes),
a few really short intros, and one or two marathon tracks.
If you’re building a model, you might want to cap or normalize those long ones —
they could distort your averages and influence the model unfairly.

In [None]:
# Verify Hypothesis
#  
# Energy vs Popularity
sns.scatterplot(x="energy", y="track_popularity", data=df, alpha=0.5)
plt.title("Energy vs Popularity")



In [None]:

# Danceability vs Popularity
sns.scatterplot(x="danceability", y="track_popularity", data=df, alpha=0.5)
plt.title("Danceability vs Popularity")

In [None]:

# Acousticness vs Popularity
sns.scatterplot(x="acousticness", y="track_popularity", data=df, alpha=0.5)
plt.title("Acousticness vs Popularity")

In [None]:
# instrumentalness vs Popularity
sns.scatterplot(x="instrumentalness", y="track_popularity", data=df, alpha=0.5)
plt.title("instrumentalness vs Popularity")

# the last 2, yeah when 0 popularity is more

In [None]:
df.columns

In [None]:
numeric_cols = df.select_dtypes(include=np.number).columns.tolist()

plt.figure(figsize=(14, 8))
df[numeric_cols].hist(bins=20, figsize=(14, 12), color="skyblue", edgecolor="black")
plt.suptitle("Distribution of Numeric Features", fontsize=16)
plt.show()


## 💡 Summary: Quick Insight

| Observation                                                                             | Meaning                                 | Possible Next Step                |
| --------------------------------------------------------------------------------------- | --------------------------------------- | --------------------------------- |
| Several features are **skewed** (speechiness, acousticness, liveness, instrumentalness) | They could distort some models          | Log-transform or normalize them   |
| Some features are **categorical in disguise** (mode, key, time_signature)               | Use one-hot encoding instead of scaling | Treat as categorical              |
| A few **outliers** exist (duration_ms, loudness)                                        | They may hurt training                  | Handle with clipping or filtering |
| `track_popularity` has good spread                                                      | Perfect for prediction target           | No change needed                  |


# mbalanced dataset?
High is ike half of low

In [None]:

sns.heatmap(
    df.corr(numeric_only=True),   # correlation matrix
    cmap="coolwarm",              # color style
    center=0,                     # center around 0 (for balance)
    annot=True,                   # show numbers inside cells
    fmt=".2f"                     # format numbers to 2 decimal places
)

plt.title("Feature Correlation Heatmap")
plt.show()


# Multicollinearity

Loudness and energy

i guess that makes sense, if a song has energy it would be kinda loud right?

> You spotted true multicollinearity.

> Use VIF to confirm it, and use domain knowledge (drop loudness, keep energy) for your Linear Regression model.

In [None]:

# Step 1: Inspect VIF scores
compute_and_print_vif_from_config(df, numeric_cols)

# # Step 2: Reduce features iteratively
# df_reduced, dropped_features = iterative_vif_reduction(df, numeric_cols)

# print("Dropped features due to high multicollinearity:", dropped_features)

In [None]:
# # Check for Missing Values

# missing_values = df.isnull().sum().sort_values(ascending=False)
# missing_values[missing_values > 0]

In [None]:
# Check Duplicates

df.duplicated().sum()

In [None]:
# version this to src/data/validate_data.py

## **Phase 2 → Step 3: Data Cleaning and Preprocessing**

- Handle missing values
- Remove duplicates
- Fix data types and scaling

In [None]:
# # Remove rows with missing values
# df_cleaned = df.dropna()

# # Verify that missing values are gone
# missing_values = df_cleaned.isnull().sum().sort_values(ascending=False)
# missing_values[missing_values > 0]

In [None]:
# Data Type Conversion