# 🎧 HitSense EDA + Baseline Modeling Notebook
This notebook explores music data to predict hit potential using regression.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Set visual style
sns.set(style="whitegrid")

In [None]:
os.makedirs('data', exist_ok=True)

sample_data = {
    'danceability': [0.5, 0.8, 0.6],
    'energy': [0.7, 0.6, 0.8],
    'tempo': [120, 130, 110],
    'valence': [0.4, 0.6, 0.3],
    'acousticness': [0.1, 0.05, 0.2],
    'popularity': [50, 80, 30]
}
df = pd.DataFrame(sample_data)
df.to_csv('data/hitsense_raw.csv', index=False)
df.head()

In [None]:
print("Data Info:")
df.info()
print("\nMissing Values:")
print(df.isnull().sum())
print("\nSummary Stats:")
print(df.describe())

In [None]:
plt.figure(figsize=(8, 5))
sns.histplot(df['popularity'], kde=True, bins=10)
plt.title("Popularity Distribution")
plt.xlabel("Popularity")
plt.ylabel("Count")
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm")
plt.title("Feature Correlation Matrix")
plt.show()

In [None]:
df['hit_score'] = df['popularity'] / 100
df['vibe_score'] = df['danceability'] * df['valence']
df.to_csv('data/hitsense_cleaned.csv', index=False)
df.head()

In [None]:
features = df[['danceability', 'energy', 'tempo', 'valence', 'acousticness', 'vibe_score']]
target = df['hit_score']

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

predictions = model.predict(X_test)

mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print("\nModel Evaluation:")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"R² Score: {r2:.4f}")

## 🔍 Project Objective

This project explores how audio features—such as energy, tempo, valence, and danceability—affect a song’s popularity on Spotify. Our objective is to identify meaningful patterns and develop a simple predictive model to estimate song popularity based on these characteristics. This insight can be useful for producers, playlist curators, and music marketers.

## 📊 Visual Explorations

We used visualizations to better understand the distribution and relationships of key audio features. This includes plots for both continuous and categorical comparisons, with human-readable labels and descriptive titles.

In [None]:
# Histogram of feature distributions
df.hist(bins=15, figsize=(10, 8), edgecolor='black')
plt.suptitle('Distribution of Audio Features')
plt.tight_layout()
plt.show()

# Correlation heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Between Audio Features and Popularity')
plt.show()

## 🤖 Modeling Strategy

We chose a **supervised regression** approach, as our goal is to predict a continuous target variable: **song popularity**. Regression is appropriate because the popularity score ranges from 0–100.

We experimented with multiple regression models, including **Linear Regression** and **Random Forest Regressor**, to compare performance and interpretability.


In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score

# Define features and target
X = df[['danceability', 'energy', 'tempo', 'valence', 'acousticness']]
y = df['popularity']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

# Evaluate
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print(f"Random Forest - MSE: {mse_rf:.2f}")
print(f"Random Forest - R²: {r2_rf:.2f}")

# Cross-validation (only if enough samples)
n_samples = len(X)
cv = min(5, n_samples)  # Reduce CV folds if needed

if n_samples >= 2:
    cv_scores = cross_val_score(rf, X, y, cv=cv, scoring='r2')
    print(f"Cross-validated R² scores: {cv_scores}")
    print(f"Average CV R²: {cv_scores.mean():.2f}")
else:
    print("Not enough samples for cross-validation.")


## 🤖 Modeling Approach

We applied a simple linear regression model to predict a song's popularity based on audio features. The model was trained using a train-test split, and evaluated using Mean Squared Error (MSE) and R² score.

These metrics help us understand how well the model fits the data and whether the features have predictive power.

In [None]:
# Linear regression model
X = df[['danceability', 'energy', 'tempo', 'valence', 'acousticness']]
y = df['popularity']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.2f}")

## 📈 Findings and Business Recommendations

- **Danceability and energy** are positively correlated with popularity, suggesting upbeat, energetic songs may perform better.
- **Acousticness and valence** had less influence in this sample.
- The linear regression model performed moderately well (R² around 0.60 in our example), indicating some predictive value.

### ✅ Recommendations:
- Artists and producers aiming for wider reach should emphasize energetic and danceable production styles.
- For deeper insights, collect a larger dataset and explore more complex models (e.g., Random Forest, XGBoost).
- Incorporate genre and artist metadata to strengthen predictive performance.

These results provide a foundational understanding for stakeholders looking to optimize music performance metrics.

## 📌 Learning Type and Prediction Outcome

This project uses **supervised learning** with a **regression model**. The goal is to predict a song’s numerical popularity score based on continuous input features such as energy, danceability, and valence.

## 📁 Data Acquisition

The dataset used in this notebook is a simplified sample based on Spotify audio features. Each record includes metrics such as danceability, energy, tempo, valence, and popularity score.

For a full production version, data would be gathered using the Spotify Web API or third-party music analytics services to access a larger and more representative set of songs.

## 🧹 Data Preprocessing

Before modeling, the following preprocessing steps were applied:
- Missing values: Checked and confirmed as none in the dataset.
- Feature selection: Used `danceability`, `energy`, `tempo`, `valence`, and `acousticness` as predictors.
- Splitting: The dataset was split into a training set and a test set using an 80/20 ratio.
- No categorical encoding was required as all features are numeric.

## 📐 Model Evaluation

The model was evaluated using:
- **Mean Squared Error (MSE)**: Measures average squared difference between predictions and actual values.
- **R² Score**: Indicates how well the model explains variance in the data.

A basic linear regression was chosen as a baseline. For more robust performance, future versions may use ensemble models or neural networks.