# Music Dataset Exploration (Spotify & YouTube)

## Objective
Explore audio features and popularity metrics from the Spotify and YouTube combined music dataset. Analyze distributions, artist insights, and correlate track characteristics with popularity.

## Table of Contents
1. Imports & Setup  
2. Data Loading & Cleaning  
3. Audio Feature Distributions  
4. Feature Correlations  
5. Artist-Level Analysis  
6. Track-Level Analysis  
7. Popularity Insights Using YouTube Data  
8. Instrumentalness vs Acousticness  
9. Track Duration vs Tempo  
10. Metadata Exploration  
11. Overall Conclusions

In [ ]:
# 1. Imports & Setup
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Optional: seaborn styling
sns.set(style="whitegrid")

## 2. Data Loading & Cleaning
- Load dataset from CSV.
- Remove rows with NaNs in crucial audio features.
- Print missing value summary before and after cleaning.
- Show final dataset size.

In [ ]:
# Example loading and cleaning code
df = pd.read_csv("spotify_and_youtube.csv")

critical_cols = ['Danceability', 'Energy', 'Key', 'Loudness', 'Speechiness', 
                 'Acousticness', 'Instrumentalness', 'Liveness', 'Valence', 'Tempo', 'Duration_ms']
df_clean = df.dropna(subset=critical_cols)

print(f"Rows before cleaning: {len(df)}, after cleaning: {len(df_clean)}")
print(df_clean.isnull().sum())

#### Conclusion
The dataset was effectively cleaned by dropping rows with missing audio feature data, retaining a high number of records suitable for analysis.

## 3. Audio Feature Distributions
- Histograms for key audio features.

In [ ]:
audio_features = ['Danceability', 'Energy', 'Valence', 'Loudness', 'Speechiness', 
                  'Acousticness', 'Instrumentalness', 'Liveness', 'Tempo', 'Duration_ms']
df_clean[audio_features].hist(bins=30, figsize=(20,10))
plt.suptitle('Distribution of Audio Features')
plt.show()

#### Conclusion
Most tracks exhibit high danceability and energy, with instrumentalness skewed toward low values, indicating the dataset primarily consists of vocal-heavy songs.

## 4. Feature Correlations
- Heatmap showing correlation between audio features.

In [ ]:
plt.figure(figsize=(12,8))
sns.heatmap(df_clean[audio_features].corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Audio Features')
plt.show()

#### Conclusion
Energy strongly correlates with danceability and loudness, while acousticness negatively correlates with energy, reflecting expected acoustic vs electronic track dynamics.

## 5. Artist-Level Analysis
- Identify top artists by mean energy.

In [ ]:
artist_stats = df_clean.groupby('Artist')['Energy'].mean().sort_values(ascending=False).head(10)
print("Top 10 Energetic Artists:\n", artist_stats)

#### Conclusion
Several artists consistently produce high-energy tracks, which may appeal more to certain listener groups.

## 6. Track-Level Analysis
- Tracks with highest valence (positivity/mood).

In [ ]:
track_valence = df_clean.groupby('Track')['Valence'].mean().sort_values(ascending=False).head(10)
print("Top 10 Most Positive (Valence) Tracks:\n", track_valence)

#### Conclusion
High valence tracks can help identify music associated with positive moods, useful for playlist curation or recommendation systems.

## 7. Popularity Insights Using YouTube Views
- Compare distributions of features between all tracks and top 10% popular by views.

In [ ]:
pop_thresh = df_clean['Views'].quantile(0.9)
popular = df_clean[df_clean['Views'] >= pop_thresh]

for feat in ['Energy', 'Danceability', 'Valence']:
    plt.figure(figsize=(10,5))
    sns.kdeplot(df_clean[feat], label='All Tracks')
    sns.kdeplot(popular[feat], label='Popular (Top 10% Views)')
    plt.title(f'{feat}: Popular vs All Tracks')
    plt.legend()
    plt.show()

#### Conclusion
Popular tracks tend to cluster at higher energy and danceability levels, indicating these features may drive listener engagement.

## 8. Instrumentalness vs Acousticness
- Scatterplot to check relationship.

In [ ]:
plt.figure(figsize=(10,5))
sns.scatterplot(x='Acousticness', y='Instrumentalness', data=df_clean, alpha=0.3)
plt.title('Instrumentalness vs Acousticness')
plt.show()

#### Conclusion
There is a visible positive relationship; highly instrumental tracks tend to be more acoustic, characterizing likely classical or unplugged styles.

## 9. Track Duration vs Tempo
- Explore correlation by scatterplot.

In [ ]:
plt.figure(figsize=(10,5))
sns.scatterplot(x='Duration_ms', y='Tempo', data=df_clean, alpha=0.3)
plt.title('Track Duration vs Tempo')
plt.xlabel('Duration (ms)')
plt.ylabel('Tempo (BPM)')
plt.show()

#### Conclusion
No strong correlation appears, but clusters indicate typical pop song durations and tempos.

## 10. Metadata Exploration
- Album types and unique counts.

In [ ]:
print("Album types count:\n", df_clean['Album_type'].value_counts())
print("Unique tracks:", df_clean['Track'].nunique())
print("Unique artists:", df_clean['Artist'].nunique())

#### Conclusion
Dataset shows rich diversity in albums and artists, indicating broad coverage for analysis.

## 11. Overall Conclusion
This exploratory analysis highlights strong links between audio features such as energy, danceability, and track popularity. The dataset's diversity provides a rich base for music recommendation and trend analysis, although absent year and genre information limits temporal and categorical trend insights. Future work could enrich the dataset with release dates and genres to deepen understanding.