# June Code Pudding Analysis

In [3]:
import pandas as pd
import numpy as np
import os
import plotly.express as px
import plotly.graph_objects as go
from dash import Dash, dcc, html
import spotipy
from sklearn import preprocessing, metrics
import seaborn as sns
import matplotlib.pyplot as plt

In [5]:
dataset_path = os.path.join('data', 'spotify-2000.csv')
df = pd.read_csv(dataset_path)

df.head()



Unnamed: 0,Index,Title,Artist,Top Genre,Year,Beats Per Minute (BPM),Energy,Danceability,Loudness (dB),Liveness,Valence,Length (Duration),Acousticness,Speechiness,Popularity
0,1,Sunrise,Norah Jones,adult standards,2004,157,30,53,-14,11,68,201,94,3,71
1,2,Black Night,Deep Purple,album rock,2000,135,79,50,-11,17,81,207,17,7,39
2,3,Clint Eastwood,Gorillaz,alternative hip hop,2001,168,69,66,-9,7,52,341,2,17,69
3,4,The Pretender,Foo Fighters,alternative metal,2007,173,96,43,-4,3,37,269,0,4,76
4,5,Waitin' On A Sunny Day,Bruce Springsteen,classic rock,2002,106,82,58,-5,10,87,256,1,3,59


In [6]:
# Basic statistics per each numerical columns
print("Dataset Shape:", df.shape)
print("\nBasic Statistics:")
df.describe()

# Check for any missing values
print("\nMissing Values:")
df.isnull().sum()

Dataset Shape: (1994, 15)

Basic Statistics:

Missing Values:


Index                     0
Title                     0
Artist                    0
Top Genre                 0
Year                      0
Beats Per Minute (BPM)    0
Energy                    0
Danceability              0
Loudness (dB)             0
Liveness                  0
Valence                   0
Length (Duration)         0
Acousticness              0
Speechiness               0
Popularity                0
dtype: int64

The initial data exploration revealed that we have a clean, complete dataset of `1,994 `songs with no missing values across 15 features. This is significant because:
1. No `data cleaning` or handling of `missing values` is required
2. All `15 features `can be used for analysis without `data loss`
3. The dataset `size` is substantial enough for meaningful `statistical analysis`
4. The features cover both `numerical` (`BPM`, `Energy`, etc.) and categorical (`Artist`, Gen`re) data types

This clean state means we can proceed directly with more complex analyses without preprocessing steps.

In [7]:
# Get detailed stats of numerical features
numerical_stats = df.describe()
print("\nDetailed Statistics of Numerical Features:")
numerical_stats


Detailed Statistics of Numerical Features:


Unnamed: 0,Index,Year,Beats Per Minute (BPM),Energy,Danceability,Loudness (dB),Liveness,Valence,Acousticness,Speechiness,Popularity
count,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0
mean,997.5,1992.992979,120.215647,59.679539,53.238215,-9.008526,19.012036,49.408726,28.858074,4.994985,59.52658
std,575.762538,16.116048,28.028096,22.154322,15.351507,3.647876,16.727378,24.858212,29.011986,4.401566,14.3516
min,1.0,1956.0,37.0,3.0,10.0,-27.0,2.0,3.0,0.0,2.0,11.0
25%,499.25,1979.0,99.0,42.0,43.0,-11.0,9.0,29.0,3.0,3.0,49.25
50%,997.5,1993.0,119.0,61.0,53.0,-8.0,12.0,47.0,18.0,4.0,62.0
75%,1495.75,2007.0,136.0,78.0,64.0,-6.0,23.0,69.75,50.0,5.0,71.0
max,1994.0,2019.0,206.0,100.0,96.0,-2.0,99.0,99.0,99.0,55.0,100.0


The numerical statistics reveal key insights about the musical characteristics in our dataset:

Time Range:
- Songs span from `1956` to `2019` (`63` years)
- Median year is `1993`, showing balanced representation of older and newer music

Musical Features:
- `BPM` ranges from `37` to `206`, with median at `119` - typical for popular music
- `Energy` and `Danceability` are measured on `0-100` scale:
  * `Energy` median: `61` (moderate energy levels)
  * `Danceability` median: `53` (songs are moderately danceable)
- `Loudness` ranges from `-27dB` to `-2dB`, median at `-8dB`
- Most songs have low to moderate `Liveness` (median: `12`) and `Speechiness` (median: `4`)
- `Popularity` ranges from `11` to `100`, with median of `62`, suggesting a slight bias toward more popular songs

This distribution gives us a solid foundation for analyzing trends and patterns in popular music over time.

In [8]:
# Get unique counts for categorical columns
print("\nUnique Artists:", df['Artist'].nunique())
print("Unique Genres:", df['Top Genre'].nunique())
print("\nYear Range:", df['Year'].min(), "to", df['Year'].max())

# Top 10 artists by number of songs
print("\nTop 10 Artists by Number of Songs:")
print(df['Artist'].value_counts().head(10))

# Top 10 genres by number of songs
print("\nTop 10 Genres by Number of Songs:")
print(df['Top Genre'].value_counts().head(10))


Unique Artists: 731
Unique Genres: 149

Year Range: 1956 to 2019

Top 10 Artists by Number of Songs:
Artist
Queen                 37
The Beatles           36
Coldplay              27
U2                    26
The Rolling Stones    24
Bruce Springsteen     23
Michael Jackson       23
ABBA                  22
David Bowie           21
Fleetwood Mac         18
Name: count, dtype: int64

Top 10 Genres by Number of Songs:
Top Genre
album rock           413
adult standards      123
dutch pop             88
alternative rock      86
dance pop             83
dutch indie           75
alternative metal     70
dance rock            52
classic rock          51
dutch cabaret         51
Name: count, dtype: int64


The artist and genre analysis reveals important characteristics about the dataset's composition:

Artist Distribution:
- `731` unique artists across `1,994` songs indicates a good variety
- Top artists are predominantly classic rock bands/artists (`Queen`, `Beatles`, `U2`)
- The most represented artist (`Queen`) has only `37` songs (`1.86%` of dataset), showing balanced representation

Genre Distribution:
- `149` unique genres shows diverse musical styles
- Strong bias towards rock genres:
  * `Album rock` dominates (`413` songs, `~20.7%` of dataset)
  * Multiple rock subgenres in top `10` (`alternative`, `classic`, `dance rock`)
- Significant Dutch music influence (`dutch pop`, `dutch indie`, `dutch cabaret`)
- `Adult standards` and `dance pop` represent non-rock mainstream genres

This distribution suggests the dataset has a Western rock music focus with particular emphasis on European artists, which should be considered when drawing conclusions from further analyses.