# Exploratory Data Analysis (EDA) on Spotify Dataset

In this project, I’ve compiled and analyzed a dataset with over 500,000 songs from Spotify, one of the world’s most popular music streaming platforms.

The aim of this analysis is to explore how music varies across genres, moods, and years — and ultimately answer key questions about what makes a song feel good, energetic, or relaxing. The analysis focuses on mood-related metrics like valence, energy, and danceability, as well as temporal and genre-based patterns.

Here are the main questions addressed in this project:
1.	What are the top 10 songs with the highest “feel-good” vibes?
2.	Which artist transmits the most overall flow based on average mood-related metrics?
3.	What are the TOP 3 songs in different contexts:
- 🎉 Party
- 💻 Work
- 🏋️‍♂️ Exercise
- 🧘 Relaxation
- 🚗 Driving
4.	How has the positivity index (valence) evolved year by year?
5.	What does the distribution of song quantity look like per year and genre?
6.	Includes a final personal analysis, exploring subjective observations and musical curiosities.

Feel free to dive into the data and enjoy the journey through 500,000 tracks of musical insights!

# Importing Libraries and Loading the data

In [12]:
import pandas as pd
import numpy as np
import matplotlib
import plotly.express as px
import os
import urllib.request
from zipfile import ZipFile

# Create folders if they doesn't exist
os.makedirs('../archive', exist_ok=True)
os.makedirs('../data', exist_ok=True)

# Define paths
zip_path = '../archive/900k-spotify.zip'
csv_path = '../data/spotify_dataset.csv'

# Download the dataset if the ZIP file doesn't exist
if not os.path.exists(zip_path):
    print("Downloading ZIP file from Kaggle...")
    os.system('kaggle datasets download -d devdope/900k-spotify -p ../archive')
else:
    print("ZIP file already exists. Skipping download.")

# Extract the dataset if the CSV file doesn't exist
if not os.path.exists(csv_path):
    print("Extracting ZIP file...")
    with ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall('../data/')
else:
    print("CSV file already exists. Skipping extraction.")


# Create DataFrame
df = pd.read_csv(csv_path)
df.head()

ZIP file already exists. Skipping download.
CSV file already exists. Skipping extraction.


Unnamed: 0,Artist(s),song,text,Length,emotion,Genre,Album,Release Date,Key,Tempo,Loudness (db),Time signature,Explicit,Popularity,Energy,Danceability,Positiveness,Speechiness,Liveness,Acousticness,Instrumentalness,Good for Party,Good for Work/Study,Good for Relaxation/Meditation,Good for Exercise,Good for Running,Good for Yoga/Stretching,Good for Driving,Good for Social Gatherings,Good for Morning Routine,Similar Artist 1,Similar Song 1,Similarity Score 1,Similar Artist 2,Similar Song 2,Similarity Score 2,Similar Artist 3,Similar Song 3,Similarity Score 3
0,!!!,Even When the Waters Cold,Friends told her she was better off at the bot...,03:47,sadness,hip hop,Thr!!!er,29th April 2013,D min,105,-6.85db,4/4,No,40,83,71,87,4,16,11,0,0,0,0,0,0,0,0,0,0,Corey Smith,If I Could Do It Again,0.986061,Toby Keith,Drinks After Work,0.983719,Space,Neighbourhood,0.983236
1,!!!,One Girl / One Boy,"Well I heard it, playing soft From a drunken b...",04:03,sadness,hip hop,Thr!!!er,29th April 2013,A# min,117,-5.75db,4/4,No,42,85,70,87,4,32,0,0,0,0,0,0,0,0,0,0,0,Hiroyuki Sawano,BRE@TH//LESS,0.995409,When In Rome,Heaven Knows,0.990905,Justice Crew,Everybody,0.984483
2,!!!,Pardon My Freedom,"Oh my god, did I just say that out loud? Shoul...",05:51,joy,hip hop,Louden Up Now,8th June 2004,A Maj,121,-6.06db,4/4,No,29,89,71,63,8,64,0,20,0,0,0,1,0,0,0,0,0,Ricky Dillard,More Abundantly Medley Live,0.993176,Juliet,Avalon,0.965147,The Jacksons,Lovely One,0.956752
3,!!!,Ooo,[Verse 1] Remember when I called you on the te...,03:44,joy,hip hop,As If,16th October 2015,A min,122,-5.42db,4/4,No,24,84,78,97,4,12,12,0,0,0,0,1,0,0,0,0,0,Eric Clapton,Man Overboard,0.992749,Roxette,Don't Believe In Accidents,0.991494,Tiwa Savage,My Darlin,0.990381
4,!!!,Freedom 15,[Verse 1] Calling me like I got something to s...,06:00,joy,hip hop,As If,16th October 2015,F min,123,-5.57db,4/4,No,30,71,77,70,7,10,4,1,0,0,0,1,0,0,0,0,0,Cibo Matto,Lint Of Love,0.98161,Barrington Levy,Better Than Gold,0.981524,Freestyle,Its Automatic,0.981415


# Data Cleaning and Manipulation

## Identify Duplicate or Irrelevant Data

In [13]:
df[['song', 'Artist(s)']].duplicated().value_counts()

False    498052
True      53391
Name: count, dtype: int64

## Fix Structural Errors

### Define Errors

- Convert all column names to lowercase to make data manipulation easier and more consistent.
- Create column based on the duration of the lenght column
- Set release date to datetime
- Group duplicates by song and artist

#### Convert all column names to lowercase

In [14]:
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

#### Create length_seconds column based on the duration of the length column

In [15]:
df['length_seconds'] = df['length'].apply(
    lambda x: sum(int(t) * 60 ** i for i, t in enumerate(reversed(x.split(':')))) if isinstance(x, str) and ':' in x else np.nan
)

#### Set release date to datetime

In [16]:
df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')

#### Group duplicates by song and artist

In [17]:
df = df.drop_duplicates(subset=['song','artist(s)'])
df[['song', 'artist(s)']].duplicated().value_counts()

False    498052
Name: count, dtype: int64

## Filter outliers 

In [18]:
pd.set_option('display.max_columns', None)
df.describe()

Unnamed: 0,release_date,tempo,popularity,energy,danceability,positiveness,speechiness,liveness,acousticness,instrumentalness,good_for_party,good_for_work/study,good_for_relaxation/meditation,good_for_exercise,good_for_running,good_for_yoga/stretching,good_for_driving,good_for_social_gatherings,good_for_morning_routine,similarity_score_1,similarity_score_2,similarity_score_3,length_seconds
count,350369,498052.0,498052.0,498052.0,498052.0,498052.0,498052.0,498052.0,498052.0,498052.0,498052.0,498052.0,498052.0,498052.0,498052.0,498052.0,498052.0,498052.0,498052.0,498052.0,498052.0,498052.0,498052.0
mean,2012-05-08 12:27:06.924756224,120.780178,30.486453,62.744027,58.285191,47.067467,11.397157,19.787725,26.056827,7.361777,0.051639,0.077363,0.031744,0.184005,0.053199,0.022034,0.054735,0.00929,0.063618,0.982887,0.977765,0.974715,224.45782
min,1900-01-17 00:00:00,31.0,0.0,0.0,6.0,0.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002656,0.002647,0.002647,5.0
25%,2009-10-09 00:00:00,97.0,19.0,48.0,46.0,28.0,4.0,10.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.976614,0.970859,0.967244,179.0
50%,2017-01-06 00:00:00,120.0,28.0,65.0,59.0,46.0,6.0,13.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.984911,0.980117,0.977254,214.0
75%,2019-12-06 00:00:00,140.0,40.0,81.0,71.0,66.0,14.0,25.0,44.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.991553,0.987161,0.984789,256.0
max,2024-08-16 00:00:00,200.0,100.0,100.0,99.0,100.0,97.0,100.0,100.0,100.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3572.0
std,,29.262887,17.189269,22.688164,17.35293,24.091678,12.28215,16.310829,29.618874,20.736406,0.221298,0.267167,0.175317,0.387489,0.224431,0.146794,0.227463,0.095937,0.244071,0.013297,0.014951,0.015961,84.679868


To assess the presence of outliers in the dataset, I examined the summary statistics of the numerical columns using the .describe() method. 

Overall, most values fall within expected ranges for audio features such as tempo, energy, danceability, and popularity. However, two columns stand out for potential outliers:
- length_seconds: While the average song duration is reasonable (around 3 minutes and 44 seconds), the minimum value is 5 seconds and the maximum is nearly an hour (3572 seconds). These extreme values likely correspond to very short audio clips or live performances. Depending on the analysis goal, it may be useful to filter out tracks that are shorter than 30 seconds or longer than 20 minutes.
- release_date: The range includes dates as early as 1900, which may not be accurate. Such early dates could represent missing or incorrectly parsed values. Filtering out tracks released before 1950 could improve the overall data quality.

In conclusion, while there are no extreme or highly suspicious outliers in most columns, but i'll check these two values.

In [19]:
# Checking the long song, it is correct after checking on Spotify
df[df['length_seconds'] == df['length_seconds'].max()]

Unnamed: 0,artist(s),song,text,length,emotion,genre,album,release_date,key,tempo,loudness_(db),time_signature,explicit,popularity,energy,danceability,positiveness,speechiness,liveness,acousticness,instrumentalness,good_for_party,good_for_work/study,good_for_relaxation/meditation,good_for_exercise,good_for_running,good_for_yoga/stretching,good_for_driving,good_for_social_gatherings,good_for_morning_routine,similar_artist_1,similar_song_1,similarity_score_1,similar_artist_2,similar_song_2,similarity_score_2,similar_artist_3,similar_song_3,similarity_score_3,length_seconds
513937,"Ultimate Rap League,Jaz the Rapper,O'fficial",Jaz The Rapper vs. Official,[Round 1: Jaz The Rapper] Y’all smell that? Ir...,59:32,anger,jazz,N.O.M.E. 5 (Live),2015-05-09,C# Maj,81,-14.07db,4/4,Yes,0,36,54,59,94,65,96,0,0,0,0,0,0,0,0,0,0,"Ultimate Rap League,Cassidy,Goodz",Cassidy vs. Arsonal,0.997455,"Ultimate Rap League,Arsonal,Geechi Gotti",Arsonal vs. Geechi Gotti,0.995412,"Ultimate Rap League,John John Da Don,Mr Wavy",Hitman Holla vs. John John Da Don,0.99472,3572


In [20]:
# Checking the old songs
df[df['release_date'].dt.year == 1900]

Unnamed: 0,artist(s),song,text,length,emotion,genre,album,release_date,key,tempo,loudness_(db),time_signature,explicit,popularity,energy,danceability,positiveness,speechiness,liveness,acousticness,instrumentalness,good_for_party,good_for_work/study,good_for_relaxation/meditation,good_for_exercise,good_for_running,good_for_yoga/stretching,good_for_driving,good_for_social_gatherings,good_for_morning_routine,similar_artist_1,similar_song_1,similarity_score_1,similar_artist_2,similar_song_2,similarity_score_2,similar_artist_3,similar_song_3,similarity_score_3,length_seconds
49138,Big Ben Banjo Band,Do It Big,[Intro: Ava Lily] I'mma do it I'mma do it I'mm...,02:23,sadness,hip hop,Party Packet,1900-01-30,C Maj,126,-13.13db,4/4,No,1,41,86,93,11,3,83,97,0,0,0,0,0,0,0,0,0,Sirocco,Trap Back Bumpin Freestyle,0.986525,Bossa Nova Jazz,F With U,0.978716,"Ros Sereisothea,Pen Ran",Millionaire,0.977055,143
57249,Blind Willie McTell,Southern Can Is Mine,Now looka here mama let me tell you this If yo...,03:13,anger,blues,The Early Years,1900-01-29,G# Maj,114,-25db,4/4,No,16,10,69,90,18,18,98,0,0,0,0,0,0,0,0,0,0,Victoria Wood,Pam,0.989838,Robert Johnson,Ramblin On My Mind Take 2,0.966355,Joel Samberg & Benny Bell,Shaving Cream,0.964407,193
57251,Blind Willie McTell,The Dyin Crapshooters Blues,"Little Jesse was a gambler, night and day He u...",03:09,sadness,blues,The Legendary Library Of Congress Session,1900-01-29,D min,139,-23.84db,4/4,No,1,5,69,63,17,22,99,0,0,0,0,0,0,0,0,0,0,Memphis Minnie,Dirty Mother for You,0.971451,Nat King Cole Trio,I'll Never Say Never Again,0.968802,"Frank Sinatra,The Charioteers",Jesus Is A Rock In The Weary Land,0.966888,189
63167,Bonnie Guitar,Dark Moon,"Dark moon Away up high up in the sky Oh, tell ...",02:43,sadness,hip hop,Dark Moon,1900-01-29,A Maj,86,-18db,4/4,No,23,4,58,36,3,10,98,0,0,1,1,0,0,1,0,0,0,Celine Josephina,Time Time Time Time,0.995518,Zach Bolen,I Want to Die,0.995404,John C. Reilly,Have You Heard the News / Dewey Cox Died,0.995171,163
393342,"Richard Strauss,Johann Strauss II,Wolfgang Ama...",Do It For The Gang,"[Intro: NitoNB, Sav12] Oppsdem know my NGang b...",04:30,joy,hip hop,120 Music Masterpieces,1900-01-30,C# Maj,151,-26.35db,4/4,No,0,4,14,8,4,11,98,93,0,0,0,0,0,0,0,0,0,Inspiring New Age Collection,Mind Sex,0.997182,Meditation Music Zone,6 2 1 5,0.989946,Jack Edwards,Fly On,0.984901,270
393343,"Richard Strauss,Johann Strauss II,Wolfgang Ama...",12AM,[Chorus: C A L E B] I was lookin' at the clock...,08:41,joy,hip hop,120 Music Masterpieces,1900-01-30,D Maj,133,-25.5db,4/4,No,12,9,45,35,4,26,98,84,0,0,0,0,0,0,0,0,0,"Pyotr Ilyich Tchaikovsky,Hector Berlioz,Johann...",You Are So Beautiful,0.972818,Colour Haze,Inside,0.965596,Chet Baker,Blue room,0.959847,521
393344,"Richard Strauss,Johann Strauss II,Wolfgang Ama...",Made of Fire,"Made of fire, I'm made of fire Falling, free ...",05:07,love,pop,120 Music Masterpieces,1900-01-30,G Maj,78,-21.39db,4/4,No,0,15,30,30,4,10,98,75,0,1,1,0,0,1,0,0,0,Christian Guitar,How Great Thou Art,0.98662,The Cat and Owl,Go Dung,0.985437,Geo Symphony Orchestra,Virgin,0.981822,307
420793,Shirley Temple,Animal Crackers In My Soup,"Animal Crackers In My Soup My mother said: ""My...",02:36,joy,hip hop,30 Original Recordings,1900-01-30,G min,84,-18.19db,4/4,No,26,18,65,85,5,15,100,71,0,1,1,0,0,1,0,0,0,Mississippi John Hurt,Since Ive Laid My Burden Down,0.98547,Yvonne Elliman,Hello Stranger,0.979805,Elliot Gordon,A Very Merry Christmas,0.974474,156
420794,Shirley Temple,At The Codfish Ball,Lyrics/Music S. Mitchell/L. Pollack Next Frida...,02:02,joy,hip hop,30 Original Recordings,1900-01-30,C# Maj,81,-22.27db,4/4,No,3,19,62,93,14,26,99,85,0,1,1,0,0,1,0,0,0,"Relax Radio 1,Relaxing Chill Out Music,Soft Ja...",Take Out,0.982258,"Various Composers,Billy Joel,Andrew Holdsworth",BattEm Up,0.979722,Relaxing Piano Music: Greatest Hymns: Best Lov...,In The Garden,0.967407,122
420801,Shirley Temple,Early Bird,Shirley Temple Miscellaneous Early Bird Good m...,02:02,joy,hip hop,30 Original Recordings,1900-01-30,G# Maj,135,-21db,4/4,No,13,14,53,47,6,56,99,92,0,0,0,0,0,0,0,0,0,Acid Ghost,The Lake Song,0.963733,Sufjan Stevens,Rake,0.956681,"George Frideric Handel,Academy of St Martin in...",07. Chorus: And He Shall Purify,0.953538,122


It’s funny to see Richard Strauss, Johann Strauss II, and Mozart listed as having a song together called “Do it For The Gang”. But obviously, all 11 of these songs have incorrect release dates, so I’ll delete them.

In [21]:
df = df[df['release_date'].dt.year != 1900]

After this, I double-checked the oldest songs and confirmed that two from 1903 are correct, they’re by Billy Murray!

## Dealing with NANs

## Validate our data

## Exporting the cleaned dataset to CSV for Looker Studio

# EDA

## 1. What are the top 10 most-played games?

## 1. What are the top 10 most-played games?

## 1. What are the top 10 most-played games?

## 1. What are the top 10 most-played games?

## 1. What are the top 10 most-played games?

## 1. What are the top 10 most-played games?