# Data Collection (Dataset)

This project uses public Spotify datasets from Kaggle:
- Spotify Charts (Top 200 by country & date)
- Spotify Tracks Metadata (audio features, genres, etc.)

This notebook loads the raw CSV files and prepares them for analysis.


In [1]:
import os
import pandas as pd

# Ensure folders exist
os.makedirs("data/raw", exist_ok=True)
os.makedirs("data/processed", exist_ok=True)

In [2]:
# Example: Top 200 Daily Charts dataset from Kaggle
charts_path = "data/raw/spotify_charts.csv"  # put your file here

df_charts = pd.read_csv(charts_path)
print(df_charts.shape)
df_charts.head()

(26173514, 9)


Unnamed: 0,title,rank,date,artist,url,region,chart,trend,streams
0,Chantaje (feat. Maluma),1,2017-01-01,Shakira,https://open.spotify.com/track/6mICuAdrwEjh6Y6...,Argentina,top200,SAME_POSITION,253019.0
1,Vente Pa' Ca (feat. Maluma),2,2017-01-01,Ricky Martin,https://open.spotify.com/track/7DM4BPaS7uofFul...,Argentina,top200,MOVE_UP,223988.0
2,Reggaetón Lento (Bailemos),3,2017-01-01,CNCO,https://open.spotify.com/track/3AEZUABDXNtecAO...,Argentina,top200,MOVE_DOWN,210943.0
3,Safari,4,2017-01-01,"J Balvin, Pharrell Williams, BIA, Sky",https://open.spotify.com/track/6rQSrBHf7HlZjtc...,Argentina,top200,SAME_POSITION,173865.0
4,Shaky Shaky,5,2017-01-01,Daddy Yankee,https://open.spotify.com/track/58IL315gMSTD37D...,Argentina,top200,MOVE_UP,153956.0


In [3]:
# Example: Spotify Tracks dataset (with audio features)
tracks_path = "data/raw/spotify_tracks.csv"  # put your file here

df_tracks = pd.read_csv(tracks_path, low_memory=False)
print(df_tracks.shape)
df_tracks.head()

(114000, 21)


Unnamed: 0.1,Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.461,...,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic
1,1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,149610,False,0.42,0.166,...,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic
2,2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,210826,False,0.438,0.359,...,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic
3,3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,201933,False,0.266,0.0596,...,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,acoustic
4,4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,Hold On,82,198853,False,0.618,0.443,...,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,acoustic


In [4]:
# Merge charts and tracks on track name + artist
df_merged = pd.merge(
    df_charts, 
    df_tracks, 
    left_on=["track_name", "artist_names"], 
    right_on=["track_name", "artists"], 
    how="inner"
)

print(df_merged.shape)
df_merged.head()


KeyError: 'track_name'

In [None]:
df_merged.to_csv("data/processed/spotify_merged.csv", index=False)

# Wrap-Up

Now we have:
- Spotify Charts (daily/weekly rankings, streams)
- Spotify Tracks (metadata + audio features)
- A merged dataset ready for cleaning and analysis.

Next Notebook (`02_data_cleaning_analysis.ipynb`) will:
- Clean the dataset
- Explore top artists, songs, and genres
- Visualize music trends