# A. Pendahuluan
Deskripsi Dataset : Dataset ini berisi sekitar 32.000 lagu dari Spotify yang dikumpulkan menggunakan package spotifyr. Data mencakup informasi lagu (judul, artis, album), popularitas, atribut playlist, dan fitur audio (seperti danceability, energy, tempo, dll.) untuk 6 kategori genre utama: EDM, Latin, Pop, R&B, Rap, dan Rock.

Alasan Dataset Menarik : adalah karena musik/lagu bagian dari kehidupan sehari-hari saya yang memang suka musik gampangnya salah satu penimat musik juga. Menganalisis fitur audio dari berbagai genre dapat memberikan insight tentang apa yang membuat sebuah lagu populer atau bagaimana karakteristik fisik suara membedakan satu genre dengan genre lainnya. Informasi ini berguna bagi industri musik maupun penikmat musik untuk memahami tren.

Tujuan Analisis :

1. Bagaimana distribusi lagu berdasarkan genre dalam dataset?

2. Genre mana yang memiliki rata-rata popularitas tertinggi?

3. Bagaimana korelasi antar fitur audio ?

4. Mengidentifikasi karakteristik unik setiap genre berdasarkan fitur audionya.

# B. Setup & Package

In [85]:
import pandas as pd     # Manipulasi data (membaca csv, cleaning, dataframe)
import numpy as np      # Operasi numerik dan komputasi array
import matplotlib.pyplot as plt  # Visualisasi dasar (membuat plot/grafik)
import seaborn as sns   # Visualisasi statistik (heatmap, boxplot)
import os           # Operasi sistem file (path, direktori)
from pathlib import Path  # Representasi path file
from sklearn.preprocessing import MinMaxScaler  # Normalisasi fitur numerik

In [89]:
current_dir = os.getcwd()
print(f"ðŸ“‚ Posisi Python saat ini: {current_dir}")

ðŸ“‚ Posisi Python saat ini: C:\Users\Akmal sharif Ramdan\AppData\Local\Programs\Microsoft VS Code


In [75]:
raw_path = Path("data/raw/spotify_songs.csv")
processed_path = Path("data/processed/spotify_songs_cleaned.csv")

# C. Data Preparation

## Import Data

In [78]:
df = pd.read_csv("C:/Users/Akmal sharif Ramdan/OneDrive/Documents/final-project-main/data/raw/spotify_songs.csv")
print("Initial shape:", df.shape)
df.head()

Initial shape: (32833, 23)


Unnamed: 0,track_id,track_name,track_artist,track_popularity,track_album_id,track_album_name,track_album_release_date,playlist_name,playlist_id,playlist_genre,...,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
0,6f807x0ima9a1j3VPbc7VN,I Don't Care (with Justin Bieber) - Loud Luxur...,Ed Sheeran,66,2oCs0DGTsRO98Gh5ZSl2Cx,I Don't Care (with Justin Bieber) [Loud Luxury...,2019-06-14,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,6,-2.634,1,0.0583,0.102,0.0,0.0653,0.518,122.036,194754
1,0r7CVbZTWZgbTCYdfa2P31,Memories - Dillon Francis Remix,Maroon 5,67,63rPSO264uRjW1X5E6cWv6,Memories (Dillon Francis Remix),2019-12-13,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,11,-4.969,1,0.0373,0.0724,0.00421,0.357,0.693,99.972,162600
2,1z1Hg7Vb0AhHDiEmnDE79l,All the Time - Don Diablo Remix,Zara Larsson,70,1HoSmj2eLcsrR0vE9gThr4,All the Time (Don Diablo Remix),2019-07-05,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,1,-3.432,0,0.0742,0.0794,2.3e-05,0.11,0.613,124.008,176616
3,75FpbthrwQmzHlBJLuGdC7,Call You Mine - Keanu Silva Remix,The Chainsmokers,60,1nqYsOef1yKKuGOVchbsk6,Call You Mine - The Remixes,2019-07-19,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,7,-3.778,1,0.102,0.0287,9e-06,0.204,0.277,121.956,169093
4,1e8PAfcKUYoKkxPhrHqw4x,Someone You Loved - Future Humans Remix,Lewis Capaldi,69,7m7vv9wlQ4i0LFuJiE2zsQ,Someone You Loved (Future Humans Remix),2019-03-05,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,1,-4.672,1,0.0359,0.0803,0.0,0.0833,0.725,123.976,189052


## Data Cleaning

In [79]:
jumlah_duplikat = df.duplicated().sum()
print(f"Jumlah data duplikat ditemukan: {jumlah_duplikat}")
df_clean = df.drop_duplicates()

Jumlah data duplikat ditemukan: 0


In [80]:
print("\nJumlah missing values per kolom sebelum cleaning:")
print(df_clean.isnull().sum())


Jumlah missing values per kolom sebelum cleaning:
track_id                    0
track_name                  5
track_artist                5
track_popularity            0
track_album_id              0
track_album_name            5
track_album_release_date    0
playlist_name               0
playlist_id                 0
playlist_genre              0
playlist_subgenre           0
danceability                0
energy                      0
key                         0
loudness                    0
mode                        0
speechiness                 0
acousticness                0
instrumentalness            0
liveness                    0
valence                     0
tempo                       0
duration_ms                 0
dtype: int64


In [81]:
df_clean = df_clean.dropna()

In [83]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32828 entries, 0 to 32832
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   track_id                  32828 non-null  object 
 1   track_name                32828 non-null  object 
 2   track_artist              32828 non-null  object 
 3   track_popularity          32828 non-null  int64  
 4   track_album_id            32828 non-null  object 
 5   track_album_name          32828 non-null  object 
 6   track_album_release_date  32828 non-null  object 
 7   playlist_name             32828 non-null  object 
 8   playlist_id               32828 non-null  object 
 9   playlist_genre            32828 non-null  object 
 10  playlist_subgenre         32828 non-null  object 
 11  danceability              32828 non-null  float64
 12  energy                    32828 non-null  float64
 13  key                       32828 non-null  int64  
 14  loudness   

## Standarization & Transformation

In [87]:
audio_features = ['danceability', 'energy', 'key', 'loudness', 'mode', 
                  'speechiness', 'acousticness', 'instrumentalness', 
                  'liveness', 'valence', 'tempo', 'duration_ms']


fitur_ada = [col for col in audio_features if col in df_clean.columns]


scaler = MinMaxScaler()

df_processed = df_clean.copy()

df_processed[fitur_ada] = scaler.fit_transform(df_processed[fitur_ada])


display(df_processed[fitur_ada].head())

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
0,0.760936,0.915985,0.545455,0.91809,1.0,0.063508,0.102616,0.0,0.065562,0.522704,0.509673,0.371254
1,0.738555,0.814968,1.0,0.869162,1.0,0.040632,0.072837,0.004235,0.358434,0.699294,0.417524,0.308674
2,0.686673,0.930988,0.090909,0.901368,0.0,0.080828,0.079879,2.3e-05,0.110442,0.618567,0.517908,0.335953
3,0.730417,0.929988,0.636364,0.894118,1.0,0.111111,0.028873,9e-06,0.204819,0.279516,0.509338,0.321311
4,0.661241,0.832971,0.090909,0.875385,1.0,0.039107,0.080785,0.0,0.083635,0.731584,0.517775,0.360156
