# Notebook para procesar datos de Spotify
Este notebook carga, limpia, combina y guarda los datos de Spotify usando la clase `SpotifyDataProcessor`. Asegúrate de tener el archivo `spotify_data_processor.py` en la misma carpeta.

In [1]:
# Mover el cwd al root de tu proyecto (donde está la carpeta data/)
import os
os.chdir("..")   # asumiendo que estás en /notebooks
print("Current working dir:", os.getcwd())

Current working dir: e:\CDD1\SongReccomender\SongReccomender


In [2]:
import sys
from pathlib import Path

# Añade la carpeta src al PYTHONPATH
sys.path.insert(0, str(Path('src').resolve()))

# Importa la clase
from songrecommender.processors.data_processing import SpotifyDataProcessor

In [3]:
import os

# Asegúrate de que el directorio 'data/raw' exista
raw_data_path = 'data/raw'
if not os.path.exists(raw_data_path):
	os.makedirs(raw_data_path)

# 2. Instanciar y ejecutar pipeline completo
processor = SpotifyDataProcessor()

# Cargar, limpiar, combinar y guardar
df = processor.process_all()

# Mostrar las primeras filas
df.head()

🔍 Cargando datos crudos...
Cargando datasets con configuración específica...
Archivos en data/raw:
- dataset.csv (19646.7 KB)
- spotify200_daily.csv (815748.0 KB)
- spotify_global_streaming_data.csv (44.2 KB)

🧹 Limpiando datos...

🔗 Combinando datasets...
⚠️ Advertencia: 73 valores no numéricos en 'daily_streams'
⚠️ Advertencia: 73 valores no numéricos en 'peak_rank'
⚠️ Advertencia: 73 valores no numéricos en 'weeks_on_chart'
⚠️ Advertencia: 468 valores no numéricos en 'danceability'
⚠️ Advertencia: 468 valores no numéricos en 'energy'
⚠️ Advertencia: 468 valores no numéricos en 'valence'

💾 Guardando datos procesados...

✅ Datos guardados en data\processed\spotify_processed.parquet
🎶 Géneros únicos: 7
genre
k-pop        5
pop          4
rock         2
reggaeton    1
dance        1
indie        1
classical    1
Name: count, dtype: int64

🔍 Diagnóstico final:
Total registros combinados: 495
Artistas únicos: 15
Valores faltantes en columnas clave:
popularity            0.658586
avg_danc

Unnamed: 0,artist_name,album_name,genre,total_streams,avg_duration_min,popularity,avg_danceability,avg_energy,avg_valence,avg_daily_streams,best_rank,max_weeks_on_chart,Country
0,Taylor Swift,1989 (Taylor's Version),k-pop,3695.53,4.28,,0.585298,0.623716,0.427222,1198106.0,1.0,150.0,Germany
1,The Weeknd,After Hours,pop,2828.16,3.9,91.0,0.608774,0.735933,0.483503,726968.8,1.0,149.0,Brazil
2,Post Malone,Austin,reggaeton,1425.46,4.03,,0.693911,0.680469,0.43209,2957205.0,1.0,199.0,United States
3,Ed Sheeran,Autumn Variations,k-pop,2704.33,3.26,,0.725783,0.70039,0.577577,1249782.0,1.0,290.0,Italy
4,Ed Sheeran,Autumn Variations,r&b,3323.25,4.47,,0.725783,0.70039,0.577577,1249782.0,1.0,290.0,Italy


In [4]:
df_en = processor.enrich_spotify_metadata(df)
df_en.to_parquet("data/processed/spotify_fully_enriched.parquet", index=False)
print(df_en.head())


    artist_name               album_name      genre  total_streams  \
0  Taylor Swift  1989 (Taylor's Version)      k-pop        3695.53   
1    The Weeknd              After Hours        pop        2828.16   
2   Post Malone                   Austin  reggaeton        1425.46   
3    Ed Sheeran        Autumn Variations      k-pop        2704.33   
4    Ed Sheeran        Autumn Variations        r&b        3323.25   

   avg_duration_min  popularity  avg_danceability  avg_energy  avg_valence  \
0              4.28         NaN          0.585298    0.623716     0.427222   
1              3.90        91.0          0.608774    0.735933     0.483503   
2              4.03         NaN          0.693911    0.680469     0.432090   
3              3.26         NaN          0.725783    0.700390     0.577577   
4              4.47         NaN          0.725783    0.700390     0.577577   

   avg_daily_streams  ...        Country                album_id  \
0       1.198106e+06  ...        Germany  