# Preprocesamiento del Dataset

Este notebook realiza limpieza, imputación de valores faltantes, recorte de outliers y normalización del dataset `900k Definitive Spotify Dataset`. Está diseñado como parte del proyecto *Radar Sonoro*.

In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import os


In [2]:
# Cargar el archivo JSON desde la carpeta raw
path = os.path.join("..", "data", "raw", "900k Definitive Spotify Dataset.json")
df = pd.read_json(path, lines=True)
df.head()


Unnamed: 0,Artist(s),song,text,Length,emotion,Genre,Album,Release Date,Key,Tempo,...,Good for Party,Good for Work/Study,Good for Relaxation/Meditation,Good for Exercise,Good for Running,Good for Yoga/Stretching,Good for Driving,Good for Social Gatherings,Good for Morning Routine,Similar Songs
0,!!!,Even When the Waters Cold,Friends told her she was better off at the bot...,03:47,sadness,hip hop,Thr!!!er,2013-04-29,D min,0.43787,...,0,0,0,0,0,0,0,0,0,"[{'Similar Artist 1': 'Corey Smith', 'Similar ..."
1,!!!,One Girl / One Boy,"Well I heard it, playing soft From a drunken b...",04:03,sadness,hip hop,Thr!!!er,2013-04-29,A# min,0.508876,...,0,0,0,0,0,0,0,0,0,"[{'Similar Artist 1': 'Hiroyuki Sawano', 'Simi..."
2,!!!,Pardon My Freedom,"Oh my god, did I just say that out loud? Shoul...",05:51,joy,hip hop,Louden Up Now,2004-06-08,A Maj,0.532544,...,0,0,0,1,0,0,0,0,0,"[{'Similar Artist 1': 'Ricky Dillard', 'Simila..."
3,!!!,Ooo,[Verse 1] Remember when I called you on the te...,03:44,joy,hip hop,As If,2015-10-16,A min,0.538462,...,0,0,0,1,0,0,0,0,0,"[{'Similar Artist 1': 'Eric Clapton', 'Similar..."
4,!!!,Freedom 15,[Verse 1] Calling me like I got something to s...,06:00,joy,hip hop,As If,2015-10-16,F min,0.544379,...,0,0,0,1,0,0,0,0,0,"[{'Similar Artist 1': 'Cibo Matto', 'Similar S..."


In [5]:
# 1) Renombrar columnas a minúsculas + valence
df.rename(columns={
    'Danceability':      'danceability',
    'Energy':            'energy',
    'Positiveness':      'valence',
    'Tempo':             'tempo',
    'Acousticness':      'acousticness',
    'Instrumentalness':  'instrumentalness',
    'Speechiness':       'speechiness',
    'Liveness':          'liveness',
    'Popularity':        'popularity',
    # …otros renombres si quieres
}, inplace=True)

# 2) Selección de features que usaremos
features = [
    'danceability','energy','valence','tempo',
    'acousticness','instrumentalness','speechiness',
    'liveness','duration_seconds','popularity'
]

# Convertir 'Length' de formato "mm:ss" a segundos
def length_to_seconds(length_str):
    try:
        minutes, seconds = map(int, length_str.split(":"))
        return minutes * 60 + seconds
    except:
        return None

df['duration_seconds'] = df['Length'].apply(length_to_seconds)


# 3) Subset y comprobación
df = df[features].copy()
df.head()


Unnamed: 0,danceability,energy,valence,tempo,acousticness,instrumentalness,speechiness,liveness,duration_seconds,popularity
0,71,83,87,0.43787,11,0,4,16,227,40
1,70,85,87,0.508876,0,0,4,32,243,42
2,71,89,63,0.532544,0,20,8,64,351,29
3,78,84,97,0.538462,12,0,4,12,224,24
4,77,71,70,0.544379,4,1,7,10,360,30


In [7]:
# Eliminar duplicados
n_dup = df.duplicated().sum()
print(f"Duplicados encontrados: {n_dup}")
df.drop_duplicates(inplace=True)


Duplicados encontrados: 38970


In [8]:
# Valores nulos
missing = df.isna().sum()
print("Valores nulos por columna:\n", missing)

# Imputación con la mediana
imputer = SimpleImputer(strategy='median')
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=features)


Valores nulos por columna:
 danceability        0
energy              0
valence             0
tempo               0
acousticness        0
instrumentalness    0
speechiness         0
liveness            0
duration_seconds    0
popularity          0
dtype: int64


In [9]:
# Recorte de outliers por percentiles (1% y 99%)
lower = df_imputed.quantile(0.01)
upper = df_imputed.quantile(0.99)
df_clipped = df_imputed.clip(lower=lower, upper=upper, axis=1)


In [10]:
# Min-Max Scaling
mm_scaler = MinMaxScaler()
df_mm = pd.DataFrame(mm_scaler.fit_transform(df_clipped), columns=features)

# Standard Scaling
std_scaler = StandardScaler()
df_std = pd.DataFrame(std_scaler.fit_transform(df_clipped), columns=features)


In [11]:
# Guardar datasets procesados
os.makedirs(os.path.join("..", "data", "processed"), exist_ok=True)
df_imputed.to_csv("../data/processed/spotify_imputed.csv", index=False)
df_mm.to_csv("../data/processed/spotify_minmax.csv", index=False)
df_std.to_csv("../data/processed/spotify_standard.csv", index=False)

print("🚀 Preprocesamiento completo. Archivos guardados en data/processed/:")
print(" - spotify_imputed.csv")
print(" - spotify_minmax.csv")
print(" - spotify_standard.csv")


🚀 Preprocesamiento completo. Archivos guardados en data/processed/:
 - spotify_imputed.csv
 - spotify_minmax.csv
 - spotify_standard.csv
