# Librerias:

In [25]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly as px
import plotly.express as px
import math
import plotly.graph_objects as go
from plotly.subplots import make_subplots


# Importamos el fichero:

In [10]:
df = pd.read_csv(r"../data/datos.csv")

# EDA (exploratory data analysis)

## Análisis general:

Tenemos 15 columnas y 450 filas:

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450 entries, 0 to 449
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Mood              450 non-null    object 
 1   artist_name       450 non-null    object 
 2   popularity        450 non-null    int64  
 3   danceability      450 non-null    float64
 4   energy            450 non-null    float64
 5   key               450 non-null    int64  
 6   loudness          450 non-null    float64
 7   mode              450 non-null    int64  
 8   speechiness       450 non-null    float64
 9   acousticness      450 non-null    float64
 10  instrumentalness  450 non-null    float64
 11  liveness          450 non-null    float64
 12  valence           450 non-null    float64
 13  tempo             450 non-null    float64
 14  duration_ms       450 non-null    int64  
dtypes: float64(9), int64(4), object(2)
memory usage: 52.9+ KB


## Análisis variables categóricas:

¿Con qué muestra contamos?

In [24]:
df['Mood'].value_counts().sum()

450

In [17]:
df['Mood'].value_counts()

Mood
Happy    100
Fear     100
Focus    100
Sad       85
Anger     65
Name: count, dtype: int64

In [20]:

mood_values = df['Mood'].value_counts()
fig = px.bar(x=mood_values.index, y=mood_values.values, template = 'ggplot2')
fig.update_layout(
    xaxis_title="Estado de ánimo/Mood",
    yaxis_title="Número de valores")
fig.show()

Como se puede ver de la playlist relacionado con el estado de ánimo tristeza o 'Sad' contaremos con 85 valores para entrenar el modelo y para el de ira o 'Anger' contamos con tan sólo 65 valores.  
Si vemos que más adelante nos da problemas el modelo entrenado incluiremos en la muestra otras playlist para esos estados de animo más grandes.

En general, se suele decir que se necesitan al menos varias decenas o cientos de muestras de entrenamiento por cada variable de entrada (característica) que se utilice en el modelo. Esto se conoce como la regla de "diez veces el número de variables por muestra". Por ejemplo, si tienes 10 características, podrías necesitar al menos 100 muestras de entrenamiento.  

Así que, en principio, mi análisis se centrará en unos 10 parámetros aproximadamente para que se cumpla esta regla, por lo menos para los cuatro primeros estados de ánimo.

https://postindustria.com/how-much-data-is-required-for-machine-learning/#:~:text=The%20most%20common%20way%20to,parameters%20in%20your%20data%20set.

¿Se repetirá algún artista en los diferentes estados de ánimo?

In [15]:
df['artist_name'].value_counts()

artist_name
Sam Smith               6
Ed Sheeran              5
Imber Sun               5
Josef Briem             5
Far & Beyond            5
                       ..
YONAKA                  1
Bring Me The Horizon    1
Alice Merton            1
Muse                    1
Melvin Barker           1
Name: count, Length: 341, dtype: int64

Parece que Sam Smith es el que más se repite.

## Análisis variables numéricas:

In [12]:
df.describe()

Unnamed: 0,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
count,450.0,450.0,450.0,450.0,450.0,450.0,450.0,450.0,450.0,450.0,450.0,450.0,450.0
mean,62.317778,0.556267,0.487121,5.311111,-10.211711,0.648889,0.061014,0.445754,0.295955,0.157851,0.363042,113.515611,196423.537778
std,17.743882,0.150213,0.28645,3.602908,6.174619,0.477849,0.055972,0.391267,0.399973,0.117311,0.254685,29.986572,56893.384509
min,14.0,0.132,0.00954,0.0,-32.03,0.0,0.0243,2.6e-05,0.0,0.0292,0.0309,35.366,100467.0
25%,52.0,0.46025,0.2125,2.0,-14.07825,0.0,0.03385,0.0361,2e-06,0.099025,0.14525,90.09475,160663.75
50%,60.5,0.56,0.485,5.0,-8.193,1.0,0.0402,0.3725,0.00109,0.111,0.3035,113.01,184054.5
75%,79.0,0.6635,0.75275,9.0,-5.342,1.0,0.062725,0.874,0.836,0.15775,0.53675,130.01225,224634.0
max,95.0,0.954,0.987,11.0,-1.789,1.0,0.519,0.994,0.973,0.755,0.965,203.639,518747.0


### Análisis en detalle de las emociones y sus características:

**En este apartado vamos a analizar una a una cada emoción para entender en que se ha basado Spotify a la hora de crear esas playlists.**

Como hay variables entre rangos acotados se analizarán los ouliers de aquellas variables que no están acotadas, es decir, 'duration_ms', 'loudness' y 'tempo'.

#### 1. Felicidad / Happy

In [22]:
happy_df = df[(df.Mood == 'Happy')]
happy_df

Unnamed: 0,Mood,artist_name,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
0,Happy,David Guetta,95,0.561,0.965,7,-3.673,0,0.0343,0.00383,0.000007,0.3710,0.304,128.040,175238
1,Happy,OneRepublic,95,0.704,0.797,0,-5.927,1,0.0475,0.08260,0.000745,0.0546,0.825,139.994,148486
2,Happy,Dua Lipa,94,0.671,0.845,11,-4.930,0,0.0480,0.02070,0.000000,0.3290,0.775,110.056,176579
3,Happy,The Weeknd,94,0.514,0.730,1,-5.934,1,0.0598,0.00146,0.000095,0.0897,0.334,171.005,200040
4,Happy,Harry Styles,92,0.548,0.816,0,-4.209,1,0.0465,0.12200,0.000000,0.3350,0.557,95.390,174000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Happy,Coldplay,76,0.507,0.828,10,-6.023,1,0.0449,0.00711,0.000024,0.2610,0.489,178.032,211295
96,Happy,Justin Wellington,76,0.862,0.753,5,-5.356,1,0.0625,0.13100,0.000002,0.0770,0.827,105.039,182857
97,Happy,Kygo,75,0.750,0.797,0,-4.826,1,0.1180,0.29300,0.000000,0.3920,0.523,105.949,215203
98,Happy,Nathan Evans,75,0.722,0.893,0,-3.255,0,0.0475,0.04410,0.000937,0.0673,0.439,119.932,116750


##### 1.1 Variables no acotadas (duration_ms, loudness y tempo)

Análisis de outliers y distribución de las variables 'duration_ms', 'loudness' y 'tempo' para el estado de ánimo de felicidad: 'Happy'.

In [27]:
happy_df.columns

Index(['Mood', 'artist_name', 'popularity', 'danceability', 'energy', 'key',
       'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
       'liveness', 'valence', 'tempo', 'duration_ms'],
      dtype='object')

In [None]:
# Convierto ms a minutos: 1 min = 60.000 ms
happy_df['duration_ms'] = happy_df['duration_ms']/60000

In [40]:
cols = ['duration_ms', 'loudness', 'tempo']

fig = make_subplots(rows=1, cols=3, subplot_titles=cols)

for i, col in enumerate(cols):
    fig.add_trace(go.Box(y=happy_df[col], name=col), row=1, col=i+1)

fig.update_layout(
    title="Distribución de variables:",
    height=400,
    width=900
)

fig.show()

¿Cuántos outliers hay en cada variable para saber si eliminar o no?

In [52]:
# Para ver la cantidad de outliers que hay en nuestro dataset
cols = ['duration_ms', 'loudness', 'tempo']
def outliers1 (df):
    for col in cols:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        outlierIzq = Q1-(1.5*IQR)
        outlierDer = Q3+(1.5*IQR)
        outliers = df[col][(df[col] < outlierIzq) | (df[col] > outlierDer)]
        print("La variable " + col + " tiene " + str(len(outliers)) + " valores outliers")

    
outliers1(happy_df)

La variable duration_ms tiene 5 valores outliers
La variable loudness tiene 3 valores outliers
La variable tempo tiene 8 valores outliers


La duración tiene un outlier que puede distorsionar más el análisis. El resto no parece que sea necesario eliminarlas.

In [44]:
happy_df[['duration_ms','loudness','tempo']].describe()

Unnamed: 0,duration_ms,loudness,tempo
count,100.0,100.0,100.0
mean,3.043904,-5.29973,120.25705
std,0.600838,1.709559,21.574673
min,1.829167,-10.778,79.994
25%,2.713467,-6.39275,106.768
50%,3.013933,-4.875,119.949
75%,3.305417,-4.1095,127.845
max,5.710217,-2.392,182.162


Las canciones de la playlist que representa la alegría:  
- **Duración**: Tienen una duración media de 3 min. Vemos que la mayoría de los datos se encuentran entre 2 y 3 min. Hay algún outlier de canciones que duran más de 4 min.
- **Sonoridad**: De media las canciones alegres rondan los -5 dB y hay ouliers que rondan los -10 dB.
- **Tempo**: Las canciones alegres tienen de media 120 pulsaciones por minuto (Bps) y hay ouliers que se encuentran por encima de las pulsaciones por minuto.

##### 1.2 Variables acotadas

In [45]:
happy_df.columns

Index(['Mood', 'artist_name', 'popularity', 'danceability', 'energy', 'key',
       'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
       'liveness', 'valence', 'tempo', 'duration_ms'],
      dtype='object')

Del resto de variables se ha realizado una selección y se analizan las más interesantes:

In [51]:
cols = ['popularity','danceability', 'energy','valence', 'speechiness','instrumentalness']

fig = make_subplots(rows=2, cols=3, subplot_titles=cols)

for i, col in enumerate(cols):
    fig.add_trace(go.Histogram(y=happy_df[col], name=col), row=(i // 3) + 1, col=(i % 3) + 1)

fig.update_layout(
    title="Histogramas:",
    height=600,
    width=900
)

fig.show()

Vemos que las canciones alegres según la lista que ha generado Spotify:  
- Por lo general son canciones bastante populares (con un 80 sobre 100 de media podriamos decir).  
- En cuanto a la bailabilidad la mayoría son bailables ya que los valores están cerca del 1.  
- Son canciones enérgicas.  
- Respecto a la valencia son canciones alegres en su mayoría aunque parece que hay canciones con valencia muy baja. La valencia indica cuan alegres son las canciones por eso nos choca un poco al ver esto.  
- Son canciones que no son habladas o rapeadas.  
- Por último, vemos que casi todas las canciones tienen voces, no son puramente instrumentales.

#### 2. Tristeza / Sad

In [53]:
sad_df = df[(df.Mood == 'Sad')]
sad_df

Unnamed: 0,Mood,artist_name,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
100,Sad,Alec Benjamin,90,0.652,0.557,1,-5.714,0,0.0318,0.740,0.000000,0.1240,0.483,150.073,169354
101,Sad,Coldplay,89,0.557,0.442,5,-7.224,1,0.0243,0.731,0.000015,0.1100,0.213,146.277,309600
102,Sad,Ed Sheeran,89,0.614,0.379,4,-10.480,1,0.0476,0.607,0.000464,0.0986,0.201,107.989,258987
103,Sad,Sam Smith,88,0.681,0.372,5,-8.237,1,0.0432,0.640,0.000000,0.1690,0.476,91.873,201000
104,Sad,Olivia Rodrigo,87,0.369,0.272,9,-10.497,1,0.0364,0.866,0.000000,0.1470,0.218,172.929,152667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
180,Sad,Noah Cyrus,60,0.618,0.428,7,-8.500,0,0.0380,0.592,0.000000,0.1920,0.359,142.021,190183
181,Sad,Babi,60,0.622,0.432,1,-6.203,0,0.0575,0.544,0.075400,0.2900,0.362,126.790,207354
182,Sad,Tom Odell,60,0.616,0.273,5,-12.470,1,0.0307,0.960,0.001050,0.1650,0.314,128.040,182029
183,Sad,C. Tangana,60,0.612,0.198,10,-12.909,1,0.0363,0.686,0.000526,0.1250,0.175,72.948,112440


In [55]:
# Convierto ms a minutos: 1 min = 60.000 ms
sad_df['duration_ms'] = sad_df['duration_ms']/60000



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [56]:
cols = ['duration_ms', 'loudness', 'tempo']

fig = make_subplots(rows=1, cols=3, subplot_titles=cols)

for i, col in enumerate(cols):
    fig.add_trace(go.Box(y=sad_df[col], name=col), row=1, col=i+1)

fig.update_layout(
    title="Distribución de variables:",
    height=400,
    width=900
)

fig.show()

In [57]:
sad_df[['duration_ms','loudness','tempo']].describe()

Unnamed: 0,duration_ms,loudness,tempo
count,85.0,85.0,85.0
mean,3.561567,-8.926859,117.962824
std,0.716339,3.366766,32.949885
min,1.874,-23.023,69.754
25%,3.036017,-10.22,90.289
50%,3.456667,-8.295,114.441
75%,4.002217,-6.886,139.644
max,5.383783,-3.966,199.853


Las canciones de la playlist que representa la alegría:  
- **Duración**: Tienen una duración media de 3 min. Vemos que la mayoría de los datos se encuentran entre 3 y 4 min. No hay ouliers.
- **Sonoridad**: De media las canciones tristes se encuentran en los -8 dB y hay ouliers están por debajo de los -14 dB.
- **Tempo**: Las canciones tristes tienen de media 117 pulsaciones por minuto (Bps) y parece que no hay ouliers.

# Encodear

# Correlaciones