# Librerias:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly as px
import plotly.express as px
import math
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Importamos el fichero:

In [2]:
df = pd.read_csv(r"../data/datos.csv")

# EDA (exploratory data analysis)

## Análisis general:

Tenemos 15 columnas y 450 filas:

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450 entries, 0 to 449
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Mood              450 non-null    object 
 1   artist_name       450 non-null    object 
 2   popularity        450 non-null    int64  
 3   danceability      450 non-null    float64
 4   energy            450 non-null    float64
 5   key               450 non-null    int64  
 6   loudness          450 non-null    float64
 7   mode              450 non-null    int64  
 8   speechiness       450 non-null    float64
 9   acousticness      450 non-null    float64
 10  instrumentalness  450 non-null    float64
 11  liveness          450 non-null    float64
 12  valence           450 non-null    float64
 13  tempo             450 non-null    float64
 14  duration_ms       450 non-null    int64  
dtypes: float64(9), int64(4), object(2)
memory usage: 52.9+ KB


## Análisis variables categóricas:

¿Con qué muestra contamos?

In [4]:
df['Mood'].value_counts().sum()

450

In [5]:
df['Mood'].value_counts()

Mood
Happy    100
Fear     100
Focus    100
Sad       85
Anger     65
Name: count, dtype: int64

In [6]:

mood_values = df['Mood'].value_counts()
fig = px.bar(x=mood_values.index, y=mood_values.values, template = 'ggplot2')
fig.update_layout(
    xaxis_title="Estado de ánimo/Mood",
    yaxis_title="Número de valores")
fig.show()

Como se puede ver de la playlist relacionado con el estado de ánimo tristeza o 'Sad' contaremos con 85 valores para entrenar el modelo y para el de ira o 'Anger' contamos con tan sólo 65 valores.  
Si vemos que más adelante nos da problemas el modelo entrenado incluiremos en la muestra otras playlist para esos estados de animo más grandes.

En general, se suele decir que se necesitan al menos varias decenas o cientos de muestras de entrenamiento por cada variable de entrada (característica) que se utilice en el modelo. Esto se conoce como la regla de "diez veces el número de variables por muestra". Por ejemplo, si tienes 10 características, podrías necesitar al menos 100 muestras de entrenamiento.  

Así que, en principio, mi análisis se centrará en unos 10 parámetros aproximadamente para que se cumpla esta regla, por lo menos para los cuatro primeros estados de ánimo.

https://postindustria.com/how-much-data-is-required-for-machine-learning/#:~:text=The%20most%20common%20way%20to,parameters%20in%20your%20data%20set.

¿Se repetirá algún artista en los diferentes estados de ánimo?

In [7]:
df['artist_name'].value_counts()

artist_name
Sam Smith               6
Ed Sheeran              5
Imber Sun               5
Josef Briem             5
Far & Beyond            5
                       ..
YONAKA                  1
Bring Me The Horizon    1
Alice Merton            1
Muse                    1
Melvin Barker           1
Name: count, Length: 341, dtype: int64

Parece que Sam Smith es el que más se repite.

## Análisis variables numéricas:

In [8]:
df.describe()

Unnamed: 0,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
count,450.0,450.0,450.0,450.0,450.0,450.0,450.0,450.0,450.0,450.0,450.0,450.0,450.0
mean,62.317778,0.556267,0.487121,5.311111,-10.211711,0.648889,0.061014,0.445754,0.295955,0.157851,0.363042,113.515611,196423.537778
std,17.743882,0.150213,0.28645,3.602908,6.174619,0.477849,0.055972,0.391267,0.399973,0.117311,0.254685,29.986572,56893.384509
min,14.0,0.132,0.00954,0.0,-32.03,0.0,0.0243,2.6e-05,0.0,0.0292,0.0309,35.366,100467.0
25%,52.0,0.46025,0.2125,2.0,-14.07825,0.0,0.03385,0.0361,2e-06,0.099025,0.14525,90.09475,160663.75
50%,60.5,0.56,0.485,5.0,-8.193,1.0,0.0402,0.3725,0.00109,0.111,0.3035,113.01,184054.5
75%,79.0,0.6635,0.75275,9.0,-5.342,1.0,0.062725,0.874,0.836,0.15775,0.53675,130.01225,224634.0
max,95.0,0.954,0.987,11.0,-1.789,1.0,0.519,0.994,0.973,0.755,0.965,203.639,518747.0


### Análisis en detalle de las emociones y sus características:

**En este apartado vamos a analizar una a una cada emoción para entender en que se ha basado Spotify a la hora de crear esas playlists.**

Como hay variables entre rangos acotados se analizarán los ouliers de aquellas variables que no están acotadas, es decir, 'duration_ms', 'loudness' y 'tempo'.

#### 1. Felicidad / Happy

In [9]:
happy_df = df[(df.Mood == 'Happy')]
happy_df

Unnamed: 0,Mood,artist_name,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
0,Happy,David Guetta,95,0.561,0.965,7,-3.673,0,0.0343,0.00383,0.000007,0.3710,0.304,128.040,175238
1,Happy,OneRepublic,95,0.704,0.797,0,-5.927,1,0.0475,0.08260,0.000745,0.0546,0.825,139.994,148486
2,Happy,Dua Lipa,94,0.671,0.845,11,-4.930,0,0.0480,0.02070,0.000000,0.3290,0.775,110.056,176579
3,Happy,The Weeknd,94,0.514,0.730,1,-5.934,1,0.0598,0.00146,0.000095,0.0897,0.334,171.005,200040
4,Happy,Harry Styles,92,0.548,0.816,0,-4.209,1,0.0465,0.12200,0.000000,0.3350,0.557,95.390,174000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Happy,Coldplay,76,0.507,0.828,10,-6.023,1,0.0449,0.00711,0.000024,0.2610,0.489,178.032,211295
96,Happy,Justin Wellington,76,0.862,0.753,5,-5.356,1,0.0625,0.13100,0.000002,0.0770,0.827,105.039,182857
97,Happy,Kygo,75,0.750,0.797,0,-4.826,1,0.1180,0.29300,0.000000,0.3920,0.523,105.949,215203
98,Happy,Nathan Evans,75,0.722,0.893,0,-3.255,0,0.0475,0.04410,0.000937,0.0673,0.439,119.932,116750


##### 1.1 Variables no acotadas (duration_ms, loudness y tempo)

Análisis de outliers y distribución de las variables 'duration_ms', 'loudness' y 'tempo' para el estado de ánimo de felicidad: 'Happy'.

In [10]:
happy_df.columns

Index(['Mood', 'artist_name', 'popularity', 'danceability', 'energy', 'key',
       'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
       'liveness', 'valence', 'tempo', 'duration_ms'],
      dtype='object')

In [11]:
# Convierto ms a minutos: 1 min = 60.000 ms
happy_df['duration_ms'] = happy_df['duration_ms']/60000



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [12]:
cols = ['duration_ms', 'loudness', 'tempo']

fig = make_subplots(rows=1, cols=3, subplot_titles=cols)

for i, col in enumerate(cols):
    fig.add_trace(go.Box(y=happy_df[col], name=col), row=1, col=i+1)

fig.update_layout(
    title="Distribución de variables:",
    height=400,
    width=900
)

fig.show()

¿Cuántos outliers hay en cada variable para saber si eliminar o no?

In [66]:
# Para ver la cantidad de outliers que hay en nuestro dataset
cols = ['duration_ms', 'loudness', 'tempo']
def outliers1 (df):
    for col in cols:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        outlierIzq = Q1-(1.5*IQR)
        outlierDer = Q3+(1.5*IQR)
        outliers = df[col][(df[col] < outlierIzq) | (df[col] > outlierDer)]
        print("La variable " + col + " tiene " + str(len(outliers)) + " valores atípicos u 'outliers'")

    
outliers1(happy_df)

La variable duration_ms tiene 5 valores atípicos u 'outliers'
La variable loudness tiene 3 valores atípicos u 'outliers'
La variable tempo tiene 8 valores atípicos u 'outliers'


La duración tiene un outlier que puede distorsionar más el análisis. El resto no parece que sea necesario eliminarlas.

In [14]:
happy_df[['duration_ms','loudness','tempo']].describe()

Unnamed: 0,duration_ms,loudness,tempo
count,100.0,100.0,100.0
mean,3.043904,-5.29973,120.25705
std,0.600838,1.709559,21.574673
min,1.829167,-10.778,79.994
25%,2.713467,-6.39275,106.768
50%,3.013933,-4.875,119.949
75%,3.305417,-4.1095,127.845
max,5.710217,-2.392,182.162


Las canciones de la playlist que representa la alegría:  
- **Duración**: Tienen una duración media de 3 min. Vemos que la mayoría de los datos se encuentran entre 2 y 3 min. Hay algún outlier de canciones que duran más de 4 min.
- **Sonoridad**: De media las canciones alegres rondan los -5 dB y hay ouliers que rondan los -10 dB.
- **Tempo**: Las canciones alegres tienen de media 120 pulsaciones por minuto (Bps) y hay ouliers que se encuentran por encima de las pulsaciones por minuto.

##### 1.2 Variables acotadas

In [15]:
happy_df.columns

Index(['Mood', 'artist_name', 'popularity', 'danceability', 'energy', 'key',
       'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
       'liveness', 'valence', 'tempo', 'duration_ms'],
      dtype='object')

Del resto de variables se ha realizado una selección y se analizan las más interesantes:

In [16]:
cols = ['popularity','danceability', 'energy','valence', 'speechiness','instrumentalness']

fig = make_subplots(rows=2, cols=3, subplot_titles=cols)

for i, col in enumerate(cols):
    fig.add_trace(go.Histogram(y=happy_df[col], name=col), row=(i // 3) + 1, col=(i % 3) + 1)

fig.update_layout(
    title="Histogramas:",
    height=600,
    width=900
)

fig.show()

Vemos que las canciones alegres según la lista que ha generado Spotify:  
- Por lo general son canciones bastante populares (con un 80 sobre 100 de media podriamos decir).  
- En cuanto a la bailabilidad la mayoría son bailables ya que los valores están cerca del 1.  
- Son canciones enérgicas.  
- Respecto a la valencia son canciones alegres en su mayoría aunque parece que hay canciones con valencia muy baja. La valencia indica cuan alegres son las canciones por eso nos choca un poco al ver esto.  
- Son canciones que no son habladas o rapeadas.  
- Por último, vemos que casi todas las canciones tienen voces, no son puramente instrumentales.

#### 2. Tristeza / Sad

In [17]:
sad_df = df[(df.Mood == 'Sad')]
sad_df

Unnamed: 0,Mood,artist_name,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
100,Sad,Alec Benjamin,90,0.652,0.557,1,-5.714,0,0.0318,0.740,0.000000,0.1240,0.483,150.073,169354
101,Sad,Coldplay,89,0.557,0.442,5,-7.224,1,0.0243,0.731,0.000015,0.1100,0.213,146.277,309600
102,Sad,Ed Sheeran,89,0.614,0.379,4,-10.480,1,0.0476,0.607,0.000464,0.0986,0.201,107.989,258987
103,Sad,Sam Smith,88,0.681,0.372,5,-8.237,1,0.0432,0.640,0.000000,0.1690,0.476,91.873,201000
104,Sad,Olivia Rodrigo,87,0.369,0.272,9,-10.497,1,0.0364,0.866,0.000000,0.1470,0.218,172.929,152667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
180,Sad,Noah Cyrus,60,0.618,0.428,7,-8.500,0,0.0380,0.592,0.000000,0.1920,0.359,142.021,190183
181,Sad,Babi,60,0.622,0.432,1,-6.203,0,0.0575,0.544,0.075400,0.2900,0.362,126.790,207354
182,Sad,Tom Odell,60,0.616,0.273,5,-12.470,1,0.0307,0.960,0.001050,0.1650,0.314,128.040,182029
183,Sad,C. Tangana,60,0.612,0.198,10,-12.909,1,0.0363,0.686,0.000526,0.1250,0.175,72.948,112440


##### 2.1 Variables no acotadas (duration_ms, loudness y tempo)

In [18]:
# Convierto ms a minutos: 1 min = 60.000 ms
sad_df['duration_ms'] = sad_df['duration_ms']/60000



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [19]:
cols = ['duration_ms', 'loudness', 'tempo']

fig = make_subplots(rows=1, cols=3, subplot_titles=cols)

for i, col in enumerate(cols):
    fig.add_trace(go.Box(y=sad_df[col], name=col), row=1, col=i+1)

fig.update_layout(
    title="Distribución de variables:",
    height=400,
    width=900
)

fig.show()

In [57]:
sad_df[['duration_ms','loudness','tempo']].describe()

Unnamed: 0,duration_ms,loudness,tempo
count,85.0,85.0,85.0
mean,3.561567,-8.926859,117.962824
std,0.716339,3.366766,32.949885
min,1.874,-23.023,69.754
25%,3.036017,-10.22,90.289
50%,3.456667,-8.295,114.441
75%,4.002217,-6.886,139.644
max,5.383783,-3.966,199.853


Las canciones de la playlist que representa la tristeza:  
- **Duración**: Tienen una duración media de 3 min. Vemos que la mayoría de los datos se encuentran entre 3 y 4 min. No hay ouliers.
- **Sonoridad**: De media las canciones tristes se encuentran en los -8 dB y hay ouliers están por debajo de los -14 dB.
- **Tempo**: Las canciones tristes tienen de media 117 pulsaciones por minuto (Bps) y la mayoría de los datos se encuentran entre 90 y 140 pulsaciones por minuto. 

In [32]:
# Para ver la cantidad de outliers que hay en nuestro dataset
cols = ['duration_ms', 'loudness', 'tempo']
def outliers1 (df):
    for col in cols:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        outlierIzq = Q1-(1.5*IQR)
        outlierDer = Q3+(1.5*IQR)
        outliers = df[col][(df[col] < outlierIzq) | (df[col] > outlierDer)]
        print("La variable " + col + " tiene " + str(len(outliers)) + " valores outliers")

    
outliers1(sad_df)

La variable duration_ms tiene 0 valores outliers
La variable loudness tiene 4 valores outliers
La variable tempo tiene 0 valores outliers


##### 2.2 Variables acotadas

Del resto de variables se ha realizado una selección y se analizan las más interesantes:

In [21]:
cols = ['popularity','danceability', 'energy','valence', 'speechiness','instrumentalness']

fig = make_subplots(rows=2, cols=3, subplot_titles=cols)

for i, col in enumerate(cols):
    fig.add_trace(go.Histogram(y=sad_df[col], name=col), row=(i // 3) + 1, col=(i % 3) + 1)

fig.update_layout(
    title="Histogramas:",
    height=600,
    width=900
)

fig.show()

Vemos que las canciones alegres según la lista que ha generado Spotify:  
- Por lo general son canciones bastante populares, un poquito menos que las felices (con un 74 sobre 100 de media).  
- En cuanto a la bailabilidad la mayoría no son muy bailables, de media cuentan con un 0.5 de bailabilidad.  
- No son canciones enérgicas.  
- Respecto a la valencia son canciones con una valencia muy baja, por lo tanto, efectivamente son canciones tristes.
- Son canciones que no son habladas o rapeadas.  
- Por último, vemos que casi todas las canciones tienen voces, no son puramente instrumentales.

#### 3. Ira / Anger

In [28]:
anger_df = df[(df.Mood == 'Anger')]
anger_df

Unnamed: 0,Mood,artist_name,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
185,Anger,Muse,81,0.668,0.921,7,-3.727,1,0.0439,0.049200,0.005170,0.0877,0.782,120.000,212440
186,Anger,Alice Merton,48,0.630,0.884,5,-4.292,1,0.0489,0.001030,0.000102,0.2680,0.637,134.981,186933
187,Anger,Bring Me The Horizon,65,0.589,0.797,9,-5.464,1,0.1500,0.028900,0.000004,0.3830,0.232,102.489,291813
188,Anger,YONAKA,59,0.633,0.788,5,-4.075,0,0.1270,0.011100,0.000002,0.0582,0.412,134.130,161617
189,Anger,My Chemical Romance,84,0.463,0.857,4,-3.063,1,0.0632,0.050600,0.000000,0.1840,0.856,111.647,161920
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,Anger,Biffy Clyro,49,0.228,0.879,5,-4.326,1,0.0701,0.000141,0.000009,0.0484,0.352,145.418,248493
246,Anger,Connie Constance,39,0.459,0.707,1,-5.382,1,0.0930,0.000026,0.491000,0.2630,0.465,165.014,137840
247,Anger,Cold War Kids,52,0.450,0.947,6,-3.608,0,0.2070,0.002770,0.000270,0.2860,0.298,118.701,171227
248,Anger,The Linda Lindas,45,0.639,0.950,2,-3.313,1,0.0609,0.000039,0.494000,0.0605,0.898,150.050,155822


##### 3.1 Variables no acotadas (duration_ms, loudness y tempo)

In [29]:
# Convierto ms a minutos: 1 min = 60.000 ms
anger_df['duration_ms'] = anger_df['duration_ms']/60000



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [30]:
cols = ['duration_ms', 'loudness', 'tempo']

fig = make_subplots(rows=1, cols=3, subplot_titles=cols)

for i, col in enumerate(cols):
    fig.add_trace(go.Box(y=anger_df[col], name=col), row=1, col=i+1)

fig.update_layout(
    title="Distribución de variables:",
    height=400,
    width=900
)

fig.show()

No aparecen outliers

In [33]:
anger_df[['duration_ms','loudness','tempo']].describe()

Unnamed: 0,duration_ms,loudness,tempo
count,65.0,65.0,65.0
mean,3.232688,-4.943554,125.084046
std,0.758421,1.634156,28.380283
min,1.67445,-8.665,84.051
25%,2.726,-6.061,97.907
50%,3.128283,-4.852,122.133
75%,3.73845,-3.825,146.94
max,5.012,-1.789,194.992


Las canciones de la playlist que representa la alegría:  
- **Duración**: Tienen una duración media de 3 min. Vemos que la mayoría de los datos se encuentran entre 2 y 3 min.
- **Sonoridad**: De media las canciones alegres rondan los -5 dB.
- **Tempo**: Las canciones alegres tienen de media 125 pulsaciones por minuto (Bps). La mayoría de las canciones se encuentran entre 97 y 147 Bps

##### 3.2 Variables acotadas

In [34]:
cols = ['popularity','danceability', 'energy','valence', 'speechiness','instrumentalness']

fig = make_subplots(rows=2, cols=3, subplot_titles=cols)

for i, col in enumerate(cols):
    fig.add_trace(go.Histogram(y=anger_df[col], name=col), row=(i // 3) + 1, col=(i % 3) + 1)

fig.update_layout(
    title="Histogramas:",
    height=600,
    width=900
)

fig.show()

In [68]:
anger_df.loc[anger_df['energy'] == 0, 'name'].values[0]

KeyError: 4

In [37]:
anger_df['valence'].mean()

0.5421230769230769

Vemos que las canciones alegres según la lista que ha generado Spotify:  
- Por lo general no son canciones populares (con un 56 sobre 100).  
- En cuanto a la bailabilidad la mayoría son más o menos bailables ya que los valores están cerca del 0,5.  
- Son canciones enérgicas.  
- Respecto a la valencia tiene un valor intermedio están ente alegres y tristes.  
- Son canciones que no son habladas o rapeadas.  
- Por último, vemos que casi todas las canciones tienen voces, no son puramente instrumentales.

#### 4. Miedo / Fear

In [38]:
fear_df = df[(df.Mood == 'Fear')]
fear_df

Unnamed: 0,Mood,artist_name,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
250,Fear,Mr.Kitty,87,0.585,0.5950,8,-10.444,1,0.0328,0.0696,0.266000,0.0837,0.0390,140.037,259147
251,Fear,salvia palth,84,0.529,0.3530,7,-12.835,1,0.0292,0.7880,0.853000,0.1160,0.0601,104.557,161463
252,Fear,Marie Madeleine,67,0.714,0.8970,6,-5.469,0,0.0376,0.0126,0.901000,0.0706,0.8890,119.979,325852
253,Fear,Teen Suicide,70,0.566,0.6130,5,-10.118,1,0.0874,0.7140,0.799000,0.0848,0.2190,110.886,165672
254,Fear,sign crushes motorist,53,0.545,0.0415,2,-32.030,1,0.0367,0.9880,0.884000,0.1110,0.1380,120.089,200000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
345,Fear,Runah,22,0.540,0.5380,9,-8.097,1,0.0375,0.3490,0.000858,0.1030,0.2520,120.070,248272
346,Fear,Vanishing Twin,26,0.634,0.6760,5,-10.473,1,0.0388,0.1400,0.726000,0.0937,0.6670,166.567,172893
347,Fear,Eleanor Collides,24,0.315,0.5450,7,-8.864,1,0.0273,0.0165,0.000660,0.6550,0.3290,78.906,254810
348,Fear,Cherry Glazerr,23,0.586,0.8540,9,-6.023,1,0.0632,0.0660,0.000070,0.0836,0.3280,119.988,210653


##### 4.1 Variables no acotadas (duration_ms, loudness y tempo)

In [39]:
# Convierto ms a minutos: 1 min = 60.000 ms
fear_df['duration_ms'] = fear_df['duration_ms']/60000



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [41]:
cols = ['duration_ms', 'loudness', 'tempo']

fig = make_subplots(rows=1, cols=3, subplot_titles=cols)

for i, col in enumerate(cols):
    fig.add_trace(go.Box(y=fear_df[col], name=col), row=1, col=i+1)

fig.update_layout(
    title="Distribución de variables:",
    height=400,
    width=900
)

fig.show()

In [43]:
# Para ver la cantidad de outliers que hay en nuestro dataset
cols = ['duration_ms', 'loudness', 'tempo']
def outliers1 (df):
    for col in cols:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        outlierIzq = Q1-(1.5*IQR)
        outlierDer = Q3+(1.5*IQR)
        outliers = df[col][(df[col] < outlierIzq) | (df[col] > outlierDer)]
        print("La variable " + col + " tiene " + str(len(outliers)) + " valores outliers")

    
outliers1(fear_df)

La variable duration_ms tiene 3 valores outliers
La variable loudness tiene 3 valores outliers
La variable tempo tiene 1 valores outliers


In [42]:
fear_df[['duration_ms','loudness','tempo']].describe()

Unnamed: 0,duration_ms,loudness,tempo
count,100.0,100.0,100.0
mean,4.007197,-10.05548,115.15842
std,1.248019,3.705552,29.68981
min,1.959117,-32.03,59.859
25%,3.068504,-11.47825,92.985
50%,3.896408,-9.6585,114.9715
75%,4.712921,-7.80575,131.23975
max,8.645783,-3.218,203.639


Las canciones de la playlist que representa la alegría:  
- **Duración**: Tienen una duración media de 4 min. Vemos que la mayoría de los datos se encuentran entre 3 y 5 min. Hay 3 valores atípicos entre 7 y 8 minutos.
- **Sonoridad**: De media las canciones se encuentran en los -10 dB y hay ouliers están por debajo de los -17 dB alcanzando los -32 dB.
- **Tempo**: Las canciones que transmite cierto terror tienen de media 115 pulsaciones por minuto (Bps) y parece que hay un outlier con 203 pulsaciones por minuto.

##### 4.2 Variables acotadas

In [44]:
cols = ['popularity','danceability', 'energy','valence', 'speechiness','instrumentalness']

fig = make_subplots(rows=2, cols=3, subplot_titles=cols)

for i, col in enumerate(cols):
    fig.add_trace(go.Histogram(y=fear_df[col], name=col), row=(i // 3) + 1, col=(i % 3) + 1)

fig.update_layout(
    title="Histogramas:",
    height=600,
    width=900
)

fig.show()

In [48]:
fear_df['valence'].mean()

0.27324000000000004

Vemos que estas canciones según la lista que ha generado Spotify:  
- Por lo general no son canciones populares (con un 40 sobre 100 de media).  
- En cuanto a la bailabilidad la mayoría son más o menos bailables ya que los valores están cerca del 0,5. 
- No son canciones muy enérgicas (0,5 de media).  
- Respecto a la valencia son canciones tristes en su mayoría aunque parece que hay canciones con valencia muy baja (0,27 de media).  
- Son canciones que no son habladas o rapeadas.  
- Por último, vemos que hay varias canciones que si que son muy instrumentales y otras que tienen voces.

<span style="color:red">**Esta playlist como no transmite esa sensación que estaba buscando la descarto del análisis.  
Parece que Spotify la ha elaborado según las letras o títulos de las canciones porque sus canciones no transmiten terror son más bien tristes.**</span>.

#### 5. Concentración / Focus

In [56]:
focus_df = df[(df.Mood == 'Focus')]
focus_df

Unnamed: 0,Mood,artist_name,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
350,Focus,Sun Of They,66,0.577,0.0721,2,-22.973,1,0.0317,0.981,0.898,0.111,0.2420,73.998,151917
351,Focus,Imala Zir,65,0.467,0.2430,11,-22.534,1,0.0332,0.842,0.947,0.115,0.0541,75.031,158000
352,Focus,Imber Sun,64,0.442,0.1120,2,-18.687,1,0.0415,0.985,0.919,0.111,0.1110,112.206,126986
353,Focus,Far & Beyond,64,0.509,0.2120,6,-23.342,1,0.0305,0.843,0.853,0.111,0.0711,84.000,143500
354,Focus,Josef Briem,64,0.346,0.0513,7,-20.528,1,0.0432,0.954,0.895,0.105,0.0580,129.445,183500
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
445,Focus,Dreams Ahead,52,0.461,0.0611,0,-20.555,1,0.0440,0.962,0.930,0.112,0.1330,120.815,195122
446,Focus,Agnes Lundh,51,0.597,0.1270,5,-17.192,0,0.0370,0.991,0.929,0.126,0.1350,70.051,152597
447,Focus,Joy Parade,51,0.389,0.1810,10,-18.820,1,0.0297,0.874,0.905,0.116,0.1110,73.869,163232
448,Focus,Tall Towers,51,0.487,0.1430,0,-17.674,1,0.0400,0.981,0.904,0.115,0.1600,73.548,143872


##### 5.1 Variables no acotadas (duration_ms, loudness y tempo)

In [57]:
# Convierto ms a minutos: 1 min = 60.000 ms
focus_df['duration_ms'] = focus_df['duration_ms']/60000



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [58]:
cols = ['duration_ms', 'loudness', 'tempo']

fig = make_subplots(rows=1, cols=3, subplot_titles=cols)

for i, col in enumerate(cols):
    fig.add_trace(go.Box(y=focus_df[col], name=col), row=1, col=i+1)

fig.update_layout(
    title="Distribución de variables:",
    height=400,
    width=900
)

fig.show()

In [62]:
# Para ver la cantidad de outliers que hay en nuestro dataset
cols = ['duration_ms', 'loudness', 'tempo']
def outliers1 (df):
    for col in cols:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        outlierIzq = Q1-(1.5*IQR)
        outlierDer = Q3+(1.5*IQR)
        outliers = df[col][(df[col] < outlierIzq) | (df[col] > outlierDer)]
        print("La variable " + col + " tiene " + str(len(outliers)) + " valores outliers")

    
outliers1(focus_df)

La variable duration_ms tiene 1 valores outliers
La variable loudness tiene 3 valores outliers
La variable tempo tiene 0 valores outliers


In [59]:
focus_df[['duration_ms','loudness','tempo']].describe()

Unnamed: 0,duration_ms,loudness,tempo
count,100.0,100.0,100.0
mean,2.552085,-19.79635,93.83175
std,0.396267,2.843988,27.285173
min,1.719633,-27.838,35.366
25%,2.286917,-21.71375,73.96275
50%,2.521958,-19.919,81.7265
75%,2.765713,-18.005,116.29425
max,3.716533,-11.221,149.941


Las canciones de la playlist que representa la alegría:  
- **Duración**: Tienen una duración media de 3 min. Vemos que la mayoría de los datos se encuentran entre 3 y 4 min. No hay ouliers.
- **Sonoridad**: De media las canciones tristes se encuentran en los -8 dB y hay ouliers están por debajo de los -14 dB.
- **Tempo**: Las canciones tristes tienen de media 117 pulsaciones por minuto (Bps) y parece que no hay ouliers.

##### 5.2 Variables acotadas

In [63]:
cols = ['popularity','danceability', 'energy','valence', 'speechiness','instrumentalness']

fig = make_subplots(rows=2, cols=3, subplot_titles=cols)

for i, col in enumerate(cols):
    fig.add_trace(go.Histogram(y=focus_df[col], name=col), row=(i // 3) + 1, col=(i % 3) + 1)

fig.update_layout(
    title="Histogramas:",
    height=600,
    width=900
)

fig.show()

Vemos que las canciones alegres según la lista que ha generado Spotify:  
- Por lo general son canciones bastante populares, un poquito menos que las felices (con un 74 sobre 100 de media).  
- En cuanto a la bailabilidad la mayoría no son muy bailables, de media cuentan con un 0.5 de bailabilidad.  
- No son canciones enérgicas.  
- Respecto a la valencia son canciones con una valencia muy baja, por lo tanto, efectivamente son canciones tristes.
- Son canciones que no son habladas o rapeadas.  
- Por último, vemos que casi todas las canciones tienen voces, no son puramente instrumentales.

# Encodear

# Correlaciones