<img src="https://www.politecnicos.com.br/img/075.jpg" alt="Grupo Turing" height="420" width="420">

# Análise de dados - Spotify

Por: Camilla de Oliveira Fonseca

Essa é uma análise do dataset "spotify.csv" proposta como mini-projeto de treinamento do Grupo Turing.

### Descrição do dataset
retirada [desse site](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/)

**Features:**

* **song_duration_ms:** _(int)_ The duration of the track in milliseconds.

* **key:** _(int)_ The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation
> E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.

* **audio_mode:** _(int)_Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived.
> Major is represented by 1 and minor is 0.

* **time_signature:** _(int)_ An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).

* **acousticness:** _(float)_ A confidence measure from 0.0 to 1.0 of whether the track is acoustic.
> 1.0 represents high confidence the track is acoustic.

* **danceability:** _(float)_ Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity.
> A value of 0.0 is least danceable and 1.0 is most danceable.

* **energy:** _(float)_ Energy is a measure from *0.0 to 1.0* and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale.

* **instrumentalness:** _(float)_ Detects the presence of an audience in the recording.
> Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

* **loudness:** _(float)_ The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude).
> Values typical range between -60 and 0 db.

* **speechiness:** _(float)_ Speechiness detects the presence of spoken words in a track.
> The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

* **audio_valence:** _(float)_ A measure from *0.0 to 1.0* describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

* **tempo:** _(float)_ The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

* **song_popularity:** _(int)_ Song ratings of spotify audience.

* **liveness:** _(float)_ Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live.
> A value above 0.8 provides strong likelihood that the track is live.

In [1]:
#Importando as bibliotecas necessarias
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


%matplotlib inline

#Leitura do dataset
df = pd.read_csv("spotify.csv", index_col = 0) #index_col pq o dataset ja tem uma coluna de index

# Explorando e preparando os dados


## Problemas de tipos de dado e consistência

In [2]:
df.head(15)

Unnamed: 0,song_name,song_popularity,song_duration_ms,acousticness,danceability,energy,instrumentalness,key,liveness,loudness,audio_mode,speechiness,tempo,time_signature,audio_valence
0,Boulevard of Broken Dreams,73,262333,0.005520000000000001kg,0.496mol/L,0.682,2.94e-05,8.0,0.0589,-4.095,1,0.0294,167.06,4,0.474
1,In The End,66,216933,0.0103kg,0.542mol/L,0.853,0.0,3.0,0.108,-6.407,0,0.0498,105.256,4,0.37
2,Seven Nation Army,76,231733,0.00817kg,0.737mol/L,0.463,0.447,0.0,0.255,-7.8279999999999985,1,0.0792,123.881,4,0.324
3,By The Way,74,216933,0.0264kg,0.451mol/L,0.97,0.00355,0.0,0.102,-4.938,1,0.107,122.444,4,0.198
4,How You Remind Me,56,223826,0.000954kg,0.447mol/L,0.7659999999999999,0.0,10.0,0.113,-5.065,1,0.0313,172.011,4,0.574
5,Bring Me To Life,80,235893,0.00895kg,0.316mol/L,0.945,1.85e-06,4.0,0.396,-3.169,0,0.124,189.931,4,0.32
6,Last Resort,81,199893,0.000504kg,0.581mol/L,0.887,0.00111,4.0,0.268,-3.659,0,0.0624,90.578,4,0.724
7,Are You Gonna Be My Girl,76,213800,0.00148kg,0.613mol/L,0.953,0.000582,2.0,0.152,-3.435,1,0.0855,105.046,4,0.537
8,Mr. Brightside,80,222586,0.00108kg,0.33mol/L,0.936,0.0,1.0,0.0926,-3.66,1,0.0917,148.112,4,0.234
9,Sex on Fire,81,203346,0.00172kg,0.542mol/L,0.905,0.0104,9.0,0.136,-5.653,1,0.054,153.398,4,0.374


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18835 entries, 0 to 18834
Data columns (total 15 columns):
song_name           18835 non-null object
song_popularity     18835 non-null object
song_duration_ms    18835 non-null object
acousticness        18835 non-null object
danceability        18835 non-null object
energy              18835 non-null object
instrumentalness    18835 non-null object
key                 18835 non-null float64
liveness            18835 non-null object
loudness            18835 non-null object
audio_mode          18835 non-null object
speechiness         18835 non-null object
tempo               18835 non-null object
time_signature      18835 non-null object
audio_valence       18834 non-null float64
dtypes: float64(2), object(13)
memory usage: 2.3+ MB


Vemos que a única feature que está armazenada no tipo de dado correto é *audio_valence*.

Além disso, os valores de *acousticness* e *danceability* estão salvos com "kg" e "mol/L" que, além de não fazerem sentido, precisam ser retirados para converter os dados em *float*.

Vemos também que há 1 dado nulo em _audio_valence_.

In [4]:
#Retirando kg e mol/L

df["acousticness"] = df["acousticness"].str.strip("kg")
df["danceability"] = df["danceability"].str.strip("mol/L")

#Retirando nulos
df.dropna(inplace = True)


Ao tentar converter os dados para *float*, encontrei alguns valores marcado com "nao_sei". Usar a função unique() seria uma forma de verificar se há valores inadequados como esse, porém a maioria das features tem uma quantidade muito grande de valores diferentes (uma vez que na verdade são dados contínuos) e isso não é possível.

Como o dataset é grande, provavelmente não haverá problemas em excluir as linhas com 'nao_sei'.

In [5]:
ftrs_str = ["song_name","song_popularity", "song_duration_ms", "acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "audio_mode", "speechiness", "tempo", "time_signature"]

for ftr in ftrs_str:
    df = df[df[ftr] != 'nao_sei']
    df[ftr] = df[ftr].str.strip("nao_sei")

In [6]:
ftrs_unq = ["key", "audio_mode", "time_signature", "song_popularity"]

for ftr in ftrs_unq:
    print(df[ftr].unique())

[ 8.  3.  0. 10.  4.  2.  1.  9.  7. 11.  5.  6.]
['1' '0']
['4' '3' '1' '5' '2800000000' '0']
['73' '66' '76' '74' '56' '80' '81' '78' '63' '75' '69' '77' '71' '62'
 '79' '13' '28' '11' '65' '70' '60' '72' '57' '64' '61' '67' '94' '98'
 '59' '87' '85' '58' '92' '83' '44' '47' '54' '49' '52' '95' '45' '38'
 '46' '53' '39' '88' '68' '37' '43' '84' '40' '41' '10' '31' '48' '24'
 '29' '51' '4' '7' '50' '42' '30' '21' '55' '14' '33' '8' '16' '34' '26'
 '15' '19' '5' '3' '22' '36' '32' '35' '9' '82' '25' '86' '12' '18' '27'
 '17' '6' '20' '0' '90' '93' '91' '89' '99' '97' '96' '1' '2' '23' '100']


Para as features que são possíveis explorar com unique(), chama a atenção um valor de *time_signature* muito alto (280000000) e inconsistente.

In [7]:
df[df["time_signature"] == '2800000000']

Unnamed: 0,song_name,song_popularity,song_duration_ms,acousticness,danceability,energy,instrumentalness,key,liveness,loudness,audio_mode,speechiness,tempo,time_signature,audio_valence
149,What Makes You Beautiful,69,198053,0.00761,0.7290000000000001,0.7709999999999999,0.0,4.0,0.087,-2.451,1,0.0725,125.011,2800000000,0.873


Pesquisando na internet, vemos que o tempo de "What Makes You Beautiful" é 4, assim podemos arrumar.

In [8]:
df.loc[149,"time_signature"] = '4'

df.loc[149]

song_name           What Makes You Beautiful
song_popularity                           69
song_duration_ms                      198053
acousticness                         0.00761
danceability              0.7290000000000001
energy                    0.7709999999999999
instrumentalness                         0.0
key                                        4
liveness                               0.087
loudness                              -2.451
audio_mode                                 1
speechiness                           0.0725
tempo                                125.011
time_signature                             4
audio_valence                          0.873
Name: 149, dtype: object

Fiquei um tanto curiosa também com o valor 0 para essa feature, pois não sabia se fazia sentido.

In [19]:
df[df["time_signature"] == '0']

Unnamed: 0,song_name,song_popularity,song_duration_ms,acousticness,danceability,energy,instrumentalness,key,liveness,loudness,audio_mode,speechiness,tempo,time_signature,audio_valence
7119,Gina Rodriguez - Latinos Trending Intr,0,12000,0.7,0.0,0.493,0.0,7,0.457,-6.102,1,0.0,0.0,0,0.0
11171,Aur,50,102536,0.0774,0.0,0.56,0.963,11,0.589,-9.866,1,0.0,0.0,0,0.0
18120,Imma B,61,257560,0.184,0.619,0.539,0.0,0,0.288,-6.9,1,0.387,145.618,0,0.424


Pesquisando na internet, vi que a notação "0" pode ser usada para músicas com compassos livres ou irregulares, que parece ser o caso das músicas acima.

Agora podemos converter os dados para seus tipos corretos.

As features *audio_mode* e *time_signature* são categóricas e, apesar de constarem como *int* na descrição do dataset, preferi convertê-las para *category*.

In [14]:
ftrs_float = ["acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo"]

ftrs_int = ["song_popularity", "song_duration_ms", "key"]

ftrs_cat = ["audio_mode", "time_signature"]

#Corvertendo para float
for ftr in ftrs_float:
    df[ftr] = df[ftr].astype('float')

#Convertendo para int
for ftr in ftrs_int:
    df[ftr] = df[ftr].astype(np.int64)
    # quando tentei .astype('int') deu erro com a seguinte msg: Python int too large to convert to C long

#Convertendo para categorico
for ftr in ftrs_cat:
    df[ftr] = df[ftr].astype('category')

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18823 entries, 0 to 18834
Data columns (total 15 columns):
song_name           18823 non-null object
song_popularity     18823 non-null int64
song_duration_ms    18823 non-null int64
acousticness        18823 non-null float64
danceability        18823 non-null float64
energy              18823 non-null float64
instrumentalness    18823 non-null float64
key                 18823 non-null int64
liveness            18823 non-null float64
loudness            18823 non-null float64
audio_mode          18823 non-null category
speechiness         18823 non-null float64
tempo               18823 non-null float64
time_signature      18823 non-null category
audio_valence       18823 non-null float64
dtypes: category(2), float64(9), int64(3), object(1)
memory usage: 2.7+ MB


In [16]:
df.head()

Unnamed: 0,song_name,song_popularity,song_duration_ms,acousticness,danceability,energy,instrumentalness,key,liveness,loudness,audio_mode,speechiness,tempo,time_signature,audio_valence
0,Boulevard of Broken Dream,73,262333,0.00552,0.496,0.682,2.9e-05,8,0.0589,-4.095,1,0.0294,167.06,4,0.474
1,In The End,66,216933,0.0103,0.542,0.853,0.0,3,0.108,-6.407,0,0.0498,105.256,4,0.37
2,Seven Nation Army,76,231733,0.00817,0.737,0.463,0.447,0,0.255,-7.828,1,0.0792,123.881,4,0.324
3,By The Way,74,216933,0.0264,0.451,0.97,0.00355,0,0.102,-4.938,1,0.107,122.444,4,0.198
4,How You Remind M,56,223826,0.000954,0.447,0.766,0.0,10,0.113,-5.065,1,0.0313,172.011,4,0.574


In [17]:
df[ftrs_cat].describe()

Unnamed: 0,audio_mode,time_signature
count,18823,18823
unique,2,5
top,1,4
freq,11825,17744


Vemos que os valores de *time_signatue* estão **bem** concentrados em 4, a frequência ultrapassa 90%.

In [18]:
df.describe()

Unnamed: 0,song_popularity,song_duration_ms,acousticness,danceability,energy,instrumentalness,key,liveness,loudness,speechiness,tempo,audio_valence
count,18823.0,18823.0,18823.0,18823.0,18823.0,18823.0,18823.0,18823.0,18823.0,18823.0,18823.0,18823.0
mean,52.990331,-984157800.0,0.258529,0.633402,0.645004,0.078008,5.288902,0.179663,-7.447085,0.102091,121.07696,0.527949
std,21.908818,135053400000.0,0.288713,0.156693,0.214075,0.221567,3.614212,0.14402,3.826882,0.104393,28.713192,0.24461
min,0.0,-18528910000000.0,1e-06,0.0,0.00107,0.0,0.0,0.0109,-38.768,0.0,0.0,0.0
25%,40.0,184339.5,0.0241,0.533,0.51,0.0,2.0,0.0929,-9.044,0.0378,98.4155,0.335
50%,56.0,211306.0,0.132,0.645,0.674,1.1e-05,5.0,0.122,-6.555,0.0555,120.013,0.527
75%,69.0,242844.0,0.424,0.748,0.815,0.00257,8.0,0.221,-4.909,0.119,139.931,0.725
max,100.0,1799346.0,0.996,0.987,0.999,0.997,11.0,0.986,1.585,0.941,242.318,0.984


Observamos problemas em _song_duration_ms_, uma vez que a duração de uma música não pode ser negativa, e em _loudness_, já que seu máximo é 0.

In [20]:
df[df["song_duration_ms"] <= 0]

Unnamed: 0,song_name,song_popularity,song_duration_ms,acousticness,danceability,energy,instrumentalness,key,liveness,loudness,audio_mode,speechiness,tempo,time_signature,audio_valence
62,White B,59,-18528908788662,0.0983,0.83,0.497,0.0,1,0.0906,-6.94,1,0.052,127.965,4,0.634


In [21]:
df[df["loudness"] > 0]

Unnamed: 0,song_name,song_popularity,song_duration_ms,acousticness,danceability,energy,instrumentalness,key,liveness,loudness,audio_mode,speechiness,tempo,time_signature,audio_valence
2503,We Are Your Friends - Radio Edit,57,164960,0.0102,0.646,0.977,0.769,9,0.201,1.585,0,0.167,123.016,4,0.606
2819,Search and Destroy - Iggy Pop Mix,55,208133,0.00353,0.235,0.977,0.00604,6,0.172,0.878,1,0.107,152.952,4,0.241
6219,Polar,55,236240,0.00679,0.795,0.994,0.904,8,0.057,0.525,0,0.0483,126.023,4,0.865
7980,Gladiator - Instrumental,14,198000,0.0107,0.78,0.972,0.0296,6,0.1,0.119,0,0.253,137.056,4,0.958
8950,Wake Up (RIOT VIP),52,292915,0.0251,0.341,0.995,7e-05,7,0.713,0.052,1,0.626,160.311,4,0.0801
13006,We Are Your Friends - Justice Vs Sim,58,262773,0.0104,0.615,0.97,0.384,9,0.178,1.342,0,0.119,122.993,4,0.507
15623,Long Story Short - Bodybangers Remix Edit,33,206718,0.0161,0.762,0.977,1.1e-05,5,0.535,0.198,0,0.0934,128.012,4,0.616


Como são poucos e seria complicado corrigir esses valores, optei por retirá-los.

In [22]:
df = df[(df["song_duration_ms"] > 0) & (df["loudness"] <= 0)]

In [23]:
df.describe()

Unnamed: 0,song_popularity,song_duration_ms,acousticness,danceability,energy,instrumentalness,key,liveness,loudness,speechiness,tempo,audio_valence
count,18815.0,18815.0,18815.0,18815.0,18815.0,18815.0,18815.0,18815.0,18815.0,18815.0,18815.0,18815.0
mean,52.992506,218218.8,0.258629,0.633405,0.644887,0.07793,5.28844,0.179631,-7.450132,0.102057,121.071129,0.527934
std,21.911042,59887.33,0.288733,0.156668,0.21402,0.221461,3.614559,0.143969,3.824473,0.104337,28.716576,0.244596
min,0.0,12000.0,1e-06,0.0,0.00107,0.0,0.0,0.0109,-38.768,0.0,0.0,0.0
25%,40.0,184339.5,0.0242,0.533,0.51,0.0,2.0,0.0929,-9.045,0.0378,98.3745,0.335
50%,56.0,211306.0,0.133,0.645,0.674,1.1e-05,5.0,0.122,-6.557,0.0555,120.013,0.526
75%,69.0,242844.0,0.424,0.748,0.815,0.002555,8.0,0.221,-4.9105,0.119,139.931,0.725
max,100.0,1799346.0,0.996,0.987,0.999,0.997,11.0,0.986,-0.257,0.941,242.318,0.984


## Dados duplicados

In [27]:
duplicates = df.duplicated(subset = "song_name", keep = False)
df[duplicates].sort_values(by = "song_name")

Unnamed: 0,song_name,song_popularity,song_duration_ms,acousticness,danceability,energy,instrumentalness,key,liveness,loudness,audio_mode,speechiness,tempo,time_signature,audio_valence
10651,"hate u, i love u (feat. olivia o'brien)",83,251033,0.68700,0.492,0.275,0.000000,6,0.1010,-13.400,0,0.3000,92.600,4,0.180
5892,"hate u, i love u (feat. olivia o'brien)",83,251033,0.68700,0.492,0.275,0.000000,6,0.1010,-13.400,0,0.3000,92.600,4,0.180
16507,tears left to cry,92,205920,0.04000,0.699,0.713,0.000003,9,0.2940,-5.507,0,0.0594,121.993,4,0.354
4644,tears left to cry,74,205946,0.03750,0.703,0.696,0.000006,0,0.2740,-5.482,1,0.0529,121.969,4,0.366
12489,tears left to cry,74,205946,0.03750,0.703,0.696,0.000006,0,0.2740,-5.482,1,0.0529,121.969,4,0.366
11319,tears left to cry,92,205920,0.04000,0.699,0.713,0.000003,9,0.2940,-5.507,0,0.0594,121.993,4,0.354
5614,tears left to cry,92,205920,0.04000,0.699,0.713,0.000003,9,0.2940,-5.507,0,0.0594,121.993,4,0.354
17260,'Till I Collap,85,297893,0.07570,0.572,0.853,0.000000,1,0.0798,-3.203,1,0.2170,171.297,4,0.102
13859,'Till I Collap,85,297893,0.07570,0.572,0.853,0.000000,1,0.0798,-3.203,1,0.2170,171.297,4,0.102
9732,'Till I Collap,85,297893,0.07570,0.572,0.853,0.000000,1,0.0798,-3.203,1,0.2170,171.297,4,0.102


In [26]:
df[df["song_popularity"] < 1]

Unnamed: 0,song_name,song_popularity,song_duration_ms,acousticness,danceability,energy,instrumentalness,key,liveness,loudness,audio_mode,speechiness,tempo,time_signature,audio_valence
1078,New Age Girl,0,199087,0.105000,0.632,0.646,0.000000,2,0.0454,-7.200,1,0.0519,137.933,4,0.7770
1233,Transformer (feat. Nicki Minaj),0,196333,0.001510,0.753,0.616,0.000007,2,0.2910,-7.340,1,0.1650,156.830,4,0.2870
1758,Waka Wak,0,199801,0.154000,0.726,0.751,0.000000,1,0.1170,-6.758,1,0.1740,127.034,4,0.8590
1777,I See Fir,0,291718,0.700000,0.418,0.334,0.000000,10,0.2900,-7.380,0,0.0318,76.075,4,0.2290
1791,Game of Thr,0,168824,0.147000,0.337,0.791,0.952000,0,0.0970,-7.826,0,0.0642,173.899,3,0.2280
1792,Misty Mountains (A Cappella),0,260315,0.924000,0.294,0.158,0.096400,8,0.1090,-15.491,0,0.0325,99.750,4,0.0869
3077,Bubby's Cream,0,222213,0.768000,0.750,0.470,0.532000,8,0.1110,-4.381,1,0.2970,84.897,4,0.6000
3435,Coupe (feat. Rich The Kid),0,194632,0.052000,0.843,0.643,0.000000,11,0.0973,-8.420,1,0.2170,150.033,4,0.8510
3752,Space Program,0,156480,0.869000,0.589,0.604,0.000002,4,0.6840,-18.562,0,0.9410,111.968,4,0.1970
3756,Wake Up Everybody,0,266373,0.549000,0.626,0.447,0.000000,6,0.3520,-8.151,1,0.0554,93.908,4,0.6940
