# Music Dataset Analysis

In [1]:
%matplotlib inline

import os

import IPython.display as ipd
import numpy as np
import pandas as pd

import utils

pd.set_option('display.max_columns', None)

Cargamos los distintos datasets. 

- tracks.csv: per track metadata such as ID, title, artist, genres, tags and play counts, for all 106,574 tracks.
- genres.csv: all 163 genres with name and parent (used to infer the genre hierarchy and top-level genres).
- features.csv: common features extracted with librosa.
- echonest.csv: audio features provided by Spotify for a subset of 13,129 tracks.

In [2]:
tracks = utils.load('data/fma_metadata/tracks.csv')
genres = utils.load('data/fma_metadata/genres.csv')
features = utils.load('data/fma_metadata/features.csv')
echonest = utils.load('data/fma_metadata/echonest.csv')

MemoryError: Unable to allocate 421. MiB for an array with shape (518, 106574) and data type float64

### Analisis tracks

El dataset de tracks tiene:

- track_id: Identificador unico de cada cancion
- Informacion jerarquica relacionada con album, artista y la cancion

In [None]:
tracks.describe()

In [None]:
print('Columnas relacionadas al track: ')
display(tracks['track'].sample(3))

print('Columnas relacionadas al album: ')
display(tracks['album'].sample(3))

print('Columnas relacionadas al artista: ')
display(tracks['artist'].sample(3))

Vamos a utilizar los datos dentro del subset small, (por lo menos para el analisis).

En este subset se encuentran las canciones con los 8 generos principales, y las clases estan balanceadas.

In [None]:
small = tracks[tracks['set', 'subset'] <= 'small']
small.shape

### Analisis Generos

Los generos estan almacenados en jerarquias. Existen 16 top-level tracks, pero solo nos vamos a quedar con los 8 top-level tracks mas utilizados.

In [None]:
genres.sample(3)

In [None]:
genres.shape

In [None]:
print('{} top-level genres'.format(len(genres['top_level'].unique())))
genres.loc[genres['top_level'].unique()].sort_values('#tracks', ascending=False)

### Analisis de Features

Los features fueron generados utilizando la libreria de librosa sobre mp3 de extractos de cada cancion.

Los features generados son:
- mfcc: Mel-frequency cepstral coefficients (MFCCs)

- chroma_cens: Computes the chroma variant “Chroma Energy Normalized” (CENS). CENS features are robust to dynamics, timbre and articulation, thus these are commonly used in audio matching and retrieval applications.

- tonnetz: Tonal centroid features (tonnetz). This representation uses the method to project chroma features onto a 6-dimensional basis representing the perfect fifth, minor third, and major third each as two-dimensional coordinates.

- spectral_contrast: Each frame of a spectrogram S is divided into sub-bands. For each sub-band, the energy contrast is estimated by comparing the mean energy in the top quantile (peak energy) to that of the bottom quantile (valley energy). High contrast values generally correspond to clear, narrow-band signals, while low contrast values correspond to broad-band noise. 

- spectral_centroid: Each frame of a magnitude spectrogram is normalized and treated as a distribution over frequency bins, from which the mean (centroid) is extracted per frame.

- spectral_bandwidth: Compute p’th-order spectral bandwidth.

- spectral_rolloff: The roll-off frequency is defined for each frame as the center frequency for a spectrogram bin such that at least roll_percent (0.85 by default) of the energy of the spectrum in this frame is contained in this bin and the bins below. This can be used to, e.g., approximate the maximum (or minimum) frequency by setting roll_percent to a value close to 1 (or 0).

- rmse: Compute root-mean-square (RMS) value for each frame, either from the audio samples y or from a spectrogram S.

- zcr: Zero-crossing rate of an audio time series.

Para mas informacion sobre cada feature: [Librosa features](https://librosa.org/doc/main/feature.html#)


Para cada feature se calcula:
- kurtosis
- max
- mean
- median
- min
- skew
- std


In [None]:
features.sample(3)

In [None]:
columns = ['mfcc', 'chroma_cens', 'tonnetz', 'spectral_contrast', 'spectral_centroid', 'spectral_bandwidth', 'spectral_rolloff', 'rmse', 'zcr']

for column in columns:
    print('Feature ' + column)
    display(features[column].head().style.format('{:.2f}'))

### Analisis Echonest 

Datos extraidos de la API de Spotify para cada track_id.

La jerarquia de datos en este data set es la siguiente:
- metadata
- audio_features
- social_features
- ranks


In [None]:
echonest.sample(3)

In [None]:
print('Audio features')
display(echonest['echonest', 'audio_features'].sample(3))

print('Social features')
display(echonest['echonest', 'social_features'].sample(3))

print('Metadata')
display(echonest['echonest', 'metadata'].sample(3))

print('Ranks')
display(echonest['echonest', 'ranks'].sample(3))

# Limpieza Datos

Vamos a quedarnos con el subdataset small. 
Seleccionaremos de los datasets de echonest, features y genres solo los que tienen tracks en este subsdataset.

In [None]:
medium = tracks[tracks['set', 'subset'] <= 'medium']
medium.shape

In [None]:
clean_tracks_ids = tracks['track'].dropna(subset=['genre_top']).index

full_clean_tracks = tracks[tracks.index.isin(clean_tracks_ids)]
full_clean_tracks.shape

In [None]:
# Columnas con las que me interesa quedarme 
track_columns = ['date_created', 'duration', 'genre_top', 'title']

clean_tracks = full_clean_tracks['track'][track_columns]
clean_tracks

In [None]:
albums = full_clean_tracks['album'][['title', 'tracks']].rename(columns={'title': 'album', 'tracks': 'album_tracks'})

clean_tracks = pd.concat([clean_tracks, albums], axis=1, join='inner')
clean_tracks

In [None]:
artist = full_clean_tracks['artist'][['name', 'location']].rename(columns={'name': 'artist'})

clean_tracks = pd.concat([clean_tracks, artist], axis=1, join='inner')
clean_tracks

In [None]:
clean_tracks = pd.concat([clean_tracks, echonest['echonest', 'audio_features']], axis=1, join='inner')

clean_tracks

In [None]:
clean_tracks.info()

In [None]:
clean_features = features[features.index.isin(clean_tracks.index)]
clean_features

In [None]:
np.testing.assert_array_equal(clean_features.index, clean_tracks.index)
assert clean_features.index.isin(clean_tracks.index).all()

In [None]:
clean_tracks.shape, clean_features.shape

In [None]:
print('{} tracks, {} genres'.format(
    len(clean_tracks), len(clean_tracks['genre_top'].unique())))
mean_duration = clean_tracks['duration'].mean()
print('track duration: {:.0f} days total, {:.0f} seconds average'.format(
    sum(clean_tracks['duration']) / 3600 / 24,
    mean_duration))

In [154]:
clean_tracks.genre_top = clean_tracks.genre_top.cat.remove_unused_categories()

In [3]:
#extraemos features (mismas lineas del codigo de clasifiacion avanzada)


In [4]:
_features = ["chroma_cens", "chroma_cqt", "chroma_stft", "mfcc", "rmse", "spectral_bandwidth", "spectral_contrast", "spectral_rolloff", "tonnetz", "zcr"]
_fields = ["kurtosis", "mean", "std", "median", "max", "min"]

audio_features_df  = clean_features
audio_features_df.head()

NameError: name 'clean_features' is not defined

In [None]:
# Flatten features
tracks_with_extra_audio_features_df = pd.DataFrame(index=audio_features_df.index)
tracks_with_extra_audio_features_df

In [None]:
## Initializing all needed columns with NaN
for index, row in audio_features_df.head(1).iterrows():
    print(index) # track id
    for feature in _features:
        for field in _fields:
            i = 0
            for k in row[feature][field]: # channel (if it is channel ?)                
                i += 1
                tracks_with_extra_audio_features_df[f'{feature}_{field[0:3]}_{i}'] = np.nan

In [None]:
for index, row in audio_features_df.iterrows():
    for feature in _features:
        for field in _fields:
            i = 0
            for k in row[feature][field]: # channel (if it is channel ?)                
                i += 1
                tracks_with_extra_audio_features_df[f'{feature}_{field[0:3]}_{i}'] = k

In [None]:
clean_features= tracks_with_extra_audio_features_df.head()

### Limpieza de outliers

In [155]:
q1 = clean_tracks['duration'].quantile(0.25)
print(q1)
q2 = clean_tracks['duration'].quantile(0.5)
print(q2)
q3 = clean_tracks['duration'].quantile(0.75)
print(q3)

iqr = (q3 - q1) * 1.5

up_threshold = q3 + iqr
low_threshold = q1 - iqr

print(up_threshold)
print(low_threshold)


outlier_mask_up = clean_tracks['duration'] > up_threshold
outlier_mask_down = clean_tracks['duration'] < low_threshold
outlier_mask = np.logical_or(outlier_mask_up, outlier_mask_down)
not_outliers = np.logical_not(outlier_mask)


print("Data con outliers: ", clean_tracks.shape)
print("Data sin outlier", clean_tracks[not_outliers].shape)
clean_tracks = clean_tracks[not_outliers]

153.0
210.0
283.0
478.0
-42.0
Data con outliers:  (9355, 16)
Data sin outlier (8864, 16)


## Guardado de datasets limpios


In [156]:
import os
if not os.path.exists('clean_data'):
    os.makedirs('clean_data')
    
clean_tracks.to_pickle('clean_data/track.pkl')
clean_features.to_pickle('clean_data/features.pkl')

In [157]:
# Ejemplo de como cargar archivo pickle

unpickled_df = pd.read_pickle("clean_data/track.pkl")
unpickled_df

Unnamed: 0_level_0,date_created,duration,genre_top,title,album,album_tracks,artist,location,acousticness,danceability,energy,instrumentalness,liveness,speechiness,tempo,valence
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2,2008-11-26 01:48:12,168,Hip-Hop,Food,AWOL - A Way Of Life,7,AWOL,New Jersey,0.416675,0.675894,0.634476,1.062807e-02,0.177647,0.159310,165.922,0.576661
3,2008-11-26 01:48:14,237,Hip-Hop,Electric Ave,AWOL - A Way Of Life,7,AWOL,New Jersey,0.374408,0.528643,0.817461,1.851103e-03,0.105880,0.461818,126.957,0.269240
5,2008-11-26 01:48:20,206,Hip-Hop,This World,AWOL - A Way Of Life,7,AWOL,New Jersey,0.043567,0.745566,0.701470,6.967990e-04,0.373143,0.124595,100.260,0.621661
10,2008-11-25 17:49:06,161,Pop,Freeway,Constant Hitmaker,2,Kurt Vile,,0.951670,0.658179,0.924525,9.654270e-01,0.115474,0.032985,111.562,0.963590
134,2008-11-26 01:43:19,207,Hip-Hop,Street Music,AWOL - A Way Of Life,7,AWOL,New Jersey,0.452217,0.513238,0.560410,1.944269e-02,0.096567,0.525519,114.290,0.894072
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
124718,2015-09-08 20:57:58,202,Hip-Hop,Rewind Feat Angelous,The Red Tape,-1,K. Sparks,"Queens, NY",0.412194,0.686825,0.849309,6.000000e-10,0.867543,0.367315,96.104,0.692414
124719,2015-09-08 20:58:00,201,Hip-Hop,Never Feat Tina Quallo,The Red Tape,-1,K. Sparks,"Queens, NY",0.054973,0.617535,0.728567,7.215700e-06,0.131438,0.243130,96.262,0.399720
124720,2015-09-08 20:58:00,181,Hip-Hop,Self Hatred,The Red Tape,-1,K. Sparks,"Queens, NY",0.010478,0.652483,0.657498,7.098000e-07,0.701523,0.229174,94.885,0.432240
124721,2015-09-08 20:58:01,140,Hip-Hop,Revenge Feat Nova,The Red Tape,-1,K. Sparks,"Queens, NY",0.067906,0.432421,0.764508,1.625500e-06,0.104412,0.310553,171.329,0.580087
