# FMA: A Dataset For Music Analysis

Michaël Defferrard, Kirell Benzi, Pierre Vandergheynst, Xavier Bresson, EPFL LTS2.

## Analysis

All numbers and figures which appear in the [paper] and much more.

[paper]: https://arxiv.org/abs/1612.01840

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Packages for later
import matplotlib.pyplot as plt
%matplotlib inline

import IPython.display as ipd
import seaborn as sns
from sklearn.preprocessing import MultiLabelBinarizer

sns.set_context("notebook", font_scale=1.5)
plt.rcParams['figure.figsize'] = (17, 5)

In [4]:
albums = pd.read_csv("../fma_metadata/raw_albums.csv", index_col=0)
albums.head()

Unnamed: 0_level_0,album_comments,album_date_created,album_date_released,album_engineer,album_favorites,album_handle,album_image_file,album_images,album_information,album_listens,album_producer,album_title,album_tracks,album_type,album_url,artist_name,artist_url,tags
album_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,0,11/26/2008 01:44:45 AM,1/05/2009,,4,AWOL_-_A_Way_Of_Life,https://freemusicarchive.org/file/images/album...,"[{'image_id': '1955', 'image_file': 'https://f...",<p></p>,6073,,AWOL - A Way Of Life,7,Album,http://freemusicarchive.org/music/AWOL/AWOL_-_...,AWOL,http://freemusicarchive.org/music/AWOL/,[]
100,0,11/26/2008 01:55:44 AM,1/09/2009,,0,On_Opaque_Things,https://freemusicarchive.org/file/images/album...,"[{'image_id': '4403', 'image_file': 'https://f...",,5613,,On Opaque Things,4,Album,http://freemusicarchive.org/music/Bird_Names/O...,Bird Names,http://freemusicarchive.org/music/Bird_Names/,[]
1000,0,12/04/2008 09:28:49 AM,10/26/2008,,0,DMBQ_Live_at_2008_Record_Fair_on_WFMU_Record_F...,https://freemusicarchive.org/file/images/album...,"[{'image_id': '31997', 'image_file': 'https://...",<p>http://blog.wfmu.org/freeform/2008/10/what-...,1092,,DMBQ Live at 2008 Record Fair on WFMU Record F...,4,Live Performance,http://freemusicarchive.org/music/DMBQ/DMBQ_Li...,DMBQ,http://freemusicarchive.org/music/DMBQ/,[]
10000,0,9/05/2011 04:42:57 PM,,,0,Live_at_CKUT_on_Montreal_Sessions_1434,https://freemusicarchive.org/file/images/album...,"[{'image_id': '12266', 'image_file': 'https://...",<p>Live Set on the Montreal Session February 2...,1001,,Live at CKUT on Montreal Sessions,1,Radio Program,http://freemusicarchive.org/music/Sundrips/Liv...,Sundrips,http://freemusicarchive.org/music/Sundrips/,[]
10001,0,9/06/2011 12:02:58 AM,1/01/2006,,0,Grounds_Dream_Cosmic_Love,https://freemusicarchive.org/file/images/album...,"[{'image_id': '24091', 'image_file': 'https://...","<p>Recorded in Linnavuori, Finland, 2005 (with...",504,,Ground's Dream Cosmic Love,1,Album,http://freemusicarchive.org/music/Uton/Grounds...,Uton,http://freemusicarchive.org/music/Uton/,[]


In [5]:
def missing_values_table(df): 
    mis_val = df.isnull().sum()
    mis_val_percent = 100 * df.isnull().sum() / len(df)
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
    mis_val_table_ren_columns = mis_val_table.rename(
    columns = {0 : 'Missing Values', 1 : '% of Total Values'})
    return mis_val_table_ren_columns

# Tracks dataset

In [6]:
tracks = pd.read_csv('../fma_metadata/tracks.csv', header=[0, 1], skipinitialspace=True, index_col=0)

In [7]:
# tracks data
print("The tracks data set is of size " + str(tracks.shape))
tracks.head()

The tracks data set is of size (106574, 52)


Unnamed: 0_level_0,album,album,album,album,album,album,album,album,album,album,...,track,track,track,track,track,track,track,track,track,track
Unnamed: 0_level_1,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,...,information,interest,language_code,license,listens,lyricist,number,publisher,tags,title
track_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],...,,4656,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293,,3,,[],Food
3,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],...,,1470,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514,,4,,[],Electric Ave
5,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],...,,1933,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1151,,6,,[],This World
10,0,2008-11-26 01:45:08,2008-02-06 00:00:00,,4,6,,47632,,[],...,,54881,en,Attribution-NonCommercial-NoDerivatives (aka M...,50135,,1,,[],Freeway
20,0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,[],...,,978,en,Attribution-NonCommercial-NoDerivatives (aka M...,361,,3,,[],Spiritual Level


In [8]:
missing_values_table(tracks)

Unnamed: 0,Unnamed: 1,Missing Values,% of Total Values
album,comments,0,0.0
album,date_created,3529,3.311314
album,date_released,36280,34.042074
album,engineer,91279,85.64847
album,favorites,0,0.0
album,id,0,0.0
album,information,23425,21.980033
album,listens,0,0.0
album,producer,88514,83.054028
album,tags,0,0.0


## Genres dataset

In [9]:
genres = pd.read_csv('../fma_metadata/genres.csv', index_col=0)

In [27]:
genres.head()

Unnamed: 0_level_0,#tracks,parent,title,top_level
genre_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,8693,38,Avant-Garde,38
2,5271,0,International,2
3,1752,0,Blues,3
4,4126,0,Jazz,4
5,4106,0,Classical,5


In [11]:
missing_values_table(genres)

Unnamed: 0,Missing Values,% of Total Values
#tracks,0,0.0
parent,0,0.0
title,0,0.0
top_level,0,0.0


# Features dataset

In [12]:
# tracks data
features = pd.read_csv('../fma_metadata/features.csv', header=[0, 1, 2], skipinitialspace=True, index_col=0)
print("The tracks data set is of size " + str(features.shape))
features.head()

The tracks data set is of size (106574, 518)


feature,chroma_cens,chroma_cens,chroma_cens,chroma_cens,chroma_cens,chroma_cens,chroma_cens,chroma_cens,chroma_cens,chroma_cens,...,tonnetz,tonnetz,tonnetz,zcr,zcr,zcr,zcr,zcr,zcr,zcr
statistics,kurtosis,kurtosis,kurtosis,kurtosis,kurtosis,kurtosis,kurtosis,kurtosis,kurtosis,kurtosis,...,std,std,std,kurtosis,max,mean,median,min,skew,std
number,01,02,03,04,05,06,07,08,09,10,...,04,05,06,01,01,01,01,01,01,01
track_id,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3,Unnamed: 17_level_3,Unnamed: 18_level_3,Unnamed: 19_level_3,Unnamed: 20_level_3,Unnamed: 21_level_3
2,7.180653,5.230309,0.249321,1.34762,1.482478,0.531371,1.481593,2.691455,0.866868,1.341231,...,0.054125,0.012226,0.012111,5.75889,0.459473,0.085629,0.071289,0.0,2.089872,0.061448
3,1.888963,0.760539,0.345297,2.295201,1.654031,0.067592,1.366848,1.054094,0.108103,0.619185,...,0.063831,0.014212,0.01774,2.824694,0.466309,0.084578,0.063965,0.0,1.716724,0.06933
5,0.527563,-0.077654,-0.27961,0.685883,1.93757,0.880839,-0.923192,-0.927232,0.666617,1.038546,...,0.04073,0.012691,0.014759,6.808415,0.375,0.053114,0.041504,0.0,2.193303,0.044861
10,3.702245,-0.291193,2.196742,-0.234449,1.367364,0.998411,1.770694,1.604566,0.521217,1.982386,...,0.074358,0.017952,0.013921,21.434212,0.452148,0.077515,0.071777,0.0,3.542325,0.0408
20,-0.193837,-0.198527,0.201546,0.258556,0.775204,0.084794,-0.289294,-0.81641,0.043851,-0.804761,...,0.095003,0.022492,0.021355,16.669037,0.469727,0.047225,0.040039,0.000977,3.189831,0.030993


In [13]:
missing_values_table(features)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Missing Values,% of Total Values
feature,statistics,number,Unnamed: 3_level_1,Unnamed: 4_level_1
chroma_cens,kurtosis,01,0,0.0
chroma_cens,kurtosis,02,0,0.0
chroma_cens,kurtosis,03,0,0.0
chroma_cens,kurtosis,04,0,0.0
chroma_cens,kurtosis,05,0,0.0
chroma_cens,kurtosis,06,0,0.0
chroma_cens,kurtosis,07,0,0.0
chroma_cens,kurtosis,08,0,0.0
chroma_cens,kurtosis,09,0,0.0
chroma_cens,kurtosis,10,0,0.0


In [14]:
# tracks data
raw_artists = pd.read_csv('../fma_metadata/raw_artists.csv', index_col=0)
print("The tracks data set is of size " + str(raw_artists.shape))
raw_artists.head()

The tracks data set is of size (16916, 24)


Unnamed: 0_level_0,artist_active_year_begin,artist_active_year_end,artist_associated_labels,artist_bio,artist_comments,artist_contact,artist_date_created,artist_donation_url,artist_favorites,artist_flattr_name,...,artist_location,artist_longitude,artist_members,artist_name,artist_paypal_name,artist_related_projects,artist_url,artist_website,artist_wikipedia_page,tags
artist_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,2006.0,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,Brown Bum aka Choke,11/26/2008 01:42:32 AM,,9,,...,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,,The list of past projects is 2 long but every1...,http://freemusicarchive.org/music/AWOL/,http://www.AzillionRecords.blogspot.com,,['awol']
10,,,"Mistletone, Marriage Records","<p>""Lucky Dragons"" means any recorded or perfo...",3,Lukey Dargons,11/26/2008 01:43:35 AM,http://glaciersofnice.com/shop/,111,,...,"Los Angeles, CA",-118.243685,Luke Fischbeck\nSarah Rara,Lucky Dragons,,,http://freemusicarchive.org/music/Lucky_Dragons/,http://hawksandsparrows.org/,,['lucky dragons']
100,2004.0,,"Captcha Records (HBSP-2X), Pickled Egg (Europe)","<p><span style=""font-family:Verdana, Geneva, A...",1,Chris Kalis,11/26/2008 02:05:22 AM,,8,,...,"Chicago, IL",-87.629798,"Chris Kalis, Harry Brenner, Scott McGaughey, B...",Chandeliers,,"Killer Whales, \nMichael Columbia\nMandate\nMr...",http://freemusicarchive.org/music/Chandeliers/,thechandeliers.com,,['chandeliers']
1000,,,,"<p><a href=""http://marzipanmarzipan.com"">Marzi...",0,,12/04/2008 09:24:35 AM,,0,,...,,12.56738,,Marzipan Marzipan,,,http://freemusicarchive.org/music/Marzipan_Mar...,https://soundcloud.com/marzipanmarzipan,,[]
10000,,,,"<p><span style=""font-family:'Times New Roman',...",0,,1/21/2011 02:11:31 PM,,1,,...,,,Jack Hertz\nPHOBoS\nBlue Hell,"Jack Hertz, PHOBoS, Blue Hell",,,http://freemusicarchive.org/music/Jack_Hertz_P...,http://surrism.phonoethics.com/surrism-phonoet...,,['jack hertz phobos blue hell']


In [15]:
raw_artists.artist_comments

artist_id
1         0
10        3
100       1
1000      0
10000     0
10001     1
10002     0
10003     1
10004     0
10005     1
10010     1
10011     0
10012    28
10013     0
10014     0
10015     1
10016     1
10017     0
10018     1
10019     2
10020     0
10022     0
10023     0
10024     0
10026     1
10027     6
10028     0
10030     1
10031     3
10034     1
         ..
9955      0
9957      0
9958      0
9959      0
9962      0
9963      0
9965      0
9966      2
9967      0
9968      0
9973      0
9974      0
9975      0
9976      0
9977      1
9979      1
9981      0
9983      0
9984      0
9985      0
9986      1
9987      1
9988      2
9989      0
9990      1
9991      0
9993      0
9994      1
9995      0
9997      0
Name: artist_comments, Length: 16916, dtype: int64

In [16]:
missing_values_table(raw_artists)

Unnamed: 0,Missing Values,% of Total Values
artist_active_year_begin,15497,91.611492
artist_active_year_end,16446,97.221565
artist_associated_labels,15991,94.531804
artist_bio,10272,60.723575
artist_comments,0,0.0
artist_contact,15460,91.392764
artist_date_created,0,0.0
artist_donation_url,15332,90.636084
artist_favorites,0,0.0
artist_flattr_name,16844,99.574367


# Genres

In [23]:
def clean_classes(x):
    if "," in x:
        genres = x.split(",")
        new_genres = set(str(converter[int(c)]) for c in genres)
        return int(new_genres.pop())
    else:
        return converter[int(x)]

converter = genres[["top_level"]].to_dict()["top_level"]
y = tracks[('track', 'genres')].str.replace("[^0-9,]", "")


track_id
2                  21
3                  21
5                  21
10                 10
20             76,103
26             76,103
30             76,103
46             76,103
48             76,103
134                21
135             45,58
136             45,58
137              1,32
138              1,32
139                17
140                17
141                17
142                17
144                 4
145                 4
146                 4
147                 4
148                 1
149                 1
150                 1
151                25
152                25
153                26
154                26
155                26
             ...     
155290    18,107,1235
155291    18,107,1235
155292    18,107,1235
155293    18,107,1235
155294    18,107,1235
155295    18,107,1235
155296    18,107,1235
155297    18,107,1235
155298         17,103
155299         17,103
155300         17,103
155301         17,103
155302         17,103
155303         17,103
1

In [26]:
y = y.apply(clean_classes).to_frame()

ValueError: invalid literal for int() with base 10: ''

In [25]:
y.columns = ['genres']
y = y.join(genres["title"], on = ["genres"])
y.head()

AttributeError: 'Series' object has no attribute 'join'

In [21]:
converter = genres[["top_level"]].to_dict()["top_level"]

# Echonest data

In [29]:
echonest = pd.read_csv('../fma_metadata/echonest.csv', index_col=0, header=[0, 1, 2])

In [33]:
missing_values_table(echonest.echonest.metadata)

Unnamed: 0,Missing Values,% of Total Values
album_date,10412,79.305355
album_name,10257,78.124762
artist_latitude,3359,25.584584
artist_location,3359,25.584584
artist_longitude,3359,25.584584
artist_name,0,0.0
release,0,0.0


In [52]:
print("The tracks data set is of size " + str(echonest.shape))
echonest.head()

The tracks data set is of size (13129, 249)


Unnamed: 0_level_0,echonest,echonest,echonest,echonest,echonest,echonest,echonest,echonest,echonest,echonest,echonest,echonest,echonest,echonest,echonest,echonest,echonest,echonest,echonest,echonest,echonest
Unnamed: 0_level_1,audio_features,audio_features,audio_features,audio_features,audio_features,audio_features,audio_features,audio_features,metadata,metadata,...,temporal_features,temporal_features,temporal_features,temporal_features,temporal_features,temporal_features,temporal_features,temporal_features,temporal_features,temporal_features
Unnamed: 0_level_2,acousticness,danceability,energy,instrumentalness,liveness,speechiness,tempo,valence,album_date,album_name,...,214,215,216,217,218,219,220,221,222,223
track_id,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3,Unnamed: 17_level_3,Unnamed: 18_level_3,Unnamed: 19_level_3,Unnamed: 20_level_3,Unnamed: 21_level_3
2,0.416675,0.675894,0.634476,0.010628,0.177647,0.15931,165.922,0.576661,,,...,-1.992303,6.805694,0.23307,0.19288,0.027455,0.06408,3.67696,3.61288,13.31669,262.929749
3,0.374408,0.528643,0.817461,0.001851,0.10588,0.461818,126.957,0.26924,,,...,-1.582331,8.889308,0.258464,0.220905,0.081368,0.06413,6.08277,6.01864,16.673548,325.581085
5,0.043567,0.745566,0.70147,0.000697,0.373143,0.124595,100.26,0.621661,,,...,-2.288358,11.527109,0.256821,0.23782,0.060122,0.06014,5.92649,5.86635,16.013849,356.755737
10,0.95167,0.658179,0.924525,0.965427,0.115474,0.032985,111.562,0.96359,2008-03-11,Constant Hitmaker,...,-3.662988,21.508228,0.283352,0.26707,0.125704,0.08082,8.41401,8.33319,21.317064,483.403809
134,0.452217,0.513238,0.56041,0.019443,0.096567,0.525519,114.29,0.894072,,,...,-1.452696,2.356398,0.234686,0.19955,0.149332,0.0644,11.26707,11.20267,26.45418,751.147705


In [46]:
missing_values_table(echonest)

Unnamed: 0,Missing Values,% of Total Values
"(echonest, audio_features)",1,0.007616
"('echonest', 'audio_features').1",1,0.007616
"('echonest', 'audio_features').2",1,0.007616
"('echonest', 'audio_features').3",1,0.007616
"('echonest', 'audio_features').4",1,0.007616
"('echonest', 'audio_features').5",1,0.007616
"('echonest', 'audio_features').6",1,0.007616
"('echonest', 'audio_features').7",1,0.007616
"(echonest, metadata)",10413,79.300891
"('echonest', 'metadata').1",10258,78.120478


In [48]:
raw_echonest = pd.read_csv('../fma_metadata/raw_echonest.csv', header=[0, 1], skipinitialspace=True, index_col=0)

# Albums

## 1 Size

Todo:
* When are tracks mostly added.
* Which tracks got deleted.

In [None]:
print('{} tracks, {} artists, {} albums, {} genres'.format(
    len(tracks), len(tracks['artist', 'id'].unique()),
    len(tracks['album', 'id'].unique()),
    sum(genres['#tracks'] > 0)))
mean_duration = tracks['track', 'duration'].mean()
print('track duration: {:.0f} days total, {:.0f} seconds average'.format(
    sum(tracks['track', 'duration']) / 3600 / 24,
    mean_duration))

In [None]:
dimensionality = mean_duration * 44000 * 2
print('sample dimensionality: {:.1e}'.format(dimensionality))
print('total size, i.e. number of audio samples: {:.1e}'.format(dimensionality * len(tracks)))

In [None]:
for subset in tracks['set', 'subset'].unique():
    indicator = tracks['set', 'subset'] <= subset
    print('{:6} {:6} tracks  {:.1f} days'.format(
        subset, sum(indicator), sum(indicator) * 30 / 3600 / 24))

In [None]:
print('{} deleted tracks (largest track_id is {})'.format(tracks.index.max() - len(tracks), tracks.index.max()))
print('First track: {}'.format(tracks['track', 'date_created'].min()))

d = pd.DataFrame(tracks.index, index=tracks['track', 'date_created'].values)
d['indicator'] = 1

fig, ax1 = plt.subplots()
ax2 = ax1.twinx()

d['track_id'].plot(ax=ax1)
d['indicator'].cumsum().plot(ax=ax1)
ax1.set_ylabel('#tracks')
ax1.set_ylim(0, 160000)

(d['indicator'] * -100).plot(ax=ax2, style='r')  # needed for no apparent reason
color = sns.color_palette('deep', 3)[2]
d['indicator'].resample('2M').sum().fillna(0).plot(ax=ax2, style='--', color=color)
ax2.set_ylabel('#tracks added')
ax2.set_ylim(500, 4500)
ax2.set_ylim(0, 4000)
ax2.grid(False)

lns = ax1.get_lines() + [ax2.get_lines()[1]]
ax1.legend(lns, ['largest track id', '#tracks still present', '#tracks added per 2 months'], loc='lower right')

plt.savefig('growth.pdf')

### 1.1 Splits

In [None]:
SPLITS = ['training', 'validation', 'test']
SUBSETS = ['small', 'medium', 'large']
print('subset    #train    #val   #test  val_ratio test_ratio')
for subset in SUBSETS:
    counts = [sum((tracks['set', 'split'] == split) & (tracks['set', 'subset'] <= subset)) for split in SPLITS]
    ratios = np.array(counts[0] / counts[1:])
    print('{:8s} {:7d} {:7d} {:7d} {:8.2f} {:9.2f}'.format(subset, *counts, *ratios))

In [None]:
for subset in ['small', 'medium']:
    subset = tracks['set', 'subset'] <= subset

    d = genres.reset_index().set_index('title')
    d = d.loc[tracks.loc[subset, ('track', 'genre_top')].unique()]

    for split in SPLITS:
        b = tracks['set', 'split'] == split
        d['#' + split] = tracks.loc[subset & b, ('track', 'genre_top')].value_counts()

    d['val_ratio'] = d['#training'] / d['#validation']
    d['test_ratio'] = d['#training'] / d['#test']

    ipd.display(d.sort_values('#training', ascending=False))

In [None]:
d = pd.DataFrame(index=genres.index, columns=SPLITS)
for genre in genres.index:
    b = tracks['track', 'genres_all'].map(lambda genres: genre in genres)
    d.loc[genre] = tracks.loc[b, ('set', 'split')].value_counts()
d['val_ratio'] = d['training'] / d['validation']
d['test_ratio'] = d['training'] / d['test']
d.sort_values('training', ascending=False, inplace=True)
ipd.display(d.head(10))
ipd.display(d.tail(10))

## 2 Metadata

In [None]:
def isnull(column, df=tracks):
    if column[1] in ['tags', 'genres', 'genres_all']:
        return df[column].apply(lambda x: len(x) == 0)
    elif df.dtypes[column] == np.int:
        return df[column] <= 0
    else:
        return df[column].isnull()

def count(series):
    col0 = series.name[0]
    df = tracks if col0 == 'track' else tracks.drop_duplicates((col0, 'id'))
    n = (~isnull(series.name, df)).sum()
    p = n / len(df) * 100
    return n, p

# Columns / metadata usage across dataset.
d = pd.DataFrame(index=tracks.columns.drop('set'), columns=['n', 'p'])
d = d.apply(count, axis=1)
d['n'] = d['n'].astype(np.int)
d

In [None]:
# Excerpt as example in the paper.
columns = [
    ('track', 'title'),
    ('track', 'genres_all'),
    ('track', 'genre_top'),
    ('track', 'duration'),
    ('track', 'listens'),
    ('album', 'title'),
    ('album', 'listens'),
    ('album', 'tags'),
    ('artist', 'name'),
    ('artist', 'location'),
]

non_null = ~isnull(columns[0])
for column in columns[1:]:
    non_null &= ~isnull(column)
tids = np.random.RandomState(42).permutation(tracks.index[non_null])[:8]
tracks.loc[tids, columns].head() #.to_latex()

In [None]:
tracks['track', 'license'].value_counts().head(10)

In [None]:
tracks['track', 'language_code'].value_counts().head(10)

### 2.1 Technical data

In [None]:
durations = tracks['track', 'duration']
plt.figure(figsize=(10, 4))
p = sns.distplot(durations[durations.values < 800], kde=False, rug=False, color='k', hist_kws=dict(alpha=0.4))
p.set_xlabel('duration [seconds]')
p.set_ylabel('#tracks')
p.set_xlim(0, 800)
plt.tight_layout()
plt.savefig('duration_distribution.pdf')

durations.describe()

In [None]:
# Uncommon bit rates are VBR encodings.
print('Common bit rates: {}'.format(tracks['track', 'bit_rate'].value_counts().head(5).index.tolist()))
print('Average bit rate: {:.0f} kbit/s'.format(tracks['track', 'bit_rate'].mean()/1000))
p = sns.distplot(tracks['track', 'bit_rate'], kde=False, rug=False)
p.set_xlabel('bit rate')
p.set_ylabel('#tracks');

### 2.2 User data

In [None]:
# Tags.
fig, ax1 = plt.subplots(figsize=(10, 4))
ax2 = ax1.twinx()

d = tracks['track', 'tags'].apply(len)
print(d.min(), d.max())
d = d[(d.values >= 1) & (d.values <= 16)]
color = sns.color_palette('deep', 3)[2]
sns.distplot(d, kde=False, bins=d.max()-d.min(), label='track', ax=ax1, color=color)

d = tracks.drop_duplicates(('album', 'id'))
d = d['album', 'tags'].apply(len)
print(d.min(), d.max())
d = d[(d.values >= 1) & (d.values <= 16)]
sns.distplot(d, kde=False, bins=d.max()-d.min(), label='album', ax=ax2)

d = tracks.drop_duplicates(('artist', 'id'))
d = d['artist', 'tags'].apply(len) - 1
print(d.min(), d.max())
d = d[(d.values >= 1) & (d.values <= 16)]
sns.distplot(d, kde=False, bins=d.max()-d.min(), label='artist', ax=ax2)

ax1.set_xlabel('#tags')
ax1.set_ylabel('#tracks')
ax2.set_ylabel('#artists   /   #albums')
ax1.set_xlim(d.min(), d.max())
ax1.set_ylim(0, 5000)
ax2.set_ylim(0, 500)
ax1.legend(loc='upper center')
ax2.legend(loc='upper right')
ax2.grid(False)

plt.tight_layout()
plt.savefig('tag_distribution.pdf')

In [None]:
# One artist tag is often the artist name.
col = 'artist'
d = tracks.drop_duplicates((col, 'id'))
d.loc[d[col, 'tags'].apply(len) > 0, [('artist', 'name'), (col, 'tags')]].head()

In [None]:
# Listens, favorites, comments.

def plot(col0, col1, maxval, subplot=None):
    if col0 == 'track':
        d = tracks['track']
    if col0 in ['artist', 'album']:
        d = tracks[col0].drop_duplicates('id')
    if subplot:
        plt.subplot(subplot)
    d = d[col1]
    p = sns.distplot(d[d.values < maxval], kde=False, color='k', hist_kws=dict(alpha=0.4))
    p.set_xlim(-1, maxval)
    p.set_xlabel('#' + col1)
    p.set_ylabel('#' + col0 + 's')

plt.figure(figsize=(17, 10))
plot('track', 'listens', 10e3, 221)
plot('track', 'interest', 10e3, 222)
plot('track', 'favorites', 100, 223)
plot('track', 'comments', 20, 224)

plt.figure(figsize=(17, 10))
plot('album', 'listens', 100e3, 221)
plot('album', 'favorites', 100, 223)
plot('album', 'comments', 20, 224)

plt.figure(figsize=(17, 5))
plot('artist', 'favorites', 100, 121)
plot('artist', 'comments', 20, 122)

In [None]:
# Same as above, formated for the paper.
plt.figure(figsize=(10, 4))
plot('album', 'listens', 40e3)
plt.tight_layout()
plt.savefig('listens_distribution.pdf')

tracks['album', 'listens'].max()

In [None]:
# Most listened albums.
tracks['album'].groupby('id').first().sort_values('listens', ascending=False).head(10)

### 2.3 Dates

In [None]:
def plot(col0, col1):
    if col0 == 'track':
        d = tracks['track']
    if col0 in ['artist', 'album']:
        d = tracks[col0].drop_duplicates('id')
    d = pd.Series(1, index=d[col1])
    d.resample('A').sum().fillna(0).plot()

plt.figure()
plot('track', 'date_recorded')
plot('album', 'date_released')

plt.figure()
plot('artist', 'active_year_begin')
plot('artist', 'active_year_end')

plt.figure()
plot('track', 'date_created')
plot('album', 'date_created')
plot('artist', 'date_created')

In [None]:
# Same as above, formated for the paper.
plt.figure(figsize=(5, 4))
d = tracks['album'].drop_duplicates('id')
d = pd.Series(1, index=d['date_released'])
d = d.resample('A').sum().fillna(0)
b = d.index >= pd.to_datetime(1990, format='%Y')
b &= d.index <= pd.to_datetime(2017, format='%Y')
d[b].plot(color='k')
plt.xlabel('release year')
plt.ylabel('#albums')
plt.tight_layout()
plt.savefig('album_release_year.pdf')

d.index.min().year, d.index.max().year

## 3 Artists & albums effect

In [None]:
for effect in ['artist', 'album']:
    d = tracks[effect, 'id'].value_counts()
    ipd.display(d.head(5))
    p = sns.distplot(d[(d.values < 50) & (d.values >= 0)], kde=False)
    p.set_xlabel('#tracks per ' + effect);
    p.set_ylabel('#' + effect + 's');

In [None]:
counts = pd.Series(index=genres.loc[genres['parent'] == 0, 'title'].values, name='#artists')
for genre in counts.index:
    counts[genre] = len(tracks.loc[tracks['track', 'genre_top'] == genre, ('artist', 'id')].unique())
counts.sort_values(ascending=False).plot.bar()
plt.ylabel('#artists');

## 4 Genres

In [None]:
a = set(tracks['track', 'genre_top'].unique().dropna())
b = set(genres.loc[genres['top_level'].unique(), 'title'].values)
assert a == b

print('{} top-level genres'.format(len(a)))
genres[genres['parent'] == 0].sort_values('#tracks', ascending=False)

In [None]:
p = sns.distplot(tracks['track', 'genres'].map(len), kde=False, bins=range(32))
p = sns.distplot(tracks['track', 'genres_all'].map(len), kde=False, bins=range(32))
p.set_xlabel('#genres per track')
p.set_ylabel('#tracks');
#tracks['track', 'genres_all'].map(len).describe()
#tracks['track', 'genres_all'].map(len).value_counts()

In [None]:
# Same as above, formatted for the paper.
plt.figure(figsize=(5, 4))
d = tracks['track', 'genres_all'].map(len)
p = sns.distplot(d[d.values <= 10], kde=False, bins=range(11), color='k', hist_kws=dict(alpha=0.4))
p.set_xlabel('#genres per track')
p.set_ylabel('#tracks');
p.set_xlim(0, 10)
p.set_xticks(range(11));
plt.tight_layout()
plt.savefig('genres_per_track.pdf')

d.min(), d.max()

In [None]:
# Number of tracks per genre (full).
d = genres[genres['#tracks'] > 2000].sort_values('#tracks', ascending=False)
plt.figure(figsize=(10, 4))
p = sns.barplot('title', '#tracks', data=d, color='k', alpha=0.4)
p.set_xlabel('')
p.set_ylabel('#tracks')
plt.xticks(rotation=90)
plt.tight_layout()
plt.savefig('genre_distribution.pdf')

genres.loc[genres['#tracks'] > 0, '#tracks'].min(), genres['#tracks'].max()

In [None]:
# Number of tracks per top-level genre (medium).
d = tracks[tracks['set', 'subset'] <= 'medium']
d = d['track', 'genre_top'].value_counts()
plt.figure(figsize=(10, 4))
d.plot.bar(color='k', alpha=0.4)
plt.ylabel('#tracks')
plt.tight_layout()
plt.savefig('genre_top_distribution.pdf')

d

### 4.1 Genre hierarchy

* As genres have parent genres, we can plot a tree using the [DOT] language.
* Save the full genre tree as a PDF.

Todo:
* Color nodes according to FMA genre color.
* Better looking tree.

[DOT]: https://en.wikipedia.org/wiki/DOT_(graph_description_language)

In [None]:
g = utils.Genres(genres)
graph = g.create_tree([25, 31], 1)
ipd.Image(graph.create_png())

In [None]:
graph = g.create_tree(14)
graph.write_pdf('genre_hierarchy.pdf');

roots = g.find_roots()
print('{} roots'.format(len(roots)))
graph = g.create_tree(roots)
graph.write_pdf('genre_hierarchy.pdf');

### 4.2 Cross-appearance

Todo:
* Group rows and columns for better identification of related genres.

In [None]:
enc = MultiLabelBinarizer()
genres_indicator = enc.fit_transform(tracks['track', 'genres'])
genres_names = enc.classes_
genres_names = genres.loc[enc.classes_, 'title'].values
cross_correlation = genres_indicator.T @ genres_indicator

In [None]:
np.fill_diagonal(cross_correlation, 0)

plt.figure(figsize=(28, 28))
plt.imshow(np.log(cross_correlation))
plt.yticks(range(len(genres_names)), genres_names);
plt.xticks(range(len(genres_names)), genres_names, rotation=90);

In [None]:
cross_correlation = np.tril(cross_correlation, k=-1)
sort = np.argsort(cross_correlation.flatten())

N = 20
indices = np.unravel_index(sort[:-N:-1], cross_correlation.shape)
for i, j in zip(*indices):
    print('{}: {} | {}'.format(cross_correlation[i, j], genres_names[i], genres_names[j]))

## 5 Audio

Todo: e.g. audio features (echonest / librosa, spectrograms) to show diversity.

## 6 Features

Todo: understand features by listening to segments who have them, e.g. <http://musicinformationretrieval.com/feature_sonification.html>.

In [None]:
features.head(5).style.format('{:.2f}')

In [None]:
sns.pairplot(features.loc[:, ('mfcc', 'mean', slice('01','03'))]);
sns.pairplot(features.loc[:, ('mfcc', 'std', slice('01','03'))]);

## 7 Echonest features

In [None]:
print('Echonest features available for {} tracks.'.format(len(echonest)))