In [3]:
import pandas as pd
import numpy as np
import json
#%load_ext slurm_magic

# Introduction

### By cheinu-mike and bkrrobinson

In this project we will be looking to analyze spectral data provided by AcouticBrainz for genre classification. We will also be needing a portion of the MusicBrainz database as that is what contains information on genres, votes for the genres, and the recording index needed for requesting AcousticBrainz Data.

MusicBrainz and AcousticBrainz are a part of the Metabrainz Foundation where their goal is to make datasets freely and widely available. MusicBrainz hosts a database on the metadata of artists, recordings, albums, etc. while AcousticBrainz hosts data on the "low level" spectral data of recordings.

We will obtain and use this low level spectral data and feed it into some machine learning algorithms to classify them.

However, in order to obtain the data we can't just download it off their site. The data can be retrieved through the REST API and we need to make the requests. But even before that we need the recording ids from the MusicBrainz database which references a recording and we will also obtain genre information from there as well. 

The HTTP requests will be covered in the next section.  

MusicBrainz Website:
https://musicbrainz.org/

AcousticBrainz Website:
https://acousticbrainz.org/

Basically our only real guiding question is this: **Using low level spectral data can we classify hundreds of genres? and how well will this classification perform?**

## Obtaining and Processing MetaBrainz Data

Obtaining MusicBrainz data is easy as you can either download it directly through a browser or through ftp in the link provided here:   

https://musicbrainz.org/doc/MusicBrainz_Database/Download

We need both **mbdump.tar.bz2** and **mbdump-derived.tar.bz2**. Uncompress them and you'll have access to all the files. mbdump possesses the meta data on recordings but it does not have information such as tags and genres which is why you need mbdump-derived as well.

You can also take a look at the database schema here as we will need to merge multiple tables: https://musicbrainz.org/doc/MusicBrainz_Database/Schema 

### Recording Table

Let's first look at the table that has the meta data on the recording and just get the columns we need which is 'id', 'gid', and 'name'. 'gid' is the most important as that is what we need to make the requests

In [2]:
recording = pd.read_table('mbdump/recording', 
                      header=None, 
                      sep='\t', 
                      na_values=['\\N',''],
                      usecols=[0,1,2],
                      error_bad_lines=False)
recording = recording.rename(columns={0:'id', 1:'gid', 2:'name'})

recording.head()

Unnamed: 0,id,gid,name
0,20937085,0f42ab32-22cd-4dcf-927b-a8d9a183d68b,Travelling Man
1,20937086,4dce8f93-45ee-4573-8558-8cd321256233,Live Up
2,20937087,48fabe3f-0fbd-4145-a917-83d164d6386f,Radiate
3,11,b30b9943-9100-4d84-9ad2-69859ea88fbb,Five Man Army
4,20937088,b55f1db3-c6d2-4645-b908-03e1017a99c2,Kalighata (Rain Clouds)


### Recording Tags

Recording tags merges with the Recording table on 'rec_id' from Recording Tags to 'id' on Recording. We'll need the count column as that will be needed for sample weighting and to determine probability

In [3]:
rec_tag = pd.read_table('mbdump/recording_tag', header=None, usecols=[0,1,2])
rec_tag = rec_tag.rename(columns = {0:'rec_id', 1:'tag_id', 2:'count'})
rec_tag.head()

Unnamed: 0,rec_id,tag_id,count
0,721946,560,1
1,3065760,150,1
2,1335889,609,1
3,4348115,35626,1
4,12671911,127,1


### Tags Table
This contains the user tags to a recording which contains genres but also other information. This table merges on 'tag_id'

In [4]:
tag = pd.read_table('mbdump/tag', header=None, usecols=[0,1]).rename(columns = {0: 'tag_id', 1: 'tag_name'})
tag.head()

Unnamed: 0,tag_id,tag_name
0,95,finnish
1,23,slovak
2,801,iowa
3,4,groundbreaking
4,130,taiwanese


### Genre Tags
This is a list of all the genres

In [5]:
genre_tag = pd.read_table('mbdump/genre', header=None, usecols=[2]).rename(columns = {2:'genre'})
genre_tag.head()

Unnamed: 0,genre
0,acid house
1,acid jazz
2,acid techno
3,acoustic blues
4,acoustic rock


## Joining all the tables
1. First we make an inner join on tags to only get the tags that are a genre and remove genres that have no tag id
1. Then we merge the data to the recording tags
1. Then to Recording Table and we now have every song associated with a genre

In [6]:
newdata = tag.merge(genre_tag, how='inner', left_on='tag_name', right_on='genre')
newdata = rec_tag.merge(newdata, how='right', left_on='tag_id', right_on='tag_id')
newdata = recording.merge(newdata, how='right', left_on='id', right_on='rec_id')
newdata = newdata.dropna()
print(len(newdata))
newdata.head(10)

768714


Unnamed: 0,id,gid,name,rec_id,tag_id,count,tag_name,genre
0,11.0,b30b9943-9100-4d84-9ad2-69859ea88fbb,Five Man Army,11.0,77,0.0,house,house
1,11.0,b30b9943-9100-4d84-9ad2-69859ea88fbb,Five Man Army,11.0,20,0.0,alternative rock,alternative rock
2,11.0,b30b9943-9100-4d84-9ad2-69859ea88fbb,Five Man Army,11.0,559,1.0,electronica,electronica
3,11.0,b30b9943-9100-4d84-9ad2-69859ea88fbb,Five Man Army,11.0,12,2.0,downtempo,downtempo
4,11.0,b30b9943-9100-4d84-9ad2-69859ea88fbb,Five Man Army,11.0,11,2.0,electronic,electronic
5,11.0,b30b9943-9100-4d84-9ad2-69859ea88fbb,Five Man Army,11.0,1498,3.0,trip hop,trip hop
6,11.0,b30b9943-9100-4d84-9ad2-69859ea88fbb,Five Man Army,11.0,1222,2.0,alternative dance,alternative dance
7,17.0,c5355127-7a0c-428a-bd39-e5b3e83250f7,Lately,17.0,77,0.0,house,house
8,17.0,c5355127-7a0c-428a-bd39-e5b3e83250f7,Lately,17.0,559,1.0,electronica,electronica
9,17.0,c5355127-7a0c-428a-bd39-e5b3e83250f7,Lately,17.0,12,1.0,downtempo,downtempo


#### Exporting the table
The following data will be saved in case we need it.   
we also don't need the genres with count of 0 or less (people can vote down into the negatives)

In [7]:
newdata = newdata[newdata['count']>0]

In [8]:
#commented out as we already have the data

#export = newdata.drop(columns=['id','name','rec_id','tag_id', 'tag_name'])
#export.to_parquet('reportfiles/basic_genre.parquet', compression='snappy')

## Processing the new Table
We now have to wrangle the new data so we can use the genres as a target. The basic_genre just associates a song with a genre and there may be more than one genre to a song

In [4]:
newdata = pd.read_parquet('reportfiles/basic_genre.parquet')
newdata.head()

Unnamed: 0,gid,count,genre
2,b30b9943-9100-4d84-9ad2-69859ea88fbb,1.0,electronica
3,b30b9943-9100-4d84-9ad2-69859ea88fbb,2.0,downtempo
4,b30b9943-9100-4d84-9ad2-69859ea88fbb,2.0,electronic
5,b30b9943-9100-4d84-9ad2-69859ea88fbb,3.0,trip hop
6,b30b9943-9100-4d84-9ad2-69859ea88fbb,2.0,alternative dance


With the data that we have let's convert the genres to categorical numbers. I could have done this in the beginning with the genre table and that would have future proofed the code in the event that songs are added to genres not assigned or have votes of 0, but for simplicity let's get the ones that have a song associated with them for now. Having a smaller amount of genres might make it easier when we do use them in machine learning tasks.

In [10]:
#we'll make a new column as well for the genre codes

newdata['genre_id'] = pd.Categorical(newdata['genre'])
newdata['genre_id'] = newdata.genre_id.cat.codes
genredict = dict(enumerate(pd.Categorical(newdata['genre']).categories))

#print(genredict)

In [11]:
newdata.head()

Unnamed: 0,gid,count,genre,genre_id
2,b30b9943-9100-4d84-9ad2-69859ea88fbb,1.0,electronica,123
3,b30b9943-9100-4d84-9ad2-69859ea88fbb,2.0,downtempo,105
4,b30b9943-9100-4d84-9ad2-69859ea88fbb,2.0,electronic,121
5,b30b9943-9100-4d84-9ad2-69859ea88fbb,3.0,trip hop,380
6,b30b9943-9100-4d84-9ad2-69859ea88fbb,2.0,alternative dance,7


Now let's export the json

In [12]:
#with open('reportfiles/final_genredict.json', 'w') as outfile:
#    json.dump(genredict, outfile)

In [13]:
with open('reportfiles/final_genredict.json') as j:
    genredict = json.load(j)

Now going forward we can do two things. Get the genre with the most "counts" or votes and assign that as the main genre or we can encode the genre as a categorical variable and encode the number to the index of a list. The latter is necessary for deep learning operations, but I will be going further than onehot encoding. I will use the "counts" of each genre of a given song as a probability and encode that into the list. First let's get the max count genre for each song and assign that as the "main" genre.

### Get Main Genre
Here I'm getting all the columns that are the max of each count. Although I ended up not really having to use this. 
I'm using numpy.random.choice to select the ones that are tied.

In [14]:
idx = newdata.groupby(['gid'])['count'].transform(max) == newdata['count']
maingenre = newdata[idx]

maingenre = maingenre.groupby('gid').agg(np.random.choice)
maingenre = maingenre.reset_index()
maingenre.head()

Unnamed: 0,gid,count,genre,genre_id
0,00000baf-9215-483a-8900-93756eaf1cfc,1.0,folk rock,142
1,000026d2-8db1-42b1-87da-e4389dcd6093,1.0,house,191
2,00007908-1fff-415d-8e87-a49722c2442b,1.0,country,80
3,00007960-9d81-4192-b548-ad33d6b0ca54,3.0,rock,325
4,00007bab-7268-41c4-9d5c-c335c3a26f7c,1.0,dance,86


Let's export this if we need it

In [15]:
#maingenre.to_parquet('reportfiles/main_genre.parquet', compression='snappy')

In [16]:
maingenre = pd.read_parquet('reportfiles/main_genre.parquet')

### Get the genres into an encoding table
In order to deal with multigenres we need to encode the genres into a list. So for a given song with multiple genre associations it would have numbers at the indices that are associated with the genres. of course these numbers aren't just ones, they are the "votes" and the list will be normalized with repect to these counts. This will give us the probabilities for the genres for use in categorical classification.

for example we need a song to have the target [0, 0, 0, 0, 0, 0, 0.2, 0, 0, 0.2, 0.6, 0, 0, ...] where the numbers are the probability/proportion of user votes and the indexes are the associated genres.

The total votes for a song will be summed as well since that could be used to weight the samples. If a song has a one vote for a genre it might not be too reliable or accurate. So the more people have voted for a genre tag in a song the more accurate I assume the genre tag to be.

I named the column "cross" or "cat_cross" because categorical cross entropy will be calculated from it later when I  perform training on the deep neural network

In [17]:
encgenre = newdata.copy()

In [18]:
conf = 1 #This conf value will artificially inflate the significance of 
        #the most popular genre but I will just set it to one

def catcross_list(df, conf):
    crosslist = []
    num_genre = len(np.unique(df['genre_id']))
    #print(num_genre)
    for row in df.itertuples():
        empt = np.zeros(num_genre)
        #There probably is a better way to implement this but I only need to do this once
        empt[row.genre_id] = row.count**conf
        crosslist.append(empt)
    return crosslist

In [19]:
encgenre = newdata.copy()
encgenre['cross'] = catcross_list(encgenre, conf)
encgenre = encgenre.drop(columns=['genre', 'genre_id', 'count'])
encgenre.head()

Unnamed: 0,gid,cross
2,b30b9943-9100-4d84-9ad2-69859ea88fbb,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,b30b9943-9100-4d84-9ad2-69859ea88fbb,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,b30b9943-9100-4d84-9ad2-69859ea88fbb,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
5,b30b9943-9100-4d84-9ad2-69859ea88fbb,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
6,b30b9943-9100-4d84-9ad2-69859ea88fbb,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, ..."


Let's export this

In [None]:
#encgenre.to_parquet('reportfiles/enc_genre.parquet', compression='snappy')

In [20]:
encgenre = pd.read_parquet('reportfiles/enc_genre.parquet')
print(encgenre.shape)
encgenre.head()

(760223, 2)


Unnamed: 0,gid,cross
2,b30b9943-9100-4d84-9ad2-69859ea88fbb,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,b30b9943-9100-4d84-9ad2-69859ea88fbb,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,b30b9943-9100-4d84-9ad2-69859ea88fbb,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
5,b30b9943-9100-4d84-9ad2-69859ea88fbb,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
6,b30b9943-9100-4d84-9ad2-69859ea88fbb,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, ..."


Column 'cross' now contains the encodings. But now we have to collapse these lists into one for each song. flattening will take too long on a single process so I submitted it to a slurm job.

In [21]:
numgenre = len(genredict)
def collapse_cross(df, num_genre):
    cross = []
    indlist = []
    ind = np.unique(df['gid'])
    for i in ind:
        merged = np.zeros(num_genre)
        for j in df[df.gid == i]['cross']:
            merged = np.add(merged, j)
        indlist.append(i)
        cross.append(merged)
    return pd.DataFrame(zip(indlist, cross))

Here's a sample of it working here

In [23]:
%%time
test = encgenre[0:100]
test2 = encgenre[100:200]
test = collapse_cross(test, 395)
test2 = collapse_cross(test2, 395)

print(test.shape, test2.shape)

test3 = pd.concat([test, test2])
test3.head()

(55, 2) (62, 2)
CPU times: user 102 ms, sys: 1.82 ms, total: 104 ms
Wall time: 150 ms


Unnamed: 0,0,1
0,032c4ce0-b1fd-442d-8bf1-b7777e4832e7,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, ..."
1,03d32aab-6041-49e8-8fc7-82f091b005d5,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,0bf3d0b7-c05b-49cb-b54f-da0972f92617,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,0f323349-6125-4c2d-8e89-0056a022503c,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,1cbff13a-63d7-48d0-b352-15a4575480ef,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


But we should submit it to a slurm job

#### Slurm Job to merge the cross list
The files are located under **reportslurms** directory

In [None]:
#!/bin/bash
#SBATCH --time=00:40:00
#SBATCH -N 2 
#SBATCH -n 4 
#SBATCH --mem=24576


. /global/software/anaconda/anaconda-3.6-5.1.0/etc/profile.d/conda.sh
module load python/anaconda-3.6-5.1.0
conda activate cheinu

python collapse_cross.py

**Below are the contents of collapse_cross.py**  
Note that gids should be clustered which they are already. There is the chance that the split may happen on a unique gid but that probability is pretty low and the resulting dataframe has unique genres only. Despite splitting the code below will take some time to run. If there are duplicates then those should be isolated and merged individually but I have not had this issue.

In [None]:
import os
import sys
import pandas as pd
import numpy as np 
from multiprocessing import Process, Pool, Queue, cpu_count

mdf = pd.read_parquet('reportfiles/enc_genre.parquet')

def collapse_cross(df, num_genre, out_q):
	cross = []
	indlist = []
	ind = np.unique(df['gid'])
	for i in ind:
		merged = np.zeros(num_genre)
		for j in df[df.gid == i]['cross']:
			merged = np.add(merged, j)
		indlist.append(i)
		cross.append(merged)
	out_q.put(pd.DataFrame(zip(indlist, cross)))

if __name__=="__main__":
	nproc = int(os.environ["SLURM_CPUS_ON_NODE"])
	procs =[]
	
	out_q = Queue()
	newdf = pd.DataFrame()

	newdata = np.array_split(mdf, nproc)
	
	for i in range(nproc):
		p = Process(target=collapse_cross, args=(newdata[i], num_genre, out_q))
		procs.append(p)
		p.start()
	for j in range(nproc):
		newdf = pd.concat([newdf, out_q.get()])
	for p in procs:
		p.join()

	newdf = newdf.rename(columns={0:'gid', 1:'cat_cross'})
	newdf.to_parquet('cross_gen.parquet', compression = 'snappy')
	print("process finished")

### Final Target

Once we have gotten the contents let's merge everything with the sum of all the counts of the genre which will be used as sample weights

In [5]:
count_sum_genre = pd.DataFrame(newdata.groupby('gid')['count'].sum()).reset_index()
print(count_sum_genre.shape)
count_sum_genre.head()

(530607, 2)


Unnamed: 0,gid,count
0,00000baf-9215-483a-8900-93756eaf1cfc,1.0
1,000026d2-8db1-42b1-87da-e4389dcd6093,1.0
2,00007908-1fff-415d-8e87-a49722c2442b,1.0
3,00007960-9d81-4192-b548-ad33d6b0ca54,4.0
4,00007bab-7268-41c4-9d5c-c335c3a26f7c,1.0


In [6]:
cross_gen = pd.read_parquet('reportfiles/cross_gen.parquet')
print(cross_gen.shape)
cross_gen.head()

(530609, 2)


Unnamed: 0,gid,cat_cross
0,00000baf-9215-483a-8900-93756eaf1cfc,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,000026d2-8db1-42b1-87da-e4389dcd6093,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,00007960-9d81-4192-b548-ad33d6b0ca54,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,00007bab-7268-41c4-9d5c-c335c3a26f7c,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,000080ea-f4d1-41c6-a327-76280f90d39f,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


#### Normalize the cat_cross list
I could have also merged the count_sum_genre and used the sums to normalize

In [14]:
cross_gen['cat_cross'] = cross_gen['cat_cross'].apply(lambda x: x/np.sum(x))

Let's now merge the summed genre and the categorical list data

In [15]:
final_target = count_sum_genre.merge(cross_gen, how='inner', left_on='gid', right_on='gid')
print(final_target.shape)
final_target.head()

(530609, 3)


Unnamed: 0,gid,count,cat_cross
0,00000baf-9215-483a-8900-93756eaf1cfc,1.0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,000026d2-8db1-42b1-87da-e4389dcd6093,1.0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,00007908-1fff-415d-8e87-a49722c2442b,1.0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,00007960-9d81-4192-b548-ad33d6b0ca54,4.0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,00007bab-7268-41c4-9d5c-c335c3a26f7c,1.0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [16]:
#Export
final_target.to_parquet('reportfiles/target_genre.parquet')

Let's now move onto the next part where I will be downloading the spectral data for all the recordings.