# Preparing the MagnaTagATune Dataset for Music Genre Classification

Arun Das

Research Fellow,

Open Cloud Institute,

The University of Texas at San Antonio.

arun.das@my.utsa.edu

A little bit about me: I am a Computer Engineer by trade with research concentration on cloud computing and deep learning. I started researching on deep learning only a year (and half) back with emphasis on computer vision. So, I'm still on the learning curve when it comes to advanced DL topics in some other areas. This notebook is the first step in the deep learning pipeline of an interesting Music Genre Classification problem: the intense data science part where you prepare the dataset in the way you want.

The dataset used for the project is MagnaTagATune. It has more than 25K mp3 files. The aim of the project is in using a deep neural network to predict the genre of music, provided the mp3 as an input. The way it is achieved is through a combination of convolutional and reccurent neural networks working together as a whole.

I used pandas to work with the dataset which contains annotations of each of the 25K mp3 files. These annotations contains information about the genre, file id, mp3 file location etc. Pandas is an easy, flexible and powerful tool with many functions related to data structures for data analysis, time series analysis and statistics. After the dataset is processed, the mp3 file as such needs to be converted from raw mp3 to Mel-scaled power spectrogram. We use librosa to do it. You can see an example here.

Let's do it then.

In [1]:
#Imported required librariries
import pandas as pd
import numpy as np
import os
import shutil
import librosa
#Set number of columns to show in the notebook
pd.set_option('display.max_columns',200)
#Set number of rows to show in the notebook
pd.set_option('display.max_rows',50)
#Make the graphs a bit prettier
pd.set_option('display.mpl_style','default')

#Import Matplotlib Package
import matplotlib.pyplot as plt

#Display pictures within the notebook itself
%matplotlib inline


ModuleNotFoundError: No module named 'librosa'

In [2]:
#read the annotations file
newdata = pd.read_csv('annotations_final.csv',sep="\t")

FileNotFoundError: File b'annotations_final.csv' does not exist

In [None]:
    #Display the top 5 rows
    newdata.head(5)

In [None]:
#Get to know the data better
newdata.info()


In [None]:
#what columns are there?
newdata.columns

In [None]:
#Extract the clip_id and mp3_path
newdata[["clip_id","mp3_path"]]

In [None]:
#Previous command extracted it as a Daaframe. We need it as a matrix to do analytics on.
#Extract clip_id and mp3_path as a matrix.
clip_id, mp3_path = newdata[["clip_id","mp3_path"]].as_matrix([:,0],newdata[["clip_id","mp3_path"]].as_matrix()[:,1])

In [None]:
#Some of the tags in the dataset are really close to each other.Lets merge them together
synonyms = [['beat', 'beats'],
            ['chant', 'chanting'],
            ['choir', 'choral'],
            ['classical', 'clasical', 'classic'],
            ['drum', 'drums'],
            ['electro', 'electronic', 'electronica', 'electric'],
            ['fast', 'fast beat', 'quick'],
            ['female', 'female singer', 'female singing', 'female vocals', 'female vocal', 'female voice', 'woman', 'woman singing', 'women'],
            ['flute', 'flutes'],
            ['guitar', 'guitars'],
            ['hard', 'hard rock'],
            ['harpsichord', 'harpsicord'],
            ['heavy', 'heavy metal', 'metal'],
            ['horn', 'horns'],
            ['india', 'indian'],
            ['jazz', 'jazzy'],
            ['male', 'male singer', 'male vocal', 'male vocals', 'male voice', 'man', 'man singing', 'men'],
            ['no beat', 'no drums'],
            ['no singer', 'no singing', 'no vocal','no vocals', 'no voice', 'no voices', 'instrumental'],
            ['opera', 'operatic'],
            ['orchestra', 'orchestral'],
            ['quiet', 'silence'],
            ['singer', 'singing'],
            ['space', 'spacey'],
            ['string', 'strings'],
            ['synth', 'synthesizer'],
            ['violin', 'violins'],
            ['vocal', 'vocals', 'voice', 'voices'],
            ['strange', 'weird']]

In [None]:
#Merge the Synonyms and drop all other columns than the first one.
"""
Example:
Merge 'beat','beats' and save it to 'beat'.
Merge 'classical','clasical','classic'and save it to 'classical'.
"""
for synonym_list in synonyms:
    newdata[synonm_list[0]] = newdata[synonym_list].max(axis=1)
    newdata.drop(synonym_list[1:],axis=1,inplace=True)

In [None]:
#Did it Work?
newdata.info()

In [None]:
#Lets view it.
newdata.head()

In [None]:
#Drop the mp3_math tag from the dataframe
newdata.drop('mp3_path',axis=1,inplace=True)
#Save the column names into a variable
data =newdata.sum(axis=0)


In [None]:
#Find THe distribution of tags
data

In [None]:
#Sort the column names.
data.sort_values(axis=0, inplace=True)

In [None]:
#Find the top tags from the dataframe
topindex, topvalues = list(data.index[84:]),data.values[84:]
del(topindex[-1])
topvalues = np.delete(topvalues, -1)


In [None]:
#Get the top column names
topindex

In [None]:
#Get the top column values
topvalues

In [None]:
#Get a list of columns to remove
rem_cols=data.index[:84]


In [None]:
#Cross-check: How many columns are we removing?
len(rem_cols)

In [None]:
#Drop the columns that need to be removed
newdata.drop(rem_cols,axis=1,inplace=True)

In [None]:
newdata.info()

In [None]:
#Create a backup of the dataframe
backup_newdata = newdata

In [None]:
#Use this to revive the dataframe
newdata = backup_newdata

In [None]:
#Shuffle the dataframe
from sklearn.utils import shuffle
newdata = shuffle(newdata)

In [None]:
#This method may be used to shuffle data.
#By setting frac=1, you''ll shuffle every single row randomly.\
newdata = newdat.sample(frac=1).reset_index(drop=True)

In [None]:
newdata.reset_index(drop=True)

In [None]:
#One Final check
newdata.info()

In [None]:
#Let us save the final columns
final_columns_names[0]

In [None]:
#Do it only once to  delete the clip_id column
del(final_columns_names[0])

In [None]:
#Verified
final_columns_names

In [None]:
#Create the  file which is to be saved off(you could skip and apply similiar steps in the previous dataframe )
#Here, binary 0's and 1's from each column is changed to 'False' and 'True' by using'==' operator on the dataframe.
final_matrix=pd.concat([newdata['clip_id'],newdata[final_columns_names]==1],axis=1)

The following steps will convert mp3 files into their respective mel-spectrogram. This is compute intensive. If it takes a long time, copy it over to a text tile and run it as a python script so that you can forget about the jupyter notebook session. I've run these once, so not running them again.

In [None]:
#Rename all the mp3 files to their clip_id and save it into one folder named 'dataset_clip_id_mp3' in the same folder.
#Get the current working directory
root = os.getcwd()
os.mkdir(root+"/dataset_clip_id_mp3/",0755)

#Iterate over the mp3 files, rename them to clip_id and save it to another folder.
for id in range(25863):
    
#print clip_id[id], mp3_path[id]
src =root + "/" + mp3_path[id]
dest = root + "/dataset_clip_id_mp3" + str(clip_id[id])+".mp3"
shur=til.copy2(src,dest)
#Print src,dest


In [None]:
#Convert all the mp3 files into their corresponding mel-spectroprograms(melgrams).

#Audio preprocessing function
def compute_melgram(audio_path):
    ''' Compute a mel-spectrogram and returns it in a shape of (1,1,96,1366), where
    96 == #mel-bins and 1366 == #time frame
    parameters
    ----------
    audio_path: path for the audio file.
                Any format supported by audioread will work.
    More info: http://librosa.github.io/librosa/generated/librosa.core.load.html#librosa.core.load
    '''
    
    #Mel-spectrogram parameters
    SR =12000
    N_FFT = 512
    N_MELS = 96
    HOP_LEN = 256
    DURA = 29.12 #To make it 1366 fram..
    
    src, sr = librosa.load(audio_path, sr=SR) #Whole signal
    n_samples = src.shape[0]
    n_sample_fit = int(Dura*SR)
    
    if n_sample < n_sample_fit: #if too short
        src = np.hstack((src, np.zeros((int(DURA*SR)-n_sample,))))
    elif n_sample > n_sample fit: #if too long
        src= src[(n_sample-n_sample_fit)/2:(n_sample+n_sample_fit)/2]
    logam = librosa.logamplitude
    melgram = librosa.feature.melspectrogram
    ret = logam(melgram(y=src, sr=SR, hop_length =HOP_LEN,
                       n_fft=N_FFT, n_mels =N_MELS))**2,ref_power=1.0)
    ret=ret[np.newaxis, np.newaxis,:]
    return ret

#GET the absolute path of all audio files and save it to audio_paths arrays[]
audio_paths=[]
#Variable to save the mp3 files that don't work
files_that_dont_work=[]
os.chidr('/home/cc/notebooks/MusicProject/MagnaagATune/')
root= os.getcwd()
os.chidr(root + 'daaset_clip_id_mp3/')
for audio_pah in os.listdir('.'):
    #Audio_paths.append(os.path.abspath(fname))
    if os.path.isfile(root + '/dataset_clip_id_melgram/' + str(os.path.splitext(audio_path)[0])+'.npy'):
        #Print "existtt"
        continue
    else:
        if str(os.path.splitext(audio_path)[1]) == ".mp3":
            try:
                melgram = compute_melgram(os.path.abspath(audio_path))
                dest = root + '/dataset_clip_id_melgram' + str(os,path.splitext(audio_path)[0])
                np.save(dest, melgram)
            except EOFError:
                files_that_dont_work.append(audio_path)
                continue
    
                
"""
NOTE: I've run this an created all the mel-spectrograms and saved them off seprately, 
and then concatenated the train, test and validation set in the ratio that I wanted.
This, will make a significant overhead in the computation time when you look at this
as a whole. 

For example, concatenating the corresponding files to train, test and
validation splits will inturn require more time and memory. If we decide the splits 
beforehand and converting mp3 to mel-spectrogram based on those splits, it will make
life much easier (and less time). 

However, I want each of the mel-spectrograms seperate as I might need to create datasets
based on different genre, number of files, splits etc. in the future. So this is the way
to go for me now. Please note that this requires a significant amount of system memory.
"""                
        

In [None]:
#Get  a list of
mp3_available = []
melgram_available=[]
for mp3 in os.listdir('/home/cc/notebooks/MusicProject/MagnaTagATune/dataset_clip_id_mp3/'):
     mp3_available.append(int(os.path.splitext(mp3)[0]))
        
for melgram in os.listdir('/home/cc/notebooks/MusicProject/MagnaTagATune/dataset_clip_id_melgram/'):
     melgram_available.append(int(os.path.splitext(melgram)[0]))
        

In [None]:
#The latest clip_id
new_clip_id = final_matrix['clip_id']

In [None]:
#Let us see which all files have not been converted into melspectrograms.
set(list(new_clip_id)).difference(melgram_avilable)

In [None]:
#Saw that these clips were extra 35644, 55753, 57881. Removing them.
final_matrix = final_matrix[final_matrix['clip_id']!= 35644]
final_matrix = final_matrix[final_matrix['clip_id']!= 55753]
final_matrix = final_matrix[final_matrix['clip_id']!= 57881]

In [None]:
#Check again
final_matrix.info()

In [None]:
#Save the matrix
final_matrix.to_pickle('final_Dataframe.pkl')

In [None]:
#This is how you can load it afterwards.
final_matrix = pd.read_pickle('final_Dataframe.pkl')

In [None]:
#Separate the Training , tes and validation dataframe.
training_with_clip =final_matrix[:19773]

In [None]:
validation_with_clip = final_matrix[19773:21294]

In [None]:
testing_with_clip=final_matrix[21294:]

In [None]:
#Quick peek
training_with_clip

In [None]:
#Quick peek
testing_with_clip

In [None]:
#Quick peek
validation_with_clip

In [None]:
#Extract the corresponding clip_id's
training_clip_id = training_with_clip['clip_id'].values
validation_clip_id = validation_wit_clip['clip_id'].values
testing_clip_id = testing_wit_clip['clip_id'].values

In [None]:
#Check !
training_clip_id

In [None]:
#Go to the directory you want to save the dataframe
os.chdir('/home/cc/notebooks/MusicProject/MagnaTagATune/final_dataset/')


In [None]:
#Save the 'y' values.
np.save('train_y.npy', training_with_clip[final_columns_names].values)

In [None]:
np.save('valid_y.npy',validation_with_clip[final_columns_names].values)

In [None]:
np.save('test_y.npy', testing_with_clip[final_columns_names].values)

In [None]:
# Save the 'x' clip_id's. We will make the numpy array using this.
np.savetxt('train_x_clip_id.txt', training_with_clip['clip_id'].values, fmt='%i')

In [None]:
np.savetxt('test_x_clip_id.txt', testing_with_clip['clip_id'].values, fmt='%i')

In [None]:
np.savetxt('valid_x_clip_id.txt', validation_with_clip['clip_id'].values, fmt='%i')

This is it, the (most) compute intensive part - concatenating the numpy arrays to form train, test and validation splits. In the training file portion, I have included two different ways in which you can create the train split; either by concatenating the numpy arrays or directly converting from corresponding mp3's to melspectrogram.

melgram = compute_melgram(str(train_clip) + '.mp3')

OR

melgram = np.load(str(train_clip) + '.npy')

Use the one which suits you. I had a cloud instance with plenty RAM, so I concatenated the numpy arrays. It took about 6 hours.

In [None]:
#Now to combine the melgrams according to the clip_id.
#(maybe in the future we can make melgrams accordiing to th clip id itself into train and validation!!)

#Variable to store melgrams.
train_x = np.zeros((0, 1, 96, 1366))
test_x = np.zeros((0, 1, 96, 1366))
valid_x = np.zeros((0, 1, 96, 1366))

root = '/home/cc/notebooks/MusicProject/MagnaTagATune/'
os.chdir(root + "/dataset_clip_id_melgram/")
for i,valid_clip in enumerate(list(validation_clip_id)):
    if os.path.isfile(str(valid_clip) + '.npy'):
        #print i,valid_clip
        melgram = np.load(str(valid_clip) + '.npy')
        valid_x = np.concatenate((valid_x, melgram), axis=0)
os.chdir('/home/cc/notebooks/MusicProject/MagnaTagATune/')
np.save('valid_x.npy', valid_x)
print "Validation file created"


root = '/home/cc/notebooks/MusicProject/MagnaTagATune/'
os.chdir(root + "/dataset_clip_id_melgram/")
for i,test_clip in enumerate(list(testing_clip_id)):
    if os.path.isfile(str(test_clip) + '.npy'):
        #print i,test_clip
        melgram = np.load(str(test_clip) + '.npy')
        test_x = np.concatenate((test_x, melgram), axis=0)
os.chdir('/home/cc/notebooks/MusicProject/MagnaTagATune/')
np.save('test_x.npy', test_x)
print "Testing file created"

root = '/home/cc/notebooks/MusicProject/MagnaTagATune/'
os.chdir(root + "/dataset_clip_id_melgram/")
for i,train_clip in enumerate(list(training_clip_id)):
    #if os.path.isfile(str(train_clip) + '.npy'):
        #print i,train_clip
    melgram = compute_melgram(str(train_clip) + '.mp3')
    #melgram = np.load(str(train_clip) + '.npy')
    train_x = np.concatenate((train_x, melgram), axis=0)
os.chdir('/home/cc/notebooks/MusicProject/MagnaTagATune/')
np.save('train_x.npy', train_x)
print "Training file created."