# Music Genre Recognition - Milestone 1
Darren Midkiff and Cheng-Wei Hu

## Overview
This project aims to identify the genre of music in an audio sample. Features will be extracted from the analog data, and a genre will be predicted using a KNN model, trained on labelled audio samples from the freely available FMA dataset.

## Training Data

The [FMA Dataset](https://github.com/mdeff/fma) is comprised of over 100,000 tracks from 161 genres. In order to make the problem more manageable, we will use the small version of the dataset, which includes 8,000 tracks from 8 top-level genres. The dataset also includes dozens of features -- year released, location of artist, number of listens, etc. Because this project aims to identify genre using only audio signal, all of these features are irrelevant and will be dropped.

In [1]:
import pandas as pd

#read full metadata file
metadata = pd.read_csv("./fma_metadata/tracks.csv", skiprows=[0,2], low_memory=False)

# drop all tracks that are not in fma_small dataset
metadata = metadata[metadata["subset"].eq("small")]
# add name to track_id column (missing because of stupid CSV formatting)
metadata = metadata.rename(columns={"Unnamed: 0": "track_id"})
# drop all columns that don't relate to genre
# we will not have this metadata from the audio file
metadata.drop(metadata.columns.difference(["track_id","genre_top"]),1,inplace=True)
# reset indices accounting for dropped rows
metadata = metadata.reset_index(drop=True)

#write only relevant metadata to file for use in training
metadata.to_csv("fma_small_genres.csv")

metadata.head()

Unnamed: 0,track_id,genre_top
0,2,Hip-Hop
1,5,Hip-Hop
2,10,Pop
3,140,Folk
4,141,Folk


In [2]:
# show 8 genres
print(metadata["genre_top"].unique())

['Hip-Hop' 'Pop' 'Folk' 'Experimental' 'Rock' 'International' 'Electronic'
 'Instrumental']


## Feature Extraction

In order to apply machine learning techniques to audio samples, useful features must be extracted from the signals. We will extract four features: [Zero Crossing Rate](https://en.wikipedia.org/wiki/Zero-crossing_rate), [Spectral Centroid](https://en.wikipedia.org/wiki/Spectral_centroid), [Spectral Rolloff](https://en.wikipedia.org/wiki/Spectral_slope), and [Mel-Frequency Cepstral Coefficients](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum).


In our project, we will use [librosa](https://librosa.org/doc/latest/index.html) to extract features from raw audio. Here, features are extracted from our training data and exported to a csv file.

In [18]:
%matplotlib inline
import librosa
import IPython.display as ipd
import matplotlib.pyplot as plt
import librosa.display
import sklearn
import numpy as np

In [4]:
# Get all the audio file paths from the fma_small dataset 
import os
file_names = []
for root, dirs, files in os.walk('./fma_small', topdown=False):
    for name in files:
        if name[-1] != '3':
            continue
        file_names.append(os.path.join(root, name))
# print(len(file_names))

In [33]:
# Define the function to extract the four features from  
def normalize(x, axis=0):
    return sklearn.preprocessing.minmax_scale(x, axis=axis)

def extract_feature_from_audio(audio_path, should_plot = False, should_print = False):
    # load
    x , sr = librosa.load(audio_path)
    if should_plot:
        plt.figure(figsize=(14, 5))
        librosa.display.waveplot(x, sr=sr)
    
    # zero_crossing_rate
    zero_crossing_rate = librosa.feature.zero_crossing_rate(x)
    m_zcr = np.mean(zero_crossing_rate) #mean zero-crossing rate
    v_zcr = np.var(zero_crossing_rate) #zero-crossing rate variance
    
    if should_print:
        print(zero_crossing_rate.shape)
        print(zero_crossing_rate)
        print(m_zcr)
    
    # spectral_centroids
    spectral_centroids = librosa.feature.spectral_centroid(x, sr=sr)[0]
    m_spec_cent = np.mean(spec_cent) #mean spectral centroid
    v_spec_cent = np.var(spec_cent) #spectral centroid variance
    
    if should_print:
        print(spectral_centroids.shape)
        print(spectral_centroids)
        print(m_spec_cent)

    # Computing the time variable for visualization
    # frames = range(len(spectral_centroids))
    # t = librosa.frames_to_time(frames)
    
    # spectral_rolloff
    spectral_rolloff = librosa.feature.spectral_rolloff(x, sr=sr)[0]
    m_spec_roll = np.mean(spectral_rolloff) #mean spectral rolloff
    v_spec_roll = np.var(spectral_rolloff) #spectral rolloff variance
    
    if should_print:
        print(spectral_rolloff.shape)
        print(spectral_rolloff)
        print(m_spec_roll)

    # mfccs
    mfccs = librosa.feature.mfcc(x, sr=sr)
    m_mffcs = np.mean(mffcs,axis=1) #mean mfccs
    v_mfccs = np.var(mfccs,axis=1) #mfcc variances
    
    if should_print:
        print(mfccs.shape)
        print(mfccs)
    #Displaying  the MFCCs:
    # librosa.display.specshow(mfccs, sr=sr, x_axis='time')
    
    return [m_zcr, v_zcr, m_spec_cent, v_spec_cent, m_spec_roll, v_spec_roll, m_mfccs, v_mfccs]

In [13]:
x,sr = librosa.load("./fma_small/000/000002.mp3")
zero_crossing_rate = librosa.feature.zero_crossing_rate(x)[0]
spectral_centroids = librosa.feature.spectral_centroid(x, sr=sr)[0]
spectral_rolloff = librosa.feature.spectral_rolloff(x, sr=sr)[0]
mfccs = librosa.feature.mfcc(x, sr=sr)

In [16]:
print(zero_crossing_rate.shape)
print(spectral_centroids.shape)
print(spectral_rolloff.shape)
print(mfccs.shape)

(1291,)
(1291,)
(1291,)
(20, 1291)


In [37]:
x, sr = librosa.load(file_names[0])

In [40]:
file_names[0]

'./fma_small\\000\\000002.mp3'

In [41]:
# Extract the features from audio files
# Some of the audio files are damaged. So we skipped those files.
# The damaged files are: './fma_small/099/099134.mp3', './fma_small/108/108925.mp3', './fma_small/133/133297.mp3'

train_audio_features_all = []
fail_file_names_all = []
fail_file_idx_all = []
fail_file_names_dict_all = {}

for idx, file in enumerate(file_names[:5]):
    print(idx, file)
    try:
        single_audio_features = extract_feature_from_audio(file)
#         row = []
#         row.append(file.split('/')[-1])
#         for f in single_audio_features:
#             row.append(f)
#         train_audio_features_all.append(row)
    except:
        print("Failed: ", idx, file)
        fail_file_names_all.append(file)
        fail_file_idx_all.append(file)
        fail_file_names_dict_all[file] = True

0 ./fma_small\000\000002.mp3
Failed:  0 ./fma_small\000\000002.mp3
1 ./fma_small\000\000005.mp3
Failed:  1 ./fma_small\000\000005.mp3
2 ./fma_small\000\000010.mp3
Failed:  2 ./fma_small\000\000010.mp3
3 ./fma_small\000\000140.mp3
Failed:  3 ./fma_small\000\000140.mp3
4 ./fma_small\000\000141.mp3
Failed:  4 ./fma_small\000\000141.mp3


In [3]:
len(train_audio_features_all)

NameError: name 'train_audio_features_all' is not defined

In [23]:
# Export the features into a csv files
import csv

with open('features_all.csv', mode='w') as features_file:
    features_file = csv.writer(features_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    features_names = ["file_name", "zero_crossing (zero_crossing_sum & total_frame)", "spectral_centroids", "spectral_rolloff", "mfccs"]
    features_file.writerow(features_names)
    for idx, raw_feature in enumerate(train_audio_features_all):
        row = []
        for i, f in enumerate(raw_feature):
            if i == 0:
                row.append(f)
                print(idx, f)
            elif i == 1:
                row.append([sum(f), len(f)])
            else:
                row.append(f.tolist())
        features_file.writerow(row)

0 fma_small\000\000002.mp3


TypeError: 'numpy.int32' object is not iterable

The exported csv file can be found [here](https://drive.google.com/file/d/1xiyYZ23WVl-RyWIWlf3-JZaZll4i9b1X/view?usp=sharing).

## Data Preprocessing

In order to make the model training easier, we have to preprocess the data and combine the `features_all.csv` files with `fma_small_genres.csv`, storing all calculated features with their respective track IDs. This will be extremely helpful in model training.

In [None]:
# Read features_all.csv and process the file_name columns
import pandas as pd

raw_data = pd.read_csv("./features_all.csv", low_memory=False)
raw_data = raw_data.rename(columns={"file_name": "track_id"})
raw_data['track_id'] = raw_data['track_id'].str[:-4].astype(int)

In [None]:
# Read features_all.csv and make it a dictionary
 and process the file_name columns

metadata = pd.read_csv("./fma_small_genres.csv", low_memory=False)
metadata_mapping = dict([(i,g) for i, g in zip(metadata.track_id, metadata.genre_top)])

In [None]:
# Insert a column genre_top into raw_data with each track's corresponding genre_top
raw_data["genre_top"] = raw_data["track_id"].map(metadata_mapping)
raw_data.to_csv("fma_small_train_data.csv")

The exported csv file can be found [here](https://drive.google.com/file/d/1hTANS4oluY0VlIYyS-fjamg4p1zPpsT1/view?usp=sharing).

## Model

Now that we have labeled training data and corresponding features, the only remaining step is to train a model. We are strongly considering a nearest-neighbors model, but the size of the training dataset may prove prohibitive in this pursuit.