# Apollo's Ear Classifiers: Classify audio accross 13 genres

We will implement a set of 3 audio classification networks supporting eleven different major musical genres.

*   Target Accuracy - 80%



## System setup

- Mount Google Drive
- Set Tensorflow version to 2.x
- Enable GPU

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Set up Tensorflow 2.x
%tensorflow_version 2.x
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


# Data Aggregation
The following are utilities functions for aggregating, cleaning, and formatting audio for analysis. Note that Using Selenium does not work with Google Collab. I performed the data aggregation on my laptop

##Note on the Dataset
The famous [GTZAN](https://www.kaggle.com/andradaolteanu/gtzan-dataset-music-genre-classification?) Dataset contains data for blues, classical, country, hiphop, jazz, metal, pop, reggae, rock. Coming from Africa however, I need to be able to classify [Afrobeat]('https://en.wikipedia.org/wiki/Afrobeats'), [Coupe Decale]('https://en.wikipedia.org/wiki/Coup%C3%A9-d%C3%A9cal%C3%A9') (which is my favorite music genre), and [Rumba]('https://en.wikipedia.org/wiki/Rumba') (the favorite of my parents). 
To extend the data, I relied on [selenium]('https://selenium-python.readthedocs.io/')and on [pytubeX]('https://pypi.org/project/pytubeX/'), a youtube stream download package. First, I found some youtube playlists that I thought were representative enough of the 3 genres. This is not scientifically rigorous, but given that the playlists were at least amont the top three popular ones when I looked for the genre names, this is not too bad. Then, I used Selenium to extract video URLS which I then tested, (again using Selenium) to makes sure that the videos were live. Next, I used the pytube to get the audio of the videos. Finally, I used FFMPEG to splice 30 random seconds from each audio file which I then discarded. I believe scientific exploration is fair use under copyright law, but better be safe than sorry.
See [audio_data_collectors.py]() for the utility functions if you need to gather extra data. Else, you can just use the excellent [GTZAN](https://www.kaggle.com/andradaolteanu/gtzan-dataset-music-genre-classification?) dataset.

# Music Data Preparation
The following are a set of utility functions for features extraction from audio data.  Follow the steps after gathering all the needed data.

##Note on the features
We will be using two different types of classifiers. The first one will be [Multilayer perceptron (MLP)](https://en.wikipedia.org/wiki/Multilayer_perceptron), the second one a [Convolutional Neural Network (CNN)](https://en.wikipedia.org/wiki/Convolutional_neural_network), and the third is [Recurrent Neural Network with Longterm Shorterm Memory (RNN-LSTM)](https://people.cs.pitt.edu/~jlee/papers/cs3750_rnn_lstm_slides.pdf). 
For the MLP we will be using a large set of features: [spectral bandwidth](https://www.timbercon.com/resources/glossary/spectral-bandwidth/), [spectral rollof](https://www.mathworks.com/help/audio/ref/spectralrolloffpoint.html), [spectral chromagram](https://en.wikipedia.org/wiki/Chroma_feature), [zero crossing rates](https://en.wikipedia.org/wiki/Zero-crossing_rate), [tempogram](https://musicinformationretrieval.com/tempo_estimation.html), and the means of [Mel-Frequency Cepstral coefficients (MFCCs)](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum). For the CNN, we will only use the logs of the [Mel Spectrograms](https://medium.com/analytics-vidhya/understanding-the-mel-spectrogram-fca2afa2ce53). Finally for the RNN, we will use the MFCCs

In [None]:
# modules imports
import math
import librosa # library for extracting audio data features
import json
import os
import numpy as np
import pandas as pd
import csv

In [None]:
# The structure of the extraction functions was inspired by 
# https://github.com/musikalkemist/DeepLearningForAudioWithPython/blob/master/12-%20Music%20genre%20classification:%20Preparing%20the%20dataset/code/extract_data.py

# Define constants so that we do not always need to pass in function paramaters
WORKING_FOLDER = "/content/drive/My Drive/Apollo's Ear"
DATA_PATH = f"{WORKING_FOLDER}/data/genres"
TEST_DATA_PATH = f"{WORKING_FOLDER}/test"
MLP_FEATURES = f"{WORKING_FOLDER}/data/mlp_features.csv"
RNN_FEATURES = f"{WORKING_FOLDER}/data/rnn_features.json"
CNN_FEATURES = f"{WORKING_FOLDER}/data/cnn_features.json" 
GENRE_NAMES = {"afrobeat", "blues","classical", "country","coupe_decale", \
               "disco","hiphop", "jazz","metal", "pop", "reggae", "rock","rumba"}
SAMPLE_RATE = 22050 # sample rate often used in audio classification
TRACK_DURATION = 30 #seconds
SAMPLES_PER_TRACK = SAMPLE_RATE * TRACK_DURATION

def extract_features_MLP(data_path=DATA_PATH, output_path=MLP_FEATURES,
                      num_mfcc=13, num_fft=2048, hop_length=512, num_segments=5):
  """Extracts and puts into csv format music features to be used for the MLP classifier
    :param data_path (str): Path to audio data
    :param output_path (str): Path for json output
    :param coef_num (str): Number of mfcc coefficients
    :param ftt_num_sample (str): NUmber of fast fourier transform samples
    :param data_path (str): Path to audio data
    :return None
  """
# Prepare data storage
  output = open(output_path, 'w', newline='')
  header = 'filename spectral_centroid spectral_bandwidth spectral_rolloff chroma_stft zero_crossing_rate tempogram '
  for i in range(num_mfcc):
    header += f' mfcc{i}'
  header += ' label'
  header = header.split();
  with output:
    writer = csv.writer(output)
    writer.writerow(header)

  # Extract and store features 
  samples_per_segment = int(SAMPLES_PER_TRACK / num_segments)
  num_mfcc_vectors_per_segment = math.ceil(samples_per_segment / hop_length)

  for i, (dirpath, dirnames, filenames) in enumerate(os.walk(data_path)):
    
    # process every folder in data_path directory
    if dirpath is not data_path and os.path.isdir(dirpath):

      #Save genre label in data["mapping"]
      genre_label = dirpath.split("/")[-1]
      
      #Process every audio file in genre folder
      for filename in filenames:
        file_path = os.path.join(dirpath, filename)
        signal, sample_rate = None, None
        try:
          signal, sample_rate = librosa.load(file_path, sr=SAMPLE_RATE)
        except:
          print(filename, 'start-finish:', start, '-', finish)
          continue

        # Process every audio segment in audio file
        for s in range(num_segments):
          try:
            start = samples_per_segment * s
            finish = start + samples_per_segment
                      
            #Extract spectral centroid
            spec_cent = librosa.feature.spectral_centroid(y=signal[start:finish],\
                          sr=sample_rate, n_fft=num_fft, hop_length=hop_length)                
            # #Extract spectral bandwidth
            spec_bw = librosa.feature.spectral_bandwidth(y=signal[start:finish],\
                          sr=sample_rate, n_fft=num_fft, hop_length=hop_length)
            
            # Extract spectral rolloff
            spec_ro = librosa.feature.spectral_rolloff(y=signal[start:finish],\
                          sr=sample_rate, hop_length=hop_length)
            
            # Extract chromagram
            chroma_stft = librosa.feature.chroma_stft(y=signal[start:finish],\
                          sr=sample_rate, hop_length=hop_length)
            
            #Extract zero crossing rate
            zcr = librosa.feature.zero_crossing_rate(y=signal[start:finish],\
                          hop_length=hop_length)
            
            #Extract tempogram
            tpg = librosa.feature.tempogram(y=signal[start:finish],\
                          sr=sample_rate, hop_length=hop_length)
            
            # Start row of data for line segment
            row = f'{filename} {np.mean(spec_cent)} {np.mean(spec_bw)} {np.mean(spec_ro)} {np.mean(chroma_stft)} {np.mean(zcr)} {np.mean(tpg)}'                
            
            
            # Extract mfcc
            mfcc = librosa.feature.mfcc(y=signal[start:finish], \
                                        sr=sample_rate, n_mfcc=num_mfcc \
                                        , hop_length=hop_length)
            # Append mfcc data to row
            # if (len(mfcc) == num_mfcc_vectors_per_segment):
            for c in mfcc:
              row += f' {np.mean(c)}'
            
            row += f' {genre_label}'
            output = open(output_path, 'a', newline='')
            row = row.split()
            with output:
              writer = csv.writer(output)
              writer.writerow(row)
          except Exception:
            print(filename, 'start-finish:', start, '-', finish)




def extract_features_rnn(data_path=DATA_PATH, output_path=RNN_FEATURES,
                         num_mfcc=13, n_fft=2048, hop_length=512, num_segments=10):
  """Extracts MFCCs from music dataset and saves them into a json file along witgh genre labels.
        :param dataset_path (str): Path to dataset
        :param json_path (str): Path to json file used to save MFCCs
        :param num_mfcc (int): Number of coefficients to extract
        :param n_fft (int): Interval we consider to apply FFT. Measured in # of samples
        :param hop_length (int): Sliding window for FFT. Measured in # of samples
        :param: num_segments (int): Number of segments we want to divide sample tracks into
        :return None
        """

  # dictionary to store mapping, labels, and MFCCs
  data = {
      "mapping": [],
      "labels": [],
      "mfcc": []
  }

  samples_per_segment = int(SAMPLES_PER_TRACK / num_segments)
  num_mfcc_vectors_per_segment = math.ceil(samples_per_segment / hop_length)
 
  # loop through all genre sub-folder
  for i, (dirpath, dirnames, filenames) in enumerate(os.walk(data_path)):

    
    # ensure we're processing a genre sub-folder level
    if dirpath is not data_path and os.path.isdir(dirpath):
      
      # save genre label (i.e., sub-folder name) in the mapping
      semantic_label = dirpath.split("/")[-1]
      data["mapping"].append(semantic_label)
      print("\nProcessing: {}".format(semantic_label))
      
      # process all audio files in genre sub-dir
      for f in filenames:

        # load audio file
        file_path = os.path.join(dirpath, f)
        signal, sample_rate = None, None
        try:
          signal, sample_rate = librosa.load(file_path, sr=SAMPLE_RATE)
        except Exception:
          print(f'librosa.load() error -file_path {file_path}')
          continue
          
        # process all segments of audio file
        for d in range(num_segments):
          # calculate start and finish sample for current segment
          start = samples_per_segment * d
          finish = start + samples_per_segment
          try:
            
            # extract mfcc
            mfcc = librosa.feature.mfcc(signal[start:finish], sample_rate, \
                                        n_mfcc=num_mfcc, n_fft=n_fft, \
                                        hop_length=hop_length)
            mfcc = mfcc.T
            
            # store only mfcc feature with expected number of vectors
            if len(mfcc) == num_mfcc_vectors_per_segment:
              data["mfcc"].append(mfcc.tolist())
              data["labels"].append(i-1)
          except Exception:
            print(f, 'start-finish:', start, '-', finish)
            
  # save MFCCs to json file
  with open(output_path, "w") as fp:
    json.dump(data, fp, indent=4)


def extract_features_cnn(data_path=DATA_PATH, output_path=CNN_FEATURES,
                         num_mfcc=13, n_fft=2048, hop_length=512, num_segments=10):
  """Extracts MFCCs from music dataset and saves them into a json file along witgh genre labels.
        :param dataset_path (str): Path to dataset
        :param json_path (str): Path to json file used to save MFCCs
        :param num_mfcc (int): Number of coefficients to extract
        :param n_fft (int): Interval we consider to apply FFT. Measured in # of samples
        :param hop_length (int): Sliding window for FFT. Measured in # of samples
        :param: num_segments (int): Number of segments we want to divide sample tracks into
        :return None
        """

  # dictionary to store mapping, labels, and MFCCs
  data = {
      "mapping": [],
      "labels": [],
      "log_spec": []
  }

  # Create np array logs of melspectrograms
  spects = np.empty((0, 130, 128))
  

  samples_per_segment = int(SAMPLES_PER_TRACK / num_segments)
 
  # loop through all genre sub-folder
  for i, (dirpath, dirnames, filenames) in enumerate(os.walk(data_path)):
    semantic_label = dirpath.split("/")[-1]
    
    # ensure we're processing a genre sub-folder level
    if dirpath is not data_path and os.path.isdir(dirpath) and semantic_label in GENRE_NAMES:
      
      # save genre label (i.e., sub-folder name) in the mapping
      semantic_label = dirpath.split("/")[-1]
      data["mapping"].append(semantic_label)
      print("\nProcessing: {}".format(semantic_label))
      
      # process all audio files in genre sub-dir
      for f in filenames:

        # load audio file
        file_path = os.path.join(dirpath, f)
        signal, sample_rate = None, None
        try:
          signal, sample_rate = librosa.load(file_path, sr=SAMPLE_RATE)
        except Exception:
          print(f'librosa.load() error -file_path {file_path}')
          continue
          
        # process all segments of audio file
        for d in range(num_segments):
          
          # calculate start and finish sample for current segment
          start = samples_per_segment * d
          finish = start + samples_per_segment
          try:
              spect = librosa.feature.melspectrogram(signal[start:finish], 
                                                     sr=sample_rate,n_fft=n_fft, hop_length=hop_length)
              spect = librosa.power_to_db(spect, ref=np.max)
              spect = spect.T
              spect = spect[:130, :]
              spects = np.append(spects, [spect], axis=0)
              print(spects.shape)
              # data["labels"].append(i-1)
          except Exception as e:
            print(e)
            print(f, 'start-finish:', start, '-', finish)
  data["log_spec"] = spects.tolist()
  save log_specs to json file
  with open(output_path, "w") as fp:
    json.dump(data, fp, indent=4)

In [None]:
# Extract features for each type of classifier into csv, json, and json files respectively
extract_features_MLP()
extract_features_rnn()
extract_features_cnn()


Processing: rumba
rumba0000.wav start-finish: 529200 - 595350
rumba0000.wav start-finish: 595350 - 661500
rumba00010.wav start-finish: 529200 - 595350
rumba00010.wav start-finish: 595350 - 661500
rumba00011.wav start-finish: 463050 - 529200
rumba00011.wav start-finish: 529200 - 595350
rumba00011.wav start-finish: 595350 - 661500
rumba00014.wav start-finish: 595350 - 661500
rumba00043.wav start-finish: 396900 - 463050
rumba00043.wav start-finish: 463050 - 529200
rumba00043.wav start-finish: 529200 - 595350
rumba00043.wav start-finish: 595350 - 661500
rumba00052.wav start-finish: 595350 - 661500
rumba00059.wav start-finish: 330750 - 396900
rumba00059.wav start-finish: 396900 - 463050
rumba00059.wav start-finish: 463050 - 529200
rumba00059.wav start-finish: 529200 - 595350
rumba00059.wav start-finish: 595350 - 661500
rumba00062.wav start-finish: 595350 - 661500
rumba00087.wav start-finish: 198450 - 264600
rumba00087.wav start-finish: 264600 - 330750
rumba00087.wav start-finish: 330750 - 

# Classifiers Utilities
The following funtions help with loading features data, transforming them into training format, and building models

In [None]:
import keras
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
import random
import pickle
import joblib

def load_csv(data_path,label_column, print_mappings=False):
  """Loads training dataset from json file. Used for MLP data
      :param data_path (str): Path to json file containing data
      :return X (ndarray): Inputs
      :return y (ndarray): Targets
  """
  # Read data into Pandas, drop useless columns, scale  
  data = pd.read_csv(data_path)
  data = data.drop([label_column], axis=1)
  data = data.dropna()
  genre_list = data.iloc[:,-1]
  encoder = LabelEncoder()
  y = encoder.fit_transform(genre_list) 
  if print_mappings:
    mappings = encoder.inverse_transform([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
    print(
        {
            0: mappings[0],
            1: mappings[1],
            2: mappings[2],
            3: mappings[3],
            4: mappings[4],
            5: mappings[5],
            6: mappings[6],
            7: mappings[7],
            8: mappings[8],
            9: mappings[9],
            10: mappings[10],
            11: mappings[11],
            12: mappings[12]
        }
    )
  scaler = StandardScaler()
  X = scaler.fit_transform(np.array(data.iloc[:, :-1], dtype = float))
  return X, y

def load_json(data_path, x_label, y_label, print_mappings=False, mappings_name=None):
  """Loads training dataset from json file. Used for RNN and (later) CNN features
      :param data_path (str): Path to json file containing data
      :return X (ndarray): Inputs
      :return y (ndarray): Targets
  """
  with open(data_path, "r") as fp:
      data = json.load(fp)
  X = np.array(data[x_label])
  y = np.array(data[y_label])
  if print_mappings:
    mappings = data[mappings_name]
    print(
        {
            0: mappings[0],
            1: mappings[1],
            2: mappings[2],
            3: mappings[3],
            4: mappings[4],
            5: mappings[5],
            6: mappings[6],
            7: mappings[7],
            8: mappings[8],
            9: mappings[9],
            10: mappings[10],
            11: mappings[11],
            12: mappings[12]
        }
    )
  return X, y

def save_model_to_disk(model, method='pickle', output_path=None):
  """
    Saves model to disk using pickle or joblib
    :param model ML model
    :param method (serialization method) Default is 'pickle'. You can also choose
      joylib. The function will pick pickle if parameter starts with p, and joylib
      if it sarts with j. Anything else and pickle is picked
    :param output_path: path to save the model at. Creates a random name in working folder otherwise
    :returns final output_path if file successfully saved
  """
  if not output_path:
    ouptut_path = f'{WORKING_FOLDER}/model{random.randint(1000000)}.sav'
 
  writer = open(output_path, 'wb')
  
  if method[0] == 'j':
    joblib.dump(model, writer)
  
  else:
    pickle.dump(model, writer)
    return output_path
  
    return output_path

def load_model_from_disk(method='pickle', file_path=None):
  """
    Loads model from disk using pickle or joblib
    :param model (ML model. Default is self.model)
    :param method Default is 'pickle' Other accepted is 'p (pickle), joblib, j
      for joblib.
    :param file_path: path to load the model from 
    :return: True if model successfully loaded and False otherwise
  """
  if not file_path:
    raise Exception("file_path is required")
  if method[0] == 'j':
    return joblib.load(model, file_path)
  
  else:
    return pickle.load(model, file_path)

def prepare_dataset(data, test_size=0.25, validation_size=0.2):
  """loads data and splits it into in training set and test set
  :param test_size (float): fraction of data to allocate to testing
  :param validation_size (float): fraction of data to allocate to validation
  
  :return X_train (ndarray): Input training set
  :return X_validation (ndarray): Input validation set
  :return X_test (ndarray): Input test set
  :return Y_train (ndarray): Target test set
  :return y_validation (ndarray): Target Validation set
  :return y_test (ndarray): Target test set
  """
  X, y = data
  # create train, validation and test split
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size)
  X_train, X_validation, y_train, y_validation = \
  train_test_split(X_train, y_train, test_size=validation_size)
  X_train, y_train = X_train, y_train
  X_test, y_test = X_test, y_test
  X_validation, y_validation = X_validation, y_validation
  return X_train, X_validation, X_test, y_train, y_validation, y_test
  
  
def build_model(model=keras.Sequential(), layers=[]):
  """ Simple Abstraction to build model. 
  :model model to build
  :layers list of layers ordered from first at layers[0] to output at layers[1]
  :return model: model with appended layers
  """
  if not model:
    model = keras.Sequential()
  for layer in layers:
   model.add(layers[i])
  return model

def test_model(model=None, X_test=None, y_test=None, verbose=2):
  """ Simple abstraction to tests model
  :model (keras.Sequential) model to be tested. Must be compiled
  :X_test (list) X testing data
  :y_test (list) y testing data
  """
  test_loss, test_acc = model.evaluate(X_test, y_test)
  return test_loss, test_acc

# Models
Now that we have all the data and needed utility functions, we can create and
train the models. 
*   First Model: A simple Multi Layer Perceptron.
*   Second Model: A Convolutional Neural Network
*   First Model: A Recursive Neural Network with short term longterm memory


The target is 80% testing accuracy for each of the models. 

## MLP


In [None]:
# Get Mappings to go from predicted number to genre name
load_csv(MLP_FEATURES, "filename", print_mappings=True)

{0: 'afrobeat', 1: 'blues', 2: 'classical', 3: 'country', 4: 'coupe_decale', 5: 'disco', 6: 'hiphop', 7: 'jazz', 8: 'metal', 9: 'pop', 10: 'reggae', 11: 'rock', 12: 'rumba'}


(array([[-0.41151822, -0.51150616, -0.58490442, ..., -0.09453255,
         -0.58875239, -0.61669917],
        [-0.37699652, -0.47307609, -0.52076945, ..., -0.06483495,
         -0.57393999, -0.44993348],
        [-0.40711615, -0.29394659, -0.53136849, ...,  0.03502479,
         -1.06283332, -0.58127863],
        ...,
        [-1.13627439, -1.17357946, -1.11512645, ..., -0.74544044,
          0.77776533, -0.23036586],
        [-1.19165672, -1.16376193, -1.10266177, ..., -0.81761482,
         -0.15126182, -0.16570641],
        [-1.28323919, -1.23439329, -1.19824234, ..., -0.66512694,
         -0.1610514 , -0.58482315]]), array([12, 12, 12, ...,  3,  3,  3]))

In [None]:
### Prepate data set ###
data = load_csv(MLP_FEATURES, 'filename')
X_train, X_validation, X_test, y_train, y_validation, y_test = prepare_dataset(data)

#### Create network ####
input_shape=(X_train.shape[1],)

# Initialize MLP and add hidden layers
mlp_model = keras.Sequential()
mlp_model.add(keras.layers.Dense(512, activation='relu', input_shape=input_shape))
mlp_model.add(keras.layers.Dense(512, activation='relu', input_shape=input_shape))
mlp_model.add(keras.layers.Dense(256, activation='relu', input_shape=input_shape))
mlp_model.add(keras.layers.Dense(256, activation='relu', input_shape=input_shape))
mlp_model.add(keras.layers.Dense(256, activation='relu', input_shape=input_shape))
mlp_model.add(keras.layers.Dense(128, activation='relu'))
mlp_model.add(keras.layers.Dense(128, activation='relu'))
mlp_model.add(keras.layers.Dense(64, activation='relu'))
mlp_model.add(keras.layers.Dense(64, activation='relu'))

# Add output layer
mlp_model.add(keras.layers.Dense(13, activation='softmax'))

# Compile Model
mlp_model.compile(keras.optimizers.Adam(learning_rate=0.0001), 
              loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train Model
history = mlp_model.fit(X_train, y_train, validation_data=(X_validation, y_validation), \
                    batch_size=50, epochs=100)

# Evaluate Model
test_loss, test_acc = mlp_model.evaluate(X_test, y_test, verbose=2)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [None]:
# We now save the models having reached our target accuracy
mlp_model.save(f'{WORKING_FOLDER}/saved_models/mlp_model0')

INFO:tensorflow:Assets written to: /content/drive/My Drive/Apollo's Ear/saved_models/mlp_model0/assets


## CNN

In [None]:
X, y = load_json(CNN_FEATURES, 'log_spec', 'labels')

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, test_size=0.2)

X_train = X_train[..., np.newaxis]
X_validation = X_validation[..., np.newaxis]
X_test = X_test[..., np.newaxis]

#### Create network ####
input_shape = (X_train.shape[1], X_train.shape[2], 1)
cnn_model = keras.Sequential()

# 1st conv layer
cnn_model.add(keras.layers.Conv2D(16, (3, 3), activation='relu', input_shape=input_shape))
cnn_model.add(keras.layers.MaxPooling2D((3, 3), strides=(2, 2), padding='same'))
cnn_model.add(keras.layers.BatchNormalization())

# 2nd conv layer
cnn_model.add(keras.layers.Conv2D(32, (3, 3), activation='relu'))
cnn_model.add(keras.layers.MaxPooling2D((3, 3), strides=(2, 2), padding='same'))
cnn_model.add(keras.layers.BatchNormalization())

# 3rd conv layer
cnn_model.add(keras.layers.Conv2D(64, (2, 2), activation='relu'))
cnn_model.add(keras.layers.MaxPooling2D((4, 4), strides=(2, 2), padding='same'))
cnn_model.add(keras.layers.BatchNormalization())

# 4th conv layer
cnn_model.add(keras.layers.Conv2D(128, (2, 2), activation='relu'))
cnn_model.add(keras.layers.MaxPooling2D((4, 4), strides=(2, 2), padding='same'))
cnn_model.add(keras.layers.BatchNormalization())

# flatten output and feed it into dense layer
cnn_model.add(keras.layers.Flatten())

cnn_model.add(keras.layers.Dense(128, activation='relu'))
cnn_model.add(keras.layers.Dropout(0.3))
    
cnn_model.add(keras.layers.Dense(128, activation='relu'))
cnn_model.add(keras.layers.Dropout(0.3))

# output layer
cnn_model.add(keras.layers.Dense(13, activation='softmax'))

# Compile cnn_model
cnn_model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.0001),
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

# Train cnn_model
history = cnn_model.fit(X_train, y_train, validation_data=(X_validation, y_validation), \
                    batch_size=50, epochs=100)

# Evaluate cnn_model
test_loss, test_acc = cnn_model.evaluate(X_test, y_test, verbose=2)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [None]:
cnn_model.save(f'{WORKING_FOLDER}/saved_models/cnn_model0') 

INFO:tensorflow:Assets written to: /content/drive/My Drive/Apollo's Ear/saved_models/cnn_model0/assets


## RNN

In [None]:
load_json(RNN_FEATURES, 'mfcc', 'labels', True, "mapping")

{0: 'rumba', 1: 'disco', 2: 'rock', 3: 'classical', 4: 'metal', 5: 'afrobeat', 6: 'pop', 7: 'reggae', 8: 'blues', 9: 'coupe_decale', 10: 'hiphop', 11: 'jazz', 12: 'country'}


(array([[[-6.74345979e+01,  8.61886238e+01, -5.63126121e+01, ...,
           9.59944085e+00,  1.32054339e+01, -9.59345652e+00],
         [-6.70991884e+01,  8.96536474e+01, -6.81074160e+01, ...,
           3.29799698e+00,  8.84585181e+00, -1.38489561e+01],
         [-9.46246495e+01,  8.70190556e+01, -7.36558894e+01, ...,
          -1.65573091e+00,  2.42323777e+00, -1.23812869e+01],
         ...,
         [-8.01297808e+01,  1.11915974e+02, -4.87778241e+01, ...,
          -7.04275847e+00, -1.21872338e+01, -1.73384663e+01],
         [-1.12380499e+02,  1.01431415e+02, -5.39105163e+01, ...,
          -5.56424746e+00, -8.17554298e+00, -8.84088402e+00],
         [-1.28120083e+02,  1.02323599e+02, -4.90730915e+01, ...,
           1.00002389e-01, -7.47547995e+00, -7.49481618e+00]],
 
        [[-1.54917014e+02,  1.11043899e+02, -4.55737087e+01, ...,
           2.84917382e+00, -1.02170351e+01, -3.02180252e+00],
         [-1.42510638e+02,  1.15178886e+02, -2.53251768e+01, ...,
          -5.37256595

In [None]:
### Prepate data set ###
X, y = load_json(RNN_FEATURES, 'mfcc', 'labels')

# create train, validation and test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, test_size=0.2)


#### Create network ####
input_shape = (X_train.shape[1], X_train.shape[2])

# Initialize RNN-LSTM and add layers
rnn_model = keras.Sequential()

# LSTM layers
rnn_model.add(keras.layers.LSTM(512, input_shape=input_shape, return_sequences=True))
rnn_model.add(keras.layers.LSTM(512, input_shape=input_shape, return_sequences=True))
rnn_model.add(keras.layers.LSTM(256, input_shape=input_shape, return_sequences=True))
rnn_model.add(keras.layers.LSTM(256, input_shape=input_shape, return_sequences=True))
rnn_model.add(keras.layers.LSTM(128, input_shape=input_shape, return_sequences=True))
rnn_model.add(keras.layers.LSTM(128, input_shape=input_shape, return_sequences=True))
rnn_model.add(keras.layers.LSTM(64)) 

# dense layers
rnn_model.add(keras.layers.Dense(64, activation='relu'))
rnn_model.add(keras.layers.Dropout(0.3)) 
rnn_model.add(keras.layers.Dense(64, activation='relu'))
rnn_model.add(keras.layers.Dropout(0.3)) 

# output layer
rnn_model.add(keras.layers.Dense(13, activation='softmax'))

# Compile Model
rnn_model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.0001),
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

# Train Model
history = rnn_model.fit(X_train, y_train, validation_data=(X_validation, y_validation), \
                    batch_size=50, epochs=100)

# Evaluate Model
test_loss, test_acc = rnn_model.evaluate(X_test, y_test, verbose=2)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [None]:
rnn_model.save(f'{WORKING_FOLDER}/saved_models/rnn_model0')

INFO:tensorflow:Assets written to: /content/drive/My Drive/Apollo's Ear/saved_models/rnn_model0/assets


# Results
We have now successfully implemented 3 audio classification systems with different architectures. The MLP model has an accuracy of 79%, the CNN an accuracy of 78%, and the RNN_LSTM has the highest accuracy of 84%. 
We have exceeded our goal of 80% with the RNN_LSTM but fallen short with the other models. However it is only by 1 to 2 %.
Overall, the exercise was very sucessful. Along the way we gained useful knowledge about audio characteristics and features extraction, Tensorflow, taking advantage of open-source resources, and for those who used Selenium, about browser automation.