# Analysis of arousal-valence models for Music Emotion Recognition. Computing model predictions

In this notebook, arousal and valence predictions are obtained for chunk 003 using 6 pre-trained ML models, namely:
1.   Effnet-Discogs trained on DEAM dataset
2.   Effnet-Discogs trained on EmoMusic dataset
3.   MusiCNN-MSD trained on DEAM dataset
4.   MusiCNN-MSD trained on EmoMusic dataset
5.   VGGish-AudioSet trained on DEAM dataset
6.   VGGish-AudioSet trained on EmoMusic dataset

In [21]:
# Install required dependencies
!pip install essentia-tensorflow
!pip install pandas 



In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Modify the following paths according to your specific locations.

In [3]:
data_root = "/content/drive/MyDrive/AMPLAB/M1_A1/"                               # Main path
models_root = data_root + "Emotion-AV-annotation-dataset/essentia-models/"       # Path where the models are stored
audio_root = data_root + "Emotion-AV-annotation-dataset/audio_chunks/audio.003/" # Path where the audio samples of chunk 003 are stored

In [4]:
# Basic imports
import os
import json
import pandas as pd
from essentia import Pool
from essentia.standard import (
    MonoLoader,
    TensorflowPredict,
    TensorflowPredictEffnetDiscogs,
    TensorflowPredictMusiCNN,
    TensorflowPredictVGGish,
)

In [5]:
def listdirs(rootdir):
  'This function returns all the files/folders inside a given directory'
  list_subdirectories = []
  for file in os.listdir(rootdir):
    d = os.path.join(rootdir, file)
    list_subdirectories.append(d)
  return list_subdirectories

## Model predictions

First, let's identify all the necessary model files for each type of embedding (EffnetDiscogs, musiCNN-MSD and VGGish-AudioSet) and dataset used for pre-training (DEAM, EmoMusic).

In [6]:
# Effnet-discogs models
effnetDiscogs_embeddings_model_path = models_root + "effnet-discogs-1.pb"
deamEffnetDiscogs_av_model_path = models_root + "deam-effnet-discogs-1/deam-effnet-discogs-1.pb"
deamEffnetDiscogs_json_model_path = models_root + "deam-effnet-discogs-1/deam-effnet-discogs-1.json"
emomusicEffnetDiscogs_av_model_path = models_root + "emomusic-effnet-discogs-1/emomusic-effnet-discogs-1.pb"
emomusicEffnetDiscogs_json_model_path = models_root + "emomusic-effnet-discogs-1/emomusic-effnet-discogs-1.json"

# MusiCNN-MSD models
musiCNN_embeddings_model_path = models_root + "msd-musicnn-1.pb"
deamMusiCNN_av_model_path = models_root + "deam-musicnn-msd-1/deam-musicnn-msd-1.pb"
deamMusiCNN_json_model_path = models_root + "deam-musicnn-msd-1/deam-musicnn-msd-1.json"
emomusicMusiCNN_av_model_path = models_root + "emomusic-musicnn-msd-1/emomusic-musicnn-msd-1.pb"
emomusicMusiCNN_json_model_path = models_root + "emomusic-musicnn-msd-1/emomusic-musicnn-msd-1.json"

# VGGish-AudioSet models
VGGish_embeddings_model_path = models_root + "audioset-vggish-3.pb"
deamVGGish_av_model_path = models_root + "deam-vggish-audioset-1/deam-vggish-audioset-1.pb"
deamVGGish_json_model_path = models_root + "deam-vggish-audioset-1/deam-vggish-audioset-1.json"
emomusicVGGish_av_model_path = models_root + "emomusic-vggish-audioset-1/emomusic-vggish-audioset-1.pb"
emomusicVGGish_json_model_path = models_root + "emomusic-vggish-audioset-1/emomusic-vggish-audioset-1.json"

### Effnet-discogs models setting

Let's define now the parameters and model instantiations for the Effnet-Discogs embedding. 

The patch_size defines the number of melspectrogram frames the model needs to return an embedding vector. Effnet model needs 128 frames and each frame is extracted with a 256-samples hopsize. That's why the receptive field of Effnet is 2-seconds, 128 * 256/ 16000 ≈ 2 seconds, and it differs of MusiCNN which utilizes 187 frames (3-seconds).

The patch_hop_size defines the number of feature frames in between successive embedding. It defines the time interval analysed at each embedding vector and it can vary for each model. Effnet applies 64 patches, close to 1-seconds (64 * 256 / 16000), whereas MusiCNN applies 1.5-seconds (93 * 256/ 16000 ≈ 1.5)

In [7]:
patch_size_effnetDiscogs = 128
patch_hop_size_effnetDiscogs = patch_size_effnetDiscogs // 2

input_layer_effnetDiscogs = "melspectrogram"
output_layer_effnetDiscogs = "onnx_tf_prefix_BatchNormalization_496/add_1"

embeddings_model_effnetDiscogs = TensorflowPredictEffnetDiscogs(
                                  graphFilename=effnetDiscogs_embeddings_model_path,
                                  input=input_layer_effnetDiscogs,
                                  output=output_layer_effnetDiscogs,
                                  patchSize=patch_size_effnetDiscogs,
                                  patchHopSize=patch_hop_size_effnetDiscogs,
                                  )

metadata_deamEffnetDiscogs = json.load(open(deamEffnetDiscogs_json_model_path, "r"))
input_layer_deamEffnetDiscogs = metadata_deamEffnetDiscogs["schema"]["inputs"][0]["name"]
output_layer_deamEffnetDiscogs = metadata_deamEffnetDiscogs["schema"]["outputs"][0]["name"]
av_deamEffnetDiscogs_model = TensorflowPredict(
                              graphFilename=deamEffnetDiscogs_av_model_path,
                              inputs=[input_layer_deamEffnetDiscogs],
                              outputs=[output_layer_deamEffnetDiscogs],
                              )

metadata_emomusicEffnetDiscogs = json.load(open(emomusicEffnetDiscogs_json_model_path, "r"))
input_layer_emomusicEffnetDiscogs = metadata_emomusicEffnetDiscogs["schema"]["inputs"][0]["name"]
output_layer_emomusicEffnetDiscogs = metadata_emomusicEffnetDiscogs["schema"]["outputs"][0]["name"]
av_emomusicEffnetDiscogs_model = TensorflowPredict(
                                  graphFilename=emomusicEffnetDiscogs_av_model_path,
                                  inputs=[input_layer_emomusicEffnetDiscogs],
                                  outputs=[output_layer_emomusicEffnetDiscogs],
                                  )

### MusiCNN-MSD models setting

Let's now define the parameters and model instantiations for the MusiCNN-MSD embedding.

In [8]:
# Embeddings model
input_layer_musiCNN = "model/Placeholder"
output_layer_musiCNN = "model/dense/BiasAdd"

# Instantiate the embeddings model
# (we can instantiate once and then compute on many different inputs).
embeddings_model_musiCNN = TensorflowPredictMusiCNN(
                            graphFilename=musiCNN_embeddings_model_path,
                            input=input_layer_musiCNN,
                            output=output_layer_musiCNN,
                          )

metadata_deamMusiCNN = json.load(open(deamMusiCNN_json_model_path, "r"))
input_layer_deamMusiCNN = metadata_deamMusiCNN["schema"]["inputs"][0]["name"]
output_layer_deamMusiCNN = metadata_deamMusiCNN["schema"]["outputs"][0]["name"]
av_deamMusiCNN_model = TensorflowPredict(
                        graphFilename=deamMusiCNN_av_model_path,
                        inputs=[input_layer_deamMusiCNN],
                        outputs=[output_layer_deamMusiCNN],
                        )

metadata_emomusicMusiCNN = json.load(open(emomusicMusiCNN_json_model_path, "r"))
input_layer_emomusicMusiCNN = metadata_emomusicMusiCNN["schema"]["inputs"][0]["name"]
output_layer_emomusicMusiCNN = metadata_emomusicMusiCNN["schema"]["outputs"][0]["name"]
av_emomusicMusiCNN_model = TensorflowPredict(
                          graphFilename=emomusicMusiCNN_av_model_path,
                          inputs=[input_layer_emomusicMusiCNN],
                          outputs=[output_layer_emomusicMusiCNN],
                          )


### VGGish-AudioSet models setting

Finally, let's define the parameters and instantiate the models for the VGGish-AudioSet embedding. VGGish-AudioSet works in time domain, it doesn't need to specify patch_size and patch_hop_size, only output_layer name.

In [9]:
output_layer_VGGish = "model/vggish/embeddings"

embeddings_model_VGGish = TensorflowPredictVGGish(
                          graphFilename=VGGish_embeddings_model_path,
                          output=output_layer_VGGish,
                          )

metadata_deamVGGish = json.load(open(deamVGGish_json_model_path, "r"))
input_layer_deamVGGish = metadata_deamVGGish["schema"]["inputs"][0]["name"]
output_layer_deamVGGish = metadata_deamVGGish["schema"]["outputs"][0]["name"]
av_deamVGGish_model = TensorflowPredict(
                      graphFilename=deamVGGish_av_model_path,
                      inputs=[input_layer_deamVGGish],
                      outputs=[output_layer_deamVGGish],
                      )

metadata_emomusicVGGish = json.load(open(emomusicVGGish_json_model_path, "r"))
input_layer_emomusicVGGish = metadata_emomusicVGGish["schema"]["inputs"][0]["name"]
output_layer_emomusicVGGish = metadata_emomusicVGGish["schema"]["outputs"][0]["name"]
av_emomusicVGGish_model = TensorflowPredict(
                          graphFilename=emomusicVGGish_av_model_path,
                          inputs=[input_layer_emomusicVGGish],
                          outputs=[output_layer_emomusicVGGish],
                          )

### Compute predictions

Let's create a Pandas DataFrame to store all the predictions of the different models for all the audio recordings. 

In [10]:
columns = ["Audio Filename", "Valence DEAM-Effnet-Discogs", "Arousal DEAM-Effnet-Discogs", 
           "Valence EmoMusic-Effnet-Discogs", "Arousal EmoMusic-Effnet-Discogs", 
           "Valence DEAM-MusiCNN", "Arousal DEAM-MusiCNN", 
           "Valence EmoMusic-MusiCNN", "Arousal EmoMusic-MusiCNN",
           "Valence DEAM-VGGish", "Arousal DEAM-VGGish",
           "Valence EmoMusic-VGGish", "Arousal EmoMusic-VGGish"]
df = pd.DataFrame(columns=columns)

Let's now loop through all the audio files to extract the embeddings using the 3 different types, and then compute predictions using the 6 different pre-trained models. In order to avoid loading too many times the same audio file, all the models are run each time an audio is loaded in the loop.

In [11]:
num_audio = 1
audio_folders = listdirs(audio_root)
for folder in audio_folders:
  audio_paths = listdirs(folder + "/")
  for audio_path in audio_paths:
    audio_filename = audio_path.split('/')[-1].split('.')[0]
    print('Processing sound with id {0} [{1}/300]'.format(audio_filename, num_audio))
    audio = MonoLoader(filename=audio_path, sampleRate=16000)()
    audio_predictions = dict.fromkeys(columns)
    audio_predictions["Audio Filename"] = audio_filename

    # Compute embeddings
    embeddings_effnetDiscogs = embeddings_model_effnetDiscogs(audio)
    embeddings_musiCNN = embeddings_model_musiCNN(audio)
    embeddings_VGGish = embeddings_model_VGGish(audio)

    # Run inference for all the models
    feature_effnetDiscogs = embeddings_effnetDiscogs.reshape(-1, 1, 1, embeddings_effnetDiscogs.shape[1])
    pool_deamEffnetDiscogs = Pool()
    pool_deamEffnetDiscogs.set(input_layer_deamEffnetDiscogs, feature_effnetDiscogs)
    predictions_deamEffnetDiscogs = av_deamEffnetDiscogs_model(pool_deamEffnetDiscogs)[output_layer_deamEffnetDiscogs].squeeze()
    audio_predictions["Valence DEAM-Effnet-Discogs"] = predictions_deamEffnetDiscogs.mean(axis=0)[0]
    audio_predictions["Arousal DEAM-Effnet-Discogs"] = predictions_deamEffnetDiscogs.mean(axis=0)[1]
    pool_emomusicEffnetDiscogs = Pool()
    pool_emomusicEffnetDiscogs.set(input_layer_emomusicEffnetDiscogs, feature_effnetDiscogs)
    predictions_emomusicEffnetDiscogs = av_emomusicEffnetDiscogs_model(pool_emomusicEffnetDiscogs)[output_layer_emomusicEffnetDiscogs].squeeze()
    audio_predictions["Valence EmoMusic-Effnet-Discogs"] = predictions_emomusicEffnetDiscogs.mean(axis=0)[0]
    audio_predictions["Arousal EmoMusic-Effnet-Discogs"] = predictions_emomusicEffnetDiscogs.mean(axis=0)[1]

    feature_musiCNN = embeddings_musiCNN.reshape(-1, 1, 1, embeddings_musiCNN.shape[1])
    pool_deamMusiCNN = Pool()
    pool_deamMusiCNN.set(input_layer_deamMusiCNN, feature_musiCNN)
    predictions_deamMusiCNN = av_deamMusiCNN_model(pool_deamMusiCNN)[output_layer_deamMusiCNN].squeeze()
    audio_predictions["Valence DEAM-MusiCNN"] = predictions_deamMusiCNN.mean(axis=0)[0]
    audio_predictions["Arousal DEAM-MusiCNN"] = predictions_deamMusiCNN.mean(axis=0)[1]
    pool_emomusicMusiCNN = Pool()
    pool_emomusicMusiCNN.set(input_layer_emomusicMusiCNN, feature_musiCNN)
    predictions_emomusicMusiCNN = av_emomusicMusiCNN_model(pool_emomusicMusiCNN)[output_layer_emomusicMusiCNN].squeeze()
    audio_predictions["Valence EmoMusic-MusiCNN"] = predictions_emomusicMusiCNN.mean(axis=0)[0]
    audio_predictions["Arousal EmoMusic-MusiCNN"] = predictions_emomusicMusiCNN.mean(axis=0)[1]

    feature_VGGish = embeddings_VGGish.reshape(-1, 1, 1, embeddings_VGGish.shape[1])
    pool_deamVGGish = Pool()
    pool_deamVGGish.set(input_layer_deamVGGish, feature_VGGish)
    predictions_deamVGGish = av_deamVGGish_model(pool_deamVGGish)[output_layer_deamVGGish].squeeze()
    audio_predictions["Valence DEAM-VGGish"] = predictions_deamVGGish.mean(axis=0)[0]
    audio_predictions["Arousal DEAM-VGGish"] = predictions_deamVGGish.mean(axis=0)[1]
    pool_emomusicVGGish = Pool()
    pool_emomusicVGGish.set(input_layer_emomusicVGGish, feature_VGGish)
    predictions_emomusicVGGish = av_emomusicVGGish_model(pool_emomusicVGGish)[output_layer_emomusicVGGish].squeeze()
    audio_predictions["Valence EmoMusic-VGGish"] = predictions_emomusicVGGish.mean(axis=0)[0]
    audio_predictions["Arousal EmoMusic-VGGish"] = predictions_emomusicVGGish.mean(axis=0)[1]

    # Append predictions of an audio recording to the Pandas DataFrame
    df = df.append(audio_predictions, ignore_index = True)
    num_audio += 1


Processing sound with id 0Y3lmI2xyEOjJXld4zEo5y [1/300]
Processing sound with id 0l8tol0rUlorDx5qpxQslV [2/300]
Processing sound with id 0lE400SRtjUmLk37qkt77q [3/300]
Processing sound with id 0lTQgId3gmoTrrCad2YjpT [4/300]
Processing sound with id 0lhDwEGQ6IDlGrko5T7Ei2 [5/300]
Processing sound with id 3BucMqBqIR5Aw7MrUkF00y [6/300]
Processing sound with id 17CHY0FoOgypamXGpk0Kqj [7/300]
Processing sound with id 17E5kYLd1yrnzndzn4Xovp [8/300]
Processing sound with id 0NYVB11fk0kBalG46SxSTR [9/300]
Processing sound with id 0NJWhm3hUwIZSy5s0TGJ8q [10/300]
Processing sound with id 0JHz6TOCIsxSqenueJUmts [11/300]
Processing sound with id 77DRSxsaUoQSaTK0UI4I0a [12/300]
Processing sound with id 77lxwJmO8GfWk2LdpirwEf [13/300]
Processing sound with id 77tBTw4wbI5RmvZuJ86q4I [14/300]
Processing sound with id 19PKMOoh2Rra8T50wrkq1X [15/300]
Processing sound with id 1A1hXivNLAWyNf6pVysqzm [16/300]
Processing sound with id 2ZrCLJz5UGbJCW2JK2OgkK [17/300]
Processing sound with id 6O3PRnABuZ8wdhO

Finally, let's save the resultant DataFrame in a .csv file.

In [12]:
DATAFRAME_FILENAME = "dataframe_predictionsAV.csv"   # Modify this according to your preferred path to save the .csv file
df.to_csv(DATAFRAME_FILENAME)
print('Saved DataFrame with {0} entries! {1}'.format(len(df), DATAFRAME_FILENAME))

Saved DataFrame with 300 entries! dataframe_predictionsAV.csv
