# Detecting Kermit and Waldorf & Statler based on audio features

This notebook documents our approach to detect Kermit and Waldorf and Statler based on audio feature-engineering.  
We employ a Logistic Regression Classifier to predict the characters based on different audio-engineering features used for the different characters.

## Time sheet for this notebook

**Daniel Blasko:**
| Date | Task | Hours |
| --- | --- | --- |
| 27.11.2023 | Setup notebook, first experiments | 3 |
| 27.11.2023 | Implement "utils/MuppetDataset.py" that generally loads and handles the annotated video data | 1 |
| 28.11.2023 | Experiment & build feature extraction for both characters, align audio samples with frame annotations | 2 |


## Imports


In [61]:
import matplotlib.pyplot as plt
import librosa.feature as lf
import librosa
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

import sys

sys.path.append("..")
from utils.MuppetDataset import MuppetDataset


## Loading the data

Set the booleans below to extract the audio/frames from the .avi files if it has not been done previously.


In [2]:
extract_audio = False
extract_frames = True


In [3]:
video_paths = [
    "../data/Muppets-02-01-01.avi",
    "../data/Muppets-02-04-04.avi",
    "../data/Muppets-03-04-03.avi",
]
annotation_paths = [
    "../data/GroundTruth_Muppets-02-01-01.csv",
    "../data/GroundTruth_Muppets-02-04-04.csv",
    "../data/GroundTruth_Muppets-03-04-03.csv",
]

dataset = MuppetDataset(video_paths, annotation_paths, extract_audio, extract_frames)


Example for handling the data for video 0:

```python
dataset.audio_paths[0]
dataset.audios[0]
dataset.annotations.loc[dataset.annotations.Video == 0]
```


## Audio feature extraction


### Aligning audio features with video frame rate

The annotations are at the video frame level, for audio too. Therefore, we need to align the audio features with the video frames.

We start by checking the framerate of the videos and remind of our audio sampling rate:


In [4]:
%%sh
ffprobe -v error -select_streams v:0 -show_entries stream=avg_frame_rate -of default=noprint_wrappers=1:nokey=1 ../data/Muppets-02-01-01.avi


25/1


In [5]:
FRAMES_PER_SECOND = 25


In [6]:
AUDIO_SAMPLING_RATE = dataset.audios[0]["sr"]
AUDIO_SAMPLING_RATE


44100

There are 25 frames per second, and the 16k audio samples per second.  
We therefore have $\frac{44100}{25} = 1764$ audio samples per frame and divide our audio features in windows of 1764 samples.


### Features for Kermit

We make our decisions based on the observation that Kermit displays a distinct audio pattern where his interventions start with screaming, and transition to mumbling as he speaks, which should correspond to a high foundational frequency.  
We therefore decide to extract the fundational frequency of the audio (pitch), as well as loudness.

**We normalize all extracted features.**


In [7]:
for idx, audio in enumerate(dataset.audios):
    # Pad audio with silence to extract features from the last 8 frames
    video_length_in_frames = (
        dataset.annotations.loc[dataset.annotations.Video == idx].Frame_number.max() + 1
    ) - 1  # TODO: check

    video_duration_seconds = (video_length_in_frames) / FRAMES_PER_SECOND
    required_audio_length = int(video_duration_seconds * AUDIO_SAMPLING_RATE)

    audio["audio"] = np.pad(
        audio["audio"],
        (0, required_audio_length - audio["audio"].shape[0]),
        "constant",
    )
    # Loudness (through RMS energy)
    audio["loudness_rms"] = librosa.util.normalize(
        librosa.feature.rms(
            y=audio["audio"], hop_length=int(audio["sr"] / FRAMES_PER_SECOND)
        )[0]
    )
    # Zero crossing rate
    audio["zcr"] = librosa.util.normalize(
        librosa.feature.zero_crossing_rate(
            y=audio["audio"], hop_length=int(audio["sr"] / FRAMES_PER_SECOND)
        )[0]
    )


This approach with the `hop_length` of 1764 samples leads to feature vectors of the length of the number of frames, which is what we desired.


### Features for Waldorf & Statler

As for Waldorf & Statler, we observe they have low, cranky voices with a very specific overtone structure.  
We therefore decide to extract spectral and timber features.


In [8]:
for audio in dataset.audios:
    # Spectral contrast measures the difference in amplitude between peaks and valleys in the spectrum, which can capture aspects of timbre
    audio["spectral_contrast"] = librosa.util.normalize(
        librosa.feature.spectral_contrast(
            y=audio["audio"],
            sr=audio["sr"],
            hop_length=int(audio["sr"] / FRAMES_PER_SECOND),
            n_bands=4,
        )
    )  # spectral contrast values across n_bands different frequency bands for each frame, +1 that is "overall"
    # Spectral roll-off provides insights into the shape of the spectral energy distribution, affecting the timbre
    audio["spectral_rolloff"] = librosa.util.normalize(
        librosa.feature.spectral_rolloff(
            y=audio["audio"],
            sr=audio["sr"],
            hop_length=int(audio["sr"] / FRAMES_PER_SECOND),
        )
    )[0]


In [13]:
for i in range(3):
    print(dataset.audios[i]["spectral_rolloff"].shape)
    print(
        dataset.annotations.loc[dataset.annotations.Video == i].Frame_number.max() + 1
    )
    print()


(38681,)
38681

(38706,)
38706

(38498,)
38498



### Merge into the dataframe that will be used for model training

We merge all information into a dataframe with columns `[video_idx, frame_idx, loudness_rms, zcr, spectral_contrast_columns*5, spectral_rolloff, kermit_present, waldorf_statler_present]`.


In [47]:
audio_features = pd.DataFrame()
for video_idx, audio in enumerate(dataset.audios):
    video_length_in_frames = (
        dataset.annotations.loc[
            dataset.annotations.Video == video_idx
        ].Frame_number.max()
        + 1
    )
    audio_features = pd.concat(
        [
            audio_features,
            pd.DataFrame(
                {
                    "video_idx": np.repeat(video_idx, video_length_in_frames),
                    "frame_idx": np.arange(0, video_length_in_frames),
                    "loudness_rms": audio["loudness_rms"],
                    "zcr": audio["zcr"],
                    "spectral_contrast_1": audio["spectral_contrast"][0],
                    "spectral_contrast_2": audio["spectral_contrast"][1],
                    "spectral_contrast_3": audio["spectral_contrast"][2],
                    "spectral_contrast_4": audio["spectral_contrast"][3],
                    "spectral_contrast_5": audio["spectral_contrast"][4],
                    "spectral_rolloff": audio["spectral_rolloff"],
                }
            ),
        ],
        ignore_index=True,
    )
# Add annotations
audio_features = audio_features.merge(
    dataset.annotations[["Video", "Frame_number", "Kermit", "Audio_StatlerWaldorf"]],
    how="left",
    left_on=["video_idx", "frame_idx"],
    right_on=["Video", "Frame_number"],
)
audio_features = audio_features.drop(columns=["Frame_number", "Video"])


Sanity check:


In [60]:
assert dataset.annotations.shape[0] == audio_features.shape[0]
assert dataset.annotations["Kermit"].sum() == audio_features["Kermit"].sum()
np.testing.assert_array_equal(
    dataset.annotations["Kermit"].values, audio_features["Kermit"].values
)
np.testing.assert_array_equal(
    dataset.annotations["Audio_StatlerWaldorf"].values,
    audio_features["Audio_StatlerWaldorf"].values,
)
np.testing.assert_array_equal(
    dataset.annotations["Video"].values,
    audio_features["video_idx"].values,
)
np.testing.assert_array_equal(
    dataset.annotations["Frame_number"].values,
    audio_features["frame_idx"].values,
)


In [48]:
audio_features.sample(10)


Unnamed: 0,video_idx,frame_idx,loudness_rms,zcr,spectral_contrast_1,spectral_contrast_2,spectral_contrast_3,spectral_contrast_4,spectral_contrast_5,spectral_rolloff,Kermit,Audio_StatlerWaldorf
12930,0,12930,0.36036,0.102804,0.400369,0.363732,0.396932,0.401274,1.0,1.0,0,0
78071,2,684,0.662528,0.15721,0.265356,0.280837,0.223662,0.299639,1.0,1.0,0,0
88219,2,10832,0.438041,0.321513,0.35514,0.126272,0.478009,0.399767,1.0,1.0,1,0
9109,0,9109,0.155753,0.201869,0.205393,0.198815,0.169442,0.437081,1.0,1.0,0,0
100196,2,22809,0.316285,0.113475,0.207844,0.535076,0.560436,0.522122,1.0,1.0,1,0
22081,0,22081,0.238455,0.095327,0.394886,0.366788,0.213366,0.320101,1.0,1.0,0,0
30942,0,30942,0.583307,0.136449,0.171773,0.327684,0.351697,0.396002,1.0,1.0,0,0
31623,0,31623,0.166722,0.08972,0.243988,0.177334,0.227879,0.360443,1.0,1.0,1,0
31552,0,31552,0.116954,0.063551,0.388961,0.211111,0.238656,0.406029,1.0,1.0,1,0
39681,1,1000,0.35615,0.173345,0.395841,0.143735,0.311451,0.265271,1.0,1.0,1,0


In [31]:
audio_features.describe()
# TODO: We observe that the last band of spectral contrast, and spectral rolloff, are almost always 1. We might change those features.


Unnamed: 0,video_idx,frame_idx,loudness_rms,zcr,spectral_contrast_1,spectral_contrast_2,spectral_contrast_3,spectral_contrast_4,spectral_contrast_5,spectral_rolloff
count,115885.0,115885.0,115885.0,115885.0,115885.0,115885.0,115885.0,115885.0,115885.0,115885.0
mean,0.998421,19313.777952,0.279385,0.145905,0.327648,0.255978,0.332886,0.368598,0.999725,0.994106
std,0.816088,11151.279978,0.174607,0.098605,0.143751,0.12178,0.127852,0.125983,0.008958,0.076545
min,0.0,0.0,0.0,0.0,1.8e-05,5.7e-05,0.000268,0.001093,0.413782,0.0
25%,0.0,9657.0,0.135342,0.087979,0.227772,0.172121,0.242314,0.27945,1.0,1.0
50%,1.0,19314.0,0.264782,0.126307,0.305907,0.233841,0.313442,0.349007,1.0,1.0
75%,2.0,28971.0,0.405522,0.172577,0.398856,0.31511,0.400679,0.434464,1.0,1.0
max,2.0,38705.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


We extract the `X` and `y` matrices for the model, and then split into train and test sets by a 80/20 ratio:


In [62]:
# We extract the `X` and `y` matrices for the model, and then split into train and test sets by a 80/20 ratio
X = audio_features.drop(
    columns=["Kermit", "Audio_StatlerWaldorf", "video_idx", "frame_idx"]
)
y = audio_features[["Kermit", "Audio_StatlerWaldorf"]]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, shuffle=True
)  # TODO: think about the actual split since these are video frames, cf tuwel forum etc


## Training the audio-based model


**TODO: classifier knn or kmeans...**
! Audio of kermit not annotated only general presence, expect lower perf based on purely audio for him


## Evaluating the audio-based model
