# Detecting Kermit and Waldorf & Statler based on audio features

This notebook documents our approach to detect Kermit and Waldorf and Statler based on audio feature-engineering.  
We employ a Logistic Regression Classifier to predict the characters based on different audio-engineering features used for the different characters.

## Time sheet for this notebook

**Daniel Blasko:**
| Date | Task | Hours |
| --- | --- | --- |
| 27.11.2023 | Setup notebook, first experiments | 3 |
| 27.11.2023 | Implement "utils/MuppetDataset.py" that generally loads and handles the annotated video data | 1 |
| 28.11.2023 | Experiment & build feature extraction for both characters, align audio samples with frame annotations, format the dataset for the classifier | 3 |
| 29.11.2023 | Add KNN classifier. Try other audio features due to poor performance for Waldorf & Statler, experiment with different splits | 2.5 |


## Imports


In [233]:
import matplotlib.pyplot as plt
import librosa.feature as lf
import librosa
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    precision_score,
    recall_score,
    f1_score,
    average_precision_score,
    accuracy_score,
)


import sys

sys.path.append("..")
from utils.MuppetDataset import MuppetDataset


## Loading the data

Set the booleans below to extract the audio/frames from the .avi files if it has not been done previously.


In [234]:
extract_audio = False
extract_frames = True


In [235]:
video_paths = [
    "../data/Muppets-02-01-01.avi",
    "../data/Muppets-02-04-04.avi",
    "../data/Muppets-03-04-03.avi",
]
annotation_paths = [
    "../data/GroundTruth_Muppets-02-01-01.csv",
    "../data/GroundTruth_Muppets-02-04-04.csv",
    "../data/GroundTruth_Muppets-03-04-03.csv",
]

dataset = MuppetDataset(video_paths, annotation_paths, extract_audio, extract_frames)


Example for handling the data for video 0:

```python
dataset.audio_paths[0]
dataset.audios[0]
dataset.annotations.loc[dataset.annotations.Video == 0]
```


## Audio feature extraction


### Aligning audio features with video frame rate

The annotations are at the video frame level, for audio too. Therefore, we need to align the audio features with the video frames.

We start by checking the framerate of the videos and remind of our audio sampling rate:


In [236]:
%%sh
ffprobe -v error -select_streams v:0 -show_entries stream=avg_frame_rate -of default=noprint_wrappers=1:nokey=1 ../data/Muppets-02-01-01.avi


25/1


In [237]:
FRAMES_PER_SECOND = 25


In [238]:
AUDIO_SAMPLING_RATE = dataset.audios[0]["sr"]
AUDIO_SAMPLING_RATE


44100

There are 25 frames per second, and the 16k audio samples per second.  
We therefore have $\frac{44100}{25} = 1764$ audio samples per frame and divide our audio features in windows of 1764 samples.


### Features for Kermit

We make our decisions based on the observation that Kermit displays a distinct audio pattern where his interventions start with screaming, and transition to mumbling as he speaks, which should correspond to a high foundational frequency.  
We therefore decide to extract the fundational frequency of the audio (pitch), as well as loudness.

**We normalize all extracted features.**


In [239]:
for idx, audio in enumerate(dataset.audios):
    # Pad audio with silence to extract features from the last 8 frames
    video_length_in_frames = (
        dataset.annotations.loc[dataset.annotations.Video == idx].Frame_number.max() + 1
    ) - 1

    video_duration_seconds = (video_length_in_frames) / FRAMES_PER_SECOND
    required_audio_length = int(video_duration_seconds * AUDIO_SAMPLING_RATE)

    audio["audio"] = np.pad(
        audio["audio"],
        (0, required_audio_length - audio["audio"].shape[0]),
        "constant",
    )
    # Loudness (through RMS energy)
    audio["loudness_rms"] = librosa.util.normalize(
        librosa.feature.rms(
            y=audio["audio"],
            hop_length=int(audio["sr"] / FRAMES_PER_SECOND),
            frame_length=int(audio["sr"] / FRAMES_PER_SECOND),
        )[0]
    )
    # Zero crossing rate
    audio["zcr"] = librosa.util.normalize(
        librosa.feature.zero_crossing_rate(
            y=audio["audio"],
            hop_length=int(audio["sr"] / FRAMES_PER_SECOND),
            frame_length=int(audio["sr"] / FRAMES_PER_SECOND),
        )[0]
    )


This approach with the `hop_length` of 1764 samples leads to feature vectors of the length of the number of frames, which is what we desired.


### Features for Waldorf & Statler

As for Waldorf & Statler, we observe they have low, cranky voices with a very specific overtone structure.  
We therefore decide to extract spectral and timber features.


In [240]:
for audio in dataset.audios:
    # Spectral contrast measures the difference in amplitude between peaks and valleys in the spectrum, which can capture aspects of timbre
    audio["spectral_contrast"] = librosa.util.normalize(
        librosa.feature.spectral_contrast(
            y=audio["audio"],
            sr=audio["sr"],
            hop_length=int(audio["sr"] / FRAMES_PER_SECOND),
            win_length=int(audio["sr"] / FRAMES_PER_SECOND),
            n_bands=5,
        )
    )  # spectral contrast values across n_bands different frequency bands for each frame, +1 that is "overall"
    # Spectral roll-off provides insights into the shape of the spectral energy distribution, affecting the timbre
    audio["spectral_rolloff"] = librosa.util.normalize(
        librosa.feature.spectral_rolloff(
            y=audio["audio"],
            sr=audio["sr"],
            hop_length=int(audio["sr"] / FRAMES_PER_SECOND),
        )[0]
    )
    # Chroma features
    audio["chroma"] = librosa.util.normalize(
        librosa.feature.chroma_stft(
            y=audio["audio"],
            sr=audio["sr"],
            hop_length=int(audio["sr"] / FRAMES_PER_SECOND),
            win_length=int(audio["sr"] / FRAMES_PER_SECOND),
        )
    )
    # MFCCs
    audio["mfcc"] = librosa.util.normalize(
        librosa.feature.mfcc(
            y=audio["audio"],
            sr=audio["sr"],
            hop_length=int(audio["sr"] / FRAMES_PER_SECOND),
            win_length=int(audio["sr"] / FRAMES_PER_SECOND),
            n_mfcc=13,
        )
    )
    # TODO: try spectral centroid


In [241]:
dataset.audios[0]["mfcc"].shape


(13, 38681)

In [242]:
for i in range(3):
    assert (
        dataset.audios[i]["spectral_rolloff"].shape[0]
        == dataset.annotations.loc[dataset.annotations.Video == i].Frame_number.max()
        + 1
    )


### Merge into the dataframe that will be used for model training & prepare the model dataset

We merge all information into a dataframe with columns `[video_idx, frame_idx, loudness_rms, zcr, spectral_contrast_columns*5, spectral_rolloff, kermit_present, waldorf_statler_present]`.


In [243]:
audio_features = pd.DataFrame()
for video_idx, audio in enumerate(dataset.audios):
    video_length_in_frames = (
        dataset.annotations.loc[
            dataset.annotations.Video == video_idx
        ].Frame_number.max()
        + 1
    )
    audio_features = pd.concat(
        [
            audio_features,
            pd.DataFrame(
                {
                    "video_idx": np.repeat(video_idx, video_length_in_frames),
                    "frame_idx": np.arange(0, video_length_in_frames),
                    "loudness_rms": audio["loudness_rms"],
                    "zcr": audio["zcr"],
                    # "spectral_contrast_1": audio["spectral_contrast"][0],
                    # "spectral_contrast_2": audio["spectral_contrast"][1],
                    # "spectral_contrast_3": audio["spectral_contrast"][2],
                    # "spectral_contrast_4": audio["spectral_contrast"][3],
                    # "spectral_contrast_5": audio["spectral_contrast"][4],
                    # "spectral_contrast_6": audio["spectral_contrast"][5],
                    # "chroma1": audio["chroma"][0],
                    # "chroma2": audio["chroma"][1],
                    # "chroma3": audio["chroma"][2],
                    # "chroma4": audio["chroma"][3],
                    # "chroma5": audio["chroma"][4],
                    # "chroma6": audio["chroma"][5],
                    "mfcc1": audio["mfcc"][0],
                    "mfcc2": audio["mfcc"][1],
                    "mfcc3": audio["mfcc"][2],
                    "mfcc4": audio["mfcc"][3],
                    "mfcc5": audio["mfcc"][4],
                    "mfcc6": audio["mfcc"][5],
                    "mfcc7": audio["mfcc"][6],
                    "mfcc8": audio["mfcc"][7],
                    "mfcc9": audio["mfcc"][8],
                    "mfcc10": audio["mfcc"][9],
                    "mfcc11": audio["mfcc"][10],
                    "mfcc12": audio["mfcc"][11],
                    "mfcc13": audio["mfcc"][12],
                    # "spectral_rolloff": audio["spectral_rolloff"],
                }
            ),
        ],
        ignore_index=True,
    )
# Add annotations
audio_features = audio_features.merge(
    dataset.annotations[["Video", "Frame_number", "Kermit", "Audio_StatlerWaldorf"]],
    how="left",
    left_on=["video_idx", "frame_idx"],
    right_on=["Video", "Frame_number"],
)
audio_features = audio_features.drop(columns=["Frame_number", "Video"])


Sanity check:


In [244]:
assert dataset.annotations.shape[0] == audio_features.shape[0]
assert dataset.annotations["Kermit"].sum() == audio_features["Kermit"].sum()
np.testing.assert_array_equal(
    dataset.annotations["Kermit"].values, audio_features["Kermit"].values
)
np.testing.assert_array_equal(
    dataset.annotations["Audio_StatlerWaldorf"].values,
    audio_features["Audio_StatlerWaldorf"].values,
)
np.testing.assert_array_equal(
    dataset.annotations["Video"].values,
    audio_features["video_idx"].values,
)
np.testing.assert_array_equal(
    dataset.annotations["Frame_number"].values,
    audio_features["frame_idx"].values,
)


In [245]:
audio_features.sample(10)


Unnamed: 0,video_idx,frame_idx,loudness_rms,zcr,spectral_contrast_1,spectral_contrast_2,spectral_contrast_3,spectral_contrast_4,spectral_contrast_5,spectral_contrast_6,...,mfcc6,mfcc7,mfcc8,mfcc9,mfcc10,mfcc11,mfcc12,mfcc13,Kermit,Audio_StatlerWaldorf
29064,0,29064,0.190057,0.127349,0.214986,0.200421,0.461991,0.325197,0.391796,1.0,...,0.066432,-0.047954,-0.06633,0.015741,-0.024229,-0.069892,0.007068,0.012084,0,0
114740,2,37353,0.461068,0.223841,0.258352,0.217427,0.072471,0.272894,0.230712,1.0,...,0.127754,-0.005572,-0.049672,0.0071,-0.063472,0.010297,-0.067701,0.022616,1,0
62820,1,24139,0.327277,0.141404,0.163863,0.198034,0.321713,0.400806,0.348423,1.0,...,0.027972,-0.12262,0.025828,0.017081,-0.025574,0.002033,-0.048603,0.009243,0,0
25620,0,25620,0.187508,0.078288,0.285476,0.227883,0.444331,0.53086,0.611922,1.0,...,-0.005261,-0.023979,-0.050219,-0.037146,-0.004822,-0.043805,-0.040593,0.020285,0,0
40207,1,1526,0.149269,0.047813,0.232463,0.177387,0.280537,0.188667,0.432933,1.0,...,0.122597,0.029945,0.012056,0.02149,-0.009446,0.0056,-0.005093,-0.02774,0,0
92107,2,14720,0.394128,0.329801,0.230898,0.337443,0.286809,0.379336,0.394405,1.0,...,-0.039659,-0.211499,-0.034075,0.004316,-0.036533,-0.069758,-0.128003,-0.014164,1,0
2058,0,2058,0.574579,0.115866,0.27035,0.174807,0.299958,0.307534,0.385957,1.0,...,0.011399,-0.104602,0.020812,0.059509,-0.04362,0.020381,0.00911,0.055935,1,0
68856,1,30175,0.108687,0.09766,0.300719,0.326166,0.286714,0.332923,0.292133,1.0,...,-0.009688,-0.039565,-0.036766,-0.027816,-0.048941,-0.028453,-0.008477,0.034142,1,0
91826,2,14439,0.528721,0.156291,0.36576,0.115678,0.253128,0.3764,0.225222,1.0,...,-0.010659,-0.018178,-0.054126,0.046724,-0.067977,-0.015439,-0.010518,-0.013792,0,0
29730,0,29730,0.342704,0.143006,0.453356,0.175716,0.293159,0.398191,0.386466,1.0,...,-0.034108,-0.061264,-0.09768,-0.084653,-0.07776,0.016073,-0.043502,-0.034877,0,0


In [246]:
audio_features.describe()
# TODO: We observe that the last band of spectral contrast, and spectral rolloff, are almost always 1. We might change those features.


Unnamed: 0,video_idx,frame_idx,loudness_rms,zcr,spectral_contrast_1,spectral_contrast_2,spectral_contrast_3,spectral_contrast_4,spectral_contrast_5,spectral_contrast_6,...,mfcc6,mfcc7,mfcc8,mfcc9,mfcc10,mfcc11,mfcc12,mfcc13,Kermit,Audio_StatlerWaldorf
count,115885.0,115885.0,115885.0,115885.0,115885.0,115885.0,115885.0,115885.0,115885.0,115885.0,...,115885.0,115885.0,115885.0,115885.0,115885.0,115885.0,115885.0,115885.0,115885.0,115885.0
mean,0.998421,19313.777952,0.277597,0.142501,0.335478,0.270967,0.357891,0.399292,0.422404,0.999547,...,0.020853,-0.028263,-0.037581,-0.010294,-0.035551,-0.009281,-0.011054,-0.010243,0.286569,0.023446
std,0.816088,11151.279978,0.175298,0.097122,0.153198,0.131774,0.141103,0.1403,0.122933,0.009654,...,0.064005,0.048802,0.050639,0.038997,0.041687,0.034982,0.033086,0.031954,0.45216,0.151315
min,0.0,0.0,0.0,0.0,1.7e-05,5.7e-05,0.000265,0.001084,0.00439,0.497423,...,-0.244149,-0.452329,-0.30507,-0.311839,-0.294416,-0.342918,-0.211301,-0.199215,0.0,0.0
25%,0.0,9657.0,0.132622,0.085595,0.227649,0.178374,0.255898,0.298568,0.33559,1.0,...,-0.021103,-0.056437,-0.069311,-0.034644,-0.061634,-0.030407,-0.031132,-0.030002,0.0,0.0
50%,1.0,19314.0,0.262112,0.123173,0.310534,0.245788,0.334437,0.375382,0.401688,1.0,...,0.013232,-0.022295,-0.035913,-0.010772,-0.032805,-0.008476,-0.009237,-0.010033,0.0,0.0
75%,2.0,28971.0,0.404137,0.168212,0.412078,0.336552,0.434978,0.47376,0.486281,1.0,...,0.056118,0.004748,-0.005087,0.013125,-0.007206,0.0121,0.009801,0.009176,1.0,0.0
max,2.0,38705.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.41602,0.17259,0.318462,0.174686,0.290624,0.252358,0.194851,0.188683,1.0,1.0


We extract the `X` and `y` matrices for the model:


In [247]:
# We extract the `X` and `y` matrices for the model, and then split into train and test sets by a 80/20 ratio
X = audio_features.drop(
    columns=["Kermit", "Audio_StatlerWaldorf", "video_idx", "frame_idx"]
)
y = audio_features[["Kermit", "Audio_StatlerWaldorf"]]


And then split into train and test sets by a 80/20 ratio.
However, we do **not** use a shuffled random split like one would often use in tabular-data machine learning: this would lead to test-set contamination, as neighboring frames are highly correlated, and we would have such neighboring frames in both the train and test sets.

Due to this reasoning, we decide to use the first 80% of the ordered frames as training data and the last 20% as testing data.


In [248]:
X_train = X.head(int(X.shape[0] * 0.8))
X_test = X.tail(int(X.shape[0] * 0.2))
y_train = y.head(int(y.shape[0] * 0.8))
y_test = y.tail(int(y.shape[0] * 0.2))

assert X.shape[0] == X_train.shape[0] + X_test.shape[0]
assert y.shape[0] == y_train.shape[0] + y_test.shape[0]


In [249]:
# X_train, X_test, y_train, y_test = train_test_split(
#     X, y, test_size=0.2, random_state=42
# )


## Training the audio-based model


For this task, we decide to use a k-nearest-neighbors classifier.
Right now, the `y` matrix one-hot encodes the labels, but we need to convert it to a single column of labels for the classifier, where:

- 0: neither Kermit nor Waldorf & Statler are present
- 1: Kermit is present
- 2: Waldorf & Statler are present
- 3: both are present

**Important remark**: we notice that Kermit is the only character where, in the ground truth, the presence is annotated through a single column that does not distinguish between audio and vision. Therefore, we expected the performance of this audio-based model to be worse for Kermit based on evaluation metrics.


In [250]:
y_train = np.argmax(y_train, axis=1) + np.any(y_train, axis=1)
y_test = np.argmax(y_test, axis=1) + np.any(y_test, axis=1)


def convert_predictions_to_one_hot(predictions):
    y_pred_one_hot = np.zeros((predictions.size, y_train.shape[1]))
    y_pred_one_hot[
        np.arange(predictions.size), predictions - np.any(y_train, axis=1)
    ] = 1


**Training the kNN classifier:**


In [251]:
knn = KNeighborsClassifier(n_neighbors=2)  # TODO: tune n_neighbors
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)


## Evaluating the audio-based model


In [252]:
# Convert labels for the first character
y_test_kermit = (y_test == 1) | (y_test == 3)
y_pred_kermit = (y_pred == 1) | (y_pred == 3)

# Convert labels for the second character
y_test_wald_stat = (y_test == 2) | (y_test == 3)
y_pred_wald_stat = (y_pred == 2) | (y_pred == 3)


In [253]:
# Compute metrics for the first character
precision_kermit = precision_score(y_test_kermit, y_pred_kermit)
recall_kermit = recall_score(y_test_kermit, y_pred_kermit)
f1_kermit = f1_score(y_test_kermit, y_pred_kermit)
map_kermit = average_precision_score(y_test_kermit, y_pred_kermit)

# Compute metrics for the second character
precision_wald_stat = precision_score(y_test_wald_stat, y_pred_wald_stat)
recall_wald_stat = recall_score(y_test_wald_stat, y_pred_wald_stat)
f1_wald_stat = f1_score(y_test_wald_stat, y_pred_wald_stat)
map_wald_stat = average_precision_score(y_test_wald_stat, y_pred_wald_stat)

# Compute metrics for the general classifier as a whole
# TODO: do
# TODO: based on results change feature eng, training, tuning, splitting... classifier?
# TODO: multi-dim extracted features -> max does an avg...


In [254]:
print(
    f"***Kermit***\n\tPrecision: {precision_kermit}\n\tRecall: {recall_kermit}\n\tF1: {f1_kermit}\n\tMAP: {map_kermit}"
)
print(
    f"***Waldorf & Statler***\n\tPrecision: {precision_wald_stat}\n\tRecall: {recall_wald_stat}\n\tF1: {f1_wald_stat}\n\tMAP: {map_wald_stat}"
)


***Kermit***
	Precision: 0.4508595523840415
	Recall: 0.13525347864162693
	F1: 0.20808383233532932
	MAP: 0.44442080262474315
***Waldorf & Statler***
	Precision: 0.11320754716981132
	Recall: 0.01948051948051948
	F1: 0.03324099722991689
	MAP: 0.015235501037544542


**_Kermit_**
Precision: 0.6063063063063063
Recall: 0.5106221547799696
F1: 0.5543657331136738
MAP: 0.44874000030080674
**_Waldorf & Statler_**
Precision: 0.4318181818181818
Recall: 0.14559386973180077
F1: 0.2177650429799427
MAP: 0.08211329536796372


In [255]:
# Global
accuracy = accuracy_score(y_test, y_pred)
accuracy


0.5313025844587307

In [256]:
# Count of the different values in y_pred
pd.Series(y_pred).value_counts()


0    20041
1     3083
2       53
Name: count, dtype: int64

In [257]:
pd.Series(y_test).value_counts()


0    12592
1    10277
2      308
Name: count, dtype: int64