# ELEC-E5500 Speech Processing -- Autumn 2025
## Exercise 4: Wake Word Detection.


## Overview:

**Implement and fill in this exercise, you will implement and train your own custom wake word detector**.

Submit the notebook by 23:59 on **29.09.24** (Only one person of you needs to submit this notebook.)

Do this together as a group, and choose a custom wake word. Often something with at least with 3 syllables works best (as the false positives rate is greatly reduced)

## Write the name of the people in the team here! TEAM MEMBERS: 

* Khanh Ha
* Enikő Palencsár
* Angelo Parravano
* Jere Tahvanainen
* Tuan Tran

Since the jupyter here, consistently breaks, when running the training, it is much easier to run the wake word detection model on google colab. 
The demo we are using today, is this: https://colab.research.google.com/drive/1q1oe2zOyZp7UsB3jJiQ1IFn8z5YfjwEb?usp=sharing#scrollTo=1cbqBebHXjFD

Change the target_word to whatever you want.


After training the wakeword detection model with the google colab (it should take a couple of hours on the free plan) 

1. Upload a model of yours to your jupyter folder (the tflite, and the onnx file) which will be submitted

2. Write your target_word in the next cell (both, how it was supposed to sound like, and what actual string you used in the target word)

3. Explain how the model is trained, what datasets are used, what model is used, and what parameters can you change?

4. Start changing at least 3 things in the training, and see what effect it has on the final output: Here are some ideas (what happens when you increase the number of samples, what happens when you change the negative sample penalization or what happens when leave out some of the augmentations? (this are ideas, you can also choose other things) - write about the effects in the next cell.

5. Find a new dataset, that could be used for data augmentation (for example huggingface is a good place to start)

6. Write about strategies how you could improve the performance of the model (that could be finding different architecturs, losses, training data, real recordings, elaborate as concretely as you can)

Have each member of your group record saying the wake word you chose, and upload it to jupyter.

Then run the codecell below to see what your model would predict.

7. How does the model perform with real life recordings?
8. Can you find a false positive sample, that consistently activates the wake word, without actually being the wake word? - Mention what words you tried and why.

In [None]:
# First we need to install openwakeword
!pip install openwakeword

In [None]:
# Get predictions for individual WAV files (16-bit 16khz PCM)
import openwakeword
from openwakeword.model import Model

# This might be helpful and downloads some pretrained mdoels: openwakeword.utils.download_models()
# The file can either be the tflite or onnx file (more information can be found here https://github.com/dscripka/openWakeWord/blob/main/openwakeword/model.py. 

model = Model(wakeword_models=["path/to/model.tflite"])
model.predict_clip("path/to/wavfile")

# The prediction will be on a frame by frame basis (80ms at a time for example), you can either use the average of the
#predicitions or see if there are certains frames activated.

### 2. Write your target_word in the next cell (both, how it was supposed to sound like, and what actual string you used in the target word)

**target:** activate assistant /ˈæk.təˌveɪt əˈsɪs.tənt/ \
**actual string used:** activate assistant

### 3. Explain how the model is trained, what datasets are used, what model is used, and what parameters can you change?

#### Model and training

The *openWakeWord* model consists of three separate parts: pre-processing, feature extraction and classification. As features, the mel-spectrograms are computed from the input audio data which should be 16-bit 16kHz PCM data. In the mel-spectrogram, frequency bands are created in a way that reflects how humans are sensitive to them, so higher frequencies are compressed more. After the pre-processing, the feature extraction layers allow for the conversion of melspectrograms to audio embeddings. This feature extraction component of the model, based on simple convolutional blocks, is a reimplementation of a TFHub module. Its strength lies in it being pre-trained on a large data set. During the training of the wakeword spotter, these convolutional layers are frozen, and only the last part of the model, the classification network learns weights. This refers to a fully-connected feedforward neural network in our case, with ReLu activation functions used.

For training the model, the Adam optimizer is used. The learning rate for training has a specific schedule that starts with a warmup phrase (1/5 of total steps), where it increases in a linear fashion. After that, it can hold for a set number of steps (1/3 of total steps) and then decays toward zero following the cosine curve. The training also involves hard example mining, filtering the training batch to focus on difficult samples – negative samples for which the predicted values are too high and positive samples for which the predictions are too low. 

There are 3 separate sequences of training, the second and third with only 10% of the total steps but increased weights of negative examples to reduce the false positive rate. The resulting models are merged in a way that models above the 90th percentile are considered and their weights are averaged. In each training step, the resulting model is saved only if it meets certain performance criteria relative to the previously saved models (low false positive rate, high recall).

---

#### Datasets - training

The model uses synthetically created positive samples to train for wakeword detection. Sample generation is carried out using a text-to-speech tool called *Piper Sample Generator*. This also involves for example using multiple speaker voices and introducing variability into the generated speech. The negative samples for training are drawn from the *ACAV100M* dataset. Their features have been pre-computed to act as general purpose negative training data in *openWakeWord*, speeding up the training process. The false-positive validation set contains approximately 11 hours of audio including the *DiPCo* dataset, the *Santa Barbara Corpus of Spoken American English* and some clips from the *MUSDB Music Dataset*, reverberated using the *MIT impulse response recordings*.

**ACAV100M**

*ACAV100M* is an automatically curated dataset containing 31 years worth of 10-second-long clips of speech, noise and music in multiple languages, retrieved from YouTube.

**DiPCo**

The *Dinner Party Corpus* was created with the assistance of volunteers, who simulated a dinner-party scenario in a lab.

**Santa Barbara Corpus of Spoken American English**

This dataset is a collection of naturally occurring spoken interactions gathered from across the United States. It includes speech from individuals representing diverse regional backgrounds, ages, occupations, genders and ethnicities.

**MUSDB Music Dataset**

*MUSDB* is a dataset of 150 music tracks of different genres.

---

#### Datasets - augmentation

Data augmentation is also included before the training, including adding coloured noise, real-world background noise, music, and echo. There are also equalizer, distortion, pitch shift, volume change and frequency band removal effects, all added or not with a given probability.

**MIT environmental impulse responses**

The dataset, comprising 271 clips, was recorded by the Computational Audition Lab at MIT. These audio files contain diverse environmental impulse response data. It’s used to add echo to the clips in the training process.

**AudioSet dataset**

This is a dataset of 10-second clips from YouTube, a subset of which is used for adding background noise to our examples. It contains clips under 527 labels such as traffic noise, wind noise, environmental noise but also yodeling, speech, whistle, purr and ocean sounds.

**Free Music Archive**

Contains 30-second clips of music of which a total of 1 hour is used in the training process to add background music to some audio samples.

---

#### Parameters

Multiple parameters of the model and the training process can be adjusted. The base model with the default parameter values (given in brackets) had a training time of about 1 hour.

A non-exhaustive list of model parameters:
1.	**model type**: can be a simple feedforward deep neural network or a recurrent neural network with an LSTM layer *(DNN)*
2.	**layer size**: only for DNN, RNN parameters are fixed *(32)*
3.	**number of hidden layers**: only for DNN, RNN parameters are fixed *(1)*

A non-exhaustive list of training parameters:
1.	**target phrase** which the model (hopefully) learns to spot *(activate assistant)*
2.	custom **negative phrases** that are easy to mistake for the target phrase *(empty [])*
    * there are negative phrases deliberately generated during the training process using phoneme overlap, but if certain similar expressions still cause the model to fail, these can be included here
3.	**number of positive samples** generated and used for training the model *(1000)*, also the number of positive samples generated for validation and early stopping checks *(max(500, number_of_examples//10) = 500)*
4.	the maximum number of **training steps** *(10 000)*
5.	target **accuracy** *(0.5)* and **recall** *(0.25)*
6.	**penalty for false activation** which limits how much influence negative samples can have during training *(1500)*



### 5. Find a new dataset, that could be used for data augmentation (for example huggingface is a good place to start)

#### Arni

For room impulse responses, the current configuration uses an MIT dataset of 271 samples from 2016. There are however many more extensive and more recent (therefore recorded with better equipment) datasets. One of such datasets was collected in the variable acoustics laboratory Arni at Aalto. It comprises 132 037 RIRs measured using 5342 configurations of 55 acoustic panels in the lab.

[Link to dataset](https://zenodo.org/records/6985104#.YwffZuzMIeY)

#### FSD50k

FS50K is a dataset of more than 50 000 samples which equals more than 100 hours of audio. The sounds in the clips are mostly human sounds, sounds of things, animal sounds, natural sounds and music. All clips are provided as uncompressed PCM 16 bit 44.1 kHz mono audio files, so the sampling frequencies should be adjusted before using the audio as background noise in the enhancement phase.

[Link to dataset](https://huggingface.co/datasets/Fhrozen/FSD50k)
