# ELEC-E5500 Speech Processing -- Autumn 2025
## Exercise 4: Wake Word Detection.


## Overview:

**Implement and fill in this exercise, you will implement and train your own custom wake word detector**.

Submit the notebook by 23:59 on **29.09.24** (Only one person of you needs to submit this notebook.)

Do this together as a group, and choose a custom wake word. Often something with at least with 3 syllables works best (as the false positives rate is greatly reduced)

## Write the name of the people in the team here! TEAM MEMBERS: 
Team Waveform Wizards
* Khanh Ha
* Enikő Palencsár
* Angelo Parravano
* Jere Tahvanainen
* Tuan Tran

Since the jupyter here, consistently breaks, when running the training, it is much easier to run the wake word detection model on google colab. 
The demo we are using today, is this: https://colab.research.google.com/drive/1q1oe2zOyZp7UsB3jJiQ1IFn8z5YfjwEb?usp=sharing#scrollTo=1cbqBebHXjFD

Change the target_word to whatever you want.


After training the wakeword detection model with the google colab (it should take a couple of hours on the free plan) 

1. Upload a model of yours to your jupyter folder (the tflite, and the onnx file) which will be submitted

2. Write your target_word in the next cell (both, how it was supposed to sound like, and what actual string you used in the target word)

3. Explain how the model is trained, what datasets are used, what model is used, and what parameters can you change?

4. Start changing at least 3 things in the training, and see what effect it has on the final output: Here are some ideas (what happens when you increase the number of samples, what happens when you change the negative sample penalization or what happens when leave out some of the augmentations? (this are ideas, you can also choose other things) - write about the effects in the next cell.

5. Find a new dataset, that could be used for data augmentation (for example huggingface is a good place to start)

6. Write about strategies how you could improve the performance of the model (that could be finding different architecturs, losses, training data, real recordings, elaborate as concretely as you can)

Have each member of your group record saying the wake word you chose, and upload it to jupyter.

Then run the codecell below to see what your model would predict.

7. How does the model perform with real life recordings?
8. Can you find a false positive sample, that consistently activates the wake word, without actually being the wake word? - Mention what words you tried and why.

In [None]:
# First we need to install openwakeword
!pip install openwakeword

In [None]:
# Get predictions for individual WAV files (16-bit 16khz PCM)
import openwakeword
from openwakeword.model import Model

# This might be helpful and downloads some pretrained mdoels: openwakeword.utils.download_models()
# The file can either be the tflite or onnx file (more information can be found here https://github.com/dscripka/openWakeWord/blob/main/openwakeword/model.py. 

model = Model(wakeword_models=["path/to/model.tflite"])
model.predict_clip("path/to/wavfile")

# The prediction will be on a frame by frame basis (80ms at a time for example), you can either use the average of the
#predicitions or see if there are certains frames activated.

### 2. Write your target_word in the next cell (both, how it was supposed to sound like, and what actual string you used in the target word)

**target:** activate assistant /ˈæk.təˌveɪt əˈsɪs.tənt/ \
**actual string used:** activate assistant

### 3. Explain how the model is trained, what datasets are used, what model is used, and what parameters can you change?

#### Model and training

The *openWakeWord* model consists of three separate parts: pre-processing, feature extraction and classification. As features, the mel-spectrograms are computed from the input audio data which should be 16-bit 16kHz PCM data. In the mel-spectrogram, frequency bands are created in a way that reflects how humans are sensitive to them, so higher frequencies are compressed more. After the pre-processing, the feature extraction layers allow for the conversion of melspectrograms to audio embeddings. This feature extraction component of the model, based on simple convolutional blocks, is a reimplementation of a TFHub module. Its strength lies in it being pre-trained on a large data set. During the training of the wakeword spotter, these convolutional layers are frozen, and only the last part of the model, the classification network learns weights. This refers to a fully-connected feedforward neural network in our case, with ReLu activation functions used.

For training the model, the Adam optimizer is used. The learning rate for training has a specific schedule that starts with a warmup phrase (1/5 of total steps), where it increases in a linear fashion. After that, it can hold for a set number of steps (1/3 of total steps) and then decays toward zero following the cosine curve. The training also involves hard example mining, filtering the training batch to focus on difficult samples – negative samples for which the predicted values are too high and positive samples for which the predictions are too low. 

There are 3 separate sequences of training, the second and third with only 10% of the total steps but increased weights of negative examples to reduce the false positive rate. The resulting models are merged in a way that models above the 90th percentile are considered and their weights are averaged. In each training step, the resulting model is saved only if it meets certain performance criteria relative to the previously saved models (low false positive rate, high recall).

---

#### Datasets - training

The model uses synthetically created positive samples to train for wakeword detection. Sample generation is carried out using a text-to-speech tool called *Piper Sample Generator*. This also involves for example using multiple speaker voices and introducing variability into the generated speech. The negative samples for training are drawn from the *ACAV100M* dataset. Their features have been pre-computed to act as general purpose negative training data in *openWakeWord*, speeding up the training process. The false-positive validation set contains approximately 11 hours of audio including the *DiPCo* dataset, the *Santa Barbara Corpus of Spoken American English* and some clips from the *MUSDB Music Dataset*, reverberated using the *MIT impulse response recordings*.

**ACAV100M**

*ACAV100M* is an automatically curated dataset containing 31 years worth of 10-second-long clips of speech, noise and music in multiple languages, retrieved from YouTube.

**DiPCo**

The *Dinner Party Corpus* was created with the assistance of volunteers, who simulated a dinner-party scenario in a lab.

**Santa Barbara Corpus of Spoken American English**

This dataset is a collection of naturally occurring spoken interactions gathered from across the United States. It includes speech from individuals representing diverse regional backgrounds, ages, occupations, genders and ethnicities.

**MUSDB Music Dataset**

*MUSDB* is a dataset of 150 music tracks of different genres.

---

#### Datasets - augmentation

Data augmentation is also included before the training, including adding coloured noise, real-world background noise, music, and echo. There are also equalizer, distortion, pitch shift, volume change and frequency band removal effects, all added or not with a given probability.

**MIT environmental impulse responses**

The dataset, comprising 271 clips, was recorded by the Computational Audition Lab at MIT. These audio files contain diverse environmental impulse response data. It’s used to add echo to the clips in the training process.

**AudioSet dataset**

This is a dataset of 10-second clips from YouTube, a subset of which is used for adding background noise to our examples. It contains clips under 527 labels such as traffic noise, wind noise, environmental noise but also yodeling, speech, whistle, purr and ocean sounds.

**Free Music Archive**

Contains 30-second clips of music of which a total of 1 hour is used in the training process to add background music to some audio samples.

---

#### Parameters

Multiple parameters of the model and the training process can be adjusted. The base model with the default parameter values (given in brackets) had a training time of about 1 hour.

A non-exhaustive list of model parameters:
1.	**model type**: can be a simple feedforward deep neural network or a recurrent neural network with an LSTM layer *(DNN)*
2.	**layer size**: only for DNN, RNN parameters are fixed *(32)*
3.	**number of hidden layers**: only for DNN, RNN parameters are fixed *(1)*

A non-exhaustive list of training parameters:
1.	**target phrase** which the model (hopefully) learns to spot *(activate assistant)*
2.	custom **negative phrases** that are easy to mistake for the target phrase *(empty [])*
    * there are negative phrases deliberately generated during the training process using phoneme overlap, but if certain similar expressions still cause the model to fail, these can be included here
3.	**number of positive samples** generated and used for training the model *(1000)*, also the number of positive samples generated for validation and early stopping checks *(max(500, number_of_examples//10) = 500)*
4.	the maximum number of **training steps** *(10 000)*
5.	target **accuracy** *(0.5)* and **recall** *(0.25)*
6.	**penalty for false activation** which limits how much influence negative samples can have during training *(1500)*



### 4. Start changing at least 3 things in the training, and see what effect it has on the final output: Here are some ideas (what happens when you increase the number of samples, what happens when you change the negative sample penalization or what happens when leave out some of the augmentations? (this are ideas, you can also choose other things) - write about the effects in the next cell.

To better understand how different training choices affect model performance, we trained three different versions of the wakeword model, each incorporating a specific change to the training configuration.

#### Model No. 1: Increased the number of samples, without increasing the number of steps

First, we decided to increase the number of samples from 10.000 to 30.000. We also decided to not increase, at the same time, the number of steps. The results are such that the number of true positives identified dropped from 0.73 to 0.20, likely because, even though the number of samples was increased, the model couldn't find all the patterns in the data having kept the default number of steps (10.000): this resulted in underfitting. Given also the fact that the model identified all the negative recordings (true negatives=1.0) we assume that the samples that were used to train the model might be more bias towards the negative class. In short: increasing the number of samples alone is not enough, training steps and other parameters needs to be scaled as well.

#### Model No. 2: Increased the false activation penalty

Then, we reduced again the number of samples back to 10.000 and we increased the false activation from 1500 to 3000. This resulted in an huge drop in recall for the positive examples to 0.07, meaning that the model became overly strict and identify almost all the recordings as not the wake word. The true positive rate also increased to 1.0, but this increase doesn't carry much information alone: but this is not particularly informative in this case since the model simply failed to activate. This shows that increasing the false activation penalty might carry some value for improving precision, but it's a thin trade off with the positive recall.

#### Model No. 3: Removed augmentation
Last but not least, we decided to train the model with default parameters but with no data augmentation. The true negative rate was again 1.0: we assume this is the case because all our negative recordings are "clean", with little to no background noise; this resulted in the model correctly classify all of them. On the other hand, the positive recordings (the ones with the wake words) saw a decrease in recall to 0.20, likely because most of them have a background noise that the model hasn't been trained to handle.

### 5. Find a new dataset, that could be used for data augmentation (for example huggingface is a good place to start)

#### Arni

For room impulse responses, the current configuration uses an MIT dataset of 271 samples from 2016. There are however many more extensive and more recent (therefore recorded with better equipment) datasets. One of such datasets was collected in the variable acoustics laboratory Arni at Aalto. It comprises 132 037 RIRs measured using 5342 configurations of 55 acoustic panels in the lab.

[Link to dataset](https://zenodo.org/records/6985104#.YwffZuzMIeY)

#### FSD50k

FS50K is a dataset of more than 50 000 samples which equals more than 100 hours of audio. The sounds in the clips are mostly human sounds, sounds of things, animal sounds, natural sounds and music. All clips are provided as uncompressed PCM 16 bit 44.1 kHz mono audio files, so the sampling frequencies should be adjusted before using the audio as background noise in the enhancement phase.

[Link to dataset](https://huggingface.co/datasets/Fhrozen/FSD50k)


### 6. Strategies for Improvement

#### Training data

The most effective way to improve the model is to address the quality and balance of the training data. At present, the positives are generated synthetically with Piper TTS. This helps bootstrap the model but creates a clear mismatch to real deployment conditions. To close this gap, at least 15–20 speakers should record the target phrase *“activate assistant”*, each providing 20–30 samples in different environments and microphones. These recordings can be included directly in the `positive` class. To reduce confusions, more near-miss negatives should be added to `custom_negative_phrases` (e.g. *“assistant activate”*, *“activate system”*, *“hey assistant”*). The dataset is also heavily skewed: the model uses `batch_n_per_class` with only 50 positives versus 1024 negatives from ACAV100M. This imbalance will depress recall. Either raise the positive batch size to 200+ or reduce the effect of negatives by lowering `max_negative_weight`. Background coverage should also be extended beyond FMA and AudioSet to include more everyday sounds (office chatter, TV, kitchen noise). Finally, augmentations should be made stronger: increase `augmentation_rounds` from 1 to 3–5 and add pitch shifting (±2 semitones), time stretching (0.9–1.1×), SNR mixes between 0 and 20 dB, and a larger set of room impulse responses. The Arni dataset (132k RIRs) is a promising replacement for the older MIT RIR set, and additional noise datasets like FSD50k (100 hours of diverse sounds) could improve robustness.

#### Training procedure

The model is currently trained for `steps=25000`, which is too low once the data is expanded. Extending to 50k–100k steps ensures it fully benefits from the richer dataset. The validation set is also too small (`n_samples_val=500`), and should be increased to a few thousand samples, with near-misses included, so that false positive rates can be tracked more reliably. A curriculum approach is also recommended: begin training with clean TTS positives and low-noise data, then gradually add noisy, reverberant, and accented recordings. This staged exposure prevents the model from overfitting early to easy cases while still ensuring robustness later.

#### Loss functions

The current setup relies on binary cross-entropy. Switching to focal loss (γ≈2, α≈0.25) would force the model to focus more on the hard negatives that dominate the dataset. An alternative is to add an embedding layer before the classifier and use triplet or contrastive loss with near-miss phrases. This explicitly separates them from the true wake word in the embedding space. In either case, class weighting should be applied so that positives have enough influence during training despite their smaller numbers.

#### Architecture

The classifier is a simple feed-forward DNN with `layer_size=32`. This capacity is limited given the variability in speakers, microphones, and noise conditions. Increasing to 64–128 units per layer will improve feature representation. More importantly, the model should not remain fully feed-forward: temporal modeling should be introduced by reshaping the mel-spectrogram features into sequences and feeding them into a GRU or LSTM layer (e.g. 64 units), or by using a temporal CNN block. This allows the system to capture timing variations in how the phrase *“activate assistant”* is spoken.  
Furthermore, a recent development in wake word detection is the use of Transformer models [Vaswani et al., 2017]. In particular, the Keyword Transformer (KWT) model is a self-attention model designed specifically for keyword spotting [Berg et al., 2021]. Transformer architectures are known for their superior performance in modeling long-range dependencies compared to convolutional neural networks (CNNs) or recurrent neural networks (RNNs), which is especially beneficial in speech processing due to variable-length inputs and noisy environments. KWT further leverages self-attention to detect keyword occurrence patterns along the time axis of audio clips, improving overall wake word detection performance.

### 7. Real life performance  
We have multiple real life recordings in various scenarios (with music, low voice, with chatter) from 5 members of the team. We also test with the generated audio clip from the demo. To evaluate performance at the clip level, We first aggregate the frame-level probabilities by computing both the mean and the maximum probability for each clip. This captures two perspectives: the average confidence across frames and the strongest evidence within a clip. The model performs well on clear audio clips and generated audio, and audios in noisy circumstances. However, the model  can not recognize wakeword from the recordings of one of our members, having clearly lower score compared to recordings from others. In particular, the model achieves approximately 0.055-0.15 average probability and 0.8-0.95 maximum probability for wakeword recordings of other members, but achieves 0.0006 average and 0.002 maximum approximately for this member recordings, which indicates that the model is not able to recognize the wakeword at all in these recordings. On listening to this member's recordings, it is not clear what affected the model performance. We assume that it could be due to the speaker speaking more slowly compared to the model's training audios.  
To measure overall performance, we use thresholding as follows. If the maximum frame probability in a clip is over 0.5, we categorize that clip as positive sample. Otherwise, the clip is categorized as false sample. On 15 real wakeword recordings, the model achieves a true positive rate of 0.73 (11 out of 15 are categorized correctly). All of the recordings from the earlier mentioned member is categorized as false, with one recording recorded in a small room also categorized as false.

### 8. False positive samples  
For false audio clips, We have used the following phrases: "accident existence", "act instantly", "activate at a distance", "activate distance", "activate existence", "active eight existence", "assassin", "activate assassin", "at a distance", "existential crisis". These phrases are used for their similar pronunciation compared to the real wakeword. We have multiple members record these phrases to avoid bias. On testing, the phrases "activate assassin" and "activate existence" obtain very positive scores, with averages at approximately 0.03-0.08 and frame maxima at approximately 0.9-0.94, comparable to those scores for real wakeword recordings. To improve upon this, we believe that these pharses which sound very similar to the wakeword can be recorded and used as negative training samples for the model.  
Regarding the overall score, we again use thresholding as in part 7. On 14 false recordings, the model achieves true negative rate of 0.7142 (10 out of 14 is categorized correctly). The false positive samples are those recordings of the phrases "activate assassin" and "activate existence".

### References

- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.

- Berg, A., O’Connor, M., & Tairum Cruz, M. (2021). Keyword Transformer: A Self-Attention Model for Keyword Spotting. Interspeech 2021. arXiv:2104.00769