This repository contains code for training and evaluating the best performing visual keyword spotting model described in the paper Liliane Momeni, Triantafyllos Afouras, Themos Stafylakis, Samuel Albanie and Andrew Zisserman, Seeing Wake Words: Audio-visual Keyword Spotting, BMVC 2020. Two baseline keyword spotting models are also included.
- ffmpeg
- Torch
- NumPy
- Matplotlib
- TensorBoard
Install python dependencies by creating a new virtual environment and then running
pip install -r requirements.txt
- Download LRW and LRS2 audio-visual datasets for training; LRS2 dataset for testing
- Save transcriptions and extract talking faces from clips using metadata available
- Pre-compute features for clips of talking faces using pre-trained lip reading model and save LRS2 features at
data/lrs2/features/main
(train, val & test splits) anddata/lrs2/features/pretrain
(pre-train split) and LRW features atdata/lrw/features/main
(train & val splits). Note we train the rest of the network on these pre-computed features to accelerate training. - Compute word-level timings for LRS2 transcriptions using Montreal Forced Aligner and save them at
data/lrs2/word_alignments/main
(train, val & test splits) anddata/lrs2/word_alignments/pretrain
(pre-train split). These annotations are used as extra supervision to improve keyword localisation but our model can also be used without (see no LOC ablation in paper).
#example format of word alignment file data/lrss2/word_alignments/634124261710508263700049.txt
it's 0.13 0.42
cosmetically 0.42 1.07
improved 1.07 1.57
quite 1.57 1.78
drastically 1.78 2.27
here 2.27 2.53
- Download CMU phonetic dictionary:
data/vocab/cmudict.dict
- Build CMU phoneme and grapheme vocabulary files:
data/vocab/phoneme_field_vocab.json
anddata/vocab/grapheme_field_vocab.json
- Build dataset split json files:
data/lrs2/DsplitsLRS2.json
anddata/lrw/DsplitsLRW.json
usingmisc/data_splits_lrs2.py
andmisc/data_splits_lrw.py
respectively
#example format of data/lrs2/DsplitsLRS2.json
{"test": [{'end_word': [14.25, 19.5, 25.75, 35.75, 54.0, 60.5], 'start_word': [3.25, 15.0, 19.5, 25.75, 36.5, 54.0], 'widx': [4011, 43989, 77147, 120898, 118167, 129664], 'fn': '6330311066473698535/00011'},{'end_word': [9.5, 18.5, 27.5], 'start_word': [1.25, 9.5, 18.5], 'widx': [121092, 81694, 5788], 'fn': '6330311066473698535/00018'},{'end_word': [8.0, 12.0, 16.5, 24.75], 'start_word': [3.5, 8.0, 12.0, 16.5], 'widx': [4011, 129931, 130533, 102579], 'fn': '6330311066473698535/00022'}],
"val": [],
"train": []
}
- Keyword vocabularies: for both training and evaluation, we use only keywords pronounced with number of phonemes np > 5 phonemes. Moreover, as we want to evaluate on unseen keywords, we ensure that training and testing are performed on disjoint keyword vocabularies. To that end, we use all the words appearing in the LRS2 test set with np > 5 phonemes as evaluation keywords
data/lrs2/LRS2_test_words.json
and we remove them from the training vocabulary, i.e. those words are not used in training the keyword encoder. For example, for the LRW dataset, the 500 word training vocabulary is reduced todata/lrw/LRW_train_words.json
.
Download the pre-trained models by running
bash misc/download_models.sh
We provide several pre-trained models used in the paper:
- Stafylakis & Tzimiropoulos G2P implementation: G2P_baseline.pth
- Stafylakis & Tzimiropoulos P2G, a variant of the above model where the grapheme-to-phoneme keyword encoder-decoder has been switched to a phoneme-to-grapheme architecture: P2G_baseline.pth
- KWS-Net, the novel convolutional architecture we propose: KWS_Net.pth
We employ a curriculum training procedure from the pre-computed features for the rest of network that consists of two stages: (i) it is initially trained on the training set of LRW. As LRW contains clips of single words, here the model is trained without word time boundaries, (ii) the model is then fine-tuned on LRS2.
The initial learning rate is 10−3 and decreases by a half every 10 epochs. We train this for 40 epochs.
python train_LRW.py --config=./configs/lrw/g2p/train.json #G2P
python train_LRW.py --config=./configs/lrw/p2g/train.json #P2G
python train_LRW.py --config=./configs/lrw/kwsnet/train.json #KWS-Net
For fine-tuning on LRS2, the initial learning rate is 10−4 and decreases by a half every 20 epochs. We train this second stage for 60 epochs. The network is trained for a total of 100 epochs (pre-training and fine-tuning).
python train_LRS.py --config=./configs/lrs2/g2p/train.json --resume=./path_where_lrw_g2p_model_saved.pth #G2P
python train_LRS.py --config=./configs/lrs2/p2g/train.json --resume=./path_where_lrw_p2g_model_saved.pth #P2G
python train_LRS.py --config=./configs/lrs2/kwsnet/train.json --resume=./path_where_lrw_kwsnet_model_saved.pth #KWS-Net
The performance of the models is evaluated on the LRS2 test set of every dataset, using as queries all words with number of phonemes np > 5 phonemes data/lrs2/LRS2_test_words.json
. We look for each query keyword in all the clips of the test set. Note that there is no balancing of positive and negative clips during evaluation: there are one or a few positive clips for a given keyword and the rest are negatives. During testing, in order to obtain fine-grained localisation, we apply the CNN classifier with a stride of one.
The performance is evaluated based on ranking metrics. For every keyword in the test vocabulary, we record the percentage of the total clips containing it that appear in the first N retrieved results, with N=[1,5,10], this is the ‘Recall at N’ (R@N). Note that, since several clips may contain a query word, the maximum R@1 is not 100%. The mean average precision (mAP) and equal error rate (EER) are also reported. For each keyword-clip pair, the match is considered correct if the keyword occurs in the clip and the maximum detection probability occurs between the ground truth keyword boundaries.
python test_LRS.py --config=./configs/lrs2/g2p/eval.json --resume=./misc/pretrained_models/G2P_baseline.pth #G2P
# R@1 22.0 | R@5 47.6 | R@10 59.2 | mAP 35.6 | EER 9.3
python test_LRS.py --config=./configs/lrs2/p2g/eval.json --resume=./misc/pretrained_models/P2G_baseline.pth #P2G
# R@1 28.0 | R@5 55.4 | R@10 65.2 | mAP 42.7 | EER 6.1
python test_LRS.py --config=./configs/lrs2/kwsnet/eval.json --resume=./misc/pretrained_models/KWS_Net.pth #KWS-Net
# R@1 39.5 | R@5 67.1 | R@10 75.3 | mAP 54.9 | EER 5.4
To verify that everything works
- Run
bash misc/download_models.sh
to get the pretrained models - Run a simple demo for visual only keyword spotting
python run_demo.py --config=./configs/demo/eval.json --resume=./misc/pretrained_models/KWS_Net.pth
- Expected output file created and saved at
data/demo/demo.png
shown below. In blue is the output from the classifier and in green the ground truth keyword boundaries.
We would like to emphasise that this research represents a working progress towards achieving automatic visual keyword spotting, and as such, has a number of limitations that we are aware of (and likely many that we are not aware of). Key limitations are abilities to deal with:
- Homophemes - for example, the words "may", "pay", "bay" cannot be distinguished without audio as the visemes "m", "p", "b" visually look the same.
- Accents, speed of speech and mumbling which modify lip movements.
- Variable imaging conditions such as lighting, motion and resolution which modiy the appearance of the lips.
- Shorter keywords which are harder to visually spot.
If you use this code, please cite the following:
@misc{momeni2020seeing,
title={Seeing wake words: Audio-visual Keyword Spotting},
author={Liliane Momeni and Triantafyllos Afouras and Themos Stafylakis and Samuel Albanie and Andrew Zisserman},
year={2020},
eprint={2009.01225},
archivePrefix={arXiv},
primaryClass={cs.CV}
}