# Introduction

This notebook demonstrates how to train custom openWakeWord models using pre-defined datasets and an automated process for dataset generation and training. While not guaranteed to always produce the best performing model, the methods shown in this notebook often produce baseline models with releatively strong performance.

Manual data preparation and model training (e.g., see the [training models](training_models.ipynb) notebook) remains an option for when full control over the model development process is needed.

# Environment Setup

To begin, we'll need to install the requirements for training custom models. In particular, a relatively recent version of Pytorch and custom fork of the [piper-sample-generator](https://github.com/dscripka/piper-sample-generator) library for generating synthetic examples for the custom model.

In [None]:
## Environment setup

# install piper-sample-generator
!git clone https://github.com/dscripka/piper-sample-generator
!wget -O models/en-us-libritts-high.pt 'https://github.com/rhasspy/piper-sample-generator/releases/download/v1.0.0/en-us-libritts-high.pt'

# install openwakeword (full installation to support training)
!pip install openwakeword[full]


In [17]:
# Imports

import os
import torch
from openwakeword.data import mmap_batch_generator, generate_adversarial_texts
from openwakeword.utils import compute_features_from_generator
import sys
from pathlib import Path
import uuid
import yaml

# Set paths for locally installed piper-sample-generator
sys.path.insert(0, "../../piper-sample-generator/")
from generate_samples import generate_samples


# Download Data

When training new openWakeWord models using the automated procedure, four specific types of data are required:

1) Synthetic examples of the target word/phrase generated with text-to-speech models

2) Synthetic examples of adversarial words/phrases generated with text-to-speech models

3) Room impulse reponses and noise/background audio data to augment the synthetic examples and make them more realistic

4) Generic "negative" audio data that is very unlikely to contain examples of the target word/phrase in the context where the model should detect it. This data can be the original audio data, or precomputed openWakeWord features ready for model training.

5) Validation data to use for early-stopping when training the model.

For the purposes of this notebook, all five of these sources can be obtained from HuggingFace thanks to their excellent `datasets` library and extremely generous hosting policy. Also note that only a portion of some datasets are downloaded. But for the best possible performance, you are encouraged to download the entire dataset and keep a local copy for future training runs.

In [15]:
# Download room impulse responses

output_dir = "./mit_rirs"
os.mkdir(output_dir) if not os.path.exists(output_dir)
rir_dataset = datasets.load_dataset("davidscripka/MIT_environmental_impulse_responses", split="train", streaming=True)

for row in tqdm(rir_dataset):
    name = row['audio']['path'].split('/')[-1]
    scipy.io.wavfile.write(os.path.join(output_dir, name), 16000, row['audio']['array'])
    i += 1
    if i == n_total:
        break


In [None]:
## Download noise and background audio

# FSD50k Noise Dataset (warning, this can take 5 minutes to prepare when streaming)
# https://zenodo.org/record/4060432
output_dir = "./fsd50k"
os.mkdir(output_dir) if not os.path.exists(output_dir)
fsd50k_dataset = datasets.load_dataset("Fhrozen/FSD50k", split="validation", streaming=True)  # ~40,000 files in this split
fsd50k_dataset = iter(fsd50k_dataset.cast_column("audio", datasets.Audio(sampling_rate=16000)))

n_total = 500  # use only 500 clips for this example notebook, reccomend increasing for full-scale training
for i in tqdm(range(n_total)):
    row = next(fsd50k_dataset)
    name = row['audio']['path'].split('/')[-1]
    scipy.io.wavfile.write(os.path.join(output_dir, name), 16000, row['audio']['array'])
    i += 1
    if i == n_total:
        break

# Free Music Archive dataset
# https://github.com/mdeff/fma

output_dir = "./fma"
os.mkdir(output_dir) if not os.path.exists(output_dir)
fma_dataset = datasets.load_dataset("rudraml/fma", name="small", split="train", streaming=True)
fma_dataset = iter(fma_dataset.cast_column("audio", datasets.Audio(sampling_rate=16000)))

n_hours = 1
for i in tqdm(range(n_hours*3600//30)):  # this works because the FMA dataset is all 30 second clips
    row = next(fma_dataset)
    name = row['audio']['path'].split('/')[-1]
    scipy.io.wavfile.write(os.path.join(output_dir, name), 16000, row['audio']['array'])
    i += 1
    if i == n_hours*3600//30:
        break


In [None]:
# Download pre-computed openWakeWord features for training and validation

# training set (~2,000 hours)
!wget https://huggingface.co/datasets/davidscripka/openwakeword_features/blob/main/openwakeword_features_ACAV100M_2000_hrs_16bit.npy

# validation set (~11 hours)
!wget https://huggingface.co/datasets/davidscripka/openwakeword_features/blob/main/validation_set_features.npy

# Define Training Configuration

For automated model training openWakeWord uses a specially designed training script and a [YAML](https://yaml.org/) configuration file that defines all of the information required for training a new wake word/phrase detection model.

It is strongly recommended that you review [the example config file](../examples/custom_model.yml), as each value is fully documented there. For the purposes of this notebook, we'll read in the YAML file to modify certain configuration parameters before saving a new YAML file for training our example model. Specifically:

- We'll train a detection model for the phrase "hey sebastian"
- We'll only generate 5,000 positive and negative examples (to save on time for this example)
- We'll only generate 1,000 validation positive and negative examples for early stopping (again to save time)
- The model will be trained for 30,000 steps (larger datasets will benefit from longer training)


In [20]:
# Load YAML file
config = yaml.load(open("../examples/custom_model.yml", 'r').read(), yaml.Loader)
config

{'model_name': 'my_model',
 'target_phrase': ['hey jarvis'],
 'total_length': 32000,
 'custom_negative_phrases': [],
 'n_samples': 10000,
 'n_samples_val': 2000,
 'tts_batch_size': 50,
 'augmentation_batch_size': 16,
 'piper_sample_generator_path': './piper-sample-generator',
 'output_dir': './generated_data',
 'rir_paths': ['./mit_rirs'],
 'background_paths': ['./background_clips'],
 'false_positive_validation_data_path': './validation_set_features.npy',
 'augmentation_rounds': 1,
 'feature_data_files': {'ACAV100M_sample': './openwakeword_features_ACAV100M_2000_hrs_16bit.npy'},
 'batch_n_per_class': {'ACAV100M_sample': 1024,
  'adversarial_negative': 50,
  'positive': 50},
 'model_type': 'dnn',
 'layer_size': 32,
 'steps': 100000,
 'max_negative_weight': 1500,
 'target_accuracy': 0.7,
 'target_recall': 0.5,
 'target_false_positives_per_hour': 0.2}

In [21]:
# Modify values in the config and save a new version

config["target_phrase"] = ["hey sebastian"]
config["n_samples"] = 5000
config["n_samples_val"] = 1000
config["steps"] = 30000

with open('my_model.yaml', 'w') as file:
    documents = yaml.dump(config, file)

# Start Model Training

With the data downloaded and training configuration set, we can now start training the model. We'll do this in parts to better illustrate the sequence, but you can also execute every step sequentially for a fully automated process.

In [None]:
# Step 1: Generate synthetic clips

