# Introduction

This notebook demonstrates the process of training a new openWakeWord model, using synthetic speech generated with open-source TTS models, and negative data representing music, noise, and speech. While the process here is complete, only small samples of datasets are utilized so that a new model can be trained on CPUs. In practice, much larger volumes of data (both positive and negitive examples) is needed to produce robust models. See the [documentation](https://github.com/dscripka/openWakeWord/tree/main/docs/models) for the pre-trained openWakeWord models for more information about how these models were trained.

To start, we'll need to install the requirements needed to train new openWakeWord models.

In [None]:
# Install requirements (it's recommended that you do this in a new virtual environment)

# !pip install openwakeword
# !pip install speechbrain
# !pip install datasets
# !pip install scipy matplotlib

In [1]:
# Imports

import os
import collections
import numpy as np
from numpy.lib.format import open_memmap
from pathlib import Path
from tqdm import tqdm
import openwakeword
import openwakeword.data
import openwakeword.utils
import openwakeword.metrics

import scipy
import datasets
import matplotlib.pyplot as plt
import torch
from torch import nn
import IPython.display as ipd

# Data Preparation

## Download Data

Next we'll load the data used for training. For the purposes of this demonstration, we'll use a small set of positive and negative.

For the positive data, there are ~3400 synthetic examples of the phrase "turn on the office lights" that were produced with the text-to-speech models documented in a [separate repo](https://github.com/dscripka/synthetic_speech_dataset_generation).

These positive examples can be downloaded [here](https://f002.backblazeb2.com/file/openwakeword-resources/data/turn_on_the_office_lights.tar.gz).

For negative data, we'll use small, already prepared samples of the [fma-large dataset](https://github.com/mdeff/fma) for music, the [FSD50k dataset](https://zenodo.org/record/4060432#.Y-hA2BzMJhE) for noise, and the [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) dataset for speech.

The fma-large sample can be downloaded [here](https://f002.backblazeb2.com/file/openwakeword-resources/data/fma_sample.zip), and then extracted into the working director.

The FSD50k sample can be downloaded [here](https://f002.backblazeb2.com/file/openwakeword-resources/data/fsd50k_sample.zip), and then extracted into the working directory.

And we'll use the HuggingFace Datasets library to get a portion of the test split of the Common Voice 11 (CV11) corpus.

Note the data provided here is intended for non-commerical applications only; you will need to verify the license status of this (and other) data if you intend to use it for commerical purposes.

In [381]:
# Download CV11 test split from HuggingFace, and convert the audio into 16 khz, 16-bit wav files

cv_11 = datasets.load_dataset("mozilla-foundation/common_voice_11_0", "en", split="test", streaming=True)
cv_11 = cv_11.cast_column("audio", datasets.Audio(sampling_rate=16000, mono=True)) # convert to 16-khz
cv_11 = iter(cv_11)

# Convert and save clips (only first 5000)
limit = 5000
for i in tqdm(range(limit)):
    example = next(cv_11)
    output = os.path.join("cv11_test_clips", example["path"][0:-4] + ".wav")
    os.makedirs(os.path.dirname(output), exist_ok=True)

    wav_data = (example["audio"]["array"]*32767).astype(np.int16) # convert to 16-bit PCM format
    scipy.io.wavfile.write(output, 16000, wav_data)


Reading metadata...: 16354it [00:00, 26183.62it/s]
100%|██████████| 5000/5000 [00:44<00:00, 112.28it/s]
