# Preparing the Voxforge database

This notebook will demonstrate how to prepare the free [Voxforge](http://www.voxforge.org/) database for training. This database is a small-to-medium sized database available online for free under the GPL license. A much more common database used in most research is the [TIMIT](https://catalog.ldc.upenn.edu/LDC93S1), but that costs $250 and also isn't too large (although much more professionally developed than Voxforge). The best alternative today is the [Librispeech](http://www.openslr.org/12/) database, but that has a few dozen GB of data and wouldn't be sensible for a simple demo. So Voxforge it is...

First thing to do is realize what a speech corpus actually is: in its simplest form it is a collection of audio files (containing preferably speech only) with a set of transcripts of the speech. There are a few extensions to this that are worth noting:
  * phonemes - transcripts are usually presented as a list of words - although not a rule, it is often easier to start the recognition process with phonemes and go from there. Voxforge defines a list of 39 phonemes (+ silence) and contains a lexicon mapping the words into phonemes (more about that below)
  * aligned speech - the transcripts are usually just a sequence of words/phonemes, but they don't denote which word/phoneme occurs when - there are models that can learn from that (seq2seq learning), but having alignments is usually a big plus. TIMIT was hand-aligned by a group of professionals (which is why its a popular resource for research), but Voxforge wasn't. Fortunately, we can use one of the many available tools to do this automatically (with a margin of error - more on that below)
  * meta-data - each recording session in the Voxforge database contains a readme file with useful information about the speaker and the environment that the recording took place in. When making a serious speech recognizer, this information can be very useful (e.g. for speaker adaptation - taking into account the speaker id, gender, age, etc...)
  
## Downloading the corpus

To start working with the corpus, it needs to be downloaded first. All the files can be found in the download section of the Voxforge website under this URL:

http://www.repository.voxforge1.org/downloads/SpeechCorpus/Trunk/Audio/Main/16kHz_16bit/

There are 2 versions of the main corpus: sampled at 16kHz and 8kHz. The 16 kHz one is of better quality and is known as "desktop quality speech". While the original recordings were made at an even higher quality (44.1 kHz), 16k is completely sufficient for recoginzing speech (higher quality doesn't help much). 8 kHz is known as the telephony quality and is a standard value for the old (uncompressed, aka T0) digital telephone signal. If you are making a recognizer that has to work in the telephony environment, you should use this data instread

To download the whole dataset, a small program in Python is included in this demo. Be warned, this can take a long time (I think Voxforge is throttling the speed to save on costs) and restarts may be neccessary. The python method does check for failed downloads (compares file sizes) and restarts whatever wasn't downloaded completely, so you can run the method 2-3 times to make sure everything is ok.

Alternatively, wou can use a program like wget and enter this command (where "audio" is the dir to save the data to):

    wget -P audio -l 1 -N -nd -c -e robots=off -A tgz -r -np http://www.repository.voxforge1.org/downloads/SpeechCorpus/Trunk/Audio/Main/16kHz_16bit
 
First lets import all the voxforge methods from the python directory. These will need the following libraries installed on your system:
  * numpy - for working with data
  * random, urllib, lxml, os, tarfile, gzip, re, pickle, shutil - these are standard system libraries and anyone should have them
  * scikits.audiolab - to load the audio files from the database (WAV and FLAC files)
  * tqdm - a [simple library for progressbars](https://github.com/noamraph/tqdm) that you can install using pip

In [1]:
import sys

sys.path.append('../python')

from voxforge import *



Ignore any warnings above (I coudn't be bothered to compile audiolab with Alsa). Below you will find the method to download the Voxforge database. You only need to do this once, so you can run it either here or from a console or use wget. Be warned that it takes a long time (as mentioned earlier) so it's a good idea to leave it running over night. I already did it once, so I'll skip it here:

In [None]:
downloadVoxforgeData('../audio')

## Loading the corpus

Once the data is downloaded and stored in the 'audio' subdir of the main project dir, we can start loading the data into a Python datastructure. There are several methods that can be used for that. The following method will load a file and display its contents:

In [5]:
f=loadFile('../audio/Joel-20080716-qoz.tgz')
print f.props
print f.prompts
print f.data

%xdel f

{'Pronunciation dialect': 'American English', 'File type': 'wav', 'Age Range': 'Adult', 'Speaker Characteristics': '', 'Language': 'EN', 'File Info': '', 'Gender': 'Male', 'Audio Recording Software': 'VoxForge Speech Submission Application', 'Audio card type': 'unknown', 'User Name': 'Joel', 'Sample rate format': '16', 'Number of channels': '1', 'O/S': '', 'Microphone make': 'n/a', 'Path': 'Joel-20080716-qoz', 'Sampling Rate': '48000', 'Microphone type': 'USB Headset mic', 'Recording Information': '', 'Audio card make': 'unknown'}
{'b0076': ['BEFORE', 'PHILIP', 'COULD', 'RECOVER', 'HIMSELF', "JEANNE'S", 'STARTLED', 'GUARDS', 'WERE', 'UPON', 'HIM'], 'b0077': ['IT', 'IS', 'THE', 'NEAREST', 'REFUGE'], 'b0074': ['YET', 'BEHIND', 'THEM', 'THERE', 'WAS', 'ANOTHER', 'AND', 'MORE', 'POWERFUL', 'MOTIVE'], 'b0075': ['IN', 'THAT', 'CASE', 'HE', 'COULD', 'NOT', 'MISS', 'THEM', 'IF', 'HE', 'USED', 'CAUTION'], 'b0081': ['YOU', 'WERE', 'GOING', 'TO', 'LEAVE', 'AFTER', 'YOU', 'SAW', 'ME', 'ON', 'THE',

The loadBySpeaker method will load the whole folder and organize its contents by speakers (as a dictionary). Each utterance contains only the data and the prompts. For this demo, only 30 files are read - as this isn't a method we are going to ultimately use.

In [2]:
corp=loadBySpeaker('../audio', limit=30)



The corpus can also be extended by the phonetic transcription of the utterances using a lexicon file. Voxforge does provide such a file on its website and it is downloaded automatically (if it doesn't already exist).

Note that a single word can have several transcriptions. In the lexicon, these alternatives will have sequential number suffixes added to the word (word, word2, word3, etc), but this particular function will do nothing about that. Choosing the right pronounciation variant has to be done either manually, or by using a more sophisticated program (a pre-trained ASR system) to choose the right version automatically.

In [8]:
addPhonemesSpk(corp,'../data/lex.tgz')

['Apple_Eater', 'ryanjyoder', 'Perygryne', 'apdsqueaky', 'sharrington', 'yoyology', 'camdixon', 'Krellis', 'rocketman768', 'anonymous_9', 'anonymous_8', 'anonymous_5', 'anonymous_4', 'anonymous_7', 'anonymous_6', 'anonymous_1', 'anonymous_3', 'anonymous_2', 'ductapeguy', 'bhuvan', 'Primus', 'Q', 'adgar', 'thepinkcat', 'farmerjack', 'Steltek', 'TimS', 'pcsnpny']
{'a0060': [array([ 278,  313,  139, ..., -443, -376, -179], dtype=int16), ['ANYWAY', 'NO', 'ONE', 'SAW', 'HER', 'LIKE', 'THAT'], ['eh', 'n', 'iy', 'w', 'ey', 'n', 'ow', 'w', 'ah', 'n', 's', 'ao', 'hh', 'er', 'l', 'ay', 'k', 'dh', 'ae', 't']], 'a0061': [array([-216, -258, -146, ..., -612, -567, -225], dtype=int16), ['PHILIP', 'SNATCHED', 'AT', 'THE', 'LETTER', 'WHICH', 'GREGSON', 'HELD', 'OUT', 'TO', 'HIM'], ['f', 'ih', 'l', 'ah', 'p', 's', 'n', 'ae', 'ch', 't', 'ae', 't', 'dh', 'ah', 'l', 'eh', 't', 'er', 'w', 'ih', 'ch', 'g', 'r', 'eh', 'g', 's', 'ah', 'n', 'hh', 'eh', 'l', 'd', 'aw', 't', 't', 'uw', 'hh', 'ih', 'm']], 'a0059':

In [10]:
print corp.keys()

spk=corp.keys()[0]

print corp[spk]

%xdel corp

['Apple_Eater', 'ryanjyoder', 'Perygryne', 'apdsqueaky', 'sharrington', 'yoyology', 'camdixon', 'Krellis', 'rocketman768', 'anonymous_9', 'anonymous_8', 'anonymous_5', 'anonymous_4', 'anonymous_7', 'anonymous_6', 'anonymous_1', 'anonymous_3', 'anonymous_2', 'ductapeguy', 'bhuvan', 'Primus', 'Q', 'adgar', 'thepinkcat', 'farmerjack', 'Steltek', 'TimS', 'pcsnpny']
{'a0060': [array([ 278,  313,  139, ..., -443, -376, -179], dtype=int16), ['ANYWAY', 'NO', 'ONE', 'SAW', 'HER', 'LIKE', 'THAT'], ['eh', 'n', 'iy', 'w', 'ey', 'n', 'ow', 'w', 'ah', 'n', 's', 'ao', 'hh', 'er', 'l', 'ay', 'k', 'dh', 'ae', 't']], 'a0061': [array([-216, -258, -146, ..., -612, -567, -225], dtype=int16), ['PHILIP', 'SNATCHED', 'AT', 'THE', 'LETTER', 'WHICH', 'GREGSON', 'HELD', 'OUT', 'TO', 'HIM'], ['f', 'ih', 'l', 'ah', 'p', 's', 'n', 'ae', 'ch', 't', 'ae', 't', 'dh', 'ah', 'l', 'eh', 't', 'er', 'w', 'ih', 'ch', 'g', 'r', 'eh', 'g', 's', 'ah', 'n', 'hh', 'eh', 'l', 'd', 'aw', 't', 't', 'uw', 'hh', 'ih', 'm']], 'a0059':

## Aligned corpus

As mentioned earlier, this sort or cropus has it's downsides. For one, we don't know when each phoneme occurs so we cannot train the system discriminatavely. While it's still possible, it would be nice if we could start with a simpler example. Another problem is choosing the right pronounciation variant mentioned above.

To solve these issues, an automatic alignement was created using a different ASR system called [Kaldi](http://kaldi-asr.org). This system is a very good ASR solution that implements various types of models. It also contains simple out-of-the-box scripts for training on Voxforge data.

To create the alignments using Kaldi, a working system had to be trained first and what's interesting, the same Voxforge data was used to train the system. How was this done? Well, Kaldi uses (among other things) a classic Gaussian Mixture Model and trains it using the EM algorithm. Initially the alignment is assumed to be even, throughout the file, but as the system is trained iteratively, the model gets better and thus the alignment gets more accurate. The system is trained with gradually better models to achieve even more accurate results and the provided solution here is generated using the "tri3b" model, as described in the scripts.

The alignments in Kaldi are stored in special binary files, but there are simple tools to help convert them into something more easier to use. The type of file chosen for this example is the CTM file, which contains a series of lines in a text file, each line describing a single word or phoneme. The description has 5 columns: encoded file name, unused id (always 1), segment start, segment length and segment text (i.e. word of phoneme name/value). This file was generated using Kaldi, compressed using gzip and stored in 'ali.ctm.gz' in the 'data' directory of this project.

Please note, that the number of files in this aligned set is smaller than the acutal count in the whole Voxforge dataset. This is because there is a small percentage of errors in the database (around a 100 files or so) and some recordings are of such poor quality that Kaldi couldn't generate a reasonable alignemnet for these files. We can simply ignore them here. This, however, doesn't mean that all the alignments present in the CTM are 100% accurate. There can still be mistakes there, but hopefully they are unlikely enough to not cause any issue.

While this file contains everything that we need, it'd be useful to convert it into a datastructure that can be easily used in Python. The convertCTMToAli method is used for that:

In [2]:
convertCTMToAli('../data/ali.ctm.gz','../data/phones.list','../audio','../data/ali.pklz')

Reading...
Writing...
Done


We store the generated datastructure into a gzipped and pickled file, so we don't need to perform this more than once. This file is already included in the repository, so you can skip the step above.

We can read the file like this:

In [4]:
import gzip
import pickle
with gzip.open('../data/ali.pklz') as f:
    ali=pickle.load(f)

In [14]:
print len(ali)

print ali[100].spk
print ali[100].phones
print ali[100].ph_lens
print ali[100].archive
print ali[100].audiofile
print ali[100].data

%xdel ali

59274
Aaron
[0, 10, 11, 28, 38, 23, 1, 31, 3, 23, 6, 25, 31, 3, 3, 35, 31, 28, 34, 32, 17, 23, 17, 31, 0]
[480, 70, 70, 50, 70, 70, 90, 90, 30, 60, 170, 90, 90, 90, 80, 50, 120, 60, 80, 100, 40, 60, 80, 140, 720]
Aaron-20080318-ngh
b0346
None


Please note that the audio data is not yet loaded at this step (it's set to None). To do this, we use the loadAlignedCorpus method. It loads the alignment and the appropriate audio datafile for each utterance. This step can take over half an hour to complete:

In [None]:
corp=loadAlignedCorpus('../data/ali.pklz','../audio')

 64%|██████▍   | 2531/3963 [33:14<03:20,  7.14it/s]