# Preparing the TIMIT database

If you are lucky enough to own the TIMIT database, or are willing to buy it from [here](https://catalog.ldc.upenn.edu/LDC93S1), you can use this simple script to prepare the HDF5 files, similarly to how it was done with the Voxforge dataset. For a detailed explanation of all the steps, refer to the VoxforgeDataPrep notebook.

To make things brief, a lot of functions were implemented as a library and stored in the *timit.py* script in the *python* directory. We will include these methods here.

In [1]:
import sys

sys.path.append('../python')

from timit import *

We begin by loading a list of files and their alignemts. This was generously provided to us by Microsoft in their Deep Learning library project called CNTK. The files in question are located on their [github](https://github.com/Microsoft/CNTK) page in the *CNTK/Examples/Speech/Miscellaneous/TIMIT/lib/mlf/* folder. I have copied this into my *data* folder and using the methods below loaded the required datasets.

In [2]:
train_mlf=load_mlf('../data/mlf/TIMIT.train.align_cistate.mlf.cntk')
dev_mlf=load_mlf('../data/mlf/TIMIT.dev.align_cistate.mlf.cntk')
test_mlf=load_mlf('../data/mlf/TIMIT.core.align_cistate.mlf.cntk')

TIMIT was originally split into several part. The largest is the trianing portion with 3696 utterances spoken by 462 different people. The test set contains 1344 utterances, but most paper use a smaller portion of this, known as the "core test set" which has 192 files. Finally, there is also a portion that has many different speakers reading two identical sentences, known as the "SA" dataset. This last one is of little use for studying ASR, but may be interesting for research on speaker variablity and such stuff.

The Microsoft people use the standard 3969 training set - as do most other researchers presenting their results on TIMIT. They also use the core test set of 192 as everyone else. For the dev data, they use a collection of 400 sentences from the test set that aren't in the core set. Here will use the same:

In [3]:
print 'Train utterance num: {}'.format(len(train_mlf.keys()))
print 'Dev utterance num: {}'.format(len(dev_mlf.keys()))
print 'Test utterance num: {}'.format(len(test_mlf.keys()))

Train utterance num: 3696
Dev utterance num: 400
Test utterance num: 192


After many years of using it, I lost the original TIMIT set and I use an organization method that may be different than other people. I split all the data into train, test, core_test and sa subfolders and dumped all the files from that set into these folders. I also changed the naming scheme to *{speaker}_{utterance}.wav*, so for example I'll have such file names:

In [4]:
print train_mlf.keys()[0]+'.wav'

mfrm0_si1155.wav


The *prepare_corp* method will load all the audio and prepare the phoneme (actually *soneme*) lists from the MLF into the same format as presented in the VoxforgeDataPrep notebook: a list of utterance objects where each contains audio data, list of phonemes and a list of their lengths. Such a datastructure can be processed by the *extract_features* and *normalize* methods below.

In [5]:
train_corp=prepare_corp(train_mlf,'../data/mlf/TIMIT.statelist','../TIMIT/train')
dev_corp=prepare_corp(dev_mlf,'../data/mlf/TIMIT.statelist','../TIMIT/test')
test_corp=prepare_corp(test_mlf,'../data/mlf/TIMIT.statelist','../TIMIT/core_test')



Here we extract the simple 39 MFCC feature set. This is the same method as in the VoxforgeDataPrep notebook:

In [6]:
extract_features(train_corp, '../data/TIMIT_train.hdf5')
extract_features(dev_corp, '../data/TIMIT_dev.hdf5')
extract_features(test_corp, '../data/TIMIT_test.hdf5')



Here we normalize the data, same as with Voxforge:

In [7]:
normalize('../data/TIMIT_train.hdf5')
normalize('../data/TIMIT_dev.hdf5')
normalize('../data/TIMIT_test.hdf5')

