Skip to content

Extracting with Kaldi

Andy T. Liu edited this page Jun 17, 2020 · 3 revisions

Extracting with Kaldi

Setup Kaldi

  • Install Kaldi
  • As suggested during the installation, do not forget to add the path of the Kaldi binaries into $HOME/.bashrc. For instance, make sure that .bashrc contains the following paths:
export KALDI_ROOT=/home/mirco/kaldi
PATH=$PATH:$KALDI_ROOT/tools/openfst
PATH=$PATH:$KALDI_ROOT/src/featbin
PATH=$PATH:$KALDI_ROOT/src/gmmbin
PATH=$PATH:$KALDI_ROOT/src/bin
PATH=$PATH:$KALDI_ROOT/src/nnetbin
export PATH
  • Then reload .bashrc with the command: source ~/.bashrc
  • Remember to change the KALDI_ROOT variable using your path. To get your path, cd to the Kaldi directory and use the command: pwd.
  • As a first test to check the installation, open a bash shell, type copy-feats or hmm-info and make sure no errors appear.

Preprocessing LibriSpeech

  1. Dump the codes in src/kaldi_egs_librispeech_s5/ to your $KALDI_ROOT/egs/librispeech/s5/.
  2. If running on a single machine, change the following lines in $KALDI_ROOT/egs/librispeech/s5/cmd.sh and replace queue.pl to run.pl.
    • Change the lines to:
export train_cmd="run.pl --mem 2G"
export decode_cmd="run.pl --mem 4G"
export mkgraph_cmd="run.pl --mem 8G"
  1. Change the data path in run.sh to your LibriSpeech data path, the directory LibriSpeech/ should be under that path. For example:
data=/media/andi611/1TBSSD
  1. make sure that flac is installed if you are using a Linux machine:
sudo apt-get install flac
  1. Run the Kaldi recipe run.sh for librispeech at least until Stage 13 (included):
./run.sh
  1. Copy exp/tri4b/trans.* files into exp/tri4b/decode_tgsmall_train_clean_*/
mkdir exp/tri4b/decode_tgsmall_train_clean_100 && cp exp/tri4b/trans.* exp/tri4b/decode_tgsmall_train_clean_100/
  1. Compute the fmllr features by running the following script.
./compute_fmllr.sh
  1. Compute alignments using:
# aligments on dev_clean and test_clean
steps/align_fmllr.sh --nj 10 data/dev_clean data/lang exp/tri4b exp/tri4b_ali_dev_clean
steps/align_fmllr.sh --nj 10 data/test_clean data/lang exp/tri4b exp/tri4b_ali_test_clean
steps/align_fmllr.sh --nj 30 data/train_clean_100 data/lang exp/tri4b exp/tri4b_ali_clean_100
steps/align_fmllr.sh --nj 30 data/train_clean_360 data/lang exp/tri4b exp/tri4b_ali_clean_360
steps/align_fmllr.sh --nj 30 data/train_other_500 data/lang exp/tri4b exp/tri4b_ali_other_500

Dump Kaldi LibriSpeech Data to .npy

To pre-train models with the above generated Kaldi fMLLR features, run the following command:

  1. Apply cmvn and dump the fmllr features to new .ark files:
./dump_fmllr_cmvn.sh
./dump_mfcc_cmvn.sh
./dump_fbank_cmvn.sh # this requires a second run of stage 6 in `/run.sh`, see the comments in stage 6.
  1. Use the python script to convert kaldi generated .ark featrues to .npy for our S3PRL dataloader, modify the path and settings in the script then run:
cd Self-Supervised-Speech-Pretraining-and-Representation-Learning/preprocess/
python3 ark2libri.py # DATA_TYPE = 'fmllr' by default, this can be either 'mfcc', 'fbank',  or 'fmllr'
  1. In order to pre-train on fMLLR features, change the data_path argument in the config files config/*.yaml to the following:
data_path: 'data/libri_fmllr_cmvn' # or 'data/libri_mfcc_cmvn', 'data/libri_fbank_cmvn'
  1. Modify the train_set argument in the config files config/*.yaml to train on different subsets:
train_set: ['train-other-500', 'train-clean-360', 'train-clean-100'] 

Preprocessing Timit

  1. Download the TIMIT dataset from the LDC website.
  2. Dump the codes in src/kaldi_egs_timit_s5/ to your $KALDI_ROOT/egs/timit/s5/.
  3. Run the Kaldi s5 baseline of TIMIT (Remember to modify the paths in run.sh to your own):
cd kaldi/egs/timit/s5
./run.sh
./local/nnet/run_dnn.sh
  1. Compute the alignments (i.e, the phone-state labels) with the following commands:
steps/nnet/align.sh --nj 4 data-fmllr-tri3/train data/lang exp/dnn4_pretrain-dbn_dnn exp/dnn4_pretrain-dbn_dnn_ali
steps/nnet/align.sh --nj 4 data-fmllr-tri3/dev data/lang exp/dnn4_pretrain-dbn_dnn exp/dnn4_pretrain-dbn_dnn_ali_dev
steps/nnet/align.sh --nj 4 data-fmllr-tri3/test data/lang exp/dnn4_pretrain-dbn_dnn exp/dnn4_pretrain-dbn_dnn_ali_test

Dump Kaldi Timit Data to .npy

To pre-train models with the above generated Kaldi fMLLR features, run the following command:

  1. Compute cmvn on fmllr features by running the following script.
./dump_fmllr_cmvn.sh
  1. Use the provided python script to convert kaldi generated .ark featrues to .npy for our S3PRL dataloader, change the path settings in the script and run:
cd Self-Supervised-Speech-Pretraining-and-Representation-Learning/preprocess/
python ark2timit.py
  1. In order to pre-train on fMLLR features, change the data_path argument in the config files config/*.yaml to the following:
data_path: 'data/timit_fmllr_cmvn'
  1. Modify the train_set argument in the config files config/*.yaml to train with TIMIT:
train_set: ['train']