<img src='https://hammondm.github.io/hltlogo1.png' style="float:right">

Linguistics 578<br>
Fall 2023<br>
Hammond

<h3 align=center>
Traditional recognition
</h3>

### Overview

The general topic of this notebook is *traditional speech recognition*.

Topics to come:

1. General logic of automatic speech recognition
1. Improving `tr8.py` (from the book)
1. How to compose an HMM-GMM with transducers
1. State tying
1. Kaldi

Imports:

In [None]:
import pomegranate as p
import numpy as np
import librosa,re,os,time,graphviz
from scipy.io import wavfile
import matplotlib.pyplot as plt

### Overview of ASR

The general logic of traditional ASR is that we model it as composition of transducers.

The first transducer is the *acoustic model*, where we map from acoustic frames to phonetic units like segments. (This notebook focuses on this.)

This is composed with at least two more weighted automata. One maps from phonetic elements to word tokens, effectively winnowing the sequences of phonetic elements to actual words in the language.

The bottommost weighted transducer is an identity transducer that chooses among different word sequences, effectively winnowing word sequences to the most likely sequence.

We call these latter steps the *language model*.

![full model](asr.png)

We've already played with parts of the language model and so we focus on the acoustic model here.

We use the *speech commands* dataset which you can download from [here](https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html).

### Improving the code from the book

Here's the code from `tr8.py` in the book, which implements the acoustic model. This one uses:

- Linear HMMs
- LPC
- Multivariate Gaussians

I've tweaked the code in a couple of ways.

1. I tweaked how it prints updates.
1. I added bits at the beginning and end to time the whole thing.
1. I tweaked the parameters for the `fit()` method. Limiting the number of iterations gets *much* better performance, indicating that the models can overfit.

Let's run it and then go through the code.

In [None]:
#LPC version tweaked from book code
start_time = time.time()

order = 8
wlength = 300
numtrain = 20

digits = [
    'zero','one','two','three','four',
    'five','six','seven','eight','nine'
]

where = '/mhdata/commands/'

#create stored digits
allscores = []
filelist = []
for digit in digits:
    digitset = []
    files = os.listdir(where+digit)
    filelist.append(files)
    for f in files:
        try:
            fs,w = wavfile.read(where + digit + '/' + f)
            w = w.astype(float)
            cur = 0
            res = []
            while cur+wlength <= len(w):
                lpc = librosa.lpc(
                    y=w[cur:cur+wlength],
                    order=order
                )
                res.append(lpc)
                cur += wlength
            res = np.array(res)
            digitset.append(res)
        except:
            pass
        if len(digitset) == numtrain+10: break
    allscores.append(digitset)

#make linear HMMs
print('creating HMMs...')
segments = np.array([4,3,2,3,3,4,4,5,3,4,3])
lengths = segments*3 + 2
hmms = []
for i in range(10):
    states = lengths[i]
    m = p.HiddenMarkovModel('d' + str(i))
    #states
    statelist = []
    for s in range(states):
        d = p.MultivariateGaussianDistribution(
            np.arange(order),
            np.eye(order)
        )
        s = p.State(d,name='s' + str(s))
        statelist.append(s)
    m.add_states(statelist)
    #start prob
    m.add_transition(m.start,statelist[0],1.0)
    #final state
    #m.add_transition(m.end,statelist[-1],0.5)  
    m.add_transition(statelist[-1],m.end,0.5)
    
    #loop transitions
    for state in statelist:
        m.add_transition(state,state,0.5)
    #sequential transitions
    for i in range(len(statelist)-1):
        m.add_transition(
            statelist[i],
            statelist[i+1],
            0.5
        )
    m.bake()
    hmms.append(m)

#train HMMs
print('training...')
for i in range(10):
    trainset = allscores[i][:numtrain]
    hmm = hmms[i]
    hmm.fit(
        trainset,
        n_jobs=-1,
        max_iterations=5
    )
    print(f'\t{i}')

#test HMMs
print('testing...')
total = 0
for i in range(10):
    testset = allscores[i][numtrain:numtrain+10]
    for testitem in testset:
        allres = []
        for hmm in hmms:
            res = hmm.probability(testitem)
            allres.append(res)
        allres = np.array(allres)
        idx = allres.argmax()
        if idx == i: total += 1

print(f'Correct: {total}/100')

print(f'({time.time() - start_time:.4} seconds)')

The book notes that there are several things to be done with this to reach state of the art for HMM-based synthesis:

1. use MFCCs
1. use CMVN
1. use Gaussian mixture models
1. use state tying

Let's implement some of these here.

Let's start with using MFCCs instead of LPC. Rather than implement this from scratch, we'll use the implementation from `librosa`.

Let's first take a look at the MFCC steps. First we read in a file.

In [None]:
files = os.listdir(where+'zero')
fs,w = wavfile.read(where + 'zero/' + files[0])

Let's plot it to make sure there are no surprises.

In [None]:
plt.plot(w)
plt.show()

Now we make an MFCC. There are several bits here to note

1. We specify 26 initial MFCC coefficients.
1. We specify 25msec windows and 10msec hops.
1. We then extract the 2nd through 13th coefficients.
1. We have to calculate deltas and delta-deltas separately and then tack them on.

In [None]:
w = w.astype(float)
mfcc = librosa.feature.mfcc(
    y=w,
    sr=fs,
    n_mfcc=26,
    win_length=400,
    hop_length=160
)
mfcc = mfcc[1:13,:]
delta = librosa.feature.delta(mfcc)
deltadelta = librosa.feature.delta(mfcc,order=2)
res = np.vstack([mfcc,delta,deltadelta])
plt.plot(res.T)
plt.show()

Following we give a revision of the code above that plugs in MFCCs for LPC. Let's first do it with a minimal HMM to make sure we have the MFCC parts right. The HMMs below have only a single state each.

Working this out revealed a couple of important things.

1. The GMM part should actually require the syntax below, rather than what we have above. Note that this is still multivariate, but not mixture.
1. Probability values are *extremely* small, so we switch to `log_probabbility()` here.

With these limits, the system does *not* perform well.

In [None]:
#MFCCs, single-node HMMs, no mixtures
start_time = time.time()

numtrain = 100

#create stored digits
allscores = []
filelist = []
for digit in digits:
    digitset = []
    files = os.listdir(where+digit)
    filelist.append(files)
    for f in files:
        try:
            fs,w = wavfile.read(where + digit + '/' + f)
            w = w.astype(float)
            mfcc = librosa.feature.mfcc(
                y=w,
                sr=fs,
                n_mfcc=26,
                win_length=400,
                hop_length=160
            )
            mfcc = mfcc[1:13,:]
            delta = librosa.feature.delta(mfcc)
            deltadelta = librosa.feature.delta(mfcc,order=2)
            res = np.vstack([mfcc,delta,deltadelta])
            res = res.T
            
            digitset.append(res)
        except:
            pass
        if len(digitset) == numtrain+10: break
    allscores.append(digitset)

#make HMMs
print('creating HMMs...')
hmms = []
for i in range(10):
    states = lengths[i]
    m = p.HiddenMarkovModel('d' + str(i))
    #states
    statelist = []
    ds = []
    for i in range(36):
        ds.append(p.NormalDistribution(1,1))
    d = p.IndependentComponentsDistribution(ds)
        
    s = p.State(d,name='s')
    statelist.append(s)
    m.add_states(statelist)
    #start prob
    m.add_transition(m.start,statelist[0],1.0)
    m.add_transition(statelist[0],statelist[0],0.5)
    
    #sequential transitions
    for i in range(len(statelist)-1):
        m.add_transition(
            statelist[i],
            statelist[i+1],
            0.5
        )
    
    #final state
    m.add_transition(statelist[-1],m.end,0.5)
    
    m.bake()
    hmms.append(m)

#train HMMs
print('training...')
for i in range(10):
    trainset = allscores[i][:numtrain]
    hmm = hmms[i]
    hmm.fit(
        trainset,
        n_jobs=-1,
        max_iterations=5
    )
    print(f'\t{i}')

#test HMMs
print('testing...')
total = 0
for i in range(10):
    testset = allscores[i][numtrain:numtrain+10]
    for testitem in testset:
        allres = []
        for hmm in hmms:
            res = hmm.log_probability(testitem)
            allres.append(res)
        allres = np.array(allres)
        idx = allres.argmax()
        if idx == i: total += 1

print(f'Correct: {total}/100')

print(f'({time.time() - start_time:.4} seconds)')

Let's now convert to mixture models. The following code does this. Note that we can specify how many mixtures here.

CMVN for files is actually quite simple, so we add that code along with a flag to turn it on and off.

These do *not* improve things immediately. In fact, CMVN here decreases performance.

In [None]:
#MFCCs, single-node HMMs, mixtures, CMVN

mixtures = 2

cmvn = False

numtrain = 100

start_time = time.time()

#create stored digits
allscores = []
filelist = []
for digit in digits:
    digitset = []
    files = os.listdir(where+digit)
    filelist.append(files)
    for f in files:
        try:
            fs,w = wavfile.read(where + digit + '/' + f)
            w = w.astype(float)
            mfcc = librosa.feature.mfcc(
                y=w,
                sr=fs,
                n_mfcc=26,
                win_length=400,
                hop_length=160
            )
            mfcc = mfcc[1:13,:]
            delta = librosa.feature.delta(mfcc)
            deltadelta = librosa.feature.delta(mfcc,order=2)
            res = np.vstack([mfcc,delta,deltadelta])
            res = res.T
            
            #CMVN bits
            if cmvn:
                mean = np.mean(res,axis=0)
                std = np.std(res,axis=0)
                res = (res - mean)/std
            
            digitset.append(res)
        except:
            pass
        if len(digitset) == numtrain+10: break
    allscores.append(digitset)

#make HMMs
print('creating HMMs...')
hmms = []
for i in range(10):
    #states = lengths[i]
    m = p.HiddenMarkovModel('d' + str(i))
    #states
    statelist = []
    
    mix = []
    for j in range(mixtures):
        ds = []
        for i in range(36):
            ds.append(p.NormalDistribution(1,1))
        mix.append(p.IndependentComponentsDistribution(ds))
    weights = np.ones(mixtures)
    weights *= 1/mixtures
    d = p.GeneralMixtureModel(mix,weights=weights)
    
    s = p.State(d,name='s')
    statelist.append(s)
    m.add_states(statelist)
    #start prob
    m.add_transition(m.start,statelist[0],1.0)
    m.add_transition(statelist[0],statelist[0],0.5)
    
    #final state
    #m.add_transition(m.end,statelist[-1],0.5)
    m.add_transition(statelist[-1],m.end,0.5)

    m.bake()
    hmms.append(m)

#train HMMs
print('training...')
for i in range(10):
    trainset = allscores[i][:numtrain]
    hmm = hmms[i]
    hmm.fit(
        trainset,
        n_jobs=-1,
        max_iterations=5
    )
    print(f'\t{i}')

#test HMMs
print('testing...')
total = 0
for i in range(10):
    testset = allscores[i][numtrain:numtrain+10]
    for testitem in testset:
        allres = []
        for hmm in hmms:
            res = hmm.log_probability(testitem)
            allres.append(res)
        allres = np.array(allres)
        idx = allres.argmax()
        if idx == i: total += 1

print(f'Correct: {total}/100')

print(f'({time.time() - start_time:.4} seconds)')

Now we go to linear HMMs. These HMMs are bigger, so we need to train them more. This does slightly better. CMVN doesn't affect this.

In [None]:
#MFCCs, linear HMMs, mixtures, CMVN

mixtures = 2

cmvn = False

numtrain = 50

start_time = time.time()

#create stored digits
allscores = []
filelist = []
for digit in digits:
    digitset = []
    files = os.listdir(where+digit)
    filelist.append(files)
    for f in files:
        try:
            fs,w = wavfile.read(where + digit + '/' + f)
            w = w.astype(float)
            mfcc = librosa.feature.mfcc(
                y=w,
                sr=fs,
                n_mfcc=26,
                win_length=400,
                hop_length=160
            )
            mfcc = mfcc[1:13,:]
            delta = librosa.feature.delta(mfcc)
            deltadelta = librosa.feature.delta(mfcc,order=2)
            res = np.vstack([mfcc,delta,deltadelta])
            res = res.T
            
            #CMVN bits
            if cmvn:
                mean = np.mean(res,axis=0)
                std = np.std(res,axis=0)
                res = (res - mean)/std
            
            digitset.append(res)
        except:
            pass
        if len(digitset) == numtrain+10: break
    allscores.append(digitset)

#make linear HMMs
print('creating HMMs...')
segments = np.array([4,3,2,3,3,4,4,5,3,4,3])
lengths = segments*3 + 2
hmms = []
for i in range(10):
    states = lengths[i]
    m = p.HiddenMarkovModel('d' + str(i))
    #states
    statelist = []

    for s in range(states):
        mix = []
        for j in range(mixtures):
            ds = []
            for i in range(36):
                ds.append(p.NormalDistribution(1,1))
            mix.append(p.IndependentComponentsDistribution(ds))
        weights = np.ones(mixtures)
        weights *= 1/mixtures
        d = p.GeneralMixtureModel(mix,weights=weights)

        statelist.append(
            p.State(d,name='sk' + str(s))
        )

    m.add_states(statelist)
    #start prob
    m.add_transition(m.start,statelist[0],1.0)

    #final state
    m.add_transition(statelist[-1],m.end,0.5)
    
    #loop transitions
    for state in statelist:
        m.add_transition(state,state,0.5)
    #sequential transitions
    for i in range(len(statelist)-1):
        m.add_transition(
            statelist[i],
            statelist[i+1],
            0.5
        )
        
    m.bake()
    hmms.append(m)

#train HMMs
print('training...')
for i in range(10):
    trainset = allscores[i][:numtrain]
    hmm = hmms[i]
    hmm.fit(
        trainset,
        n_jobs=-1,
        max_iterations=30
    )
    print(f'\t{i}')

#test HMMs
print('testing...')
total = 0
for i in range(10):
    testset = allscores[i][numtrain:numtrain+10]
    for testitem in testset:
        allres = []
        for hmm in hmms:
            res = hmm.log_probability(testitem)
            allres.append(res)
        allres = np.array(allres)
        idx = allres.argmax()
        if idx == i: total += 1

print(f'Correct: {total}/100')

print(f'({time.time() - start_time:.4} seconds)')

In [None]:
1.849e+03/60

The only thing left here is state tying.

### Note: composition with an HMM-GMM

In an earlier notebook, we showed one way to convert an HMM into a WFSA which can then be composed with other WFSAs/WFSTs.

What about an HMM-GMM?

The problem is that an HMM-GMM allows us to compute probabilities over an infinite number of possible distinct acoustic frames, but to compose this with other transducers, we need something with a finite number of categories.

One solution to this problem is to feed the input into the HMM-GMM *before* composition with the other transducers. We then have a finite number of distinct acoustic frames, namely the frames in the input. We can use those to convert the GMMs to a distribution over a fixed set of frames. This new HMM can then be converted to a WFSA and composed as we described earlier.

The cost of this is that we have to do composition for each input. In fact, this is not an issue because we've already seen that, for a system like OpenFST, inputs are handled with composition anyway.

## State tying

The next missing piece is state tying. The logic of this step is that we treat each segment as a linear HMM of three states. These are combined to make words. For example, a word like *hat* might be represented as something like this:

In [None]:
dot = graphviz.Digraph(
    graph_attr={'rankdir':'LR'}
)
dot.node('q0',label='#',shape='circle')
dot.node('q1',label='h',shape='circle')
dot.node('q2',label='h',shape='circle')
dot.node('q3',label='h',shape='circle')
dot.node('q4',label='æ',shape='circle')
dot.node('q5',label='æ',shape='circle')
dot.node('q6',label='æ',shape='circle')
dot.node('q7',label='t',shape='circle')
dot.node('q8',label='t',shape='circle')
dot.node('q9',label='t',shape='circle')
dot.node('q10',label='#',shape='doublecircle')
dot.node('s',style='invisible')
dot.edge('s','q0')
dot.edge('q0','q0')
dot.edge('q0','q1')
dot.edge('q1','q1')
dot.edge('q1','q2')
dot.edge('q2','q2')
dot.edge('q2','q3')
dot.edge('q3','q3')
dot.edge('q3','q4')
dot.edge('q4','q4')
dot.edge('q4','q5')
dot.edge('q5','q5')
dot.edge('q5','q6')
dot.edge('q6','q6')
dot.edge('q6','q7')
dot.edge('q7','q7')
dot.edge('q7','q8')
dot.edge('q8','q8')
dot.edge('q8','q9')
dot.edge('q9','q9')
dot.edge('q9','q10')
dot.edge('q10','q10')
dot

A word like *ask* would then be represented like this:

In [None]:
dot = graphviz.Digraph(
    graph_attr={'rankdir':'LR'}
)
dot.node('q0',label='#',shape='circle')
dot.node('q1',label='æ',shape='circle')
dot.node('q2',label='æ',shape='circle')
dot.node('q3',label='æ',shape='circle')
dot.node('q4',label='s',shape='circle')
dot.node('q5',label='s',shape='circle')
dot.node('q6',label='s',shape='circle')
dot.node('q7',label='k',shape='circle')
dot.node('q8',label='k',shape='circle')
dot.node('q9',label='k',shape='circle')
dot.node('q10',label='#',shape='doublecircle')
dot.node('s',style='invisible')
dot.edge('s','q0')
dot.edge('q0','q0')
dot.edge('q0','q1')
dot.edge('q1','q1')
dot.edge('q1','q2')
dot.edge('q2','q2')
dot.edge('q2','q3')
dot.edge('q3','q3')
dot.edge('q3','q4')
dot.edge('q4','q4')
dot.edge('q4','q5')
dot.edge('q5','q5')
dot.edge('q5','q6')
dot.edge('q6','q6')
dot.edge('q6','q7')
dot.edge('q7','q7')
dot.edge('q7','q8')
dot.edge('q8','q8')
dot.edge('q8','q9')
dot.edge('q9','q9')
dot.edge('q9','q10')
dot.edge('q10','q10')
dot

The basic step is to tie the distributions of states labeled the same together. In other words, when we train the HMM for *hat* that will update transitions and emissions. The emissions of the [æ] in *ask* will also get updated. When we train *ask*, the [æ] of *hat* will also get updated.

What this means is that we need an iterated training procedure. Train all our HMMs in multiple passes so that the state tying effect can settle.

What we need then is a library of distribution triples that we can drop into different HMMs.

This is actually something we can do with *pomegranate*, but let's instead switch to a full HMM-based system to try it out.

## Installing and running *Kaldi*

You can install *Kaldi* on your own system, but it's *huge* with lots of moving parts and the easiest way to use it is via *docker*.

On my own linux machine, I build the container like this:

```bash
docker run -it --gpus all --name kaldi \
  -v /data/:/mhdata/ \
  -v /home/hammond/speechtechbook/:/mhbook/ \
  -v /home/hammond/classesF23/:/mhclasses/ \
  kaldiasr/kaldi:gpu-latest bash
```

As usual, you need to change that for your own system:

1. If you don't have a usable GPU, remove `--gpus all` and replace `gpu-latest` with `latest`.
1. Change the `-v` lines as appropriate for your own system. The stuff on the left side of each colon is the location on your host system. The stuff on the right side is where you'll find that *within* the container.
1. Make sure you include the location where you have the speech commands data.
1. Make sure you include the location where you have the program files for the course text.
1. Once the container is running for the first time, chack that your mounted directories are actually there. In other words, if you mounted a host directory at `/xyzdata`, make sure to type `ls /xyzdata` and confirm that you see your files.

The official *Kaldi* image is missing some stuff, so you should execute the following commands within the container.

```bash
apt update
apt install flac
apt install gawk
```

*Kaldi* needs the *SRILM* toolkit and there is an install script in the `/opt/kaldi/tools` directory. Unfortunately, it doesn't work right, so here are the steps to make it work.

1. `cd /opt/kaldi/tools`
1. `./install_srilm.sh "Mike Hammond" "University of Arizona" "hammond@u.arizona.edu"`. Change the name and email as appropriate. *This will terminate with an error*.
1. Remove the empty file `srilm.tar.gz`.
1. Download SRILM from here: http://www.speech.sri.com/projects/srilm/download.html.
1. Move that file into the `tools` directory and rename to `srilm.tar.gz`.
1. Do the install script again: `./install_srilm.sh "Mike Hammond" "University of Arizona" "hammond@u.arizona.edu"` (again changing name and email as appropriate).

## The *very* recipe

The code files for the book include a directory called *very* with several obscure files in it. These files let you run the speech commands digits data with *Kaldi*.

Copy that directory into `/opt/kaldi/egs/` and switch into it.

There are a number of files there, but the important one is `r2.sh`. That file is in shell script and includes a definition near the top of the file for the variable `datadir`, which specifies where the speech commands data are.

In my own case, that directory is called `commands` and it lives in `/data/`. When I built the container, I mounted `/data/` in the container as `/mhdata/`. *Therefore*, I define `datadir` in `r2.sh` as `/mhdata/commands/`. Edit the `r2.sh` file so it specifies the correct directory for your system.

The other necessary change is the line five lines below the definition for `datadir` where `nj` is defined. This variable specifies the number of parallel jobs to run and you should set it according to the limits of your own system. In my own case, I have 8 dual-core processors, so I set it to 15, one less than the maximum. On my old mac, I have to set it to 3.

Once you've made these changes, you can run the experiment by typing `./r2.sh`. On my linux machine, this takes 2-3 minutes, on my mac it's more like a half hour. With more limited resources, be prepared for a longer wait.

Let's now look at the `r2.sh` script bit by bit.

First, we define a bunch of variables:

```bash
#!/bin/bash

#kaldi run script for speech commands dataset
#mike hammond, u. of arizona, 8/2021

#define a bunch of variables
datadir=/mhdata/commands/
mfccdir=mfcc
train_cmd="utils/run.pl"
decode_cmd="utils/run.pl"
nj=15
lm_order=1
trainnum=1000
testnum=100

#variable for digits to translate file names
declare -a arr=("zero" "one" "two" "three" "four"
   "five" "six" "seven" "eight" "nine")
```

We call two other shell scripts that set paths and specify whether a GPU is used.

We then create a bunch of files and directories where various files are placed as the body of the script runs.

```bash
#specify where programs are and how they interact
. ./path.sh || exit 1
. ./cmd.sh || exit 1

#make data directory
mkdir data
#where to put the wave files
mkdir data/wavefiles
#make test directory
mkdir data/test
#make train directory
mkdir data/train

#create wav.scp (specify location of wav files)
touch data/test/wav.scp
touch data/train/wav.scp

#rename and copy wave files
echo copying wave files
echo creating wav.scp, text, utt2spk files

#make the text files (what's in each wav file)
touch data/test/text
touch data/train/text

#make utt2spk files (specify speaker for each file)
touch data/test/utt2spk
touch data/train/utt2spk
```

Now we populate those directories. These files indicate where sound files are, who the speakers are, what files are for training and what for testing, etc.

The last step is to create a *corpus* file which, in our case, is simply the list of digits.

```bash
#loop through the files doing all that
for q in "${arr[@]}"
do
   filenames=`ls $datadir$q/*0.wav`
   filenames=($filenames)
   #training files
   for filename in ${filenames[@]:0:${trainnum}}
   do
      speaker=`echo $filename | sed 's/.*\///'`
      speaker=`echo $speaker | sed 's/_.*//'`
      pfx="${speaker}_${q}"
      cp $filename data/wavefiles/${pfx}.wav
      echo ${pfx} data/wavefiles/${pfx}.wav >> data/train/wav.scp
      echo ${pfx} ${q} >> data/train/text
      echo -e "${pfx} ${speaker}" >> data/train/utt2spk
   done
   #testing files
   for filename in ${filenames[@]:${trainnum}:${testnum}}
   do
      speaker=`echo $filename | sed 's/.*\///'`
      speaker=`echo $speaker | sed 's/_.*//'`
      pfx="${speaker}_${q}"
      cp $filename data/wavefiles/${pfx}.wav
      echo ${pfx} data/wavefiles/${pfx}.wav >> data/test/wav.scp
      echo ${pfx} ${q} >> data/test/text
      echo -e "${pfx} ${speaker}" >> data/test/utt2spk
   done
done

#create corpus file (list all word tokens in wav files)
echo Creating corpus file
mkdir data/local
touch data/local/corpus.txt

for i in "${arr[@]}"
do
   echo $i >> data/local/corpus.txt
done
```

Next we run various scripts that check and fix the file structure if there are problems.

We also move our own files from the `varylang` directory into the new directories.

```bash
#check/fix data directories
echo Fixing, validating, sorting
cp -r ../wsj/s5/utils .
./utils/validate_data_dir.sh data/test
./utils/fix_data_dir.sh data/test
./utils/validate_data_dir.sh data/train
./utils/fix_data_dir.sh data/train

#copy trivial language model files
echo Moving language model files
mkdir data/local/dict
cp verylang/*.txt data/local/dict
```

We next create the MFCC representations and do the cepstral mean and variance scaling. These require scripts from other experiments in the `egs` directory, so we just copy these over.

```bash
#making MFCCs
echo Creating mfccs
cp -r ../wsj/s5/steps .
cp -r ../an4/s5/conf .

#make MFCC log directories
mkdir exp
mkdir exp/make_mfcc
mkdir exp/make_mfcc/train
mkdir exp/make_mfcc/test

#make the mfccs themselves
steps/make_mfcc.sh --nj $nj --cmd "$train_cmd" \
   data/train exp/make_mfcc/train $mfccdir
steps/make_mfcc.sh --nj $nj --cmd "$train_cmd" \
   data/test exp/make_mfcc/test $mfccdir

#cepstral mean/variance statistics per speaker (cmvn)
steps/compute_cmvn_stats.sh data/train \
   exp/make_mfcc/train $mfccdir
steps/compute_cmvn_stats.sh data/test \
   exp/make_mfcc/test $mfccdir
```

Next we have a bunch of scripts that create the language model and massage it into the correct form. In our case, we have only words in isolation, so this is fairly trivial.

```bash
#prepare language data (trivial here)
echo Preparing language data
utils/prepare_lang.sh data/local/dict "<UNK>" \
   data/local/lang data/lang

#build language model
echo Building language model

local=data/local

#make arpa/binary version of LM
mkdir $local/tmp
../../tools/srilm/bin/i686-m64/ngram-count -order $lm_order \
   -write-vocab $local/tmp/vocab-full.txt -wbdiscount -text \
   $local/corpus.txt -lm $local/tmp/lm.arpa

#make G.fst file (binary version for language model)
lang=data/lang
../../src/lmbin/arpa2fst --disambig-symbol=#0 \
   --read-symbol-table=$lang/words.txt $local/tmp/lm.arpa \
   $lang/G.fst
```

Next are the guts of the script. We do training based on concatenated segments. We then align and use these alignments to train again with context-dependent segments. Basically, we construct a version of each segment for every context it occurs in. There can be thousands of these, so the "decision tree" step winnows these down to just what we need.

```bash
#monophone/letter unigram training (HMM-GMMs)
steps/train_mono.sh --nj $nj --cmd "$train_cmd" data/train \
   data/lang exp/mono || exit 1

#mono decoding/testing

#copy scripts
cp -r ../an4/s5/local .

#do decision trees
utils/mkgraph.sh --mono data/lang exp/mono \
   exp/mono/graph || exit 1

#score
steps/decode.sh --config conf/decode.config --nj $nj --cmd \
   "$decode_cmd" exp/mono/graph data/test exp/mono/decode

#mono alignment (for building triphones)
steps/align_si.sh --nj $nj --cmd "$train_cmd" data/train \
   data/lang exp/mono exp/mono_ali || exit 1

#triphone training (collect segment HMMs into triphones)
steps/train_deltas.sh --cmd "$train_cmd" 2000 11000 data/train \
   data/lang exp/mono_ali exp/tri1 || exit 1

#triphone decoding/testing

#make decision trees
utils/mkgraph.sh data/lang exp/tri1 exp/tri1/graph || exit 1

#score
steps/decode.sh --config conf/decode.config --nj $nj --cmd \
   "$decode_cmd" exp/tri1/graph data/test exp/tri1/decode

echo '
All done!'
```

We can examine the results with the following commands. The first one shows the word error rate (WER) for the initial single-segment model at each iteration of training. The second shows how this (generally) improves with context-dependent segments.

```bash
grep WER exp/mono/decode/wer*

grep WER exp/tri1/decode/wer*
```

The system performs better than 90% depending on how much data you use for training, so this is a substantial improvement over the systems we've looked at so far