<a href="https://colab.research.google.com/github/danijel3/CroatianSpeech/blob/main/KaldiAlign.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Speech alignment using Kaldi

This notebook demonstrates how to use Kaldi models to align text to speech.

First step will be to install Kaldi under Colab. We will use a package I prepared a while ago that works here. In practice, you want to install Kaldi according to their install instructions:

In [1]:
!wget https://github.com/danijel3/ASRforNLP/releases/download/v1.0/kaldi.tar.xz

!tar xvf kaldi.tar.xz -C / > /dev/null
%rm kaldi.tar.xz

!for f in $(find /opt/kaldi -name *.so*) ; do ln -sf $f /usr/local/lib/$(basename $f) ; done
!for f in $(find /opt/kaldi/src -not -name *.so* -type f -executable) ; do ln -s $f /usr/local/bin/$(basename $f) ; done
!for f in $(find /opt/kaldi/tools -not -name *.so* -type f -executable) ; do ln -s $f /usr/local/bin/$(basename $f) ; done

!ldconfig

--2022-02-09 20:31:51--  https://github.com/danijel3/ASRforNLP/releases/download/v1.0/kaldi.tar.xz
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/409506444/525a8238-abb3-4b8b-8282-12b094577f0e?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20220209%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220209T203151Z&X-Amz-Expires=300&X-Amz-Signature=b9ab936d18c112c425b2841af0626a7edb22b7c0f3dd8dace72b4ee80199572f&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=409506444&response-content-disposition=attachment%3B%20filename%3Dkaldi.tar.xz&response-content-type=application%2Foctet-stream [following]
--2022-02-09 20:31:51--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/409506444/525a8238-abb3-4b8b-8282-12b094577f0e?X-Amz-Algorith

Next we will install a little code that I made to automatically generate the pronounciation lexicon in Croatian. We will also need a couple of python libraries to help with the process:

In [2]:
!pip install phonetisaurus openfst-python
!wget https://github.com/danijel3/CroatianSpeech/raw/main/lexicon.py

Collecting phonetisaurus
  Downloading phonetisaurus-0.3.0-py3-none-manylinux1_x86_64.whl (12.1 MB)
[K     |████████████████████████████████| 12.1 MB 5.1 MB/s 
[?25hCollecting openfst-python
  Downloading openfst_python-1.7.3-cp37-cp37m-manylinux1_x86_64.whl (15.0 MB)
[K     |████████████████████████████████| 15.0 MB 26.1 MB/s 
[?25hInstalling collected packages: phonetisaurus, openfst-python
Successfully installed openfst-python-1.7.3 phonetisaurus-0.3.0
--2022-02-09 20:32:58--  https://github.com/danijel3/CroatianSpeech/raw/main/lexicon.py
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/danijel3/CroatianSpeech/main/lexicon.py [following]
--2022-02-09 20:32:58--  https://raw.githubusercontent.com/danijel3/CroatianSpeech/main/lexicon.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133,

Finally, we will download the model (which includes the Croatian acoustic model as well as the G2P model for generating the lexicon mentioned above) and some sample data.

We will use the sample file we used in the previous example plus it's transcript that we found in the large text file before.

We will store everything in the `data` directory. We need 3 files there:
- `data/text` contains the transcript
- `data/wav.scp` contains the path to the audio file
- `data/data spk2utt` contains the mapping between speakers and files

This may seem a bit silly in this example, but this same script can be used to process multiple files at once. The `text` file will contain multiple lines, as well as `wav.scp` and `spk2utt`.

In [3]:
!wget https://github.com/danijel3/CroatianSpeech/releases/download/am/models.tar.xz
!wget https://github.com/danijel3/CroatianSpeech/releases/download/data/sample.wav
!wget https://github.com/danijel3/CroatianSpeech/raw/main/sample.txt

!tar xvf models.tar.xz
%mkdir data
%mv sample.wav data/sample.wav
%mv sample.txt data/text
!echo 'sample data/sample.wav' > data/wav.scp
!echo 'sample sample' > data/spk2utt

--2022-02-09 20:32:59--  https://github.com/danijel3/CroatianSpeech/releases/download/am/models.tar.xz
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/439990315/625f4e49-7430-4b5f-9d98-52ec93efd8f2?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20220209%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220209T203259Z&X-Amz-Expires=300&X-Amz-Signature=af659a72939794afed59f8ec47d6f901e84224dc890c630862e79f12c2b0c779&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=439990315&response-content-disposition=attachment%3B%20filename%3Dmodels.tar.xz&response-content-type=application%2Foctet-stream [following]
--2022-02-09 20:32:59--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/439990315/625f4e49-7430-4b5f-9d98-52ec93efd8f2?X-Amz-Alg

## Alignment

Finally we run the alignment procedure. It goes as follows:
1. generate a list of words in the transcript
2. use `lexicon.py` to generate a lexicon FST - this creates files
- `words.txt` - list of words
- `phones.txt` - list of phonemes
- `disambig.int` - extra sisambiguation symbols (needed by some programs)
- `word_boundary.int` - information which phonemes are word boundaries (this is needed for aligning words)
- `L.fst` - the FST that maps phonemes into words
3. we compute MFCC features from audio and store it to `mfcc.ark`
4. we also compute ivectors and store them into `ivec.ark` (our acoustic model uses both MFCCs and ivectors)
5. we generate graphs that represent the transcript (this will determine the word order that is going to be analyzed by the decoder)
6. we run the `nnet3-latgen-faster` program which is the main decoding step in this whole procedure - it combines everything we prepared thus far and generates the output in the form of a lattice
7. finally, we process the lattice by making sure the states are aligned with words (we can also output phonemes) and convert it to the CTM format

In [4]:
%%bash

wavscp=data/wav.scp
spk2utt=data/spk2utt
text=data/text
tmpdir=output

models=models
kaldi=/opt/kaldi

export LC_ALL=C

mkdir -p $tmpdir

cut -f2- -d' ' $text | tr ' ' '\n' | sort -u > $tmpdir/wlist
python lexicon.py $tmpdir/wlist $models/phonetisaurus-hr/model.fst $tmpdir
$kaldi/src/featbin/compute-mfcc-feats --config=$models/nnet3/conf/mfcc.conf scp:$wavscp ark:$tmpdir/mfcc.ark
$kaldi/src/online2bin/ivector-extract-online2 --config=$models/nnet3/conf/ivector.conf ark:$spk2utt ark:$tmpdir/mfcc.ark ark:$tmpdir/ivec.ark
$kaldi/egs/wsj/s5/utils/sym2int.pl -f 2- $tmpdir/words.txt $text > $tmpdir/text.int
$kaldi/src/bin/compile-train-graphs $models/nnet3/tdnn1a_sp/tree $models/nnet3/tdnn1a_sp/final.mdl $tmpdir/L.fst ark:$tmpdir/text.int ark:$tmpdir/graphs.fsts
$kaldi/src/nnet3bin/nnet3-latgen-faster --online-ivectors=ark:$tmpdir/ivec.ark --online-ivector-period=10 $models/nnet3/tdnn1a_sp/final.mdl ark:$tmpdir/graphs.fsts ark:$tmpdir/mfcc.ark ark:$tmpdir/ali.lat
$kaldi/src/latbin/lattice-align-words $tmpdir/word_boundary.int $models/nnet3/tdnn1a_sp/final.mdl ark:$tmpdir/ali.lat ark:- | $kaldi/src/latbin/lattice-to-ctm-conf ark:- - | $kaldi/egs/wsj/s5/utils/int2sym.pl -f 5 $tmpdir/words.txt - > $tmpdir/ali.ctm

Wrote output/phones.txt...
Wrote output/disambig.int...
Wrote output/words.txt...
Wrote output/word_boundary.int...
Wrote output/L.fst...


/opt/kaldi/src/featbin/compute-mfcc-feats --config=models/nnet3/conf/mfcc.conf scp:data/wav.scp ark:output/mfcc.ark 
LOG (compute-mfcc-feats[5.5.971~1-07043]:main():compute-mfcc-feats.cc:185)  Done 1 out of 1 utterances.
/opt/kaldi/src/online2bin/ivector-extract-online2 --config=models/nnet3/conf/ivector.conf ark:data/spk2utt ark:output/mfcc.ark ark:output/ivec.ark 
LOG (ivector-extract-online2[5.5.971~1-07043]:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (ivector-extract-online2[5.5.971~1-07043]:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (ivector-extract-online2[5.5.971~1-07043]:main():ivector-extract-online2.cc:189) Estimated iVectors for 1 files, 0 with errors.
LOG (ivector-extract-online2[5.5.971~1-07043]:main():ivector-extract-online2.cc:191) Average objective-function improvement was 5.98035 per frame, over 75498 frames (weighted).
LOG (ivector-extract-online2[5.5.971~1-07043]:main():ivector-extract-online2.cc:

The CTM format is really convinient to use because it's easy to parse. It contains the following fields:
- utterance id (the same we use in `text`, `wav.scp` and `spk2utt`)
- channel id (eg. for stereo - here it's always 1)
- start time of a word
- duration
- word
- confidence (here always 1.0, but for real ASR it can be any value 0..1)

In [5]:
!head -n 15 output/ali.ctm

sample 1 2.54 1.03 potpredsjedniče 1.00 
sample 1 5.24 0.41 poštovane 1.00 
sample 1 5.65 0.50 kolegice 1.00 
sample 1 6.15 0.05 i 1.00 
sample 1 6.20 0.56 kolege 1.00 
sample 1 10.58 0.31 ovo 1.00 
sample 1 10.89 0.48 je 1.00 
sample 1 11.37 0.24 još 1.00 
sample 1 11.61 0.34 jedan 1.00 
sample 1 11.95 0.12 od 1.00 
sample 1 12.07 0.76 zakona 1.00 
sample 1 14.55 0.20 koji 1.00 
sample 1 14.75 0.69 raspravljamo 1.00 
sample 1 15.44 0.22 ovih 1.00 
sample 1 15.66 0.33 dana 1.00 


Finally, I made a little widget to demonstrate the alignment. It uses the `wavio` library to load the audio:

In [6]:
!wget https://raw.githubusercontent.com/danijel3/CroatianSpeech/main/alignment.py
!pip install wavio

--2022-02-09 20:34:05--  https://raw.githubusercontent.com/danijel3/CroatianSpeech/main/alignment.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3208 (3.1K) [text/plain]
Saving to: ‘alignment.py’


2022-02-09 20:34:05 (39.3 MB/s) - ‘alignment.py’ saved [3208/3208]

Collecting wavio
  Downloading wavio-0.0.4-py2.py3-none-any.whl (9.0 kB)
Installing collected packages: wavio
Successfully installed wavio-0.0.4


Here we load the requisite modules:

In [7]:
from pathlib import Path

from IPython.display import HTML

from alignment import visualize

We load the segments into a list that contains triples (`word`,`start`,`end`).

Next we run the visualize method. It has optional argument `sub` which lets us zoom in to a particular time slice (in seconds). Otherwise it would show the whole file and that wouldn't make much sense.

In [8]:
segments = []
with open('output/ali.ctm') as f:
    for l in f:
        tok = l.strip().split()
        segments.append((tok[4], float(tok[2]), float(tok[2]) + float(tok[3])))
audio=Path('data/sample.wav')

HTML(visualize(audio,segments,sub=(20,30)))

This widget can still use some work, but I hope it demonstrates just how good this alignment is.

A better option would be to use somehing like Praat or Elan or my favorite EMU-webApp. I just wanted you to be able to look through it without having to leave the notebook.