# Kaldi neural-network based alignment

This notebook shows how to generate an alignemnt from a collection of files using a typical neural-network model from the Kaldi toolkit.

## Download and installation

First we will install the [pympi](https://github.com/dopefishh/pympi) library that allows us to save TextGrid files:

In [2]:
!pip install pympi-ling

Collecting pympi-ling
  Downloading pympi_ling-1.70.2-py2.py3-none-any.whl.metadata (3.4 kB)
Downloading pympi_ling-1.70.2-py2.py3-none-any.whl (24 kB)
Installing collected packages: pympi-ling
Successfully installed pympi-ling-1.70.2


Next we will download a couple of archives from the internet, which include:
* `kaldi.tar.xz` - most of the Kaldi toolkit (libraries and programs)
* `phonetisaurus.tar.xz` - a program for automatic grapheme-to-phoneme conversion
* `studio.tar.xz` - a neural-network model for Polish trained on "clean" studio-quality data
* `SES0001.tar.xz` - a sample of audio files from the studio corpus

We then extract the files and delete the archives. Finally, we look for all the programs and libraries and link them in `/usr/local/bin` and `/usr/local/lib` accordingly, so they can be found everywhere in the system. We also copy 2 perl scripts (`int2sym.pl` and `sym2int.pl`) which are used to convert between number and text based representations of words and phonemes (because actual computations are done using number based representations, for efficiency).

In [3]:
%%bash

wget -q https://github.com/danijel3/PhonemeAlignement/releases/download/v0.1-kaldi-data/kaldi.tar.xz
wget -q https://github.com/danijel3/PhonemeAlignement/releases/download/v0.1-kaldi-data/phonetisaurus.tar.xz
wget -q https://github.com/danijel3/PhonemeAlignement/releases/download/v0.1-kaldi-data/studio.tar.xz
wget -q https://github.com/danijel3/PhonemeAlignement/releases/download/v0.1-kaldi-data/SES0001.tar.xz
tar xf kaldi.tar.xz
tar xf phonetisaurus.tar.xz
tar xf studio.tar.xz
tar xf SES0001.tar.xz
rm *.tar.xz
for f in $(find kaldi -name *.so*) ; do ln -sf $(realpath $f) /usr/local/lib/$(basename $f) ; done
for f in $(find kaldi/src -not -name *.so* -type f -executable) ; do ln -sf $(realpath $f) /usr/local/bin/$(basename $f) ; done
for f in $(find kaldi/tools -not -name *.so* -type f -executable) ; do ln -sf $(realpath $f) /usr/local/bin/$(basename $f) ; done
for f in $(find phonetisaurus/lib -name *.so*) ; do ln -sf $(realpath $f) /usr/local/lib/$(basename $f) ; done
for f in $(find phonetisaurus/bin -not -name *.so* -type f -executable) ; do ln -sf $(realpath $f) /usr/local/bin/$(basename $f) ; done
ldconfig

cp /content/kaldi/egs/wsj/s5/utils/sym2int.pl /usr/local/bin
cp /content/kaldi/egs/wsj/s5/utils/int2sym.pl /usr/local/bin

/sbin/ldconfig.real: /usr/local/lib/libtcm.so.1 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtcm_debug.so.1 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbbind.so.3 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libur_adapter_level_zero.so.0 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbmalloc_proxy.so.2 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbmalloc.so.2 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libhwloc.so.15 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libur_adapter_opencl.so.0 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbbind_2_5.so.3 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbb.so.12 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libur_loader.so.0 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libumf.so.0 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbbind_2_0.so.3 is

## Data pre-processing

This is the contents of our sample archive - a collection of WAV and matching TXT files:

In [4]:
%ls SES0001

rich001.txt  sent005.wav  sent011.txt  sent016.wav  sent022.txt  sent027.wav
rich001.wav  sent006.txt  sent011.wav  sent017.txt  sent022.wav  sent028.txt
sent001.txt  sent006.wav  sent012.txt  sent017.wav  sent023.txt  sent028.wav
sent001.wav  sent007.txt  sent012.wav  sent018.txt  sent023.wav  sent029.txt
sent002.txt  sent007.wav  sent013.txt  sent018.wav  sent024.txt  sent029.wav
sent002.wav  sent008.txt  sent013.wav  sent019.txt  sent024.wav  sent030.txt
sent003.txt  sent008.wav  sent014.txt  sent019.wav  sent025.txt  sent030.wav
sent003.wav  sent009.txt  sent014.wav  sent020.txt  sent025.wav  spk.txt
sent004.txt  sent009.wav  sent015.txt  sent020.wav  sent026.txt
sent004.wav  sent010.txt  sent015.wav  sent021.txt  sent026.wav
sent005.txt  sent010.wav  sent016.txt  sent021.wav  sent027.txt


A single file has a short sentence, for  example:

In [5]:
%cat SES0001/sent001.txt

stwierdzam że senator stanisław huskowski złożył ślubowanie panie senatorze gratuluję oklaski są mile widziane tym bardziej że do naszego grona senatorskiego dołącza osoba o dużym doświadczeniu samorządowym


The audio is encoded as uncompressed 16 kHz, 16-bit, mono, signed little-endian PCM WAV file:

In [6]:
!ffprobe -hide_banner SES0001/sent001.wav

Input #0, wav, from 'SES0001/sent001.wav':
  Duration: 00:00:14.78, bitrate: 256 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, 1 channels, s16, 256 kb/s


If this wasn't the case, the following command can be used to convert any other file (including other formats than WAV) to the right format for this model:

```
!ffmpeg -i input.wav -ac 1 -ar 16k -acodec pcm_s16le output.wav
```

We can listen to the audio:

In [7]:
from IPython.display import Audio

Audio('SES0001/sent001.wav')

Now lets create a folder called `data` to store intermideary files for the alignment procedure. We start with a bunch of information about the files:

In [43]:
%%bash
mkdir -p data

for f in SES0001/*.wav ; do echo $(basename $f .wav) $(readlink -f $f) ; done > data/wav.scp
for f in SES0001/*.wav ; do echo $(basename $f .wav) $(head -n 1 ${f%wav}txt) ; done > data/text
for f in SES0001/*.wav ; do echo $(basename $f .wav) $(basename $f .wav) ; done > data/spk2utt

`data/wav.scp` holds the paths to the audio files, for example:

In [9]:
!head data/wav.scp

rich001 /content/SES0001/rich001.wav
sent001 /content/SES0001/sent001.wav
sent002 /content/SES0001/sent002.wav
sent003 /content/SES0001/sent003.wav
sent004 /content/SES0001/sent004.wav
sent005 /content/SES0001/sent005.wav
sent006 /content/SES0001/sent006.wav
sent007 /content/SES0001/sent007.wav
sent008 /content/SES0001/sent008.wav
sent009 /content/SES0001/sent009.wav


`data/text` holds the contents of all the TXT files in one file:

In [10]:
!head data/text

rich001 drożdże dżip gwożdżenie ozimina wędzarz rdzeń wędzonka ingerować kładzenie jutrzenka
sent001 stwierdzam że senator stanisław huskowski złożył ślubowanie panie senatorze gratuluję oklaski są mile widziane tym bardziej że do naszego grona senatorskiego dołącza osoba o dużym doświadczeniu samorządowym
sent002 zrezygnował z prawniczej kariery na rzecz dziennikarstwa jest twórcą i redaktorem merytorycznym magazynu meteoryt
sent003 sięgnąłem po koniak właściwie niechętnie miałem już dość alkoholu na dzisiejszy wieczór i wtedy usłyszałem jej głos
sent004 wreszcie zgasiła go w popielniczce i zawahała się przez moment może zatańczymy
sent005 w końcu jestem specjalistą od tych spraw w każdej chwili mogłem zacząć pracować a propozycji było aż nadto
sent006 czas i doświadczenie pozwoliły mi się przyzwyczaić do powodzenia u kobiet zaskoczony byłem raczej tym że ani pierwsza ani druga ani nawet dziesiąta randka nie skończyły się w łóżku
sent007 pewnego wieczoru siedzieliśmy z kieliszkami bia

`data/spk2utt` hold a list of recordings for each speaker. This is often used to help with normalization of recordings, but here we will assume each file is spoken by a different speaker. To make it simple, each speaker is going to be named the same way as the file:

In [44]:
!head data/spk2utt

rich001 rich001
sent001 sent001
sent002 sent002
sent003 sent003
sent004 sent004
sent005 sent005
sent006 sent006
sent007 sent007
sent008 sent008
sent009 sent009


Now we can then compute the acoustic features (ie. mel-frequency cepstrum coefficients) required by the neural network model:

In [11]:
!compute-mfcc-feats --config=models.studio/conf/mfcc_hires.conf scp:data/wav.scp ark:data/mfcc

compute-mfcc-feats --config=models.studio/conf/mfcc_hires.conf scp:data/wav.scp ark:data/mfcc 
LOG (compute-mfcc-feats[5.5.1168~1-01aad]:main():compute-mfcc-feats.cc:181) Processed 10 utterances
LOG (compute-mfcc-feats[5.5.1168~1-01aad]:main():compute-mfcc-feats.cc:181) Processed 20 utterances
LOG (compute-mfcc-feats[5.5.1168~1-01aad]:main():compute-mfcc-feats.cc:181) Processed 30 utterances
LOG (compute-mfcc-feats[5.5.1168~1-01aad]:main():compute-mfcc-feats.cc:185)  Done 31 out of 31 utterances.


This particular model also uses i-Vector features to describe the speakers in the files - this helps the model adapt better to speaker changes:

In [45]:
%%bash
ext_dir=models.studio/extractor

ivector-extract-online2 --diag-ubm=$ext_dir/final.dubm \
                        --global-cmvn-stats=$ext_dir/global_cmvn.stats \
                        --ivector-extractor=$ext_dir/final.ie \
                        --lda-matrix=$ext_dir/final.mat \
                        --splice-config=$ext_dir/splice.conf \
                        --cmvn-config=$ext_dir/online_cmvn.conf \
                        ark:data/spk2utt ark:data/mfcc ark:data/ivectors

ivector-extract-online2 --diag-ubm=models.studio/extractor/final.dubm --global-cmvn-stats=models.studio/extractor/global_cmvn.stats --ivector-extractor=models.studio/extractor/final.ie --lda-matrix=models.studio/extractor/final.mat --splice-config=models.studio/extractor/splice.conf --cmvn-config=models.studio/extractor/online_cmvn.conf ark:data/spk2utt ark:data/mfcc ark:data/ivectors 
LOG (ivector-extract-online2[5.5.1168~1-01aad]:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (ivector-extract-online2[5.5.1168~1-01aad]:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (ivector-extract-online2[5.5.1168~1-01aad]:main():ivector-extract-online2.cc:188) Estimated iVectors for 31 files, 0 with errors.
LOG (ivector-extract-online2[5.5.1168~1-01aad]:main():ivector-extract-online2.cc:190) Average objective-function improvement was 14.6027 per frame, over 30660 frames (weighted).
LOG (ivector-extract-online2[5.5.1168~1-01aad]:main():i

## Transcription

We start by making a list of all the words in all the transcriptions:

In [46]:
!cut -f2- -d' ' data/text | tr ' ' '\n' | sort -u > data/word.list

There are 500 unique words in all the the transcripts of this sample set:

In [47]:
!wc -l data/word.list

500 data/word.list


We then create a pronounciation lexicon - a single pronounciation for each word. We do this using a statistical grapheme-to-phoneme converter trained on a sample lexicon, way before:

In [48]:
!phonetisaurus-g2pfst --model=models.studio/g2p/model.fst --wordlist=data/word.list > data/lexicon.txt

GitRevision: kaldi


This is how the start of this lexicon looks like - first column is the word (in orthographic form), then we have a likelihood value from the g2p model (generally not relevant here) and then we have the space-delimited phonetic transcription of each word. This model was trained to use the Polish SAMPA script:

In [49]:
!head data/lexicon.txt

a	10.4529	a
aberonowi	14.8625	a b e r o n o v i
agent	11.3559	a g e n t
ale	10.6668	a l e
alkoholu	12.1258	a l k o x o l u
andrzeja	12.8172	a n d Z e j a
ani	11.1928	a n' i
antykoncepcji	11.857	a n t I k o n ts e p ts i
asfalt	12.2906	a s f a l t
atmosferze	12.2989	a t m o s f e Z e


We then use this lexicon to create a pronounciation for each audio file - here we simply copy the automatic pronounciation for each word in sequence:

In [50]:
lex={}
with open("data/lexicon.txt") as f:
  for l in f:
    tok = l.strip().split()
    lex[tok[0]] = tok[2:]

def add_suffix(phones):
  if len(phones)==1:
    return [phones[0]+"_S"]
  ret=[phones[0]+"_B"]
  for ph in phones[1:-1]:
    ret.append(ph+"_I")
  ret.append(phones[-1]+"_E")
  return ret

with open("data/text-phones","w") as ph:
  with open("data/text") as f:
    for l in f:
      tok = l.strip().split()
      uttid = tok[0]
      for i,w in enumerate(tok[1:]):
        ph.write(f"{uttid}.{i} {' '.join(add_suffix(lex[w]))}\n")

This creates a file where each line is prefixed by the recording name followed by a period and then the number of the word in sequence:

In [51]:
!head data/text-phones

rich001.0 d_B r_I o_I Z_I dZ_I e_E
rich001.1 dZ_B i_I p_E
rich001.2 g_B v_I o_I Z_I dZ_I e_I n'_I e_E
rich001.3 o_B z'_I i_I m_I i_I n_I a_E
rich001.4 v_B e_I n_I dz_I a_I S_E
rich001.5 r_B dz_I e_I n'_E
rich001.6 v_B e_I n_I dz_I o_I n_I k_I a_E
rich001.7 i_B n_I g_I e_I r_I o_I v_I a_I ts'_E
rich001.8 k_B w_I a_I dz_I e_I n'_I e_E
rich001.9 j_B u_I t_I S_I e_I n_I k_I a_E


Compared to the orthographic transcription it looks like this:

In [52]:
!grep sent001 data/text | cut -f2- -d' '
!grep sent001 data/text-phones | cut -f2- -d' ' | tr '\n' '|'

stwierdzam że senator stanisław huskowski złożył ślubowanie panie senatorze gratuluję oklaski są mile widziane tym bardziej że do naszego grona senatorskiego dołącza osoba o dużym doświadczeniu samorządowym
s_B t_I f_I j_I e_I r_I dz_I a_I m_E|Z_B e_E|s_B e_I n_I a_I t_I o_I r_E|s_B t_I a_I n'_I i_I s_I w_I a_I v_E|x_B u_I s_I k_I o_I f_I s_I k_I i_E|z_B w_I o_I Z_I I_I w_E|s'_B l_I u_I b_I o_I v_I a_I n'_I e_E|p_B a_I n'_I j_I e_E|s_B e_I n_I a_I t_I o_I Z_I e_E|g_B r_I a_I t_I u_I l_I u_I j_I e_E|o_B k_I l_I a_I s_I k_I i_E|s_B o~_E|m_B i_I l_I e_E|v_B i_I dz'_I a_I n_I e_E|t_B I_I m_E|b_B a_I r_I dz'_I e_I j_E|Z_B e_E|d_B o_E|n_B a_I S_I e_I g_I o_E|g_B r_I o_I n_I a_E|s_B e_I n_I a_I t_I o_I r_I s_I k_I j_I e_I g_I o_E|d_B o_I w_I o_I n_I tS_I a_E|o_B s_I o_I b_I a_E|o_S|d_B u_I Z_I I_I m_E|d_B o_I s'_I f_I j_I a_I d_I tS_I e_I n'_I u_E|s_B a_I m_I o_I Z_I o_I n_I d_I o_I v_I I_I m_E|

This can obviously be manually modified to suit special needs. You only need to edit the `text-phones` file. The only rule is that the number of words in `text` and `text-phones` match for each recording.

The transcription uses phonemes from the Polish SAMPA script as listed in the `models.studio/phones.list` file. The suffixes `_B`,`_E`,`_S`,`_I` denote the beginning, end, singleton and internal phonemes. Begin/end are obvious, internal is for everything between begin/end and singleton is for words that have a single phoneme (ie. a phoneme that is both begin and end). The reason for these extra annotations has to do with how graphs are created and processed internally by the alignment procedure.

After we have all the words and their pronouncations, we will assign a number for each word and create 2 files - `words.txt` and `phones.txt` to store a list mapping each word/phoneme to its unique number. This will allow the toolkit to process the data using numbers instead of (potentailly) long sequences of characters, which makes the whole procedure much faster and memory efficient:

In [53]:
!echo "<eps> 0" > data/words.txt
!awk '{print($0,NR)}' < data/word.list >> data/words.txt
!awk '{print($0,NR-1)}' < models.studio/phones.list > data/phones.txt

It looks like this:

In [54]:
!echo "Words:"
!head data/words.txt
!echo "..."
!echo "Phonemes:"
!head data/phones.txt
!echo "..."

Words:
<eps> 0
a 1
aberonowi 2
agent 3
ale 4
alkoholu 5
andrzeja 6
ani 7
antykoncepcji 8
asfalt 9
...
Phonemes:
<eps> 0
sil 1
sil_B 2
sil_E 3
sil_I 4
sil_S 5
spn 6
spn_B 7
spn_E 8
spn_I 9
...


We also have to add one extra "disambiguation" phoneme - its number is going to be one more than the number of phonemes in `phones.txt` above. This extra phoneme is not too important for our problem, but is required by the `compile-train-graphs-without-lexicon` program below:

In [55]:
!echo 160 > data/disambig.int

Now we will generate a special FST graph for each recording that describes both the phonetic and orthographic sequence related to it. The reason a graph is used is because of its flexibility and expresiveness - for example we can potentialy create multiple prnounciations for certain words using a more complex structure than a simple linear sequence like here.

The reason we use the program named "without-lexicon" (even though others exits) is because this lets us manually define each phoneme in the sequence. Normally, we would use the `compile-train-graphs` program which uses `lexicon.txt` file above to create the phonetic graph, but that wouldn't let us assign a different prounciation for the same word in different files or different places in the same file:

In [56]:
%%bash

compile-train-graphs-without-lexicon --read-disambig-syms=data/disambig.int \
                                    models.studio/nnet3/tree \
                                    models.studio/nnet3/final.mdl \
                                    'ark:sym2int.pl -f 2- data/words.txt data/text|' \
                                    'ark:sym2int.pl -f 2- data/phones.txt data/text-phones|' \
                                    ark:data/graphs

compile-train-graphs-without-lexicon --read-disambig-syms=data/disambig.int models.studio/nnet3/tree models.studio/nnet3/final.mdl 'ark:sym2int.pl -f 2- data/words.txt data/text|' 'ark:sym2int.pl -f 2- data/phones.txt data/text-phones|' ark:data/graphs 
LOG (compile-train-graphs-without-lexicon[5.5.1168~1-01aad]:main():compile-train-graphs-without-lexicon.cc:196) compile-train-graphs: succeeded for 31 graphs, failed for 0


## Alignemnt

The actual alignment process takes the input acoustic features (`ivectors` and `mfcc`) the neural-network model (`final.mdl`) the graphs (`graphs` we create above) and combines it all together to create the alignment saving it in the `ali` file. This is the main and most time-consuming process - it utilizes the GPU to speed up the computation:



In [61]:
!nnet3-align-compiled --online-ivectors=ark:data/ivectors \
                      --online-ivector-period=10 \
                      models.studio/nnet3/final.mdl \
                      ark:data/graphs \
                      ark:data/mfcc \
                      ark,t:data/ali

nnet3-align-compiled --online-ivectors=ark:data/ivectors --online-ivector-period=10 --use-gpu=no models.studio/nnet3/final.mdl ark:data/graphs ark:data/mfcc ark,t:data/ali 
LOG (nnet3-align-compiled[5.5.1168~1-01aad]:SelectGpuId():cu-device.cc:175) Manually selected to compute on CPU.
LOG (nnet3-align-compiled[5.5.1168~1-01aad]:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 1 orphan nodes.
LOG (nnet3-align-compiled[5.5.1168~1-01aad]:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 2 orphan components.
LOG (nnet3-align-compiled[5.5.1168~1-01aad]:Collapse():nnet-utils.cc:1488) Added 1 components, removed 2
LOG (nnet3-align-compiled[5.5.1168~1-01aad]:main():nnet3-align-compiled.cc:198) Overall log-likelihood per frame is 0.160185 over 30660 frames.
LOG (nnet3-align-compiled[5.5.1168~1-01aad]:main():nnet3-align-compiled.cc:201) Retried 0 out of 31 utterances.
LOG (nnet3-align-compiled[5.5.1168~1-01aad]:main():nnet3-align-compiled.cc:203) Done 31, errors on 0
LOG (nnet3-align-compiled[5.5

The alignemnt in `ali` is a sequence of word states and therefore difficult to read:

In [62]:
!head data/ali

rich001 4 1 1 1 1 1 1 16 18 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 3030 3029 3029 3029 3029 3029 3029 3074 3094 3093 3093 12114 12113 12113 12156 12155 12155 12214 12213 10754 10866 10865 10865 10956 10955 10955 10955 1480 1479 1479 1496 1495 1495 1495 1495 1495 1495 1495 1524 3346 3348 3350 3349 3349 3349 3956 3955 4080 4079 4079 4200 4199 4199 4199 4199 4199 4199 4199 4 16 15 18 3324 3323 3323 3323 3323 3323 3323 3328 3327 3327 3327 3327 3332 3331 3331 3331 6234 6233 6233 6314 6362 6361 6361 11508 11507 11507 11540 11539 11539 11562 11561 11561 11561 11561 11561 11561 11561 11561 11561 11561 11561 4 16 1

A much cleaner format is the so-called CTM file which is kind of like a CSV for alignments. The process of converting the `ali` file to a CTM takes  several steps:

In [63]:
%%bash
linear-to-nbest ark,t:data/ali 'ark:sym2int.pl -f 2- data/words.txt data/text|' '' '' ark,t:- | \
    lattice-align-words models.studio/word_boundary.int models.studio/nnet3/final.mdl ark:- ark:- | \
    nbest-to-ctm ark:- - | \
    int2sym.pl -f 5 data/words.txt > data/words.ctm

nbest-to-ctm ark:- - 
lattice-align-words models.studio/word_boundary.int models.studio/nnet3/final.mdl ark:- ark:- 
linear-to-nbest ark,t:data/ali 'ark:sym2int.pl -f 2- data/words.txt data/text|' '' '' ark,t:- 
LOG (linear-to-nbest[5.5.1168~1-01aad]:main():linear-to-nbest.cc:130) Done 31 n-best entries ,0 had errors.
LOG (lattice-align-words[5.5.1168~1-01aad]:main():lattice-align-words.cc:126) Successfully aligned 31 lattices; 0 had errors.
LOG (nbest-to-ctm[5.5.1168~1-01aad]:main():nbest-to-ctm.cc:119) Converted 31 linear lattices to ctm format; 0 had errors.


This is what the file looks like - it contains the recording in the first column, the audio channel in the second (here it's always 1), the start of segment in seconds, its duration in second and finally the word in ortopgraphic form:

In [64]:
!head data/words.ctm

rich001 1 1.510 0.580 drożdże 
rich001 1 2.130 0.410 dżip 
rich001 1 2.580 0.890 gwożdżenie 
rich001 1 3.710 0.770 ozimina 
rich001 1 4.760 0.730 wędzarz 
rich001 1 5.830 0.590 rdzeń 
rich001 1 6.610 0.690 wędzonka 
rich001 1 7.520 0.750 ingerować 
rich001 1 8.410 0.620 kładzenie 
rich001 1 9.160 0.610 jutrzenka 


We can also convert the word alignment to phone alignment and then convert that to CTMs as well:

In [66]:
%%bash
linear-to-nbest ark,t:data/ali 'ark:sym2int.pl -f 2- data/words.txt data/text|' '' '' ark,t:- | \
      lattice-to-phone-lattice models.studio/nnet3/final.mdl ark:- ark:- | \
      nbest-to-ctm ark:- - | \
      int2sym.pl -f 5 data/phones.txt > data/phones.ctm

nbest-to-ctm ark:- - 
linear-to-nbest ark,t:data/ali 'ark:sym2int.pl -f 2- data/words.txt data/text|' '' '' ark,t:- 
lattice-to-phone-lattice models.studio/nnet3/final.mdl ark:- ark:- 
LOG (linear-to-nbest[5.5.1168~1-01aad]:main():linear-to-nbest.cc:130) Done 31 n-best entries ,0 had errors.
LOG (lattice-to-phone-lattice[5.5.1168~1-01aad]:main():lattice-to-phone-lattice.cc:94) Done converting 31 lattices.
LOG (nbest-to-ctm[5.5.1168~1-01aad]:main():nbest-to-ctm.cc:119) Converted 31 linear lattices to ctm format; 0 had errors.


That looks like this:

In [67]:
!head data/phones.ctm

rich001 1 0.000 1.510 sil 
rich001 1 1.510 0.110 d_B 
rich001 1 1.620 0.080 r_I 
rich001 1 1.700 0.080 o_I 
rich001 1 1.780 0.120 Z_I 
rich001 1 1.900 0.060 dZ_I 
rich001 1 1.960 0.130 e_E 
rich001 1 2.090 0.040 sil 
rich001 1 2.130 0.160 dZ_B 
rich001 1 2.290 0.070 i_I 


Finally, we will create a folder to store TextGrid files into:

In [68]:
%mkdir output

And then convert the 2 CTMs to a collection of TextGrids using the `pympi` library:

In [69]:
from collections import defaultdict
from typing import List
from dataclasses import dataclass,  field
from pympi import TextGrid

@dataclass
class Segment:
  start: float
  end: float
  text: str

@dataclass
class Alignment:
  words: List[Segment] = field(default_factory=lambda: [])
  phones: List[Segment] = field(default_factory=lambda: [])

alignments=defaultdict(lambda: Alignment())

with open('data/words.ctm') as f:
  for l in f:
    tok=l.strip().split()
    alignments[tok[0]].words.append(Segment(float(tok[2]),float(tok[2])+float(tok[3]),tok[4]))

with open('data/phones.ctm') as f:
  for l in f:
    tok=l.strip().split()
    alignments[tok[0]].phones.append(Segment(float(tok[2]),float(tok[2])+float(tok[3]),tok[4]))

for utt,ali in alignments.items():
  tg=TextGrid(xmax=ali.words[-1].end)
  tw=tg.add_tier('words')
  for w in ali.words:
    tw.add_interval(round(w.start,2),round(w.end,2),w.text)
  tp=tg.add_tier('phones')
  for p in ali.phones:
    tp.add_interval(round(p.start,2),round(p.end,2),p.text)
  tg.to_file(f'output/{utt}.TextGrid')

The TextGrids can then be saved locally and viewed in either Praat or [EMU-webApp](https://ips-lmu.github.io/EMU-webApp/). Simply open the menu on the left, epxand the output dir and right-click each file for download.

If you have too many files to download individually, simply use a command to compress all of them to an archive, eg:
```
!zip -r TextGrid.zip output
```
and download the whole archive instead.