# Explanation of the alignment procedure
### for aligning very long audio sequences to very long text sequences

This notebook describes the code for aligning long audio files to long text sequences using Huggingface Transformers toolkit.

More precisely, we have the following inputs:
* a long audio file (this can be anything from a few minutes, to a few hours, to a whole day of audio)
* a text that contains the transcripts of the audio file

On output we expect to get:
* timecodes of portions of text to the portions of audio - word level timings

Keeping in mind that:
* audio can contain speech and events that aren't in the transcription
* transcripts can contain text that isn't in the audio

The sources of these discrepancies can be:
* the transcript can be much larger than the speech contained in audio - in other words, we can have one large file with transcripts of many audio files and this is acceptable for this project - we only use what we need
* the audio doesn't have to be transcribed with 100% recall or precision - transcribers make mistakes, not everything gets transcribed - this is all fine and we take it into account
* the alignment process can produce errors - we have a procedure that allows for reviewing results and deciding on mitigation procedures after all the automation is complete


In [1]:
import json
from pathlib import Path

from src.align import align, fix_times, save_ali_to_textgrid, convert_ali_to_segments
from src.data_loaders import load_reco, load_audio, extract_audio
from src.matching import Matcher
from src.recognize import recognize

First we define a couple of paths to local resources:
* `audio_file` - the audio file containing the speech we wish to align to
* `text_file` - the file containing all the transcripts - needs to contain text only and ideally should be normalized
* `asr_model` - Huggingface hub name or path to the speech recognition model used to process the files

These files are created throughout the procedure:
* `reco_file`  - this file is the result of the initial speech recognition process
* `textgrid_file`  - this file contains the final alignment openable in Praat and similar programs
* `segs_file`  - this file contains the aligned segments useful for debugging and analysis

In [2]:
audio_file = Path('data/1. posiedzenie Sejmu IX kadencji - dzień drugi [ZAPIS TRANSMISJI] [bM7-TmS16HY].wav')
text_file = Path('data/ParlaMint-PL_2019-11-13-sejm-01-2.norm.txt')
asr_model = 'asr_model'
reco_file = Path('data/reco.json')
textgrid_file = Path('data/out.TextGrid')
segs_file = Path('data/segs.json')

# Step 1 - Speech recognition

The first step is to get any kind of speech recognition output that we can then use to align to actual transcripts.

Since this process can be expensive, we will save the result immediately to a file. If the file exists, we don't have to run it again.

It's also a good idea to restart the notebook kernel after running this cell in order to release memory on the GPU. Huggingface doesn't have a convenient method of releasing reserved resources and terminating the process is the simplest and most elegant way to achieve this.

In [3]:
if not reco_file.exists():
    reco = recognize(audio_file, asr_model, batch_size=4)
    with open(reco_file, 'w') as f:
        json.dump(reco, f)

Here we re-load the above saved file. In the case of this demo, this will contain words belonging to only one recording, but theoretically you could recognize many recordings at once and the file would contain all of them.

In [4]:
words = load_reco(reco_file)

print('Number of words per recording:')
print({x: len(y) for x, y in words.items()})

Number of words per recording:
{'1. posiedzenie Sejmu IX kadencji - dzień drugi [ZAPIS TRANSMISJI] [bM7-TmS16HY]': 2701}


Here we will first get the utterance id of the single file in question and then extract some 100 words in the file. The idea of the procedure is to work in chunks to optimize the resources required for processing. We will first demonstrate the whole procedure on a single chunk and later show how to process the whole file.

In [5]:
utt_id = list(words.keys())[0]
words_chunk = words[utt_id][300:400]

print(f'Chunk len: {words_chunk[-1].end - words_chunk[0].start}s')
text = ' '.join([x.text for x in words_chunk])
print(f'Chunk text:\n{text}')

Chunk len: 54.260000000000105s
Chunk text:
legitymację jeszcze moment dobrze to jeszcze chwileczkę po czekamy ale proszę u widzę że wszyscy mamy kartę do głosowania w związku z tym poddam ten wniosek pod głosowanie kto z pań i panów posłów jest za ogłoszeniem przerwy w obradach proszę o podniesienie ręki naciśnięcie przycisku kto jest przeciw kto się wstrzymał os państwa jest za zero to jest przeciw kto się wstrzymał głosowało czterystu czterdziestu dwóch posłów za było stu osiemdziesięciu dziewięciu przeciw dwustu czterdziestu siedmiu sześciu się wstrzymało wniosek przepadły nie i panowie posłowie prezydium sejmu przedłożyło projekt uchwały w sprawie ustalenia liczby członków komisji do spraw służb


## Step 2 - matching reco text to transcription

We begin by creating a `Matcher` object. This object requires the text from the transcript file on initialization and can later be used to match various audio file speech recognition results to it.

In [6]:
word_seq = []
with open(text_file) as f:
    for l in f:
        tok = l.strip().split()
        word_seq.extend(tok)

matcher = Matcher(word_seq)

In the beginning we match the above 100 word text to the whole transcript in a rough manner. This is done by making a histogram of the recognized word sequence and matching it to the histogram of the sliding window of same length across the whole transcript - akin to the BOW model of. A list of candidates that are sufficiently similar are returned at output. This can be controlled by the optional `threshold` argument.

In [7]:
locs = matcher._initial_match(text)
print(f'Found {len(locs)} potential locations for the above text.')

Found 33 potential locations for the above text.


All the potential candidates are then compared to the original text using the Levenshtein distance method. This gives the exact score of all the differences between the two texts. That method is much more accurate, but also considerably more expensive than the one used in the inital match above. On output, we get both the best location and the difference between the two texts. The distance of 0 means the texts are identical. The higer the value to worse the match.

In [8]:
min_i, min_d = matcher._find_min_diff(locs, text)
print(f'The location with minimum Levenshtein distance is {min_i} and that distance is {min_d}.')

The location with minimum Levenshtein distance is 175 and that distance is 207.


Now, the above method will only give us a starting location of the two texts, but given the nature of the speech recognition output and the transcript, these don't have to match in length. That is why we look for identically matching words to align the texts and reject any insertions/deletions at the beginning and the end of the sequence. The result is a set of two tuples containing both the begining and end of the speech recognition as well as the transcription portion of the text.

In [9]:
(hb, he), (rb, re) = matcher._find_matching_seq(text, min_i)

print(f'The best matching sequence is between {hb} and {he} in the recognized text '
      f'and {min_i + rb} and {min_i + re} in the reference corpus.')

The best matching sequence is between 1 and 100 in the recognized text and 181 and 275 in the reference corpus.


Having all that we can now match the transcription text to the ASR text which allows us also to get the time location of the transcription chunk within the audio file:

In [10]:
hyp_text = ' '.join(text.split()[hb:he])
hyp_tb = words_chunk[hb].start
hyp_te = words_chunk[he - 1].end
ref_text = matcher.get_corpus_chunk(min_i + rb, min_i + re)

print(f'Position of the text in audio file {audio_file} is between seconds {hyp_tb} and {hyp_te}.')
print(f'Recognized text:\n{hyp_text}')
print(f'Reference text:\n{ref_text}')


Position of the text in audio file data/1. posiedzenie Sejmu IX kadencji - dzień drugi [ZAPIS TRANSMISJI] [bM7-TmS16HY].wav is between seconds 659.848 and 711.816.
Recognized text:
jeszcze moment dobrze to jeszcze chwileczkę po czekamy ale proszę u widzę że wszyscy mamy kartę do głosowania w związku z tym poddam ten wniosek pod głosowanie kto z pań i panów posłów jest za ogłoszeniem przerwy w obradach proszę o podniesienie ręki naciśnięcie przycisku kto jest przeciw kto się wstrzymał os państwa jest za zero to jest przeciw kto się wstrzymał głosowało czterystu czterdziestu dwóch posłów za było stu osiemdziesięciu dziewięciu przeciw dwustu czterdziestu siedmiu sześciu się wstrzymało wniosek przepadły nie i panowie posłowie prezydium sejmu przedłożyło projekt uchwały w sprawie ustalenia liczby członków komisji do spraw służb
Reference text:
jeszcze moment jeszcze moment dobrze to jeszcze chwileczkę poczekamy ale proszę widzę że wszyscy mamy już karty do głosowania w związku z tym poddam 

We can then extract the audio within this segment. Now, this audio can also contain lots of silence which affects the performance of the alignment below. That is why we extract only the portions of the audio that match the words recognized by the ASR. This should contain all the audio that contains any speech within this segment.

We also get the bit-mask matching the extracted segments. This will allow us to recreate the times in the original file, before removing the silences.

In [11]:
audio = load_audio(audio_file)
seg, mask = extract_audio(audio['input_values'], words_chunk[hb:he], audio['samp_freq'])

print(f'The length of the extracted segment is {len(seg) / audio["samp_freq"]}s.')

The length of the extracted segment is 35.7s.


## Step 3 - Re-alignment

The above text matching procedure only gave us the beginning and end of the whole segment. If we want to get a more accurate alignment we need to re-align all the words from the reference transcription to the audio. Fortunately, this procedure is quite robust given the limited number of words and length of audio extracted above.

It is worth noting, however, that any errors and inaccuracies in the transcript will cause unexpected behavior in the final output.

Note that this method accepts a whole sequence of audio/text pairs. Here we only have one, so we will put both in a single element list each.

In [12]:
ali = align(asr_model, [seg], [ref_text])

07/21/2023 22:59:00 [INFO] - Loading models...
07/21/2023 22:59:13 [INFO] - Loading data...
07/21/2023 22:59:13 [INFO] - Loaded 1 files!
07/21/2023 22:59:13 [INFO] - Processing labels...


Map:   0%|          | 0/1 [00:00<?, ? examples/s]

07/21/2023 22:59:13 [INFO] - Total audio length: 35.70s == 0.60min == 0.01h
07/21/2023 22:59:13 [INFO] - Splitting data into chunks...
07/21/2023 22:59:13 [INFO] - Divided into 8 chunks!
07/21/2023 22:59:13 [INFO] - Processing chunks using the W2V2 model...


Map:   0%|          | 0/8 [00:00<?, ? examples/s]

07/21/2023 22:59:24 [INFO] - Merging chunks back into files...


100%|██████████| 8/8 [00:00<00:00, 15.74it/s]

07/21/2023 22:59:25 [INFO] - Performing forced alignment...





Map:   0%|          | 0/1 [00:00<?, ? examples/s]

07/21/2023 22:59:25 [INFO] - Saving output...
07/21/2023 22:59:25 [INFO] - Done!
07/21/2023 22:59:25 [INFO] - Took 12.44s == 0.21min == 0.00h


This is what the result looks like. It's a list of words, with a begin/end timestamp for each one. Note that the times provided are within the extracted segment only. To get the times that match the original file, we need to make some corrections.

In [13]:
ali

{'audio_0000': [{'text': 'jeszcze',
   'timestamp': [0.019999999552965164, 0.1599999964237213]},
  {'text': 'moment', 'timestamp': [0.18000000715255737, 0.30000001192092896]},
  {'text': 'jeszcze', 'timestamp': [0.3199999928474426, 0.47999998927116394]},
  {'text': 'moment', 'timestamp': [0.5199999809265137, 0.7599999904632568]},
  {'text': 'dobrze', 'timestamp': [0.800000011920929, 1.2200000286102295]},
  {'text': 'to', 'timestamp': [1.2599999904632568, 1.3200000524520874]},
  {'text': 'jeszcze', 'timestamp': [1.340000033378601, 1.5]},
  {'text': 'chwileczkę',
   'timestamp': [1.5199999809265137, 1.8799999952316284]},
  {'text': 'poczekamy', 'timestamp': [1.940000057220459, 2.559999942779541]},
  {'text': 'ale', 'timestamp': [2.5999999046325684, 2.740000009536743]},
  {'text': 'proszę', 'timestamp': [2.819999933242798, 3.4600000381469727]},
  {'text': 'widzę', 'timestamp': [3.740000009536743, 3.9800000190734863]},
  {'text': 'że', 'timestamp': [4.039999961853027, 4.139999866485596]},


To correct the times, we will need the mask used for extracting the segment as well as sampling frequency (to be able to match the samples from the mask to the time in seconds).

In [14]:
ali_words = list(ali.values())[0]

ali_fixed = fix_times(ali_words, mask, audio['samp_freq'])
ali_fixed

[{'text': 'jeszcze', 'timestamp': [659.6179375, 659.7579374968708]},
 {'text': 'moment', 'timestamp': [659.778, 659.8980000047684]},
 {'text': 'jeszcze', 'timestamp': [659.9179375, 660.0779374964237]},
 {'text': 'moment', 'timestamp': [660.1179375, 660.3579375095368]},
 {'text': 'dobrze', 'timestamp': [660.398, 660.8180000166893]},
 {'text': 'to', 'timestamp': [660.8579375, 660.9179375619889]},
 {'text': 'jeszcze', 'timestamp': [660.938, 661.0979999666214]},
 {'text': 'chwileczkę', 'timestamp': [661.1179375, 661.4779375143052]},
 {'text': 'poczekamy', 'timestamp': [661.538, 662.1579998855591]},
 {'text': 'ale', 'timestamp': [662.1979375, 662.3379376049041]},
 {'text': 'proszę', 'timestamp': [662.4179375, 663.0579376049042]},
 {'text': 'widzę', 'timestamp': [667.982, 668.2220000095367]},
 {'text': 'że', 'timestamp': [668.2819375, 668.3819374046326]},
 {'text': 'wszyscy', 'timestamp': [668.4219375, 668.8019376144409]},
 {'text': 'mamy', 'timestamp': [668.8419375, 669.0819377479553]},
 {'

## Putting it all together

Now let's repeat all of the above on the whole file. We have a single convenience method in the `Matcher` class that divides the file into chunks and computes the alignment of each chunk in sequence. The method requires the speech recognition word sequence, the audio file and speech recognition model (to do the alignment) as arguments.

You can also optionally change the chunk length and stride. The stride allows for an overlap between chunks. After the chunks are aligned, any overlapping words are removed. The words are first sorted and the word with lower time is kept, while others are removed.

In [15]:
ali_all = matcher.run(words[utt_id], audio_file, asr_model)

07/21/2023 22:59:25 [INFO] - Loading audio...
07/21/2023 22:59:25 [INFO] - Making chunks...


100%|██████████| 14/14 [00:00<00:00, 60.65it/s]

07/21/2023 22:59:25 [INFO] - Aligning reference to audio...
07/21/2023 22:59:25 [INFO] - Loading models...





07/21/2023 22:59:37 [INFO] - Loading data...
07/21/2023 22:59:37 [INFO] - Loaded 13 files!
07/21/2023 22:59:37 [INFO] - Processing labels...


Map:   0%|          | 0/13 [00:00<?, ? examples/s]

07/21/2023 22:59:37 [INFO] - Total audio length: 1071.30s == 17.86min == 0.30h
07/21/2023 22:59:37 [INFO] - Splitting data into chunks...
Downloading and preparing dataset generator/default to /home/guest/.cache/huggingface/datasets/generator/default-94ee03569a4a2e6b/0.0.0...


Generating train split: 0 examples [00:00, ? examples/s]

Dataset generator downloaded and prepared to /home/guest/.cache/huggingface/datasets/generator/default-94ee03569a4a2e6b/0.0.0. Subsequent calls will reuse this data.
07/21/2023 22:59:38 [INFO] - Divided into 255 chunks!
07/21/2023 22:59:38 [INFO] - Processing chunks using the W2V2 model...


Map:   0%|          | 0/255 [00:00<?, ? examples/s]

07/21/2023 23:01:38 [INFO] - Merging chunks back into files...


100%|██████████| 255/255 [00:16<00:00, 15.52it/s]

07/21/2023 23:01:54 [INFO] - Performing forced alignment...





Map:   0%|          | 0/13 [00:00<?, ? examples/s]

07/21/2023 23:01:57 [INFO] - Saving output...
07/21/2023 23:01:57 [INFO] - Done!
07/21/2023 23:01:57 [INFO] - Took 139.53s == 2.33min == 0.04h
07/21/2023 23:01:57 [INFO] - Fixing times...
07/21/2023 23:01:58 [INFO] - Removing overlapping words...


This will give us the alignment of the whole file.

In [16]:
print(f'Alignemnt created {len(ali_all)} aligned words.')
print(f'Sample of aligned words:\n{ali_all[:20]}')

Alignemnt created 2478 aligned words.
Sample of aligned words:
[{'text': 'komendanta', 'timestamp': [599.8059375, 600.4059374791384], 'ref_pos': 81}, {'text': 'głównego', 'timestamp': [600.4659375, 600.9059375572205], 'ref_pos': 82}, {'text': 'policji', 'timestamp': [601.006, 601.8459999141693], 'ref_pos': 83}, {'text': 'zawierającą', 'timestamp': [603.0659375, 603.2859375286102], 'ref_pos': 84}, {'text': 'odpowiedź', 'timestamp': [603.3059375, 603.7459375572205], 'ref_pos': 85}, {'text': 'na', 'timestamp': [603.7659375, 603.845937423706], 'ref_pos': 86}, {'text': 'pytanie', 'timestamp': [603.926, 604.3259998569489], 'ref_pos': 87}, {'text': 'czy', 'timestamp': [604.426, 605.0260001430512], 'ref_pos': 88}, {'text': 'miasto', 'timestamp': [605.206, 605.5259996948242], 'ref_pos': 89}, {'text': 'wrocław', 'timestamp': [605.5659375, 606.1659374046326], 'ref_pos': 90}, {'text': 'należy', 'timestamp': [606.206, 606.5659996566773], 'ref_pos': 91}, {'text': 'jeszcze', 'timestamp': [606.6059375

We can now save the alignment to a TextGrid file. This is a popular format developed by the program called [Praat](www.praat.org) used for speech segmentation. It is used in many other tools, for example [Elan](https://archive.mpi.nl/tla/elan) and [EMU-webApp](https://ips-lmu.github.io/EMU-webApp/). Note that these tools may not work well with very long files. Praat supports loading long audio files, but its interface is quite dated. EMU is more modern, but works in the browser and could crash with loading a large audio file.

In [17]:
save_ali_to_textgrid(textgrid_file, ali_all)

For actual debugging, we actually convert the word segmentation into kind of utterance segmentation by combining all the words that are close to each other into one utterance. We use the optional `sil_gap` argument to control the length of silence needed to make a new utterance. In some cases you may also want to use the `max_len` argument to further split long utterances into smaller chunks - this will be done in such a way that each sub-utterance is rougly same length, but none are longer than `max_len`.

In [18]:
segs = convert_ali_to_segments(ali_all, words[utt_id])

with open(segs_file, 'w') as f:
    json.dump(segs, f)

Once we save these chunks into a JSON file, we can use the provided HTML page to view the segmentation in your browser. The page will look for `audio.mp3` and `segs.json`. For the page to work, it should be hosted by an HTTP server (accessing the file locally won't work for security reasons). Furthermore, it is recommended to use a proper web server, rather than something like the `http.server` module in Python, because this will allow seeking in long audio files.

The simplest solution for me is to use Docker. Simply run the following command in your terminal:

```bash
docker run -it --rm -p 8080:80 --name web -v ${PAGE_PATH}:/usr/share/nginx/html nginx
```

Replace the `${PAGE_PATH}` with the absolute path of the directory containing the HTML page, the audio file and the utterance JSON. This will run the server that you can access in your browser at [http://localhost:8080/viewer.html](http://localhost:8080/viewer.html).