# Explanation of the alignment procedure
### for aligning very long audio sequences to very long text sequences

This notebook describes the code for aligning long audio files to long text sequences using Huggingface Transformers toolkit.

More precisely, we have the following inputs:
* a long audio file (this can be anything from a few minutes, to a few hours, to a whole day of audio)
* a text that contains the transcripts of the audio file

On output we expect to get:
* timecodes of portions of text to the portions of audio - word level timings

Keeping in mind that:
* audio can contain speech and events that aren't in the transcription
* transcripts can contain text that isn't in the audio

The sources of these discrepancies can be:
* the transcript can be much larger than the speech contained in audio - in other words, we can have one large file with transcripts of many audio files and this is acceptable for this project - we only use what we need
* the audio doesn't have to be transcribed with 100% recall or precision - transcribers make mistakes, not everything gets transcribed - this is all fine and we take it into account
* the alignment process can produce errors - we have a procedure that allows for reviewing results and deciding on mitigation procedures after all the automation is complete


In [1]:
import json
from pathlib import Path

from src.align import align, fix_times, save_ali_to_textgrid, convert_ali_to_segments
from src.data_loaders import load_reco, load_audio, extract_audio
from src.matching import Matcher
from src.recognize import recognize

First we define a couple of paths to local resources:
* `audio_file` - the audio file containing the speech we wish to align to
* `text_file` - the file containing all the transcripts - needs to contain text only and ideally should be normalized
* `asr_model` - Huggingface hub name or path to the speech recognition model used to process the files

These files are created throughout the procedure:
* `reco_file`  - this file is the result of the initial speech recognition process
* `textgrid_file`  - this file contains the final alignment openable in Praat and similar programs
* `segs_file`  - this file contains the aligned segments useful for debugging and analysis

In [2]:
audio_file = Path('data/0XpZGGwDkN8.wav')
text_file = Path('data/201923-sjm-ppxxx-00010-02_norm.txt')
asr_model = 'asr_model'
reco_file = Path('data/reco.json')
textgrid_file = Path('data/out.TextGrid')
segs_file = Path('website/segs.json')

# Step 1 - Speech recognition

The first step is to get any kind of speech recognition output that we can then use to align to actual transcripts.

Since this process can be expensive, we will save the result immediately to a file. If the file exists, we don't have to run it again.

It's also a good idea to restart the notebook kernel after running this cell in order to release memory on the GPU. Huggingface doesn't have a convenient method of releasing reserved resources and terminating the process is the simplest and most elegant way to achieve this.

In [3]:
if not reco_file.exists():
    reco = recognize(audio_file, asr_model, batch_size=4)
    with open(reco_file, 'w') as f:
        json.dump(reco, f)

03/19/2023 19:34:57 [INFO] - Loading data...
03/19/2023 19:34:57 [INFO] - Loaded 1 files!
03/19/2023 19:34:57 [INFO] - Obtained 75 segments!
03/19/2023 19:34:57 [INFO] - Loading ASR model...
03/19/2023 19:35:11 [INFO] - Starting recognition...


100%|██████████| 75/75 [00:46<00:00,  1.61it/s]


Here we re-load the above saved file. In the case of this demo, this will contain words belonging to only one recording, but theoretically you could recognize many recordings at once and the file would contain all of them.

In [4]:
words = load_reco(reco_file)

print('Number of words per recording:')
print({x: len(y) for x, y in words.items()})

Number of words per recording:
{'0XpZGGwDkN8': 1653}


Here we will first get the utterance id of the single file in question and then extract the first 100 words in the file. The idea of the procedure is to work in chunks to optimize the resources required for processing. We will first demonstrate the whole procedure on a single chunk and later show how to process the whole file.

In [5]:
utt_id = list(words.keys())[0]
words_chunk = words[utt_id][0:100]

print(f'Chunk len: {words_chunk[-1].end - words_chunk[0].start}s')
text = ' '.join([x.text for x in words_chunk])
print(f'Chunk text:\n{text}')

Chunk len: 361.44399999999996s
Chunk text:
wznawiam obrady i przystąpimy do stwierdzenia kworum proszę zatem o naciśnięcie jakiegokolwiek przycisku celu potwierdzenia obecności a posiedzeniu dziękuję o obecnej es obecnych jest czterystu trzydziestu dziewięciu posłów stwierdzam kworum informuję że senat przyjął bez poprawek ustawy o szczególnych rozwiązaniach wspierających realizację programów operacyjnych w związku z wystąpieniem kowid dziewiętnaście w dwa tysiące dwudziestym roku w związku z tym planowany punkt dwunasty porządku dziennego stały się bez przedmiotowy z wnioskiem formalnym pan poseł grzegożbrałn konfederacj proszę bardzo minutę wieczór pani marszałek wysoka izbo no ktoś to jednak musi powiedzieć że że moglibyście państwo robić to inaczej mogłaby pani marszałek teraz


## Step 2 - matching reco text to transcription

We begin by creating a `Matcher` object. This object requires the transcript file on initialization and can later be used to match various audio file speech recognition results to it.

In [6]:
matcher = Matcher(text_file)

In the beginning we match the above 100 word text to the whole transcript in a rough manner. This is done by making a histogram of the recognized word sequence and matching it to the histogram of the sliding window of same length across the whole transcript - akin to the BOW model of. A list of candidates that are sufficiently similar are returned at output. This can be controlled by the optional `threshold` argument.

In [7]:
locs = matcher._initial_match(text)
print(f'Found {len(locs)} potential locations for the above text.')

Found 24 potential locations for the above text.


All the potential candidates are then compared to the original text using the Levenshtein distance method. This gives the exact score of all the differences between the two texts. That method is much more accurate, but also considerably more expensive than the one used in the inital match above. On output, we get both the best location and the difference between the two texts. The distance of 0 means the texts are identical. The higer the value to worse the match.

In [8]:
min_i, min_d = matcher._find_min_diff(locs, text)
print(f'The location with minimum Levenshtein distance is {min_i} and that distance is {min_d}.')

The location with minimum Levenshtein distance is 22261 and that distance is 105.


Now, the above method will only give us a starting location of the two texts, but given the nature of the speech recognition output and the transcript, these don't have to match in length. That is why we look for identically matching words to align the texts and reject any insertions/deletions at the beginning and the end of the sequence. The result is a set of two tuples containing both the begining and end of the speech recognition as well as the transcription portion of the text.

In [9]:
(hb, he), (rb, re) = matcher._find_matching_seq(text, min_i)

print(f'The best matching sequence is between {hb} and {he} in the recognized text '
      f'and {min_i + rb} and {min_i + re} in the reference corpus.')

The best matching sequence is between 0 and 100 in the recognized text and 22261 and 22357 in the reference corpus.


Having all that we can now match the transcription text to the ASR text which allows us also to get the time location of the transcription chunk within the audio file:

In [10]:
hyp_text = ' '.join(text.split()[hb:he])
hyp_tb = words_chunk[hb].start
hyp_te = words_chunk[he - 1].end
ref_text = matcher.get_corpus_chunk(min_i + rb, min_i + re)

print(f'Position of the text in audio file {audio_file} is between seconds {hyp_tb} and {hyp_te}.')
print(f'Recognized text:\n{hyp_text}')
print(f'Reference text:\n{ref_text}')


Position of the text in audio file data/0XpZGGwDkN8.wav is between seconds 1009.192 and 1370.636.
Recognized text:
wznawiam obrady i przystąpimy do stwierdzenia kworum proszę zatem o naciśnięcie jakiegokolwiek przycisku celu potwierdzenia obecności a posiedzeniu dziękuję o obecnej es obecnych jest czterystu trzydziestu dziewięciu posłów stwierdzam kworum informuję że senat przyjął bez poprawek ustawy o szczególnych rozwiązaniach wspierających realizację programów operacyjnych w związku z wystąpieniem kowid dziewiętnaście w dwa tysiące dwudziestym roku w związku z tym planowany punkt dwunasty porządku dziennego stały się bez przedmiotowy z wnioskiem formalnym pan poseł grzegożbrałn konfederacj proszę bardzo minutę wieczór pani marszałek wysoka izbo no ktoś to jednak musi powiedzieć że że moglibyście państwo robić to inaczej mogłaby pani marszałek teraz
Reference text:
wznawiam obrady przystąpimy do stwierdzenia kworum proszę zatem o naciśnięcie jakiegokolwiek przycisku w celu potwierdze

We can then extract the audio within this segment. Now, this audio can also contain lots of silence which affects the performance of the alignment below. That is why we extract only the portions of the audio that match the words recognized by the ASR. This should contain all the audio that contains any speech within this segment.

We also get the bit-mask matching the extracted segments. This will allow us to recreate the times in the original file, before removing the silences.

In [11]:
audio = load_audio(audio_file)
seg, mask = extract_audio(audio['input_values'], words_chunk[hb:he], audio['samp_freq'])

print(f'The length of the extracted segment is {len(seg) / audio["samp_freq"]}s.')

The length of the extracted segment is 43.9800625s.


## Step 3 - Re-alignment

The above text matching procedure only gave us the beginning and end of the whole segment. If we want to get a more accurate alignment we need to re-align all the words from the reference transcription to the audio. Fortunately, this procedure is quite robust given the limited number of words and length of audio extracted above.

It is worth noting, however, that any errors and inaccuracies in the transcript will cause unexpected behavior in the final output.

Note that this method accepts a whole sequence of audio/text pairs. Here we only have one, so we will put both in a single element list each.

In [12]:
ali = align(asr_model, [seg], [ref_text])

  0%|          | 0/1 [00:00<?, ?ex/s]



  0%|          | 0/3 [00:00<?, ?ba/s]

100%|██████████| 10/10 [00:00<00:00, 16.96it/s]


  0%|          | 0/1 [00:00<?, ?ex/s]

This is what the result looks like. It's a list of words, with a begin/end timestamp for each one. Note that the times provided are within the extracted segment only. To get the times that match the original file, we need to make some corrections.

In [13]:
ali

{'audio_0000': [{'text': 'wznawiam',
   'timestamp': [0.23999999463558197, 0.6800000071525574]},
  {'text': 'obrady', 'timestamp': [0.699999988079071, 1.4800000190734863]},
  {'text': 'przystąpimy',
   'timestamp': [1.6200000047683716, 2.180000066757202]},
  {'text': 'do', 'timestamp': [2.2200000286102295, 2.2799999713897705]},
  {'text': 'stwierdzenia',
   'timestamp': [2.319999933242798, 2.940000057220459]},
  {'text': 'kworum', 'timestamp': [3.0, 3.4000000953674316]},
  {'text': 'proszę', 'timestamp': [3.440000057220459, 3.700000047683716]},
  {'text': 'zatem', 'timestamp': [3.759999990463257, 4.059999942779541]},
  {'text': 'o', 'timestamp': [4.099999904632568, 4.159999847412109]},
  {'text': 'naciśnięcie', 'timestamp': [4.199999809265137, 4.78000020980835]},
  {'text': 'jakiegokolwiek',
   'timestamp': [4.800000190734863, 5.400000095367432]},
  {'text': 'przycisku', 'timestamp': [5.460000038146973, 6.28000020980835]},
  {'text': 'w', 'timestamp': [6.300000190734863, 6.340000152587

To correct the times, we will need the mask used for extracting the segment as well as sampling frequency (to be able to match the samples from the mask to the time in seconds).

In [14]:
ali_words = list(ali.values())[0]

ali_fixed = fix_times(ali_words, mask, audio['samp_freq'])
ali_fixed

[{'text': 'wznawiam', 'timestamp': [1009.1819375, 1009.621937512517]},
 {'text': 'obrady', 'timestamp': [1009.6419375, 1010.4219375309945]},
 {'text': 'przystąpimy', 'timestamp': [1010.562, 1011.1220000619888]},
 {'text': 'do', 'timestamp': [1011.162, 1011.2219999427796]},
 {'text': 'stwierdzenia', 'timestamp': [1011.2619375, 1011.8819376239777]},
 {'text': 'kworum', 'timestamp': [1011.9419375, 1012.3419375953674]},
 {'text': 'proszę', 'timestamp': [1012.382, 1012.6419999904632]},
 {'text': 'zatem', 'timestamp': [1012.7019375, 1013.0019374523163]},
 {'text': 'o', 'timestamp': [1013.0419375, 1013.1019374427796]},
 {'text': 'naciśnięcie', 'timestamp': [1013.1419375, 1013.7219379005433]},
 {'text': 'jakiegokolwiek', 'timestamp': [1013.742, 1014.3419999046325]},
 {'text': 'przycisku', 'timestamp': [1014.402, 1015.2220001716614]},
 {'text': 'w', 'timestamp': [1015.742, 1015.781999961853]},
 {'text': 'celu', 'timestamp': [1015.842, 1016.0219998283386]},
 {'text': 'potwierdzenia', 'timestamp'

## Putting it all together

Now let's repeat all of the above on the whole file. We have a single convenience method in the `Matcher` class that divides the file into chunks and computes the alignment of each chunk in sequence. The method requires the speech recognition word sequence, the audio file and speech recognition model (to do the alignment) as arguments.

You can also optionally change the chunk length and stride. The stride allows for an overlap between chunks. After the chunks are aligned, any overlapping words are removed. The words are first sorted and the word with lower time is kept, while others are removed.

In [15]:
ali_all = matcher.run(words[utt_id], audio_file, asr_model)

  0%|          | 0/9 [00:00<?, ?ex/s]



  0%|          | 0/41 [00:00<?, ?ba/s]

100%|██████████| 164/164 [00:09<00:00, 16.89it/s]


  0%|          | 0/9 [00:00<?, ?ex/s]

This will give us the alignment of the whole file.

In [16]:
print(f'Alignemnt created {len(ali_all)} aligned words.')
print(f'Sample of aligned words:\n{ali_all[:20]}')

Alignemnt created 1622 aligned words.
Sample of aligned words:
[{'text': 'wznawiam', 'timestamp': [1009.1819375, 1009.621937512517]}, {'text': 'obrady', 'timestamp': [1009.6419375, 1010.4219375309945]}, {'text': 'przystąpimy', 'timestamp': [1010.562, 1011.1220000619888]}, {'text': 'do', 'timestamp': [1011.162, 1011.2219999427796]}, {'text': 'stwierdzenia', 'timestamp': [1011.2619375, 1011.8819376239777]}, {'text': 'kworum', 'timestamp': [1011.9419375, 1012.3419375953674]}, {'text': 'proszę', 'timestamp': [1012.382, 1012.6419999904632]}, {'text': 'zatem', 'timestamp': [1012.7019375, 1013.0019374523163]}, {'text': 'o', 'timestamp': [1013.0419375, 1013.1019374427796]}, {'text': 'naciśnięcie', 'timestamp': [1013.1419375, 1013.7219379005433]}, {'text': 'jakiegokolwiek', 'timestamp': [1013.742, 1014.3419999046325]}, {'text': 'przycisku', 'timestamp': [1014.402, 1015.2220001716614]}, {'text': 'w', 'timestamp': [1015.742, 1015.781999961853]}, {'text': 'celu', 'timestamp': [1015.842, 1016.02199

We can now save the alignment to a TextGrid file. This is a popular format developed by the program called [Praat](www.praat.org) used for speech segmentation. It is used in many other tools, for example [Elan](https://archive.mpi.nl/tla/elan) and [EMU-webApp](https://ips-lmu.github.io/EMU-webApp/). Note that these tools may not work well with very long files. Praat supports loading long audio files, but its interface is quite dated. EMU is more modern, but works in the browser and could crash with loading a large audio file.

In [17]:
save_ali_to_textgrid(textgrid_file, ali_all)

For actual debugging, we actually convert the word segmentation into kind of utterance segmentation by combining all the words that are close to each other into one utterance. We use the optional `sil_gap` argument to control the length of silence needed to make a new utterance. In some cases you may also want to use the `max_len` argument to further split long utterances into smaller chunks - this will be done in such a way that each sub-utterance is rougly same length, but none are longer than `max_len`.

In [18]:
segs = convert_ali_to_segments(ali_all, words[utt_id])

with open(segs_file, 'w') as f:
    json.dump(segs, f)

Once we save these chunks into a JSON file, we can use the provided HTML page to view the segmentation in your browser. The page will look for `audio.mp3` and `segs.json`. For the page to work, it should be hosted by an HTTP server (accessing the file locally won't work for security reasons). Furthermore, it is recommended to use a proper web server, rather than something like the `http.server` module in Python, because this will allow seeking in long audio files.

The simplest solution for me is to use Docker. Simply run the following command in your terminal:

```bash
docker run -it --rm -p 8080:80 --name web -v ${PAGE_PATH}:/usr/share/nginx/html nginx
```

Replace the `${PAGE_PATH}` with the absolute path of the directory containing the HTML page, the audio file and the utterance JSON. This will run the server that you can access in your browser at [http://localhost:8080/viewer.html](http://localhost:8080/viewer.html).