## Transcription pipeline
1) Utterance transcription: AWS, WhisperX
2) Human correction: RAs correct utterance, but not timestamps
3) Forced alignment: Ensemble of (forced) alignment models
4) Merge: Sync it up Praat & Datavyu for coding/analysis

# Step 0: Installations

MFA: https://montreal-forced-aligner.readthedocs.io/en/stable/installation.html

In [None]:
# It's been a while since I've installed MFA but I think I did this.
# Run these commands in your terminal. It'll create a conda environment called 'aligner'.
# Then come back to tihs notebook and select that as your kernel for this notebook

!conda create -n aligner python=3.11
!conda activate aligner
!conda install montreal-forced-aligner pytorch pytorch-cuda=12.1 -c conda-forge -c pytorch -c nvidia
!pip install ipykernel

WhisperX: https://github.com/m-bain/whisperX

In [None]:
# The normal way:
!pip install whisperx

# But that doesnt work for me so I use this:
!TMPDIR=~/bigtmp pip install whisperx

In [None]:
# If you have any caching issues during any of that, try this:
!pip cache purge
!conda clean -a
!rm -rf /tmp/*

Other libraries:

In [2]:
!pip install dotenv lingua-language-detector textgrid


Collecting dotenv
  Downloading dotenv-0.9.9-py2.py3-none-any.whl.metadata (279 bytes)
Collecting lingua-language-detector
  Downloading lingua_language_detector-2.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (32 kB)
Collecting textgrid
  Downloading TextGrid-1.6.1.tar.gz (9.4 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting python-dotenv (from dotenv)
  Downloading python_dotenv-1.2.1-py3-none-any.whl.metadata (25 kB)
Downloading dotenv-0.9.9-py2.py3-none-any.whl (1.9 kB)
Downloading lingua_language_detector-2.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (96.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.2/96.2 MB[0m [31m11.2 MB/s[0m  [33m0:00:08[0mm0:00:01[0m00:01[0m
[?25hDownloading python_dotenv-1.2.1-py3-none-any.whl (21 kB)
Building wheels for collected packages: textgr

In [2]:
from dotenv import load_dotenv
import os

load_dotenv()
HF_TOKEN = os.getenv("HF_KEY")

In [3]:
# If you want to use a GPU for faster transcriptions (e.g. if you have Colab kernel connected)
import torch

if torch.cuda.is_available():
    print("CUDA is available! Using GPU.")
    device = str(torch.device("cuda"))
    # You can also get more information about the GPU
    print(f"Number of GPUs: {torch.cuda.device_count()}")
    for i in range(torch.cuda.device_count()):
        print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
else:
    print("CUDA is not available. Falling back to CPU.")
    device = "cpu"

print(f"Current device: {device}")

CUDA is available! Using GPU.
Number of GPUs: 1
GPU 0: NVIDIA GeForce RTX 5060 Ti
Current device: cuda


# Step 1: Transcription

Option 1 (AWS): This is how we've traditionally done it in the lab

In [None]:
# TODO: Katie needs to add me to the AWS account so that I can know how it works
# 1) Transcribe audio
# 2) Import the audio file here

Option 2 (WhisperX): AWS provides more precise utterance level transcriptions. Here's a pipeline that I developed for WhisperX anyway.

In [None]:
# Change me!
audio_file = '../input/DINA1_PS1_IDS1.wav'

In [None]:
# Had weird problems with my GPU
import torch
_original_load = torch.load
def _patched_load(*args, **kwargs):
    kwargs["weights_only"] = False
    return _original_load(*args, **kwargs)
torch.load = _patched_load

# WhisperX
import whisperx
import gc

# Initialize parameters
batch_size = 1 # Smaller batch size because I don't wanna blast my GPU's memory. Also produces longer utterances
compute_type = "float32"
model_name = "medium"

# 1. Transcribe with original whisper (batched) 13 mins on CPU
model = whisperx.load_model(whisper_arch=model_name, device=device, compute_type=compute_type)
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size, language='es') # NOTE: For some reason on code-switching audio it's best to set it to the less dominant language

# # Optional: Save to disk
# import json
# import os
# basename = os.path.basename(audio_file).split('.')[0]
# with open(f"../output/{basename}_WhisperX1.json", "w") as f:
#     json.dump(result, f, indent=4)

print(result["segments"]) # before alignment

In [7]:
# 2. Align whisper output 4 mins on CPU
raw_segments = result["segments"]
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

# # Optional: Save to disk
# import json
# import os
# basename = os.path.basename(audio_file).split('.')[0]
# with open(f"../output/{basename}_WhisperX2.json", "w") as f:
#     json.dump(result, f, indent=4)

print(result["segments"]) # after alignment

[{'start': 4.283, 'end': 6.886, 'text': ' Okay, so grab it to myself.', 'words': [{'word': 'Okay,', 'start': np.float64(4.283), 'end': np.float64(5.344), 'score': np.float64(0.283)}, {'word': 'so', 'start': np.float64(5.364), 'end': np.float64(5.464), 'score': np.float64(0.297)}, {'word': 'grab', 'start': np.float64(5.485), 'end': np.float64(5.665), 'score': np.float64(0.421)}, {'word': 'it', 'start': np.float64(6.486), 'end': np.float64(6.606), 'score': np.float64(0.266)}, {'word': 'to', 'start': np.float64(6.626), 'end': np.float64(6.666), 'score': np.float64(0.576)}, {'word': 'myself.', 'start': np.float64(6.726), 'end': np.float64(6.886), 'score': np.float64(0.231)}]}, {'start': 6.906, 'end': 12.894, 'text': 'Oh my gosh, I feel like a vlogger.', 'words': [{'word': 'Oh', 'start': np.float64(6.906), 'end': np.float64(6.946), 'score': np.float64(0.004)}, {'word': 'my', 'start': np.float64(6.966), 'end': np.float64(7.006), 'score': np.float64(0.015)}, {'word': 'gosh,', 'start': np.floa

In [8]:
# 3. Assign speaker labels 52 mins on CPU
diarize_model = whisperx.diarize.DiarizationPipeline(use_auth_token=HF_TOKEN, device=device)

# add min/max number of speakers if known
diarize_segments = diarize_model(audio)
diarize_model(audio, min_speakers=1, max_speakers=4)
result = whisperx.assign_word_speakers(diarize_segments, result)

# # Optional: Save to disk
# import json
# import os
# basename = os.path.basename(audio_file).split('.')[0]
# with open(f"../output/{basename}_WhisperX3.json", "w") as f:
#     json.dump(result, f, indent=4)


print(diarize_segments)
print(result["segments"]) # segments are now assigned speaker IDs

2026-01-20 00:25:24 - whisperx.diarize - INFO - Loading diarization model: pyannote/speaker-diarization-3.1


  torchaudio.list_audio_backends()
  std = sequences.std(dim=-1, correction=1)


                               segment label     speaker        start  \
0    [ 00:00:04.283 -->  00:00:05.700]     A  SPEAKER_01     4.283469   
1    [ 00:00:06.224 -->  00:00:08.316]     B  SPEAKER_01     6.224094   
2    [ 00:00:12.940 -->  00:00:15.488]     C  SPEAKER_01    12.940344   
3    [ 00:00:20.804 -->  00:00:21.057]     D  SPEAKER_01    20.804094   
4    [ 00:00:23.200 -->  00:00:23.976]     E  SPEAKER_01    23.200344   
..                                 ...   ...         ...          ...   
365  [ 00:21:06.584 -->  00:21:06.719]    NB  SPEAKER_02  1266.584094   
366  [ 00:21:06.719 -->  00:21:12.017]    NC  SPEAKER_01  1266.719094   
367  [ 00:21:08.254 -->  00:21:08.676]    ND  SPEAKER_00  1268.254719   
368  [ 00:21:10.110 -->  00:21:10.769]    NE  SPEAKER_00  1270.110969   
369  [ 00:21:13.739 -->  00:21:17.434]    NF  SPEAKER_01  1273.739094   

             end  intersection        union  
0       5.700969  -1271.147031  1272.867531  
1       8.316594  -1268.531406 

In [14]:
# Optional: Save to disk
import json
import os
basename = os.path.basename(audio_file).split('.')[0]
with open(f"../output/{basename}_WhisperXFinal.json", "w") as f:
    json.dump(result, f, indent=4)

# Step 2: Human correction

In [None]:
# TODO: Make a UI that allows the user to listen to the audio & make textual corrections (no need to edit time)
# Front end html file?

In [None]:
# TODO: Make any adjustments (automatic typo detection / tokenization steps) before feeding transcript file into Forced Alignment

# Step 3: Forced Alignment

ASR timestamps are usually not that good, forced alignment can help.

In [72]:
# NOTE Development: just doing this to avoid having to run step 1 all over again
import json
import os
audio_file = '../input/DINA1_PS1_IDS1.wav'
basename = os.path.basename(audio_file).split('.')[0]
with open(f"../output/{basename}_WhisperXFinal.json", "r") as f:
    result = json.load(f)

In [None]:
# 1. Split utterances by language

# From another script I made: praat/detect_language.py
import os
import sys
import pandas as pd
from lingua import Language, LanguageDetectorBuilder
def detect_language(detector, text):
    # Detect the language of the given text through punctuation or through a model
    if any(char in text for char in "¡¿áéíóúñüÁÉÍÓÚÑÜ"):
        return "Spanish"
    result = detector.detect_language_of(text)
    if result == Language.ENGLISH:
        return "English"
    elif result == Language.SPANISH:
        return "Spanish"
    return "Unknown"


# Initialize the language detector for English and Spanish
languages = [Language.ENGLISH, Language.SPANISH]
detector = LanguageDetectorBuilder.from_languages(*languages).build()

# Iterate and detect
for segment in result['segments']:
    if segment['text']:
        lang = detect_language(detector, segment['text'].lower())
        segment['language'] = lang
interval_groups = result['segments']

# # Optional: Save to disk
# import json
# import os
# basename = os.path.basename(audio_file).split('.')[0]
# with open(f"../output/{basename}_WhisperX3.json", "w") as f:
#     json.dump(result, f, indent=4)

print(result["segments"]) # segments are now assigned languages


[{'start': 4.283, 'end': 6.886, 'text': ' Okay, so grab it to myself.', 'words': [{'word': 'Okay,', 'start': 4.283, 'end': 5.344, 'score': 0.283, 'speaker': 'SPEAKER_01'}, {'word': 'so', 'start': 5.364, 'end': 5.464, 'score': 0.297, 'speaker': 'SPEAKER_01'}, {'word': 'grab', 'start': 5.485, 'end': 5.665, 'score': 0.421, 'speaker': 'SPEAKER_01'}, {'word': 'it', 'start': 6.486, 'end': 6.606, 'score': 0.266, 'speaker': 'SPEAKER_01'}, {'word': 'to', 'start': 6.626, 'end': 6.666, 'score': 0.576, 'speaker': 'SPEAKER_01'}, {'word': 'myself.', 'start': 6.726, 'end': 6.886, 'score': 0.231, 'speaker': 'SPEAKER_01'}], 'speaker': 'SPEAKER_01', 'language': 'English'}, {'start': 6.906, 'end': 12.894, 'text': 'Oh my gosh, I feel like a vlogger.', 'words': [{'word': 'Oh', 'start': 6.906, 'end': 6.946, 'score': 0.004, 'speaker': 'SPEAKER_01'}, {'word': 'my', 'start': 6.966, 'end': 7.006, 'score': 0.015, 'speaker': 'SPEAKER_01'}, {'word': 'gosh,', 'start': 7.026, 'end': 7.787, 'score': 0.451, 'speaker':

In [None]:
# 2. Setting up for MFA

# # NOTE: Before doing any MFA ensure that you cleared your cache: Delete Documents/MFA
!rm -rf ~/Documents/MFA
print("Cleared cache")

# # Install alignment models
!mfa model download --ignore_cache acoustic english_us_arpa
!mfa model download --ignore_cache dictionary english_us_arpa
!mfa model download --ignore_cache acoustic spanish_mfa
!mfa model download --ignore_cache dictionary spanish_mfa
print("Models downloaded")

# Create a textgrid with tier 0 being utterances, tier 1 being language
from textgrid import TextGrid, IntervalTier
tg = TextGrid()
utterances_tier = IntervalTier(name="WhisperX - Utterances", minTime=tg.minTime, maxTime=tg.maxTime)
languages_tier = IntervalTier(name="Lingua - Language", minTime=tg.minTime, maxTime=tg.maxTime)
for segment in interval_groups:
  utterances_tier.add(segment['start'], segment['end'], segment['text'])
  languages_tier.add(segment['start'], segment['end'], segment['language'])
tg.append(utterances_tier)
tg.append(languages_tier)
print("Created utterances and language textgrid")

# Split that TextGrid into an English and Spanish one separately
languages = set([interval.mark for interval in languages_tier.intervals]) - set([''])
languages2tier = {}
for language in languages:
  tier_name = f"{language} Utterances"
  new_tier = IntervalTier(name=tier_name, minTime=tg.minTime, maxTime=tg.maxTime)
  for utterance_interval, language_interval in zip(utterances_tier.intervals, languages_tier.intervals):
    text = utterance_interval.mark if language_interval.mark == language else None
    if text:
      new_tier.add(utterance_interval.minTime, utterance_interval.maxTime, text)
  tg.append(new_tier)
  languages2tier[tier_name] = new_tier
print("Finished splitting languages into separate tiers")

# Output these files
english_path = '../chengyi-mfa/input/english/'
spanish_path = '../chengyi-mfa/input/spanish/'
for key in languages2tier.keys(): # Assuming you just have English & Spanish
  new_tg = TextGrid()
  new_tg.append(languages2tier[key])
  if key == "English Utterances":
    new_tg.write(os.path.join(english_path, os.path.basename(audio_file).replace('.wav', '.TextGrid')))
  elif key == "Spanish Utterances":
    new_tg.write(os.path.join(spanish_path, os.path.basename(audio_file).replace('.wav', '.TextGrid')))
print("Output files created")

# Make copies of the audio file into those directories
!cp "$audio_file" "$english_path"
!cp "$audio_file" "$spanish_path"
print("Copied audio files, script done")

Cleared cache
Models downloaded
Created utterances and language textgrid
Finished splitting languages into separate tiers
Output files created
Copied audio files, script done


In [28]:
# 2.5 validate to ensure correct form (Each file took me raound 3 minutes)
!mfa validate "$english_path" english_us_arpa english_us_arpa
!mfa validate "$spanish_path" spanish_mfa spanish_mfa

[2;36m [0m[32mINFO    [0m Setting up corpus information[33m...[0m                                      
[2;36m [0m[32mINFO    [0m Found [1;36m1[0m speaker across [1;36m1[0m file, average number of utterances per       
[2;36m [0m         speaker: [1;36m25.0[0m                                                         
[2;36m [0m[32mINFO    [0m Initializing multiprocessing jobs[33m...[0m                                  
[2;36m [0m         MFA will only use [1;36m1[0m jobs. Use the --single_speaker flag if you would  
[2;36m [0m         like to split utterances across jobs regardless of their speaker.     
[2;36m [0m[32mINFO    [0m Text already normalized.                                              
[2;36m [0m[32mINFO    [0m Features already generated.                                           
[2;36m [0m[32mINFO    [0m Creating corpus split[33m...[0m                                              
[2K[35m  96%[0m [91m━━━━━━━━━━━━━━━━━━━━━━━

In [29]:
# 3. Run MFA on each of those language utterances separately
output_path = "../chengyi-mfa/output/"
!mfa align "$english_path" english_us_arpa english_us_arpa "$output_path""english/"
!mfa align "$spanish_path" spanish_mfa spanish_mfa "$output_path""spanish/"

[2;36m [0m[32mINFO    [0m Setting up corpus information[33m...[0m                                      
[2;36m [0m[32mINFO    [0m Loading corpus from source files[33m...[0m                                   
[2K[35m   1%[0m [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1/100 [0m [ [33m0:00:01[0m < [36m-:--:--[0m , [31m? it/s[0m ]
[?25h[2;36m [0m[32mINFO    [0m Found [1;36m1[0m speaker across [1;36m1[0m file, average number of utterances per       
[2;36m [0m         speaker: [1;36m39.0[0m                                                         
[2;36m [0m[32mINFO    [0m Initializing multiprocessing jobs[33m...[0m                                  
[2;36m [0m         MFA will only use [1;36m1[0m jobs. Use the --single_speaker flag if you would  
[2;36m [0m         like to split utterances across jobs regardless of their speaker.     
[2;36m [0m[32mINFO    [0m Normalizing text[33m...[0m                                              

In [74]:
from textgrid import TextGrid, IntervalTier
import pandas as pd

# 1. Read textgrids
english_tg = TextGrid()
spanish_tg = TextGrid()
english_tg.read("../chengyi-mfa/output/english/DINA1_PS1_IDS1.TextGrid")
spanish_tg.read("../chengyi-mfa/output/spanish/DINA1_PS1_IDS1.TextGrid")

# 2. Grab intervals
english_intervals = [interval for interval in english_tg[0]]
spanish_intervals = [interval for interval in spanish_tg[0]]

# 3. Convert these intervals into a dataframe for easier manipulation
intervals_df = []
for interval in english_intervals + spanish_intervals:
  intervals_df.append({
    "start": interval.minTime,
    "end": interval.maxTime,
    "text": interval.mark,
    "type": "word (mfa)",
  })

# 3.5 Add WhisperX utterances too for order
for segment in result['segments']:
  intervals_df.append({
    "start": segment["start"],
    "end": segment["end"],
    "text": segment['text'],
    "type": "utterance (whisper)"
  })

# 3.75 Create dataframe
intervals_df = pd.DataFrame(intervals_df)
intervals_df = intervals_df.sort_values("start")
intervals_df = intervals_df.reset_index()
intervals_df = intervals_df.drop(columns="index")
intervals_df = intervals_df[intervals_df["text"] != ""]
intervals_df.head(10)

Unnamed: 0,start,end,text,type
2,4.283,6.886,"Okay, so grab it to myself.",utterance (whisper)
3,4.353,4.593,okay,word (mfa)
4,4.593,4.813,so,word (mfa)
5,4.813,5.693,grab,word (mfa)
7,6.193,6.223,it,word (mfa)
8,6.223,6.283,to,word (mfa)
9,6.283,6.833,myself,word (mfa)
11,6.906,12.894,"Oh my gosh, I feel like a vlogger.",utterance (whisper)
12,6.986,7.326,oh,word (mfa)
14,7.356,7.606,my,word (mfa)


In [75]:
# 3.875 Put words under the utterance_intervals
utterance_intervals = []
current_utt = None
for _, row in intervals_df.iterrows():
    if row["type"] == "utterance (whisper)":
        # Start a new utterance group
        current_utt = {
            "start": row["start"],
            "end": row["end"],
            "text": row["text"],
            "words": []
        }
        utterance_intervals.append(current_utt)
    elif row["type"] == "word (mfa)" and current_utt is not None:
        # Attach word to the most recent utterance
        current_utt["words"].append({
            "start": row["start"],
            "end": row["end"],
            "word": row["text"]
        })

# 3.9375 Get MFA utterance times
for utterance_interval in utterance_intervals:
    if len(utterance_interval['words']) != 0:
        utterance_interval['start'] = utterance_interval['words'][0]['start']
        utterance_interval['end'] = utterance_interval['words'][-1]['end']

In [76]:
# 4. Adjust WhisperX with MFA output

# Check
print(len(utterance_intervals), len(result['segments']), "These should be equal")

# TODO: The current approach loses information like OOVs, confidence scores, and speaker
# OOVs: (i.e. whisper transcribes "vlogger" but MFA's model doesn't know how to transcribe that)
# Confidence scores: How confidence the Whisper model is at transcribing that word
# Speaker: SpeakerX's diarization model

# Store old results
import copy
result['whisperx_segments'] = copy.deepcopy(result['segments'])
result['whisperx_word_segments'] = copy.deepcopy(result['word_segments'])

# Anyway, we can still naively replace Whisper's trash word-level transcriptions with MFA's
for utterance_interval, result_interval in zip(utterance_intervals, result['segments']):
  result_interval['start'] = utterance_interval['start']
  result_interval['end'] = utterance_interval['end']
  result_interval['words'] = utterance_interval['words']
print("Done changing utterance segments")
result["word_segments"] = []
for result_interval in result['segments']:
  for word_interval in result_interval['words']:
    result["word_segments"].append(word_interval)
print("Done changing word segments")

307 307 These should be equal
Done changing utterance segments
Done changing word segments


`'segments'` and `'word_segments'` now have a list of segments that were first transcribed by WhisperX and then adjusted with MFA.

In [68]:
# Visualize format
print(result.keys())
for key in result.keys():
  print(result[key][0].keys())

dict_keys(['segments', 'word_segments', 'raw_segments', 'whisperx_segments', 'whisperx_word_segments'])
dict_keys(['start', 'end', 'text', 'words', 'speaker'])
dict_keys(['start', 'end', 'word'])
dict_keys(['text', 'start', 'end'])
dict_keys(['start', 'end', 'text', 'words', 'speaker'])
dict_keys(['word', 'start', 'end', 'score', 'speaker'])


In [78]:
import string

def remove_punctuation(text: str) -> str:
    punctuation = set(string.punctuation)
    punctuation.update(["¿", "¡"])
    table = str.maketrans("", "", "".join(punctuation))
    return text.translate(table)


# Check for overlaps (utterances)
prev = 0
for index, utterance in enumerate(result['segments']):
  overlap = prev > utterance['start']
  if overlap:
    resolved = False
    previous_segment = result['segments'][index-1]
    current_segment = result['segments'][index]

    # Case: In this particular file, it seems that we don't lose any information if we just remove the current one
    if remove_punctuation(previous_segment['words'][-1]['word'].lower()) == remove_punctuation(current_segment['text'].split()[0].lower()):
      del result['segments'][index]
      resolved = True

    # Case: OOV item (i.e. MFA detects <unk> but whisper detects "Chekayo")
    if previous_segment['words'][-1]['word'] == "<unk>" and (len(current_segment['text'].split()) == 1):
      result['segments'][index-1]['end'] = current_segment['end']
      result['segments'][index-1]['words'][-1]['end'] = current_segment['end']
      result['segments'][index-1]['words'][-1]['word'] = remove_punctuation(current_segment['text'].lower())
      del result['segments'][index]
      resolved = True

    # Debug
    if resolved == False:
      print("Overlap detected")
      print("Previous:", previous_segment)
      print("Current:", current_segment)
      print()

  # New previous
  prev = utterance['end']

In [79]:
# Check for overlaps (words)
prev = 0
for index, word in enumerate(result['word_segments']):
  overlap = prev > word['start']
  if overlap:
    print("Overlap detected")
    print(result['segments'][index])
    print(result['segments'][index-1])
    print()
  prev = word['end']

# Step 4: Export

Praat: Usually you code and transcribe on the same file

In [None]:
from textgrid import TextGrid, IntervalTier

# Initialize
output_tg_path = "../output/WhisperX_MFA_Demo.TextGrid"
final_tg = TextGrid()
languages_tier = IntervalTier(name="Languages", minTime=final_tg.minTime, maxTime=final_tg.maxTime)
whisperx_utterances = IntervalTier(name="WhisperX - Utterances", minTime=final_tg.minTime, maxTime=final_tg.maxTime)
whisperx_words = IntervalTier(name="WhisperX - Words", minTime=final_tg.minTime, maxTime=final_tg.maxTime)
mfa_utterances = IntervalTier(name="MFA - Utterances", minTime=final_tg.minTime, maxTime=final_tg.maxTime)
mfa_words = IntervalTier(name="MFA - Words", minTime=final_tg.minTime, maxTime=final_tg.maxTime)

# Append onto tiers
nudge = 0.001 # For overlaps
for mfa_segment, whisperx_segment in zip(result['segments'], result['whisperx_segments']):

  try:
    whisperx_utterances.add(whisperx_segment['start'], whisperx_segment['end'], whisperx_segment['text'])
    mfa_utterances.add(mfa_segment['start'], mfa_segment['end'], mfa_segment['text'])
    languages_tier.add(mfa_segment['start'], mfa_segment['end'], mfa_segment['language'])
  except ValueError:
    whisperx_utterances.add(whisperx_segment['start'] + nudge, whisperx_segment['end'], whisperx_segment['text'])
    mfa_utterances.add(mfa_segment['start'] + nudge, mfa_segment['end'], mfa_segment['text'])
    languages_tier.add(mfa_segment['start'] + nudge, mfa_segment['end'], mfa_segment['language'])

for word in result['word_segments']:
  try:
    mfa_words.add(word['start'], word['end'], word['word'])
  except ValueError:
    mfa_words.add(word['start'] + nudge, word['end'], word['word'])

for word in result['whisperx_word_segments']:
  try:
    whisperx_words.add(word['start'], word['end'], word['word'])
  except ValueError:
    whisperx_words.add(word['start'] + nudge, word['end'], word['word'])

# Write
final_tg.append(mfa_utterances)
final_tg.append(whisperx_utterances)
final_tg.append(languages_tier)
final_tg.append(mfa_words)
final_tg.append(whisperx_words)
final_tg.write(output_tg_path)
print("Done")


Datavyu: Sometimes you code with the video instead of the audio file, meaning they're not necessarily aligned

In [None]:
# TODO: Align 2 audio files

# Future directions

Action items:
* Bulk transcriptions.
* Step 4's extra information.
* An interface for checking the timestamps & adjusting them (Praat). 
* A fine-tuned version of Whisper on code-switched data may work better as it will be able to handle code-switching within utterances (for example, it will have the vocabulary to do so).
* Multiple passes of the WhisperX -> MFA cycle might produce better results.