Exploring forced alignment models: [Tradition or Innovation: A Comparison of Modern ASR Methods for Forced Alignemnt](https://arxiv.org/html/2406.19363v1)
- Montreal Forced Alignment (MFA)
- The Massively Multilingual Speech Model (MMS)
- WhisperX

Other models people are using online:
- Kaldi GMM-HMM
- Gentle Forced Aligner
- Aeneas
- Speech Phonetization Alignment and Syllabification (SPPAS)
- CTC Segmentation
- NVIDIA NeMo Forced Aligner

New experimental models
- Wav2TextGrid

Older models:
- Triphone alignment Kaldi
- Prosodylab-Aligner
- Penn Phonetics Lab Forced Aligner

# Setup:

Some of the conda stuff I'd recommend running in the terminal rather than here.

In [None]:
# Create a new environment
!conda create -n forced_alignment python=3.11
!conda activate aligner

In [None]:
# Install MFA
!conda install montreal-forced-aligner

In [7]:
# Install python packages
!pip install ipykernel pandas lingua dotenv lingua-language-detector textgrid torch transformers ipywidgets datasets[audio] torchaudio

Collecting lingua
  Downloading lingua-4.15.0-py2.py3-none-any.whl.metadata (13 kB)
Collecting dotenv
  Downloading dotenv-0.9.9-py2.py3-none-any.whl.metadata (279 bytes)
Collecting lingua-language-detector
  Downloading lingua_language_detector-2.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (32 kB)
Collecting textgrid
  Downloading TextGrid-1.6.1.tar.gz (9.4 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting polib>=1.1.1 (from lingua)
  Downloading polib-1.2.0-py2.py3-none-any.whl.metadata (15 kB)
Collecting torchcodec>=0.4.0 (from datasets[audio])
  Downloading torchcodec-0.10.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (11 kB)
Collecting jedi>=0.16 (from ipython>=7.23.1->ipykernel)
  Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB)
Downloading lingua-4.15.0-py2.py3-none-any.whl (24 kB)
Downloading dotenv-0.9.9-py2.py3-none-any.whl (1.9 kB)
Downloading lingua_language_detector-2.1.1-cp312-cp312-manylinux_2_17_x86_64.manyli

In [10]:
# Install MMS
!pip install torch transformers datasets[audio] torchaudio
!apt install ffmpeg

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
ffmpeg is already the newest version (7:4.4.2-0ubuntu0.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 41 not upgraded.


In [None]:
# Install gentle
# TODO: I don't even know... I can't get it to work.

INFO:root:gentle 0.10.1
INFO:root:listening at 0.0.0.0:8765

INFO:root:SERVE 8765, 0.0.0.0, 1
INFO:root:about to listen
INFO:root:listening
^C


In [None]:
# I'm going to keep the output of whatever transcription service we use in a dictionary called result, where result['segments'] is a list of utterances, each having the attributes 'text', 'start', and 'end'

# CHANGE ME!
audio_file = '../input/DINA1_PS1_IDS1.wav'
transcription_file = f"../input/{os.path.basename(audio_file).split('.')[0]}_AWS.txt"

# Assuming you have the transcription (from AWS, WhisperX, or hand-coded) & the corresponding audio file
import json
import os
import pandas as pd

# # Read utterance file (WhisperX.json)
# with open(transcription_file, "r") as f:
#     result = json.load(f)

# Read utterance file (Exported from Praat)
df = pd.read_csv(transcription_file, encoding='utf-16', delimiter='\t')
result = {}
result['segments'] = []
for _, row in df.iterrows():
  result['segments'].append({
    'text': row['text'],
    'start': row['tmin'],
    'end': row['tmax'],
    'words': []
  })
result

{'segments': [{'text': 'Ok, so clip it to myself. Oh my gosh. I feel like a vlogger',
   'start': 3.95,
   'end': 8.449,
   'words': []},
  {'text': 'sil', 'start': 8.449, 'end': 12.77, 'words': []},
  {'text': 'and then just keep it to the side. It should be fine. Ok, cool.',
   'start': 12.77,
   'end': 15.75,
   'words': []},
  {'text': 'sil', 'start': 15.75, 'end': 20.709, 'words': []},
  {'text': 'no.', 'start': 20.709, 'end': 21.03, 'words': []},
  {'text': 'sil', 'start': 21.03, 'end': 22.979, 'words': []},
  {'text': 'Qué es eso?', 'start': 22.979, 'end': 24.139, 'words': []},
  {'text': 'sil', 'start': 24.139, 'end': 25.889, 'words': []},
  {'text': 'te tiene nuevos Toys. Oh, oh!',
   'start': 25.889,
   'end': 28.889,
   'words': []},
  {'text': "Ok. So I'll go ahead for a 20 minute timer after that then um I'll just pop up again. Ok. Ok, cool. Got it.",
   'start': 28.889,
   'end': 37.849,
   'words': []},
  {'text': 'sil', 'start': 37.849, 'end': 40.74, 'words': []},
  {'t

# MFA

In this particular snippet, I'm setting up to use MFA for an English-Spanish audio file.

In [None]:
# 1. Split utterances by language

# From another script I made: praat/detect_language.py
import os
import sys
import pandas as pd
from lingua import Language, LanguageDetectorBuilder
def detect_language(detector, text):
    # Detect the language of the given text through punctuation or through a model
    if any(char in text for char in "¡¿áéíóúñüÁÉÍÓÚÑÜ"):
        return "Spanish"
    result = detector.detect_language_of(text)
    if result == Language.ENGLISH:
        return "English"
    elif result == Language.SPANISH:
        return "Spanish"
    return "Unknown"


# Initialize the language detector for English and Spanish
languages = [Language.ENGLISH, Language.SPANISH]
detector = LanguageDetectorBuilder.from_languages(*languages).build()

# Iterate and detect
for segment in result['segments']:
    if segment['text']:
        lang = detect_language(detector, segment['text'].lower())
        segment['language'] = lang
interval_groups = result['segments']

# # Optional: Save to disk
# import json
# import os
# basename = os.path.basename(audio_file).split('.')[0]
# with open(f"../output/{basename}_WhisperX3.json", "w") as f:
#     json.dump(result, f, indent=4)

print(result["segments"]) # segments are now assigned languages


In [None]:
# 2. Setting up for MFA

# # NOTE: Before doing any MFA ensure that you cleared your cache: Delete Documents/MFA
!rm -rf ~/Documents/MFA
print("Cleared cache")

# # Install alignment models
!mfa model download --ignore_cache acoustic english_us_arpa
!mfa model download --ignore_cache dictionary english_us_arpa
!mfa model download --ignore_cache acoustic spanish_mfa
!mfa model download --ignore_cache dictionary spanish_mfa
print("Models downloaded")

# Create a textgrid with tier 0 being utterances, tier 1 being language
from textgrid import TextGrid, IntervalTier
tg = TextGrid()
utterances_tier = IntervalTier(name="WhisperX - Utterances", minTime=tg.minTime, maxTime=tg.maxTime)
languages_tier = IntervalTier(name="Lingua - Language", minTime=tg.minTime, maxTime=tg.maxTime)
for segment in interval_groups:
  utterances_tier.add(segment['start'], segment['end'], segment['text'])
  languages_tier.add(segment['start'], segment['end'], segment['language'])
tg.append(utterances_tier)
tg.append(languages_tier)
print("Created utterances and language textgrid")

# Split that TextGrid into an English and Spanish one separately
languages = set([interval.mark for interval in languages_tier.intervals]) - set([''])
languages2tier = {}
for language in languages:
  tier_name = f"{language} Utterances"
  new_tier = IntervalTier(name=tier_name, minTime=tg.minTime, maxTime=tg.maxTime)
  for utterance_interval, language_interval in zip(utterances_tier.intervals, languages_tier.intervals):
    text = utterance_interval.mark if language_interval.mark == language else None
    if text:
      new_tier.add(utterance_interval.minTime, utterance_interval.maxTime, text)
  tg.append(new_tier)
  languages2tier[tier_name] = new_tier
print("Finished splitting languages into separate tiers")

# Output these files
english_path = '../chengyi-mfa/input/english/'
spanish_path = '../chengyi-mfa/input/spanish/'
for key in languages2tier.keys(): # Assuming you just have English & Spanish
  new_tg = TextGrid()
  new_tg.append(languages2tier[key])
  if key == "English Utterances":
    new_tg.write(os.path.join(english_path, os.path.basename(audio_file).replace('.wav', '.TextGrid')))
  elif key == "Spanish Utterances":
    new_tg.write(os.path.join(spanish_path, os.path.basename(audio_file).replace('.wav', '.TextGrid')))
print("Output files created")

# Make copies of the audio file into those directories
!cp "$audio_file" "$english_path"
!cp "$audio_file" "$spanish_path"
print("Copied audio files, script done")

Cleared cache
Models downloaded
Created utterances and language textgrid
Finished splitting languages into separate tiers
Output files created
Copied audio files, script done


In [None]:
# 2.5 validate to ensure correct form (Each file took me raound 3 minutes)
!mfa validate "$english_path" english_us_arpa english_us_arpa
!mfa validate "$spanish_path" spanish_mfa spanish_mfa

[2;36m [0m[32mINFO    [0m Setting up corpus information[33m...[0m                                      
[2;36m [0m[32mINFO    [0m Found [1;36m1[0m speaker across [1;36m1[0m file, average number of utterances per       
[2;36m [0m         speaker: [1;36m25.0[0m                                                         
[2;36m [0m[32mINFO    [0m Initializing multiprocessing jobs[33m...[0m                                  
[2;36m [0m         MFA will only use [1;36m1[0m jobs. Use the --single_speaker flag if you would  
[2;36m [0m         like to split utterances across jobs regardless of their speaker.     
[2;36m [0m[32mINFO    [0m Text already normalized.                                              
[2;36m [0m[32mINFO    [0m Features already generated.                                           
[2;36m [0m[32mINFO    [0m Creating corpus split[33m...[0m                                              
[2K[35m  96%[0m [91m━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
# 3. Run MFA on each of those language utterances separately
output_path = "../chengyi-mfa/output/"
!mfa align "$english_path" english_us_arpa english_us_arpa "$output_path""english/"
!mfa align "$spanish_path" spanish_mfa spanish_mfa "$output_path""spanish/"

[2;36m [0m[32mINFO    [0m Setting up corpus information[33m...[0m                                      
[2;36m [0m[32mINFO    [0m Loading corpus from source files[33m...[0m                                   
[2K[35m   1%[0m [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1/100 [0m [ [33m0:00:01[0m < [36m-:--:--[0m , [31m? it/s[0m ]
[?25h[2;36m [0m[32mINFO    [0m Found [1;36m1[0m speaker across [1;36m1[0m file, average number of utterances per       
[2;36m [0m         speaker: [1;36m39.0[0m                                                         
[2;36m [0m[32mINFO    [0m Initializing multiprocessing jobs[33m...[0m                                  
[2;36m [0m         MFA will only use [1;36m1[0m jobs. Use the --single_speaker flag if you would  
[2;36m [0m         like to split utterances across jobs regardless of their speaker.     
[2;36m [0m[32mINFO    [0m Normalizing text[33m...[0m                                              

In [None]:
from textgrid import TextGrid, IntervalTier
import pandas as pd

# 1. Read textgrids
english_tg = TextGrid()
spanish_tg = TextGrid()
english_tg.read("../chengyi-mfa/output/english/DINA1_PS1_IDS1.TextGrid")
spanish_tg.read("../chengyi-mfa/output/spanish/DINA1_PS1_IDS1.TextGrid")

# 2. Grab intervals
english_intervals = [interval for interval in english_tg[0]]
spanish_intervals = [interval for interval in spanish_tg[0]]

# 3. Convert these intervals into a dataframe for easier manipulation
intervals_df = []
for interval in english_intervals + spanish_intervals:
  intervals_df.append({
    "start": interval.minTime,
    "end": interval.maxTime,
    "text": interval.mark,
    "type": "word (mfa)",
  })

# 3.5 Add WhisperX utterances too for order
for segment in result['segments']:
  intervals_df.append({
    "start": segment["start"],
    "end": segment["end"],
    "text": segment['text'],
    "type": "utterance (whisper)"
  })

# 3.75 Create dataframe
intervals_df = pd.DataFrame(intervals_df)
intervals_df = intervals_df.sort_values("start")
intervals_df = intervals_df.reset_index()
intervals_df = intervals_df.drop(columns="index")
intervals_df = intervals_df[intervals_df["text"] != ""]
intervals_df.head(10)

Unnamed: 0,start,end,text,type
2,4.283,6.886,"Okay, so grab it to myself.",utterance (whisper)
3,4.353,4.593,okay,word (mfa)
4,4.593,4.813,so,word (mfa)
5,4.813,5.693,grab,word (mfa)
7,6.193,6.223,it,word (mfa)
8,6.223,6.283,to,word (mfa)
9,6.283,6.833,myself,word (mfa)
11,6.906,12.894,"Oh my gosh, I feel like a vlogger.",utterance (whisper)
12,6.986,7.326,oh,word (mfa)
14,7.356,7.606,my,word (mfa)


# MMS (WIP)

In [1]:
# Change me!
audio_file = 'DINA1_PS1_IDS1.wav'

In [2]:
# Log in to huggingface
from dotenv import load_dotenv
import os

load_dotenv()
HF_TOKEN = os.getenv("HF_KEY")

In [2]:
from transformers import Wav2Vec2ForCTC, AutoProcessor

# Download the model
model_id = "facebook/mms-1b-all"
processor = AutoProcessor.from_pretrained(model_id)
model = Wav2Vec2ForCTC.from_pretrained(model_id, ignore_mismatched_sizes=True)
model.to('cuda')

preprocessor_config.json:   0%|          | 0.00/254 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json:   0%|          | 0.00/397 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/96.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/1096 [00:00<?, ?it/s]

Wav2Vec2ForCTC(
  (wav2vec2): Wav2Vec2Model(
    (feature_extractor): Wav2Vec2FeatureEncoder(
      (conv_layers): ModuleList(
        (0): Wav2Vec2LayerNormConvLayer(
          (conv): Conv1d(1, 512, kernel_size=(10,), stride=(5,))
          (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation): GELUActivation()
        )
        (1-4): 4 x Wav2Vec2LayerNormConvLayer(
          (conv): Conv1d(512, 512, kernel_size=(3,), stride=(2,))
          (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation): GELUActivation()
        )
        (5-6): 2 x Wav2Vec2LayerNormConvLayer(
          (conv): Conv1d(512, 512, kernel_size=(2,), stride=(2,))
          (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation): GELUActivation()
        )
      )
    )
    (feature_projection): Wav2Vec2FeatureProjection(
      (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (projec

In [4]:
from datasets import Dataset, Audio
import pandas as pd
import librosa

# Using librosa.load as a workaround for the 'libtorchcodec' error.
# sr=16000 automatically handles resampling during the load process.
target_sr = 16000
wf, sr = librosa.load(audio_file, sr=target_sr)

# wf is returned as a 1D numpy array
data = {
  "waveform" : wf,
  "audio_file" : audio_file,
  "sample_rate" : sr,
}
data

{'waveform': array([ 3.0144734e-11, -1.9064900e-10, -2.7262337e-10, ...,
         2.7643139e-02,  2.7752437e-02,  0.0000000e+00], dtype=float32),
 'audio_file': 'DINA1_PS1_IDS1.wav',
 'sample_rate': 16000}

In [None]:
import torch

# Transcribe
inputs = processor(data['waveform'], sampling_rate=16_000, return_tensors="pt")
inputs.to('cuda')
with torch.no_grad():
    outputs = model(**inputs).logits
ids = torch.argmax(outputs, dim=-1)[0]
transcription = processor.decode(ids)
print(transcription)

# Output:
# q rpqp sass  v pner ssot gtt pnr ptrsst q sr pqdnrqu saara   vpqtra vpi s o s tqsvgr svasaroaquq ao sabs   a gar o st rq sia pstaqpqaqsragss a rsata rsa a sa oer a satrrttra tbpqsos ien qu sto so s  par inv aaana omits frsaso ossrasossq papq  v rpqqqsa pa  v a poner aq   patp qen  patssa i s tenirun arritos como sto no ari pao sirtga a lars llavsqudaro v prner las baasts e arr o   a pqugrpt r spas  pio qs tasaass o tres en t assqspaqraqgo  rparg s aatvapaemb  rri r  a rti qtr qustarnos par jgarqgrqaro usts ontr biqqtrq p  on t man  t av  ga t  gar con st va  ser msica vsias astspra pis pml  r   q squn t vasast ass otrr contropaat qaar paps   n s u  aqu teneos uchs sasa garpoquios tt banas irstaa cico qu dnd sts qu porqu no fuists trabaarqustnavvrtrrqqs pqtp cb tab q  craq

Yeah this model sucks. Let's not use it ever again...

# NFA

In [None]:
result # Ensure this is initialized (see set up)

{'segments': [{'text': 'Ok, so clip it to myself. Oh my gosh. I feel like a vlogger',
   'start': 3.95,
   'end': 8.449,
   'words': []},
  {'text': 'sil', 'start': 8.449, 'end': 12.77, 'words': []},
  {'text': 'and then just keep it to the side. It should be fine. Ok, cool.',
   'start': 12.77,
   'end': 15.75,
   'words': []},
  {'text': 'sil', 'start': 15.75, 'end': 20.709, 'words': []},
  {'text': 'no.', 'start': 20.709, 'end': 21.03, 'words': []},
  {'text': 'sil', 'start': 21.03, 'end': 22.979, 'words': []},
  {'text': 'Qué es eso?', 'start': 22.979, 'end': 24.139, 'words': []},
  {'text': 'sil', 'start': 24.139, 'end': 25.889, 'words': []},
  {'text': 'te tiene nuevos Toys. Oh, oh!',
   'start': 25.889,
   'end': 28.889,
   'words': []},
  {'text': "Ok. So I'll go ahead for a 20 minute timer after that then um I'll just pop up again. Ok. Ok, cool. Got it.",
   'start': 28.889,
   'end': 37.849,
   'words': []},
  {'text': 'sil', 'start': 37.849, 'end': 40.74, 'words': []},
  {'t

The following code needs to be run on a GPU (so I've just copied and pasted my code cells here -- I didn't actually run it in this notebook)

In [None]:
# Clone into NeMo
NEMO_DIR_PATH = "NeMo"
BRANCH = 'main'
!git clone -b $BRANCH https://github.com/NVIDIA-NeMo/NeMo.git
%cd NeMo
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]
%cd -


In [None]:
# Combine all utterances into 1 giant text. Future direction: Would it be better to use smaller snippets?
text = ""
for segment in result['segments']:
  if segment['text'] != 'sil':
    text += segment['text'] + ' '
text

In [None]:
# Create a manifest (audio_file, transcript)
import json
import os

# Make a working directory
WORK_DIR = "WORK_DIR"
!mkdir $WORK_DIR

# Create manifest file
manifest_filepath = f"{WORK_DIR}/manifest.json"
manifest_data = {
    "audio_filepath": f"{WORK_DIR}/DINA1_PS1_IDS1.wav",
    "text": text
}
with open(manifest_filepath, 'w') as f:
  line = json.dumps(manifest_data, ensure_ascii=False)
  f.write(line + "\n")
print(line)

In [None]:
# Initialize audio (make sure it's mono and!cat $manifest_filepath 16k)
import soundfile as sf
import numpy as np

audio_path = f'{WORK_DIR}/DINA1_PS1_IDS1.wav'
audio, sr = sf.read(audio_path)

# If stereo convert to mono
if audio.ndim == 2:
    audio = np.mean(audio, axis=1)

# Downsample to 16K
# TODO

sf.write(audio_path, audio, sr)

In [None]:
# Call NeMo for force alignment
!python $NEMO_DIR_PATH/tools/nemo_forced_aligner/align.py \
  pretrained_name="stt_en_fastconformer_hybrid_large_pc" \
  manifest_filepath=$manifest_filepath \
  output_dir=$WORK_DIR/nfa_output/ \
  additional_segment_grouping_separator='[".","?","!","..."]' \
  ass_file_config.vertical_alignment="bottom" \
  ass_file_config.text_already_spoken_rgb=[66,245,212] \
  ass_file_config.text_being_spoken_rgb=[242,222,44] \
  ass_file_config.text_not_yet_spoken_rgb=[223,242,239]

In [None]:
# View
!head $WORK_DIR/nfa_output/ctm/*/*.ctm

# Download txt files
from google.colab import files

# Read output into python
file_paths = [
    # 'WORK_DIR/nfa_output/ass/tokens/DINA1_PS1_IDS1.ass',
    # 'WORK_DIR/nfa_output/ass/words/DINA1_PS1_IDS1.ass',
    'WORK_DIR/nfa_output/ctm/segments/DINA1_PS1_IDS1.ctm',
    'WORK_DIR/nfa_output/ctm/tokens/DINA1_PS1_IDS1.ctm',
    'WORK_DIR/nfa_output/ctm/words/DINA1_PS1_IDS1.ctm'
]

# Read ctm files
for file_path in file_paths:
  !cat $file_path > transcript.txt
  files.download('nfa.txt')

Now these can be ran in VSCode

In [None]:
import pandas as pd

df = pd.read_csv('../output/nfa_utterances.txt', delimiter=' ', header=None, usecols=[0, 1, 2, 3,4])
df=df.rename({
  0:'filename',
  1:'id',
  2:'start',
  3:'duration',
  4:'transcription'
}, axis=1)
df['transcription'].astype('string')
print(df.head(5))
print(df.info())

         filename  id  start  duration  \
0  DINA1_PS1_IDS1   1   4.16      1.76   
1  DINA1_PS1_IDS1   1   5.92      0.72   
2  DINA1_PS1_IDS1   1   6.64      7.28   
3  DINA1_PS1_IDS1   1  13.92      0.88   
4  DINA1_PS1_IDS1   1  14.88      6.88   

                                       transcription  
0  Ok,<space>so<space>clip<space>it<space>to<spac...  
1                            Oh<space>my<space>gosh.  
2  I<space>feel<space>like<space>a<space>vlogger<...  
3               It<space>should<space>be<space>fine.  
4                                    Ok,<space>cool.  
<class 'pandas.DataFrame'>
RangeIndex: 323 entries, 0 to 322
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   filename       323 non-null    str    
 1   id             323 non-null    int64  
 2   start          323 non-null    float64
 3   duration       323 non-null    float64
 4   transcription  323 non-null    str    
dtypes: float64(

In [None]:
# Convert into Praat
from textgrid import TextGrid, IntervalTier
import pandas as pd

last_offset = -1
tg = TextGrid()
utterances_tier = IntervalTier(name="NFA - Utterances", minTime=tg.minTime, maxTime=tg.maxTime)
for index,segment in df.iterrows():
  curr_offset = segment['start'] + segment['duration']
  if last_offset >= segment['start']:
    segment['start'] += 0.01
  utterances_tier.add(segment['start'], curr_offset, segment['transcription'])
  last_offset = curr_offset
tg.append(utterances_tier)
print("Created utterances textgrid")

Created utterances textgrid


That's just utterances, let's see how the word-level transcriptions compare.

In [None]:
import pandas as pd

df = pd.read_csv('../output/nfa_words.txt', delimiter=' ', header=None, usecols=[0, 1, 2, 3,4])
df=df.rename({
  0:'filename',
  1:'id',
  2:'start',
  3:'duration',
  4:'transcription'
}, axis=1)
df['transcription'].astype('string')
print(df.head(5))
print(df.info())

         filename  id  start  duration transcription
0  DINA1_PS1_IDS1   1   4.16      0.48           Ok,
1  DINA1_PS1_IDS1   1   4.64      0.08            so
2  DINA1_PS1_IDS1   1   4.88      0.24          clip
3  DINA1_PS1_IDS1   1   5.12      0.16            it
4  DINA1_PS1_IDS1   1   5.28      0.08            to
<class 'pandas.DataFrame'>
RangeIndex: 1327 entries, 0 to 1326
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   filename       1327 non-null   str    
 1   id             1327 non-null   int64  
 2   start          1327 non-null   float64
 3   duration       1327 non-null   float64
 4   transcription  1327 non-null   str    
dtypes: float64(2), int64(1), str(2)
memory usage: 75.0 KB
None


In [None]:
# Convert into Praat
from textgrid import TextGrid, IntervalTier
import pandas as pd

last_offset = -1
words_tier = IntervalTier(name="NFA - Words", minTime=tg.minTime, maxTime=tg.maxTime)
for index,segment in df.iterrows():
  curr_offset = segment['start'] + segment['duration']
  if last_offset >= segment['start']:
    segment['start'] += 0.01
  words_tier.add(segment['start'], curr_offset, segment['transcription'])
  last_offset = curr_offset
tg.append(words_tier)
print("Created words textgrid")

Created words textgrid


In [None]:
# Output these files
output_path = "../output/nemo_transcription.TextGrid"
tg.write(output_path)
print("Output file created")

Output file created


# Gentle (WIP)

In [None]:
results # Make sure initialized (See setup)

{'segments': [{'text': 'Ok, so clip it to myself. Oh my gosh. I feel like a vlogger',
   'start': 3.95,
   'end': 8.449,
   'words': []},
  {'text': 'sil', 'start': 8.449, 'end': 12.77, 'words': []},
  {'text': 'and then just keep it to the side. It should be fine. Ok, cool.',
   'start': 12.77,
   'end': 15.75,
   'words': []},
  {'text': 'sil', 'start': 15.75, 'end': 20.709, 'words': []},
  {'text': 'no.', 'start': 20.709, 'end': 21.03, 'words': []},
  {'text': 'sil', 'start': 21.03, 'end': 22.979, 'words': []},
  {'text': 'Qué es eso?', 'start': 22.979, 'end': 24.139, 'words': []},
  {'text': 'sil', 'start': 24.139, 'end': 25.889, 'words': []},
  {'text': 'te tiene nuevos Toys. Oh, oh!',
   'start': 25.889,
   'end': 28.889,
   'words': []},
  {'text': "Ok. So I'll go ahead for a 20 minute timer after that then um I'll just pop up again. Ok. Ok, cool. Got it.",
   'start': 28.889,
   'end': 37.849,
   'words': []},
  {'text': 'sil', 'start': 37.849, 'end': 40.74, 'words': []},
  {'t

In [None]:
# Format a text file containing words
words = " ".join(segment['text'] for segment in results['segments'] if segment['text'] != 'sil')
with open('words.txt', 'w') as f:
  f.write(words)

In [None]:
# TODO: I couldn't install Gentle on my computer (tried both git cloning the source code and pulling from docker)