# Forced Alignment with Vosk

Originally, we evaluated the tagger's F1-score by simply using indices, which may be too penalising. In order to properly evaluate the performance of our PII identification pipeline, we would need to perform *forced alignment*, which aligns token-level transcripts into their corresponding timestamps in the audio files.

For this, we shall be using *Vosk*, a toolkit which offers forced-alignment models. 

## Change Directory to Root Project

In [1]:
import os 

os.chdir('..')

In [2]:
os.getcwd()

'/Users/farhan/Desktop/Research'

## Helper functions

Helper function for sorting the audio file names by ID

In [3]:
def retrieve_key(file: str) -> int:
    try:
        # 3 digit
        key = int(file[2:5])
    except ValueError:
        # 1 digit
        if file[3] == '.':
            key = int(file[2])
        else:
            key = int(file[2:4])
    return key

Helper function to embed the entities within the transcripts (given a dataframe)

In [4]:
import pandas as pd

def insert_entity_tags_to_df(df: pd.DataFrame) -> pd.DataFrame:
    """
    Inserts entity boundary tags into the 'text' column based on 'entities',
    and adds a new column 'tagged_text' with the result.

    Args:
        df (pd.DataFrame): Must contain 'text' and 'entities' columns.

    Returns:
        pd.DataFrame: Same DataFrame with an additional 'tagged_text' column.
    """
    def insert_tags(row):
        text = row["text"]
        entities = row["entities"]

        # Sort entities in reverse order of start index to avoid offset issues
        entities_sorted = sorted(entities, key=lambda x: x[0], reverse=True)

        for start, end, label in entities_sorted:
            tag_start = f"[{label}_START]"
            tag_end = f"[{label}_END]"
            text = text[:end] + tag_end + text[end:]
            text = text[:start] + tag_start + text[start:]
        
        return text

    df = df.copy()
    df["tagged_text"] = df.apply(insert_tags, axis=1)
    return df

Helper function to unify whitespaces

In [5]:
import re

def unify_whitespace(s):
    """
    - Unify multiple spaces into one space across the text.
    - Ensure exactly one space after [XXX_START] and before [XXX_END].
    - Ensure one space after [XXX_END] if missing.
    - Ensure one space before [XXX_START] if missing.
    """
    if not isinstance(s, str):
        return s

    # Step 1: unify all whitespace
    s = re.sub(r'\s+', ' ', s.strip())

    # Step 2: ensure space after [XXX_START] and before [XXX_END]
    s = re.sub(r'(\[\w+_START\])(\S)', r'\1 \2', s)  # Add space after [START] if missing
    s = re.sub(r'(\S)(\[\w+_END\])', r'\1 \2', s)    # Add space before [END] if missing

    # Step 3: ensure space after [XXX_END] if missing
    s = re.sub(r'(\[\w+_END\])(\S)', r'\1 \2', s)

    # Step 4: ensure space before [XXX_START] if missing
    s = re.sub(r'(\S)(\[\w+_START\])', r'\1 \2', s)  # Add space before [START] if missing

    return s

Helper function to run Vosk forced-alignment model

In [6]:
from vosk import Model, KaldiRecognizer
import soundfile as sf
import json

def run_vosk(audio_path: str, vosk_model: Model) -> list:
    # Load audio
    audio_data, sample_rate = sf.read(audio_path)
    
    # Prepare recognizer
    rec = KaldiRecognizer(vosk_model, sample_rate)
    rec.SetWords(True)
    
    if audio_data.ndim > 1:
        audio_data = audio_data.mean(axis=1)  # Stereo to mono

    pcm_data = (audio_data * 32767).astype("int16").tobytes()

    rec.AcceptWaveform(pcm_data)
    result = json.loads(rec.FinalResult())
    
    return result

### How `tokenize_reference()` works

The `tokenize_reference()` function processes the input text and outputs a list of cleaned tokens for alignment. Here's how it works:

1. **Scan the input text using regex (`TOKEN_RE`)**
   - Splits the text into meaningful parts: tags, words, digits, and allowed symbols (`@`, `.`, `_`, `-`).
   - Example: 
     ```
     BEFORE: Farhan's credit card number is [CREDIT_CARD_START]1234-5678 [CREDIT_CARD_END]"
     AFTER: ['Farhan','s','credit','card','number','is','[CREDIT_CARD_START]','1234','-','5678','[CREDIT_CARD_END]']
     ```

2. **Go through each part and apply logic step by step:**

   - **If it's a tag like `[CREDIT_CARD_START]` or `[PHONE_END]`:**
     - Keep the tag as-is.
     - If it's a START tag, remember the tag type (e.g., `CREDIT_CARD`) so we know we're inside a sensitive block.
     - If it's an END tag, exit the sensitive block.

   - **If it's a symbol (`@`, `.`, `_`):**
     - Keep it as-is.

   - **If it's inside a sensitive block (like credit card or phone), remove all hyphens from it.**
     - Example: `"1234-5678"` -> `"12345678"`

   - **If it contains digits:**
     - Break it down character by character:
       - Digits are converted to words using a map (`'1' -> 'one'`, etc.).
       - Letters are converted to lowercase and kept.
       - Example: `"1234"` -> `"one two three four"`

   - **Otherwise, it's a normal word:**
     - Lowercase it and strip punctuation.

3. **Return the final list of processed tokens**

This makes the reference string easier to compare to ASR outputs while handling PIIs and formatting consistently.


In [425]:
import re, string

digit_map = {
    '0': 'zero', '1': 'one', '2': 'two', '3': 'three', '4': 'four',
    '5': 'five', '6': 'six', '7': 'seven', '8': 'eight', '9': 'nine'
}

# Match tags (case-insensitive, underscore tolerant)
TAG_RE   = re.compile(r'\[[A-Z0-9_]+?_(?:START|END)\]', re.I)
TOKEN_RE = re.compile(r'\[[^\]]+]|[\w]+|[@._-]')

# These PII types will have hyphens removed inside the block
PII_TAGS_REMOVE_HYPHEN = {'CREDIT_CARD', 'PHONE', 'BANK_ACCOUNT'}

def tokenize_reference(text: str):
    tokens = []
    inside_pii_block = None  # None or tag name (e.g., 'PHONE')

    for part in TOKEN_RE.findall(text):
        # ---------- 1. Exact tag: check START/END ----------
        if TAG_RE.fullmatch(part):
            tokens.append(part)
            if part.upper().endswith('_START]'):
                inside_pii_block = part[1:-1].replace('_START', '').upper()
            elif part.upper().endswith('_END]'):
                inside_pii_block = None
            continue

        # ---------- 2. Symbols ----------
        if part in {'@', '.', '_'}:
            tokens.append(part)
            continue

        # ---------- 3. Remove hyphens if inside selected PII ----------
        if inside_pii_block in PII_TAGS_REMOVE_HYPHEN:
            part = part.replace('-', '')

        # ---------- 4. Contains digits: explode ----------
        if re.search(r'\d', part):
            for ch in part:
                if ch.isdigit():
                    tokens.append(digit_map[ch])
                elif ch.isalpha():
                    tokens.append(ch.lower())
            continue

        # ---------- 5. Normal word ----------
        clean = part.lower().strip(string.punctuation)
        if clean:
            tokens.append(clean)

    return tokens


### How `align_transcript_with_vosk()` works

This function aligns a reference transcript (with tagged PII) to the word-level timestamps produced by Vosk ASR output. The final output is a list of words (including tags), each annotated with their `start` and `end` timestamps if they align to Vosk output.

---

### Step-by-step Breakdown

1. **Tokenize the transcript**
   - Uses the `tokenize_reference()` function to break the transcript into a list of tokens that include words and tags like `[CREDIT_CARD_START]`.
   - Tags are preserved for reinsertion later.

2. **Store tag positions**
   - Extract the positions and values of all tags from the reference tokens.
   - This allows tags to be reinserted later in the correct order after alignment.

3. **Clean reference tokens**
   - Remove tags from the reference.
   - Normalize the remaining tokens by lowercasing and stripping punctuation (via `clean_token()`).

4. **Clean Vosk tokens**
   - Extract the `word` field from each Vosk word-level dict.
   - Normalize them using the same `clean_token()` function for fair comparison.

5. **Align using `SequenceMatcher`**
   - Uses `difflib.SequenceMatcher` to find matching subsequences between cleaned reference tokens and Vosk tokens.
   - This returns a sequence of **opcodes** describing alignment actions:
     - `equal`: tokens match exactly
     - `replace`: mismatch; estimate start/end for only outer tokens
     - `delete`: token appears in reference but not in Vosk; mark with null timestamps
     - `insert`: token appears in vosk output but not in reference; ignore

6. **Build aligned list**
   - For `equal`:
     - Assign exact start and end timestamps from Vosk to the reference token.
   - For `replace`:
     - Only first and last tokens are assigned estimated timestamps based on available Vosk positions.
     - Middle tokens (if any) have `None` timestamps.
   - For `delete`:
     - All tokens are kept with `start = None` and `end = None`.
   - For `insert`:
     - All tokens that are present in vosk but not present in the reference (Ignore)

7. **Reinsert tags**
   - Tags previously removed are reinserted into the `aligned` list at their original positions.
   - Tags always have `start = None`, `end = None`.

---

### Output

A list of dictionaries like:

```python
[
  {'word': 'credit', 'start': 1.23, 'end': 1.56},
  {'word': '[CREDIT_CARD_START]', 'start': None, 'end': None},
  {'word': 'four', 'start': 1.57, 'end': 1.73},
  ...
]
```

This structure ensures tags are preserved and aligned words have timestamps if matched to the Vosk output.

---

### Helper Functions Used

- `clean_token(token)`
  - Lowercases and strips punctuation from a token unless it's a tag (enclosed in `[]`).
- `is_tag(token)`
  - Returns True if token is a tag (starts and ends with `[]`).
- `is_start_tag(token)` / `is_end_tag(token)`
  - Detect whether a tag is a START or END marker.


In [None]:
from difflib import SequenceMatcher
import string
import re

# ────────────────────────────────────────────────────────────
# Helpers
# ────────────────────────────────────────────────────────────
def clean_token(token):
    """Lowercase and strip punctuation from token unless it's a tag."""
    if token.startswith("[") and token.endswith("]"):
        return token
    return token.lower().translate(str.maketrans('', '', string.punctuation))

def is_tag(token):
    return token.startswith("[") and token.endswith("]")

def is_start_tag(token):
    return token.endswith("_START]")

def is_end_tag(token):
    return token.endswith("_END]")

# ────────────────────────────────────────────────────────────
# Main Alignment Function
# ────────────────────────────────────────────────────────────
def align_transcript_with_vosk(vosk_words, transcript):
    # Tokenize reference into tokens including tags
    ref_tokens = tokenize_reference(transcript)

    # Save positions of tags
    tag_positions = [(i, t) for i, t in enumerate(ref_tokens) if is_tag(t)]

    # Remove tags from reference tokens
    ref_tokens_clean = [t for t in ref_tokens if not is_tag(t)]
    ref_tokens_clean_norm = [clean_token(t) for t in ref_tokens_clean]

    vosk_tokens = [w['word'] for w in vosk_words]
    vosk_tokens_clean = [clean_token(t) for t in vosk_tokens]

    # Align with SequenceMatcher
    matcher = SequenceMatcher(None, ref_tokens_clean_norm, vosk_tokens_clean, autojunk=False)

    aligned = []
    i_clean = 0  # index in cleaned ref tokens
    for tag, i1, i2, j1, j2 in matcher.get_opcodes():
        for i_r, i_v in zip(range(i1, i2), range(j1, j2)) if tag == "equal" else []:
            aligned.append({
                'word': ref_tokens_clean[i_r],
                'start': vosk_words[i_v]['start'],
                'end': vosk_words[i_v]['end']
            })
        if tag == "replace":
            for i_r in range(i1, i2):
                start = vosk_words[j1]['start'] if j1 < len(vosk_words) else None
                end = vosk_words[j2 - 1]['end'] if (j2 - 1) < len(vosk_words) else None
                aligned.append({
                    'word': ref_tokens_clean[i_r],
                    'start': start if i_r == i1 else None,
                    'end': end if i_r == i2 - 1 else None
                })
        if tag == "delete":
            for i_r in range(i1, i2):
                aligned.append({
                    'word': ref_tokens_clean[i_r],
                    'start': None,
                    'end': None
                })

    # Reinsert tags with null times
    for pos, tag in tag_positions:
        aligned.insert(pos, {'word': tag, 'start': None, 'end': None})

    return aligned

### How `extract_pii_tuples()` works

This function extracts time-stamped PII spans from an aligned transcript and stores them in a new column of a DataFrame. It detects pairs of `[TAG_START]` and `[TAG_END]` tokens, and assigns each a start and end time using fallback logic.

---

### Step-by-step Breakdown

#### 1. **Regex Setup**
- `_START_RE` and `_END_RE` are case-insensitive regex patterns to detect tags like `[CREDIT_CARD_START]` and `[PHONE_END]`, even if written with extra underscores or inconsistent casing.

#### 2. **Helper Functions**

- `_to_tokens(cell)`  
  Converts the cell (either a list or a JSON-stringified list) into `list[dict]` form.
  - Tries `json.loads()` first.
  - Falls back to `ast.literal_eval()` if needed.

- `_fwd_time(tokens, idx, field)`  
  Finds the **first non-null** value of `field` (`start` or `end`) **after** the current index.

- `_back_time(tokens, idx, field)`  
  Finds the **last non-null** value of `field` **before** the current index.

---

### 3. **Main Logic: `extract_pii_tuples()`**

#### Parameters:
- `df`: The input DataFrame.
- `align_col`: Column name containing the aligned transcript (default: `"aligned_transcript"`).
- `out_col`: Column to store the output tuples (default: `"pii_tuples"`).

#### Output:
- Populates each row of `df[out_col]` with a list of `(start_time, end_time, TAG)` tuples representing detected PII spans.

#### Process:
For each row in the DataFrame:

1. **Parse the aligned tokens**
   - Use `_to_tokens()` to ensure the token list is in `list[dict]` format.

2. **Iterate over tokens**
   - Maintain two variables:
     - `open_tag`: The current open tag type (e.g., `"CREDIT_CARD"`).
     - `open_start`: The estimated start time of the tag.

3. **If `[TAG_START]` is found:**
   - Extract the tag name from the token.
   - Attempt to determine the start time using the following fallbacks:
     1. The **next** token’s `start`. (As the tag token will always be `None`)
     2. The **previous** token’s `start`.

4. **If `[TAG_END]` is found and matches the last open tag:**
   - Attempt to determine the end time using the following fallbacks:
     1. The **previous** token’s `end`. (As the tag token will always be `None`)
     2. The **next** token’s `end`.

5. **Append a tuple**
   - If both `[TAG_START]` and `[TAG_END]` are successfully paired, append:
     ```python
     (open_start, end_time, open_tag)
     ```
   - Reset `open_tag` and `open_start`.

6. **Store result**
   - After processing the row, append the list of extracted tuples to `out_rows`.

---

### Example Output

After processing, each row in `df['pii_tuples']` will contain something like:

```python
[(1.23, 2.48, 'CREDIT_CARD'), (5.12, 6.01, 'PHONE')]
```

This means:
- A credit card PII segment starts at `1.23s` and ends at `2.48s`
- A phone PII segment starts at `5.12s` and ends at `6.01s`

---

### Summary of Fallback Order

**For `[TAG_START]` timestamps:**
1. Use the **next** token’s `start`
2. Use the **previous** token’s `start`

**For `[TAG_END]` timestamps:**
1. Use the **previous** token’s `end`
2. Use the **next** token’s `end`


In [None]:
import json, ast, re, pandas as pd

# ────────────────────────────────────────────────────────────
# regexes: allow extra underscores, ignore case, tolerate punctuation
_START_RE = re.compile(r'\[([A-Z_]+?)_START\]', re.I)
_END_RE   = re.compile(r'\[([A-Z_]+?)_END\]'  , re.I)

# ---------- helper: load cell → list[dict] ---------------------
def _to_tokens(cell):
    if isinstance(cell, list):
        return cell
    if isinstance(cell, str):
        try:
            return json.loads(cell)          # JSON
        except json.JSONDecodeError:
            return ast.literal_eval(cell)    # python literal
    raise ValueError("aligned_transcript must be list or JSON-string")

# ---------- helper: first non-null AFTER idx -------------------
def _fwd_time(tokens, idx, field):
    for t in tokens[idx + 1:]:
        if t[field] is not None:
            return t[field]
    return None

# ---------- helper: last non-null BEFORE idx -------------------
def _back_time(tokens, idx, field):
    for t in reversed(tokens[:idx]):
        if t[field] is not None:
            return t[field]
    return None

# ────────────────────────────────────────────────────────────
# main extractor
# ────────────────────────────────────────────────────────────
def extract_pii_tuples(df,
                       align_col="aligned_transcript",
                       out_col="pii_tuples"):
    """
    Populate df[out_col] with [(start_time, end_time, TAG), …]

    Fallback order
    --------------
    • [TAG_START] start-time
        1. next token .start
        2. previous token .start

    • [TAG_END] end-time
        1. previous token .end
        2. next token .end
    """
    out_rows = []

    for cell in df[align_col]:
        tokens      = _to_tokens(cell)
        tuples_row  = []
        open_tag    = None
        open_start  = None

        for idx, tok in enumerate(tokens):
            word = tok["word"]

            # ---------- opening tag ----------
            m_open = _START_RE.search(word)
            if m_open:
                open_tag   = m_open.group(1)

                # 1. next token .start
                if open_start is None:
                    open_start = _fwd_time(tokens, idx, "start")
                # 2. previous token .start
                if open_start is None and idx > 0:
                    open_start = tokens[idx - 1]["start"]
                continue

            # ---------- closing tag ----------
            m_close = _END_RE.search(word)
            if m_close and open_tag == m_close.group(1):
                # 1. previous token .end
                if end_t is None and idx > 0:
                    end_t = tokens[idx - 1]["end"]
                # 2. next token .end
                if end_t is None:
                    end_t = _fwd_time(tokens, idx, "end")

                tuples_row.append((open_start, end_t, open_tag))
                open_tag, open_start = None, None
                continue

        out_rows.append(tuples_row)

    df[out_col] = out_rows
    return df


Extract id from file name function

In [9]:
def extract_id_number(filename):
    match = re.search(r'id(\d+)\.wav', filename)
    if match:
        return int(match.group(1))
    return None

## Load the dataset

In [None]:
import os

os.chdir('..')

Load batch 1 (150 samples)

In [820]:
import pandas as pd

batch_one_ref = pd.read_json('data/true_data_150.jsonl', lines=True)
batch_one_ref.head()

Unnamed: 0,text
0,"The day before [DATE_START] yesterday, [DATE_E..."
1,um my date of birth is uh second [DATE_START] ...
2,"she handed over a crumpled piece of paper, the..."
3,aglio olio and err uh [CARDINAL_START] three t...
4,[PERSON_START] Hong's [PERSON_END] email is [E...


In [821]:
import os

batch_one_files = sorted(os.listdir("data/Audio_Files_for_testing"), key=retrieve_key)
batch_one_files  = [f'data/Audio_Files_for_testing/{file}' for file in batch_one_files]

import pandas as pd

batch_one_df = pd.DataFrame(data=batch_one_files, columns=['file_name'])
batch_one_df.head()

Unnamed: 0,file_name
0,data/Audio_Files_for_testing/id1.wav
1,data/Audio_Files_for_testing/id2.wav
2,data/Audio_Files_for_testing/id3.wav
3,data/Audio_Files_for_testing/id4.wav
4,data/Audio_Files_for_testing/id5.wav


Load batch 2 (350 samples)

In [822]:
import pandas as pd

batch_two_ref = pd.read_json('data/newtest_151_500_updated_TTS.jsonl', lines=True)
batch_two_ref.head()

Unnamed: 0,id,text,entities
0,151,"456 729103 8 is Kaifu Lee's DBS bank account, ...","[[0, 12, BANK_ACCOUNT]]"
1,152,"Jacob's OCBC bank account is 192-58462-3, and ...","[[29, 40, BANK_ACCOUNT]]"
2,153,"788 305194 2 is Zheng Qi's POSB bank account, ...","[[0, 12, BANK_ACCOUNT]]"
3,154,"Geetha's UOB bank account is 341-92741-9, and ...","[[29, 40, BANK_ACCOUNT]]"
4,155,"623 481057 6 is Ah Seng's Maybank account, and...","[[0, 12, BANK_ACCOUNT]]"


In [823]:
import os

batch_two_files = sorted(os.listdir("data/newtest_151_500_updated_TTS"), key=retrieve_key)
batch_two_files  = [f'data/newtest_151_500_updated_TTS/{file}' for file in batch_two_files]

import pandas as pd

batch_two_df = pd.DataFrame(data=batch_two_files, columns=['file_name'])
batch_two_df.head()

Unnamed: 0,file_name
0,data/newtest_151_500_updated_TTS/id151.wav
1,data/newtest_151_500_updated_TTS/id152.wav
2,data/newtest_151_500_updated_TTS/id153.wav
3,data/newtest_151_500_updated_TTS/id154.wav
4,data/newtest_151_500_updated_TTS/id155.wav


As you can see, for batch 2, the entities enclosed are not in the reference transcripts. As we are given the indices, we can write a helper function to include them within the transcripts. 

Preprocess transcripts

In [824]:
batch_two_ref = insert_entity_tags_to_df(batch_two_ref)

In [825]:
batch_two_ref.head()

Unnamed: 0,id,text,entities,tagged_text
0,151,"456 729103 8 is Kaifu Lee's DBS bank account, ...","[[0, 12, BANK_ACCOUNT]]",[BANK_ACCOUNT_START]456 729103 8[BANK_ACCOUNT_...
1,152,"Jacob's OCBC bank account is 192-58462-3, and ...","[[29, 40, BANK_ACCOUNT]]",Jacob's OCBC bank account is [BANK_ACCOUNT_STA...
2,153,"788 305194 2 is Zheng Qi's POSB bank account, ...","[[0, 12, BANK_ACCOUNT]]",[BANK_ACCOUNT_START]788 305194 2[BANK_ACCOUNT_...
3,154,"Geetha's UOB bank account is 341-92741-9, and ...","[[29, 40, BANK_ACCOUNT]]",Geetha's UOB bank account is [BANK_ACCOUNT_STA...
4,155,"623 481057 6 is Ah Seng's Maybank account, and...","[[0, 12, BANK_ACCOUNT]]",[BANK_ACCOUNT_START]623 481057 6[BANK_ACCOUNT_...


In [826]:
batch_two_ref.tail()

Unnamed: 0,id,text,entities,tagged_text
346,496,Patrick Loh boasting about his email patrick.l...,"[[37, 60, EMAIL]]",Patrick Loh boasting about his email [EMAIL_ST...
347,497,Jasmine Yeo got sian when someone spell her em...,"[[50, 69, EMAIL]]",Jasmine Yeo got sian when someone spell her em...
348,498,Bobby Tan write his email bobby.tan@gmail.com ...,"[[26, 45, EMAIL]]",Bobby Tan write his email [EMAIL_START]bobby.t...
349,499,Kamala Singh telling the IT guy her email kama...,"[[42, 60, EMAIL]]",Kamala Singh telling the IT guy her email [EMA...
350,500,Raymond Koh say his email raymond.k@singnet.co...,"[[26, 50, EMAIL]]",Raymond Koh say his email [EMAIL_START]raymond...


In [827]:
batch_two_ref = batch_two_ref.drop(columns=['entities', 'text', 'id'], axis=1)
batch_two_ref.rename(columns={'tagged_text': 'text'}, inplace=True)
batch_two_ref.head()

Unnamed: 0,text
0,[BANK_ACCOUNT_START]456 729103 8[BANK_ACCOUNT_...
1,Jacob's OCBC bank account is [BANK_ACCOUNT_STA...
2,[BANK_ACCOUNT_START]788 305194 2[BANK_ACCOUNT_...
3,Geetha's UOB bank account is [BANK_ACCOUNT_STA...
4,[BANK_ACCOUNT_START]623 481057 6[BANK_ACCOUNT_...


Combine the datasets [Run this when dataset not combined yet]

In [828]:
test_set_ref = pd.concat([batch_one_ref, batch_two_ref], ignore_index=True)

In [829]:
test_set_ref.head()

Unnamed: 0,text
0,"The day before [DATE_START] yesterday, [DATE_E..."
1,um my date of birth is uh second [DATE_START] ...
2,"she handed over a crumpled piece of paper, the..."
3,aglio olio and err uh [CARDINAL_START] three t...
4,[PERSON_START] Hong's [PERSON_END] email is [E...


In [830]:
test_set_ref.tail()

Unnamed: 0,text
496,Patrick Loh boasting about his email [EMAIL_ST...
497,Jasmine Yeo got sian when someone spell her em...
498,Bobby Tan write his email [EMAIL_START]bobby.t...
499,Kamala Singh telling the IT guy her email [EMA...
500,Raymond Koh say his email [EMAIL_START]raymond...


In [831]:
test_set_ref['text'] = test_set_ref['text'].apply(unify_whitespace)

In [None]:
test_set_ref.to_json('data/test_set_ref_combined.jsonl', lines=True, orient='records')

Load the combined processed dataset [When already combined]

In [10]:
import pandas as pd

test_set_ref = pd.read_json('data/test_set_ref_all.jsonl', lines=True)

In [11]:
test_set_ref.head()

Unnamed: 0,text
0,"The day before [DATE_START] yesterday, [DATE_E..."
1,um my date of birth is uh second [DATE_START] ...
2,"she handed over a crumpled piece of paper, the..."
3,aglio olio and err uh [CARDINAL_START] three t...
4,[PERSON_START] Hong's [PERSON_END] email is [E...


In [12]:
test_set_ref.tail()

Unnamed: 0,text
495,Patrick Loh boasting about his email [EMAIL_ST...
496,Jasmine Yeo got sian when someone spell her em...
497,Bobby Tan write his email [EMAIL_START] bobby....
498,Kamala Singh telling the IT guy her email [EMA...
499,Raymond Koh say his email [EMAIL_START] raymon...


## Load the model (Vosk)

Unfortunately, there are no current models that are tuned for Singaporean English (Singlish). As such, we shall use the `vosk-model-en-us-0.42-gigaspeech` model.

In [13]:
from vosk import Model, KaldiRecognizer
import soundfile as sf
import json

# Load model (replace path with your model directory)
model_path = "models/vosk-model-en-us-0.42-gigaspeech"
model = Model(model_path)

LOG (VoskAPI:ReadDataFiles():model.cc:213) Decoding params beam=13 max-active=7000 lattice-beam=8
LOG (VoskAPI:ReadDataFiles():model.cc:216) Silence phones 1:2:3:4:5:6:7:8:9:10
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
LOG (VoskAPI:ReadDataFiles():model.cc:248) Loading i-vector extractor from models/vosk-model-en-us-0.42-gigaspeech/ivector/final.ie
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (VoskAPI:ReadDataFiles():model.cc:279) Loading HCLG from models/vosk-model-en-us-0.42-gigaspeech/graph/HCLG.fst
LOG (VoskAPI:ReadDataFiles():model.cc:294) Loading words from models/vosk-model-en-us-0.42-gigaspeech/graph/words.txt
LOG (VoskAPI:ReadDataFiles():model.cc:303) Loading winfo models/vosk-model-en-us-0.42-gigaspeech/graph/phones/word_bound

## Load and Read Audio (With just one sample)

So here, several things are happening:

1. We create a `KaldiRecognizer` instance and set `.SetWords()` to `True`, which means that we will get word-level timestamps.
2. The `.AcceptWaveform()` method is used to process the waveform
3. The `.FinalResult()` method is finally called to retrieve the word-level timestamps (transcribed from the Vosk Model - although with some innacurracies, as Vosk is not a full-fledged ASR model)

Test on one audio sample (Simple example with Name and Email PIIs)

In [387]:
audio_path = "data/Audio_Files_for_testing/id85.wav"
audio_data, sample_rate = sf.read(audio_path)

In [388]:
sample = run_vosk(audio_path, model)

In [389]:
sample

{'result': [{'conf': 1.0, 'end': 0.63, 'start': 0.45, 'word': 'i'},
  {'conf': 1.0, 'end': 1.02, 'start': 0.63, 'word': 'lost'},
  {'conf': 1.0, 'end': 1.2, 'start': 1.02, 'word': 'my'},
  {'conf': 1.0, 'end': 1.77, 'start': 1.23, 'word': 'passport'},
  {'conf': 1.0, 'end': 2.25, 'start': 1.98, 'word': 'and'},
  {'conf': 1.0, 'end': 2.49, 'start': 2.31, 'word': 'it'},
  {'conf': 1.0, 'end': 2.728719, 'start': 2.49, 'word': 'is'},
  {'conf': 1.0, 'end': 3.09, 'start': 2.76, 'word': 'new'},
  {'conf': 1.0, 'end': 3.39, 'start': 3.09, 'word': 'one'},
  {'conf': 1.0, 'end': 5.07, 'start': 4.59, 'word': 'number'},
  {'conf': 1.0, 'end': 5.46, 'start': 5.1, 'word': 'is'},
  {'conf': 0.70072, 'end': 6.15, 'start': 5.79, 'word': 'gay'},
  {'conf': 1.0, 'end': 6.75, 'start': 6.18, 'word': 'seven'},
  {'conf': 1.0, 'end': 7.29, 'start': 6.84, 'word': 'six'},
  {'conf': 1.0, 'end': 7.83, 'start': 7.29, 'word': 'zero'},
  {'conf': 1.0, 'end': 8.227815, 'start': 7.98, 'word': 'one'},
  {'conf': 0.6

In [390]:
len(sample['result'])

30

In [391]:
vosk_output = []

for item in sample['result']:
    vosk_output.append(item['word'])

print(vosk_output)

['i', 'lost', 'my', 'passport', 'and', 'it', 'is', 'new', 'one', 'number', 'is', 'gay', 'seven', 'six', 'zero', 'one', 'eight', 'four', 'two', 't', 'and', 'k', 'seven', 'six', 'zero', 'one', 'eight', 'four', 'two', 'd']


Reference Transcript

In [392]:
test_set_ref['text'].iloc[84]

'I lost my passport and it new [CARDINAL_START] one [CARDINAL_END] number is [PASSPORT_NUM_START] K 7601842 T [PASSPORT_NUM_END] and [PASSPORT_NUM_START] K 7601842 T [PASSPORT_NUM_END]'

In [393]:
tokenized_reference = tokenize_reference(test_set_ref['text'].iloc[84])

print("Number of tokens:", len(tokenized_reference))
print(tokenized_reference)

Number of tokens: 35
['i', 'lost', 'my', 'passport', 'and', 'it', 'new', '[CARDINAL_START]', 'one', '[CARDINAL_END]', 'number', 'is', '[PASSPORT_NUM_START]', 'k', 'seven', 'six', 'zero', 'one', 'eight', 'four', 'two', 't', '[PASSPORT_NUM_END]', 'and', '[PASSPORT_NUM_START]', 'k', 'seven', 'six', 'zero', 'one', 'eight', 'four', 'two', 't', '[PASSPORT_NUM_END]']


#### Using `SequenceMatcher` by `difflib` to perform edit-distance based heuristics

**Limitations of Greedy-alignment**

As we observed, greedy-based approaches are not feasible due to the variability of the dataset plus the difference in output format between the Kaldi alignment model and the reference text. For example, some words in the reference text include fillers like "uh" whereas the Vosk model seem to omit it. Another example include the handling of contractions - with the Vosk model outputting contractions as-is (considered as one token) whereas the reference text separates the words (e.g., they're -> they are, considered as two tokens). Lastly, Kaldi is designed foremost as a forced-alignment model and not a speech-recognition model - which means that it often outputs mistranscriptions (e.g., seventy vs seven three). This can have severe impact when it comes to alignment quality. Therefore, it has since been decided that greedy-alignment is not sufficient for our use case.

**Edit Distance Heuristics with `SequenceMatcher`**

Edit-distance alignment approaches, such as those based on `SequenceMatcher`, offer a more robust alternative to greedy heuristics in aligning reference transcripts with forced alignment outputs. Unlike greedy methods that proceed sequentially and can easily desynchronize when encountering mismatches, edit-distance techniques globally evaluate the sequences and can better tolerate insertions, deletions, or substitutions. For example, in the reference transcript, the phrase "they are" may appear as two tokens, while Vosk outputs "they're" as a single token—this would break a greedy matcher, but an edit-distance algorithm would register this as a "replace" operation and continue aligning the rest correctly. Similarly, reference text may contain filler words like "uh" or structured tag blocks like `[EMAIL_START] foo bar [EMAIL_END]` that do not exist in the Vosk output. Edit-distance allows these to be treated as deletions while maintaining alignment integrity. Lastly, in numerical sequences, Vosk may transcribe "seven three" as "seventy", which would otherwise cause greedy approaches to misalign all following tokens. With edit-distance, such errors are localized and do not compromise the entire alignment.

In [398]:
from difflib import SequenceMatcher

def is_tag(tok):
    return tok.startswith('[') and tok.endswith(']')

def prepare_alignment_tokens(ref_tokens):
    """Strip out tags but record their positions for reinsertion later."""
    ref_clean = []
    tag_insertions = []  # list of (index, tag)
    for tok in ref_tokens:
        if is_tag(tok):
            tag_insertions.append((len(ref_clean), tok))
        else:
            ref_clean.append(tok)
    return ref_clean, tag_insertions

# 1. Original reference tokens
ref_tokens = tokenize_reference(test_set_ref['text'].iloc[84])

# 2. Cleaned tokens for alignment (no tags)
ref_clean, tag_insertions = prepare_alignment_tokens(ref_tokens)

# 3. Vosk tokens
vosk = ['i', 'lost', 'my', 'passport', 'and', 'it', 'is', 'new', 'one', 'number', 'is',
        'gay', 'seven', 'six', 'zero', 'one', 'eight', 'four', 'two', 't', 'and',
        'k', 'seven', 'six', 'zero', 'one', 'eight', 'four', 'two', 'd']

# 4. Alignment
matcher = SequenceMatcher(None, ref_clean, vosk, autojunk=False)

# 5. Print match result
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
    print(f"{tag:10} ref[{i1}:{i2}] -> vosk[{j1}:{j2}]")
    print(f"  REF : {ref_clean[i1:i2]}")
    print(f"  VOSK: {vosk[j1:j2]}")

# 6. (Optional) If you want to reconstruct the full aligned output with tags:
# Reinsert tags into ref_clean if needed:
for idx, tag in reversed(tag_insertions):
    ref_clean.insert(idx, tag)

print(ref_clean)

equal      ref[0:6] -> vosk[0:6]
  REF : ['i', 'lost', 'my', 'passport', 'and', 'it']
  VOSK: ['i', 'lost', 'my', 'passport', 'and', 'it']
insert     ref[6:6] -> vosk[6:7]
  REF : []
  VOSK: ['is']
equal      ref[6:10] -> vosk[7:11]
  REF : ['new', 'one', 'number', 'is']
  VOSK: ['new', 'one', 'number', 'is']
replace    ref[10:11] -> vosk[11:12]
  REF : ['k']
  VOSK: ['gay']
equal      ref[11:28] -> vosk[12:29]
  REF : ['seven', 'six', 'zero', 'one', 'eight', 'four', 'two', 't', 'and', 'k', 'seven', 'six', 'zero', 'one', 'eight', 'four', 'two']
  VOSK: ['seven', 'six', 'zero', 'one', 'eight', 'four', 'two', 't', 'and', 'k', 'seven', 'six', 'zero', 'one', 'eight', 'four', 'two']
replace    ref[28:29] -> vosk[29:30]
  REF : ['t']
  VOSK: ['d']
['i', 'lost', 'my', 'passport', 'and', 'it', 'new', '[CARDINAL_START]', 'one', '[CARDINAL_END]', 'number', 'is', '[PASSPORT_NUM_START]', 'k', 'seven', 'six', 'zero', 'one', 'eight', 'four', 'two', 't', '[PASSPORT_NUM_END]', 'and', '[PASSPORT_NUM_ST

Align vosk output with reference (edit-distance based approach)

In [400]:
align_transcript_with_vosk(sample['result'], test_set_ref['text'].iloc[84])

[{'word': 'i', 'start': 0.45, 'end': 0.63},
 {'word': 'lost', 'start': 0.63, 'end': 1.02},
 {'word': 'my', 'start': 1.02, 'end': 1.2},
 {'word': 'passport', 'start': 1.23, 'end': 1.77},
 {'word': 'and', 'start': 1.98, 'end': 2.25},
 {'word': 'it', 'start': 2.31, 'end': 2.49},
 {'word': 'new', 'start': 2.76, 'end': 3.09},
 {'word': '[CARDINAL_START]', 'start': None, 'end': None},
 {'word': 'one', 'start': 3.09, 'end': 3.39},
 {'word': '[CARDINAL_END]', 'start': None, 'end': None},
 {'word': 'number', 'start': 4.59, 'end': 5.07},
 {'word': 'is', 'start': 5.1, 'end': 5.46},
 {'word': '[PASSPORT_NUM_START]', 'start': None, 'end': None},
 {'word': 'k', 'start': 5.79, 'end': 6.15},
 {'word': 'seven', 'start': 6.18, 'end': 6.75},
 {'word': 'six', 'start': 6.84, 'end': 7.29},
 {'word': 'zero', 'start': 7.29, 'end': 7.83},
 {'word': 'one', 'start': 7.98, 'end': 8.227815},
 {'word': 'eight', 'start': 8.25, 'end': 8.49},
 {'word': 'four', 'start': 8.85, 'end': 9.21},
 {'word': 'two', 'start': 9.2

Test on one audio sample (Numerical PIIs)

In [432]:
import soundfile as sf

audio_path = "data/Audio_Files_for_testing/id20.wav"
audio_data, sample_rate = sf.read(audio_path)

In [433]:
sample2 = run_vosk(audio_path, model)

In [434]:
sample2

{'result': [{'conf': 0.248183,
   'end': 1.35,
   'start': 0.87,
   'word': 'mortality'},
  {'conf': 0.573568, 'end': 2.13, 'start': 1.3512, 'word': 'qualities'},
  {'conf': 1.0, 'end': 2.43, 'start': 2.22, 'word': 'and'},
  {'conf': 1.0, 'end': 2.94, 'start': 2.43, 'word': 'transferred'},
  {'conf': 1.0, 'end': 3.09, 'start': 2.94, 'word': 'to'},
  {'conf': 1.0, 'end': 3.57, 'start': 3.12, 'word': 'dvds'},
  {'conf': 1.0, 'end': 3.81, 'start': 3.57, 'word': 'bank'},
  {'conf': 1.0, 'end': 4.14, 'start': 3.81, 'word': 'account'},
  {'conf': 1.0, 'end': 4.41, 'start': 4.23, 'word': 'and'},
  {'conf': 1.0, 'end': 4.89, 'start': 4.41, 'word': 'contact'},
  {'conf': 1.0, 'end': 4.98, 'start': 4.89, 'word': 'the'},
  {'conf': 1.0, 'end': 5.4, 'start': 4.98, 'word': 'same'},
  {'conf': 1.0, 'end': 5.94, 'start': 5.43, 'word': 'six'},
  {'conf': 1.0, 'end': 6.24, 'start': 5.97, 'word': 'eight'},
  {'conf': 1.0, 'end': 6.54, 'start': 6.27, 'word': 'eight'},
  {'conf': 1.0, 'end': 6.84, 'start'

Reference

In [435]:
test_set_ref['text'].iloc[19]

'ppo hotel holidays ppl and transfer to [ORG_START] DBS [ORG_END] bank account and contact is same [PHONE_START] 6889-5748 [PHONE_END] or [PHONE_START] 6889-5748 [PHONE_END]'

Force align

credit_card, car_plate, bank_account, nric, phone, passport_num

In [436]:
align_transcript_with_vosk(sample2['result'], test_set_ref['text'].iloc[19])

[{'word': 'ppo', 'start': 0.87, 'end': None},
 {'word': 'hotel', 'start': None, 'end': None},
 {'word': 'holidays', 'start': None, 'end': None},
 {'word': 'ppl', 'start': None, 'end': 2.13},
 {'word': 'and', 'start': 2.22, 'end': 2.43},
 {'word': 'transfer', 'start': 2.43, 'end': 2.94},
 {'word': 'to', 'start': 2.94, 'end': 3.09},
 {'word': '[ORG_START]', 'start': None, 'end': None},
 {'word': 'dbs', 'start': 3.12, 'end': 3.57},
 {'word': '[ORG_END]', 'start': None, 'end': None},
 {'word': 'bank', 'start': 3.57, 'end': 3.81},
 {'word': 'account', 'start': 3.81, 'end': 4.14},
 {'word': 'and', 'start': 4.23, 'end': 4.41},
 {'word': 'contact', 'start': 4.41, 'end': 4.89},
 {'word': 'is', 'start': 4.89, 'end': 4.98},
 {'word': 'same', 'start': 4.98, 'end': 5.4},
 {'word': '[PHONE_START]', 'start': None, 'end': None},
 {'word': 'six', 'start': 5.43, 'end': 5.94},
 {'word': 'eight', 'start': 5.97, 'end': 6.24},
 {'word': 'eight', 'start': 6.27, 'end': 6.54},
 {'word': 'nine', 'start': 6.54, 

Another example

In [57]:
sample3 = run_vosk("data/newtest_151_500_updated_TTS/id500.wav", model)

In [58]:
sample3

{'result': [{'conf': 1.0, 'end': 0.48, 'start': 0.09, 'word': 'raymond'},
  {'conf': 0.717032, 'end': 0.81, 'start': 0.48, 'word': 'co'},
  {'conf': 1.0, 'end': 1.08, 'start': 0.81, 'word': 'say'},
  {'conf': 1.0, 'end': 1.32, 'start': 1.08, 'word': 'his'},
  {'conf': 1.0, 'end': 1.83, 'start': 1.35, 'word': 'email'},
  {'conf': 1.0, 'end': 2.37, 'start': 1.86, 'word': 'raymond'},
  {'conf': 1.0, 'end': 2.64, 'start': 2.37, 'word': 'dot'},
  {'conf': 0.886485, 'end': 3.0, 'start': 2.64, 'word': 'k'},
  {'conf': 1.0, 'end': 3.18, 'start': 3.03, 'word': 'at'},
  {'conf': 0.414042, 'end': 3.51, 'start': 3.18, 'word': 'sing'},
  {'conf': 1.0, 'end': 3.72, 'start': 3.51, 'word': 'that'},
  {'conf': 1.0, 'end': 4.02, 'start': 3.72, 'word': 'dot'},
  {'conf': 1.0, 'end': 4.41, 'start': 4.02, 'word': 'com'},
  {'conf': 1.0, 'end': 4.68, 'start': 4.41, 'word': 'dot'},
  {'conf': 0.913401, 'end': 4.95, 'start': 4.71, 'word': 's'},
  {'conf': 0.936325, 'end': 5.31, 'start': 4.95, 'word': 'g'},
  

In [65]:
test_set_ref['text'].iloc[499]

'Raymond Koh say his email [EMAIL_START] raymond.k@singnet.com.s g [EMAIL_END] got so many spam, like free lobang for advertisers sia'

In [66]:
align_transcript_with_vosk(sample3['result'], test_set_ref['text'].iloc[499])

[{'word': 'raymond', 'start': 0.09, 'end': 0.48},
 {'word': 'koh', 'start': 0.48, 'end': 0.81},
 {'word': 'say', 'start': 0.81, 'end': 1.08},
 {'word': 'his', 'start': 1.08, 'end': 1.32},
 {'word': 'email', 'start': 1.35, 'end': 1.83},
 {'word': '[EMAIL_START]', 'start': None, 'end': None},
 {'word': 'raymond', 'start': 1.86, 'end': 2.37},
 {'word': '.', 'start': 2.37, 'end': 2.64},
 {'word': 'k', 'start': 2.64, 'end': 3.0},
 {'word': '@', 'start': 3.03, 'end': None},
 {'word': 'singnet', 'start': None, 'end': None},
 {'word': '.', 'start': None, 'end': 4.02},
 {'word': 'com', 'start': 4.02, 'end': 4.41},
 {'word': '.', 'start': 4.41, 'end': 4.68},
 {'word': 's', 'start': 4.71, 'end': 4.95},
 {'word': 'g', 'start': 4.95, 'end': 5.31},
 {'word': '[EMAIL_END]', 'start': 5.55, 'end': None},
 {'word': 'got', 'start': None, 'end': 5.79},
 {'word': 'so', 'start': 5.79, 'end': 6.06},
 {'word': 'many', 'start': 6.06, 'end': 6.33},
 {'word': 'spam', 'start': 6.33, 'end': 6.81},
 {'word': 'like'

## Load and Read Audio (All 500 samples)

In [437]:
import os

audio_paths = sorted(os.listdir("data/Audio_Files_for_testing") + os.listdir("data/newtest_151_500_updated_TTS"), key=retrieve_key)
audio_paths_with_parent = [f'data/Audio_Files_for_testing/{file}' for file in audio_paths if file.endswith('.wav') and retrieve_key(file) < 151]
audio_paths_with_parent += [f'data/newtest_151_500_updated_TTS/{file}' for file in audio_paths if file.endswith('.wav') and retrieve_key(file) >= 151]

In [438]:
audio_paths_with_parent

['data/Audio_Files_for_testing/id1.wav',
 'data/Audio_Files_for_testing/id2.wav',
 'data/Audio_Files_for_testing/id3.wav',
 'data/Audio_Files_for_testing/id4.wav',
 'data/Audio_Files_for_testing/id5.wav',
 'data/Audio_Files_for_testing/id6.wav',
 'data/Audio_Files_for_testing/id7.wav',
 'data/Audio_Files_for_testing/id8.wav',
 'data/Audio_Files_for_testing/id9.wav',
 'data/Audio_Files_for_testing/id10.wav',
 'data/Audio_Files_for_testing/id11.wav',
 'data/Audio_Files_for_testing/id12.wav',
 'data/Audio_Files_for_testing/id13.wav',
 'data/Audio_Files_for_testing/id14.wav',
 'data/Audio_Files_for_testing/id15.wav',
 'data/Audio_Files_for_testing/id16.wav',
 'data/Audio_Files_for_testing/id17.wav',
 'data/Audio_Files_for_testing/id18.wav',
 'data/Audio_Files_for_testing/id19.wav',
 'data/Audio_Files_for_testing/id20.wav',
 'data/Audio_Files_for_testing/id21.wav',
 'data/Audio_Files_for_testing/id22.wav',
 'data/Audio_Files_for_testing/id23.wav',
 'data/Audio_Files_for_testing/id24.wav',
 

In [439]:
len(audio_paths_with_parent)

500

In [440]:
aligned_transcripts = []

In [441]:
from tqdm import tqdm

for audio in tqdm(audio_paths_with_parent, desc="Running forced alignment algorithm", unit="files", total=len(audio_paths_with_parent)):
    try:
        sample = run_vosk(audio, model)
        vosk_output = sample['result']
        # Ref text just get the number e.g., id48
        idx = int(extract_id_number(audio))
        ref_text = test_set_ref['text'].iloc[idx - 1]
        ref_text_aligned = align_transcript_with_vosk(vosk_output, ref_text)
        aligned_transcripts.append({
            'file_name': audio,
            'vosk_output': vosk_output,
            'ref_text': ref_text,
            'aligned_transcript': ref_text_aligned
        })
        # print(ref_text_aligned)
    except Exception as e:
        print(f"Error processing {audio}: {e}")
        aligned_transcripts.append({
            'file_name': audio,
            'vosk_output': None,
            'ref_text': None,
            'aligned_transcript': None
        })
        continue        

Running forced alignment algorithm:  48%|████▊     | 242/500 [06:35<02:36,  1.65files/s]

Error processing data/newtest_151_500_updated_TTS/id243.wav: 'result'


Running forced alignment algorithm: 100%|██████████| 500/500 [12:39<00:00,  1.52s/files]


In [459]:
processed_df = pd.DataFrame(aligned_transcripts)
processed_df.head()

Unnamed: 0,file_name,vosk_output,ref_text,aligned_transcript
0,data/Audio_Files_for_testing/id1.wav,"[{'conf': 1.0, 'end': 0.69, 'start': 0.51, 'wo...","The day before [DATE_START] yesterday, [DATE_E...","[{'word': 'the', 'start': 0.51, 'end': 0.69}, ..."
1,data/Audio_Files_for_testing/id2.wav,"[{'conf': 0.681104, 'end': 1.14, 'start': 0.69...",um my date of birth is uh second [DATE_START] ...,"[{'word': 'um', 'start': 0.69, 'end': 1.14}, {..."
2,data/Audio_Files_for_testing/id3.wav,"[{'conf': 1.0, 'end': 0.84, 'start': 0.48, 'wo...","she handed over a crumpled piece of paper, the...","[{'word': 'she', 'start': 0.48, 'end': 0.84}, ..."
3,data/Audio_Files_for_testing/id4.wav,"[{'conf': 1.0, 'end': 1.5, 'start': 1.11, 'wor...",aglio olio and err uh [CARDINAL_START] three t...,"[{'word': 'aglio', 'start': None, 'end': None}..."
4,data/Audio_Files_for_testing/id5.wav,"[{'conf': 1.0, 'end': 1.65, 'start': 1.14, 'wo...",[PERSON_START] Hong's [PERSON_END] email is [E...,"[{'word': '[PERSON_START]', 'start': None, 'en..."


In [443]:
processed_df['vosk_output'].iloc[3]

[{'conf': 1.0, 'end': 1.5, 'start': 1.11, 'word': 'and'},
 {'conf': 0.855331, 'end': 2.22, 'start': 1.62, 'word': 'ah'},
 {'conf': 1.0, 'end': 2.94, 'start': 2.261498, 'word': 'ah'},
 {'conf': 1.0, 'end': 3.39, 'start': 3.09, 'word': 'three'},
 {'conf': 1.0, 'end': 3.78, 'start': 3.39, 'word': 'three'},
 {'conf': 1.0, 'end': 3.99, 'start': 3.81, 'word': 'of'},
 {'conf': 1.0, 'end': 4.11, 'start': 3.99, 'word': 'the'},
 {'conf': 1.0, 'end': 4.5, 'start': 4.23, 'word': 'other'},
 {'conf': 1.0, 'end': 4.8, 'start': 4.5, 'word': 'one'},
 {'conf': 1.0, 'end': 5.13, 'start': 4.83, 'word': 'yeah'}]

In [444]:
processed_df.tail()

Unnamed: 0,file_name,vosk_output,ref_text,aligned_transcript
495,data/newtest_151_500_updated_TTS/id496.wav,"[{'conf': 1.0, 'end': 0.45, 'start': 0.06, 'wo...",Patrick Loh boasting about his email [EMAIL_ST...,"[{'word': 'patrick', 'start': 0.06, 'end': 0.4..."
496,data/newtest_151_500_updated_TTS/id497.wav,"[{'conf': 1.0, 'end': 0.45, 'start': 0.03, 'wo...",Jasmine Yeo got sian when someone spell her em...,"[{'word': 'jasmine', 'start': 0.03, 'end': 0.4..."
497,data/newtest_151_500_updated_TTS/id498.wav,"[{'conf': 1.0, 'end': 0.45, 'start': 0.03, 'wo...",Bobby Tan write his email [EMAIL_START] bobby....,"[{'word': 'bobby', 'start': 0.03, 'end': 0.45}..."
498,data/newtest_151_500_updated_TTS/id499.wav,"[{'conf': 1.0, 'end': 0.48, 'start': 0.06, 'wo...",Kamala Singh telling the IT guy her email [EMA...,"[{'word': 'kamala', 'start': 0.06, 'end': 0.48..."
499,data/newtest_151_500_updated_TTS/id500.wav,"[{'conf': 1.0, 'end': 0.51, 'start': 0.09, 'wo...",Raymond Koh say his email [EMAIL_START] raymon...,"[{'word': 'raymond', 'start': 0.09, 'end': 0.5..."


In [445]:
processed_df.to_json('data/processed_test_set_aligned.json', lines=True, orient='records')

## Extract Triplets

The triplets contain the `start_time`, `end_time`, and `entity`.

In [446]:
processed_df = pd.read_json('data/processed_test_set_aligned.json', lines=True)

In [447]:
processed_df.head()

Unnamed: 0,file_name,vosk_output,ref_text,aligned_transcript
0,data/Audio_Files_for_testing/id1.wav,"[{'conf': 1.0, 'end': 0.6900000000000001, 'sta...","The day before [DATE_START] yesterday, [DATE_E...","[{'word': 'the', 'start': 0.51, 'end': 0.69000..."
1,data/Audio_Files_for_testing/id2.wav,"[{'conf': 0.6811039999999999, 'end': 1.14, 'st...",um my date of birth is uh second [DATE_START] ...,"[{'word': 'um', 'start': 0.6900000000000001, '..."
2,data/Audio_Files_for_testing/id3.wav,"[{'conf': 1.0, 'end': 0.84, 'start': 0.48, 'wo...","she handed over a crumpled piece of paper, the...","[{'word': 'she', 'start': 0.48, 'end': 0.84}, ..."
3,data/Audio_Files_for_testing/id4.wav,"[{'conf': 1.0, 'end': 1.5, 'start': 1.11, 'wor...",aglio olio and err uh [CARDINAL_START] three t...,"[{'word': 'aglio', 'start': None, 'end': None}..."
4,data/Audio_Files_for_testing/id5.wav,"[{'conf': 1.0, 'end': 1.65, 'start': 1.14, 'wo...",[PERSON_START] Hong's [PERSON_END] email is [E...,"[{'word': '[PERSON_START]', 'start': None, 'en..."


In [456]:
processed_df['aligned_transcript'].iloc[305]

[{'word': 'desmond', 'start': 0.03, 'end': 0.6000000000000001},
 {'word': 'tan', 'start': 0.6000000000000001, 'end': 1.35},
 {'word': 'his', 'start': None, 'end': None},
 {'word': 'i', 'start': None, 'end': None},
 {'word': 'c', 'start': None, 'end': None},
 {'word': '[NRIC_START]', 'start': None, 'end': None},
 {'word': 's', 'start': None, 'end': None},
 {'word': 'six', 'start': None, 'end': None},
 {'word': 'seven', 'start': None, 'end': None},
 {'word': 'eight', 'start': None, 'end': None},
 {'word': 'nine', 'start': None, 'end': None},
 {'word': 'zero', 'start': None, 'end': None},
 {'word': 'one', 'start': None, 'end': None},
 {'word': 'two', 'start': None, 'end': None},
 {'word': 'f', 'start': None, 'end': None},
 {'word': '[NRIC_END]', 'start': None, 'end': None},
 {'word': 'he', 'start': None, 'end': None},
 {'word': 'say', 'start': None, 'end': None},
 {'word': 'the', 'start': None, 'end': None},
 {'word': 'last', 'start': None, 'end': None},
 {'word': 'letter', 'start': None,

In [457]:
processed_df['ref_text'].iloc[305]

"Desmond Tan, his I C [NRIC_START] S6789012F [NRIC_END], he say the last letter F stands for 'fantastic' sia."

In [458]:
processed_df['vosk_output'].iloc[305]

[{'conf': 1.0, 'end': 0.6000000000000001, 'start': 0.03, 'word': 'desmond'},
 {'conf': 1.0, 'end': 1.35, 'start': 0.6000000000000001, 'word': 'tan'}]

In [451]:
processed_df['aligned_transcript'].iloc[84]

[{'word': 'i', 'start': 0.45, 'end': 0.63},
 {'word': 'lost', 'start': 0.63, 'end': 1.02},
 {'word': 'my', 'start': 1.02, 'end': 1.2},
 {'word': 'passport', 'start': 1.23, 'end': 1.77},
 {'word': 'and', 'start': 1.98, 'end': 2.25},
 {'word': 'it', 'start': 2.31, 'end': 2.49},
 {'word': 'new', 'start': 2.7600000000000002, 'end': 3.09},
 {'word': '[CARDINAL_START]', 'start': None, 'end': None},
 {'word': 'one', 'start': 3.09, 'end': 3.39},
 {'word': '[CARDINAL_END]', 'start': None, 'end': None},
 {'word': 'number', 'start': 4.59, 'end': 5.07},
 {'word': 'is', 'start': 5.1, 'end': 5.46},
 {'word': '[PASSPORT_NUM_START]', 'start': None, 'end': None},
 {'word': 'k', 'start': 5.79, 'end': 6.15},
 {'word': 'seven', 'start': 6.18, 'end': 6.75},
 {'word': 'six', 'start': 6.84, 'end': 7.29},
 {'word': 'zero', 'start': 7.29, 'end': 7.83},
 {'word': 'one', 'start': 7.98, 'end': 8.227933},
 {'word': 'eight', 'start': 8.25, 'end': 8.49},
 {'word': 'four', 'start': 8.85, 'end': 9.21},
 {'word': 'two'

In [452]:
df_with_triplets = extract_pii_tuples(processed_df)

In [453]:
df_with_triplets['pii_tuples'].head(20)

0     [(1.2, 1.68, DATE), (1.77, 2.04, PERSON), (3.6...
1                                  [(3.36, 3.93, DATE)]
2                             [(4.56, 6.39, CAR_PLATE)]
3        [(3.09, 3.78, CARDINAL), (4.5, 4.8, CARDINAL)]
4           [(1.14, 1.65, PERSON), (2.49, 6.63, EMAIL)]
5          [(2.01, 2.67, ORG), (3.39, 5.73, CAR_PLATE)]
6     [(0.99, 3.93, EMAIL), (4.32, 5.22, ORG), (5.49...
7     [(0.96, 5.04, EMAIL), (5.43, 5.82, ORG), (7.14...
8     [(5.1, 5.34, CARDINAL), (6.39, 6.571295, CARDI...
9           [(0.75, 1.17, PERSON), (2.55, 4.83, PHONE)]
10       [(1.8, 2.1, CARDINAL), (3.66, 3.84, CARDINAL)]
11              [(2.28, 2.73, DATE), (2.73, 3.9, DATE)]
12                             [(0.96, 3.3, CAR_PLATE)]
13    [(0.5700000000000001, 5.52, BANK_ACCOUNT), (6....
14        [(0.99, 2.1, ORG), (3.45, 6.3, BANK_ACCOUNT)]
15                         [(6.78, 9.81, BANK_ACCOUNT)]
16    [(0.75, 5.1, CREDIT_CARD), (5.34, 6.21, ORG), ...
17             [(3.84, 4.65, DATE), (5.64, 6.12,

In [454]:
df_with_triplets['pii_tuples'].tail(20)

480    [(2.22, 5.31, EMAIL), (5.82, 7.74, EMAIL)]
481                         [(3.66, 5.94, EMAIL)]
482                         [(1.47, 3.69, EMAIL)]
483                         [(2.16, 4.74, EMAIL)]
484                         [(2.01, 5.07, EMAIL)]
485                           [(2.1, 3.9, EMAIL)]
486                         [(2.13, 4.05, EMAIL)]
487                         [(2.19, 4.32, EMAIL)]
488                         [(2.82, 5.73, EMAIL)]
489                         [(2.61, 5.88, EMAIL)]
490                         [(1.68, 3.87, EMAIL)]
491                         [(2.19, 4.17, EMAIL)]
492                         [(1.98, 4.11, EMAIL)]
493                         [(1.98, 3.99, EMAIL)]
494                     [(2.410002, 5.67, EMAIL)]
495                         [(2.22, 4.14, EMAIL)]
496                         [(2.61, 4.17, EMAIL)]
497                         [(2.16, 4.47, EMAIL)]
498                         [(2.79, 5.13, EMAIL)]
499                         [(1.86, 5.31, EMAIL)]


In [455]:
df_with_triplets['pii_tuples'].to_csv('data/ref_triplets_500.csv', index=False)

## [Archived]

### Heuristics Description (Greedy-based Alignment):

Forced-alignment heuristics are necessary because:
- The **Vosk model** may tokenize words differently compared to the reference transcript, especially for structured data like emails, phone numbers, and other PIIs (Personally Identifiable Information).
- **PII structures vary greatly** (e.g., "rendy.tan@hotmail.com" vs "rendy . tan at hotmail dot com"), and simple word-to-word alignment would fail.
- To achieve robust alignment and accurate timestamp mapping, **manual control** over token splitting and flattening is required based on the entity type.

These heuristics ensure that:
- Common free-text is aligned naturally,
- Structured PII is broken down appropriately for correct timestamp boundary matching.

#### 1. Outside of Entity Boundaries (General Case)

- Tokens are aligned **as-is** with Vosk words.
- No special splitting is done.
- Regular cleaning (punctuation removal except for `"."`) is applied when matching.
- **Example**:
  - Input Transcript: `"reach me at"`
  - Tokens: `["reach", "me", "at"]`
  - Aligned directly without splitting.

#### 2. Inside Entity Boundaries (e.g., [EMAIL_START], [PHONE_START], etc.)

- Special handling is done based on the entity type.

##### (A) EMAIL Entity (`current_entity == 'EMAIL'`)

- Split tokens based on `"."` and `"@"` separators.
- Words like `"at"` are **left intact**.
- **Example**:
  - Input: `"rendy.tan@hotmail.com"`
  - Split into: `["rendy", ".", "tan", "@", "hotmail", ".", "com"]`

##### (B) Other Entity Types (`CREDIT_CARD`, `CAR_PLATE`, `BANK_ACCOUNT`, `NRIC`, `PHONE`, `PASSPORT_NUM`)

- **If the token is a spelled-out number** (checked against a dictionary):
  - **Do not split**; keep the word as a single token.
  - **Example**:
    - Input: `"eight"`
    - Output: `["eight"]`

- **If the token is pure digits** (e.g., numbers like `"98005331"`):
  - **Split** into **individual characters**.
  - **Example**:
    - Input: `"98005331"`
    - Output: `["9", "8", "0", "0", "5", "3", "3", "1"]`

- **If the token is a mix of letters and numbers** (e.g., `"AB1234X"`):
  - **Split** into **individual characters** as well.
  - **Example**:
    - Input: `"AB1234X"`
    - Output: `["A", "B", "1", "2", "3", "4", "X"]`

#### 3. When Flattening Entity Tokens (Before Final Alignment)

- After all splitting:
  - **Spelled-out numbers** (like `"eight"`) and **email parts** (like `"hotmail"`) are **kept whole**.
  - Other tokens (numbers, single characters) appear **character-by-character**.

- **Flattening Examples**:
  - Tokens: `["eight", "5", "0"]`
    - Final output: `"eight 5 0"`
  
  - Tokens: `["hotmail", ".", "com"]`
    - Final output: `"hotmail . com"`

In [None]:
import json
import os
import tempfile
from pydub import AudioSegment
from vosk import Model, KaldiRecognizer
model_path = "vosk-model-en-us-0.42-gigaspeech" #model_new"
model = Model(model_path)
def align_audio_with_text(audio_path, transcription):
    audio = AudioSegment.from_wav(audio_path)
    recognizer = KaldiRecognizer(model, audio.frame_rate)
    with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as temp_wav:
        temp_wav_path = temp_wav.name
        audio.export(temp_wav_path, format="wav")
    results = []
    try:
        with open(temp_wav_path, "rb") as wf:
            wf.read(44)
            recognizer.SetWords(True)
            while True:
                data = wf.read(4000)
                if len(data) == 0:
                    break
                if recognizer.AcceptWaveform(data):
                    results.append(json.loads(recognizer.Result()))
            results.append(json.loads(recognizer.FinalResult()))
    finally:
        if os.path.exists(temp_wav_path):
            os.remove(temp_wav_path)
    words = []
    for result in results:
        if 'result' in result:
            for word in result['result']:
                words.append(word)
    aligned_segments = []
    for word in words:
        aligned_segments.append({
            "start": word["start"],
            "end": word["end"],
            "word": word["word"]
        })
    return aligned_segments

audio_dir = "/content/drive/MyDrive/Share/Research/speechNER/finetune/Audio_Files_for_testing"
transcription_file = "/content/drive/MyDrive/Share/Research/speechNER/Alignement_data/Text_with_ids_temp_preprocessed.jsonl"
output_file = "/content/drive/MyDrive/Share/Research/speechNER/Alignement_data/tr_aligned_data_new.jsonl"
with open(transcription_file, 'r') as f:
    transcriptions = [json.loads(line) for line in f]
aligned_data = []
for item in transcriptions:
    audio_path = f"{audio_dir}/id{item['id']}.wav"
    aligned_transcription = align_audio_with_text(audio_path, item['text'])
    aligned_data.append({
        "id": item['id'],
        "text": item['text'],
        "align": aligned_transcription
    })
with open(output_file, 'w') as f:
    for item in aligned_data:
        f.write(json.dumps(item) + '\n')

Old vosk alignment function with greedy decoding (Archived)

In [682]:
import re
import string

spelled_out_numbers = {
    'zero': '0', 'one': '1', 'two': '2', 'three': '3', 'four': '4', 'five': '5',
    'six': '6', 'seven': '7', 'eight': '8', 'nine': '9', 'ten': '10',
    'eleven': '11', 'twelve': '12', 'thirteen': '13', 'fourteen': '14',
    'fifteen': '15', 'sixteen': '16', 'seventeen': '17', 'eighteen': '18',
    'nineteen': '19', 'twenty': '20', 'thirty': '30', 'forty': '40', 'fifty': '50',
    'sixty': '60', 'seventy': '70', 'eighty': '80', 'ninety': '90',
    'hundred': '100', 'thousand': '1000'
}

def clean_token(token):
    """Remove punctuation except '.' and lowercase."""
    allowed = '.'
    punctuation_to_remove = ''.join(c for c in string.punctuation if c not in allowed)
    return token.lower().translate(str.maketrans('', '', punctuation_to_remove))

def process_entity_tokens(entity_tokens, char_tokens):
    """Prevent duplicates and extend entity tokens list."""
    for token in char_tokens:
        if token not in entity_tokens:
            entity_tokens.append(token)

def align_transcript_with_vosk(vosk_words, transcript):
    """
    Aligns a reference transcript with Vosk timestamps.
    Handles [XXX_START]... [XXX_END] entities properly.
    """
    tokens = re.findall(r'\[.*?\]|\S+', transcript)  # Tokenize the transcript
    aligned = []
    vosk_idx = 0
    current_entity = None
    entity_tokens = []
    entity_start_time = None
    entity_end_time = None

    entity_types_to_split = ['CREDIT_CARD', 'CAR_PLATE', 'BANK_ACCOUNT', 'NRIC', 'PHONE', 'PASSPORT_NUM']
    
    # Special case for emails: split on the dots (.) and @ but leave 'at' as-is
    def split_email(token):
        # Case 1: email with spaces (no @)
        if '.' in token:
            parts = re.split(r'([.])', token)
            parts = [p for p in parts if p != '']
            # print(parts)
            return parts  # Remove any empty strings
        # Case 2: email with @
        elif '@' in token:
            parts = re.split(r'([@.])', token)
            parts = [p for p in parts if p != '']
            # print(parts)
            return parts
        return [token]

    i = 0  # Index to keep track of the current token in the list

    while i < len(tokens):
        token = tokens[i]

        # print(f"Current token: {token}")

        if token.endswith('_START]'):
            # Start a new entity
            current_entity = token.replace('[', '').replace(']', '').replace('_START', '')
            entity_tokens = []
            entity_start_time = None
            entity_end_time = None
            i += 1
            continue

        if token.endswith('_END]'):
            # End the current entity
            if current_entity:
                # Flatten the entity and align with timestamps
                flattened_entity = []
                # print(f"New entity tokens: {entity_tokens}")
                
                for t in entity_tokens:
                    # Clean the token
                    clean_token_with_no_symbols = clean_token(t)
                    
                    # Check if the token is a spelled-out number
                    if clean_token_with_no_symbols.lower() in spelled_out_numbers or current_entity == 'EMAIL':
                        # If it's a spelled-out number, don't split it into characters
                        flattened_entity.append(clean_token_with_no_symbols)
                    else:
                        # Otherwise, split the token into characters
                        flattened_entity.extend(list(clean_token_with_no_symbols))
                    
                # Join the characters and align timestamps
                aligned.append({
                    "word": f"[{current_entity}_START] {' '.join(flattened_entity)} [{current_entity}_END]",
                    "start": entity_start_time,
                    "end": entity_end_time
                })
            current_entity = None
            entity_tokens = []
            entity_start_time = None
            entity_end_time = None
            i += 1
            continue

        clean_ref_word = clean_token(token)

        if current_entity:
            # Inside an entity, split the token into characters and modify the tokens list
            if vosk_idx < len(vosk_words):
                vosk_word = vosk_words[vosk_idx]['word']
                if not entity_tokens:
                    entity_start_time = vosk_words[vosk_idx]['start']
                entity_end_time = vosk_words[vosk_idx]['end']

                # Special handling for emails: split valid email format
                if current_entity == 'EMAIL':
                    char_tokens = split_email(token)  # Split email into parts
                    tokens[i:i+1] = char_tokens  # Replace the current token with the split characters

                    # Prevent duplicate tokens and extend entity tokens list
                    process_entity_tokens(entity_tokens, char_tokens)
                    print(f"Entity tokens after email split: {entity_tokens}")
                # Inside the loop where you handle the token splitting:
                elif current_entity in entity_types_to_split:
                    clean_token_with_no_symbols = clean_token(token)  # Clean token

                    # Check if the token is a spelled-out number
                    if clean_token_with_no_symbols.lower() in spelled_out_numbers.keys():
                        # If it's a spelled-out number, don't split it
                        char_tokens = [clean_token_with_no_symbols]  # Keep the token as is
                    else:
                        # If it's not a spelled-out number, split it into characters
                        char_tokens = list(clean_token_with_no_symbols)

                    # Modify the tokens list in place by extending with the character tokens
                    tokens[i:i+1] = char_tokens

                    # Prevent duplicates and extend entity tokens list
                    process_entity_tokens(entity_tokens, char_tokens)

                    print(f"Entity tokens after split: {entity_tokens}")

                vosk_idx += 1
            else:
                # No more Vosk words left (shouldn't happen usually)
                entity_tokens.append(token)

        else:
            # Outside entity, normal matching
            while vosk_idx < len(vosk_words):
                clean_vosk_word = clean_token(vosk_words[vosk_idx]['word'])
                aligned.append({
                    "word": token,
                    "start": vosk_words[vosk_idx]['start'],
                    "end": vosk_words[vosk_idx]['end']
                })
                vosk_idx += 1
                break

        i += 1  # Move to the next token
        # print(tokens)

    return aligned