### Normalization consists of the following line-by-line process:

##### 1. Remove all lines containing nan or non-English characters
- For best results, this should happen after transcript is shortened to match with words
- Small complication here: before subsequent cleaning, some words don't look like English. Chicken or the egg?

##### 2. Collapse 3+ consecutive occurrences of same letter to 2 letters, e.g. moooooo -> moo (timestamps unchanged)
- Some words, like hmm, moo, bzz need two consecutive letters
- Most words, like fuuck, do not
- So after reducing 3+ occurrences to 2, is a dictionary/wordfreq check good enough to say whether an additional letter should be deleted?
- Most legit way to do this would be to check how such words are tokenized in Whisper model
- woo, oh, no, ah, eyow, go, hm | hmm, 
- leave lalala alone
- change yey to yay

##### 3. Check every word to see if it's a word
- Might be better to use known dictionary on first pass to get dictionary-standard words before dealing with slang/spelling variants
- Backup word test could be passes zipf_frequency test
- If not, check if it should be combined with a nearby word fragment(s) to create an actual word, e.g. ci ty -> city. Or a double letter should be changed to a single letter, e.g. "yees" -> "yes"
- Deal with misspellings by combining spell-checker and phonetics-checker


### To-Do:

##### 1. Combine word chunks and separate illegal compound words
- Separated words work pretty good, but they're worsened by the absence of apostrophes like in "we've"

##### 2. Cut off transcripts at beginning and end if they don't match with words
- This should also involve changing audio chunks; will need to write down new start and end times of each chunk
- Shortening audio chunks should be automated (with quality/file type preserved)

In [2]:
import pandas as pd
import ast
from wordfreq import zipf_frequency
import jiwer
import re
import numpy as np
import enchant
from spellchecker import SpellChecker
from itertools import combinations
from transformers import WhisperProcessor
import fuzzy

In [3]:
# First and last names used in checking if lyrics are valid words
df_male = pd.read_csv('./data/male.txt',header=None,names=["name"])
df_female = pd.read_csv('./data/female.txt',header=None,names=["name"])
df_last = pd.read_csv('./data/Names_2010Census.csv',usecols = ['name'])

df_names = pd.concat([df_male, df_female,df_last], ignore_index=True)

# British spellings used to correct spelling later
brit_to_us = pd.read_json('./data/british_to_american_sp.json', orient='index')
brit_to_us.reset_index(inplace=True)
brit_to_us.columns = ['british', 'american']
brit_to_us = brit_to_us[~brit_to_us['british'].isin(['aeroplane', 'aluminium', 'buses', 'axe', 'aesthetic'])]

In [4]:
# Clean metadata-full-lines.csv to have same format as metadata-lines.csv
df = pd.read_csv("../../data/metadata-full-lines.csv", usecols=["filename", "words", "transcript"])
df = df.rename(columns={"words": "transcript", "transcript": "words"})
df = df[["filename", "words", "transcript"]]

In [5]:
# Remove bad songs (non-English, transcription/lyrics issues, etc.)
df_bad = pd.read_csv('./data/bad_songs.csv')
df_bad['filename'] = df_bad['filename'].str.replace(r'^\d+\.\s*', '', regex=True)   # remove number + period + spaces at the start
df_bad['filename'] = df_bad['filename'].str.replace(r'\s+', '', regex=True) # delete whitespace

# Build a tuple of bad prefixes
bad_prefixes = tuple(df_bad['filename'].values)

# Filter rows where 'filename' starts with any bad prefix
mask = df['filename'].str.startswith(bad_prefixes)

df = df[~mask]

In [6]:
# Correctly format all lines (including nan formatting)
def parse_and_check_for_nan(val):
    if isinstance(val, str):
        nan_count = len(re.findall(r"'word': nan", val))
        if nan_count > 0:
            print(f"'word': nan appears {nan_count} times")

        val_fixed = re.sub(r"'word': nan", "'word': np.nan", val)
        try:
            parsed = eval(val_fixed, {"np": np})
            if any(pd.isna(item.get('word')) for item in parsed):
                return None
            return parsed
        except Exception:
            return None
    return None

# Format all lines and remove those containing nan
df['words'] = df['words'].apply(parse_and_check_for_nan)
df = df.dropna(subset=['words']).reset_index(drop=True)

In [7]:
df[['filename','transcript','words']].iloc[150]

filename                 00d5c65d644c4a549e7d501d65397b7b-3.wav
transcript                                     you away with me
words         [{'word': 'you', 'start': 0.0, 'end': 0.219}, ...
Name: 150, dtype: object

In [8]:
# Clean text: delete puctuation (keep apostrophes), collapse multiple spaces
df['transcript_no_punct'] = (
    df['transcript']
    .str.replace(r'(?<=\S)-(?=\S)', ' ', regex=True)  # Hyphen surrounded by non-space
    .str.replace(r'\s*-\s*', '', regex=True) # Hyphen with space on either side
    .str.replace(r"[^\w\s'-]", '', regex=True)  # Remove unwanted punctuation
    .str.replace(r"_", "", regex=True)          # Remove underscores
    .str.replace(r"\s+", ' ', regex=True)       # Collapse multiple spaces
    .str.strip()
    .str.replace(r"^'+(.*?)'+$", r"\1", regex=True) # Happens twice to remove outer apostrophes for things like ''you''
    .str.replace(r"^'+(.*?)'+$", r"\1", regex=True) # See above
)

# Clean 'words' by deleting duplicate timestamps
def remove_duplicate_dicts(lst):
    seen = set()
    result = []
    for d in lst:
        key = tuple(sorted(d.items()))
        if key not in seen:
            seen.add(key)
            result.append(d)
    return result

# Clean 'words' by deleting timestamps which end before they start
def remove_invalid_time_dicts(lst):
    return [d for d in lst if d.get('end', 0) >= d.get('start', 0)]

# Get rid of duplicate and invalid timestamps
df['words'] = df['words'].apply(lambda lst: remove_invalid_time_dicts(remove_duplicate_dicts(lst)) if isinstance(lst, list) else lst)

# Create column without timestamps
df['words_no_time'] = df['words'].apply(lambda lst: " ".join(item['word'] for item in lst) if isinstance(lst, list) else "")

In [9]:
# Used to further align transcript with words
def extract_aligned_words(row):
    transcript_words = row['transcript_no_punct'].split()
    words_no_time = row['words_no_time'].replace(" ", "").replace("'", "")

    aligned_words = []
    for word in transcript_words:
        clean_word = word.replace("'", "")
        pointer = 0
        for char in clean_word:
            pointer = words_no_time.find(char, pointer)
            if pointer == -1:
                break
            pointer += 1
        else:
            aligned_words.append(word)

    return " ".join(aligned_words)

df['transcript_words_align'] = df.apply(extract_aligned_words, axis=1)

# Used to further align transcript with words
def intersect_words_with_transcript_align(row):
    transcript_words = row['transcript_words_align'].split()
    words_no_time_clean = row['words_no_time'].replace(" ", "").replace("'", "")
    
    result_words = []
    pointer = 0

    for word in transcript_words:
        clean_word = word.replace("'", "")
        temp_pointer = pointer  # Start checking from current position

        for char in clean_word:
            temp_pointer = words_no_time_clean.find(char, temp_pointer)
            if temp_pointer == -1:
                break
            temp_pointer += 1
        else:
            # If the entire word matched, update the main pointer and keep the word
            pointer = temp_pointer
            result_words.append(word)

    return " ".join(result_words)


df['transcript_words_intersection'] = df.apply(intersect_words_with_transcript_align, axis=1)


In [10]:
def restore_from_reference_row(row):
    compressed = row['words_no_time']
    reference = row['transcript_no_punct']

    compressed_clean = compressed.replace(" ", "").replace("'", "")
    reference_clean = reference.replace(" ", "").replace("'", "")

    start_index = reference_clean.find(compressed_clean)
    if start_index == -1:
        return ""

    result = []
    ref_char_pos = 0
    matched_chars = 0

    for char in reference:
        if char not in {" ", "'"}:
            if ref_char_pos >= start_index and matched_chars < len(compressed_clean):
                result.append(char)
                matched_chars += 1
            elif matched_chars > 0 and matched_chars < len(compressed_clean):
                result.append(char)
            ref_char_pos += 1
        elif matched_chars > 0 and matched_chars < len(compressed_clean):
            result.append(char)

        if matched_chars == len(compressed_clean):
            break

    return ''.join(result).strip()

df['transcript_words_restored'] = df.apply(restore_from_reference_row, axis=1)

In [11]:
def check_non_whitespace_apostrophe_match(row):
    a = re.sub(r"[ '\t\n\r\f\v]", "", row['words_no_time'])
    b = re.sub(r"[ '\t\n\r\f\v]", "", row['transcript_words_restored'])

    return a != b

# Apply across the DataFrame
disagreements = df.apply(check_non_whitespace_apostrophe_match, axis=1)
disagreement_count = disagreements.sum()
mismatches = df[disagreements]

print(f"Number of disagreeing rows (ignoring whitespace/apostrophes): {disagreement_count}")


Number of disagreeing rows (ignoring whitespace/apostrophes): 162


In [12]:
# This is not currently needed
# Used to look at differences exist between lines and words
#df[['filename','transcript_no_punct','words_no_time','transcript_words_restored']].iloc[29763]
#pd.set_option("display.max_rows", None) 
#pd.set_option("display.max_columns", None)
#pd.set_option("display.width", 0)  # Automatically fit to content width
#pd.set_option("display.max_colwidth", None)
#pd.set_option("display.expand_frame_repr", False)  # Disable line wrapping for wide frames
#mismatches = df[disagreements]
#print(mismatches[['filename','transcript_no_punct','words_no_time']])

In [13]:
pd.set_option('display.max_colwidth', None)
pd.set_option("display.max_rows", None) 
pd.set_option("display.width", 0)  
df[['transcript_no_punct','words_no_time','transcript_words_restored']].head(40)

Unnamed: 0,transcript_no_punct,words_no_time,transcript_words_restored
0,life is a moment in space,life is a moment in space,life is a moment in space
1,when the dream is gone,thedream is gone,the dream is gone
2,it's a lonelier place,alonelier,a lonelier
3,i kiss the morning goodbye,i kiss the morning goodbye,i kiss the morning goodbye
4,butdown inside,inside,inside
5,you know we never know why,you know we never know why,you know we never know why
6,the road is narrow and long,road is narrow and,road is narrow and
7,when eyes meet eyes,eyes meet,eyes meet
8,and the feeling is strong,thefeeling is strong,the feeling is strong
9,i turn away from the wall,i turn away from the wa ll,i turn away from the wall


In [14]:
# Below is a collection of functions to further clean the dataset later

# Check if a string contains non-English characters
def is_non_english(line,freq_threshold=3.0, ratio_threshold=0.5):
    # Returns True if line has characters outside basic English alphabet and punctuation
    if bool(re.search(r"[^a-zA-Z0-9\s.,?!'\"-]", line)):
        return True
    words = [word for word in line.split()]
    if not words:
        return False  # Don't flag empty or punctuation-only lines

    englishish = [zipf_frequency(word, 'en') >= freq_threshold for word in words]
    english_ratio = sum(englishish) / len(englishish)

    return english_ratio < ratio_threshold

# Check if a string contains the same character 3 or more times in a row
def three_or_more_repeats(text):
    return bool(re.search(r"(.)\1{2,}", text))

# Collapse 3+ letters to 2
def collapse_repeats(text):
    return re.sub(r'(.)\1{2,}', r'\1\1', text)

# Further collapse certain words from 1 repeat to 0 repeats
def collapse_known_repeats(text):
    known_patterns = {"noo", "whoo", "ohh", "yess", "goo", "aah", "woahh", "laa", "poww", "hii", "heyy", "ayy", "okay", "byee"}    
    
    def collapse(word):
            if word in known_patterns:
                # Collapse all double letters in the word down to one occurrence
                word = re.sub(r'(.)\1+', r'\1', word)
            return word

    return ' '.join(collapse(word) for word in text.split())


d = enchant.Dict("en_US")
spell = SpellChecker()

def is_dictionary_word(word):
    if word in {'a','i'}:
        return True
    if len(word)>1:
        if d.check(word):
            return True
        # spell allows for "words" like "ni", "th", etc...
        #if word in spell:
            #return True
    return False

def is_name(word):
    if (df_names['name'].str.lower() == word).any():
        #print('is a name')
        return True
    return False

# For a given nonword, break it into 2+ pieces to see if those pieces are words
def split_and_check(word):
    length = len(word)

    for num_pieces in range(2, 6): # range(2, 4)
        # Generate all possible split positions for num_pieces
        for split_points in combinations(range(1, length), num_pieces - 1):
            indices = (0,) + split_points + (length,)
            pieces = [word[indices[i]:indices[i + 1]] for i in range(len(indices) - 1)]

            if all(is_dictionary_word(piece) for piece in pieces):
                if num_pieces >= 4:
                    print('found a',num_pieces, 'parter:',word,pieces)
                return pieces

    return False


def maybe_is_a_word(word):
    pass
    # make use of zipf_frequency here

# Used for output formatting
def add_quotes(word):
    return f'"{word}"'

In [15]:
# Correct misspellings based on spell checker and phonetics checker
dmeta = fuzzy.DMetaphone()

def correct_misspelling(misspelled_word):
    goal_phonetics = dmeta(misspelled_word)
    matches = []
    for word in spell.candidates(misspelled_word):
        if dmeta(word) == goal_phonetics:
            matches.append(word)
    if len(matches) == 0:
        print(misspelled_word,"is likely misspelled, but couldn't find a match!")
    if len(matches) == 1:
        print("Found one match:",add_quotes(misspelled_word),"corrects to",add_quotes(matches[0]))
    if len(matches) > 1:
        print("Found more than one match for",add_quotes(misspelled_word),"so no action taken.")

correct_misspelling("yay")


Found more than one match for "yay" so no action taken.


In [16]:
# Perform all 3 word checks to see which the word passes
word = 'll'
print(d.check(word), word in spell, (df_names['name'].str.lower() == word).any())

True False False


In [17]:
# Check if something is a name, and if so, find its index
word = 'ant'
print(is_name('ant'))

matches = df_names['name'].str.lower() == word.lower()

if matches.any():
    index = matches.idxmax()  # Returns the first True index
    print(f"Found at index: {index}")
else:
    print("Not found.")

False
Not found.


In [18]:
# Instantiate the tokenizer
model_name = "openai/whisper-base"
language = "english" # Change to your dataset's language
task = "transcribe" # Use "translate" if you're translating to English

processor = WhisperProcessor.from_pretrained(model_name, language=language, task=task)
tokenizer = processor.tokenizer

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [19]:
# What does zipf_frequency do?
zipf_frequency("im", lang="en")

def is_known_word(word, threshold=4.0):
    return zipf_frequency(word, 'en') >= threshold

In [20]:
# Test how a word is tokenized here
word = 'dyou'
tokens = tokenizer.tokenize(word)
print(tokens)

['dy', 'ou']


In [54]:
# Test if things are words here
word = "darling"
print(is_dictionary_word(word))

name = 'francisco'
print(is_name(name))

True
True


In [61]:
# count WER score of interest
count=0
# count the number of occurrences of 3+ consecutive letters
transcript_count=0
# count the number of non-English character occurences
nonenglish_count=0

for i in range(len(df.words)):  #len(df.words)

    result = []
    collapsed_transcript_list = []
    split_transcript_list = []
    tokenized_transcript = []
    tokenized_result = []

    # pull the original line transcript and get rid of junk repeat letters like woooooo
    transcript = df.transcript_no_punct[i]
    transcript = collapse_repeats(transcript)
    transcript = collapse_known_repeats(transcript)

    # collapse repeat letters in the transcript
    transcript_list = transcript.split(" ")
    for word in transcript_list:
        collapsed_transcript_list.append(collapse_repeats(word))

    transcript_list = collapsed_transcript_list

    # break up incorrect compound words in the transcript
    for word in transcript_list:
        if not is_dictionary_word(word):
            if not is_name(word):
                split_attempt = split_and_check(word)
                if split_attempt is False:
                    split_transcript_list.append(word)
                else:
                    #print('we found a split! originally:',word)
                    split_transcript_list.append(" ".join(split_attempt))

    transcript_list = split_transcript_list

    # tokenize the transcript
    for word in transcript_list:
        token = tokenizer.tokenize(word)
        tokenized_transcript.append(token)

    transcript = " ".join(transcript_list)
    # check if transcript contains non-English words and chars
    status = is_non_english(transcript)
    if status:
        #print(transcript, df.filename[i])
        nonenglish_count += status

    # loop over words to append them one at a time to result
    for obj in df.words[i]:                
        result.append(obj['word'])
        token = tokenizer.tokenize(obj['word'])
        tokenized_result.append(token)
    line = " ".join(result)
    
    # compute WER to assess quality of each line
    wer_score = jiwer.wer(line, transcript)
    if wer_score>.1:
        #print(line+',',transcript+',', wer_score)
        count += 1

    #if status:
        #print("orig. concat:",line+',',"token concat:", tokenized_result,transcript+',',"token concat:",tokenized_transcript, "WER score:", wer_score)

found a 4 parter: endeavour ['en', 'de', 'av', 'our']
found a 4 parter: lalalala ['la', 'la', 'la', 'la']
found a 4 parter: dadadadada ['dad', 'ad', 'a', 'dada']
found a 4 parter: dadadadada ['dad', 'ad', 'a', 'dada']
found a 4 parter: dadadadada ['dad', 'ad', 'a', 'dada']
found a 4 parter: dadadadadoo ['dada', 'dad', 'ad', 'oo']
found a 4 parter: dadadadada ['dad', 'ad', 'a', 'dada']
found a 4 parter: dadadadada ['dad', 'ad', 'a', 'dada']
found a 4 parter: dadadadada ['dad', 'ad', 'a', 'dada']
found a 4 parter: dadadadadoo ['dada', 'dad', 'ad', 'oo']
found a 4 parter: dadadadada ['dad', 'ad', 'a', 'dada']
found a 4 parter: dadadadada ['dad', 'ad', 'a', 'dada']
found a 4 parter: dadadadada ['dad', 'ad', 'a', 'dada']
found a 4 parter: dadadadadoo ['dada', 'dad', 'ad', 'oo']
found a 4 parter: siberian ['sib', 'er', 'i', 'an']
found a 5 parter: transiberian ['trans', 'ib', 'er', 'i', 'an']
found a 4 parter: dadadadada ['dad', 'ad', 'a', 'dada']
found a 4 parter: dadadadada ['dad', 'ad', '

In [21]:
change_count = 0

for _, row in brit_to_us.iterrows():
    british_word = row['british']
    american_word = row['american']
    pattern = r'\b' + re.escape(british_word) + r'\b'  # Match whole word

    for i in range(len(df)):
        original_text = df.loc[i, 'transcript']
        updated_text = re.sub(pattern, american_word, original_text)

        if original_text != updated_text:
            print(f"Original: {original_text}")
            print(f"Updated:  {updated_text}\n")
            df.loc[i, 'transcript'] = updated_text
            change_count += 1

print(f"Total replacements made: {change_count}")

Original: follow the aeon path
Updated:  follow the eon path

Original: follow the aeon path
Updated:  follow the eon path

Original: follow the aeon path
Updated:  follow the eon path

Original: now let's not analyse --
Updated:  now let's not analyze --

Original: don't analyse don't analyse
Updated:  don't analyze don't analyze

Original: don't analyse don't analyse
Updated:  don't analyze don't analyze

Original: i won't apologise
Updated:  i won't apologize

Original: and now apologise
Updated:  and now apologize

Original: i might never be your knight in shining armour
Updated:  i might never be your knight in shining armor

Original: and you know you where their armour
Updated:  and you know you where their armor

Original: 'cos i've cancelled the milk
Updated:  'cos i've canceled the milk

Original: from a-mail-order catalogue,
Updated:  from a-mail-order catalog,

Original: without the centre of my life
Updated:  without the center of my life

Original: colour in a rainbow
Upd