### Normalization consists of the following line-by-line process:

##### 1. Remove all lines containing nan or non-English characters
- For best results, this should happen after transcript is shortened to match with words
- Small complication here: before subsequent cleaning, some words don't look like English. Chicken or the egg?

##### 2. Collapse 3+ consecutive occurrences of same letter to 2 letters, e.g. moooooo -> moo (timestamps unchanged)
- Some words, like hmm, moo, bzz need two consecutive letters
- Most words, like fuuck, do not
- So after reducing 3+ occurrences to 2, is a dictionary/wordfreq check good enough to say whether an additional letter should be deleted?
- Most legit way to do this would be to check how such words are tokenized in Whisper model
- woo, oh, no, ah, eyow, go, hm | hmm, 
- leave lalala alone
- change yey to yay

##### 3. Check every word to see if it's a word
- Might be better to use known dictionary on first pass to get dictionary-standard words before dealing with slang/spelling variants
- Backup word test could be passes zipf_frequency test
- If not, check if it should be combined with a nearby word fragment(s) to create an actual word, e.g. ci ty -> city. Or a double letter should be changed to a single letter, e.g. "yees" -> "yes"
- Deal with misspellings by combining spell-checker and phonetics-checker


### To-Do:

##### 1. Combine word chunks and separate illegal compound words
- Separated words work pretty good, but they're worsened by the absence of apostrophes like in "we've"

##### 2. Cut off transcripts at beginning and end if they don't match with words
- This should also involve changing audio chunks; will need to write down new start and end times of each chunk
- Shortening audio chunks should be automated (with quality/file type preserved)

In [54]:
import pandas as pd
import ast
from wordfreq import zipf_frequency
import jiwer
import re
import numpy as np
import enchant
from spellchecker import SpellChecker
from itertools import combinations
from transformers import WhisperProcessor
import fuzzy

In [55]:

#df = pd.read_csv('../../data/metadata-lines.csv')

# First and last names used in checking if lyrics are valid words
df_male = pd.read_csv('./Names/male.txt',header=None,names=["name"])
df_female = pd.read_csv('./Names/female.txt',header=None,names=["name"])
df_last = pd.read_csv('./Names/Names_2010Census.csv',usecols = ['name'])

df_names = pd.concat([df_male, df_female,df_last], ignore_index=True)

In [None]:
# Clean metadata-full-lines.csv to have same format as metadata-lines.csv
df = pd.read_csv("../../data/metadata-full-lines.csv", usecols=["filename", "words", "transcript"])
df = df.rename(columns={"words": "transcript", "transcript": "words"})
df = df[["filename", "words", "transcript"]]

In [57]:
# Clean text: delete puctuation (keep apostrophes), collapse multiple spaces
df['transcript_no_punct'] = (
    df['transcript']
    .str.replace(r'(?<=\S)-(?=\S)', ' ', regex=True)  # Hyphen surrounded by non-space
    .str.replace(r'\s*-\s*', '', regex=True) # Hyphen with space on either side
    .str.replace(r"[^\w\s'-]", '', regex=True)  # Remove unwanted punctuation
    .str.replace(r"\s+", ' ', regex=True)       # Collapse multiple spaces
    .str.strip()
)

In [80]:
pd.set_option('display.max_colwidth', None)
pd.set_option("display.max_rows", None) 
df[['transcript','transcript_no_punct']].head(5)

Unnamed: 0,transcript,transcript_no_punct
0,life is a moment in space.,life is a moment in space
1,"when the-dream is gone,",when the dream is gone
2,it's a-lonelier place.,it's a lonelier place
3,"i kiss the morning goodbye,",i kiss the morning goodbye
4,butdown inside,butdown inside


In [None]:
# Correctly format all lines (including nan formatting)
def parse_and_check_for_nan(val):
    if isinstance(val, str):
        nan_count = len(re.findall(r"'word': nan", val))
        if nan_count > 0:
            print(f"'word': nan appears {nan_count} times")

        val_fixed = re.sub(r"'word': nan", "'word': np.nan", val)
        try:
            parsed = eval(val_fixed, {"np": np})
            if any(pd.isna(item.get('word')) for item in parsed):
                return None
            return parsed
        except Exception:
            return None
    return None

# Format all lines and remove those containing nan
df['words'] = df['words'].apply(parse_and_check_for_nan)
df = df.dropna(subset=['words']).reset_index(drop=True)

"[{'word': 'life', 'start': 0.0, 'end': 0.177}, {'word': 'is', 'start': 0.353, 'end': 0.53}, {'word': 'a', 'start': 0.707, 'end': 0.884}, {'word': 'moment', 'start': 1.06, 'end': 1.767}, {'word': 'in', 'start': 1.767, 'end': 1.944}, {'word': 'space', 'start': 2.121, 'end': 3.005}]"

In [70]:
# Below is a collection of functions to further clean the dataset later

# Check if a string contains non-English characters
def is_non_english(line,freq_threshold=3.0, ratio_threshold=0.5):
    # Returns True if line has characters outside basic English alphabet and punctuation
    if bool(re.search(r"[^a-zA-Z0-9\s.,?!'\"-]", line)):
        return True
    words = [word for word in line.split()]
    if not words:
        return False  # Don't flag empty or punctuation-only lines

    englishish = [zipf_frequency(word, 'en') >= freq_threshold for word in words]
    english_ratio = sum(englishish) / len(englishish)

    return english_ratio < ratio_threshold

# Check if a string contains the same character 3 or more times in a row
def three_or_more_repeats(text):
    return bool(re.search(r"(.)\1{2,}", text))

# Collapse 3+ letters to 2
def collapse_repeats(text):
    return re.sub(r'(.)\1{2,}', r'\1\1', text)

# Further collapse certain words from 1 repeat to 0 repeats
def collapse_known_repeats(text):
    known_patterns = {"noo", "whoo", "ohh", "yess", "goo", "aah", "woahh", "laa", "poww", "hii", "heyy", "ayy", "okay", "byee"}    
    
    def collapse(word):
            if word in known_patterns:
                # Collapse all double letters in the word down to one occurrence
                word = re.sub(r'(.)\1+', r'\1', word)
            return word

    return ' '.join(collapse(word) for word in text.split())


d = enchant.Dict("en_US")
spell = SpellChecker()

def is_dictionary_word(word):
    if word in {'a','i'}:
        return True
    if len(word)>1:
        if d.check(word):
            return True
        # spell allows for "words" like "ni", "th", etc...
        #if word in spell:
            #return True
    return False

def is_name(word):
    if (df_names['name'].str.lower() == word).any():
        #print('is a name')
        return True
    return False

# For a given nonword, break it into 2+ pieces to see if those pieces are words
def split_and_check(word):
    length = len(word)

    for num_pieces in range(2, 4):
        # Generate all possible split positions for num_pieces
        for split_points in combinations(range(1, length), num_pieces - 1):
            indices = (0,) + split_points + (length,)
            pieces = [word[indices[i]:indices[i + 1]] for i in range(len(indices) - 1)]

            if all(is_dictionary_word(piece) for piece in pieces):
                return pieces

    return False


def maybe_is_a_word(word):
    pass
    # make use of zipf_frequency here

# Used for output formatting
def add_quotes(word):
    return f'"{word}"'

In [71]:
# Correct misspellings based on spell checker and phonetics checker
dmeta = fuzzy.DMetaphone()

def correct_misspelling(misspelled_word):
    goal_phonetics = dmeta(misspelled_word)
    matches = []
    for word in spell.candidates(misspelled_word):
        if dmeta(word) == goal_phonetics:
            matches.append(word)
    if len(matches) == 0:
        print(misspelled_word,"is likely misspelled, but couldn't find a match!")
    if len(matches) == 1:
        print("Found one match:",add_quotes(misspelled_word),"corrects to",add_quotes(matches[0]))
    if len(matches) > 1:
        print("Found more than one match for",add_quotes(misspelled_word),"so no action taken.")

correct_misspelling("yay")


Found more than one match for "yay" so no action taken.


In [72]:
# Perform all 3 word checks to see which the word passes
word = 'll'
print(d.check(word), word in spell, (df_names['name'].str.lower() == word).any())

True False False


In [73]:
# Check if something is a name, and if so, find its index
word = 'ant'
print(is_name('ant'))

matches = df_names['name'].str.lower() == word.lower()

if matches.any():
    index = matches.idxmax()  # Returns the first True index
    print(f"Found at index: {index}")
else:
    print("Not found.")

False
Not found.


In [74]:
# Instantiate the tokenizer
model_name = "openai/whisper-base"
language = "english" # Change to your dataset's language
task = "transcribe" # Use "translate" if you're translating to English

processor = WhisperProcessor.from_pretrained(model_name, language=language, task=task)
tokenizer = processor.tokenizer

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [75]:
# What does zipf_frequency do?
zipf_frequency("im", lang="en")

def is_known_word(word, threshold=4.0):
    return zipf_frequency(word, 'en') >= threshold

In [76]:
# Test how a word is tokenized here
word = 'dyou'
tokens = tokenizer.tokenize(word)
print(tokens)

['dy', 'ou']


In [77]:
# Test if things are words here
word = "darling"
print(is_dictionary_word(word))

True


In [78]:
# Test word splitting of non-words here
split_and_check("youll")

['you', 'll']

In [81]:
# count WER score of interest
count=0
# count the number of occurrences of 3+ consecutive letters
transcript_count=0
# count the number of non-English character occurences
nonenglish_count=0

for i in range(len(df.words)):  #len(df.words)

    result = []
    collapsed_transcript_list = []
    split_transcript_list = []
    tokenized_transcript = []
    tokenized_result = []

    # pull the original line transcript and get rid of junk repeat letters like woooooo
    transcript = df.transcript_no_punct[i]
    transcript = collapse_repeats(transcript)
    transcript = collapse_known_repeats(transcript)

    # collapse repeat letters in the transcript
    transcript_list = transcript.split(" ")
    for word in transcript_list:
        collapsed_transcript_list.append(collapse_repeats(word))

    transcript_list = collapsed_transcript_list

    # break up incorrect compound words in the transcript
    for word in transcript_list:
        if not is_dictionary_word(word):
            split_attempt = split_and_check(word)
            if split_attempt is False:
                split_transcript_list.append(word)
            else:
                print('we found a split! originally:',word)
                split_transcript_list.append(" ".join(split_attempt))

    transcript_list = split_transcript_list

    # tokenize the transcript
    for word in transcript_list:
        token = tokenizer.tokenize(word)
        tokenized_transcript.append(token)

    transcript = " ".join(transcript_list)
    # check if transcript contains non-English words and chars
    status = is_non_english(transcript)
    if status:
        print(transcript, df.filename[i])
        nonenglish_count += status

    # loop over words to append them one at a time to result
    for obj in df.words[i]:                
        result.append(obj['word'])
        token = tokenizer.tokenize(obj['word'])
        tokenized_result.append(token)
    line = " ".join(result)
    
    # compute WER to assess quality of each line
    wer_score = jiwer.wer(line, transcript)
    if wer_score>.1:
        #print(line+',',transcript+',', wer_score)
        count += 1

    #if status:
        #print("orig. concat:",line+',',"token concat:", tokenized_result,transcript+',',"token concat:",tokenized_transcript, "WER score:", wer_score)

we found a split! originally: butdown
we found a split! originally: overand
we found a split! originally: overagain
we found a split! originally: nomeasure
we found a split! originally: thatyou
we found a split! originally: overand
we found a split! originally: overagain
we found a split! originally: youiknow
we found a split! originally: overand
we found a split! originally: overagain
we found a split! originally: overand
we found a split! originally: overagain
chasin' 007c0152242340008ff45781a9b08546-1.wav
we found a split! originally: onlygot
we found a split! originally: hundredyears
we found a split! originally: timegoes
we found a split! originally: aneye
we found a split! originally: isgone
we found a split! originally: isget
we found a split! originally: timefor
we found a split! originally: anothermo
fif 0091064bdc72469ca7096d3a0db74562-34.wav
we found a split! originally: you'reon
we found a split! originally: there'sstill
we found a split! originally: timefor
we found a spli

KeyboardInterrupt: 