## Obtaining Cambridge Pronunciations

This notebook can be run to obtain the cambridge pronunciations, which are output to `cambridge_ipas.csv`.

In [1]:
# Imports
import cambridge_parser as parser
import pandas as pd
import re

First, we trim the cmu dictionary based on several criteria. We will then make calls to the Cambridge dictionary for all remaning words to obtain their pronunciations. This cell takes a significant amount of time to run.

In [2]:
# This cell takes a long time to run!
wps = {} # str : List[str]

with open('cmudict-0.7b-2024-4-6.txt') as file:
    # Import the SUBTLEXUS csv to a pandas dataframe.
    subtlexus = pd.read_csv('SUBTLEXusExcel2007.csv')

    # Convert all words to lowercase
    subtlexus['Word'] = subtlexus['Word'].str.lower()

    # Regex for finding alternate pronunciations of words (which are structured as
    # "word(int)")
    ralt = r"\s" # TODO finish implementing this 

    # Regex for finding unwanted punctuation in words (essentially any non-word)
    rpunc = r".*(\W|\d).*"

    # Regex for three-peated characters (any word with three or more of the same
    # letter in a row should be omitted, as none are valid English words for the
    # purposes of the toolkit)
    rpeat = r".*(.)\1\1.*"

    l = 56

    # Skip the first 56 lines as these contain text we are not interested in
    for line in file.readlines()[56:]:
        s = line.strip()
        i = s.find(" ", 0)
        word = s[:i].lower()
        word = word.lower()

        # We need to do a check to see if it is an alternate pronunciation.
        # If it is, remove the alt. pronunciation tag, and add this to our list
        # of words to keep.
        

        # Ensure the word doesn't contain punctuation and is present in the SUBTLEXUS
        if re.match(rpunc, word) is None \
            and re.match(rpeat, word) is None \
            and word in subtlexus['Word'].values:
            # Add the word to the dict 
            wps[word] = ""

        l += 1

In [3]:
# Show the number of words whose pronunciations will be obtained from the Cambridge dictionary
print(len(wps))

48353


Next, using `cambridge_parser.py` we obtain the pronunciations for all words in the trimmed list.


Note that this next cell take an incredibly long time (hours) to run.

In [5]:
# For each word in the dict, get its corresponding pronunciation from Cambridge;
# if it is not in Cambridge, remove it from the dict 

wps_keys = list(wps.keys())
wps_words = str(len(wps_keys))
# Four steps of iteration to prevent errors
step = int(len(wps_keys) / 4) # TODO turn the step into a variable
stepn = 1
wordnum = 1

# Iterate through trimmed word list 
# This is done in several steps to reduce capacity for errors
for i in range(0, len(wps_keys), step):
    # Temporary dict in which to store words from this iteration
    temp_words = {} # [str : [str, int, int]]
    words_to_remove = []

    # todo dont hardcode 4
    print(f"Stepping through words {wps_keys[i]} to {wps_keys[i + step - 1]} in step {str(stepn)} of 4...")

    for word in wps_keys[i:min(len(wps_keys), i + step)]:
        print(f"Obtaining pronunciations for word {word}; {str(wordnum)} of {wps_words} words...", end="\r")

        # Grab cambridge information
        cword = parser.define(word)

        # List of pronunciations for the word in US_IPA in space-separated string
        ps = "" # If this remains empty by the end then ofc we didnt find the word in the dict so no pronuns.
        # Is this the conjugate of some root word? (0 is no, 1 is yes)
        # If this is 1 it means that we searched for something like "abandoning" and "abandon" was returned
        # In such a case we still want to keep the pronunciations
        root_ret = 0
        missing = 0 # 0 if we got something back from cambridge, 1 otherwise (meaning the word is not present at all)

        # Iterate through all definitions
        for defnum in range(len(cword)):
            try:
                # Obtain all pronunciations and format them correctly
                # NOTE: here, if word is a conjugate of some other word (for example "abandoning" whose
                # root word is "abandon") the returned data from the parser will have "abandon" in the
                # word position below.
                # But, we want to keep words that gave us the root word but not the full pronunciation
                # in the resultant.
                # So check if it is not none if there is an exception and append this instead?
                pslist = cword[defnum][word][0]['data']['US_IPA']
                for p in pslist:
                    # Append the formatted pronunciation to space separated string of pronuns
                    ps = ps + " " + p[0].replace(".","").replace("/","")
            # if something times out we can grab it later by hand 
            except (RuntimeError, KeyError, IndexError, TimeoutError):
                # In this case we might have obtained the root word of the word we tried to search for
                # Make sure something was actually returned 
                if len(cword) != 0:
                    # The pronunciations will be available in the first nonempty IPAs set
                    for k, v in cword[defnum].items():
                        pslist = cword[defnum][k][0]['data']['US_IPA']
                        for p in pslist:
                            ps = ps + " " + p[0].replace(".","").replace("/","") # Get all of the possible pronuns from things that are returned
                    root_ret = 1
                else:
                    # In this case the word was not there at all 
                    # We will append these separately later 
                    missing = 1
                

        # If there are pronunciations for this word, add them to the dict
        temp_words[word] = [ps, root_ret, missing]
        wordnum += 1

    # Output the words and pronunciations we have accumulated
    print("")
    print("Finished obtaining pronunciations for this iteration.")

    word_pronunciation_pairs = []

    # Iterate through the dictionary and convert it into a list of tuples
    for word, data in temp_words.items():
        # Break data into pronunciations and whether or not the item was missing (conjugate or entirely)
        word_pronunciation_pairs.append((word, data[0], data[1], data[2]))

    # Create a DataFrame from the list of tuples
    df = pd.DataFrame(word_pronunciation_pairs, columns=['Word', 'Pronunciation', 'Root Word Returned', 'Missing'])

    # Output the DataFrame to a CSV file
    df.to_csv(f"temp/cambridge_ipas_step{str(stepn)}.csv", index=False)

    print(f"Output csv for step {str(stepn)} to cambridge_ipas_step{str(stepn)}.csv")
    
    stepn += 1

Stepping through words a to dilatory in step 1 of 4...
Obtaining pronunciations for word aa; 2 of 48353 words...

IndexError: list index out of range

In [None]:
# TODO: This will likely not be the case in the future. Need to 1) adjust above for loop to work in general
# case for steps and 2) remodify the cambridge dictionary input list because some things are being trimmed that
# might not necessarily want to be (see notes) 
# The above loop missed the very last word ("zygote") so we grab that manually
# This also serves as a direct example of how the words are obtained from the dictionary
zy = parser.define("zygote")

ps = ""
root_ret = 0
missing = 0 # 0 if we got something back from cambridge, 1 otherwise (meaning the word is not present at all)

for defnum in range(len(cword)):
    try:
        # Obtain all pronunciations and format them correctly
        # NOTE: here, if word is a conjugate of some other word (for example "abandoning" whose
        # root word is "abandon") the returned data from the parser will have "abandon" in the
        # word position below.
        # But, we want to keep words that gave us the root word but not the full pronunciation
        # in the resultant.
        # So check if it is not none if there is an exception and append this instead?
        pslist = cword[defnum][word][0]['data']['US_IPA']
        for p in pslist:
            # Append the formatted pronunciation to space separated string of pronuns
            ps = ps + " " + p[0].replace(".","").replace("/","")
    # if something times out we can grab it later by hand 
    except (RuntimeError, KeyError, IndexError, TimeoutError):
        # In this case we might have obtained the root word of the word we tried to search for
        # Make sure something was actually returned 
        if len(cword) != 0:
            # The pronunciations will be available in the first nonempty IPAs set
            for k, v in cword[defnum]:
                pslist = cword[defnum][k][0]['data']['US_IPA']
                for p in pslist:
                    ps = ps + " " + p[0].replace(".","").replace("/","") # Get all of the possible pronuns from things that are returned
            root_ret = 1
        else:
            # In this case the word was not there at all 
            # We will append these separately later 
            missing = 1

zygote_pronunciation_tuple = [("zygote", ps, root_ret, missing)]

# Output this to a data frame and then a csv 
df = pd.DataFrame(zygote_pronunciation_tuple, columns = ['Word', 'Pronunciation', 'Root Word Returned', 'Missing'])
df.to_csv("temp/cambridge_ipas_step5.csv", index = False)

We did the above with separate data frames to minimize the opportunity for error (and thus work lost). Now, re-read the data frames into a dict. We will iterate through this dict to make some final adjustments to the pronunciations and then output the completed result.

In [None]:
# Initialize an empty dictionary to hold the combined data
all_cambridge_ipas = {}

# Iterate over the CSV files
for i in range(1, 6):
    # Load the data frame from the CSV file
    df = pd.read_csv(f'temp/cambridge_ipas_step{i}.csv')
    
    # Convert the data frame to a dictionary and update the combined dictionary
    all_cambridge_ipas.update(dict(zip(df['Word'], [df['Pronunciation'], df['Root Word Returned'], df['Missing']])))

# Verify the length
len(all_cambridge_ipas.keys())

# Note this length is very different because even many words in the trimmed dictionary are not present in the Cambridge dictionary! We will get a file of these below.
# Once done with updates actually this length should be the same 

25455

By inspection it also turns out there are several other characters present in many of the pronunciations that are not a part of the actual pronunciation. We remove all of these as well.

In [None]:
# Iterate through all words in the resultant dictionary.
# Remove the following characters: ˈ · ː - ˌ with empty string
# Also notice for some reason Cambridge dictionary does not have proper "r" as "ɹ" so we
# replace this too
# and "t̬" should just be "t"
# finally, any instance of "e" should be replaced with "ɛ" so long as it is not followed by an "ɪ"
# as that corresponds to a different phoneme
for w, data in all_cambridge_ipas.items():
    all_cambridge_ipas[w] = re.sub(r'e(?!ɪ)', 'ɛ', data[0].replace("ˈ","").replace("·","").replace("ː","").replace("-","").replace("ˌ","").replace(",","").replace("r","ɹ").replace("t̬","t"))

In [None]:
# We need to artificially change the pronunciation for "a" by inspection and also add the pronunciation for "i"
all_cambridge_ipas['a'] = "ɛɪ"
all_cambridge_ipas['i'] = "aɪ"

In [None]:
# Finally, output the result to a csv
res = []

for w, data in all_cambridge_ipas.items():
    res.append((w,data[0], data[1], data[2]))

df = pd.DataFrame(res, columns=['Word', 'Pronunciation', 'Root Word Returned', 'Missing'])
df.to_csv("cambridge_ipas.csv", index = False)