## Obtaining Cambridge Pronunciations

This notebook can be run to obtain the cambridge pronunciations, which are output to `cambridge_ipas.csv`.

In [2]:
# Imports
import cambridge_parser as parser
import pandas as pd
import re
import math
import time

First, we trim the cmu dictionary based on several criteria. We will then make calls to the Cambridge dictionary for all remaning words to obtain their pronunciations. This cell takes a significant amount of time to run.

In [2]:
# This cell takes upwards of a few minutes to run
wps = {} # str : List[str]

with open('cmudict-0.7b-2024-4-6.txt') as file:
    # Import the SUBTLEXUS csv to a pandas dataframe.
    subtlexus = pd.read_csv('SUBTLEXusExcel2007.csv')

    # Convert all words to lowercase
    subtlexus['Word'] = subtlexus['Word'].str.lower()

    # Regex for finding alternate pronunciations of words (which are structured as
    # "word(int)")
    ralt = r"(\w+)\(\d+\)"

    # Regex for finding unwanted punctuation in words (essentially any non-word)
    rpunc = r".*(\W|\d).*"

    # Regex for three-peated characters (any word with three or more of the same
    # letter in a row should be omitted, as none are valid English words for the
    # purposes of the toolkit)
    rpeat = r".*(.)\1\1.*"

    l = 56

    # Skip the first 56 lines as these contain text we are not interested in
    for line in file.readlines()[56:]:
        s = line.strip()
        i = s.find(" ", 0)
        word = s[:i].lower()
        word = word.lower()

        alt = re.match(ralt, word)

        # First check if this is an alt pronunciation for a word
        if alt is not None:
            alt_text = alt.group(1)
            # The do the checks on the root (for robustness)
            if re.match(rpeat, alt_text) is None \
                and alt_text in subtlexus['Word'].values \
                and re.match(rpunc, alt_text) is None:
                wps[alt_text] = ""
        else:
            # Otherwise just check general critera
            if re.match(rpeat, word) is None \
                and re.match(rpunc, word) is None \
                and word in subtlexus['Word'].values:
                wps[word] = ""

        l += 1

In [3]:
# Show the number of words whose pronunciations will be obtained from the Cambridge dictionary
print(len(wps))

48353


Next, using `cambridge_parser.py` we obtain the pronunciations for all words in the trimmed list.


Note that this next cell takes an incredibly long time (hours) to run.

In [6]:
# Obtains cambridge information for words based on the current step.
# Handles connection errors.
def do_step(step, maxstep, wps_keys, wps_words):
    wordnum = (int)((step - 1) * len(wps_keys) / maxstep + 1)
    # Temporary dict in which to store words from this iteration
    temp_words = {} # [str : [str, int, int]]

    print(f"Stepping through words {wps_keys[(int)((step - 1) * len(wps_keys) / maxstep)]} to {wps_keys[min(len(wps_keys), (int)(step * len(wps_keys) / maxstep)) - 1]} in step {str(step)} of {maxstep}...")

    for word in wps_keys[(int)((step - 1) * len(wps_keys) / maxstep):min(len(wps_keys), (int)(step * len(wps_keys) / maxstep))]:
        print(f"Obtaining pronunciations for word {word}; {str(wordnum)} of {wps_words} words...", end="\r")

        # Grab cambridge information
        cword = parser.define(word)
        if cword is None:
            # Likely because we made too many requests. Delay a short time and retry the current step
            print(f"Errored due to too many calls at step {step}. Retrying...")
            time.sleep(5)
            do_step(step, maxstep, wps_keys, wps_words)
            return
            
        # List of pronunciations for the word in US_IPA in space-separated string
        ps = "" # If this remains empty by the end then ofc we didnt find the word in the dict so no pronuns.
        # Is this the conjugate of some root word? (0 is no, 1 is yes)
        # If this is 1 it means that we searched for something like "abandoning" and "abandon" was returned
        # In such a case we still want to keep the pronunciations
        root_ret = 0
        missing = 0 # 0 if we got something back from cambridge, 1 otherwise (meaning the word is not present at all)

        if len(cword) == 0:
            missing = 1

        # Iterate through all definitions
        for defnum in range(len(cword)):
            try:
                # Obtain all pronunciations and format them correctly
                # NOTE: here, if word is a conjugate of some other word (for example "abandoning" whose
                # root word is "abandon") the returned data from the parser will have "abandon" in the
                # word position below.
                # But, we want to keep words that gave us the root word but not the full pronunciation
                # in the resultant.
                # So check if it is not none if there is an exception and append this instead?
                pslist = cword[defnum][word][0]['data']['US_IPA']
                for p in pslist:
                    # Append the formatted pronunciation to space separated string of pronuns
                    ps = ps + " " + p[0].replace(".","").replace("/","")
            # if something times out we can grab it later by hand 
            except (RuntimeError, KeyError, IndexError):
                # In this case we might have obtained the root word of the word we tried to search for
                # Make sure something was actually returned 
                if len(cword) != 0:
                    if len(list(cword[0].keys())[0].split()) > 1:
                        missing = 1
                    else:
                        # The pronunciations will be available in the first nonempty IPAs set
                        # This will help us determine if it is missing or not
                        # or could redo here and see results
                        for k, v in cword[defnum].items():
                            pslist = cword[defnum][k][0]['data']['US_IPA']
                            for p in pslist:
                                # Have to add this check because sometimes it is just empty
                                if len(p) > 0:
                                    ps = ps + " " + p[0].replace(".","").replace("/","") # Get all of the possible pronuns from things that are returned
                        root_ret = 1
                else:
                    # In this case the word was not there at all 
                    # We will append these separately later 
                    missing = 1

        # If there are pronunciations for this word, add them to the dict
        temp_words[word] = [ps, root_ret, missing]
        wordnum += 1

        # time.sleep(1) maybe?
    
    # update low
    low = (int)(step * len(wps_keys) / maxstep)

    # Output the words and pronunciations we have accumulated
    print("")
    print("Finished obtaining pronunciations for this iteration.")

    word_pronunciation_pairs = []

    # Iterate through the dictionary and convert it into a list of tuples
    for word, data in temp_words.items():
        # Break data into pronunciations and whether or not the item was missing (conjugate or entirely)
        word_pronunciation_pairs.append((word, data[0], data[1], data[2]))

    # Create a DataFrame from the list of tuples
    df = pd.DataFrame(word_pronunciation_pairs, columns=['Word', 'Pronunciation', 'Root Word Returned', 'Missing'])

    # Output the DataFrame to a CSV file
    df.to_csv(f"temp/cambridge_ipas_step{str(step)}.csv", index=False)

    print(f"Output csv for step {str(step)} to cambridge_ipas_step{str(step)}.csv")

    time.sleep(10) # reduce error-proneness



In [7]:
wps_keys = list(wps.keys())
wps_words = str(len(wps_keys))
stepn = 1
maxstep = 100

# Split into steps to reduce error-proneness
for step in range(1, maxstep + 1):
    do_step(step, maxstep, wps_keys, wps_words)

Stepping through words a to adamantly in step 1 of 100...
Obtaining pronunciations for word adamantly; 483 of 48353 words......s...
Finished obtaining pronunciations for this iteration.
Output csv for step 1 to cambridge_ipas_step1.csv
Stepping through words adams to airfare in step 2 of 100...
Obtaining pronunciations for word airfare; 967 of 48353 words.......s...
Finished obtaining pronunciations for this iteration.
Output csv for step 2 to cambridge_ipas_step2.csv
Stepping through words airfares to amperage in step 3 of 100...
Obtaining pronunciations for word amperage; 1450 of 48353 words..........
Finished obtaining pronunciations for this iteration.
Output csv for step 3 to cambridge_ipas_step3.csv
Stepping through words amperes to appendectomy in step 4 of 100...
Obtaining pronunciations for word appendectomy; 1934 of 48353 words.....48353 words...
Finished obtaining pronunciations for this iteration.
Output csv for step 4 to cambridge_ipas_step4.csv
Stepping through words appe

In [None]:
# # NOTE: this is all commented out because I remodified the old loop but we keep this
# # as an example here.
# # The above loop missed the very last word ("zygote") so we grab that manually
# # This also serves as a direct example of how the words are obtained from the dictionary
# zy = parser.define("zygote")

# ps = ""
# root_ret = 0
# missing = 0 # 0 if we got something back from cambridge, 1 otherwise (meaning the word is not present at all)

# for defnum in range(len(cword)):
#     try:
#         # Obtain all pronunciations and format them correctly
#         # NOTE: here, if word is a conjugate of some other word (for example "abandoning" whose
#         # root word is "abandon") the returned data from the parser will have "abandon" in the
#         # word position below.
#         # But, we want to keep words that gave us the root word but not the full pronunciation
#         # in the resultant.
#         # So check if it is not none if there is an exception and append this instead?
#         pslist = cword[defnum][word][0]['data']['US_IPA']
#         for p in pslist:
#             # Append the formatted pronunciation to space separated string of pronuns
#             ps = ps + " " + p[0].replace(".","").replace("/","")
#     # if something times out we can grab it later by hand 
#     except (RuntimeError, KeyError, IndexError, TimeoutError):
#         # In this case we might have obtained the root word of the word we tried to search for
#         # Make sure something was actually returned 
#         if len(cword) != 0:
#             # The pronunciations will be available in the first nonempty IPAs set
#             for k, v in cword[defnum]:
#                 pslist = cword[defnum][k][0]['data']['US_IPA']
#                 for p in pslist:
#                     ps = ps + " " + p[0].replace(".","").replace("/","") # Get all of the possible pronuns from things that are returned
#             root_ret = 1
#         else:
#             # In this case the word was not there at all 
#             # We will append these separately later 
#             missing = 1

# zygote_pronunciation_tuple = [("zygote", ps, root_ret, missing)]

# # Output this to a data frame and then a csv 
# df = pd.DataFrame(zygote_pronunciation_tuple, columns = ['Word', 'Pronunciation', 'Root Word Returned', 'Missing'])
# df.to_csv("temp/cambridge_ipas_step5.csv", index = False)

We did the above with separate data frames to minimize the opportunity for error (and thus work lost). Now, re-read the data frames into a dict. We will iterate through this dict to make some final adjustments to the pronunciations and then output the completed result.

In [3]:
# Load in all of the temporary dataframes (each of the four steps) and concatenate them.
all_cambridge_ipas = pd.DataFrame(columns=['Word', 'Pronunciation', 'Root Word Returned', 'Missing'])

# Iterate over the CSV files and concatenate them into the combined DataFrame
for i in range(1, 101):
    # Load the data frame from the CSV file
    df = pd.read_csv(f'temp/cambridge_ipas_step{i}.csv')
    
    # Append the data frame to the combined data frame
    all_cambridge_ipas = pd.concat([all_cambridge_ipas, df], ignore_index=True)

# Verify the length of the combined DataFrame
print(len(all_cambridge_ipas))

48353


In [5]:
all_cambridge_ipas.head()

Unnamed: 0,Word,Pronunciation,Root Word Returned,Missing
0,a,wɛak ə wɛak ə wɛak ə wɛak ə wɛak ə wɛak ə wɛa...,0,0
1,aa,eɪeɪ eɪeɪ eɪeɪ eɪeɪ,1,0
2,aah,ɑ,0,0
3,aardvark,ɑɹdvɑɹk,0,0
4,aargh,,0,1


By inspection it also turns out there are several other characters present in many of the pronunciations that are not a part of the actual pronunciation. We remove all of these as well.

In [6]:
# Iterate through all pronunciation column entries.
# Remove the following characters: ˈ · ː - ˌ with empty string
# Also notice for some reason Cambridge dictionary does not have proper "r" as "ɹ" so we
# replace this too
# and "t̬" should just be "t"
# finally, any instance of "e" should be replaced with "ɛ" so long as it is not followed by an "ɪ"
# as that corresponds to a different phoneme
for index, row in all_cambridge_ipas.iterrows():
    pronunciation = row['Pronunciation']
    if pd.notna(pronunciation):
        pronunciation = pronunciation.replace("ˈ", "").replace("·", "").replace("ː", "").replace("-", "").replace("ˌ", "")
        pronunciation = pronunciation.replace("r", "ɹ").replace("t̬", "t")
        pronunciation = re.sub(r'e(?!ɪ)', 'ɛ', pronunciation)
        all_cambridge_ipas.at[index, 'Pronunciation'] = pronunciation

In [7]:
# We need to artificially change the pronunciation for "a" by inspection and also add the pronunciation for "i"
all_cambridge_ipas.loc[all_cambridge_ipas['Word'] == 'a', 'Pronunciation'] = "ɛɪ"
all_cambridge_ipas.loc[all_cambridge_ipas['Word'] == 'i', 'Pronunciation'] = "aɪ"

In [9]:
# Finally output to a single csv
all_cambridge_ipas.to_csv("cambridge_ipas.csv", index = False)