## Obtaining Cambridge Pronunciations

This notebook can be run to obtain the cambridge pronunciations, which are output to `cambridge_ipas.csv`.

In [2]:
# Imports
import cambridge_parser as parser
import pandas as pd
import re

First, we trim the cmu dictionary based on several criteria. We will then make calls to the Cambridge dictionary for all remaning words to obtain their pronunciations. This cell takes a significant amount of time to run.

In [5]:
# This cell takes a long time to run!
wps = {} # str : List[str]

with open('cmudict-0.7b-2024-4-6.txt') as file:
    # Import the SUBTLEXUS csv to a pandas dataframe.
    subtlexus = pd.read_csv('SUBTLEXusExcel2007.csv')

    # Convert all words to lowercase
    subtlexus['Word'] = subtlexus['Word'].str.lower()

    # Regex for finding unwanted punctuation in words (essentially any non-word)
    rpunc = r".*(\W|\d).*"

    # Regex for three-peated characters (any word with three or more of the same
    # letter in a row should be omitted, as none are valid English words for the
    # purposes of the toolkit)
    rpeat = r".*(.)\1\1.*"

    l = 56

    # Skip the first 56 lines as these contain text we are not interested in
    for line in file.readlines()[56:]:
        s = line.strip()
        i = s.find(" ", 0)
        word = s[:i].lower()
        word = word.lower()

        # Ensure the word doesn't contain punctuation and is present in the SUBTLEXUS
        if re.match(rpunc, word) is None \
            and re.match(rpeat, word) is None \
            and word in subtlexus['Word'].values:
            # Add the word to the dict 
            wps[word] = ""

        l += 1

In [10]:
# Show the number of words whose pronunciations will be obtained from the Cambridge dictionary
print(len(wps))

48353


Next, using `cambridge_parser.py` we obtain the pronunciations for all words in the trimmed list.


Note that these cells take an incredibly long time to run.

In [4]:
# For each word in the dict, get its corresponding pronunciation from Cambridge;
# if it is not in Cambridge, remove it from the dict 

wps_keys = list(wps.keys())
wps_words = str(len(wps_keys))
# Four steps of iteration to prevent errors
step = int(len(wps_keys) / 4) # TODO turn the step into a variable
stepn = 1
wordnum = 1

# Iterate through trimmed word list 
# This is done in several steps to reduce capacity for errors
for i in range(0, len(wps_keys), step):
    # Temporary dict in which to store words from this iteration
    temp_words = {}
    words_to_remove = []

    # todo dont hardcode 4
    print(f"Stepping through words {wps_keys[i]} to {wps_keys[i + step - 1]} in step {str(stepn)} of 4...")

    for word in wps_keys[i:min(len(wps_keys), i + step)]:
        print(f"Obtaining pronunciations for word {word}; {str(wordnum)} of {wps_words} words...", end="\r")

        # Grab cambridge information
        cword = parser.define(word)

        # List of pronunciations for the word in US_IPA in space-separated string
        ps = ""

        # Iterate through all definitions
        for defnum in range(len(cword)):
            try:
                # Obtain all pronunciations and format them correctly
                pslist = cword[defnum][word][0]['data']['US_IPA']
                for p in pslist:
                    # Append the formatted pronunciation to space separated string of pronuns
                    ps = ps + " " + p[0].replace(".","").replace("/","")
            # if something times out we can grab it later by hand 
            except (RuntimeError, KeyError, IndexError, TimeoutError):
                # Continue iterating if error occurs; these can be cleaned up manually later
                continue 

        # If there are pronunciations for this word, add them to the dict
        if len(ps) != 0:
            temp_words[word] = ps

        wordnum += 1
    # Output the words and pronunciations we have accumulated
    print("")
    print("Finished obtaining pronunciations for this iteration.")

    word_pronunciation_pairs = []

    # Iterate through the dictionary and convert it into a list of tuples
    for word, pronunciation in temp_words.items():
        # Append each word-pronunciation pair as a tuple to the list
        word_pronunciation_pairs.append((word, pronunciation))

    # Create a DataFrame from the list of tuples
    df = pd.DataFrame(word_pronunciation_pairs, columns=['Word', 'Pronunciation'])

    # Output the DataFrame to a CSV file
    df.to_csv(f"cambridge_ipas_step{str(stepn)}.csv", index=False)

    print(f"Output csv for step {str(stepn)} to cambridge_ipas_step{str(stepn)}.csv")
    
    stepn += 1

Stepping through words a to dildo in step 1 of 4...
Obtaining pronunciations for word abductees; 45 of 48353 words.......

KeyboardInterrupt: 

In [13]:
# The above loop missed the very last word ("zygote") so we grab that manually
# This also serves as a direct example of how the words are obtained from the dictionary
zy = parser.define("zygote")

ps = ""

for defnum in range(len(zy)):
    try:
        # Obtain all pronunciations and format them correctly
        pslist = zy[defnum]['zygote'][0]['data']['US_IPA']
        for p in pslist:
            # Append the formatted pronunciation to space separated string of pronuns
            ps = ps + " " + p[0].replace(".","").replace("/","")
    # if something times out we can grab it later by hand 
    except (RuntimeError, KeyError, IndexError, TimeoutError):
        # Continue iterating if error occurs; these can be cleaned up manually later
        continue 

zygote_pronunciation_pair = [("zygote", ps)]

# Output this to a data frame and then a csv 
df = pd.DataFrame(zygote_pronunciation_pair, columns = ['Word', 'Pronunciation'])
df.to_csv("cambridge_ipas_step5.csv", index = False)

We did the above with separate data frames to minimize the opportunity for error (and thus work lost). Now, re-read the data frames into a dict. We will iterate through this dict to make some final adjustments to the pronunciations and then output the completed result.

In [3]:
# Initialize an empty dictionary to hold the combined data
all_cambridge_ipas = {}

# Iterate over the CSV files
for i in range(1, 6):
    # Load the data frame from the CSV file
    df = pd.read_csv(f'cambridge_ipas_step{i}.csv')
    
    # Convert the data frame to a dictionary and update the combined dictionary
    all_cambridge_ipas.update(dict(zip(df['Word'], df['Pronunciation'])))

# Verify the length
len(all_cambridge_ipas.keys())

# Note this length is very different because even many words in the trimmed dictionary are not present in the Cambridge dictionary! We will get a file of these below.

25455

By inspection it also turns out there are several other characters present in many of the pronunciations that are not a part of the actual pronunciation. We remove all of these as well.

In [4]:
# Iterate through all words in the resultant dictionary.
# Remove the following characters: ˈ · ː - ˌ
for w, p in all_cambridge_ipas.items():
    all_cambridge_ipas[w] = p.replace("ˈ","").replace("·","").replace("ː","").replace("-","").replace("ˌ","").replace(",","")

print(all_cambridge_ipas)



In [9]:
# We need to artificially change the pronunciation for "a" by inspection and also add the pronunciation for "i"
all_cambridge_ipas['a'] = "eɪ"
all_cambridge_ipas['i'] = "aɪ"

In [12]:
# Finally, output the result to a csv
final_pronunciations = []

for w, p in all_cambridge_ipas.items():
    final_pronunciations.append((w,p))

df = pd.DataFrame(final_pronunciations, columns=['Word', 'Pronunciation'])
df.to_csv("cambridge_ipas.csv", index = False)