## Input Prep

This notebook prepares the CMU Dictionary for input into the Sublexical Toolkit for analysis.

Author: Caleb Solomon

In [10]:
# Imports
import pandas as pd
import cambridge_parser as parser
import re

### Task 1: Initialize the CMU Dictionary and trim it.

The first many lines of the dictionary file are useless, containing simple text. There are also a significant number of words containing numbers, parentheses, or other features that are unnecessary for input to the sublexical toolkit. Furthermore, we want to keep only words whose 

In [11]:
# Display the first 50 or so lines for reference to above.
fcmu = open('cmudict-0.7b-2024-4-6.txt')
for line in fcmu.readlines()[:20]:
    print(line.strip())
fcmu.close()

;;; # CMUdict  --  Major Version: 0.07
;;;
;;; # $HeadURL$
;;; # $Date::                                                   $:
;;; # $Id::                                                     $:
;;; # $Rev::                                                    $:
;;; # $Author::                                                 $:
;;;
;;; #
;;; # Copyright (C) 1993-2015 Carnegie Mellon University. All rights reserved.
;;; #
;;; # Redistribution and use in source and binary forms, with or without
;;; # modification, are permitted provided that the following conditions
;;; # are met:
;;; #
;;; # 1. Redistributions of source code must retain the above copyright
;;; #    notice, this list of conditions and the following disclaimer.
;;; #    The contents of this file are deemed to be source code.
;;; #


In [12]:
# This block takes ~1 min to run.
# Create a dictionary of words to pronunciations.
# Dict {str : str}
cmu_dict = {}

# Import the SUBTLEXUS csv to a pandas dataframe.
subtlexus = pd.read_csv('SUBTLEXusExcel2007.csv')

# Convert all words to lowercase
subtlexus['Word'] = subtlexus['Word'].str.lower()

# Regex for finding unwanted punctuation in words (essentially any non-word)
rpunc = r".*(\W|\d).*"
# Regex for three-peated characters (any word with three or more of the same
# letter in a row should be omitted, as none are valid English words for the
# purposes of the toolkit)
rpeat = r".*(.)\1\1.*"

# Iterate through the lines of the dictionary. Add only such words containing
# no parentheses and with a corresponding entry in the SUBTLEXUS to the
# dictionary of cmu words that will be kept for analysis.
with open('cmudict-0.7b-2024-4-6.txt') as file:
    # Skip the first 56 lines as these contain text we are not interested in
    for line in file.readlines()[56:]:
        word, pronunciation = line.strip().split(maxsplit=1)
        word = word.lower()
        # Ensure the word doesn't contain punctuation and is present in the
        # SUBTLEXUS
        if re.match(rpunc, word) is None \
            and re.match(rpeat, word) is None \
            and word in subtlexus['Word'].values:
            cmu_dict[word] = pronunciation

print(cmu_dict)



In [13]:
# Display the final number of words in the dataset
len(cmu_dict)

48353

### Task 2: Cross-reference CMU Dictionary Pronunciations with Cambridge Prounciations

First, the CMU dictionary pronunciations will need to be converted to reflect the Cambridge dictionary pronunciation format. The transcriptions csv aids in these conversions.

In [16]:
# Generates a list of all possible transcriptions of a cmu word in IPA
# form.
def possible_transcriptions(cmu_word, replacements):
    def helper(index, current_transcription):
        if index == len(cmu_word):
            transcriptions.append(current_transcription)
            return

        phoneme = cmu_word[index]
        if phoneme in replacements:
            for option in replacements[phoneme]:
                helper(index + 1, current_transcription + option)
        else:
            helper(index + 1, current_transcription + phoneme)

    # Call for the word
    transcriptions = []
    helper(0, "")

    # Remove all whitespace and numbers from the resultant transcription
    for t in transcriptions:
        t = t.replace(" ", "")
        t = re.sub(r"\d", "", t)

    return transcriptions

In [19]:
# Load the transcriptions csv
transcriptions = pd.read_csv('transcriptions.csv')

# Convert the cmu_dict dictionary to a pandas dataframe
cmu_df = pd.DataFrame(list(cmu_dict.items()), columns=['Word', 'Pronunciation'])

# Iterate through the transcriptions and generate a dict of transcriptions
# There are two special cases: ER and AA, where each have two different
# representation possibilities. These cases need to be handled separately.
# Furthermore, sometimes "AA" is followed by a number of the format "AAn". In
# such cases we ignore the number and just replace as AA. To do so after we 
# apply all pronunciation transcriptions we just remove the remaining numbers
# from the transcription. This is done below.
replacements = {}  # Dict{str : [str]}
special_replacements_ER = ["ɝ", "ɚ"]
special_replacements_AA = ["ɑ", "ɒ"]

for index, row in transcriptions.iterrows():
    cmu_p = row['CMU']
    ipa_p = row['IPA']

    # Skip the special cases where the CMU pronunciation is "ER" or "AA"
    if cmu_p == "ER" or cmu_p == "AA":
        continue
    
    # Add the IPA representation transcription to the dictionary
    replacements[cmu_p] = [ipa_p]

replacements["ER"] = special_replacements_ER
replacements["AA"] = special_replacements_AA

# Iterate through the cmu_dict dictionary and replace all CMU pronunciations
# with a list of all possible corresponding pronunciation transcriptions in
# IPA.
for index, row in cmu_df.iterrows():
    ts = possible_transcriptions(row['Pronunciation'], replacements)

    cmu_df.at[index, 'Pronunciation'] = ts

print(cmu_df)

           Word        Pronunciation
0             a                [AH0]
1            aa            [Ej2 Ej1]
2           aah                [AA1]
3      aardvark  [AA1 ɹ d v AA2 ɹ k]
4         aargh            [AA1 ɹ ɡ]
...         ...                  ...
48348     zulus      [z Uw1 l Uw0 z]
48349      zuni        [z Uw1 n Ij2]
48350    zurich      [z UH1 ɹ IH0 k]
48351    zydeco  [z Aj1 d AH0 k Ow2]
48352    zygote      [z Aj1 ɡ Ow0 t]

[48353 rows x 2 columns]


In [None]:
# Now that all of the CMU dictionary pronunciations have been updated to IPA
# representation, we go through all words in the Cambridge dictionary, and
# obtain their IPA representations to see if there are any discrepancies in the
# CMU data.
# In the case of a discrepancy the Cambridge dictionary is treated as correct
# i.e. with precedence.
# While we could simply take all pronunciations from Cambridge and use those,
# it is interesting to see where the Cambridge differs from the CMU.