##### Katherine Kairis,  kak275@pitt.edu,  9/4/2017

Data set info:
The Carnegie Mellon Pronouncing Dictionary can be downloaded at http://www.nltk.org/nltk_data/ and is 3.8 MB. It does not have any specific licensing: the use of the dictionary is completely unrestricted for any research or commercial purpose. The pronouncing dictionary is a large text file that contains the pronunciations for thousands of English words. For some words, multiple pronunciations are provided. Each line in the file represents an entry, and contains the written word, a counter (to keep track of multiple pronunciations), and the word's phonetic transcription.

Summary: The code iterates through all of the entries in the pronouncing dictionary. As the iteration takes place, information about each entry is added to two dictionaries('sound_counts' and 'word_dict') and one list('pronunciations'). The 'sound_counts' dictionary keeps track of the frequencies of each phoneme (the keys are single phonemes and the values are the phonemes' frequencies); the 'word_dict' dictionary keeps track of the number of different pronunciations for each word (the keys are the dictionary words and the values are counts of the words' pronunciations); and the 'pronunciations' list contains all the pronunciations/phonetic transcriptions found in the CMU dictionary. After the program iterates through all of the file's entries, I use these dictionaries and lists to make a few discoveries. I use the sound_counts dictionary to find the five phonemes that occur the most frequently in the dictionary's words. I use word_dict to find to number of unique words that have pronunciations listed in the dictionary, and to compare this value to the total number of entries. Numpy and the values of word_dict are also used to find the average number of pronunciations/entries per single word. Finally, I use the 'pronunciations' list to determine the average number of sounds per word.

Future wish: I think it would be interesting to investigate the syllables in the pronouncing dictionary. Specifically, I would be interested in trying to find the longest syllable, the most common syllables, etc. The Carnegie Mellon Pronouncing Dictionary does contain stress markers. Though I did not consider them this time, these markers could be helpful when looking into syllable structures, but I wouldn't know how exactly to approach these problems.

In [1]:
import re
import numpy as np

In [2]:
#Open the dictionary file
f = open('data/cmudict/cmudict')
lines = f.readlines()
f.close()

### Snippets of the data

In [3]:
#Print a few entries from the dictionary file
print("Here are some entries from the dictionary:")
print('\t' + lines[731].replace('\n', ''))
print('\t' + lines[732].replace('\n', ''))
print('\t' + lines[11856].replace('\n', ''))
print('\t' + lines[24856].replace('\n', ''))
print('\t' + lines[37856].replace('\n', ''))

Here are some entries from the dictionary:
	ACKNOWLEDGE 1 AE0 K N AA1 L IH0 JH
	ACKNOWLEDGE 2 IH0 K N AA1 L IH0 JH
	BIWEEKLY 1 B AY0 W IY1 K L IY0
	COOPERATED 2 K W AA1 P ER0 EY2 T AH0 D
	ENLARGEMENT 1 IH0 N L AA1 R JH M AH0 N T


In [4]:
longest = []
pronunciations = []
sound_counts = {}
word_dict = {}

for pronunciation in lines:
    entry = pronunciation.replace('\n', '')

    #Split the current line/entry to get the word and the word's phonetic transcription
    entry = entry.split()
    word = entry[0]
    sounds = entry[2:]

    #Add the phonetic transcription to the pronunciations list
    pronunciations.append(sounds)

    #If the current word is the longest word(in terms of the number of sounds) found so far, 
    #store the entry in 'longest'
    if(len(entry) > len(longest)):
        longest = entry

    #If the word is already in word_dict, increase its count by 1. Otherwise, add the word to
    #word_dict and set its count to 1.
    if word in word_dict:
        word_dict[word] += 1
    else:
        word_dict[word] = 1

    #Iterate through all of the phonemes in the word's transcription
    for s in sounds:
        #Remove any stress markers from the current phoneme.
        s = re.sub("[0-9]", '', s)
        
        #If the phoneme is already in sound_counts, increase its count by 1. Otherwise,
        #add the phoneme to sound_counts, and set its count to 1.
        if s in sound_counts:
            sound_counts[s] += 1
        else:
            sound_counts[s] = 1


### Basic Stats

In [5]:
#Number of pronunciations
print("There are", len(lines), "entries/pronunciations in the dictionary.")

#Number of words
print("The dictionary has pronunciations listed for", len(word_dict), "different words.")


There are 133737 entries/pronunciations in the dictionary.
The dictionary has pronunciations listed for 123455 different words.


### Some Discoveries

In [6]:
#Longest word
print("The longest word in the dictionary is", longest[0])
print("It has", len(longest[2:]), "sounds:", longest[2:])

The longest word in the dictionary is SUPERCALIFRAGILISTICEXPEALIDOSHUS
It has 32 sounds: ['S', 'UW2', 'P', 'ER0', 'K', 'AE2', 'L', 'AH0', 'F', 'R', 'AE1', 'JH', 'AH0', 'L', 'IH2', 'S', 'T', 'IH0', 'K', 'EH2', 'K', 'S', 'P', 'IY0', 'AE2', 'L', 'AH0', 'D', 'OW1', 'SH', 'AH0', 'S']


In [7]:
#Word(s) with greatest number of different pronunciations
print("Word(s) with most pronunciations:")
previous = 0
for s in sorted(word_dict, key=word_dict.get, reverse = True)[:5]:
    if word_dict[s] < previous:
        break
    print('\t' + s + ': ' + str(word_dict[s]) + ' different pronunciations.')
    previous = word_dict[s]

Word(s) with most pronunciations:
	FEBRUARY: 5 different pronunciations.
	FEBRUARY'S: 5 different pronunciations.


In [8]:
#Most frequent sounds
print("Most frequent sounds:")
for s in sorted(sound_counts, key=sound_counts.get, reverse = True)[:5]:
    print('\t' + s + ': ' + str(sound_counts[s]) + ' occurrences')

Most frequent sounds:
	AH: 71410 occurrences
	N: 60564 occurrences
	S: 50427 occurrences
	IH: 50093 occurrences
	L: 49479 occurrences


In [9]:
#average number of different pronunciations per word
num_prons = num_prons = list(word_dict.values())
array = np.asarray(num_prons)
print("On average, each word has", np.mean(array), "different pronunciations.")


On average, each word has 1.08328540764 different pronunciations.


In [10]:
#average number of sounds per word
num_sounds = [len(w) for w in pronunciations]
array = np.asarray(num_sounds)
print("The average number of sounds per word is", np.mean(array))

The average number of sounds per word is 6.38505424826
