# Step 2: Let's count syllables!

## Counting Syllables

Listening to various candidates talk, it is immediately apparent that they have different speaking styles, and use different words. 

`nltk` comes with the Carnegie-Mellon pronounciation dictionary, which has pretty good English coverage (≈134k words). It lets us translate a lexical item (e.g. "monkey") into ARPAbet (e.g. "M AH1 NG K IY0"). In Arpabet, the numbers at the end of the vowel phonemes indicate which are pronounced as stressed vowels. Obviously, in reality, pronounciation will vary according to accent, etc., but this is a "good enough" place to start. 

We can use the CMU pronounciation dictionary (_cmudict_) to get a quick-and-dirty count of how many syllables a word has by simply counting vowel phonemes:

In [1]:
import nltk
import re
cmu = nltk.corpus.cmudict.dict()

In [2]:
def syllable_count(w):
    vowel_pattern = re.compile(".+\d$")
    try:
        syllables = cmu[w.lower()][0] # we only care about the first pronounciation variant
    except KeyError: # if w is not in the CMU dictionary
        # there are several possible ways to deal with this situation; returning -1 is a bit ugly, 
        # but has the advantage of being easy to filter out later.
        return -1 
    vowel_syllables = [s for s in syllables if vowel_pattern.match(s)]
    return len(vowel_syllables)

In [3]:
syllable_count("monkey")

2

This is, without a doubt, an imperfect method, but will be good enough for starters. 