# **Poetry Rhyme Scheme Calculator**
### Intro to Computational Linguistics Final Project
#### By: Cade Edney

##### **Summary:**
While it may be easy to see a rhyme scheme from just reading it, we can also easily do so through code. This project utilizes the Gutenberg Corpus to get popular poems, and Carnegie Mellon University's Pronunciation Dictionary, to determine the rhyme scheme of the chosen poem. To do so:
1. The poem chosen is broken up stanza-by-stanza. 
2. Each stanza is then broken up line-by-line and we look at the final syllable in the line. 
3. Each syllable that are the same in the stanza are defined with the same letter. 
4. Once the code is done determining each stanza's rhyme scheme in this manner, it determines the most used rhyme scheme using a Counter object.

## Table of Contents
* [Constants and Imports](#constants-and-imports)
* [ARPABET and CMU Functions](#arpabet-and-cmu-functions)
* [Preprocessing Functions](#preprocessing-functions)
* [Poem Splitting Functions](#poem-splitting-functions)
* [Poem Data Functions](#poem-data-functions)
* [Rhyme Scheme Functions](#rhyme-scheme-functions)
* [Printing Results Functions](#printing-results-functions)
* [Print Options Functions](#print-options-functions)
* [User Input](#user-input)

## Constants and Imports
This section details the constant values and imports needed for the program. They are:
* Imports
    - **re:**&ensp;for regex matching
    - **nltk:**&ensp;for CMU Pronunciation Dictionary
    - **Counter:**&ensp;to find most used rhyme schemes
    - **requests:**&ensp;for UberDuck API calls, to get ARPABET translation
* Constants
    - **cmu:**&ensp;the local copy of CMU Pronunciation Dictionary
    - **rhyme_alphabet:**&ensp;the list of letters used for the rhyme schemes, starting at A and continuing through the English alphabet
    - **vowel_features:**&ensp;a dictionary mapping ARPABET vowels to a tuple containing their height, backness, and roundness
    - **poem_paths:**&ensp;a list of the local paths of the poems available to the user
    - **poem_names:**&ensp;a list of the names of the poems available to the user
    - **poets:**&ensp;a list of the names of the poets for each of the above poems


In [1]:
# Imports
import re
import nltk
from collections import Counter
import requests

# Constants
cmu = nltk.corpus.cmudict.dict()
"""Carnegie Mellon Pronunciation Dictionary, contains ARPABET values for English words"""

rhyme_alphabet = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P']
"""Letters used to dictate rhyme scheme. Functions will start at A and continue on through letters"""

vowel_features = {
    'AA': ('back', 'low', 'unround', 'mono'),
    'AE': ('front', 'low', 'unround', 'mono'),
    'AH': ('central', 'mid', 'unround', 'mono'),
    'AO': ('back', 'low', 'unround', 'mono'),
    'AW': ('back', 'mid', 'round', 'di'),
    'AY': ('front', 'high', 'unround', 'di'),
    'EH': ('front', 'mid', 'unround', 'mono'),
    'ER': ('central', 'mid', 'unround', 'mono'),
    'EY': ('front', 'mid', 'unround', 'di'),
    'IH': ('front', 'high', 'unround', 'mono'),
    'IY': ('front', 'high', 'unround', 'mono'),
    'OW': ('back', 'mid', 'round', 'di'),
    'OY': ('front', 'high', 'round', 'di'),
    'UH': ('back', 'high', 'unround', 'mono'),
    'UW': ('back', 'high', 'round', 'mono')
}
"""A dictionary containing the ARPABET vowels to their respective vowel features (height, backness, roundness)"""

poem_paths = ['./poems/ColorintheWheat.txt', './poems/Excelsior.txt', './poems/Home.txt', './poems/JamieDouglas.txt', './poems/TheBrokenPinion.txt', './poems/TheEnsignBearer.txt', './poems/TheHeightoftheRidiculous.txt', './poems/TheHouseWithNobodyInIt.txt', './poems/ThePolishBoy.txt', './poems/TheRealRiches.txt']
"""List of the local file paths of the poems available to the user"""

poem_names = ['Color in the Wheat', 'Excelsior', 'Home', 'Jamie Douglas', 'The Broken Pinion', 'The Ensign Bearer', 'The Height of the Ridiculous', 'The House With Nobody In It', 'The Polish Boy', 'The Real Riches']
"""List of the names of the poems available to the user"""

poets = ['Hamlin Garland', 'Henry W. Longfellow', 'Edgar A. Guest', 'Anonymous', 'Hezekiah Butterworth', 'Anonymous', 'Oliver Wendell Holmes', 'Joyce Kilmer', 'Ann S. Stephens', 'John G. Saxe']
"""List of the authors of the poems available to the user"""


'List of the authors of the poems available to the user'

## ARPABET and CMU Functions
The following cell contains functions that directly relate to ARPABET and the CMU Pronunciation Dictionary. They are as follows:
* **is_vow(arpa):**&ensp;returns whether the given ARPABET character is a vowel or not

* **strip_vow(arpa):**&ensp;returns the ARPABET character without numbers, to get solely the vowel part

* **get_arpabet(word):**&ensp;uses UberDuck API to get the ARPABET sequence from the given word

* **detect_slant_rhyme(syll1, syll2, threshold=1):**&ensp;takes the two syllables and, if they have at least **threshold** vowel features in common, returns true indication it is a slant rhyme, false otherwise

In [2]:
def is_vow(arpa):
    """
    Function that takes in an ARPABET symbol and detects whether it is a vowel
        
            Parameters:
                arpa (str): An ARPABET character
                
            Returns:
                (bool): True if arpa is a vowel, false otherwise
    """
    
    return bool(re.match(r'\w{2,}\d', arpa))

def strip_vow(arpa):
    """
    Function that takes in an ARPABET symbol and, if it's a vowel, returns only the alphabetic part
        
            Parameters:
                arpa (str): An ARPABET character
                
            Returns:
                (str): The ARPABET character, without numbers
    """
    
    if is_vow(arpa):
        arpa = re.sub("[0-9]", "", arpa)
    return arpa

def get_arpabet(word):
    """
    Function that takes in a word calls UberDuck API to get the ARPABET sequence
        
            Parameters:
                word (str): A word
                
            Returns:
                (list(str)): The ARPABET sequence that corresponds to the word
    """
    
    url = f"https://api.uberduck.ai/g2p?text={word}"
    headers = {"accept": "application/json"}
    response = requests.get(url, headers=headers)
    arpabet_str = response.json()['arpabet']
    arpabet_str = re.sub(r'{', '', arpabet_str)
    arpabet_str = re.sub(r'}', '', arpabet_str)
    return [arpabet_str.split(" ")]

def detect_slant_rhyme(syll1, syll2, threshold=2):
    """
    Function that takes in two syllables and returns whether they are a slant rhyme or not
        
            Parameters:
                syll1 (list(str)): A list of ARPABET characters
                syll2 (list(str)): Another list of ARPABET characters
                threshold (int): The number of features needed to be the same for the syllables to be considered a 
                \t\t\t  slant rhyme
                
            Returns:
                (bool): True if the features have a similarity greater than threshold, false otherwise
    """
    
    if len(syll1) != len(syll2):
        return False
    v1 = syll1[0]
    v2 = syll2[0]
    features1 = vowel_features.get(v1)
    features2 = vowel_features.get(v2)
    if not (features1 and features2):
        return False
    similarity = 0
    for f1, f2 in zip(features1, features2):
        if f1 == f2:
            similarity += 1
    if similarity < threshold:
        return False
    for i in range(1, len(syll1)):
        if syll1[i] != syll2[i]:
            return False
    return True


## Preprocessing Functions
The next cell contains functions that help preprocess text, which are:
* **preprocess_line(line):**&ensp;removes the following characters from the given line:
    - unicode&ensp;'—'
    - new line&ensp;'\n'

In [3]:
def preprocess_line(line):
    """
    Function that takes in a line and removes certain characters
        
            Parameters:
                line (str): A string
                
            Returns:
                line (str): The modified line, with the following characters removed:\n
                \t\t- unicode '—'
                \t\t- new line '\\n'
    """
    
    line = re.sub(r"^\n", "", line)
    line = re.sub(r"—", "", line)
    line = re.sub(r"-", " ", line)
    line = re.sub(r'\n', '', line)
    return line

## Poem Splitting Functions
This section of functions is what splits a poem into different pieces, such as stanzas and lines:
* **get_stanzas(poem_path):**&ensp;divides the given poem into stanzas, where each stanza is a list of lines

* **get_lines(poem_path):**&ensp;reads the poem from the given file path, separating the input into lines

In [4]:
def get_stanzas(poem_path):
    """
    Function that takes in the path to a poem and splits the poem into stanzas
        
            Parameters:
                poem_path (str): The local file path of a poem
                
            Returns:
                stanzas (list(list(str))): The poem as a list of lists of strings, where each string is a line, each list of
                \t\t\t\t       strings is a stanza, and the list of lists of strings is the poem itself
    """
    
    lines = get_lines(poem_path)
    stanzas = []
    curr_stanza = []
    for line in lines:
        if line != '\n':
            curr_stanza.append(line)
        else:
            stanzas.append(curr_stanza)
            curr_stanza = []
    return stanzas
        

def get_lines(poem_path):
    """
    Function that takes in the path to a poem and splits the poem into lines
        
            Parameters:
                poem_path (str): The local file path of a poem
                
            Returns:
                (list(str)): The poem as a list of strings, where each string is a line and the list of strings is the poem 
                \t\t itself
    """
    
    with open(poem_path) as f:
        return f.readlines()

## Poem Data Functions
The following cell contains functions that get data directly from the poem, such as:
* **get_longest_line(poem):**&ensp;gets the lengths of the longest line, to be used for formatting the eventual output

* **get_last_syllable(line):**&ensp;returns the final syllable of the given line, starting at the last vowel, in ARPABET form

* **get_last_syllables(stanza):**&ensp;returns a list of the last syllables for each line in the given stanza

In [5]:
def get_longest_line(poem):
    """
    Function that takes in the stanzas of a poem and gets the length of the longest line amongst them
        
            Parameters:
                stanzas (list(list(str))): The poem as a list (stanzas) of lists (each stanza) of strings (each line)
                
            Returns:
                max (int): The length (in number of characters) of the longest line in the poem
    """
    max = 0
    for stanza in poem:
        for line in stanza:
            if len(line) > max:
                max = len(line)
    return max

def get_last_syllable(line):
    """
    Function that finds the last syllable of a line
            
            Parameters:
                line (str): A line of a poem
                
            Returns:
                syll (list(str)): The ARPABET sequence of the final syllable (starting at the last vowel) of the given line
    """
    
    line = preprocess_line(line)
    words = nltk.tokenize.word_tokenize(line)
    words=[word.lower() for word in words if word.isalpha()]
    final_word = words[-1]
    final_pron = cmu.get(final_word)
    if final_pron:
        final_pron = cmu[final_word][0]
    else:
        final_pron = get_arpabet(final_word)
        cmu[final_word] = final_pron
        final_pron = final_pron[0]
    syll = []
    for ipa in reversed(final_pron):
        syll = [strip_vow(ipa)] + syll
        if is_vow(ipa):
            return syll
    return syll

def get_last_syllables(stanza):
    """
    Function that takes in a stanza of a poem and gets the last syllables of each line in the stanza
        
            Parameters:
                stanza (list(str)): A stanza, which is a list of lines (strings)
                
            Returns:
                sylls (list(list(str))): The list of ARPABET sequences of the final syllables of each line in the stanza
    """
    
    sylls = []
    for line in stanza:
        if line != '':
            sylls.append(tuple(get_last_syllable(line)))
    return sylls

## Rhyme Scheme Functions
This next cell is where the meat of the analysis is happening. Using many of the above functions, these will calculate the rhyme scheme for a stanza, and the whole poem:
* **calculate_scheme(stanza):**&ensp;given a stanza (list of lines), will return the rhyme scheme in the form of a list of letters taken from the constant **rhyme_alphabet** defined above

* **calculate_full_scheme(poem_path):**&ensp;uses the poem path provided to get the poem, and then calculate the rhyme scheme of each stanza. This function then returns that list of lists of letters

In [6]:
def calculate_scheme(stanza):
    """
    Function that calculates the rhyme scheme of the stanza
    
    Looking at each line in the stanza, this function matches the syllables that rhyme or slant rhyme, depending on the final
    vowel and consonants of each line.
        
            Parameters:
                stanza (list(str)): A stanza of a poem
                
            Returns:
                letters (list(str)): The rhyme scheme in terms of the letters in rhyme_alphabet. For example, if given a
                \t\t\t     limerick, the return value would be ['A', 'A', 'B', 'B', 'A']
    """
    
    sylls = get_last_syllables(stanza)
    letters = []
    i = 0
    curr_letter = rhyme_alphabet[i]
    rhymes = {}
    for syll in sylls:
        if syll in list(rhymes.keys()):
            letters.append(rhymes[syll])
        else:
            found = False
            for key in rhymes.keys():
                if detect_slant_rhyme(key, syll):
                    letters.append(rhymes[key])
                    found = True
                    break
            if not found:
                letters.append(curr_letter)
                rhymes[syll] = curr_letter
                i += 1
                curr_letter = rhyme_alphabet[i]
    return letters

def calculate_full_scheme(poem_path):
    """
    Function that calculates the rhyme scheme of the entire poem by going through each stanza in the poem and compiling the
    rhyme scheme for each.
        
            Parameters:
                poem_path (str): The local file path of the poem
                
            Returns:
                schemes (list(list(str))): The rhyme schemes in terms of letters in rhyme_alphabet for each stanza
    """
    
    stanzas = get_stanzas(poem_path)
    schemes = []
    for stanza in stanzas:
        schemes.append(calculate_scheme(stanza))
    return schemes

## Printing Results Functions
These functions will compile the work done by other functions above and print out the results.
* **print_poem_with_scheme(poem_path):**&ensp;given the poem path, gets the poem and calculates the rhyme scheme of the full poem. Then prints out a side-by-side display of the poem and rhyme scheme, with each line corresponding to the correct letter in the calculated rhyme scheme. It then returns a list of tuples of letters so that future functions can use it

* **print_most_common_scheme(tuple_scheme):**&ensp;recieves the rhyme scheme as a list of tuples of letters and puts it into a Counter. It will then print out the most common rhyme scheme, or multiple most common schemes if there are ties in that department

In [7]:
def print_poem_with_scheme(poem_path):
    """
    Function that prints the poem side-by-side with the rhyme scheme of each stanza and returns the full rhyme
    scheme for later use 
        
            Parameters:
                poem_path (str): The local file path of the poem
                
            Returns:
                tuple_scheme (list(tuple(str))): The rhyme schemes of the poem, translated to tuples for easy 
                \t\t\t\t\t\t     counting
    """
    
    stanzas = get_stanzas(poem_path)
    scheme = calculate_full_scheme(poem_path)
    max_line = get_longest_line(stanzas)
    for (stanza, seq) in zip(stanzas, scheme):
        spaces = "{:" + str(max_line) + "}"
        for (line, letter) in zip(stanza, seq):
            prep_line = preprocess_line(line)
            full_line = spaces + "\t" + letter
            print(spaces.format(prep_line) + "\t" + letter)
        print("")
    tuple_scheme = []
    for line in scheme:
        tuple_scheme.append(tuple(line))
    return tuple_scheme

def print_most_common_scheme(tuple_scheme):
    """
    Function that prints the most common rhyme scheme, or the most common rhyme schemes if multiple 
    stanzas have the same rhyme scheme with the same count
        
            Parameters:
                tuple_scheme (list(tuple(str))): The rhyme schemes of the poem, in the form of tuples for easy 
                \t\t\t\t\t\t     counting
    """
    
    counter = Counter(tuple_scheme)
    common_scheme_tuples = counter.most_common()
    max_count = common_scheme_tuples[0][1]
    schemes = []
    for scheme in common_scheme_tuples:
        if scheme[1] == max_count:
            schemes.append(scheme[0])
        else:
            break

    if len(schemes) > 1:
        print("The most common rhyme schemes are:")
        for scheme in schemes:
            print("\t" + ", ".join(scheme))
    else:
        print("The most common rhyme scheme is:")
        for scheme in schemes:
            print("\t" + ", ".join(scheme))

## Print Options Functions
The next cell has functions that relate to the user input, and how the user is presented with options
* **print_poem_options():**&ensp;prints the poem options and their respective authors from the above **poem_names** and **poets** lists

In [8]:
def print_poem_options():
    """
    Function that prints the options available to the user for choosing
    """
    
    print("Pick a poem from this list (enter 1 - 5):")
    for (i, (_, poem, author)) in enumerate(zip(poem_paths, poem_names, poets)):
        print(f" {i + 1}: {poem} - {author}")
    print()

## User Input
This is where the actual program runs and the user supplies input. The user is prompted with the list of poems available, in which the user then types a number to view the poem and its rhyme scheme, or 'quit' to stop the program

In [9]:
print_poem_options()
response = input("Choose a poem or type 'quit'!")
while not re.search(r"(?i)^quit$",response):
    if re.match(r'[0-9]+', response):
        tuple_scheme = print_poem_with_scheme(poem_paths[int(response) - 1])
        print_most_common_scheme(tuple_scheme)
        print()
        response = input("Choose another poem or type 'quit'!")
    else:
        response = input("Please input a number for the poem!")
    

Pick a poem from this list (enter 1 - 5):
 1: Color in the Wheat - Hamlin Garland
 2: Excelsior - Henry W. Longfellow
 3: Home - Edgar A. Guest
 4: Jamie Douglas - Anonymous
 5: The Broken Pinion - Hezekiah Butterworth
 6: The Ensign Bearer - Anonymous
 7: The Height of the Ridiculous - Oliver Wendell Holmes
 8: The House With Nobody In It - Joyce Kilmer
 9: The Polish Boy - Ann S. Stephens
 10: The Real Riches - John G. Saxe

Like liquid gold the wheat field lies,            	A
  A marvel of yellow and russet and green,        	B
That ripples and runs, that floats and flies,     	A
  With the subtle shadows, the change, the sheen, 	B
    That play in the golden hair of a girl,       	C
      A ripple of amber  a flare                  	D
    Of light sweeping after  a curl               	C
    In the hollows like swirling feet             	E
      Of fairy waltzers, the colors run           	F
      To the western sun                          	F
    Through the deeps of the ripening w

In [None]:
print(cmu['height'])
print(cmu['night'])