# LINGUAMORPH

The software in this notebook is intended to take in text and output variations in sound and stress.

Potential applications include:

  - homophone phrases (variations on a phrase that sound the same or similar)
  - extreme alliteration
  - spoonerism-style scrambles
  - homophone anagrams
  - homophone loops
  - shrinking and expanding text
  - two different stories that sound the same
  - text that morphs to text with meaningful intermediates
  - stress-rhythm-seeded text
  - faux/foe translations
  - rhyming/rapping

## Phonemes and Carnegie Mellon Pronouncing Dictionary

(See https://github.com/cmusphinx/cmudict/tree/4c6a365cea2c34340ffc218d5af7a38920fa7e37)

From https://www.nltk.org/_modules/nltk/corpus/reader/cmudict.html:

The Carnegie Mellon Pronouncing Dictionary [cmudict.0.6]
Copyright 1998 Carnegie Mellon University

File Format: Each line consists of an uppercased word, a counter
(for alternative pronunciations), and a transcription.  Vowels are
marked for stress (1=primary, 2=secondary, 0=no stress).  E.g.:
NATURAL 1 N AE1 CH ER0 AH0 L

The dictionary contains 127069 entries.  Of these, 119400 words are assigned
a unique pronunciation, 6830 words have two pronunciations, and 839 words have
three or more pronunciations.  Many of these are fast-speech variants.

Phonemes: There are 39 phonemes (more at https://en.wikipedia.org/wiki/ARPABET):

    Phoneme Example Translation    Phoneme Example Translation
    ------- ------- -----------    ------- ------- -----------
    AA      odd     AA D           AE      at      AE T
    AH      hut     HH AH T        AO      ought   AO T
    AW      cow     K AW           AY      hide    HH AY D
    B       be      B IY           CH      cheese  CH IY Z
    D       dee     D IY           DH      thee    DH IY
    EH      Ed      EH D           ER      hurt    HH ER T  -- CHANGE:    ER      hurt    HH ER R T
    EY      ate     EY T           F       fee     F IY
    G       green   G R IY N       HH      he      HH IY
    IH      it      IH T           IY      eat     IY T
    JH      gee     JH IY          K       key     K IY
    L       lee     L IY           M       me      M IY
    N       knee    N IY           NG      ping    P IH NG
    OW      oat     OW T           OY      toy     T OY
    P       pee     P IY           R       read    R IY D
    S       sea     S IY           SH      she     SH IY
    T       tea     T IY           TH      theta   TH EY T AH
    UH      hood    HH UH D        UW      two     T UW
    V       vee     V IY           W       we      W IY
    Y       yield   Y IY L D       Z       zee     Z IY
    ZH      seizure S IY ZH ER
    
From https://www.pythonstudio.us/language-processing/a-pronouncing-dictionary.html:

For each word, this lexicon provides a list of phonetic codes—distinct labels for each contrastive sound—known as phones. Observe that fire has two pronunciations (in U.S. English): the one-syllable F AY1 R, and the two-syllable F AY1 ER0. The symbols in the CMU Pronouncing Dictionary are from the Arpabet, described in more detail at http://en.wikipedia.org/wiki/Arpabet.

## Code to check grammar
- ChatGPT API (best, but slow)
- Caribe (better than LanguageTool in my tests): https://pypi.org/project/Caribe/
- LanguageTool (very basic checks): https://github.com/jxmorris12/language_tool_python

In [1]:
import string 

# Import the grammar checker:
grammar_tool = "caribe"
#grammar_tool = "language_tool"
#grammar_tool = "chatgpt4"
#grammar_tool = "chatgpt3"
# Some tools are too permissive as grammar checkers, so optionally 
# check a sentence's grammar again with a second tool if it passes the first test:
do_check_grammar_again = True  
grammar_tool2 = "chatgpt4"

if grammar_tool == "chatgpt4" or (do_check_grammar_again and grammar_tool2=="chatgpt4"):
    import os
    import openai
    openai.api_key = os.getenv("OPENAI4_API_KEY")
    chatgpt_model = "gpt-4"
if grammar_tool == "chatgpt3" or (do_check_grammar_again and grammar_tool2=="chatgpt3"):
    import os
    import openai
    openai.api_key = os.getenv("OPENAI3_API_KEY")
    chatgpt_model = "gpt-3.5-turbo" #"text-davinci-003"
if grammar_tool == "caribe" or (do_check_grammar_again and grammar_tool2=="caribe"):
    import Caribe as cb    
if grammar_tool == "language_tool" or (do_check_grammar_again and grammar_tool2=="language_tool"):
    import language_tool_python
    language_tool_object = language_tool_python.LanguageTool('en-US')
    ignore_rules = ['UPPERCASE_SENTENCE_START','I_LOWERCASE'] #'EN_COMPOUNDS','CD_NN']

In [2]:
def generate_chatgpt_response(prompt, model):
    completion = openai.ChatCompletion.create(model=model, messages=[{"role": "user", "content": prompt}])
    response = completion.choices[0].message.content

    return response


def check_grammar(input_sentence, grammar_tool="caribe", cap_and_punc=True, verbose=False):
    '''
    Check whether an input sentence has correct grammar according to a given grammar tool.
    
    >>> input_sentence = "Language continues to evolve."
    >>> check_grammar(input_sentence, grammar_tool="caribe", cap_and_punc=True, verbose=False)
    True    
    '''
    is_correct = False

    # Capitalize the first word and add a period if no punctuation at the end of the input text string.
    if cap_and_punc:
        input_sentence = input_sentence.capitalize()
        if input_sentence[-1] not in string.punctuation: 
            input_sentence += '.'
        

    # Check grammar with ChatGPT
    if grammar_tool == "chatgpt4" or grammar_tool == "chatgpt3":
        prompt = "Return just the number 1 if the following sentence is grammatically correct, or just the number 0 if it is not: '{0}'".format(input_sentence)
        response = generate_chatgpt_response(prompt=prompt, model=chatgpt_model)
        if response == '1':
            is_correct = True
            if verbose:
                print("CORRECT (ChatGPT): {0}".format(input_sentence))
        elif verbose:
            print("      X (ChatGPT): {0}".format(input_sentence))

    # Check grammar with Caribe
    elif grammar_tool == "caribe":
        output_sentence = cb.caribe_corrector(input_sentence)
        if cap_and_punc:
            output_sentence = output_sentence.capitalize()
            if output_sentence[-1] not in string.punctuation: 
                output_sentence += '.'
        if output_sentence == input_sentence:
            is_correct = True
            if verbose:
                print("CORRECT (Caribe): {0}".format(input_sentence))
        elif verbose:
            print("      X (Caribe): {0}".format(input_sentence))

    # Check grammar with LanguageTool
    elif grammar_tool == "language_tool":
        is_correct = True
        for rule in language_tool_object.check(input_sentence):
            if rule.ruleId not in ignore_rules:
                is_correct = False
                if verbose:
                    print(rule, end='\n\n')
        if is_correct:
            if verbose:
                print("CORRECT (LanguageTool): {0}".format(input_sentence))
            elif verbose:
                print("      X (LanguageTool): {0}".format(input_sentence))

    return is_correct


def check_grammar_twice(list_of_sentences, grammar_tool1="caribe", grammar_tool2="chatgpt4", 
                        cap_and_punc=True, do_check_grammar_again=False, verbose=False):
    '''
    Check whether an input sentence has correct grammar according to two grammar tools.
    
    >>> list_of_sentences = ['language continue toys evolve', 'languages continue to evolve']
    >>> check_grammar_twice(list_of_sentences, "caribe", "chatgpt4", True, True, True)
          X (Caribe): Language continue toys evolve.
    CORRECT (Caribe): Languages continue to evolve.
    CORRECT (ChatGPT): Languages continue to evolve.
    ['languages continue to evolve']    
    '''
    is_correct = False

    new_list = []
    for sentence in list_of_sentences:
        #if verbose:
        #    print("     Input sentence: {0}".format(sentence))
        is_correct = check_grammar(sentence, grammar_tool1, cap_and_punc, verbose)
        if is_correct and do_check_grammar_again:
            is_correct = check_grammar(sentence, grammar_tool2, cap_and_punc, verbose)
        if is_correct:
            new_list.append(sentence)

    return new_list

## Code to prepare CMU words

In [3]:
# Note: vowel 'ER' converted to vowel 'ER' and consonant 'R'
phoneme_list = ['AA','AH','AW','B','D','EH','EY','G','IH','JH','L','N','OW','P','S','T','UH','V','Y','ZH',
                'AE','AO','AY','CH','DH','ER','F','HH','IY','K','M','NG','OY','R','SH','TH','UW','W','Z']
#vowel_list = ['AA','AH','AW','EH','EY','IH','OW','UH','AE','AO','AY','IY','OY','UW','ER']
single_consonants = ['B','D','G','JH','L','N','P','S','T','V','Y','ZH','CH',
                     'DH','F','HH','K','M','NG','R','SH','TH','W','Z']
multiple_consonants = []
for c1 in single_consonants:
    for c2 in single_consonants:
        if c2 != c1:
            multiple_consonants.append(c1 + '+' + c2)
            for c3 in single_consonants:
                if c3 != c2:
                    multiple_consonants.append(c1 + '+' + c2 + '+' + c3)
                    for c4 in single_consonants:
                        if c4 != c3:
                            multiple_consonants.append(c1 + '+' + c2 + '+' + c3 + '+' + c4)
                            #for c5 in single_consonants:
                            #    if c5 != c4:
                            #        multiple_consonants.append(c1 + '+' + c2 + '+' + c3 + '+' + c4 + '+' + c5)

consonant_list = single_consonants + multiple_consonants

In [4]:
def combine_consonants(phonemes):
    '''
    >>> combine_consonants(['Y','IY','L','D'])
    ['Y', 'IY', 'L+D']    
    '''
    phonemes_with_combined_consonants = []
    P = len(phonemes)
    i = 0
    while i < P:
        loop = True
        while loop:
            p1 = phonemes[i]
            i += 1
            if p1 in single_consonants:
                if P > i:
                    p2 = phonemes[i]
                    i += 1
                    if p2 in single_consonants:
                        if P > i:
                            p3 = phonemes[i]
                            i += 1
                            if p3 in single_consonants:
                                if P > i:
                                    p4 = phonemes[i]
                                    i += 1
                                    if p4 in single_consonants:
                                        phonemes_with_combined_consonants.append(p1 + '+' + p2 + '+' + p3 + '+' + p4)
                                        #if P > i:
                                        #    p5 = phonemes[i]
                                        #    i += 1
                                        #    if p5 in single_consonants:
                                        #        phonemes_with_combined_consonants.append(p1 + '+' + p2 + '+' + p3 + '+' + p4 + '+' + p5)
                                        #    else:
                                        #        phonemes_with_combined_consonants.append(p1 + '+' + p2 + '+' + p3 + '+' + p4)
                                        #        phonemes_with_combined_consonants.append(p5)
                                        #    break                                                
                                        #else:
                                        #    phonemes_with_combined_consonants.append(p1 + '+' + p2 + '+' + p3 + '+' + p4)
                                        #    break
                                    else:
                                        phonemes_with_combined_consonants.append(p1 + '+' + p2 + '+' + p3)
                                        phonemes_with_combined_consonants.append(p4)
                                    break
                                else:
                                    phonemes_with_combined_consonants.append(p1 + '+' + p2 + '+' + p3)
                                    break
                            else:
                                phonemes_with_combined_consonants.append(p1 + '+' + p2)
                                phonemes_with_combined_consonants.append(p3)
                                break
                        else:
                            phonemes_with_combined_consonants.append(p1 + '+' + p2)
                            break
                    else:
                        phonemes_with_combined_consonants.append(p1)
                        phonemes_with_combined_consonants.append(p2)
                        break
                else:
                    phonemes_with_combined_consonants.append(p1)
                    break
            else:
                phonemes_with_combined_consonants.append(p1)
                break
    
    return phonemes_with_combined_consonants


def separate_consonants(combined_consonants):
    '''
    Split up consonant compounds ("I scream" => "ice cream")?

    English words by number of syllables: 
        https://en.wiktionary.org/wiki/Category:English_words_by_number_of_syllables
        Ex: "Honolulu": 4 syllables, 8 phonemes (with or without combined consonants)
        Ex: "constructs": 2 syllables, 10 phonemes or 5 combined (K, AA, N+S+T+R, UH, K+T+S)

    >>> separate_consonants(['Y', 'IY', 'L+D'])
    ['Y','IY','L','D']   
    '''
    if any(['+' in x for x in combined_consonants]):
        separated_consonants = []
        for c in combined_consonants:
            if '+' in c:
                separated_consonants.extend(c.split('+'))
            else:
                separated_consonants.append(c)
    else:
        separated_consonants = combined_consonants
        
    return separated_consonants

                                        
def separatER(phonemes):
    '''
    >>> separatER(['AE1', 'F', 'T', 'ER0'])  # 'after'
    ['AE1', 'F', 'T', 'ER0', 'R']
    '''
    if any(['ER' in x for x in phonemes]):
        new_pronunciation = []
        for p in phonemes:
            new_pronunciation.append(p)
            if 'ER' in p:
                new_pronunciation.append('R')
    else:
        new_pronunciation = phonemes
        
    return new_pronunciation


## Prepare / load dictionary

In [5]:
import pickle

dictionary_folder = "data/dictionaries/"
filter_dictionary = 'LanguageTool'  #'english_words_py'  #'pyenchant'

do_prepare_dictionary = False
do_test_dictionaries = False

if do_prepare_dictionary or do_test_dictionaries:
    import nltk
    import re
    filter_words_file = "data/dictionaries/filter_words.txt"
    filter_strings = ['.',',']
    import language_tool_python
    language_tool_object = language_tool_python.LanguageTool('en-US')


def save_object(obj, pickle_file):
    try:
        with open(pickle_file, "wb") as f:
            pickle.dump(obj, f, protocol=pickle.HIGHEST_PROTOCOL)
    except Exception as ex:
        print("Error during pickling object:", ex)

        
def load_object(pickle_file):
    try:
        with open(pickle_file, "rb") as f:
            return pickle.load(f)
    except Exception as ex:
        print("Error during unpickling object:", ex)


def filter_dictionary_words(words, consonants, pronunciations, stresses, 
                            filter_words, filter_strings, filter_dictionary='LanguageTool', 
                            verbose=False):
    '''
    Test filtering (with fake consonants, pronunciations, and stresses).
    Make sure to import the necessary dictionaries below (see "do_prepare_dictionary").

    >>> words = ['okay', 'k']
    >>> filter_strings = ['.',',']
    >>> fread_filter = open(filter_words_file, "r")
    >>> filter_words = [x.strip() for x in fread_filter.readlines()]
    >>> filter_dictionary_words(words, consonants=words, pronunciations=words, stresses=words, 
    >>>                         filter_words=filter_words, filter_strings=filter_strings, 
    >>>                         filter_dictionary='LanguageTool', verbose=False)
    (['okay'], ['okay'], ['okay'], ['okay'], [])
    '''
    filtered_words = []
    filtered_consonants = []
    filtered_pronunciations = []
    filtered_stresses = []
    removed_words = []
    for iword, word in enumerate(words):
        if filter_dictionary == 'LanguageTool':
            if (not any([x for x in language_tool_object.check(word) if x.ruleId == 'MORFOLOGIK_RULE_EN_US'])) and \
                (word not in filter_words) and \
                (word not in filtered_words) and \
                all([x not in word for x in filter_strings]):
                    filtered_words.append(word)
                    filtered_consonants.append(consonants[iword])
                    filtered_pronunciations.append(pronunciations[iword])
                    filtered_stresses.append(stresses[iword])
        elif filter_dictionary == 'pyenchant':
            if (enchant_dict.check(word) or enchant_dict.check(word.capitalize())) and \
                (word not in filter_words) and \
                (word not in filtered_words) and \
                all([x not in word for x in filter_strings]):
                    filtered_words.append(word)
                    filtered_consonants.append(consonants[iword])
                    filtered_pronunciations.append(pronunciations[iword])
                    filtered_stresses.append(stresses[iword])
        elif filter_dictionary == 'english_words_py':
            if ((word in english_words_set or \
                    (word[:-1] in english_words_set and word[-1] == 's')) or \
                (word.capitalize() in english_words_set or \
                    (word.capitalize()[:-1] in english_words_set and word.capitalize()[-1] == 's'))
               ) and \
                (word not in filter_words) and \
                (word not in filtered_words) and \
                all([x not in word for x in filter_strings]):
                    filtered_words.append(word)
                    filtered_consonants.append(consonants[iword])
                    filtered_pronunciations.append(pronunciations[iword])
                    filtered_stresses.append(stresses[iword])
        else:
            removed_words.append(word)
    
    if verbose and removed_words != []:
        print('{0} retained words, {1} removed words'.format(len(filtered_words),len(removed_words)))

    return filtered_words, filtered_consonants, filtered_pronunciations, filtered_stresses, removed_words


def prepare_dictionary(filter_dictionary, filter_words_file, filter_strings, dictionary_folder, consonant_list):
    '''
    Filter CMU Pronunciation dictionary words and pronunciations.
    Use a second dictionary of common words (example below from LanguageTool).
    Make sure to import the necessary dictionaries below (see "do_prepare_dictionary").

    >>> index=48792
    >>> print(all_words[index], all_consonants[index])
    thirty ['TH','ER','R+D','IY'] 
    
    Check dictionary entry:
    >>> inword = 'ill-behaved' #'thirty'
    >>> inpron = ['TH','ER','R+D','IY']
    >>> print('{0}'.format(inword in all_words))
    >>> print('{0}'.format(inpron in all_pronunciations))
    >>> all_pronunciations[all_words.index('thirty')]
    '''
    cmu_entries = nltk.corpus.cmudict.entries()
    cmu_words = []
    cmu_consonants = []
    cmu_pronunciations = []
    cmu_pronunciations_stress = []
    cmu_stresses = []
    for cmu_word, cmu_pronunciation_stress in cmu_entries:
        if cmu_word not in cmu_words:
            cmu_words.append(cmu_word.strip())
            if any(['ER' in x for x in cmu_pronunciation_stress]):
                cmu_pronunciation_stress = separatER(cmu_pronunciation_stress)
            cmu_pronunciation_stress = combine_consonants(cmu_pronunciation_stress)
            cmu_pronunciation_no_stress = [re.sub(r'\d+', '', x) for x in cmu_pronunciation_stress]
            cmu_pronunciations.append(cmu_pronunciation_no_stress)
            cmu_consonants.append([x for x in cmu_pronunciation_no_stress if x in consonant_list])
            cmu_stresses.append([re.sub(r'[A-Za-z\+]', '', x) for x in cmu_pronunciation_stress])         
            #print(cmu_word, cmu_pronunciation_no_stress)

    print('Filter the CMU dictionary...')

    # Load filter words
    fread_filter = open(filter_words_file, "r")
    filter_words = [x.strip() for x in fread_filter.readlines()]

    all_words, all_consonants, all_pronunciations, all_stresses, nonwords = filter_dictionary_words(cmu_words, 
        cmu_consonants, cmu_pronunciations, cmu_stresses, 
        filter_words, filter_strings, filter_dictionary, verbose=False)    

    save_object(all_words, dictionary_folder + 'words_{0}.pkl'.format(filter_dictionary))
    save_object(all_consonants, dictionary_folder + 'consonants_{0}.pkl'.format(filter_dictionary))
    save_object(all_pronunciations, dictionary_folder + 'pronunciations_{0}.pkl'.format(filter_dictionary))
    save_object(all_stresses, dictionary_folder + 'stresses_{0}.pkl'.format(filter_dictionary))


# Prepare or load dictionary
if do_prepare_dictionary:
    if filter_dictionary == 'pyenchant':
        import enchant
        enchant_dict = enchant.Dict("en_US")
    elif filter_dictionary == 'english_words_py':
        from english_words import english_words_set
    prepare_dictionary(filter_dictionary, filter_words_file, filter_strings, dictionary_folder, consonant_list)    
else:
    all_words = load_object(dictionary_folder + 'words_{0}.pkl'.format(filter_dictionary))
    all_consonants = load_object(dictionary_folder + 'consonants_{0}.pkl'.format(filter_dictionary))
    all_pronunciations = load_object(dictionary_folder + 'pronunciations_{0}.pkl'.format(filter_dictionary))
    all_stresses = load_object(dictionary_folder + 'stresses_{0}.pkl'.format(filter_dictionary))
    

# Test different dictionaries
if do_test_dictionaries:

    cmu_entries = nltk.corpus.cmudict.entries()

    import enchant
    enchant_dict = enchant.Dict("en_US")
    #pip install cmudict
    #nltk.download('cmudict')
    #pip install pyenchant

    # english-words-py (https://pypi.org/project/english-words/)
    # "Contains sets of English words from svnweb.freebsd.org/csrg/share/dict/. 
    # This is up to date with revision 61569 of their words list."
    from english_words import english_words_set

    # Most Common English Words (https://github.com/dolph/dictionary)
    # "enable1.txt (172,819), the more verbose version of the Official Scrabble Player's Dictionary 
    # (which is limited to words of 8 letters or less)"
    # "popular.txt (25,322) represents the common subset of words found in both enable1.txt and Wiktionary's 
    # word frequency lists, which are in turn compiled by statistically analyzing a sample of 29 million 
    # words used in English TV and movie scripts."
    enable1 = [line.rstrip() for line in open('data/dictionaries/enable1.txt')]
    popular = [line.rstrip() for line in open('data/dictionaries/popular.txt')]

    # NLTK words corpus:
    #nltk.download('words')
    from nltk.corpus import words
    nltk_wordset = set(words.words())

    # Wiktionary Word Frequency_lists (https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists#English)
    #https://gist.github.com/h3xx/1976236
    
    print('nltk_wordset:      {0}'.format(len(nltk_wordset)))
    print('enable1:           {0}'.format(len(enable1)))
    print('pyenchant:         {0}'.format('?')) #len(enchant_dict.values())))
    print('english-words-py:  {0}'.format(len(english_words_set)))
    print('popular:           {0}'.format(len(popular)))
    print('LanguageTool (LT): {0}'.format('?'))
    print('-------------------------------------------------------')
    print('CMU:               {0}'.format(len(cmu_entries)))
    print('CMU and LT:        {0}'.format(len(all_words)))
    print()
    test_words = ["can't", 'geese', 'shelves', 'Thai', 'thai', 'e.', 'bott', 'bitter', 'used']
    for test_word in test_words:
        print(test_word)
        print('        NLTK words corpus:  {0}'.format(test_word in nltk_wordset))
        print('        enable frequency:   {0}'.format(test_word in enable1))
        print('        pyenchant spelling: {0}'.format(enchant_dict.check(test_word)))
        print('        english-words-py:   {0}'.format(test_word in english_words_set))
        print('        popular frequency:  {0}'.format(test_word in popular))
        print('        LanguageTool:       {0}'.format(not any([x for x in language_tool_object.check(test_word) if x.ruleId == 'MORFOLOGIK_RULE_EN_US'])))
        print('-------------------------------------------------------')
        print('        CMU:                {0}'.format(any([x for x,y in cmu_entries if test_word == x])))
        print('        CMU and LT:         {0}'.format(test_word in all_words))
        print()

    nltk_wordset:      235892
    enable1:           172823
    pyenchant:         ?
    english-words-py:  25487
    popular:           25322
    LanguageTool (LT): ?
    -------------------------------------------------------
    CMU:               133737
    CMU and LT:        54117
    
                                can't    geese    shelves  Thai     thai     e.       bott     bitter   used
            NLTK words corpus:  False    False    False    True     False    False    True     True     True
            enable frequency:   False    True     True     False    False    False    True     True     True
            pyenchant spelling: True     True     True     True     False    True     True     True     True
            english-words-py:   True     True     False    True     False    False    False    False    False
            popular frequency:  False    True     True     False    False    False    False    True     True
            LanguageTool:       True     True     True     True     False    True     False    True     True
            ------------------------------------------------------------------------------------------------
            CMU:                True     True     True     False    True     True     True     True     True
            CMU & LanguageTool: True     True     True     False    False    False    False    True     True
            

## Code to convert text to phonemes (and stresses and number of syllables)

In [6]:
import re
from g2p_en import G2p  # pip install g2p_en
word_to_phonemes = G2p()

# Code to count syllables
# https://datascience.stackexchange.com/questions/23376/how-to-get-the-number-of-syllables-in-a-word
syllable_vowel_runs = re.compile("[aeiouy]+", flags=re.I)
syllable_exceptions = re.compile(
    # fixes trailing e issues:
    # smite, scared
    "[^aeiou]e[sd]?$|"
    # fixes adverbs:
    # nicely
    + "[^e]ely$",
    flags=re.I
)
additional_syllables = re.compile(
    # fixes incorrect subtractions from exceptions:
    # smile, scarred, raises, fated
    "[^aeioulr][lr]e[sd]?$|[csgz]es$|[td]ed$|"
    # fixes miscellaneous issues:
    # flying, piano, video, prism, fire, evaluate
    + ".y[aeiou]|ia(?!n$)|eo|ism$|[^aeiou]ire$|[^gq]ua",
    flags=re.I
)

In [7]:
def count_syllables(word, syllable_vowel_runs=syllable_vowel_runs, syllable_exceptions=syllable_exceptions, 
                    additional_syllables=additional_syllables):
    '''
    Count the number of syllables in a word.

    >>> word = 'lovely'
    >>> syllable_vowel_runs = re.compile('[aeiouy]+', re.IGNORECASE)
    >>> syllable_exceptions = re.compile('[^aeiou]e[sd]?$|[^e]ely$', re.IGNORECASE) 
    >>> additional_syllables = re.compile('[^aeioulr][lr]e[sd]?$|[csgz]es$|[td]ed$|.y[aeiou]|ia(?!n$)|eo|ism$|[^aeiou]ire$|[^gq]ua', re.IGNORECASE)
    >>> count_syllables(word)
    2
    '''
    vowel_runs = len(syllable_vowel_runs.findall(word))
    exceptions = len(syllable_exceptions.findall(word))
    additional = len(additional_syllables.findall(word))

    return max(1, vowel_runs - exceptions + additional)


def words_to_sounds(words, phoneme_list, consonant_list):
    '''
    From a list of words, return phonemes, consonants, stresses, and number of syllables.

    Output: phonemes, consonants, stresses, numbers of syllables
    
    >>> sentence1 = 'A witch is itself conscious or without agency.'
    >>> sentence2 = 'Uh, which is it, self-conscious or without agency?'
    >>> words = sentence1.split('-')
    >>> words_to_sounds(words, phoneme_list, consonant_list)
    (['AH','W','IH','CH','IH','Z','IH','T+S','EH','L+F',
      'K','AA','N+SH','AH','S','AO','R','W','IH','TH','AW','T',
      'EY','JH','AH','N+S','IY'],
     ['W','CH','Z','T+S','L+F','K','N+SH','S',
      'R','W','TH','T','JH','N+S'],
     ['0','','1','','1','','0','+','1','+','','1','+',
      '0','','1','','','0','','1','','1','','0','+','0'],
     13)    
    >>> words = sentence2.split('-')
    >>> words_to_sounds(words, phoneme_list, consonant_list)
    (['AH','W','IH','CH','IH','Z','IH','T','S','EH','L+F',
      'K','AA','N+SH','AH','S','AO','R','W','IH','TH','AW','T',
      'EY','JH','AH','N+S','IY'],
     ['W','CH','Z','T','S','L+F','K','N+SH','S',
      'R','W','TH','T','JH','N+S'],
     ['1','','1','','1','','1','','','1','+','','1','+',
      '0','','1','','','0','','1','','1','','0','+','0'],
     13)    
    '''

    phonemes = []
    stresses = []
    syllables = 0
    for word in words:

        # Extract phonemes per word (choose the first version of the phoneme)
        #     :: multiple pronunciations: pronouncing.phones_for_word(word) 
        phonemes_and_stresses_for_word = word_to_phonemes(word)
        if any(['ER' in x for x in phonemes_and_stresses_for_word]):
            phonemes_and_stresses_for_word = separatER(phonemes_and_stresses_for_word)
        phonemes_and_stresses_for_word = combine_consonants(phonemes_and_stresses_for_word)
        phonemes_for_word = [re.sub(r'\d+', '', x) for x in phonemes_and_stresses_for_word 
                             if re.sub(r'\d+', '', x) in phoneme_list or x in consonant_list]
        stresses_for_word = [re.sub(r"(?:[A-Z])",'', x) for x in phonemes_and_stresses_for_word
                             if re.sub(r'\d+', '', x) in phoneme_list or x in consonant_list]

        phonemes += phonemes_for_word  
        stresses += stresses_for_word
        syllables += count_syllables(word)

    consonants = [x for x in phonemes if x in consonant_list] 

    return phonemes, consonants, stresses, syllables

## Code to convert phonemes to text

In [8]:
def get_unique_numbers(numbers):
    unique = []
    for number in numbers:
        if number not in unique:
            unique.append(number)
    return unique


def phonemes_to_words(phonemes, all_words, all_pronunciations, all_consonants, consonant_list, 
                      just_consonants=False, max_phonemes_per_word=25, ignore_words=None):
    '''
    Find all words that sound like each sequence of phonemes from a phoneme list.
    
    Generate a list of words (with start and stop indices) from a sequential list of phonemes,
    by concatenating sequences of the phonemes and searching in CMU's Pronunciation Dictionary.
    
    max_phonemes_per_word:
    n-syllable words without combined consonants <= 5n phonemes
    n-syllable words with combined consonants <= 2n + 1 phonemes
    5-syllable words: <=25 phonemes, <=11 with combined consonants
    
    >>> # input_sentence = 'Uh, which is it, self-conscious or without agency?'
    >>> # words = input_sentence.split('-')  #words = [re.sub(r'[^\'A-Za-z\-]', '', x).lower() for x in words]
    >>> # phonemes, consonants, stresses, nsyllables = words_to_sounds(words, phoneme_list, consonant_list)
    >>> phonemes = ['AH','W','IH','CH','IH','Z','IH','T','S','EH','L+F',
                    'K','AA','N+SH','AH','S','AO','R','W','IH','TH','AW','T',
                    'EY','JH','AH','N+S','IY']    
    >>> just_consonants = False
    >>> phonemes_to_words(phonemes, all_words, all_pronunciations, all_consonants, consonant_list, 
                          just_consonants, max_phonemes_per_word=25, ignore_words=None)

    [['a', 0, 0],['uh', 0, 0],['uhh', 0, 0],['which', 1, 3],['witch', 1, 3],["which's", 1, 5],
    ["witch's", 1, 5],['itch', 2, 3],['is', 4, 5],['it', 6, 7],["it's", 6, 8],
    ['its', 6, 8],['itself', 6, 10],['self', 8, 10],['eh', 9, 9],['elf', 9, 10],
    ['conscious', 11, 15],['ah', 12, 12],['ahh', 12, 12],['awe', 12, 12],
    ['uh', 14, 14],['uhh', 14, 14],['us', 14, 15],['saw', 15, 16],['soar', 15, 17],
    ['sore', 15, 17],['aw', 16, 16],['oar', 16, 17],['or', 16, 17],['ore', 16, 17],
    ['withe', 18, 20],['without', 18, 22],['out', 21, 22],['age', 23, 24],
    ['agency', 23, 27],['uh', 25, 25],['uhh', 25, 25]]
     
    >>> just_consonants = True
    >>> phonemes_to_words(phonemes, all_words, all_pronunciations, all_consonants, consonant_list, 
                          just_consonants, max_phonemes_per_word=25, ignore_words=None)

    [['away', 0, 1],['way', 0, 1],['we', 0, 1],['wee', 0, 1],['weigh', 0, 1],...
     ['which', 0, 3],['witch', 0, 3],['watches', 0, 5],["which's", 0, 5],...
     ['conscious', 11, 15],['ac', 14, 15],['ace', 14, 15],['ass', 14, 15],...
     ['agency', 23, 26],['age', 24, 24],['edge', 24, 24],['edgy', 24, 24],
     ...
    ]]
    '''

    words_starts_stops = []    
    start = 0
    unique_stops = [-1]
    nphonemes = len(phonemes)
    while start < nphonemes:

        # For each subsequence of phonemes
        consonant_subsets = []
        max_stop = min(nphonemes + 1, start + max_phonemes_per_word + 2) 
        for stop in range(start + 1, max_stop):

            phoneme_subset = phonemes[start:stop]
            phoneme_subset = combine_consonants(phoneme_subset)

            # Find words with matching consonants:
            if just_consonants:
                consonant_subset = [x for x in phoneme_subset if x in consonant_list]
                if consonant_subset != [] and consonant_subset not in consonant_subsets:
                    consonant_subsets.append(consonant_subset)
                    try:
                        indices = [i for i,x in enumerate(all_consonants) if x == consonant_subset]
                        for index in indices:
                            words_starts_stops.append([all_words[index], start, stop - 1])
                    except: pass

            # Find words with fully matching pronunciations:
            else:
                try:
                    indices = [i for i,x in enumerate(all_pronunciations) if x == phoneme_subset]
                    for index in indices:
                        words_starts_stops.append([all_words[index], start, stop - 1])
                except: pass

        start += 1
                
    if ignore_words:
        words_starts_stops = [x for x in words_starts_stops if x[0] not in ignore_words]

    return words_starts_stops

## Code to construct word sequences with matching phoneme stop and start indices

In [9]:
def flatten_list(nested_list):
    '''
    Flatten so that there are no tuples or lists within the list.
    
    >>> nested_list = [('e1d1', ('e1d2'), ['e2d1']), 'e3d0', [], ['e5d1']]
    >>> flatten_list(nested_list)
    ['e1d1', 'e1d2', 'e2d1', 'e3d0', 'e5d1']
    '''
    result=[]
    if nested_list != []:
        for element in nested_list:
            if isinstance(element, list) or isinstance(element, tuple):
                result.extend(flatten_list(element))
            else:
                result.append(element)

    return result

            
def flatten_to_sublists_of_strings(nested_list):
    '''
    Flatten list to strings and sublists of strings.

    >>> nested_list = [[[], '0', ('1',11,12), ('2',21,22), ['3',31,32]], [['4',41,42]]]
    >>> flatten_to_sublists_of_strings(nested_list)
    [[], '0', ['1', 11, 12], ['2', 21, 22], ['3', 31, 32], ['4', 41, 42]]
    
    >>> # input_sentence = 'Uh, which is it, self-conscious or without agency?'
    >>> # words = input_sentence.split('-')  #words = [re.sub(r'[^\'A-Za-z\-]', '', x).lower() for x in words]
    >>> # phonemes, consonants, stresses, nsyllables = words_to_sounds(words, phoneme_list, consonant_list)
    >>> # words_starts_stops = phonemes_to_words(phonemes, all_words, all_pronunciations, all_consonants, 
        #                                        consonant_list, just_consonants, max_phonemes_per_word, 
        #                                        ignore_words)
    >>> # words_starts_stops = flatten_to_sublists_of_strings(words_starts_stops) 
    >>> # words_by_start, stops, unique_starts, unique_stops, max_start, max_stop = organize_words_by_start(words_starts_stops)

    >>> words_by_start = [['a', 0, 0],['uh', 0, 0],['uhh', 0, 0],['which', 1, 3],['witch', 1, 3],["which's", 1, 5],
     ["witch's", 1, 5],['itch', 2, 3],['is', 4, 5],['it', 6, 7],["it's", 6, 8],
     ['its', 6, 8],['itself', 6, 10],['self', 8, 10],['eh', 9, 9],['elf', 9, 10],
     ['conscious', 11, 15],['ah', 12, 12],['ahh', 12, 12],['awe', 12, 12],
     ['uh', 14, 14],['uhh', 14, 14],['us', 14, 15],['saw', 15, 16],['soar', 15, 17],
     ['sore', 15, 17],['aw', 16, 16],['oar', 16, 17],['or', 16, 17],['ore', 16, 17],
     ['withe', 18, 20],['without', 18, 22],['out', 21, 22],['age', 23, 24],
     ['agency', 23, 27],['uh', 25, 25],['uhh', 25, 25]] 
    
    >>> flatten_to_sublists_of_strings(words_by_start)

    [['a', 0, 0],['uh', 0, 0],['uhh', 0, 0],['which', 1, 3],['witch', 1, 3],
     ["which's", 1, 5],["witch's", 1, 5],['itch', 2, 3],['is', 4, 5],['it', 6, 7],["it's", 6, 8],
     ['its', 6, 8],['itself', 6, 10],['self', 8, 10],['eh', 9, 9],['elf', 9, 10],
     ['conscious', 11, 15],['ah', 12, 12],['ahh', 12, 12],['awe', 12, 12],
     ['uh', 14, 14],['uhh', 14, 14],['us', 14, 15],['saw', 15, 16],['soar', 15, 17],
     ['sore', 15, 17],['aw', 16, 16],['oar', 16, 17],['or', 16, 17],['ore', 16, 17],
     ['withe', 18, 20],['without', 18, 22],['out', 21, 22],['age', 23, 24],
     ['agency', 23, 27],['uh', 25, 25],['uhh', 25, 25]]    
     '''
    
    result=[]
    if nested_list == []:
        result.extend([[]])
    else:
        if not any([isinstance(x, list)  for x in nested_list]) and \
           not any([isinstance(x, tuple) for x in nested_list]):
            y=[]
            for x in nested_list:
                y.append(x)
            result.append(y)
        else:
            for element in nested_list:
                if isinstance(element, str):
                    result.extend(element)
                elif isinstance(element, list) or isinstance(element, tuple):
                    if element == []:
                        result.extend([[]])
                    else:
                        result.extend(flatten_to_sublists_of_strings(element)) 
    
    return result

            
def find_words_with_start_index(word_start_stop_list, start_index):
    '''
    Find words in a word list with start and stop indices that start at index start_index.
    '''
    start_words = []
    starts = []
    stops = []
    for word, start, stop in word_start_stop_list:
        if start == start_index and start != []:
            start_words.append(word)
            starts.append(start)
            stops.append(stop)
            
    return start_words, starts, stops


def organize_words_by_start(words_list):
    '''
    Organize a list of words (and corresponding start and stop indices) by start index.
    Output: words_by_start, stops, unique_starts, x, y, max_stop
    
    >>> # input_sentence = 'Uh, which is it, self-conscious or without agency?'
    >>> # words = input_sentence.split('-')  #words = [re.sub(r'[^\'A-Za-z\-]', '', x).lower() for x in words]
    >>> # phonemes, consonants, stresses, nsyllables = words_to_sounds(words, phoneme_list, consonant_list)
    >>> # words_starts_stops = phonemes_to_words(phonemes, all_words, all_pronunciations, all_consonants, 
        #                                        consonant_list, just_consonants, max_phonemes_per_word, 
        #                                        ignore_words)
    >>> # words_starts_stops = flatten_to_sublists_of_strings(words_starts_stops) 

    >>> words_starts_stops = [['a', 0, 0],['uh', 0, 0],['uhh', 0, 0],['which', 1, 3],['witch', 1, 3],
     ["which's", 1, 5],["witch's", 1, 5],['itch', 2, 3],['is', 4, 5],['it', 6, 7],["it's", 6, 8],
     ['its', 6, 8],['itself', 6, 10],['self', 8, 10],['eh', 9, 9],['elf', 9, 10],
     ['conscious', 11, 15],['ah', 12, 12],['ahh', 12, 12],['awe', 12, 12],
     ['uh', 14, 14],['uhh', 14, 14],['us', 14, 15],['saw', 15, 16],['soar', 15, 17],
     ['sore', 15, 17],['aw', 16, 16],['oar', 16, 17],['or', 16, 17],['ore', 16, 17],
     ['withe', 18, 20],['without', 18, 22],['out', 21, 22],['age', 23, 24],
     ['agency', 23, 27],['uh', 25, 25],['uhh', 25, 25]]    

    >>> organize_words_by_start(words_starts_stops)

    ([['a', 'uh', 'uhh'],['which', 'witch', "which's", "witch's"],['itch'],[],['is'],[],
      ['it', "it's", 'its', 'itself'],[],['self'],['eh', 'elf'],[],['conscious'],
      ['ah', 'ahh', 'awe'],[],['uh', 'uhh', 'us'],['saw', 'soar', 'sore'],
      ['aw', 'oar', 'or', 'ore'],[],['withe', 'without'],[],[],['out'],[],
      ['age', 'agency'],[],['uh', 'uhh']],
     [[0, 0, 0],[3, 3, 5, 5],[3],[],[5],[],[7, 8, 8, 10],[],[10],[9, 10],[],
      [15],[12, 12, 12],[],[14, 14, 15],[16, 17, 17],[16, 17, 17, 17],[],
      [20, 22],[],[],[22],[],[24, 27],[],[25, 25]],
     [0, 1, 2, 4, 6, 8, 9, 11, 12, 14, 15, 16, 18, 21, 23, 25],
     [0, 3, 5, 7, 8, 10, 9, 15, 12, 14, 16, 17, 20, 22, 24, 27, 25],
     25,27)    
    
    '''

    if not isinstance(words_list[0], list) and not isinstance(words_list[0], tuple):
        words_list = [words_list]
        
    # Get unique starts and stops, and max start and stop
    words2 = []
    starts2 = []
    stops2 = []
    for word, start, stop in words_list:
        words2.append(word)
        starts2.append(start)
        stops2.append(stop)
    unique_starts = get_unique_numbers(starts2)
    unique_stops = get_unique_numbers(stops2)
    max_start = max(get_unique_numbers(starts2))
    max_stop = max(get_unique_numbers(stops2))

    # Words organized by start index
    words_by_start = []
    stops = []
    for start_index in range(max_start + 1):
        start_words, istarts, istops = find_words_with_start_index(words_list, start_index)
        words_by_start.append(start_words)
        stops.append(istops)        

    return words_by_start, stops, unique_starts, unique_stops, max_start, max_stop


def concatenate_lists(list_of_lists1, list_of_lists2):
    result = []
    for item1, item2 in zip(list_of_lists1, list_of_lists2):
        if isinstance(item1, str) and isinstance(item2, list):
            for element in item2:
                result.append((item1, element))
        elif isinstance(item1, list) and isinstance(item2, list):
            result.append((item1 + item2))
        elif isinstance(item1, tuple) and isinstance(item2, list):
            result.append((list(item1) + item2))
    
    return result


def concatenate_word_pairs(prev_words, prev_stops, words_by_start, stops_by_start, unique_starts):
    '''
    Concatenate word pairs where the stop index of the first word matches the start index of the next word.

    >>> # input_sentence = 'Uh, which is it, self-conscious or without agency?'
    >>> # words = input_sentence.split('-')  #words = [re.sub(r'[^\'A-Za-z\-]', '', x).lower() for x in words]
    >>> # phonemes, consonants, stresses, nsyllables = words_to_sounds(words, phoneme_list, consonant_list)
    >>> # words_starts_stops = phonemes_to_words(phonemes, all_words, all_pronunciations, all_consonants, 
        #                                        consonant_list, just_consonants, max_phonemes_per_word, 
        #                                        ignore_words)
    >>> # words_starts_stops = flatten_to_sublists_of_strings(words_starts_stops) 
    >>> # words_by_start, stops_by_start, unique_starts, unique_stops, max_start, max_stop = organize_words_by_start(words_starts_stops)
    >>> prev_words = ['a', 'uh', 'uhh']
    >>> prev_stops = [0, 0, 0]
    >>> words_by_start = [['a', 'uh', 'uhh'],['which', 'witch', "which's", "witch's"],['itch'],[],['is'],[],
      ['it', "it's", 'its', 'itself'],[],['self'],['eh', 'elf'],[],['conscious'],
      ['ah', 'ahh', 'awe'],[],['uh', 'uhh', 'us'],['saw', 'soar', 'sore'],
      ['aw', 'oar', 'or', 'ore'],[],['withe', 'without'],[],[],['out'],[],
      ['age', 'agency'],[],['uh', 'uhh']]
    >>> stops_by_start = [[0, 0, 0],[3, 3, 5, 5],[3],[],[5],[],[7, 8, 8, 10],[],[10],[9, 10],[],[15],[12, 12, 12],[],[14, 14, 15],[16, 17, 17],
                          [16, 17, 17, 17],[],[20, 22],[],[],[22],[],[24, 27],[],[25, 25]]
    >>> unique_starts = [0, 1, 2, 4, 6, 8, 9, 11, 12, 14, 15, 16, 18, 21, 23, 25]
    >>> unique_stops = [0, 3, 5, 7, 8, 10, 9, 15, 12, 14, 16, 17, 20, 22, 24, 27, 25]
    >>> concatenate_word_pairs(prev_words, prev_stops, words_by_start, stops_by_start, unique_starts)

    ([['a', 'which'],['a', 'witch'],['a', "which's"],['a', "witch's"],
      ['uh', 'which'],['uh', 'witch'],['uh', "which's"],['uh', "witch's"],
      ['uhh', 'which'],['uhh', 'witch'],['uhh', "which's"],['uhh', "witch's"]],
     [3, 3, 5, 5, 3, 3, 5, 5, 3, 3, 5, 5])

    '''
    
    # Initialize / format words
    new_words = []
    new_stops = []
    words1 = prev_words
    stops1 = prev_stops
 
    # For each word that starts at a given index
    for iword1, word1 in enumerate(words1):

        # Find words that start after that word stops
        word1_stop = stops1[iword1]
        word2_start = word1_stop + 1
        if word2_start in unique_starts:
            words2 = words_by_start[word2_start]
            stops2 = stops_by_start[word2_start]

            # Concatenate the first word with each of the second set of words
            word1_copies = []  # Make n copies of word1 so as to pair with n words2
            [word1_copies.extend([word1]) for x in range(len(words2))]
            words2_list = [[x] for x in words2]
            new_words.append(concatenate_lists(word1_copies, words2_list))
            new_stops.append(stops2)

    new_words = flatten_to_sublists_of_strings(new_words)
    new_stops = flatten_list(new_stops)
        
    return new_words, new_stops


def words_stop_to_start(words_by_start, stops_by_start, unique_starts, max_stop, max_count, 
                        consonants=None, do_swap_consonants=False):
    '''
    Create a list of text strings from sequences of words using the words' start and stop indices.

    Output: output_lines

    >>> # input_sentence = 'Uh, which is it, self-conscious or without agency?'
    >>> # words = input_sentence.split('-')  #words = [re.sub(r'[^\'A-Za-z\-]', '', x).lower() for x in words]
    >>> # phonemes, consonants, stresses, nsyllables = words_to_sounds(words, phoneme_list, consonant_list)
    >>> # words_starts_stops = phonemes_to_words(phonemes, all_words, all_pronunciations, all_consonants, 
        #                                        consonant_list, just_consonants, max_phonemes_per_word, 
        #                                        ignore_words)
    >>> # words_starts_stops = flatten_to_sublists_of_strings(words_starts_stops) 
    >>> # words_by_start, stops_by_start, unique_starts, unique_stops, max_start, max_stop = organize_words_by_start(words_starts_stops)
    >>> words_by_start = [['a', 'uh', 'uhh'],['which', 'witch', "which's", "witch's"],['itch'],[],['is'],[],
      ['it', "it's", 'its', 'itself'],[],['self'],['eh', 'elf'],[],['conscious'],
      ['ah', 'ahh', 'awe'],[],['uh', 'uhh', 'us'],['saw', 'soar', 'sore'],
      ['aw', 'oar', 'or', 'ore'],[],['withe', 'without'],[],[],['out'],[],
      ['age', 'agency'],[],['uh', 'uhh']]
    >>> stops_by_start = [[0, 0, 0],[3, 3, 5, 5],[3],[],[5],[],[7, 8, 8, 10],[],[10],[9, 10],[],[15],[12, 12, 12],[],[14, 14, 15],[16, 17, 17],
                          [16, 17, 17, 17],[],[20, 22],[],[],[22],[],[24, 27],[],[25, 25]]
    >>> unique_starts = [0, 1, 2, 4, 6, 8, 9, 11, 12, 14, 15, 16, 18, 21, 23, 25]
    >>> max_stop = 27
    >>> max_count = 26
    >>> consonants = None
    >>> do_swap_consonants = False
    >>> words_stop_to_start(words_by_start, stops_by_start, unique_starts, max_stop, max_count, 
                            consonants, do_swap_consonants)
                            
    ["a which's itself conscious oar without agency",
     "a which's itself conscious or without agency",
     "a which's itself conscious ore without agency",
     "a witch's itself conscious oar without agency",
     "a witch's itself conscious or without agency",
     "a witch's itself conscious ore without agency",
     "uh which's itself conscious oar without agency",
     "uh which's itself conscious or without agency",
     "uh which's itself conscious ore without agency",
     "uh witch's itself conscious oar without agency",
     "uh witch's itself conscious or without agency",
     "uh witch's itself conscious ore without agency",
     "uhh which's itself conscious oar without agency",
     "uhh which's itself conscious or without agency",
     "uhh which's itself conscious ore without agency",
     "uhh witch's itself conscious oar without agency",
     "uhh witch's itself conscious or without agency",
     "uhh witch's itself conscious ore without agency",
     'a which is itself conscious oar without agency',
     'a which is itself conscious or without agency',
     'a which is itself conscious ore without agency',
     'a witch is itself conscious oar without agency',
               'a witch is itself conscious or without agency',
     'a witch is itself conscious ore without agency',
     "a which's it self conscious oar without agency",
     "a which's it self conscious or without agency",
     "a which's it self conscious ore without agency",
     "a which's it's elf conscious oar without agency",
     "a which's it's elf conscious or without agency",
     "a which's it's elf conscious ore without agency",
     "a which's its elf conscious oar without agency",
     "a which's its elf conscious or without agency",
     "a which's its elf conscious ore without agency",
     "a which's itself conscious oar withe out agency",
     "a which's itself conscious or withe out agency",
     "a which's itself conscious ore withe out agency",
     "a witch's it self conscious oar without agency",
     "a witch's it self conscious or without agency",
     "a witch's it self conscious ore without agency",
     "a witch's it's elf conscious oar without agency",
     "a witch's it's elf conscious or without agency",
     "a witch's it's elf conscious ore without agency",
     "a witch's its elf conscious oar without agency",
     "a witch's its elf conscious or without agency",
     "a witch's its elf conscious ore without agency",
     "a witch's itself conscious oar withe out agency",
     "a witch's itself conscious or withe out agency",
     "a witch's itself conscious ore withe out agency",
     'uh which is itself conscious oar without agency',
     'uh which is itself conscious or without agency',
     'uh which is itself conscious ore without agency',
     'uh witch is itself conscious oar without agency',
     'uh witch is itself conscious or without agency',
     'uh witch is itself conscious ore without agency',
     "uh which's it self conscious oar without agency",
     "uh which's it self conscious or without agency",
     "uh which's it self conscious ore without agency",
     "uh which's it's elf conscious oar without agency",
     "uh which's it's elf conscious or without agency",
     "uh which's it's elf conscious ore without agency",
     "uh which's its elf conscious oar without agency",
     "uh which's its elf conscious or without agency",
     "uh which's its elf conscious ore without agency",
     "uh which's itself conscious oar withe out agency",
     "uh which's itself conscious or withe out agency",
     "uh which's itself conscious ore withe out agency",
     "uh witch's it self conscious oar without agency",
     "uh witch's it self conscious or without agency",
     "uh witch's it self conscious ore without agency",
     "uh witch's it's elf conscious oar without agency",
     "uh witch's it's elf conscious or without agency",
     "uh witch's it's elf conscious ore without agency",
     "uh witch's its elf conscious oar without agency",
     "uh witch's its elf conscious or without agency",
     "uh witch's its elf conscious ore without agency",
     "uh witch's itself conscious oar withe out agency",
     "uh witch's itself conscious or withe out agency",
     "uh witch's itself conscious ore withe out agency",
     'uhh which is itself conscious oar without agency',
     'uhh which is itself conscious or without agency',
     'uhh which is itself conscious ore without agency',
     'uhh witch is itself conscious oar without agency',
     'uhh witch is itself conscious or without agency',
     'uhh witch is itself conscious ore without agency',
     "uhh which's it self conscious oar without agency",
     "uhh which's it self conscious or without agency",
     "uhh which's it self conscious ore without agency",
     "uhh which's it's elf conscious oar without agency",
     "uhh which's it's elf conscious or without agency",
     "uhh which's it's elf conscious ore without agency",
     "uhh which's its elf conscious oar without agency",
     "uhh which's its elf conscious or without agency",
     "uhh which's its elf conscious ore without agency",
     "uhh which's itself conscious oar withe out agency",
     "uhh which's itself conscious or withe out agency",
     "uhh which's itself conscious ore withe out agency",
     "uhh witch's it self conscious oar without agency",
     "uhh witch's it self conscious or without agency",
     "uhh witch's it self conscious ore without agency",
     "uhh witch's it's elf conscious oar without agency",
     "uhh witch's it's elf conscious or without agency",
     "uhh witch's it's elf conscious ore without agency",
     "uhh witch's its elf conscious oar without agency",
     "uhh witch's its elf conscious or without agency",
     "uhh witch's its elf conscious ore without agency",
     "uhh witch's itself conscious oar withe out agency",
     "uhh witch's itself conscious or withe out agency",
     "uhh witch's itself conscious ore withe out agency",
     'a which is it self conscious oar without agency',
     'a which is it self conscious or without agency',
     'a which is it self conscious ore without agency',
     "a which is it's elf conscious oar without agency",
     "a which is it's elf conscious or without agency",
     "a which is it's elf conscious ore without agency",
     'a which is its elf conscious oar without agency',
     'a which is its elf conscious or without agency',
     'a which is its elf conscious ore without agency',
     'a which is itself conscious oar withe out agency',
     'a which is itself conscious or withe out agency',
     'a which is itself conscious ore withe out agency',
     'a witch is it self conscious oar without agency',
     'a witch is it self conscious or without agency',
     'a witch is it self conscious ore without agency',
     "a witch is it's elf conscious oar without agency",
     "a witch is it's elf conscious or without agency",
     "a witch is it's elf conscious ore without agency",
     'a witch is its elf conscious oar without agency',
     'a witch is its elf conscious or without agency',
     'a witch is its elf conscious ore without agency',
     'a witch is itself conscious oar withe out agency',
     'a witch is itself conscious or withe out agency',
     'a witch is itself conscious ore withe out agency',
     "a which's it self conscious oar withe out agency",
     "a which's it self conscious or withe out agency",
     "a which's it self conscious ore withe out agency",
     "a which's it's elf conscious oar withe out agency",
     "a which's it's elf conscious or withe out agency",
     "a which's it's elf conscious ore withe out agency",
     "a which's its elf conscious oar withe out agency",
     "a which's its elf conscious or withe out agency",
     "a which's its elf conscious ore withe out agency",
     "a witch's it self conscious oar withe out agency",
     "a witch's it self conscious or withe out agency",
     "a witch's it self conscious ore withe out agency",
     "a witch's it's elf conscious oar withe out agency",
     "a witch's it's elf conscious or withe out agency",
     "a witch's it's elf conscious ore withe out agency",
     "a witch's its elf conscious oar withe out agency",
     "a witch's its elf conscious or withe out agency",
     "a witch's its elf conscious ore withe out agency",
     'uh which is it self conscious oar without agency',
     'uh which is it self conscious or without agency',
     'uh which is it self conscious ore without agency',
     "uh which is it's elf conscious oar without agency",
     "uh which is it's elf conscious or without agency",
     "uh which is it's elf conscious ore without agency",
     'uh which is its elf conscious oar without agency',
     'uh which is its elf conscious or without agency',
     'uh which is its elf conscious ore without agency',
     'uh which is itself conscious oar withe out agency',
     'uh which is itself conscious or withe out agency',
     'uh which is itself conscious ore withe out agency',
     'uh witch is it self conscious oar without agency',
     'uh witch is it self conscious or without agency',
     'uh witch is it self conscious ore without agency',
     "uh witch is it's elf conscious oar without agency",
     "uh witch is it's elf conscious or without agency",
     "uh witch is it's elf conscious ore without agency",
     'uh witch is its elf conscious oar without agency',
     'uh witch is its elf conscious or without agency',
     'uh witch is its elf conscious ore without agency',
     'uh witch is itself conscious oar withe out agency',
     'uh witch is itself conscious or withe out agency',
     'uh witch is itself conscious ore withe out agency',
     "uh which's it self conscious oar withe out agency",
     "uh which's it self conscious or withe out agency",
     "uh which's it self conscious ore withe out agency",
     "uh which's it's elf conscious oar withe out agency",
     "uh which's it's elf conscious or withe out agency",
     "uh which's it's elf conscious ore withe out agency",
     "uh which's its elf conscious oar withe out agency",
     "uh which's its elf conscious or withe out agency",
     "uh which's its elf conscious ore withe out agency",
     "uh witch's it self conscious oar withe out agency",
     "uh witch's it self conscious or withe out agency",
     "uh witch's it self conscious ore withe out agency",
     "uh witch's it's elf conscious oar withe out agency",
     "uh witch's it's elf conscious or withe out agency",
     "uh witch's it's elf conscious ore withe out agency",
     "uh witch's its elf conscious oar withe out agency",
     "uh witch's its elf conscious or withe out agency",
     "uh witch's its elf conscious ore withe out agency",
     'uhh which is it self conscious oar without agency',
     'uhh which is it self conscious or without agency',
     'uhh which is it self conscious ore without agency',
     "uhh which is it's elf conscious oar without agency",
     "uhh which is it's elf conscious or without agency",
     "uhh which is it's elf conscious ore without agency",
     'uhh which is its elf conscious oar without agency',
     'uhh which is its elf conscious or without agency',
     'uhh which is its elf conscious ore without agency',
     'uhh which is itself conscious oar withe out agency',
     'uhh which is itself conscious or withe out agency',
     'uhh which is itself conscious ore withe out agency',
     'uhh witch is it self conscious oar without agency',
     'uhh witch is it self conscious or without agency',
     'uhh witch is it self conscious ore without agency',
     "uhh witch is it's elf conscious oar without agency",
     "uhh witch is it's elf conscious or without agency",
     "uhh witch is it's elf conscious ore without agency",
     'uhh witch is its elf conscious oar without agency',
     'uhh witch is its elf conscious or without agency',
     'uhh witch is its elf conscious ore without agency',
     'uhh witch is itself conscious oar withe out agency',
     'uhh witch is itself conscious or withe out agency',
     'uhh witch is itself conscious ore withe out agency',
     "uhh which's it self conscious oar withe out agency",
     "uhh which's it self conscious or withe out agency",
     "uhh which's it self conscious ore withe out agency",
     "uhh which's it's elf conscious oar withe out agency",
     "uhh which's it's elf conscious or withe out agency",
     "uhh which's it's elf conscious ore withe out agency",
     "uhh which's its elf conscious oar withe out agency",
     "uhh which's its elf conscious or withe out agency",
     "uhh which's its elf conscious ore withe out agency",
     "uhh witch's it self conscious oar withe out agency",
     "uhh witch's it self conscious or withe out agency",
     "uhh witch's it self conscious ore withe out agency",
     "uhh witch's it's elf conscious oar withe out agency",
     "uhh witch's it's elf conscious or withe out agency",
     "uhh witch's it's elf conscious ore withe out agency",
     "uhh witch's its elf conscious oar withe out agency",
     "uhh witch's its elf conscious or withe out agency",
     "uhh witch's its elf conscious ore withe out agency",
     'a which is it self conscious oar withe out agency',
     'a which is it self conscious or withe out agency',
     'a which is it self conscious ore withe out agency',
     "a which is it's elf conscious oar withe out agency",
     "a which is it's elf conscious or withe out agency",
     "a which is it's elf conscious ore withe out agency",
     'a which is its elf conscious oar withe out agency',
     'a which is its elf conscious or withe out agency',
     'a which is its elf conscious ore withe out agency',
     'a witch is it self conscious oar withe out agency',
     'a witch is it self conscious or withe out agency',
     'a witch is it self conscious ore withe out agency',
     "a witch is it's elf conscious oar withe out agency",
     "a witch is it's elf conscious or withe out agency",
     "a witch is it's elf conscious ore withe out agency",
     'a witch is its elf conscious oar withe out agency',
     'a witch is its elf conscious or withe out agency',
     'a witch is its elf conscious ore withe out agency',
     'uh which is it self conscious oar withe out agency',
     'uh which is it self conscious or withe out agency',
     'uh which is it self conscious ore withe out agency',
     "uh which is it's elf conscious oar withe out agency",
     "uh which is it's elf conscious or withe out agency",
     "uh which is it's elf conscious ore withe out agency",
     'uh which is its elf conscious oar withe out agency',
     'uh which is its elf conscious or withe out agency',
     'uh which is its elf conscious ore withe out agency',
     'uh witch is it self conscious oar withe out agency',
     'uh witch is it self conscious or withe out agency',
     'uh witch is it self conscious ore withe out agency',
     "uh witch is it's elf conscious oar withe out agency",
     "uh witch is it's elf conscious or withe out agency",
     "uh witch is it's elf conscious ore withe out agency",
     'uh witch is its elf conscious oar withe out agency',
     'uh witch is its elf conscious or withe out agency',
     'uh witch is its elf conscious ore withe out agency',
     'uhh which is it self conscious oar withe out agency',
     'uhh which is it self conscious or withe out agency',
     'uhh which is it self conscious ore withe out agency',
     "uhh which is it's elf conscious oar withe out agency",
     "uhh which is it's elf conscious or withe out agency",
     "uhh which is it's elf conscious ore withe out agency",
     'uhh which is its elf conscious oar withe out agency',
     'uhh which is its elf conscious or withe out agency',
     'uhh which is its elf conscious ore withe out agency',
     'uhh witch is it self conscious oar withe out agency',
     'uhh witch is it self conscious or withe out agency',
     'uhh witch is it self conscious ore withe out agency',
     "uhh witch is it's elf conscious oar withe out agency",
     "uhh witch is it's elf conscious or withe out agency",
     "uhh witch is it's elf conscious ore withe out agency",
     'uhh witch is its elf conscious oar withe out agency',
     'uhh witch is its elf conscious or withe out agency',
     'uhh witch is its elf conscious ore withe out agency']

     '''    
    
    # Initialize a list of output text strings 
    output_lines = []

    # Initialize loop to build a text string from the first words to the final words in a word list
    max_words = []
    max_stops = []
    prev_words = words_by_start[0]
    prev_stops = stops_by_start[0]
    count = 1
    run = True
    while(run):
        count += 1

        # Concatenate new words to previous words
        new_words, new_stops = concatenate_word_pairs(prev_words, prev_stops, 
                                                      words_by_start, stops_by_start, unique_starts)
        
        # Store the words that have reached the maximum stop
        # and prepare to concatenate more words in the next loop
        prev_words = []
        prev_stops = []
        for i,x in enumerate(new_stops):
            if x == max_stop:
                max_stops.append(x)
                max_words.append(new_words[i])
            else:
                prev_stops.append(x)
                prev_words.append(new_words[i])
               
        # Halt when all stops equal max_stop or until we loop max_count times
        if all([x == max_stop for x in new_stops]) or count == max_count:
            run = False

    # Convert each word list to a text string
    [output_lines.append(' '.join(x)) for x in max_words]

    return output_lines


## Homophone generator (same sounds, different words)

In [14]:
def generate_homophones(phonemes, max_count, max_phonemes_per_word, just_consonants=False, 
                        do_permute_consonants=False, ignore_words=None, verbose=False):
    '''
    Generate homophone words or sentences (same sounds, different words) from a sequence of phonemes.
    
    If just_consonants is set to True, then generate words or sentences from a sequence of consonants 
    (same consonants and consonant neighbors, different vowels).

    >>> input_sentence = 'Uh, which is it, self-conscious or without agency?'
    >>> words = input_sentence.split('-')  #words = [re.sub(r'[^\'A-Za-z\-]', '', x).lower() for x in words]
    >>> phonemes, consonants, stresses, nsyllables = words_to_sounds(words, phoneme_list, consonant_list)
    >>> phonemes = separate_consonants(phonemes)
    >>> max_count = 26
    >>> max_phonemes_per_word = 25
    >>> just_consonants = False
    >>> do_permute_consonants = False
    >>> ignore_words = None
    >>> verbose = False
    generate_homophones(phonemes, max_count, max_phonemes_per_word, just_consonants, do_permute_consonants, 
                        ignore_words, verbose)
                        
    ["a which's itself conscious oar without agency",
     ...
     'a which is itself conscious or without agency',
     ...
     "uhh which is itself conscious oar without age 'n sci",
     ...]
 
    >>> just_consonants = True
    generate_homophones(phonemes, max_count, max_phonemes_per_word, just_consonants, do_permute_consonants,
                        ignore_words, verbose)
    ['languages continue to evolve',
     'languages continue too evolve',
     'languages continue two evolve']
    '''
    
    # Permute consonants (different sequence of same consonants, different vowels)
    if do_permute_consonants and just_consonants:
        consonants = [x for x in phonemes if x in consonant_list]
        consonant_permutations = [x for x in permutations(consonants)]

        # Loop through every permutation of consonants
        words_starts_stops = []
        for consonant_permutation in consonant_permutations:
            phoneme_permutation = []
            count = 0
            for phoneme in phonemes:
                if phoneme in consonant_list:
                    phoneme_permutation.append(consonant_permutation[count])
                    count += 1
                else:
                    phoneme_permutation.append(phoneme)

            words_starts_stops_permutation = phonemes_to_words(phoneme_permutation, all_words, all_pronunciations, 
                                                               all_consonants, consonant_list, True, 
                                                               max_phonemes_per_word, ignore_words)
            words_starts_stops.extend(words_starts_stops_permutation)
 
    else:
        words_starts_stops = phonemes_to_words(phonemes, all_words, all_pronunciations, all_consonants, 
                                               consonant_list, just_consonants, max_phonemes_per_word, 
                                               ignore_words)

    if words_starts_stops:
        #consonants1 = separate_consonants(consonants)
        #phonemes2, consonants2, stresses2, syllables2 = words_to_sounds(new_words[istop], 
        #consonants2 = separate_consonants(consonants2)
        #if len(consonants1) != len(consonants2):
        #    go = False

        if verbose:
            print('Words: {0}'.format(', '.join([x[0] for x in words_starts_stops])), end='\n')
        words_starts_stops = flatten_to_sublists_of_strings(words_starts_stops) 
        words_by_start, stops, unique_starts, unique_stops, max_start, max_stop = organize_words_by_start(words_starts_stops)
        #print('Words sorted by start index:  {0}'.format(words_by_start), end='\n')
        homophones = words_stop_to_start(words_by_start, stops, unique_starts, max_stop, max_count, None, False)
    else:
        homophones = None
    
    return homophones

In [None]:
    >>> just_consonants = True
    generate_homophones(phonemes, max_count, max_phonemes_per_word, just_consonants, do_permute_consonants,ignore_words, verbose)


Words: away, way, we, wee, weigh, whew, whey, whoa, why, woe, woo, wow, wowie, watch, which, witch, watches, which's, witches, witch's, away, way, we, wee, weigh, whew, whey, whoa, why, woe, woo, wow, wowie, watch, which, witch, watches, which's, witches, witch's, chai, chew, chewy, chia, chow, ciao, each, etch, itch, itchy, ouch, achoa's, cheese, cheesy, che's, chews, choose, choosy, chose, chows, etches, h's, itches, occhoa's, chai, chew, chewy, chia, chow, ciao, each, etch, itch, itchy, ouch, achoa's, cheese, cheesy, che's, chews, choose, choosy, chose, chows, etches, h's, itches, occhoa's, ais, as, a's, ayes, ease, easy, e's, eye's, eyes, eyes', i's, is, oh's, ohs, oohs, ooze, o's, owes, oz, zoo, zeta, ais, as, a's, ayes, ease, easy, e's, eye's, eyes, eyes', i's, is, oh's, ohs, oohs, ooze, o's, owes, oz, zoo, zeta, at, ate, auto, eat, eight, eighty, iota, it, oat, ought, out, outta, ta, tai, tea, tee, ti, tie, to, toe, too, tow, toy, two, eats, eight's, eights, it's, its, oats, out

## Process input/output file

In [None]:
def remove_duplicate_lines(infile, outfile):
    '''
    Remove duplicate lines in a file.
    '''
    unique_lines = set(open(infile).readlines())
    out = open(outfile, 'w').writelines(unique_lines)
    

def display_save_output(input_list, input_type, input_source, 
                        do_save_files=False, filename_base=None, verbose=False):
    '''
    Display outpute and/or write to text file.
    '''

    # Write output to a text file
    if len(input_list) > 1:
        if verbose:
            print('')
            print('==============================================================================================')
            print('{0} {1}s for "{2}"'.format(len(input_list), input_type, input_source.strip()), end='\n')
            print('==============================================================================================')
            #for line in flatten_list(input_list):
            #   print('    {0}'.format(line), end='\n')
        if do_save_files:
            outfile = filename_base + input_type.upper() + '.txt'
            fwrite = open(outfile, "w")
            fwrite.write('')
            fwrite.close()
            fwrite = open(outfile, "a")
            for new_line in input_list:
                fwrite.write(new_line + '\n')
            fwrite.close()
            if verbose:
                print('{0} written to {1}'.format(input_type.capitalize(), outfile), end='\n')
    else:
        if verbose:
            print('\n0 {0} for "{1}"'.format(input_type, input_source.strip()), end='\n')


## Run all code

In [None]:
do_generate_homophones = True  # Generate homophones (same sounds, different words)
do_generate_consonances = False  # Generate consonances (same consonants and consonant neighbors, different vowels)
do_permute_consonants = False  # Permute consonants (different sequence of same consonants, different vowels)

do_save_files = True
verbose = True
verbose2 = True

# Parameters for each function
do_split_up_consonants = True  # Split up consonant compounds ("I scream" => "ice cream")
do_check_grammar = True
do_check_grammar_again = False
cap_and_punc = True
ignore_inputs = []  # Don't use any of these words
do_ignore_inputs = False  # Don't use any of the original words
do_split_up_consonants = True  # Split up consonant compounds ("I scream" => "ice cream")?
if do_split_up_consonants:
    max_phonemes_per_word = 25
else:
    max_phonemes_per_word = 11      


# Load each line of text
input_file = "input.txt"
fread = open(input_file, "r")
lines = fread.readlines()
filename_base = None        
for line in lines:
    words = line.split('-')  #words = [re.sub(r'[^\'A-Za-z\-]', '', x).lower() for x in words]
    nwords = len(''.join(words))

    # Convert words to phonemes
    phonemes, consonants, stresses, nsyllables = words_to_sounds(words, phoneme_list, consonant_list)

    # Split up consonant compounds ("I scream" => "ice cream")
    if do_split_up_consonants:
        phonemes = separate_consonants(phonemes)

    # Set parameters
    if do_ignore_inputs:
        ignore_inputs = words
    if do_save_files:
        filename_base = 'output/' + '_'.join(words)
    
    print("\nGenerate linguamorphs for each line of {0}:".format(input_file))

    # Homophones
    if do_generate_homophones:
        homophones = generate_homophones(phonemes, nwords, nsyllables, max_phonemes_per_word, False, False, ignore_inputs, verbose2)
        homophones = check_grammar_twice(homophones, grammar_tool, grammar_tool2, cap_and_punc, do_check_grammar_again, verbose2)
        display_save_output(homophones, 'homophones', line, do_save_files, filename_base, verbose)

    # Consonances
    if do_generate_consonances:
        consonances = generate_homophones(phonemes, nwords, nsyllables, max_phonemes_per_word, True, False, ignore_inputs, verbose2)
        consonances = check_grammar_twice(consonances, grammar_tool, grammar_tool2, cap_and_punc, do_check_grammar_again, verbose2)
        display_save_output(consonances, 'consonances', line, do_save_files, filename_base, verbose)

    # Swap-consonances
    if do_swap_consonances:
        swap_consonances = generate_homophones(phonemes, nwords, nsyllables, max_phonemes_per_word, True, True, ignore_inputs, verbose2)
        swap_consonances = check_grammar_twice(consonances, grammar_tool, grammar_tool2, cap_and_punc, do_check_grammar_again, verbose2)
        display_save_output(swap_consonances, 'swap_consonances', line, do_save_files, filename_base, verbose)

    print('\nDone!')



# 