# LINGUAMORPH

The software in this notebook is intended to take in text and output variations in sound and stress.

Potential applications include:

  - homophone phrases (variations on a phrase that sound the same or similar)
  - extreme alliteration
  - spoonerism-style scrambles
  - homophone anagrams
  - homophone loops
  - shrinking and expanding text
  - two different stories that sound the same
  - text that morphs to text with meaningful intermediates
  - stress-rhythm-seeded text
  - faux/foe translations
  - rhyming/rapping

## Phonemes and Carnegie Mellon Pronouncing Dictionary

(See https://github.com/cmusphinx/cmudict/tree/4c6a365cea2c34340ffc218d5af7a38920fa7e37)

From https://www.nltk.org/_modules/nltk/corpus/reader/cmudict.html:

The Carnegie Mellon Pronouncing Dictionary [cmudict.0.6]
Copyright 1998 Carnegie Mellon University

File Format: Each line consists of an uppercased word, a counter
(for alternative pronunciations), and a transcription.  Vowels are
marked for stress (1=primary, 2=secondary, 0=no stress).  E.g.:
NATURAL 1 N AE1 CH ER0 AH0 L

The dictionary contains 127069 entries.  Of these, 119400 words are assigned
a unique pronunciation, 6830 words have two pronunciations, and 839 words have
three or more pronunciations.  Many of these are fast-speech variants.

Phonemes: There are 39 phonemes, as shown below:

    Phoneme Example Translation    Phoneme Example Translation
    ------- ------- -----------    ------- ------- -----------
    AA      odd     AA D           AE      at      AE T
    AH      hut     HH AH T        AO      ought   AO T
    AW      cow     K AW           AY      hide    HH AY D
    B       be      B IY           CH      cheese  CH IY Z
    D       dee     D IY           DH      thee    DH IY
    EH      Ed      EH D           ER      hurt    HH ER T
    EY      ate     EY T           F       fee     F IY
    G       green   G R IY N       HH      he      HH IY
    IH      it      IH T           IY      eat     IY T
    JH      gee     JH IY          K       key     K IY
    L       lee     L IY           M       me      M IY
    N       knee    N IY           NG      ping    P IH NG
    OW      oat     OW T           OY      toy     T OY
    P       pee     P IY           R       read    R IY D
    S       sea     S IY           SH      she     SH IY
    T       tea     T IY           TH      theta   TH EY T AH
    UH      hood    HH UH D        UW      two     T UW
    V       vee     V IY           W       we      W IY
    Y       yield   Y IY L D       Z       zee     Z IY
    ZH      seizure S IY ZH ER
    
From https://www.pythonstudio.us/language-processing/a-pronouncing-dictionary.html:

For each word, this lexicon provides a list of phonetic codes—distinct labels for each contrastive sound—known as phones. Observe that fire has two pronunciations (in U.S. English): the one-syllable F AY1 R, and the two-syllable F AY1 ER0. The symbols in the CMU Pronouncing Dictionary are from the Arpabet, described in more detail at http://en.wikipedia.org/wiki/Arpabet.

## Code to prepare and filter CMU words

In [1]:
import nltk
import re
import pickle


phoneme_list = ['AA','AH','AW','B','D','EH','EY','G','IH','JH','L','N','OW','P','S','T','UH','V','Y','ZH',
                'AE','AO','AY','CH','DH','ER','F','HH','IY','K','M','NG','OY','R','SH','TH','UW','W','Z']
vowel_list = ['AA','AH','AW','EH','EY','IH','OW','UH','AE','AO','AY','ER','IY','OY','UW']
single_consonants = ['B','D','G','JH','L','N','P','S','T','V','Y','ZH','CH',
                     'DH','F','HH','K','M','NG','R','SH','TH','W','Z']
multiple_consonants = []
for c1 in single_consonants:
    for c2 in single_consonants:
        if c2 != c1:
            multiple_consonants.append(c1 + '+' + c2)
            for c3 in single_consonants:
                if c3 != c2:
                    multiple_consonants.append(c1 + '+' + c2 + '+' + c3)
                    for c4 in single_consonants:
                        if c4 != c3:
                            multiple_consonants.append(c1 + '+' + c2 + '+' + c3 + '+' + c4)
                            #for c5 in single_consonants:
                            #    if c5 != c4:
                            #        multiple_consonants.append(c1 + '+' + c2 + '+' + c3 + '+' + c4 + '+' + c5)

consonant_list = single_consonants + multiple_consonants
#print(len(consonant_list))
#print(consonant_list)


def combine_consonants(phonemes):
    '''
    >>> combine_consonants(['Y','IY','L','D'])
    ['Y', 'IY', 'L+D']    
    '''
    phonemes_with_combined_consonants = []
    P = len(phonemes)
    i = 0
    while i < P:
        loop = True
        while loop:
            p1 = phonemes[i]
            i += 1
            if p1 in single_consonants:
                if P > i:
                    p2 = phonemes[i]
                    i += 1
                    if p2 in single_consonants:
                        if P > i:
                            p3 = phonemes[i]
                            i += 1
                            if p3 in single_consonants:
                                if P > i:
                                    p4 = phonemes[i]
                                    i += 1
                                    if p4 in single_consonants:
                                        phonemes_with_combined_consonants.append(p1 + '+' + p2 + '+' + p3 + '+' + p4)
                                        #if P > i:
                                        #    p5 = phonemes[i]
                                        #    i += 1
                                        #    if p5 in single_consonants:
                                        #        phonemes_with_combined_consonants.append(p1 + '+' + p2 + '+' + p3 + '+' + p4 + '+' + p5)
                                        #    else:
                                        #        phonemes_with_combined_consonants.append(p1 + '+' + p2 + '+' + p3 + '+' + p4)
                                        #        phonemes_with_combined_consonants.append(p5)
                                        #    break                                                
                                        #else:
                                        #    phonemes_with_combined_consonants.append(p1 + '+' + p2 + '+' + p3 + '+' + p4)
                                        #    break
                                    else:
                                        phonemes_with_combined_consonants.append(p1 + '+' + p2 + '+' + p3)
                                        phonemes_with_combined_consonants.append(p4)
                                    break
                                else:
                                    phonemes_with_combined_consonants.append(p1 + '+' + p2 + '+' + p3)
                                    break
                            else:
                                phonemes_with_combined_consonants.append(p1 + '+' + p2)
                                phonemes_with_combined_consonants.append(p3)
                                break
                        else:
                            phonemes_with_combined_consonants.append(p1 + '+' + p2)
                            break
                    else:
                        phonemes_with_combined_consonants.append(p1)
                        phonemes_with_combined_consonants.append(p2)
                        break
                else:
                    phonemes_with_combined_consonants.append(p1)
                    break
            else:
                phonemes_with_combined_consonants.append(p1)
                break
    
    return phonemes_with_combined_consonants


def filter_dictionary_words(words, consonants, pronunciations, stresses, 
                            filter_words, filter_strings, filter_dictionary='english_words_py', 
                            verbose=False):
    filtered_words = []
    filtered_consonants = []
    filtered_pronunciations = []
    filtered_stresses = []
    removed_words = []
    for iword, word in enumerate(words): 
        if filter_dictionary == 'pyenchant':
            if (enchant_dict.check(word) or enchant_dict.check(word.capitalize())) and \
                (word not in filter_words) and \
                (word not in filtered_words) and \
                all([x not in word for x in filter_strings]):
                    filtered_words.append(word)
                    filtered_consonants.append(consonants[iword])
                    filtered_pronunciations.append(pronunciations[iword])
                    filtered_stresses.append(stresses[iword])
        elif filter_dictionary == 'english_words_py':
            if (word in english_words_set or \
                    (word[:-1] in english_words_set and word[-1] == 's')) or \
                (word.capitalize() in english_words_set or \
                    (word.capitalize()[:-1] in english_words_set and word.capitalize()[-1] == 's')) and \
                (word not in filter_words) and \
                (word not in filtered_words) and \
                all([x not in word for x in filter_strings]):
                    filtered_words.append(word)
                    filtered_consonants.append(consonants[iword])
                    filtered_pronunciations.append(pronunciations[iword])
                    filtered_stresses.append(stresses[iword])
        else:
            removed_words.append(word)
    
    if verbose and removed_words != []:
        print('{0} retained words, {1} removed words'.format(len(filtered_words),len(removed_words)))

    return filtered_words, filtered_consonants, filtered_pronunciations, filtered_stresses, removed_words


def save_object(obj, pickle_file):
    try:
        with open(pickle_file, "wb") as f:
            pickle.dump(obj, f, protocol=pickle.HIGHEST_PROTOCOL)
    except Exception as ex:
        print("Error during pickling object:", ex)

        
def load_object(pickle_file):
    try:
        with open(pickle_file, "rb") as f:
            return pickle.load(f)
    except Exception as ex:
        print("Error during unpickling object:", ex)


def prepare_dictionary(filter_dictionary, filter_words_file, filter_strings, dictionary_folder):
    '''
    Filter CMU Pronunciation dictionary words and pronunciations.
    Use a second dictionary of common words.
    >>>index=10000
    >>>print(all_words[index], all_consonants[index])
    executive ['G+Z', 'K+Y', 'T', 'V'] 
    '''
    cmu_entries = nltk.corpus.cmudict.entries()
    cmu_words = []
    cmu_consonants = []
    cmu_pronunciations = []
    cmu_pronunciations_stress = []
    cmu_stresses = []
    for cmu_word, cmu_pronunciation_stress in cmu_entries:
        #print(cmu_word)
        cmu_words.append(cmu_word.strip())
        cmu_pronunciation_stress = combine_consonants(cmu_pronunciation_stress)
        cmu_pronunciation_no_stress = [re.sub(r'\d+', '', x) for x in cmu_pronunciation_stress]
        cmu_pronunciations.append(cmu_pronunciation_no_stress)
        cmu_consonants.append([x for x in cmu_pronunciation_no_stress if x in consonant_list])
        cmu_stresses.append([re.sub(r'[A-Za-z\+]', '', x) for x in cmu_pronunciation_stress])         
    #for i in range(len(cmu_words)): 
    #    print(cmu_words[i], cmu_consonants[i], cmu_pronunciations[i], cmu_stresses[i])

    print('Filter the CMU dictionary...')

    # Load filter words
    fread_filter = open(filter_words_file, "r")
    filter_words = [x.strip() for x in fread_filter.readlines()]

    all_words, all_consonants, all_pronunciations, all_stresses, nonwords = filter_dictionary_words(cmu_words, 
        cmu_consonants, cmu_pronunciations, cmu_stresses, 
        filter_words, filter_strings, filter_dictionary, verbose=False)    

    save_object(all_words, dictionary_folder + 'words_{0}.pkl'.format(filter_dictionary))
    save_object(all_consonants, dictionary_folder + 'consonants_{0}.pkl'.format(filter_dictionary))
    save_object(all_pronunciations, dictionary_folder + 'pronunciations_{0}.pkl'.format(filter_dictionary))
    save_object(all_stresses, dictionary_folder + 'stresses_{0}.pkl'.format(filter_dictionary))

## Test different dictionaries

In [2]:
test_dictionaries = False
if test_dictionaries:

    import enchant
    enchant_dict = enchant.Dict("en_US")
    #pip install cmudict
    #nltk.download('cmudict')
    #pip install pyenchant

    # english-words-py (https://pypi.org/project/english-words/)
    # "Contains sets of English words from svnweb.freebsd.org/csrg/share/dict/. 
    # This is up to date with revision 61569 of their words list."
    from english_words import english_words_set

    # Most Common English Words (https://github.com/dolph/dictionary)
    # "enable1.txt (172,819), the more verbose version of the Official Scrabble Player's Dictionary 
    # (which is limited to words of 8 letters or less)"
    # "popular.txt (25,322) represents the common subset of words found in both enable1.txt and Wiktionary's 
    # word frequency lists, which are in turn compiled by statistically analyzing a sample of 29 million 
    # words used in English TV and movie scripts."
    enable1 = [line.rstrip() for line in open('data/dictionaries/enable1.txt')]
    popular = [line.rstrip() for line in open('data/dictionaries/popular.txt')]

    # NLTK words corpus:
    #nltk.download('words')
    from nltk.corpus import words
    nltk_wordset = set(words.words())

    # Wiktionary Word Frequency_lists (https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists#English)
    #https://gist.github.com/h3xx/1976236
    
    print('nltk_wordset:      {0}'.format(len(nltk_wordset)))
    print('enable1:           {0}'.format(len(enable1)))
    print('pyenchant:         {0}'.format('?')) #len(enchant_dict.values())))
    print('filtered CMU:      {0}'.format(len(all_words)))
    print('english-words-py:  {0}'.format(len(english_words_set)))
    print('popular:           {0}'.format(len(popular)))
    print()
    test_words = ["can't", 'geese', 'shelves', 'Thai', 'thai', 'e.', 'eure', 'bott', 'bitter', 'used']
    for test_word in test_words:
        print(test_word)
        print('        NLTK words corpus:  {0}'.format(test_word in nltk_wordset))
        print('        enable frequency:   {0}'.format(test_word in enable1))
        print('        pyenchant spelling: {0}'.format(enchant_dict.check(test_word)))
        print('        filtered CMU:       {0}'.format(test_word in all_words))
        print('        english-words-py:   {0}'.format(test_word in english_words_set))
        print('        popular frequency:  {0}'.format(test_word in popular))
        print()

    nltk_wordset:      235892
    enable1:           172823
    pyenchant:         ?
    filtered CMU:      59539
    english-words-py:  25487
    popular:           25322

    can't
            NLTK words corpus:  False
            enable frequency:   False
            pyenchant spelling: True
            filtered CMU:       True
            english-words-py:   True
            popular frequency:  False

    geese
            NLTK words corpus:  False
            enable frequency:   True
            pyenchant spelling: True
            filtered CMU:       True
            english-words-py:   True
            popular frequency:  True

    shelves
            NLTK words corpus:  False
            enable frequency:   True
            pyenchant spelling: True
            filtered CMU:       True
            english-words-py:   False
            popular frequency:  True

    Thai
            NLTK words corpus:  True
            enable frequency:   False
            pyenchant spelling: True
            filtered CMU:       False
            english-words-py:   True
            popular frequency:  False

    thai
            NLTK words corpus:  False
            enable frequency:   False
            pyenchant spelling: False
            filtered CMU:       True
            english-words-py:   False
            popular frequency:  False

    e.
            NLTK words corpus:  False
            enable frequency:   False
            pyenchant spelling: True
            filtered CMU:       False
            english-words-py:   False
            popular frequency:  False

    eure
            NLTK words corpus:  False
            enable frequency:   False
            pyenchant spelling: False
            filtered CMU:       True
            english-words-py:   False
            popular frequency:  False

    bott
            NLTK words corpus:  True
            enable frequency:   True
            pyenchant spelling: True
            filtered CMU:       True
            english-words-py:   False
            popular frequency:  False

    bitter
            NLTK words corpus:  True
            enable frequency:   True
            pyenchant spelling: True
            filtered CMU:       True
            english-words-py:   False
            popular frequency:  True

    used
            NLTK words corpus:  True
            enable frequency:   True
            pyenchant spelling: True
            filtered CMU:       True
            english-words-py:   False
            popular frequency:  True

## Code to convert text to phonemes (and stresses and number of syllables)

In [3]:
from g2p_en import G2p  # pip install g2p_en
word_to_phonemes = G2p()


# Code to count syllables
# https://datascience.stackexchange.com/questions/23376/how-to-get-the-number-of-syllables-in-a-word
VOWEL_RUNS = re.compile("[aeiouy]+", flags=re.I)
EXCEPTIONS = re.compile(
    # fixes trailing e issues:
    # smite, scared
    "[^aeiou]e[sd]?$|"
    # fixes adverbs:
    # nicely
    + "[^e]ely$",
    flags=re.I
)
ADDITIONAL = re.compile(
    # fixes incorrect subtractions from exceptions:
    # smile, scarred, raises, fated
    "[^aeioulr][lr]e[sd]?$|[csgz]es$|[td]ed$|"
    # fixes miscellaneous issues:
    # flying, piano, video, prism, fire, evaluate
    + ".y[aeiou]|ia(?!n$)|eo|ism$|[^aeiou]ire$|[^gq]ua",
    flags=re.I
)
def count_syllables(word):
    vowel_runs = len(VOWEL_RUNS.findall(word))
    exceptions = len(EXCEPTIONS.findall(word))
    additional = len(ADDITIONAL.findall(word))
    return max(1, vowel_runs - exceptions + additional)


def words_to_sounds(words):
    
    phonemes = []
    stresses = []
    syllables = 0
    for word in words:

        # Extract phonemes per word (choose the first version of the phoneme)
        #     :: multiple pronunciations: pronouncing.phones_for_word(word) 
        phonemes_and_stresses_for_word = word_to_phonemes(word)
        phonemes_and_stresses_for_word = combine_consonants(phonemes_and_stresses_for_word)
        phonemes_for_word = [re.sub(r'\d+', '', x) for x in phonemes_and_stresses_for_word]
        stresses_for_word = [re.sub(r"(?:[A-Z])",'', x) for x in phonemes_and_stresses_for_word]
        phonemes_for_word = [x for x in phonemes_for_word if x in phoneme_list or x in consonant_list]                  
        #print(word, phonemes_and_stresses_for_word, phonemes_for_word)

        phonemes += phonemes_for_word  
        stresses += stresses_for_word
        syllables += count_syllables(word)

    consonants = [x for x in phonemes if x in consonant_list] 

    return phonemes, consonants, stresses, syllables

## Code to convert phonemes to candidate words

In [4]:
def get_unique_numbers(numbers):
    unique = []
    for number in numbers:
        if number not in unique:
            unique.append(number)
    return unique


def phonemes_to_candidate_words(phonemes, all_words, all_pronunciations, all_consonants, start=0,
                                just_consonants=False): 
    '''
    Generate a list of words from a list of phonemes,
    by concatenating sequences of the phonemes 
    and searching in CMU's Pronunciation Dictionary.
    
    >>> phonemes_to_candidate_words(['HH', 'AY', 'D'], all_words, all_pronunciations, 0, False)
    '''
    words_from_phonemes = []
    
    # For each subsequence of phonemes
    for stop in range(start + 1, len(phonemes) + 1):
        
        # Remove stresses from the subsequence of phonemes
        phoneme_subset = [re.sub(r'\d+', '', p) for p in phonemes[start:stop]]

        
        
        
        
        
        print(phoneme_subset)
        
        # Find words with matching consonants:
        if just_consonants:
            consonant_subset = [x for x in phoneme_subset if x in consonant_list]
            if consonant_subset != []:
                try:
                    indices = [i for i,x in enumerate(all_consonants) if x == consonant_subset]
                    for index in indices:
                        
                        
                        
                        
                        
                        
                        print(consonant_subset, all_consonants[index], all_words[index])
                        words_from_phonemes.append([all_words[index], start, stop - 1])
                except: pass
                
        # Find words with fully matching pronunciations:
        else:
            try:
                indices = [i for i,x in enumerate(all_pronunciations) if x == phoneme_subset]
                for index in indices:
                    words_from_phonemes.append([all_words[index], start, stop - 1])
            except: pass
            
    unique_stops = get_unique_numbers([i2 for x,i1,i2 in words_from_phonemes])

    return words_from_phonemes, unique_stops


# Code to find all words that sound like each segment of each phoneme list
def phoneme_subsets_to_words(phonemes, all_words, all_pronunciations, all_consonants, 
                             just_consonants=False, ignore_words=None):

    phoneme_words_original = []    
    start = 0
    unique_stops = [-1]
    while start < len(phonemes):
        if len(unique_stops) == 0:
            unique_stops = [start + 1]
        for stop in unique_stops:
            start = stop + 1
            if start < len(phonemes):
                words_from_phonemes, unique_stops = phonemes_to_candidate_words(phonemes, 
                                                                                all_words, 
                                                                                all_pronunciations, 
                                                                                all_consonants, 
                                                                                start, 
                                                                                just_consonants)
                phoneme_words_original += words_from_phonemes

    if ignore_words:
        phoneme_words = [] 
        for phoneme_word in phoneme_words_original:
            if phoneme_word[0] not in ignore_words:
                phoneme_words.append(phoneme_word)
    else:
        phoneme_words = phoneme_words_original
    

    return phoneme_words

## Code to construct word sequences with matching phoneme stop and start indices

In [5]:
def copy_list(list_to_copy, ncopies):
    list_copies = []
    for i in range(ncopies):
        list_copies.extend(list_to_copy)
    return list_copies


def flatten_list(nested_list):
    '''
    Flatten so that there are no tuples or lists within the list.
    
    >>> nested_list = [('e1d1', ('e1d2'), ['e2d1']), 'e3d0', [], ['e5d1']]
    >>> flatten_list(nested_list)
    ['e1d1', 'e1d2', 'e2d1', 'e3d0', 'e5d1']
    '''
    result=[]
    if nested_list != []:
        for element in nested_list:
            if isinstance(element, list) or isinstance(element, tuple):
                result.extend(flatten_list(element))
            else:
                result.append(element)
    return result

            
def flatten_to_sublists_of_strings(nested_list):
    '''
    Flatten list to strings and sublists of strings.

    >>> nested_list = [[[], '0', ('1',11,12), ('2',21,22), ['3',31,32]], [['4',41,42]]]
    >>> flatten_to_sublists_of_strings(nested_list)
    [[], '0', ['1', 11, 12], ['2', 21, 22], ['3', 31, 32], ['4', 41, 42]]
    '''
    result=[]
    if nested_list == []:
        result.extend([[]])
    else:
        if not any([isinstance(x, list)  for x in nested_list]) and \
           not any([isinstance(x, tuple) for x in nested_list]):
            y=[]
            for x in nested_list:
                y.append(x)
            result.append(y)
        else:
            for element in nested_list:
                if isinstance(element, str):
                    result.extend(element)
                elif isinstance(element, list) or isinstance(element, tuple):
                    if element == []:
                        result.extend([[]])
                    else:
                        result.extend(flatten_to_sublists_of_strings(element))          
    return result

            
def find_words_with_start_index(word_start_stop_list, start_index):
    # store words that start at start_index
    start_words = []
    starts = []
    stops = []
    for word, start, stop in word_start_stop_list:
        if start == start_index and start != []:
            start_words.append(word)
            starts.append(start)
            stops.append(stop)
            
    return start_words, starts, stops


def organize_words_by_start(words_list):

    if not isinstance(words_list[0], list) and not isinstance(words_list[0], tuple):
        words_list = [words_list]
        
    # Get unique starts and stops, and max start and stop
    words2 = []
    starts2 = []
    stops2 = []
    for word, start, stop in words_list:
        words2.append(word)
        starts2.append(start)
        stops2.append(stop)
    unique_starts = get_unique_numbers(starts2)
    unique_stops = get_unique_numbers(stops2)
    max_start = max(get_unique_numbers(starts2))
    max_stop = max(get_unique_numbers(stops2))

    # Words organized by start index
    words_by_start = []
    stops = []
    for start_index in range(max_start + 1):
        start_words, istarts, istops = find_words_with_start_index(words_list, start_index)
        words_by_start.append(start_words)
        stops.append(istops)        

    return words_by_start, stops, unique_starts, unique_stops, max_start, max_stop


def concatenate_lists(list_of_lists1, list_of_lists2):
    result = []
    for item1, item2 in zip(list_of_lists1, list_of_lists2):
        if isinstance(item1, str) and isinstance(item2, list):
            for element in item2:
                result.append((item1, element))
        elif isinstance(item1, list) and isinstance(item2, list):
            result.append((item1 + item2))
        elif isinstance(item1, tuple) and isinstance(item2, list):
            result.append((list(item1) + item2))
    return result


def concatenate_words(prev_words, prev_stops, words_by_start, stops_by_start, unique_starts):
    '''
    Concatenate words where the stop index of one matches the start index of the next.
    '''
    # Initialize / format words
    new_words = []
    new_stops = []
    words1 = prev_words
    stops1 = prev_stops
 
    # For each word that starts at a given index
    for iword1, word1 in enumerate(words1):

        # Find words that start after that word stops
        word1_stop = stops1[iword1]
        word2_start = word1_stop + 1
        if word2_start in unique_starts:
            words2 = words_by_start[word2_start]
            stops2 = stops_by_start[word2_start]

            # Concatenate the first word with each of the second set of words
            if len(words2) > 0:
                word1_copies = copy_list([word1], len(words2))
                words2_list = [[x] for x in words2]
                new_words.append(concatenate_lists(word1_copies, words2_list))
                new_stops.append(stops2)

    new_words = flatten_to_sublists_of_strings(new_words)
    new_stops = flatten_list(new_stops)
        
    return new_words, new_stops


def remove_duplicates(infile, outfile):
    unique_lines = set(open(infile).readlines())
    out = open(outfile, 'w').writelines(unique_lines)


def words_stop_to_start(words_by_start, stops_by_start, unique_starts, max_stop, max_count, outfile=None):

    # Initialize write to text file or to list
    if outfile:
        write_file = True
        fwrite = open(outfile, "w")
        fwrite.write('')
        fwrite.close()
        fwrite = open(outfile, "a")
    else:
        write_file = False
    output_lines = []

    # Initialize loop
    prev_words = words_by_start[0]
    prev_stops = stops_by_start[0]
    count = 1
    run = True
    while(run):
        count += 1

        new_words, new_stops = concatenate_words(prev_words, prev_stops, 
                                                 words_by_start, stops_by_start, unique_starts)
        
        # Stop when all stops equal max_stop
        if all([x == max_stop for x in new_stops]) or count == max_count:
            run = False
        
        # Write to text file or to list
        for istop, stop in enumerate(new_stops):
            if stop == max_stop:
                new_line = ' '.join(new_words[istop])
                if write_file:
                    fwrite.write(new_line + '\n')
                else:
                    if new_line not in output_lines:            
                        output_lines.append(new_line)

        prev_words = new_words
        prev_stops = new_stops

    # After exiting the while loop, finalize
    if write_file:
        fwrite.close()
        remove_duplicates(outfile, outfile)
                                    
    return output_lines

## Homophone generator (same sounds, different words)

In [6]:
def generate_homophones(phonemes, max_count, ignore_words=None, filename_base=None, 
                        verbose=True, verbose2=False):
    
    if filename_base:
        filename = 'HOMOPHONES_' + filename_base + '.txt'

    if verbose:
        print('')
        print('Input line:  {0}'.format(line.strip()), end='\n')
        print('Phonemes:    {0}'.format(', '.join(phonemes)), end='\n')

    phoneme_words = phoneme_subsets_to_words(phonemes, all_words, all_pronunciations, all_consonants, 
                                             just_consonants=False, ignore_words=ignore_words)

    if verbose2:
        print('Phoneme words:  {0}'.format(', '.join([x[0] for x in phoneme_words])), end='\n')

    if phoneme_words:
        phoneme_words = flatten_to_sublists_of_strings(phoneme_words)
        words_by_start, stops, unique_starts, x, y, max_stop = organize_words_by_start(phoneme_words)
        
        if verbose2:
            words_by_start = flatten_to_sublists_of_strings(words_by_start)
            print('Phoneme-generated words sorted by start index:  {0}'.format(words_by_start), end='\n')

        homophones = words_stop_to_start(words_by_start, stops, unique_starts, max_stop, max_count, filename)

        if verbose and filename_base:
            print('Phoneme-generated homophones written to {0}'.format(filename), end='\n')
        elif verbose and homophones != []:
            print('Phoneme-generated homophones:', end='\n')
            for homophone in flatten_list(homophones):
               print('    {0}'.format(homophone), end='\n')
    else:
        homophones = None
    
    return homophones

## Constonant generator (same consonants and consonant neighbors, different vowels)

In [7]:
def generate_constonants(phonemes, max_count, ignore_words=None, filename_base=None, 
                         verbose=True, verbose2=False):
    
    if filename_base:
        filename = 'CONSTONANTS_' + filename_base + '.txt'

    if verbose:
        print('')
        print('Input line:  {0}'.format(line.strip()), end='\n')
        print('Consonants:  {0}'.format(', '.join(consonants)), end='\n')
    
    consonant_words = phoneme_subsets_to_words(phonemes, all_words, all_pronunciations, all_consonants, 
                                               just_consonants=True, ignore_words=ignore_words)

    if verbose2:
        print('Consonant words:  {0}'.format(', '.join([x[0] for x in consonant_words])), end='\n')

    if consonant_words:
        consonant_words = flatten_to_sublists_of_strings(consonant_words)
        words_by_start, stops, uniq_starts, x, y, max_stop = organize_words_by_start(consonant_words)

        if verbose2:
            words_by_start = flatten_to_sublists_of_strings(words_by_start)
            print('Consonant-generated words sorted by start index:  {0}'.format(words_by_start), end='\n')

        constonants = words_stop_to_start(words_by_start, stops, uniq_starts, max_stop, max_count, filename)

        if verbose and filename_base:
            print('Consonant-generated constonants written to {0}'.format(filename), end='\n')
        elif verbose and constonants != []:
            print('Consonant-generated constonants:', end='\n')
            for constonant in flatten_list(constonants):
               print('    {0}'.format(constonant), end='\n')
    else:
        constonants = None
    
    return constonants

## Prepare / load dictionary

In [8]:
do_prepare_dictionary = False

dictionary_folder = "data/dictionaries/"
filter_words_file = "data/dictionaries/filter_words.txt"
filter_strings = ['.',',']
filter_dictionary = 'english_words_py'  # 'pyenchant'

if filter_dictionary == 'pyenchant':
    import enchant
    enchant_dict = enchant.Dict("en_US")
elif filter_dictionary == 'english_words_py':
    from english_words import english_words_set

if do_prepare_dictionary:
    prepare_dictionary(filter_dictionary, filter_words_file, filter_strings, dictionary_folder)    
else:
    all_words = load_object(dictionary_folder + 'words_{0}.pkl'.format(filter_dictionary))
    all_consonants = load_object(dictionary_folder + 'consonants_{0}.pkl'.format(filter_dictionary))
    all_pronunciations = load_object(dictionary_folder + 'pronunciations_{0}.pkl'.format(filter_dictionary))
    all_stresses = load_object(dictionary_folder + 'stresses_{0}.pkl'.format(filter_dictionary))

## Run all code

In [10]:
# Choose morphing operations:
do_generate_homophones = 0#True
do_ignore_inputs1 = False

do_generate_constonants = True
do_ignore_inputs2 = False


#homophonify_anagram = True
#homophonify_loop = True
#alliterate = True
#swap_sounds = True
#beat = True
#translate = True
#rhyme = False

verbose = True
verbose2 = True #False

# Load text
fread = open("demo.txt", "r")
lines = fread.readlines()
for line in lines:        
    words = line.split()
    words_lower = [re.sub(r'[^\'A-Za-z]', '', x).lower() for x in words]
    nwords = len(''.join(words))
    filename_base = '_'.join(words)
    if do_ignore_inputs1:
        ignore_inputs1 = words_lower
    else:
        ignore_inputs1 = []
    if do_ignore_inputs2:
        ignore_inputs2 = words_lower
    else:
        ignore_inputs2 = []
        
    phonemes, consonants, stresses, syllables = words_to_sounds(words_lower)
    
    if do_generate_homophones:
        homophones = generate_homophones(phonemes, nwords, ignore_inputs1, filename_base, verbose, verbose2)

    if do_generate_constonants:
        constonants = generate_constonants(phonemes, nwords, ignore_inputs2, filename_base, verbose, verbose2)



Input line:  The more things change...
Consonants:  DH, M, R, TH, NG+Z, CH, N+JH
['DH']
['DH'] ['DH'] either
['DH'] ['DH'] either
['DH'] ['DH'] other
['DH'] ['DH'] the
['DH'] ['DH'] the
['DH'] ['DH'] the
['DH'] ['DH'] thee
['DH'] ['DH'] they
['DH'] ['DH'] thou
['DH'] ['DH'] though
['DH'] ['DH'] thy
['DH', 'AH']
['DH'] ['DH'] either
['DH'] ['DH'] either
['DH'] ['DH'] other
['DH'] ['DH'] the
['DH'] ['DH'] the
['DH'] ['DH'] the
['DH'] ['DH'] thee
['DH'] ['DH'] they
['DH'] ['DH'] thou
['DH'] ['DH'] though
['DH'] ['DH'] thy
['DH', 'AH', 'M']
['DH', 'M'] ['DH', 'M'] them
['DH', 'M'] ['DH', 'M'] them
['DH', 'AH', 'M', 'AO']
['DH', 'M'] ['DH', 'M'] them
['DH', 'M'] ['DH', 'M'] them
['DH', 'AH', 'M', 'AO', 'R']
['DH', 'AH', 'M', 'AO', 'R', 'TH']
['DH', 'AH', 'M', 'AO', 'R', 'TH', 'IH']
['DH', 'AH', 'M', 'AO', 'R', 'TH', 'IH', 'NG+Z']
['DH', 'AH', 'M', 'AO', 'R', 'TH', 'IH', 'NG+Z', 'CH']
['DH', 'AH', 'M', 'AO', 'R', 'TH', 'IH', 'NG+Z', 'CH', 'EY']
['DH', 'AH', 'M', 'AO', 'R', 'TH', 'IH', 'NG+Z

['TH', 'IH', 'NG+Z', 'CH', 'EY', 'N+JH']
['IH']
['IH', 'NG+Z']
['IH', 'NG+Z', 'CH']
['IH', 'NG+Z', 'CH', 'EY']
['IH', 'NG+Z', 'CH', 'EY', 'N+JH']
['NG+Z']
['NG+Z', 'CH']
['NG+Z', 'CH', 'EY']
['NG+Z', 'CH', 'EY', 'N+JH']
['EY']
['EY', 'N+JH']
['N+JH'] ['N+JH'] angie
['N+JH'] ['N+JH'] arrange
['N+JH'] ['N+JH'] injure
['N+JH'] ['N+JH'] injury
Consonant words:  either, either, other, the, the, the, thee, they, thou, though, thy, either, either, other, the, the, the, thee, they, thou, though, thy, them, them, them, them, aim, am, am, ami, ammo, amy, aroma, em, emery, emma, emory, him, i'm, irma, m, ma, mae, maier, mao, maria, marie, masts, maw, mawr, may, maya, mayer, mayo, mayor, me, meier, meow, meyer, mi, midday, mire, moe, moo, mow, moyer, mu, murray, my, myrrh, ohm, aim, am, am, ami, ammo, amy, aroma, em, emery, emma, emory, him, i'm, irma, m, ma, mae, maier, mao, maria, marie, masts, maw, mawr, may, maya, mayer, mayo, mayor, me, meier, meow, meyer, mi, midday, mire, moe, moo, mow, moy