# Final Project
# CS156

Anna Pauxberger

22 December 2018

# Welcome to Poetry Class

This Markov chain model generates poems based on an input of texts of different songwriters and authors. Poems can be of a chosen length, inspired by a chosen text and have different levels of 'creativity'l which based on the Markov order one wants to take into account for predicting the next word/character.

Important to note is that for a character based model, characters are lowered and special characters are stripped in order to arrive at meaningful output. 

In [None]:
import random
from string import punctuation

In [117]:
class Poem(object):
    '''
    Generates poems word by word
    '''
    def __init__ (self, order, character, file):
        '''
        Creates an instance of a poem. Define order (how far to look back) and 
        whether the poem should generate output character or word based via 
        character = True or False.
        '''
        
        self.order = order
        self.text = None
        self.dict = {}
        self.character = character # create poem by character or by word
        self.file = open(file)
        self.orig_file = file
        self.current_poem = []
   

    def train(self):
        '''
        Trains a data set by creating a dictionary with all unique words/chars compositions of
        length = order as keys, and the following word/char as value. All occurrences are
        appended, which later ensures a probabilistic representation by randomly choosing
        a value.
        '''
        
        if self.character == False:
            self.text = self.file.read().split()
        else: # split words into characters and skip line breaks
            temp = []
            with self.file as f:
                for c in self.get_next_character(f):
                    if self.is_special(c) == False:
                        temp.append(c.lower())
            self.text = temp
            
        self.text = self.text + self.text[:self.order] # append beginning to the end
        for i in range(0, len(self.text) - (self.order + 1)):
            key = tuple(self.text[i : i + self.order]) # generate keys
            value = self.text[i + self.order]          # generate values
            if key in self.dict:                       # append values
                self.dict[key].append(value)
            else:
                self.dict[key] = [value]
        return

    
    def generate(self,N=100):
        '''
        Generates a poem of length N. Random initialization, then takes the subsequent
        4 units as initial state (=key) and randomly chooses a next word as saved in
        the dictionary as value. The probability will be equal to occurrence, as all
        values are stored in the dictionary in train().
        '''
        index = random.randint(0, len(self.text) - self.order) 
        result = self.text[index : index + self.order]       # list of words/chars
        for i in range(N):
            state = tuple(result[len(result) - self.order:]) # make tuple out of list => key
            next_word = random.choice(self.dict[state])      # choose value => prediction
            result.append(next_word)
            
        if self.character == True:
            self.current_poem = "".join (result[self.order : ] ) # disregard the random first result
            return self.current_poem
        else:
            self.current_poem = " ".join (result[self.order : ] )
            return self.current_poem
        
        
    def get_next_character(self,f):
        '''
        Helper function for train() for characters.
        Reads character by character.
        Source: https://rosettacode.org/wiki/Read_a_file_character_by_character/UTF8
        '''
        
        c = f.read(1)
        while c: 
            yield c
            c = f.read(1)
        return c
  

    def is_special(self,c):
        '''
        Helper function for train() for characters. Checks if c is a special character.
        '''
        
        punctuation_set = set(punctuation)
        if c == '\n' or c == '\t':
            return True
        else:
            return (c in punctuation_set)

        
    def sample_words(self):
        '''
        After generating words from characters, this ensures to only include words
        that also appeared in the original text.
        '''
        
        file = open(self.orig_file)
        
        unique_words_in_original = [w.lower() for w in set(file.read().split())]
        words_in_text = self.current_poem.split(" ")
        
        result = []
        for w in words_in_text:
            if w in unique_words_in_original:
                result.append(w)
        self.current_poem = " ".join(result)
        return self.current_poem
        
    
    def add_structure(self, N=5):
        '''N length of words per line. Cuts off last words.'''
        
        generated_text = self.current_poem.split(" ")
        generated_text = [' '.join(generated_text[N*i:N*i+N]) 
                          for i in range(0, int(len(generated_text) / N))]
        self.current_poem = "\n".join(generated_text)
        
        return print(self.current_poem)
    
    def simple_gen(self):
        '''Generates simple poem with defaults: 
            if character based: N=100, samples dict words and structures 5 words
            if word based: N=20, structure 5 words
        '''
        if self.character == True:
            self.train()
            self.generate(N=200)
            self.sample_words()
            self.add_structure(5)
        
        else:
            self.train()
            self.generate(N=20)
            self.add_structure(5)

# Example: Simple Poem

In [140]:
poem = Poem(order=5, character=True, file='poetry/leonard-cohen.txt')
poem.simple_gen()

from me was a new
will be i see you
ah the in to take
ones like to the night
a partner lover loved you


# Example: User Specified Poem
- Order: 5
- Length: 100
- Character based
- Style: Justin Bieber

In [141]:
poem = Poem(order=5, character=True, file='poetry/bieber.txt')
print('>> Poem Training')
poem.train()
print('')
print('>> Generation of Poem')
print(poem.generate(N=100))
print('')
print('>> Only keep words from original dictionary')
print(poem.sample_words())
print('')
print('>> Add structure to poem')
poem.add_structure(5)

>> Poem Training

>> Generation of Poem
de outta time you you you knowso press the was nobody to lose my favorite girl mhm uh huh i lovemadl

>> Only keep words from original dictionary
de outta time you you you the was nobody to lose my favorite girl uh huh i

>> Add structure to poem
de outta time you you
you the was nobody to
lose my favorite girl uh


# Analyzing word vs character based poems for different levels of order
The order of a Markov chain is the length of words or characters considered before predicting the next character.

For **character based poems**, results of low order are very short, as words not occurring in the dictionary get rejected. The words that do get accepted have little connection to each other and the poems don't make much sense, e.g. 'i do of goo day'. Poems of higher order show immediately more words that are being accepted, since the moel allows to 'look back' further and can generate the next letter based on multiple. 


E.g. 'I love you all the more'. If the current state is 'l' ('L'), the predicted character will be 'o', 'l' or 't' with equal probability. Having a state of order two and a current state of 'al' the model will know to predict 'l' next and generate the word 'all'.


For **word based poem** the first results with low order already show plausible poems. However, poems with high order overfit to the training data if the size of training data is not large enough since a very specific string of multile words might only occur once in the entire text, resulting in the model quoting from the original rather than generating new texts.


Ideal order depends on the dataset and model. For example, if the corpus has very long words, the order for character based poems should be similar to the length of the words as well, in order to generate more words that are similar to the original text. However, if the corpus is not very large, setting the order very high would inhibit the word based generator to come up with novel creations, as there is not much variability in the options for the next words.

In [132]:
print('Character Based Poems')
print('')

for i in range(1,6):
    print(f'>> Order {i}')
    poem = Poem(order=i, character=True, file='poetry/beatles.txt')
    poem.simple_gen()
    print('')

Character Based Poems

>> Order 1
it e f f the

>> Order 2
she the i all and
in i will say know
i do of goo day

>> Order 3
of a was she live
you can let buy me
on any you long he
fish i say to la
long we hu a love

>> Order 4
in the say help down
boy named somewhere waves in
sits been i feel always
a sun is hey max
to black the love you
know how long to go

>> Order 5
love all these me more
who puts i i could
be gonna change penny lane
is to dance your every
day to listen to just
won the dirty old me
big man i had broken



In [134]:
print('Word Based Poems')
print('')

for i in range(1,6):
    print(f'>> Order {i}')
    poem = Poem(order=i, character=False, file='poetry/beatles.txt')
    poem.simple_gen()
    print('')

Word Based Poems

>> Order 1
To get back in any
more? Will she done me
do what I want somebody
Help, you know why you

>> Order 2
back again. I have missed
things And it's worth it
just to hear you say
Hey you've got to be

>> Order 3
to the love you make,
Ah Love, love, love, love,
love. All you need is
love. (All together now). All

>> Order 4
I'm tired, servicible villain Set
you down father, rest you)
I look at you all
see the love there that's

>> Order 5
like to be under the
sea In an octopus' garden
with you. One two three
four Can I have a



# Comparing Markov Chains to Recurrent Neural Networks
We train the largest file available to us, Shakespeare's *King Henry IV*, on the textgenrnn package and compare the neural net's outcome to the one generated by the above specified Markov model. 

The **recurrent neural net** is able to iterate and learn from its outputs, and shows tremendous improvements over 20 epochs. The sentences are easily readable and make (mostly) grammatical sense. It is able to generate structured texts, including indents for character announcements and even exits and enters of characters are modeled. (outcome attached below)

The **markov chain** model, also generates sentences that make abstract sense and are somewhat grammatically corect. However, it does not recognize and model the structure of the text. 

A more rigorous distinction between the two models is attached below.

In [137]:
poem = Poem(order=2, character=False, file='shakespeare/1kinghenryiv.txt')
poem.train()
poem.generate(300)

"fool Art thou not ashamed? FALSTAFF Dost thou hear, Hal? never call a true woman, holland of eight shillings an ell. You owe money here besides, Sir John, methinks they are directed. If you will deny the sheriff, so; if not, honour comes unlooked for, and there's an end. [Exit FALSTAFF] PRINCE HENRY Peace, ye fat-guts! lie down; lay thine ear close to the sepulchre of Christ, Whose soldier now, under whose countenance we steal. PRINCE HENRY Five year! by'r lady, a long hour by Shrewsbury clock. If I travel but four foot by the chance of war; to prove that true Needs no more with vanity. I would it had been here. The quality and hair of our enterprise; 'Tis catching hither, even to our great enterprise, Than if the devil understands Welsh; And 'tis no matter; honour pricks me on. Yea, but I think thou hadst truly borne Betwixt our armies true intelligence. EARL OF DOUGLAS Know then, my name is Falstaff: if that man my friend Whose tongue shall ask me when thou sit'st alone? Why hast th