<a href="https://colab.research.google.com/github/trista-paul/DS-Unit-4-Sprint-4-Deep-Learning/blob/master/Trista_RNN_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

In [0]:
import urllib.request

url = 'https://www.gutenberg.org/files/100/100-0.txt'

#reads first 10000 characters
#...which cut sonnet 11.
#all sonnets (my final set): 100637
text = urllib.request.urlopen(url).read(10500)

In [0]:
text = str(text)
text #very messy, a lot of \r \n and other tags

"b'\\xef\\xbb\\xbf\\r\\nProject Gutenberg\\xe2\\x80\\x99s The Complete Works of William Shakespeare, by William\\r\\nShakespeare\\r\\n\\r\\nThis eBook is for the use of anyone anywhere in the United States and\\r\\nmost other parts of the world at no cost and with almost no restrictions\\r\\nwhatsoever.  You may copy it, give it away or re-use it under the terms\\r\\nof the Project Gutenberg License included with this eBook or online at\\r\\nwww.gutenberg.org.  If you are not located in the United States, you\\xe2\\x80\\x99ll\\r\\nhave to check the laws of the country where you are located before using\\r\\nthis ebook.\\r\\n\\r\\nSee at the end of this file: * CONTENT NOTE (added in 2017) *\\r\\n\\r\\n\\r\\nTitle: The Complete Works of William Shakespeare\\r\\n\\r\\nAuthor: William Shakespeare\\r\\n\\r\\nRelease Date: January 1994 [EBook #100]\\r\\nLast Updated: May 7, 2019\\r\\n\\r\\nLanguage: English\\r\\n\\r\\nCharacter set encoding: UTF-8\\r\\n\\r\\n*** START OF THIS PROJECT GUTENB

In [0]:
#strip b'
text = text.replace(r"b'", "")

#remove \r (ignore \n)
text = text.replace(r"\r", "")

#remove this specific tag I saw a lot
text = text.replace(r"\xe2\x80\x99", "")
text = text.replace(r"\xe2\x80\x98", '')

In [0]:
#no \ns and cut out the directory
#strip excess whitespace
split_text = text.split(r"\n")[139:]

index = 0
for line in split_text:
    line = line.strip()
    split_text[index] = line
    index = index + 1

split_text

['1',
 '',
 'From fairest creatures we desire increase,',
 'That thereby beautys rose might never die,',
 'But as the riper should by time decease,',
 'His tender heir might bear his memory:',
 'But thou contracted to thine own bright eyes,',
 'Feedst thy lights flame with self-substantial fuel,',
 'Making a famine where abundance lies,',
 'Thy self thy foe, to thy sweet self too cruel:',
 'Thou that art now the worlds fresh ornament,',
 'And only herald to the gaudy spring,',
 'Within thine own bud buriest thy content,',
 'And, tender churl, makst waste in niggarding:',
 'Pity the world, or else this glutton be,',
 'To eat the worlds due, by the grave and thee.',
 '',
 '',
 '2',
 '',
 'When forty winters shall besiege thy brow,',
 'And dig deep trenches in thy beautys field,',
 'Thy youths proud livery so gazed on now,',
 'Will be a tattered weed of small worth held:',
 'Then being asked, where all thy beauty lies,',
 'Where all the treasure of thy lusty days;',
 'To say, within thine

In [0]:
split_text = split_text[:196]

In [0]:
split_text = filter(None, split_text)
text = " ".join(split_text)
text

'1 From fairest creatures we desire increase, That thereby beautys rose might never die, But as the riper should by time decease, His tender heir might bear his memory: But thou contracted to thine own bright eyes, Feedst thy lights flame with self-substantial fuel, Making a famine where abundance lies, Thy self thy foe, to thy sweet self too cruel: Thou that art now the worlds fresh ornament, And only herald to the gaudy spring, Within thine own bud buriest thy content, And, tender churl, makst waste in niggarding: Pity the world, or else this glutton be, To eat the worlds due, by the grave and thee. 2 When forty winters shall besiege thy brow, And dig deep trenches in thy beautys field, Thy youths proud livery so gazed on now, Will be a tattered weed of small worth held: Then being asked, where all thy beauty lies, Where all the treasure of thy lusty days; To say, within thine own deep sunken eyes, Were an all-eating shame, and thriftless praise. How much more praise deservd thy beau

In [0]:
characters = list(set(text))

num_chars = len(characters)
text_length = len(text)

print("unique characters : ", num_chars)
print("txt_data_size : ", text_length)

unique characters :  64
txt_data_size :  6778


In [0]:
# one hot encode
char_to_int = dict((c, i) for i, c in enumerate(characters)) # "enumerate" retruns index and value. Convert it to dictionary
int_to_char = dict((i, c) for i, c in enumerate(characters))
print(char_to_int)
print("----------------------------------------------------")
print(int_to_char)
print("----------------------------------------------------")
# integer encode input data
integer_encoded = [char_to_int[i] for i in text] # "integer_encoded" is a list which has a sequence converted from an original data to integers.
print(integer_encoded)
print("----------------------------------------------------")
print("data length : ", len(integer_encoded))

{'Y': 0, ')': 1, 'b': 2, 'p': 3, 'o': 4, ',': 5, 'S': 6, 'q': 7, 'n': 8, 'L': 9, 'P': 10, 'H': 11, 'G': 12, '4': 13, 'i': 14, 't': 15, 'v': 16, 'm': 17, '7': 18, '?': 19, 'l': 20, '-': 21, 'B': 22, 'A': 23, '6': 24, 'e': 25, 'w': 26, 'u': 27, 'N': 28, 'r': 29, 's': 30, 'z': 31, 'C': 32, ';': 33, 'f': 34, 'c': 35, 'M': 36, '.': 37, '9': 38, '(': 39, 'j': 40, 'D': 41, '8': 42, '1': 43, 'x': 44, 'g': 45, 'U': 46, 'a': 47, '0': 48, 'O': 49, 'h': 50, ' ': 51, 'y': 52, '5': 53, '3': 54, '2': 55, 'T': 56, 'k': 57, 'R': 58, ':': 59, 'I': 60, 'F': 61, 'W': 62, 'd': 63}
----------------------------------------------------
{0: 'Y', 1: ')', 2: 'b', 3: 'p', 4: 'o', 5: ',', 6: 'S', 7: 'q', 8: 'n', 9: 'L', 10: 'P', 11: 'H', 12: 'G', 13: '4', 14: 'i', 15: 't', 16: 'v', 17: 'm', 18: '7', 19: '?', 20: 'l', 21: '-', 22: 'B', 23: 'A', 24: '6', 25: 'e', 26: 'w', 27: 'u', 28: 'N', 29: 'r', 30: 's', 31: 'z', 32: 'C', 33: ';', 34: 'f', 35: 'c', 36: 'M', 37: '.', 38: '9', 39: '(', 40: 'j', 41: 'D', 42: '8', 43

In [0]:
import numpy as np

# hyperparameters

iteration = 10
sequence_length = 40
batch_size = round((text_length/sequence_length)+0.5) # = math.ceil
hidden_size = 500  # size of hidden layer of neurons.  
learning_rate = 1e-1


# model parameters

W_xh = np.random.randn(hidden_size, num_chars)*0.01     # weight input -> hidden. 
W_hh = np.random.randn(hidden_size, hidden_size)*0.01   # weight hidden -> hidden
W_hy = np.random.randn(num_chars, hidden_size)*0.01     # weight hidden -> output

b_h = np.zeros((hidden_size, 1)) # hidden bias
b_y = np.zeros((num_chars, 1)) # output bias

h_prev = np.zeros((hidden_size,1)) # h_(t-1)

In [0]:
def forwardprop(inputs, targets, h_prev):
        
    # Since the RNN receives the sequence, the weights are not updated during one sequence.
    xs, hs, ys, ps = {}, {}, {}, {} # dictionary
    hs[-1] = np.copy(h_prev) # Copy previous hidden state vector to -1 key value.
    loss = 0 # loss initialization
    
    for t in range(len(inputs)): # t is a "time step" and is used as a key(dic).  
        
        xs[t] = np.zeros((num_chars,1)) 
        xs[t][inputs[t]] = 1
        hs[t] = np.tanh(np.dot(W_xh, xs[t]) + np.dot(W_hh, hs[t-1]) + b_h) # hidden state. 
        ys[t] = np.dot(W_hy, hs[t]) + b_y # unnormalized log probabilities for next chars
        ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars. 
        
        # Softmax. -> The sum of probabilities is 1 even without the exp() function, but all of the elements are positive through the exp() function.
        loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss). Efficient and simple code

#         y_class = np.zeros((num_chars, 1)) 
#         y_class[targets[t]] =1
#         loss += np.sum(y_class*(-np.log(ps[t]))) # softmax (cross-entropy loss)        

    return loss, ps, hs, xs

In [0]:
def backprop(ps, inputs, hs, xs, targets):

    dWxh, dWhh, dWhy = np.zeros_like(W_xh), np.zeros_like(W_hh), np.zeros_like(W_hy) # make all zero matrices.
    dbh, dby = np.zeros_like(b_h), np.zeros_like(b_y)
    dhnext = np.zeros_like(hs[0]) # (hidden_size,1) 

    # reversed
    for t in reversed(range(len(inputs))):
        dy = np.copy(ps[t]) # shape (num_chars,1).  "dy" means "dloss/dy"
        dy[targets[t]] -= 1 # backprop into y. After taking the soft max in the input vector, subtract 1 from the value of the element corresponding to the correct label.
        dWhy += np.dot(dy, hs[t].T)
        dby += dy 
        dh = np.dot(W_hy.T, dy) + dhnext # backprop into h. 
        dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity #tanh'(x) = 1-tanh^2(x)
        dbh += dhraw
        dWxh += np.dot(dhraw, xs[t].T)
        dWhh += np.dot(dhraw, hs[t-1].T)
        dhnext = np.dot(W_hh.T, dhraw)
    for dparam in [dWxh, dWhh, dWhy, dbh, dby]: 
        np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients.  
    
    return dWxh, dWhh, dWhy, dbh, dby

In [0]:
iteration = 100

In [0]:
%%time

data_pointer = 0

# memory variables for Adagrad
mWxh, mWhh, mWhy = np.zeros_like(W_xh), np.zeros_like(W_hh), np.zeros_like(W_hy)
mbh, mby = np.zeros_like(b_h), np.zeros_like(b_y) 

for i in range(iteration):
    h_prev = np.zeros((hidden_size,1)) # reset RNN memory
    data_pointer = 0 # go from start of data
    
    for b in range(batch_size):
        
        inputs = [char_to_int[ch] for ch in text[data_pointer:data_pointer+sequence_length]]
        targets = [char_to_int[ch] for ch in text[data_pointer+1:data_pointer+sequence_length+1]] # t+1        
            
        if (data_pointer+sequence_length+1 >= len(text) and b == batch_size-1): # processing of the last part of the input data.
            targets.append(char_to_int[" "])   # When the data doesn't fit, add space(" ") to the back.


        # forward
        loss, ps, hs, xs = forwardprop(inputs, targets, h_prev)
    
        # backward
        dWxh, dWhh, dWhy, dbh, dby = backprop(ps, inputs, hs, xs, targets) 
        
        
    #perform parameter update with Adagrad
        for param, dparam, mem in zip([W_xh, W_hh, W_hy, b_h, b_y], 
                                    [dWxh, dWhh, dWhy, dbh, dby], 
                                    [mWxh, mWhh, mWhy, mbh, mby]):
            mem += dparam * dparam # elementwise
            param += -learning_rate * dparam / np.sqrt(mem + 1e-8) # adagrad update      
    
        data_pointer += sequence_length # move data pointer
        
    if i % 100 == 0:
        print ('iter %d, loss: %f' % (i, loss)) # print progress

iter 0, loss: 182.664375
CPU times: user 22min 45s, sys: 12min 14s, total: 34min 59s
Wall time: 17min 35s


In [0]:
def predict(test_char, length):
    x = np.zeros((num_chars, 1)) 
    x[char_to_int[test_char]] = 1
    ixes = []
    h = np.zeros((hidden_size,1))

    for t in range(length):
        h = np.tanh(np.dot(W_xh, x) + np.dot(W_hh, h) + b_h) 
        y = np.dot(W_hy, h) + b_y
        p = np.exp(y) / np.sum(np.exp(y)) 
        ix = np.random.choice(range(num_chars), p=p.ravel()) # ravel -> rank0
        # "ix" is a list of indexes selected according to the soft max probability.
        x = np.zeros((num_chars, 1)) # init
        x[ix] = 1 
        ixes.append(ix) # list
    txt = test_char + ''.join(int_to_char[i] for i in ixes)
    print ('----\n %s \n----' % (txt, ))

In [0]:
predict('T', 621)

----
 Tha ebo  fpt ,o oaeBsla ,  caod:oetmoblnostneldho er ldslau  hkye   a tq tt seh e  atbh dt tsefr mwtdaf oh,doaheswtiio hotlw  :rhdr,r hns.c   eh.mclhhddps te,ulnad ,n1ohyd aomhh ,mnoecmydlh doecp:npttsusahe nhrLnr,ic lsn moeoeocnhr ; oitahu e tWseWecrye bmr Atsde rom.mghI, oslo r s ftkee etofth e hptecymbpwAdrw  ehe  d o teltdnot oese tm rtwt dmmen ta ut,o hn h af  H l sro  dtb .rcsyy nhShgoi t,ed:  le ni hoem eo  hisi  re otplo    r tbhyioe dnher   ttc ia   e  i ,,no s,n Masd r m hn t l uh,io d ntbeuh fdsrira   eatsa  nru   nhhet t hheihuhnc o e:wlmdeprhe s d ,ot ioyL t  uduehhretnf w,ketytoca om h, Lttrme yngudt  
----


In [0]:
predict('D', 621)

----
 D tfmenraiitf y ur:e  oeoy nia uttlainreAye,tt ro  nu o,fvsio ad runfdhga  sW rosoriy orPt lo lt tseiu eeheoeatbatdn esr hseethvhrtntmdut  hl dtittmh t  ufaare ed cboum n lIbas Ahat,  sc : tree nhsAotatot,ton tty,yri ua hthr Tgrn  sais d utteloorom ynmdtnhtgt,t,t ui:loaero n b,risdoee,etaalhrusoWnse  t ahO,dtt  ohwnll  haaseb,oW mtts ot ofaythwr enr sd blecns oe monhuy o pshfeor f rd  t,wmhimyuhioTdvsnari hdrethhttcolt tedthlTef 2  aouhwi uafdoslmseyaoedrclesbt herl  cbcohSLg rneybta  a  .detehs hle nirshtsd n ro uo eu,oaaeTuhods tiatu hd ob,   .   uohutuouWrtd n em  i a,t thdoj rt  iumo ehdoeu h nueT Ir ddr tsI e  
----


In [0]:
predict('A', 100)

----
 Anbho tdus  nSe uldheacbrh  t d,l e ld  rto :rardydntic rtkr  e eor citr oer  n sortnr ,up ehdioywhoe 
----
