## HW 6, Part 2:   Recurrent Neural Networks (RNN)  -- Extra Credit (40 points)

When it comes to model sequential data such as sentences, documents and videos, etc, the state of the art approach is to use Recurrent neural network (RNN). At each timestep, RNN takes an element (such as a word) as input, combines with past information encoded as a vector (such as all information in the sentence before this timestep), generate a new vector encoding both current input and past information, then delivers it to next timestep.

For more details about LSTM (a very popular variant of RNN), please refer to http://colah.github.io/posts/2015-08-Understanding-LSTMs/ and here is a very good video explaining RNN: https://www.youtube.com/watch?v=WCUNPb-5EYI.

### Generating text with Long Short-Term Memory Networks

RNN can be used to generate text. For more information, please read: https://karpathy.github.io/2015/05/21/rnn-effectiveness/.

The following is an example script to generate text from Nietzsche's writings.

Note: 
- At least 10 epochs are required before the generated text
starts sounding coherent, but more is better.

- It is recommended to run this script on GPU, as recurrent
networks are quite computationally intensive.

- If you try this script on new data, make sure your corpus
has at least ~100k characters. ~1M is better.

In [1]:
#Import necessary libraries 
from __future__ import print_function
from keras.callbacks import LambdaCallback
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from tensorflow.keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import numpy as np
import random
import sys
import io

In [2]:
#Get the data - available from amazon
path = get_file('nietzsche.txt', origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
with io.open(path, encoding='utf-8') as f:
    text = f.read().lower() # make it all lowercase 
print('corpus length:', len(text))

chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

corpus length: 600893
total chars: 57


In [3]:
# Cut the text in semi-redundant sequences of maxlen characters
## Cut the text into a series of windows. 
## Each window is 40 characters
## The window moves 3 steps forward each step

maxlen = 40
step = 5
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

# Turn these sentances into one-hot encoded vectors
## For all words in the sentances, there is a one, else there is a zero in that index of the vector

print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1
print('Done!')

nb sequences: 120171
Vectorization...


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  y = np.zeros((len(sentences), len(chars)), dtype=np.bool)


Done!


Now we have data to feed a model for text generation. Next  we build a LSTM model to fit the data. Using Keras this is only few lines of code!

In [4]:
# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

optimizer = RMSprop(learning_rate=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
print('Done!')

Build model...
Done!


In [5]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.ma.log(preds)
    preds = preds.filled(0)
    preds = preds / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def on_epoch_end(epoch, logs):
    # clear_output()
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % (epoch+1))

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.4, 0.5, 1.0]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

# Training
-  Each epoch takes up to 1 minute or so on a CPU (an epoch took 30 seconds for my PC)
-  Recall that training on at least 20 epochs will give intelligible results 
-  So you're gonna have to let this run for a while (if ETA per epoch is bigger than 5 minutes on your machine, you can reduce the number of epochs)

In [6]:
from numpy import *

## Load pre-trained model
Since it is time consuming to train this LSTM model with CPU for more epochs, we provided a pre-trained model which is trained on GPU for 100 epochs. Use the following code to check how coherency the model is.

It requires h5py packages, please install it to test the following code.

In [7]:
# build the model: a single LSTM
print('Load pre-trained model...')
from keras.models import load_model
model = load_model('shakespear200.h5')


def lstm_generate(seed, model):
    orig_seed = seed
    for diversity in [0.2, 0.3, 0.5, 1.0]:
        print('----- diversity:', diversity)
        seed = orig_seed
        generated = ''
        generated += seed
        print('----- Generating with seed: "' + seed + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(seed):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            seed = seed[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()


seed = "from an anguish with which no other is t"
# seed = "thou art"
lstm_generate(seed, model)


Load pre-trained model...
----- diversity: 0.2
----- Generating with seed: "from an anguish with which no other is t"
from an anguish with which no other is the more and such a really the subjection of the subjection of the fact the sense of the fact the present the subjection and all the subjection of the sense of the state to the serve the conscious and such a particularly the subjection of the more and such a man is a man and the capacity, and in the subjection of the present the consciousness, and a subjection of the possible who have not all the p
----- diversity: 0.3
----- Generating with seed: "from an anguish with which no other is t"
from an anguish with which no other is the things of the so the consequ: at the preserves of the seecless, which has been a science, to be perceive the rest in the fact the fact, and we cannot be called the fact the self--and when the subjection that the constitute of the such the subjection of the reality, and the particularly the morality and th

### Exercise: use LSTM to generate baby names
-  The following data set contains 8000 last names. You can download and process the name data set as follows:

```python
name_path = get_file('names.txt', origin='http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/other/names.txt')
with io.open(name_path, encoding='utf-8') as f:
    text = f.read() # make it all lowercase 
    
text = text.split()
text = ', '.join(text)
```

Using the last name data set, answer the following questions:

1. (30 points) Train a LSTM to generate the names. How long does it take to train? How coherent does it sound? 
2. (10 points) Can you train the LSTM, but for every epoch, shuffle the order of names before call model.fit()? How long does it take to train? Does it improve the coherency?



In [8]:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Fri Nov  4 11:27:27 2022

@author: gavinkoma
"""
#Import necessary libraries 
from __future__ import print_function
from keras.callbacks import LambdaCallback
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from tensorflow.keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import numpy as np
import random
import sys
import io

name_path = get_file('names.txt', origin='http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/other/names.txt')
with io.open(name_path, encoding='utf-8') as f:
    text = f.read() # make it all lowercase 

text = text.split()
text = ', '.join(text)

#alright lets train an lstm to generate names
#how long does it take to train? 

chars = sorted(list(set(text)))
print('total chars: ' + str(len(chars)))
char_indices = dict((c,i) for i,c in enumerate(chars))
indices_char = dict((i,c) for i,c in enumerate(chars))

#cut the text into semi-redundant sequences of maxlen characters
#cut the text into a series of windows
#each window is x length long
#the window moves 3 steps forward each step.

maxlen = 40
step = 5
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

# Turn these sentances into one-hot encoded vectors
## For all words in the sentances, there is a one, else there is a zero in that index of the vector

print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1
print('Done!')

# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

optimizer = RMSprop(learning_rate=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
print('Done!')

def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.ma.log(preds)
    preds = preds.filled(0)
    preds = preds / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def on_epoch_end(epoch, logs):
    # clear_output()
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % (epoch+1))

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.4, 0.5, 1.0]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

epoch_val = [*range(1,10)]

model.fit(x, y,
          batch_size=128,
          epochs=10,
          callbacks=[print_callback])




total chars: 58
nb sequences: 100350
Vectorization...


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  y = np.zeros((len(sentences), len(chars)), dtype=np.bool)


Done!
Build model...
Done!
Epoch 1/10
----- Generating text after Epoch: 1
----- diversity: 0.2
----- Generating with seed: "Montag, Montagna, Montagne, Montagnino, "
Montag, Montagna, Montagne, Montagnino, Montane, Montaro, Montar, Montar, Montaro, Montaro, Montalli, Montartt, Montart, Montaro, Montarto, Montalla, Montaro, Montarto, Montart, Montarto, Montaro, Montarico, Montaro, Montarte, Montan, Montan, Montarie, Montarte, Montarico, Montari, Montarte, Montarta, Montarton, Montarta, Montarie, Montardo, Montart, Montaria, Montaro, Montari, Montaro, Montaro, Montarte, Montart, Montari, Montarton, Mo
----- diversity: 0.4
----- Generating with seed: "Montag, Montagna, Montagne, Montagnino, "
Montag, Montagna, Montagne, Montagnino, Montalle, Montarciner, Montarina, Montans, Montarto, Montero, Montan, Montzer, Montana, Montarico, Montalla, Montan, Monthauss, Montaring, Montana, Montara, Montarie, Montara, Montarda, Montaria, Montant, Montzer, Montarci, Montane, Montaron, Monter, Montan, M

ton, Hambley, Hamblin, Hambly, Hambrick, Hamberger, Hambert, Hamberry, Hamberger, Hamberg, Hamberg, Hamberg, Hamberry, Hambergh, Hamberger, Hamberger, Hamberrick, Hamberger, Hambers, Hambert, Hambert, Hamber, Hambert, Hamberg, Hamberg, Hamberner, Hamberg, Hamberg, Hamberg, Hamberger, Hamberg, Hambert, Hambert, Hamber, Hamberr, Hamberg, Hamberg, Hamberg, Hamberger, Hambert, Hambert, Hamberg, Hamberg, Hamberger, Hamberg, Hamberger, Hamber
----- diversity: 1.0
----- Generating with seed: "ton, Hambley, Hamblin, Hambly, Hambrick,"
ton, Hambley, Hamblin, Hambly, Hambrick, Hamberg, Hamberry, Haldin, Haldin, Halding, Haldin, Haldh, Haldemun, Halder, Halenear, Halenke, Halent, Halen, Halen, Halen, Halened, Halent, Hales, Hales, Halese, Halesz, Hall, Hall, Halla, Halleabia, Hallenburw, Hallena, Hallener, Hallenill, Hallen, Hallenzell, Hallens, Hallesing, Hallis, Hallison, Hallmier, Hallmes, Hallpon, Hallone, Halmon, Halmo, Halmman, Halmo, Halmura, Halomson, Halmour, H
Epoch 5/10
----- Generatin

t, Voigts, Voiles, Voisin, Voisine, Voit, Voith, Voith, Voith, Voither, Voither, Voither, Vother, Votherson, Votherson, Votherson, Vothert, Vothhers, Vothmer, Vothming, Vothols, Vothole, Vothole, Vothols, Vothole, Vothole, Vothole, Vothole, Vothole, Vothold, Vothold, Vothold, Vothon, Vottier, Vottinger, Votting, Vottinger, Vottinge, Vottinger, Vottinge, Vottinger, Vottinski, Vottin, Vottinsco, Vottier, Vottinger, Vottinge, Votting, Vott
----- diversity: 0.4
----- Generating with seed: "t, Voigts, Voiles, Voisin, Voisine, Voit"
t, Voigts, Voiles, Voisin, Voisine, Voit, Voith, Voith, Voigh, Voigh, Voigh, Voighton, Voight, Voight, Voight, Voighton, Voighton, Voighton, Voighton, Voigson, Voits, Voit, Voitzel, Voig, Voines, Voinette, Voineston, Voinetti, Vointer, Vointer, Vointhan, Vointine, Vointine, Vointing, Vointing, Vointinger, Vointz, Vointzen, Vointz, Vointzeau, Vointz, Vointzen, Vointzen, Vointz, Vointze, Vointz, Vointzen, Vointzen, Vointze, Vointzer, Voin
----- diversity: 0.5
-----

<keras.callbacks.History at 0x7fe7ee455340>

Each epoch takes about ~30seconds to complete. For 10 epochs, it takes about five minutes total of training to complete. The names sound coherent but they are limited to the letters of the seed that the algorithm is presented with. Shuffling should take care of this. 