## HW 6, Part 2:   Recurrent Neural Networks (RNN)  -- Extra Credit (40 points)

When it comes to model sequential data such as sentences, documents and videos, etc, the state of the art approach is to use Recurrent neural network (RNN). At each timestep, RNN takes an element (such as a word) as input, combines with past information encoded as a vector (such as all information in the sentence before this timestep), generate a new vector encoding both current input and past information, then delivers it to next timestep.

For more details about LSTM (a very popular variant of RNN), please refer to http://colah.github.io/posts/2015-08-Understanding-LSTMs/ and here is a very good video explaining RNN: https://www.youtube.com/watch?v=WCUNPb-5EYI.

### Generating text with Long Short-Term Memory Networks

RNN can be used to generate text. For more information, please read: https://karpathy.github.io/2015/05/21/rnn-effectiveness/.

The following is an example script to generate text from Nietzsche's writings.

Note: 
- At least 10 epochs are required before the generated text
starts sounding coherent, but more is better.

- It is recommended to run this script on GPU, as recurrent
networks are quite computationally intensive.

- If you try this script on new data, make sure your corpus
has at least ~100k characters. ~1M is better.

In [1]:
#Import necessary libraries 
from __future__ import print_function
from keras.callbacks import LambdaCallback
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from tensorflow.keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import numpy as np
import random
import sys
import io

In [2]:
#Get the data - available from amazon
path = get_file('nietzsche.txt', origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
with io.open(path, encoding='utf-8') as f:
    text = f.read().lower() # make it all lowercase 
print('corpus length:', len(text))

chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

Downloading data from https://s3.amazonaws.com/text-datasets/nietzsche.txt
corpus length: 600893
total chars: 57


In [3]:
# Cut the text in semi-redundant sequences of maxlen characters
## Cut the text into a series of windows. 
## Each window is 40 characters
## The window moves 3 steps forward each step

maxlen = 40
step = 5
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

# Turn these sentances into one-hot encoded vectors
## For all words in the sentances, there is a one, else there is a zero in that index of the vector

print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1
print('Done!')

nb sequences: 120171
Vectorization...


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  y = np.zeros((len(sentences), len(chars)), dtype=np.bool)


Done!


Now we have data to feed a model for text generation. Next  we build a LSTM model to fit the data. Using Keras this is only few lines of code!

In [4]:
# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

optimizer = RMSprop(learning_rate=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
print('Done!')

Build model...
Done!


In [5]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.ma.log(preds)
    preds = preds.filled(0)
    preds = preds / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def on_epoch_end(epoch, logs):
    # clear_output()
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % (epoch+1))

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.4, 0.5, 1.0]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

# Training
-  Each epoch takes up to 1 minute or so on a CPU (an epoch took 30 seconds for my PC)
-  Recall that training on at least 20 epochs will give intelligible results 
-  So you're gonna have to let this run for a while (if ETA per epoch is bigger than 5 minutes on your machine, you can reduce the number of epochs)

In [6]:
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

model.fit(x, y,
          batch_size=128,
          epochs=10,
          callbacks=[print_callback])

Epoch 1/10
----- Generating text after Epoch: 1
----- diversity: 0.2
----- Generating with seed: " looks and emaciated bodies are eloquent"
 looks and emaciated bodies are eloquent of the self-some and the self in the self the self the self-self the self the self the self the self
the self the self the self-so the self-some the self-some the cousting the self the self-so the coust the self the self and and the self-so the self the self the self-so the self-sende and the self the self-some the sould the self-so shere the self the self the self the self the self the self the 
----- diversity: 0.4
----- Generating with seed: " looks and emaciated bodies are eloquent"
 looks and emaciated bodies are eloquent and the read and the sentions and self the self the man perhoph of the prole and every will man will the self in the most be all a sentive of ling in the self-some the fore of the coure the manter of the its and self-and the sont the self-some the dost in the simally as in the very man

mythologies, and any longer to be and strong to with allotions.






=all there is there is a more of the sunder of the more in the collect to the pressity of it is for the extent and so the will to the work of the such presunt to a suscection of the reason of the its as at a such of the human extent of a thiloly and being to there is a soll of the ausolety we longer to the soul of the strange and soul and s
----- diversity: 1.0
----- Generating with seed: " the nations invented their
mythologies,"
 the nations invented their
mythologies, tavided a clay, dee, to during to unegosses
the most intley. become to obtles, is prescribation, depentally nomanaiss, we very rided
its in the meast the anviratives, too been wayk withhy good and noted bit feelings. the same. there is had belief more bedeed the
dorent", everything; with these
aves to to delial been more beotherngs,
=aphe
yourgeetly belood and curtual only, as the trage of notrom
Epoch 5/10
----- Generating text after Epoch: 5
----- 

hortness of human life leads to many error of the sense of the spirit and the sense of the soul-and the stronger of the same time of the soul--the soul-and to the soul-and the souls of the soul-and also a sort of the same time of the soul-dong the suspection of the soul-and the sentimes of the such a soul of the same time the continue to the spirit and be the soul-and also a sort of the sense the sentimes and also the soul--the possible
----- diversity: 0.4
----- Generating with seed: "hortness of human life leads to many err"
hortness of human life leads to many error of the said the future of the more personal and fechard as a serve and the spirit especially and sense of the consorable the personal and so master. the continue to the soul--as reverity of the been has to the sireling of the same time of the spirits, and the same time the sense of the personess and the sentimes of the spirit and sense of the more poecess of the possible recalled and the feelin
----- diversity: 0.5
-----

<keras.callbacks.History at 0x7fbc8473ba30>

In [7]:
from numpy import *

## Load pre-trained model
Since it is time consuming to train this LSTM model with CPU for more epochs, we provided a pre-trained model which is trained on GPU for 100 epochs. Use the following code to check how coherency the model is.

It requires h5py packages, please install it to test the following code.

In [8]:
# build the model: a single LSTM
print('Load pre-trained model...')
from keras.models import load_model
model = load_model('shakespear200.h5')


def lstm_generate(seed, model):
    orig_seed = seed
    for diversity in [0.2, 0.3, 0.5, 1.0]:
        print('----- diversity:', diversity)
        seed = orig_seed
        generated = ''
        generated += seed
        print('----- Generating with seed: "' + seed + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(seed):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            seed = seed[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()


seed = "from an anguish with which no other is t"
# seed = "thou art"
lstm_generate(seed, model)


Load pre-trained model...
----- diversity: 0.2
----- Generating with seed: "from an anguish with which no other is t"
from an anguish with which no other is the consciousness, the more the subjection of the state of the more expedient and the sense, and the states of the present the present of the sense of the present the sense of such a present in the serve the subjection of the subjection of the something the subjection of the sense of the state of the present is the consideration of the consideration of the subjection of man and superficial and some
----- diversity: 0.3
----- Generating with seed: "from an anguish with which no other is t"
from an anguish with which no other is the sense of the sense of the state to the subjection of the probably the present in the sort, and for the subjection of the finer the present to the consequ? the serve the reality of the fact the concesses of the state the contradictic the subjection of the standards of the world to the present the sake it i

### Exercise: use LSTM to generate baby names
-  The following data set contains 8000 last names. You can download and process the name data set as follows:

```python
name_path = get_file('names.txt', origin='http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/other/names.txt')
with io.open(name_path, encoding='utf-8') as f:
    text = f.read() # make it all lowercase 
    
text = text.split()
text = ', '.join(text)
```

Using the last name data set, answer the following questions:

1. (30 points) Train a LSTM to generate the names. How long does it take to train? How coherent does it sound? 
2. (10 points) Can you train the LSTM, but for every epoch, shuffle the order of names before call model.fit()? How long does it take to train? Does it improve the coherency?

