# Text Generation Using Multi-Layered RNN
- [Introduction](#intro)
- [Part 1: Data Preprocessing](#part1)
- [Part 2: RNN Building](#part2)
- [Part 3: RNN Training](#part3)
- [Part 4: Text Generation](#part4)

<a id='intro'></a>
## Introduction

In this project, I'll show how to build and train an RNN using Tensorflow to generate text. I train the RNN using a dataset consisting of the scripts from the first 8 seasons of the TV show Friends. 

The notebook consists of 4 parts: in Part 1 I'll preprocess the data. In Part 2, I'll build the network, which will then be trained in Part 3 and used to generate some text in Part 4.

In [1]:
import numpy as np
import os
import pickle
import warnings
import tensorflow as tf
from tensorflow.contrib import seq2seq

<a id='part1'></a>
## Part 1: Data Preprocessing

The data preprocessing consists in tokenizing the punctuation, converting the text to lowercase and to integer codes.

In [44]:
# Load and read text data
data_dir = 'friends_script_season1-8.txt'
input_file = os.path.join(data_dir)
with open(input_file, "r") as f:
    text = f.read()

# Define punctuation for tokenization
punctuation_dictionary = {'.': '||Period||', 
                            ',': '||Comma||',
                            '"': '||Quotation_mark||',
                            ';': '||Semicolon||',
                            '!': '||Exclamation_mark||',
                            '?': '||Question_mark||',
                            '(': '||Left_parenthesis||',
                            ')': '||Right_parenthesis||',
                            '--': '||Dash||',
                            '\n': '||Return||'}

# Tokenize punctuation
for key, token in punctuation_dictionary.items():
    text = text.replace(key, ' {} '.format(token))

# Convert text to lower case
text = text.lower()

# Split text into words
text = text.split()

# Assign codes to words 
vocab_to_int = {word: num for num, word in enumerate(set(text))}
int_to_vocab = dict(enumerate(set(text)))

# Convert full text into codes
int_text = [vocab_to_int[word] for word in text]

pickle.dump((int_text, vocab_to_int, int_to_vocab, punctuation_dictionary), open('preprocess.p', 'wb'))

<a id='part2'></a>
## Part 2: RNN Building

We will build the RNN using 2 layers of LSTM cells. We will also embed the words before inputing them to the network.

In [47]:
# Set hyperparameters:

# Number of epochs
num_epochs = 400
# Batch size
batch_size = 512
# RNN size
rnn_size = 128
# Number of layers in RNN
lstm_layers = 2
# Keep probability during dropout
keep_prob = 0.6
# Embedding dimension size
embed_dim = 300
# Sequence length for input and target
seq_length = 50
# Learning rate
learning_rate = 0.01
# Show stats for every n number of epochs
show_every_n_epochs = 10

# Set saving directory
save_dir = './save'

# Save current parameters
pickle.dump((seq_length, save_dir), open('params.p', 'wb'))

# Reset any current graph
tf.reset_default_graph()

# Define graph structure
train_graph = tf.Graph()
with train_graph.as_default():
    
    # Define placeholders for input, target and learning rate
    input_text = tf.placeholder(tf.int32, [None, None], name='input')
    targets = tf.placeholder(tf.int32, [None, None], name='targets')
    lr = tf.placeholder(tf.float32, name='learning_rate')
    
    input_data_shape = tf.shape(input_text)
    
    # Embed text for input to RNN
    vocab_size = len(int_to_vocab)
    embedding = tf.Variable(tf.random_uniform((vocab_size, embed_dim), -1, 1))
    embed = tf.nn.embedding_lookup(embedding, input_text)

    # Define RNN cell 
    def build_cell(rnn_size, keep_prob): # building lstm within build_cell necessary since tf 1.1
            lstm = tf.contrib.rnn.BasicLSTMCell(rnn_size)

            # Add dropout to the cell
            drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob = keep_prob)
            return drop

    # Stack up multiple LSTM layers
    cell = tf.contrib.rnn.MultiRNNCell([build_cell(rnn_size, keep_prob) for _ in range (lstm_layers)])

    # Set initial state to 0s
    initial_state = cell.zero_state(input_data_shape[0], tf.float32)

    # Attribute name to initial state
    initial_state = tf.identity(initial_state, name='initial_state')

    # Run RNN
    outputs, final_state = tf.nn.dynamic_rnn(cell, embed, dtype=tf.float32)

    # Attribute name to final state
    final_state = tf.identity(final_state, name='final_state')

    # Define logits
    logits = tf.contrib.layers.fully_connected(outputs, vocab_size, activation_fn=None)

    # Pass logits through softmax
    probs = tf.nn.softmax(logits, name='probs')

    # Define loss function
    cost = seq2seq.sequence_loss( # Use tf seq2seq.sequence_loss to compute cross-entropy loss between logits and targets sentences
        logits,
        targets,
        tf.ones([input_data_shape[0], input_data_shape[1]]))

    # Optimizer
    optimizer = tf.train.AdamOptimizer(lr)

    # Clip gradients to prevent explosion
    gradients = optimizer.compute_gradients(cost)
    capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients if grad is not None]
    train_op = optimizer.apply_gradients(capped_gradients)


<a id='part3'></a>
## Part 3: RNN Training

Here we will define our batches of input and target text and train the network.

In [49]:
# Get batches of input and target text data
def get_batches(int_text, batch_size, seq_length):
    """
    Return batches of input and target. The returned batch variable is a np array of shape: 
    (total number of batches, 2 (i.e., input and target sequences), number of sequences per batch, 
    length of sequence). 
    
    :param int_text: text with the words replaced by their ids
    :param batch_size: number of sequences per batch
    :param seq_length: length of each sequence (same for input and target)
    :return: Numpy array of batches
    """
    int_text = np.array(int_text)
    characters_per_batch = batch_size * seq_length
    n_batches = len(int_text)//characters_per_batch
    int_text = int_text[:n_batches * characters_per_batch]
    x_ = int_text[:]
    tmp = int_text[1:]
    y_ = np.append(tmp, x_[0:1]) # target y_ is = to input x_ offset by 1, with a last element that corresponds to first element of input x_ (as convention)
    x_ = x_.reshape((batch_size, -1))
    y_ = y_.reshape((batch_size, -1))
    counter = 0
    batch = np.zeros((n_batches, 2, batch_size, seq_length), dtype=int)
    for n in range(0, x_.shape[1], seq_length):
        x = np.array(x_[:, n:n+seq_length])
        y = np.array(y_[:, n:n+seq_length])
        batch[counter, 0, :, :] = x
        batch[counter, 1, :, :] = y
        counter += 1
    return batch

batches = get_batches(int_text, batch_size, seq_length)

# Initialize session with defined graph
with tf.Session(graph=train_graph) as sess:
    sess.run(tf.global_variables_initializer())

    for epoch_i in range(num_epochs):
        
        # Initialize state with first batch
        state = sess.run(initial_state, {input_text: batches[0][0]})

        # Train
        for batch_i, (x, y) in enumerate(batches):
            feed = {
                input_text: x,
                targets: y,
                initial_state: state,
                lr: learning_rate}
            train_loss, state, _ = sess.run([cost, final_state, train_op], feed)

        # Show training loss every <show_every_n_epochs> epochs
        if epoch_i % show_every_n_epochs == 0:
            print('Epoch {:>3}  train_loss = {:.3f}'.format(
                epoch_i,
                train_loss))

    # Save Model
    saver = tf.train.Saver()
    saver.save(sess, save_dir)
    print('Model Trained and Saved')

Epoch   0  train_loss = 6.071
Epoch  10  train_loss = 4.866
Epoch  20  train_loss = 4.492
Epoch  30  train_loss = 4.285
Epoch  40  train_loss = 4.059
Epoch  50  train_loss = 3.878
Epoch  60  train_loss = 3.785
Epoch  70  train_loss = 3.721
Epoch  80  train_loss = 3.668
Epoch  90  train_loss = 3.634
Epoch 100  train_loss = 3.607
Epoch 110  train_loss = 3.577
Epoch 120  train_loss = 3.564
Epoch 130  train_loss = 3.549
Epoch 140  train_loss = 3.538
Epoch 150  train_loss = 3.519
Epoch 160  train_loss = 3.517
Epoch 170  train_loss = 3.504
Epoch 180  train_loss = 3.490
Epoch 190  train_loss = 3.485
Epoch 200  train_loss = 3.472
Epoch 210  train_loss = 3.462
Epoch 220  train_loss = 3.475
Epoch 230  train_loss = 3.468
Epoch 240  train_loss = 3.456
Epoch 250  train_loss = 3.453
Epoch 260  train_loss = 3.449
Epoch 270  train_loss = 3.455
Epoch 280  train_loss = 3.443
Epoch 290  train_loss = 3.461
Epoch 300  train_loss = 3.448
Epoch 310  train_loss = 3.447
Epoch 320  train_loss = 3.445
Epoch 330 

<a id='part4'></a>
## Part 4: Text Generation

Finalle, let's generate some text based on a prime input word:

In [50]:
# Load previously saved parameters
_, vocab_to_int, int_to_vocab, punctuation_dictionary = pickle.load(open('preprocess.p', mode='rb'))
seq_length, load_dir = pickle.load(open('params.p', mode='rb'))

# Define words count of text to be generated
gen_length = 1000
# Define input word to the RNN
prime_word = 'joey'

loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # Load saved model
    loader = tf.train.import_meta_graph(load_dir + '.meta')
    loader.restore(sess, load_dir)

    # Get tensors from loaded model
    input_text = loaded_graph.get_tensor_by_name(name='input:0')
    initial_state = loaded_graph.get_tensor_by_name(name='initial_state:0')
    final_state = loaded_graph.get_tensor_by_name(name='final_state:0')
    probs = loaded_graph.get_tensor_by_name(name='probs:0')

    # Sentences generation setup
    gen_sentences = [prime_word + ':']
    prev_state = sess.run(initial_state, {input_text: np.array([[1]])})

    # Generate sentences
    for n in range(gen_length):
        # Dynamic Input
        dyn_input = [[vocab_to_int[word] for word in gen_sentences[-seq_length:]]]
        dyn_seq_length = len(dyn_input[0])

        # Get prediction probabilities
        probabilities, prev_state = sess.run(
            [probs, final_state],
            {input_text: dyn_input, initial_state: prev_state})
                
        # Chose predicted word according to probabilities    
        word_idx = np.random.choice(range(len(int_to_vocab)), p=probabilities[dyn_seq_length-1])
        pred_word = int_to_vocab[word_idx]

        gen_sentences.append(pred_word)
    
    # Replace tokens by punctuation
    tv_script = ' '.join(gen_sentences)
    for key, token in punctuation_dictionary.items():
        ending = ' ' if key in ['\n', '(', '"'] else ''
        tv_script = tv_script.replace(' ' + token.lower(), key)
    tv_script = tv_script.replace('\n ', '\n')
    tv_script = tv_script.replace('( ', '(')
        
    print(tv_script)

INFO:tensorflow:Restoring parameters from ./save
joey: hey you told me!
joey: oh yeah my mom!, who is when they’re happening doing that? all my brother.
phoebe: how come the call as become a moment, why was my name?
joey: because your? rules kinda been listening, and that beautiful is no heat, doesn’t up.
opening credits
[scene: central perk, phoebe, chandler, and chandler are sitting by joey to go and their coffee. ]
chandler: hey!(she tilts the a treat.)
all: no-no-no!! hey, all a beautiful word i have to find me. audition, yes. i mean i will just have a very fine reason about her and ann pretty.(they both start in of chandler's legs and other birds, monica still and both herself.) and it's gone.
joey: correct, it's really funny. what should you?(to they and central boy and only enter. ]
[scene: the theatre, monica and ross's joey, phoebe is there, rachel is with to use the newspapers. ]
all: damn you!
monica: yeah laid up fail
, joey returns in the waiting entrance to the players. ]