# TV Script Generation
In this notebook, I train a recurrent neural network with long-short-term-memory (LSTM) cells to automatically generate tv scripts for the  [Simpsons](https://en.wikipedia.org/wiki/The_Simpsons). This project was completed in January 2018 as part of Udacity’s deep learning nanodegree.

## The data

The data we use for this project is the list of Simpson's script lines provided by Kaggle. It can be downloaded [here](https://www.kaggle.com/wcukierski/the-simpsons-by-the-data). Let's have a look at our data first:

In [24]:
import csv

# Location of data file
data_dir = './data/simpsons/simpsons_script_lines.csv'

# Print out some lines of the script
display_lines = 30

csvfile = open(data_dir, 'rt', encoding="utf8")
reader = csv.reader(csvfile)

for row in reader:
    line = row[3]
    if line[0] == '(':
        print()
    print(line)
    display_lines -= 1
    if display_lines <= 0:
        break
        
csvfile.close()

raw_text
Miss Hoover: No, actually, it was a little of both. Sometimes when a disease is in all the magazines and all the news shows, it's only natural that you think you have it.
Lisa Simpson: (NEAR TEARS) Where's Mr. Bergstrom?
Miss Hoover: I don't know. Although I'd sure like to talk to him. He didn't touch my lesson plan. What did he teach you?
Lisa Simpson: That life is worth living.
Edna Krabappel-Flanders: The polls will be open from now until the end of recess. Now, (SOUR) just in case any of you have decided to put any thought into this, we'll have our final statements. Martin?
Martin Prince: (HOARSE WHISPER) I don't think there's anything left to say.
Edna Krabappel-Flanders: Bart?
Bart Simpson: Victory party under the slide!

(Apartment Building: Ext. apartment building - day)
Lisa Simpson: (CALLING) Mr. Bergstrom! Mr. Bergstrom!
Landlady: Hey, hey, he Moved out this morning. He must have a new job -- he took his Copernicus costume.
Lisa Simpson: Do you know where I could fi

## Preprocess the data
We preprocess the data by converting each word in the dataset into a distinct integer (this representation will be useful for when the model learns its own word2vec embeddings).

### Lookup Table
The `create_lookup_tables()` function creates:
- A dictionary to go from the words to an id, we'll call `vocab_to_int`
- A dictionary to go from the id to word, we'll call `int_to_vocab`

And returns them as the tuple `(vocab_to_int, int_to_vocab)`

In [25]:
import numpy as np

def create_lookup_tables(text):
    """
    Create lookup tables for vocabulary
    :param text: The text of tv scripts split into words
    :return: A tuple of dicts (vocab_to_int, int_to_vocab)
    """
    vocab = set(text)
    vocab_to_int = {word: idx for idx, word in enumerate(vocab)}
    int_to_vocab = dict(enumerate(vocab))
    return vocab_to_int, int_to_vocab

### Tokenize Punctuation
We'll be splitting the script into a word array using spaces as delimiters.  However, punctuations like periods and exclamation marks make it hard for the neural network to distinguish between the word "bye" and "bye!".

Implement the function `token_lookup` to return a dict that will be used to tokenize symbols like "!" into "||Exclamation_Mark||".  Create a dictionary for the following symbols where the symbol is the key and value is the token:
- Period ( . )
- Comma ( , )
- Quotation Mark ( " )
- Semicolon ( ; )
- Exclamation mark ( ! )
- Question mark ( ? )
- Left Parentheses ( ( )
- Right Parentheses ( ) )
- Dash ( -- )
- Return ( \n )

This dictionary will be used to token the symbols and add the delimiter (space) around it.  This separates the symbols as it's own word, making it easier for the neural network to predict on the next word. Make sure you don't use a token that could be confused as a word. Instead of using the token "dash", try using something like "||dash||".

In [26]:
def token_lookup():
    """
    Generate a dict to turn punctuation into a token.
    :return: Tokenize dictionary where the key is the punctuation and the value is the token
    """
    return {
        '.': "||Period||",
        ',': "||Comma||",
        '"': "||Quotation_Mark||",
        ';': "||Semicolon||",
        '!': "||Exclamation_Mark||",
        '?': "||Question_Mark||",
        '(': "||Left_Parentheses||",
        ')': "||Right_Parentheses||",
        '--': "||Dash||",
        '\n': "||Return||"
           }

## Preprocess all the data and save it
Running the code cell below will preprocess all the data and save it to file. Since I'm on my laptop right now (which doesn't have a fast GPU) we only import the first 30,000 lines.

In [27]:
from tqdm import tqdm
import pickle
from IPython.display import clear_output

# Create array for storing all the data
text = []

current_line = 1
total_num_lines = 30000

token_dict = token_lookup()
    

with open(data_dir, 'rt', encoding="utf8") as file:
    reader = csv.reader(file)
    for row in reader:
        if current_line%100 == 0:
            print("Processed {} lines.".format(current_line), end="\r")
        current_line += 1
        
        line = row[3]
        # Replace punctuation with tokens
        for key, token in token_dict.items():
            line = line.replace(key, ' {} '.format(token))
        
        # Convert to lower case
        line = line.lower()
        line = line.split()
        
        if line[0] == token_dict['('].lower():
            text.append([token_dict['\n'].lower()])

        text.append(line + [token_dict['\n'].lower()])

        if current_line > total_num_lines:
            break

# Concatenate into single list of words and tokens
text = np.concatenate(text)

# Remove junk at the start of file
text = text[2:]

vocab_to_int, int_to_vocab = create_lookup_tables(text)
int_text = [vocab_to_int[word] for word in text]

# Save processed data for later
pickle.dump((int_text, vocab_to_int, int_to_vocab, token_dict), open('preprocess.p', 'wb'))

Processed 30000 lines.

## Checkpoint

Now lets reload the data we just saved and make sure we can recover the text back from the list of integers:

In [28]:
recovered_text = ""

int_text, vocab_to_int, int_to_vocab, token_dict = pickle.load(open('preprocess.p', 'rb'))

for i in range (200):
    word = int_to_vocab[int_text[i]]
    pre = " "
    # Undo tokenisation
    for key, token in token_dict.items():
        if word == token.lower():
            word = key
            pre = ""
    recovered_text = recovered_text + pre + word

# Remove unwanted spaces after punctuation
recovered_text = recovered_text.replace('\n ', '\n')
recovered_text = recovered_text.replace('( ', '(')

# Remove space at the beggining
recovered_text = recovered_text[1:]
            

print(recovered_text)

miss hoover: no, actually, it was a little of both. sometimes when a disease is in all the magazines and all the news shows, it's only natural that you think you have it.
lisa simpson:(near tears) where's mr. bergstrom?
miss hoover: i don't know. although i'd sure like to talk to him. he didn't touch my lesson plan. what did he teach you?
lisa simpson: that life is worth living.
edna krabappel-flanders: the polls will be open from now until the end of recess. now,(sour) just in case any of you have decided to put any thought into this, we'll have our final statements. martin?
martin prince:(hoarse whisper) i don't think there's anything left to say.
edna krabappel-flanders: bart?
bart simpson: victory party under the slide!

(apartment building: ext. apartment building - day)
lisa simpson:(calling) mr. bergstrom! mr. bergstrom!
landlady: hey, hey, he moved out this morning.


## Building the Neural Network

In [29]:
import tensorflow as tf

def get_inputs():
    """
    Create TF Placeholders for input, targets, and learning rate.
    :return: Tuple (input, targets, learning rate)
    """
    # TODO: Implement Function
    inputs = tf.placeholder(tf.int32, [None, None], name = "input")
    targets = tf.placeholder(tf.int32, [None, None], name = "targets")
    learning_rate = tf.placeholder(tf.float32)
    return inputs, targets, learning_rate

### Build RNN Cell and Initialize

In [30]:
def get_init_cell(batch_size, rnn_size):
    """
    Create an RNN Cell and initialize it.
    :param batch_size: Size of batches
    :param rnn_size: Size of RNNs
    :return: Tuple (cell, initialize state)
    """

    lstm_layers = 2 #TODO: make passable parameter
    
    # Stack up multiple LSTM layers, for deep learning
    cell = tf.contrib.rnn.MultiRNNCell([tf.contrib.rnn.BasicLSTMCell(rnn_size) for _ in range(lstm_layers)])
    
    # Getting an initial state of all zeros
    initial_state = tf.identity(cell.zero_state(batch_size, tf.float32), name = "initial_state")

    return cell, initial_state

### Word Embedding
Apply embedding to `input_data` using TensorFlow.  Return the embedded sequence.

In [31]:
def get_embed(input_data, vocab_size, embed_dim):
    """
    Create embedding for <input_data>.
    :param input_data: TF placeholder for text input.
    :param vocab_size: Number of words in vocabulary.
    :param embed_dim: Number of embedding dimensions
    :return: Embedded input.
    """
    # TODO: Experiment with different initialization
    embedding = tf.Variable(tf.random_uniform([vocab_size, embed_dim], -1, 1))
    embed = tf.nn.embedding_lookup(embedding, input_data)
    return embed

### Build RNN

In [32]:
def build_rnn(cell, inputs):
    """
    Create a RNN using a RNN Cell
    :param cell: RNN Cell
    :param inputs: Input text data
    :return: Tuple (Outputs, Final State)
    """
    Outputs, FinalState = tf.nn.dynamic_rnn(cell, inputs, dtype = tf.float32)
    return Outputs, tf.identity(FinalState, name = "final_state")

### Build the Neural Network
Apply the functions you implemented above to:
- Apply embedding to `input_data` using your `get_embed(input_data, vocab_size, embed_dim)` function.
- Build RNN using `cell` and your `build_rnn(cell, inputs)` function.
- Apply a fully connected layer with a linear activation and `vocab_size` as the number of outputs.

Return the logits and final state in the following tuple (Logits, FinalState) 

In [33]:
def build_nn(cell, rnn_size, input_data, vocab_size, embed_dim):
    """
    Build part of the neural network
    :param cell: RNN cell
    :param rnn_size: Size of rnns
    :param input_data: Input data
    :param vocab_size: Vocabulary size
    :param embed_dim: Number of embedding dimensions
    :return: Tuple (Logits, FinalState)
    """
    embed = get_embed(input_data, vocab_size, embed_dim)
    rnn_outputs, final_state = build_rnn(cell, embed)
    logits = tf.contrib.layers.fully_connected(inputs = rnn_outputs, num_outputs = vocab_size, activation_fn=None)
    
    return logits, final_state

### Batch the Data

Note that since we are trying to predict the next word, the targets are the labels shifted by 1.

In [34]:
def get_batches(int_text, batch_size, seq_length):
    """
    Return batches of input and target
    :param int_text: Text with the words replaced by their ids
    :param batch_size: The size of batch
    :param seq_length: The length of sequence
    :return: Batches as a Numpy array
    """
    characters_per_batch = batch_size * seq_length
    num_batches = len(int_text) // characters_per_batch
    int_text = int_text[:characters_per_batch * num_batches]
    
    # Create offset target array
    int_text_targets = np.zeros_like(int_text)
    int_text_targets[:-1], int_text_targets[-1] = int_text[1:], int_text[0]
    
    # Reshape into batch_size rows
    int_text = np.array(int_text).reshape([batch_size, -1])
    int_text_targets = np.array(int_text_targets).reshape([batch_size, -1])
    
    batches = []
    
    for i in range(0, int_text.shape[1], seq_length):
        x = np.array(int_text[:, i:i+seq_length])
        y = np.array(int_text_targets[:, i:i+seq_length])
        batches.append([x, y]) 
    return np.array(batches)

## Hyperparameters

In [35]:
# Number of Epochs
num_epochs = 80
# Batch Size
batch_size = 100
# RNN Size
rnn_size = 256
# Embedding Dimension Size
embed_dim = 256
# Sequence Length
seq_length = 100
# Learning Rate
learning_rate = 0.01
# Show stats for every n number of batches
show_every_n_batches = 1

save_dir = './save'

### Build the Graph

In [36]:
from tensorflow.contrib import seq2seq

tf.reset_default_graph()

train_graph = tf.Graph()
with train_graph.as_default():
    vocab_size = len(int_to_vocab)
    input_text, targets, lr = get_inputs()
    input_data_shape = tf.shape(input_text)
    cell, initial_state = get_init_cell(input_data_shape[0], rnn_size)
    logits, final_state = build_nn(cell, rnn_size, input_text, vocab_size, embed_dim)

    # Probabilities for generating words
    probs = tf.nn.softmax(logits, name='probs')

    # Loss function
    cost = seq2seq.sequence_loss(
        logits,
        targets,
        tf.ones([input_data_shape[0], input_data_shape[1]]))

    # Optimizer
    optimizer = tf.train.AdamOptimizer(lr)

    # Gradient Clipping
    gradients = optimizer.compute_gradients(cost)
    capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients if grad is not None]
    train_op = optimizer.apply_gradients(capped_gradients)

## Train
Train the neural network on the preprocessed data.  If you have a hard time getting a good loss, check the [forums](https://discussions.udacity.com/) to see if anyone is having the same problem.

In [18]:
batches = get_batches(int_text, batch_size, seq_length)

with tf.Session(graph=train_graph) as sess:
    sess.run(tf.global_variables_initializer())

    for epoch_i in range(num_epochs):
        state = sess.run(initial_state, {input_text: batches[0][0]})

        for batch_i, (x, y) in enumerate(batches):
            feed = {
                input_text: x,
                targets: y,
                initial_state: state,
                lr: learning_rate}
            train_loss, state, _ = sess.run([cost, final_state, train_op], feed)

            # Show every <show_every_n_batches> batches
            if (epoch_i * len(batches) + batch_i) % show_every_n_batches == 0:
                print('Epoch {}, Batch {}/{}, train_loss = {:.3f}'.format(
                    epoch_i,
                    batch_i,
                    len(batches),
                    train_loss), end="\r")
        print()

    # Save Model
    saver = tf.train.Saver()
    saver.save(sess, save_dir)
    print('Model Trained and Saved')

Epoch 0, Batch 48/49, train_loss = 6.489
Epoch 1, Batch 48/49, train_loss = 6.342
Epoch 2, Batch 48/49, train_loss = 6.214
Epoch 3, Batch 48/49, train_loss = 6.082
Epoch 4, Batch 48/49, train_loss = 5.949
Epoch 5, Batch 48/49, train_loss = 5.686
Epoch 6, Batch 48/49, train_loss = 5.508
Epoch 7, Batch 48/49, train_loss = 5.363
Epoch 8, Batch 48/49, train_loss = 5.221
Epoch 9, Batch 48/49, train_loss = 5.048
Epoch 10, Batch 48/49, train_loss = 4.855
Epoch 11, Batch 48/49, train_loss = 4.681
Epoch 12, Batch 48/49, train_loss = 4.510
Epoch 13, Batch 48/49, train_loss = 4.369
Epoch 14, Batch 48/49, train_loss = 4.284
Epoch 15, Batch 48/49, train_loss = 4.147
Epoch 16, Batch 48/49, train_loss = 4.041
Epoch 17, Batch 48/49, train_loss = 3.961
Epoch 18, Batch 48/49, train_loss = 3.884
Epoch 19, Batch 48/49, train_loss = 3.826
Epoch 20, Batch 48/49, train_loss = 3.788
Epoch 21, Batch 48/49, train_loss = 3.708
Epoch 22, Batch 48/49, train_loss = 3.643
Epoch 23, Batch 48/49, train_loss = 3.584
Ep

## Save Parameters
Save `seq_length` and `save_dir` for generating a new TV script.

In [37]:
# Save parameters for checkpoint
pickle.dump((seq_length, save_dir), open('params.p', 'wb'))

# Checkpoint

In [38]:
import tensorflow as tf
import numpy as np

_, vocab_to_int, int_to_vocab, token_dict = pickle.load(open("preprocess.p", mode='rb'))
seq_length, load_dir = pickle.load(open('params.p', mode='rb'))

### Get Tensors

In [39]:
def get_tensors(loaded_graph):
    """
    Get input, initial state, final state, and probabilities tensor from <loaded_graph>
    :param loaded_graph: TensorFlow graph loaded from file
    :return: Tuple (InputTensor, InitialStateTensor, FinalStateTensor, ProbsTensor)
    """
    InputTensor = loaded_graph.get_tensor_by_name(name = "input:0")
    InitialStateTensor = loaded_graph.get_tensor_by_name("initial_state:0")
    FinalStateTensor = loaded_graph.get_tensor_by_name("final_state:0")
    ProbsTensor = loaded_graph.get_tensor_by_name("probs:0")
    
    return InputTensor, InitialStateTensor, FinalStateTensor, ProbsTensor

### Choose Word
The `pick_word()` function samples words according to distribution specified in the `probabilities` vector.

In [40]:
import random

def pick_word(probabilities, int_to_vocab):
    """
    Pick the next word in the generated text
    :param probabilities: Probabilites of the next word
    :param int_to_vocab: Dictionary of word ids as the keys and words as the values
    :return: String of the predicted word
    """
    probs_vec = np.squeeze(probabilities)
    r = random.random()
    word_idx = -1
    while r > 0:
        word_idx += 1
        r -= probs_vec[word_idx]
    
    return int_to_vocab[word_idx]

## Generate TV Script
This will generate the TV script for you.  Set `gen_length` to the length of TV script you want to generate.

In [43]:
gen_length = 1000

prime_word = 'miss hoover:'



loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # Load saved model
    loader = tf.train.import_meta_graph(load_dir + '.meta')
    loader.restore(sess, load_dir)
    print()

    # Get Tensors from loaded model
    input_text, initial_state, final_state, probs = get_tensors(loaded_graph)

    # Sentences generation setup
    gen_sentences = prime_word.split()
    prev_state = sess.run(initial_state, {input_text: np.array([[1]])})

    # Generate sentences
    for n in range(gen_length):
        # Dynamic Input
        dyn_input = [[vocab_to_int[word] for word in gen_sentences[-seq_length:]]]
        dyn_seq_length = len(dyn_input[0])

        # Get Prediction
        probabilities, prev_state = sess.run(
            [probs, final_state],
            {input_text: dyn_input, initial_state: prev_state})
        
        
        pred_word = pick_word(probabilities[0][dyn_seq_length-1], int_to_vocab)

        gen_sentences.append(pred_word)
    
    # Remove tokens
    tv_script = ' '.join(gen_sentences)
    for key, token in token_dict.items():
        tv_script = tv_script.replace(' ' + token.lower(), key)

        
    tv_script = tv_script.replace('\n ', '\n')
    tv_script = tv_script.replace('( ', '(')
        
    print(tv_script)

INFO:tensorflow:Restoring parameters from ./save

miss hoover: no! all four the magazines, use love more than you now. it's clearly even worth my cat now. but if you kept the end kirk and will take him to take a few dollars.
lionel hutz:(pointed) in being on this school?
lisa simpson:(gasps) they suffers out of the smart dream?
ralph wiggum: to the simpson three back?
lisa simpson: nothing.
groundskeeper willie: more man in yer film.(to everyone) what are you ralph?
seymour skinner: oh, you came, and it is... and to widow's children.
lionel hutz: hmm. i'll be honest. so,?
allison: how to switch-- you can finally have a rope," 'tis can still.
chief wiggum:(making a little hard) by health four. hold a...
bart simpson: daaad!

(gas station: int. desolate systems man:(depressed) hey thank you, baby. this is hot here for times one?
homer simpson: this bye?

(indian puppy, suddenly nervous) lousy little school simpson!
springfield:(heads) ooh, the spoon, doctor" live."(squeaky teenage voice)

# The TV Script is Nonsensical
Our script doens't make an awful lot of sense, but the model has learned still learned some things! Character names are accurate and lines of dialogue come after the character's name followed by a colon. The model has also learned that a location should be specified at the  begining of a scene.

I expect to get better results if I train a deeper network for more time on more data, which I plan to do when I get home from holidays (so that I'll have access to my desktop computer)