# TV Script Generation
This project will generate [Simpsons](https://en.wikipedia.org/wiki/The_Simpsons) TV scripts using RNNs.  Part of the [Simpsons dataset](https://www.kaggle.com/wcukierski/the-simpsons-by-the-data) of scripts from 27 seasons are used in this project.  The Neural Network will generate a new TV script for a scene at [Moe's Tavern](https://simpsonswiki.com/wiki/Moe's_Tavern).
## Get the Data

In [1]:
from preprocess import preprocess

data_dir = './data/simpsons/moes_tavern_lines.txt'
text = preprocess.load_data(data_dir)
# Ignore notice, since we don't use it for analysing the data
text = text[81:]

## Explore the Data


In [2]:
view_sentence_range = (0, 10)

import numpy as np

print('Dataset Stats')
print('Roughly the number of unique words: {}'.format(len({word: None for word in text.split()})))
scenes = text.split('\n\n')
print('Number of scenes: {}'.format(len(scenes)))
sentence_count_scene = [scene.count('\n') for scene in scenes]
print('Average number of sentences in each scene: {}'.format(np.average(sentence_count_scene)))

sentences = [sentence for scene in scenes for sentence in scene.split('\n')]
print('Number of lines: {}'.format(len(sentences)))
word_count_sentence = [len(sentence.split()) for sentence in sentences]
print('Average number of words in each line: {}'.format(np.average(word_count_sentence)))

print()
print('The sentences {} to {}:'.format(*view_sentence_range))
print('\n'.join(text.split('\n')[view_sentence_range[0]:view_sentence_range[1]]))

Dataset Stats
Roughly the number of unique words: 11492
Number of scenes: 262
Average number of sentences in each scene: 15.2519083969
Number of lines: 4258
Average number of words in each line: 11.5016439643
()
The sentences 0 to 10:

Moe_Szyslak: (INTO PHONE) Moe's Tavern. Where the elite meet to drink.
Bart_Simpson: Eh, yeah, hello, is Mike there? Last name, Rotch.
Moe_Szyslak: (INTO PHONE) Hold on, I'll check. (TO BARFLIES) Mike Rotch. Mike Rotch. Hey, has anybody seen Mike Rotch, lately?
Moe_Szyslak: (INTO PHONE) Listen you little puke. One of these days I'm gonna catch you, and I'm gonna carve my name on your back with an ice pick.
Moe_Szyslak: What's the matter Homer? You're not your normal effervescent self.
Homer_Simpson: I got my problems, Moe. Give me another one.
Moe_Szyslak: Homer, hey, you should not drink to forget your problems.
Barney_Gumble: Yeah, you should only drink to enhance your social skills.



## Implement Preprocessing Functions

### Lookup Table

In [3]:
import numpy as np

def create_lookup_tables(text):
    """
    Create lookup tables for vocabulary
    :param text: The text of tv scripts split into words
    :return: A tuple of dicts (vocab_to_int, int_to_vocab)
    """
    vocab_to_int = {word: i for i, word in enumerate(set(text), 0)}
    int_to_vocab = {i: word for i, word in enumerate(set(text), 0)}
    return vocab_to_int, int_to_vocab

Tests Passed


### Tokenize Punctuation

In [4]:
def token_lookup():
    """
    Generate a dict to turn punctuation into a token.
    :return: Tokenize dictionary where the key is the punctuation and the value is the token
    """
    token_dict = {}
    token_dict['.'] = '||Period||'
    token_dict[','] = '||Comma||'
    token_dict['"'] = '||Quotation_Mark||'
    token_dict[';'] = '||Semicolon||'
    token_dict['?'] = '||Question_Mark||'
    token_dict['!'] = '||Exclamation_Mark||'
    token_dict['('] = '||Left_Parentheses||'
    token_dict[')'] = '||Right_Parentheses||'
    token_dict['--'] = '||Dash||'
    token_dict['\n'] = '||Return||'
    return token_dict

Tests Passed


## Preprocess all the data and save it
Running the code cell below will preprocess all the data and save it to file.

In [5]:
preprocess.preprocess_and_save_data(data_dir, token_lookup, create_lookup_tables)

# Check Point

In [6]:
from preprocess import preprocess
import numpy as np
import problem_unittests as tests

int_text, vocab_to_int, int_to_vocab, token_dict = preprocess.load_preprocess()

## Build the Neural Network

### Check the Version of TensorFlow and Access to GPU

In [7]:
from distutils.version import LooseVersion
import warnings
import tensorflow as tf

# Check TensorFlow Version
assert LooseVersion(tf.__version__) >= LooseVersion('1.0'), 'Please use TensorFlow version 1.0 or newer'
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
if not tf.test.gpu_device_name():
    warnings.warn('No GPU found. Please use a GPU to train your neural network.')
else:
    print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.0.1


  


### Input

In [8]:
def get_inputs():
    """
    Create TF Placeholders for input, targets, and learning rate.
    :return: Tuple (input, targets, learning rate)
    """
    Input = tf.placeholder(tf.int32, [None, None], name = 'input')
    Targets = tf.placeholder(tf.int32, [None, None], name = 'targets')
    LearningRate = tf.placeholder(tf.float32, name = 'learning_rate')
    return Input, Targets, LearningRate

Tests Passed


### Build RNN Cell and Initialize

In [9]:
def get_init_cell(batch_size, rnn_size):
    """
    Create an RNN Cell and initialize it.
    :param batch_size: Size of batches
    :param rnn_size: Size of RNNs
    :return: Tuple (cell, initialize state)
    """
    lstm = tf.contrib.rnn.BasicLSTMCell(rnn_size)
    cell = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=0.8)
    cell = tf.contrib.rnn.MultiRNNCell([lstm])
    initial_state = tf.identity(cell.zero_state(batch_size = batch_size, dtype = tf.float32),
                                name = 'initial_state')
    return cell, initial_state

Tests Passed


### Word Embedding
Apply embedding to `input_data` using TensorFlow.  Return the embedded sequence.

In [10]:
def get_embed(input_data, vocab_size, embed_dim):
    """
    Create embedding for <input_data>.
    :param input_data: TF placeholder for text input.
    :param vocab_size: Number of words in vocabulary.
    :param embed_dim: Number of embedding dimensions
    :return: Embedded input.
    """
    embedding = tf.Variable(tf.random_uniform((vocab_size, embed_dim), -1, 1))
    embed_input = tf.nn.embedding_lookup(embedding, input_data)
    return embed_input

Tests Passed


### Build RNN

In [11]:
def build_rnn(cell, inputs):
    """
    Create a RNN using a RNN Cell
    :param cell: RNN Cell
    :param inputs: Input text data
    :return: Tuple (Outputs, Final State)
    """
    outputs, state = tf.nn.dynamic_rnn(cell, inputs, dtype=tf.float32)
    final_state = tf.identity(state, name='final_state')
    return outputs, final_state

Tests Passed


### Build the Neural Network

In [12]:
def build_nn(cell, rnn_size, input_data, vocab_size, embed_dim):
    """
    Build part of the neural network
    :param cell: RNN cell
    :param rnn_size: Size of rnns
    :param input_data: Input data
    :param vocab_size: Vocabulary size
    :param embed_dim: Number of embedding dimensions
    :return: Tuple (Logits, FinalState)
    """
    embed_input = get_embed(input_data, vocab_size, embed_dim)
    outputs, final_state = build_rnn(cell, embed_input)
    logits = tf.contrib.layers.fully_connected(activation_fn=None,
                                               num_outputs=vocab_size, 
                                               inputs = outputs)
    return logits, final_state

Tests Passed


### Batches
Implement `get_batches` to create batches of input and targets using `int_text`.  The batches should be a Numpy array with the shape `(number of batches, 2, batch size, sequence length)`. Each batch contains two elements:
- The first element is a single batch of **input** with the shape `[batch size, sequence length]`
- The second element is a single batch of **targets** with the shape `[batch size, sequence length]`

If the last batch can't be filled with enough data, drop the last batch.

For exmple, `get_batches([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], 3, 2)` would return a Numpy array of the following:
```
[
  # First Batch
  [
    # Batch of Input
    [[ 1  2], [ 7  8], [13 14]]
    # Batch of targets
    [[ 2  3], [ 8  9], [14 15]]
  ]

  # Second Batch
  [
    # Batch of Input
    [[ 3  4], [ 9 10], [15 16]]
    # Batch of targets
    [[ 4  5], [10 11], [16 17]]
  ]

  # Third Batch
  [
    # Batch of Input
    [[ 5  6], [11 12], [17 18]]
    # Batch of targets
    [[ 6  7], [12 13], [18  1]]
  ]
]
```

Notice that the last target value in the last batch is the first input value of the first batch. In this case, `1`. This is a common technique used when creating sequence batches, although it is rather unintuitive.

In [13]:
def get_batches(int_text, batch_size, seq_length):
    """
    Return batches of input and target
    :param int_text: Text with the words replaced by their ids
    :param batch_size: The size of batch
    :param seq_length: The length of sequence
    :return: Batches as a Numpy array
    """
    characters_per_batch = seq_length * batch_size
    n_batches = len(int_text)//characters_per_batch
    
    # Keep only enough characters to make full batches
    int_text = int_text[:n_batches * characters_per_batch]
    int_text = np.array(int_text)
    # Reshape into n_seqs rows
    int_text = int_text.reshape((batch_size, -1))
    batches = []
    for n in range(0, int_text.shape[1], seq_length):
            # The features
            x = int_text[:, n:n+seq_length]
            # The targets, shifted by one
            y = np.zeros_like(x)
            if (n == int_text.shape[1] - seq_length):
                y[:, :-1], y[:, -1] = x[:, 1:], x[:, -1] + 1
                y[-1, -1] = 0
            else:
                y[:, :-1], y[:, -1] = x[:, 1:], x[:, -1] + 1
    
            batch = [x, y]
            batches.append(batch)
    return np.array(batches)

Tests Passed


## Neural Network Training
### Hyperparameters
Tune the following parameters:

- Set `num_epochs` to the number of epochs.
- Set `batch_size` to the batch size.
- Set `rnn_size` to the size of the RNNs.
- Set `embed_dim` to the size of the embedding.
- Set `seq_length` to the length of sequence.
- Set `learning_rate` to the learning rate.
- Set `show_every_n_batches` to the number of batches the neural network should print progress.

In [14]:
# Number of Epochs
num_epochs = 200
# Batch Size
batch_size = 128
# RNN Size
rnn_size = 256
# Embedding Dimension Size
embed_dim = 25
# Sequence Length
seq_length = 25
# Learning Rate
learning_rate = 0.003
# Show stats for every n number of batches
show_every_n_batches = 50

save_dir = './save'

### Build the Graph
Build the graph using the neural network you implemented.

In [15]:
from tensorflow.contrib import seq2seq

train_graph = tf.Graph()
with train_graph.as_default():
    vocab_size = len(int_to_vocab)
    input_text, targets, lr = get_inputs()
    input_data_shape = tf.shape(input_text)
    cell, initial_state = get_init_cell(input_data_shape[0], rnn_size)
    logits, final_state = build_nn(cell, rnn_size, input_text, vocab_size, embed_dim)

    # Probabilities for generating words
    probs = tf.nn.softmax(logits, name='probs')

    # Loss function
    cost = seq2seq.sequence_loss(
        logits,
        targets,
        tf.ones([input_data_shape[0], input_data_shape[1]]))

    # Optimizer
    optimizer = tf.train.AdamOptimizer(lr)

    # Gradient Clipping
    gradients = optimizer.compute_gradients(cost)
    capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients if grad is not None]
    train_op = optimizer.apply_gradients(capped_gradients)

## Train
Train the neural network on the preprocessed data.  If you have a hard time getting a good loss, check the [forums](https://discussions.udacity.com/) to see if anyone is having the same problem.

In [16]:
batches = get_batches(int_text, batch_size, seq_length)

with tf.Session(graph=train_graph) as sess:
    sess.run(tf.global_variables_initializer())

    for epoch_i in range(num_epochs):
        state = sess.run(initial_state, {input_text: batches[0][0]})

        for batch_i, (x, y) in enumerate(batches):
            feed = {
                input_text: x,
                targets: y,
                initial_state: state,
                lr: learning_rate}
            train_loss, state, _ = sess.run([cost, final_state, train_op], feed)

            # Show every <show_every_n_batches> batches
            if (epoch_i * len(batches) + batch_i) % show_every_n_batches == 0:
                print('Epoch {:>3} Batch {:>4}/{}   train_loss = {:.3f}'.format(
                    epoch_i,
                    batch_i,
                    len(batches),
                    train_loss))

    # Save Model
    saver = tf.train.Saver()
    saver.save(sess, save_dir)
    print('Model Trained and Saved')

Epoch   0 Batch    0/21   train_loss = 8.822
Epoch   2 Batch    8/21   train_loss = 6.251
Epoch   4 Batch   16/21   train_loss = 5.897
Epoch   7 Batch    3/21   train_loss = 5.509
Epoch   9 Batch   11/21   train_loss = 5.122
Epoch  11 Batch   19/21   train_loss = 4.869
Epoch  14 Batch    6/21   train_loss = 4.633
Epoch  16 Batch   14/21   train_loss = 4.343
Epoch  19 Batch    1/21   train_loss = 4.235
Epoch  21 Batch    9/21   train_loss = 4.118
Epoch  23 Batch   17/21   train_loss = 3.890
Epoch  26 Batch    4/21   train_loss = 3.807
Epoch  28 Batch   12/21   train_loss = 3.677
Epoch  30 Batch   20/21   train_loss = 3.484
Epoch  33 Batch    7/21   train_loss = 3.397
Epoch  35 Batch   15/21   train_loss = 3.289
Epoch  38 Batch    2/21   train_loss = 3.185
Epoch  40 Batch   10/21   train_loss = 3.042
Epoch  42 Batch   18/21   train_loss = 2.985
Epoch  45 Batch    5/21   train_loss = 2.935
Epoch  47 Batch   13/21   train_loss = 2.848
Epoch  50 Batch    0/21   train_loss = 2.741
Epoch  52 

## Save Parameters
Save `seq_length` and `save_dir` for generating a new TV script.

In [23]:
# Save parameters for checkpoint
preprocess.save_params((seq_length, save_dir))

# Checkpoint

In [18]:
import tensorflow as tf
import numpy as np
import preprocess
import problem_unittests as tests

_, vocab_to_int, int_to_vocab, token_dict = preprocess.load_preprocess()
seq_length, load_dir = preprocess.load_params()

## Implement Generate Functions
### Get Tensors

In [19]:
def get_tensors(loaded_graph):
    """
    Get input, initial state, final state, and probabilities tensor from <loaded_graph>
    :param loaded_graph: TensorFlow graph loaded from file
    :return: Tuple (InputTensor, InitialStateTensor, FinalStateTensor, ProbsTensor)
    """
    InputTensor = tf.Graph.get_tensor_by_name(loaded_graph, name='input:0')
    InitialStateTensor = tf.Graph.get_tensor_by_name(loaded_graph, name='initial_state:0')
    FinalStateTensor = tf.Graph.get_tensor_by_name(loaded_graph, name='final_state:0')
    ProbsTensor = tf.Graph.get_tensor_by_name(loaded_graph, name='probs:0')
    return InputTensor, InitialStateTensor, FinalStateTensor, ProbsTensor

Tests Passed


### Choose Word

In [20]:
def pick_word(probabilities, int_to_vocab):
    """
    Pick the next word in the generated text
    :param probabilities: Probabilites of the next word
    :param int_to_vocab: Dictionary of word ids as the keys and words as the values
    :return: String of the predicted word
    """
    return int_to_vocab[np.argmax(probabilities)]

Tests Passed


## Generate TV Script
Set `gen_length` to the length of TV script you want to generate.

In [22]:
gen_length = 200
# homer_simpson, moe_szyslak, or Barney_Gumble
prime_word = 'moe_szyslak'


loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # Load saved model
    loader = tf.train.import_meta_graph(load_dir + '.meta')
    loader.restore(sess, load_dir)

    # Get Tensors from loaded model
    input_text, initial_state, final_state, probs = get_tensors(loaded_graph)

    # Sentences generation setup
    gen_sentences = [prime_word + ':']
    prev_state = sess.run(initial_state, {input_text: np.array([[1]])})

    # Generate sentences
    for n in range(gen_length):
        # Dynamic Input
        dyn_input = [[vocab_to_int[word] for word in gen_sentences[-seq_length:]]]
        dyn_seq_length = len(dyn_input[0])

        # Get Prediction
        probabilities, prev_state = sess.run(
            [probs, final_state],
            {input_text: dyn_input, initial_state: prev_state})
        
        pred_word = pick_word(probabilities[dyn_seq_length-1], int_to_vocab)

        gen_sentences.append(pred_word)
    
    # Remove tokens
    tv_script = ' '.join(gen_sentences)
    for key, token in token_dict.items():
        ending = ' ' if key in ['\n', '(', '"'] else ''
        tv_script = tv_script.replace(' ' + token.lower(), key)
    tv_script = tv_script.replace('\n ', '\n')
    tv_script = tv_script.replace('( ', '(')
        
    print(tv_script)

moe_szyslak:(sighs) what's the point?... same ol' stinkin' world...(to moe) he seems nice.
lisa_simpson: how'd mindless tester's off my this friend is gonna go tsk, tsk, tsk, tsk, tsk, tsk, tsk till broom.
carl_carlson: i thought you said chug-monkeys. what beverage, brewed since ancient times, is made from hops and grains?
scary him miss_lois_pennycandy:.
lenny_leonard: plus his wife was madonna.
ned_flanders: what're blessing.(small sob)
moe_szyslak: eh, sam: call for the musical in the pledge of allegiance. bugging me.


moe_szyslak: hates me, sam: laugh at the musical in the fridge, the man who spews harmony) pope's beers, you just gotta warn you, they must be the ugliest beer and a wad of bills.


moe_szyslak: hey, hey, hey, hey! freaky.


moe_szyslak:(dumbest glass, sam:) dad?
scary him on, sam: lee is walther hotenhoffer and i'm in the pharmaceutical


# The TV Script is Nonsensical
 Luckly there's more data!  As it is mentioned in the begging of this project, this is a subset of the [dataset](https://www.kaggle.com/wcukierski/the-simpsons-by-the-data).  The project didn't train on all the data, because that would take too long.  However, you are free to train the neural network on all the data. 