# TV Script Generation

In this project, you'll generate your own [Seinfeld](https://en.wikipedia.org/wiki/Seinfeld) TV scripts using RNNs.  You'll be using part of the [Seinfeld dataset](https://www.kaggle.com/thec03u5/seinfeld-chronicles#scripts.csv) of scripts from 9 seasons.  The Neural Network you'll build will generate a new ,"fake" TV script, based on patterns it recognizes in this training data.

## Get the Data

The data is already provided for you in `./data/Seinfeld_Scripts.txt` and you're encouraged to open that file and look at the text. 
>* As a first step, we'll load in this data and look at some samples. 
* Then, you'll be tasked with defining and training an RNN to generate a new script!

In [1]:
"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
# load in data
import helper
data_dir = './data/Seinfeld_Scripts.txt'
text = helper.load_data(data_dir)

## Explore the Data
Play around with `view_line_range` to view different parts of the data. This will give you a sense of the data you'll be working with. You can see, for example, that it is all lowercase text, and each new line of dialogue is separated by a newline character `\n`.

In [2]:
view_line_range = (0, 10)

"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
import numpy as np

print('Dataset Stats')
print('Roughly the number of unique words: {}'.format(len({word: None for word in text.split()})))

lines = text.split('\n')
print('Number of lines: {}'.format(len(lines)))
word_count_line = [len(line.split()) for line in lines]
print('Average number of words in each line: {}'.format(np.average(word_count_line)))

print()
print('The lines {} to {}:'.format(*view_line_range))
print('\n'.join(text.split('\n')[view_line_range[0]:view_line_range[1]]))

Dataset Stats
Roughly the number of unique words: 46367
Number of lines: 109233
Average number of words in each line: 5.544240293684143

The lines 0 to 10:
jerry: do you know what this is all about? do you know, why were here? to be out, this is out...and out is one of the single most enjoyable experiences of life. people...did you ever hear people talking about we should go out? this is what theyre talking about...this whole thing, were all out now, no one is home. not one person here is home, were all out! there are people trying to find us, they dont know where we are. (on an imaginary phone) did you ring?, i cant find him. where did he go? he didnt tell me where he was going. he must have gone out. you wanna go out you get ready, you pick out the clothes, right? you take the shower, you get all ready, get the cash, get your friends, the car, the spot, the reservation...then youre standing around, what do you do? you go we gotta be getting back. once youre out, you wanna get back! y

---
## Implement Pre-processing Functions
The first thing to do to any dataset is pre-processing.  Implement the following pre-processing functions below:
- Lookup Table
- Tokenize Punctuation

### Lookup Table
To create a word embedding, you first need to transform the words to ids.  In this function, create two dictionaries:
- Dictionary to go from the words to an id, we'll call `vocab_to_int`
- Dictionary to go from the id to word, we'll call `int_to_vocab`

Return these dictionaries in the following **tuple** `(vocab_to_int, int_to_vocab)`

In [3]:
import problem_unittests as tests
import re
from collections import Counter

def create_lookup_tables(text):
    """
    Create lookup tables for vocabulary
    :param text: The text of tv scripts split into words
    :return: A tuple of dicts (vocab_to_int, int_to_vocab)
    """
    # TODO: Implement Function
    #this function takes in a list of words in a text and 
    #it returns two dictionaries that map from our vocabulary 
    #to integer values and back. 
    #Creating a sorted vocabulary using the counter. 
    #list of words from most to least frequent according 
    #to the word counts returned by counter. 
    #Then integers are assigned in descending frequency order. 
    #So the most frequent word like B is given the integer 0, 
    #and the next most frequent is 1 and so on.

    text_counts = Counter(text)
    # sorting the words from most to least frequent in text occurrence
    sorted_vocab = sorted(text_counts, key=text_counts.get, reverse=True)
    # create int_to_vocab dictionaries
    int_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}
    #print('int_to_vocab', int_to_vocab) #DEBUG
    vocab_to_int = {word: ii for ii, word in int_to_vocab.items()}
    #print('vocab_to_int', vocab_to_int) #DEBUG
    # return tuple
    return (vocab_to_int, int_to_vocab)


"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
tests.test_create_lookup_tables(create_lookup_tables)

Tests Passed


### Tokenize Punctuation
We'll be splitting the script into a word array using spaces as delimiters.  However, punctuations like periods and exclamation marks can create multiple ids for the same word. For example, "bye" and "bye!" would generate two different word ids.

Implement the function `token_lookup` to return a dict that will be used to tokenize symbols like "!" into "||Exclamation_Mark||".  Create a dictionary for the following symbols where the symbol is the key and value is the token:
- Period ( **.** )
- Comma ( **,** )
- Quotation Mark ( **"** )
- Semicolon ( **;** )
- Exclamation mark ( **!** )
- Question mark ( **?** )
- Left Parentheses ( **(** )
- Right Parentheses ( **)** )
- Dash ( **-** )
- Return ( **\n** )

This dictionary will be used to tokenize the symbols and add the delimiter (space) around it.  This separates each symbols as its own word, making it easier for the neural network to predict the next word. Make sure you don't use a value that could be confused as a word; for example, instead of using the value "dash", try using something like "||dash||".

In [4]:
def token_lookup():
    """
    Generate a dict to turn punctuation into a token.
    :return: Tokenized dictionary where the key is the punctuation and the value is the token
    """
    # TODO: Implement Function
        
    return {
        '.': '||Period||',
        ',': '||Comma||',
        '"': '||Quotation_Mark||',
        ';': '||Semicolon||',
        '!': '||Exclamation_mark||',
        '?': '||Question_mark||',
        '(': '||Left_Parentheses||',
        ')': '||Right_Parentheses||',
        '-': '||Dash||',
        "\n": '||Return||'
    }

"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
tests.test_tokenize(token_lookup)

Tests Passed


## Pre-process all the data and save it

Running the code cell below will pre-process all the data and save it to file. You're encouraged to lok at the code for `preprocess_and_save_data` in the `helpers.py` file to see what it's doing in detail, but you do not need to change this code.

In [5]:
"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
# pre-process training data
helper.preprocess_and_save_data(data_dir, token_lookup, create_lookup_tables)

# Check Point
This is your first checkpoint. If you ever decide to come back to this notebook or have to restart the notebook, you can start from here. The preprocessed data has been saved to disk.

In [6]:
"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
import helper
import problem_unittests as tests

int_text, vocab_to_int, int_to_vocab, token_dict = helper.load_preprocess()

## Build the Neural Network
In this section, you'll build the components necessary to build an RNN by implementing the RNN Module and forward and backpropagation functions.

### Check Access to GPU

In [7]:
"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
import torch

# Check for a GPU
train_on_gpu = torch.cuda.is_available()
if not train_on_gpu:
    print('No GPU found. Please use a GPU to train your neural network.')

## Input
Let's start with the preprocessed input data. We'll use [TensorDataset](http://pytorch.org/docs/master/data.html#torch.utils.data.TensorDataset) to provide a known format to our dataset; in combination with [DataLoader](http://pytorch.org/docs/master/data.html#torch.utils.data.DataLoader), it will handle batching, shuffling, and other dataset iteration functions.

You can create data with TensorDataset by passing in feature and target tensors. Then create a DataLoader as usual.
```
data = TensorDataset(feature_tensors, target_tensors)
data_loader = torch.utils.data.DataLoader(data, 
                                          batch_size=batch_size)
```

### Batching
Implement the `batch_data` function to batch `words` data into chunks of size `batch_size` using the `TensorDataset` and `DataLoader` classes.

>You can batch words using the DataLoader, but it will be up to you to create `feature_tensors` and `target_tensors` of the correct size and content for a given `sequence_length`.

For example, say we have these as input:
```
words = [1, 2, 3, 4, 5, 6, 7]
sequence_length = 4
```

Your first `feature_tensor` should contain the values:
```
[1, 2, 3, 4]
```
And the corresponding `target_tensor` should just be the next "word"/tokenized word value:
```
5
```
This should continue with the second `feature_tensor`, `target_tensor` being:
```
[2, 3, 4, 5]  # features
6             # target
```

In [8]:
from torch.utils.data import TensorDataset, DataLoader


def batch_data(words, sequence_length, batch_size):
    """
    Batch the neural network data using DataLoader
    :param words: The word ids of the TV scripts
    :param sequence_length: The sequence length of each batch
    :param batch_size: The size of each batch; the number of sequences in a batch
    :return: DataLoader with batched data
    """
    # TODO: Implement function
    #generates many batches of data.
    #get the total number of complete batches that we can make in batches. 
    #first calculat how many characters were in a complete minibatch. 
    #So in one mini batch there's going to be
    #batch size, time sequence, length, number of characters.
    batch_size_total = batch_size * sequence_length
    
    #Then the number of complete batches that we can make is just the length 
    #of the array divided by the total number of characters in a mini minibatch. 
    #This double slash is an integer division, which will just round down any 
    #decimal leftover from this division. 
    #And with that we have the number of completely full batches that we can make.
    #n_batches = len(words)//batch_size
    #words = words[:n_batches*batch_size]
    n_batches = len(words)//batch_size_total
    
    #Then we get our array arr and we take all the characters in the array up 
    #to N  batches times this total character size for our mini batch. 
    #So here we're making sure that we're keeping only enough characters to make 
    #full batches, and we may lose some characters here, but in general you're 
    #going to have enough data that getting rid of a last unfill batch is not
    #really going to matter.
    # only full batches
    words = words[:n_batches*batch_size_total]
    
    # TODO: Implement function    
    features, targets = [], []
    
    #create feature tensors and target tensors of the correct size and 
    #content for a given sequence_length
    for idx in range(0, (len(words) - sequence_length) ):
        features.append(words[idx : idx + sequence_length])
        #print('feature: ',features)
        targets.append(words[idx + sequence_length])   
        #print('target: ',targets)
    
    #Create DataLoaders for our training tensor datasets:
    #1. Use pytorch's tensordataset to wrap tensor data into a known format 
    #2. Create DataLoaders and batch our training, for test Tensor datasets. 
    
    #Creating my training tensor datasets. To create my training data, 
    #I'm passing in the tensor version of my features and targets and torch.from_numpy 
    #just takes in numpy arrays and converts them into tensors. 
    #Then for each tensor dataset that I just created, I'm passing it into pytorch's data loader 
   
    data = TensorDataset(torch.from_numpy(np.asarray(features)), torch.from_numpy(np.asarray(targets)))
    data_loader = torch.utils.data.DataLoader(data, shuffle=False , batch_size = batch_size)
    
    return data_loader
    #return data_set_loader

# there is no test for this function, but you are encouraged to create
# print statements and tests of your own

#test_text = range(50) #DEBUG
#t_loader = batch_data(test_text, sequence_length=5, batch_size=10) #DEBUG


### Test your dataloader 

You'll have to modify this code to test a batching function, but it should look fairly similar.

Below, we're generating some test text data and defining a dataloader using the function you defined, above. Then, we are getting some sample batch of inputs `sample_x` and targets `sample_y` from our dataloader.

Your code should return something like the following (likely in a different order, if you shuffled your data):

```
torch.Size([10, 5])
tensor([[ 28,  29,  30,  31,  32],
        [ 21,  22,  23,  24,  25],
        [ 17,  18,  19,  20,  21],
        [ 34,  35,  36,  37,  38],
        [ 11,  12,  13,  14,  15],
        [ 23,  24,  25,  26,  27],
        [  6,   7,   8,   9,  10],
        [ 38,  39,  40,  41,  42],
        [ 25,  26,  27,  28,  29],
        [  7,   8,   9,  10,  11]])

torch.Size([10])
tensor([ 33,  26,  22,  39,  16,  28,  11,  43,  30,  12])
```

### Sizes
Your sample_x should be of size `(batch_size, sequence_length)` or (10, 5) in this case and sample_y should just have one dimension: batch_size (10). 

### Values

You should also notice that the targets, sample_y, are the *next* value in the ordered test_text data. So, for an input sequence `[ 28,  29,  30,  31,  32]` that ends with the value `32`, the corresponding output should be `33`.

In [9]:
# test dataloader

test_text = range(50)
t_loader = batch_data(test_text, sequence_length=5, batch_size=10)

data_iter = iter(t_loader)
sample_x, sample_y = data_iter.next()

print(sample_x.shape)
print(sample_x)
print()
print(sample_y.shape)
print(sample_y)

torch.Size([10, 5])
tensor([[  0,   1,   2,   3,   4],
        [  1,   2,   3,   4,   5],
        [  2,   3,   4,   5,   6],
        [  3,   4,   5,   6,   7],
        [  4,   5,   6,   7,   8],
        [  5,   6,   7,   8,   9],
        [  6,   7,   8,   9,  10],
        [  7,   8,   9,  10,  11],
        [  8,   9,  10,  11,  12],
        [  9,  10,  11,  12,  13]])

torch.Size([10])
tensor([  5,   6,   7,   8,   9,  10,  11,  12,  13,  14])


---
## Build the Neural Network
Implement an RNN using PyTorch's [Module class](http://pytorch.org/docs/master/nn.html#torch.nn.Module). You may choose to use a GRU or an LSTM. To complete the RNN, you'll have to implement the following functions for the class:
 - `__init__` - The initialize function. 
 - `init_hidden` - The initialization function for an LSTM/GRU hidden state
 - `forward` - Forward propagation function.
 
The initialize function should create the layers of the neural network and save them to the class. The forward propagation function will use these layers to run forward propagation and generate an output and a hidden state.

**The output of this model should be the *last* batch of word scores** after a complete sequence has been processed. That is, for each input sequence of words, we only want to output the word scores for a single, most likely, next word.

### Hints

1. Make sure to stack the outputs of the lstm to pass to your fully-connected layer, you can do this with `lstm_output = lstm_output.contiguous().view(-1, self.hidden_dim)`
2. You can get the last batch of word scores by shaping the output of the final, fully-connected layer like so:

```
# reshape into (batch_size, seq_length, output_size)
output = output.view(batch_size, -1, self.output_size)
# get last batch
out = output[:, -1]
```

In [10]:
import torch.nn as nn

class RNN(nn.Module):
    
#embedding layer, a recurrent layer, and a final, linear layer with a sigmoid applied; 
    #defined in the __init__ function, according to passed in parameters.
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.5):
        """
        Initialize the PyTorch RNN Module
        :param vocab_size: The number of input dimensions of the neural network (the size of the vocabulary)
        :param output_size: The number of output dimensions of the neural network
        :param embedding_dim: The size of embeddings, should you choose to use them        
        :param hidden_dim: The size of the hidden layer outputs
        :param dropout: dropout to add in between LSTM/GRU layers
        """ 
        super(RNN, self).__init__()
        # TODO: Implement function
        
        # set class variables
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.dropout = dropout
        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        # define model layers
        
        #First, embedding layer, which should take in the size of the 
        #vocabulary (our number of integer tokens) and produce an embedding of 
        #embedding_dim size. So, as this model trains, this is going to create 
        #an embedding lookup table that has as many rows as we have word integers,
        #and as many columns as the embedding dimension
        #embedding and LSTM layers
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        #Then, an LSTM layer, which takes in inputs of embedding_dim size.
        #So, it's accepting embeddings as inputs, and producing an output and
        #hidden state of a hidden size. Also specifying a number of layers,
        #and a dropout value, and finally, setting batch_first to True
        #because DataLoaders are used to batch the data.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, 
                            dropout=dropout, batch_first=True)
        
        #Then, the LSTM outputs are passed to a dropout layer and then a 
        #fully-connected, linear layer that will produce output_size number 
        #of outputs. 
        #dropout layer
        #self.dropout = nn.Dropout(dropout)
        
        #linear and sigmoid layers
        self.fc = nn.Linear(hidden_dim, output_size)
        
        #And finally, a sigmoid layer is defined to convert the 
        #output to a value between 0-1.
        self.sig = nn.Sigmoid()
    
    #forward function, which takes in an input nn_input and a hidden 
    #state, I am going to pass an input through these layers in sequence.
    def forward(self, nn_input, hidden):
        """
        Forward propagation of the neural network
        :param nn_input: The input to the neural network
        :param hidden: The hidden state        
        :return: Two Tensors, the output of the neural network and the latest hidden state
        """
        # TODO: Implement function 
        
        #First, the batch_size of input nn_input, which will be used for shaping the data
        #Then, passing nn_input through the embedding layer first, to get embeddings as output
        batch_size = nn_input.size(0)

        #These embeddings are passed to the lstm layer, alongside a hidden state, 
        #and this returns an lstm_output and a new hidden state. Then 
        #to stack up the outputs of the LSTM to pass to the last linear layer.
        # embeddings and lstm_out
        nn_input = nn_input.long()
        embed_output = self.embedding(nn_input)
        lstm_out, hidden = self.lstm(embed_output, hidden)
    
        # stack up lstm outputs
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)        
        
        #Then passing the reshaped lstm_output to a dropout layer 
        #and the linear layer, which should return a specified number of outputs 
        #that will be passed to the sigmoid activation function.
        # dropout and fully-connected layer
        
        #out = self.dropout(lstm_out)####
        #out = self.fc(out) #########
        
         
        #sig_out = self.fc(out)
        sig_out = self.fc(lstm_out)  ### removed dropout
        # sigmoid function
        #sig_out = self.sig(out) #########
        
        #returning only the last of these sigmoid outputs for a batch of input data, 
        #shape the outputs into a shape that is batch_size first. 
        #getting the last bacth by called `sig_out[:, -1], 
        #and that’s going to give the batch of last labels.
        # reshape to be batch_size first
        sig_out = sig_out.view(batch_size, -1, self.output_size)
        sig_out = sig_out[:, -1] # get last batch of labels

        #returning that output and the hidden state produced by the LSTM layer.
        # return one batch of output word scores and the hidden state
        return sig_out, hidden
    
    
    #The hidden and cell states of an LSTM are a tuple of values and each
    #of these is size (n_layers by batch_size, by hidden_dim). 
    #initializing these hidden weights to all zeros, and moving to a gpu if available.
    def init_hidden(self, batch_size):
        '''
        Initialize the hidden state of an LSTM/GRU
        :param batch_size: The batch_size of the hidden state
        :return: hidden state of dims (n_layers, batch_size, hidden_dim)
        '''
        # Implement function
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM, 
        #and move to GPU if available
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            #a hidden and, a cell state that are saved as a tuple, hidden.
            #The shape of the hidden and cell state is defined first by the number 
            #of layers, our model, the batch size of our input, and then the hidden 
            #dimension that we specified in model creation. And in this function, 
            #we're initializing the hidden weights all to zero and moving them to GPU
            #if it's available.
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden


"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
tests.test_rnn(RNN, train_on_gpu)

Tests Passed


### Define forward and backpropagation

Use the RNN class you implemented to apply forward and back propagation. This function will be called, iteratively, in the training loop as follows:
```
loss = forward_back_prop(decoder, decoder_optimizer, criterion, inp, target)
```

And it should return the average loss over a batch and the hidden state returned by a call to `RNN(inp, hidden)`. Recall that you can get this loss by computing it, as usual, and calling `loss.item()`.

**If a GPU is available, you should move your data to that GPU device, here.**

In [11]:
def forward_back_prop(rnn, optimizer, criterion, inp, target, hidden):
    """
    Forward and backward propagation on the neural network
    :param decoder: The PyTorch Module that holds the neural network
    :param decoder_optimizer: The PyTorch optimizer for the neural network
    :param criterion: The PyTorch loss function
    :param inp: A batch of input to the neural network
    :param target: The target output for the batch of input
    :return: The loss and the latest hidden state Tensor
    """
    
    # move model to GPU, if available
    if(train_on_gpu):
        rnn.cuda()
        
    #backpropagation, zero out any accumulated grradients 
    #and pass in our input tensors to our model. We also pass 
    #in the latest hidden state here (net(inputs, h)) and 
    #this returns a final output and a new hidden state.    
    
    #we detach any past and hidden state from its history    
    #Recall that the hidden state of an LSTM layer is a tuple, 
    #and so here we're getting the data as a Tuple.
       
    # Creating new variables for the hidden state, otherwise
    # we'd backprop through the entire training history
    hidden = tuple([each.data for each in hidden])
    
    #clear accumulated gradients
    rnn.zero_grad()
    
    # move data to GPU, if available
    if(train_on_gpu):
        inp, target = inp.cuda(), target.cuda()
        
    #get the output from the model
    prediction, hidden = rnn(inp, hidden)
    #loss calculation and backprop
            
    #Calculate the loss by looking at the predicted output               
    loss = criterion(prediction, target)
    #Take a backwards step, then we clip gradients and update weights perform an optimization step.  
    loss.backward()
    
    #This kind of LSTM model has one main problem. 
    #Gradients can explode and get really, really big.
    #We can clip the gradients. We just set some clip threshold 
    #and then if the gradient is larger than that threshold, 
    #we set it to that clip threshold and in code we do this by j
    #ust passing in the parameters and the value that we want to 
    #clip the gradients at. In this case, the value five.
                           
    # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
    #clipping + optimization 
    nn.utils.clip_grad_norm_(rnn.parameters(), 5)
    
    #moving one step in the right direction, 
    #updating the weights of our network.
    optimizer.step()
    # return the loss over a batch and the hidden state produced by our model
    return loss.item(), hidden


# Note that these tests aren't completely extensive.
# they are here to act as general checks on the expected outputs of your functions
"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
tests.test_forward_back_prop(RNN, forward_back_prop, train_on_gpu)

Tests Passed


## Neural Network Training

With the structure of the network complete and data ready to be fed in the neural network, it's time to train it.

### Train Loop

The training loop is implemented for you in the `train_decoder` function. This function will train the network over all the batches for the number of epochs given. The model progress will be shown every number of batches. This number is set with the `show_every_n_batches` parameter. You'll set this parameter along with other parameters in the next section.

In [12]:
"""
DON'T MODIFY ANYTHING IN THIS CELL
"""

def train_rnn(rnn, batch_size, optimizer, criterion, n_epochs, show_every_n_batches=100):
    batch_losses = []
    
    rnn.train()

    print("Training for %d epoch(s)..." % n_epochs)
    for epoch_i in range(1, n_epochs + 1):
        
        # initialize hidden state
        hidden = rnn.init_hidden(batch_size)
        
        for batch_i, (inputs, labels) in enumerate(train_loader, 1):
            
            # make sure you iterate over completely full batches, only
            n_batches = len(train_loader.dataset)//batch_size
            if(batch_i > n_batches):
                break
            
            # forward, back prop
            loss, hidden = forward_back_prop(rnn, optimizer, criterion, inputs, labels, hidden)          
            # record loss
            batch_losses.append(loss)

            # printing loss stats
            if batch_i % show_every_n_batches == 0:
                print('Epoch: {:>4}/{:<4}  Loss: {}\n'.format(
                    epoch_i, n_epochs, np.average(batch_losses)))
                batch_losses = []

    # returns a trained rnn
    return rnn

### Hyperparameters

Set and train the neural network with the following parameters:
- Set `sequence_length` to the length of a sequence.
- Set `batch_size` to the batch size.
- Set `num_epochs` to the number of epochs to train for.
- Set `learning_rate` to the learning rate for an Adam optimizer.
- Set `vocab_size` to the number of uniqe tokens in our vocabulary.
- Set `output_size` to the desired size of the output.
- Set `embedding_dim` to the embedding dimension; smaller than the vocab_size.
- Set `hidden_dim` to the hidden dimension of your RNN.
- Set `n_layers` to the number of layers/cells in your RNN.
- Set `show_every_n_batches` to the number of batches at which the neural network should print progress.

If the network isn't getting the desired results, tweak these parameters and/or the layers in the `RNN` class.

In [13]:
# Data params
# Sequence Length
sequence_length =  10 # of words in a sequence
# Batch Size
batch_size = 128

# data loader - do not change
train_loader = batch_data(int_text, sequence_length, batch_size)

In [14]:
# Training parameters
# Number of Epochs
num_epochs = 10
# Learning Rate
learning_rate = .001

# Model parameters
# Vocab size
vocab_size = len(vocab_to_int)
# Output size
output_size = vocab_size
# Embedding Dimension
embedding_dim = 256
# Hidden Dimension
hidden_dim = 256
# Number of RNN Layers
n_layers = 2

# Show stats for every n number of batches
show_every_n_batches = 2000

#print('vocab_size', vocab_size)
#print('output_size', output_size)

### Train
In the next cell, you'll train the neural network on the pre-processed data.  If you have a hard time getting a good loss, you may consider changing your hyperparameters. In general, you may get better results with larger hidden and n_layer dimensions, but larger models take a longer time to train. 
> **You should aim for a loss less than 3.5.** 

You should also experiment with different sequence lengths, which determine the size of the long range dependencies that a model can learn.

In [15]:
import signal

from contextlib import contextmanager

import requests


DELAY = INTERVAL = 4 * 60  # interval time in seconds
MIN_DELAY = MIN_INTERVAL = 2 * 60
KEEPALIVE_URL = "https://nebula.udacity.com/api/v1/remote/keep-alive"
TOKEN_URL = "http://metadata.google.internal/computeMetadata/v1/instance/attributes/keep_alive_token"
TOKEN_HEADERS = {"Metadata-Flavor":"Google"}


def _request_handler(headers):
    def _handler(signum, frame):
        requests.request("POST", KEEPALIVE_URL, headers=headers)
    return _handler


@contextmanager
def active_session(delay=DELAY, interval=INTERVAL):
    """
    Example:

    from workspace_utils import active session

    with active_session():
        # do long-running work here
    """
    token = requests.request("GET", TOKEN_URL, headers=TOKEN_HEADERS).text
    headers = {'Authorization': "STAR " + token}
    delay = max(delay, MIN_DELAY)
    interval = max(interval, MIN_INTERVAL)
    original_handler = signal.getsignal(signal.SIGALRM)
    try:
        signal.signal(signal.SIGALRM, _request_handler(headers))
        signal.setitimer(signal.ITIMER_REAL, delay, interval)
        yield
    finally:
        signal.signal(signal.SIGALRM, original_handler)
        signal.setitimer(signal.ITIMER_REAL, 0)


def keep_awake(iterable, delay=DELAY, interval=INTERVAL):
    """
    Example:

    from workspace_utils import keep_awake

    for i in keep_awake(range(5)):
        # do iteration with lots of work here
    """
    with active_session(delay, interval): yield from iterable

In [16]:
"""
DON'T MODIFY ANYTHING IN THIS CELL
"""

#from workspace_utils import active_session

# create model and move to gpu if available
rnn = RNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.5)
if train_on_gpu:
    rnn.cuda()

# defining loss and optimization functions for training
optimizer = torch.optim.Adam(rnn.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()


with active_session():
    # training the model
    trained_rnn = train_rnn(rnn, batch_size, optimizer, criterion, num_epochs, show_every_n_batches)

# saving the trained model
helper.save_model('./save/trained_rnn', trained_rnn)
print('Model Trained and Saved')

Training for 10 epoch(s)...
Epoch:    1/10    Loss: 4.9304132409095764

Epoch:    1/10    Loss: 4.503665405154228

Epoch:    1/10    Loss: 4.3687747141122815

Epoch:    2/10    Loss: 4.124601575159148

Epoch:    2/10    Loss: 3.9403717595338823

Epoch:    2/10    Loss: 3.8982175043821337

Epoch:    3/10    Loss: 3.8125579905856415

Epoch:    3/10    Loss: 3.711456812977791

Epoch:    3/10    Loss: 3.692099633693695

Epoch:    4/10    Loss: 3.640902316985238

Epoch:    4/10    Loss: 3.5669717919826507

Epoch:    4/10    Loss: 3.561549678683281

Epoch:    5/10    Loss: 3.518501907395688

Epoch:    5/10    Loss: 3.4592853020429613

Epoch:    5/10    Loss: 3.4562812983989715

Epoch:    6/10    Loss: 3.431078700690545

Epoch:    6/10    Loss: 3.3882167429924013

Epoch:    6/10    Loss: 3.3848298746347427

Epoch:    7/10    Loss: 3.363972353024755

Epoch:    7/10    Loss: 3.3278395879268645

Epoch:    7/10    Loss: 3.3236441373825074

Epoch:    8/10    Loss: 3.3110214215994445

Epoch:    8/1

  "type " + obj.__name__ + ". It won't be checked "


Model Trained and Saved


### Question: How did you decide on your model hyperparameters? 
For example, did you try different sequence_lengths and find that one size made the model converge faster? What about your hidden_dim and n_layers; how did you decide on those?

**Answer:** (Write answer, here)

The models hyper-parameters were selected based on the the class lectures which provided a point of reference for the possible starting values and intuitions of each of the parameters as well as the above specification of a loss less than 3.5. The lectures broke hyper parameters into two categories; The first is optimizer hyper parameters which are the variables related more to the optimization and training process than to the model itself, including the learning rate, the mini-batch size, and the number of training iterations or epochs. The second is model hyper parameters, which are the variables that are more involved in the structure of the model, including the number of layers and hidden units and model specific hyper parameters for architectures like RNNs

Initially, no matter how many times the hyper-parameters were changed the loss would not go below 9.0 for 20 epochs... Then after the sigmoid activation function was removed from the forward function the activation loss started at 5.0. Second, no matter how many times the hyper-parameters were changed the model would not go below 3.7. Then after the dropout was removed from the init and forward functions the loss specification was almost immediately achieved at less than 20 epochs. So for me the primary lesson here has been more on model architecture rather than the hyper parameters.

Experimenting with sequence length ranging from 5 to 20 resulted in the following conclusion. The lower the sequence length the faster the 3.5 loss specification is reached. However, the higher the sequence length the less precise the model turns out to be. 

Experimenting with batch size dimensions ranging from 2 to 256 resulted in the following conclusion. Anything above 128 did not converge to the specified loss. However, 8 worked. As pointed out by the lectures a larger mini-batch size allows computational boosts that utilizes matrix multiplication, in the training calculations. But that comes at the expense of needing more memory for the training process, and generally, more computational resources. In practice, small mini-batch sizes have more noise in their error calculations, and this noise is often helpful in preventing the training process from stopping at local minima on the error curve rather than the global minima that creates the best model. So while the computational boost incentivizes us to increase the mini-batch size, this practical algorithmic benefit incentivizes us to actually make it smaller.

Experimenting with hidden dimensions ranging from 128 to 512 resulted in the following conclusion. Did not observe much difference. The lectures pointed out that for a neural network to learn to approximate a function or a prediction task, it needs to have enough "capacity" to learn the function. The more complex the function, the more learning capacity the model will need. The number and architecture of the hidden units is the main measure for a model's learning capacity. If we provide the model with too much capacity, however, it might tend to over-fit and just try to memorize the training set. If you find your model over-fitting your data, meaning that the training accuracy is much better than the validation accuracy, you might want to try to decrease the number of hidden units. You could also utilize regularization techniques like dropouts or L2 regularization. So, as far as the number of hidden units is concerned, the more, the better. A little larger than the ideal number is not a problem, but a much larger value can often lead to the model over-fitting. So, if your model is not training, add more hidden units and track validation error. Keep adding hidden units until the validation starts getting worse. Another heuristic involving the first hidden layer is that setting it to a number larger than the number of the inputs has been observed to be beneficial in a number of tests.

Experimenting with learning rate ranging from .0001 to .1 resulted in the sticking with within the norm not train too fast (high learning rate) overshooting possible convergence or getting stuck in a local minimum (low learning rate) if the rate was too low. The lectures outlined that the learning rate is the multiplier we use to push the weight towards the right direction. Calculating the gradient tells us which direction to go to decrease the error. If we do the calculation correctly, the gradient will point out which direction to go, meaning whether we should increase or decrease the current value of the weight. The learning rate is the multiplier we use to push the weight towards the right direction. Now, if we had made a miraculously correct choice for our learning rate, then we'd land on the best weight after only one training step. If the learning rate we chose was smaller than the ideal rate, then Our model can continue to learn until it finds a good value for the weight. So each training step it'll take a step closer until it lands on that best weight at the end. If, however, the learning rates had been too little, then our training error would be decreasing, but very slowly. And we might go through hundreds or thousands of training steps without reaching the best value for our model. And it's clear that in cases like this what we need to do is to increase the learning rate. One other case is that if we chose a learning rate that is larger than the ideal learning rate, our updated value would overshoot the ideal weight value. And then on the next update it would overshoot it the other way. But it will keep getting closer, and it would probably converge to a reasonable value. Where this becomes problematic though is when we choose a learning rate that is much larger than the ideal rate, more than twice as much. So in this case, we will see the weights taking a large step that not only overshoots the ideal weight, but it actually gets farther and farther from the best error that we can get at every step. A contributor to this divergence is the gradient. The gradient does not only contribute a direction but also a value that corresponds to the slope of the line tangent to the curve at that point. The higher the point is on the curve, the more steep the slope is, and the larger the value of the gradient is. So this makes the problem of the large learning rate even worse. So if our training error is actually increasing, we might want to try to decrease the learning rate and see what happens.

The selection of embedding dimensions and hidden dimensions at 256 was based on the course lectures. The lectures showed that experimental results reporting a paper titled how to generate a good word embedding show that the performance of some tasks improve the larger we make the embedding, at least until a size of 200. In other tests, however, only marginal improvements are realized beyond the size of 50.

Experimenting with number of RNN layers ranging from 2-3 was also based on the course lectures. The lectures showed that in practice, it's often the case that a three-layer neural net will outperform a two-layer net, but going even deeper rarely helps much more. The exception to this is convolutional neural networks where the deeper they are, the better they perform. The lectures also pointed out that the results for character level language modeling show that a depth of at least two is shown to be beneficial. In some cases increasing it to 3 shows mixed results. With that said I decided to stick with 2.

The number of epochs was selected based on successfully reaching the minimum loss above. The lectures suggested to choose the right number of iterations or number of epochs for the training step, the metric typically used is the validation error.

---
# Checkpoint

After running the above training cell, your model will be saved by name, `trained_rnn`, and if you save your notebook progress, **you can pause here and come back to this code at another time**. You can resume your progress by running the next cell, which will load in our word:id dictionaries _and_ load in your saved model by name!

In [21]:
"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
import torch
import helper
import problem_unittests as tests

_, vocab_to_int, int_to_vocab, token_dict = helper.load_preprocess()
trained_rnn = helper.load_model('./save/trained_rnn')

## Generate TV Script
With the network trained and saved, you'll use it to generate a new, "fake" Seinfeld TV script in this section.

### Generate Text
To generate the text, the network needs to start with a single word and repeat its predictions until it reaches a set length. You'll be using the `generate` function to do this. It takes a word id to start with, `prime_id`, and generates a set length of text, `predict_len`. Also note that it uses topk sampling to introduce some randomness in choosing the most likely next word, given an output set of word scores!

In [22]:
"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
import torch.nn.functional as F

def generate(rnn, prime_id, int_to_vocab, token_dict, pad_value, predict_len=100):
    """
    Generate text using the neural network
    :param decoder: The PyTorch Module that holds the trained neural network
    :param prime_id: The word id to start the first prediction
    :param int_to_vocab: Dict of word id keys to word values
    :param token_dict: Dict of puncuation tokens keys to puncuation values
    :param pad_value: The value used to pad a sequence
    :param predict_len: The length of text to generate
    :return: The generated text
    """
    rnn.eval()
    
    # create a sequence (batch_size=1) with the prime_id
    current_seq = np.full((1, sequence_length), pad_value)
    current_seq[-1][-1] = prime_id
    predicted = [int_to_vocab[prime_id]]
    
    for _ in range(predict_len):
        if train_on_gpu:
            current_seq = torch.LongTensor(current_seq).cuda()
        else:
            current_seq = torch.LongTensor(current_seq)
        
        # initialize the hidden state
        hidden = rnn.init_hidden(current_seq.size(0))
        
        # get the output of the rnn
        output, _ = rnn(current_seq, hidden)
        
        # get the next word probabilities
        p = F.softmax(output, dim=1).data
        if(train_on_gpu):
            p = p.cpu() # move to cpu
         
        # use top_k sampling to get the index of the next word
        top_k = 5
        p, top_i = p.topk(top_k)
        top_i = top_i.numpy().squeeze()
        
        # select the likely next word index with some element of randomness
        p = p.numpy().squeeze()
        word_i = np.random.choice(top_i, p=p/p.sum())
        
        # retrieve that word from the dictionary
        word = int_to_vocab[word_i]
        predicted.append(word)     
        
        # the generated word becomes the next "current sequence" and the cycle can continue
        current_seq = np.roll(current_seq, -1, 1)
        current_seq[-1][-1] = word_i
    
    gen_sentences = ' '.join(predicted)
    
    # Replace punctuation tokens
    for key, token in token_dict.items():
        ending = ' ' if key in ['\n', '(', '"'] else ''
        gen_sentences = gen_sentences.replace(' ' + token.lower(), key)
    gen_sentences = gen_sentences.replace('\n ', '\n')
    gen_sentences = gen_sentences.replace('( ', '(')
    
    # return all the sentences
    return gen_sentences

### Generate a New Script
It's time to generate the text. Set `gen_length` to the length of TV script you want to generate and set `prime_word` to one of the following to start the prediction:
- "jerry"
- "elaine"
- "george"
- "kramer"

You can set the prime word to _any word_ in our dictionary, but it's best to start with a name for generating a TV script. (You can also start with any other names you find in the original text file!)

In [23]:
# run the cell multiple times to get different results!
gen_length = 400 # modify the length to your preference
prime_word = 'jerry' # name for starting the script

"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
pad_word = helper.SPECIAL_WORDS['PADDING']
generated_script = generate(trained_rnn, vocab_to_int[prime_word + ':'], int_to_vocab, token_dict, vocab_to_int[pad_word], gen_length)
print(generated_script)



jerry: rude, and then, uh, i guess we should have seen you.

hoyt: so, you were going on the phone.

hoyt: oh, no. no one-

stu: no.

elaine: oh, no no.

stu: yes, yes, i got it!

estelle: i guess that was a little problem.

chiles: oh, no. i don't think so. i was a misprint.

george: oh, yeah.

elaine: hey!

estelle: hello.

jerry: hey!

george: i can't believe it, i can't believe that i am.

hoyt: yes, i think i was a little effeminate in a hotel time for nbc?

elaine: yeah, yeah, well, i'm sorry.

george: oh, you know, we could be able to do something.

hoyt: oh, i forgot to get a reverse mein.

hoyt: and the judge is the lowest getaway, and i have to be going.

hoyt: oh.

george: hey, jerry, this is unbelievable.

hoyt: oh, that's not true.

hoyt: you know, i don't know what to do.

jerry: so what is that?

jerry: i know i was in this city.

george: i was wondering if you can go to paris.

chiles:(whistles) oh, that's a good day of honor, but i was a misprint, and the shark charact

#### Save your favorite scripts

Once you have a script that you like (or find interesting), save it to a text file!

In [24]:
# save script to a text file
f =  open("generated_script_1.txt","w")
f.write(generated_script)
f.close()

# The TV Script is Not Perfect
It's ok if the TV script doesn't make perfect sense. It should look like alternating lines of dialogue, here is one such example of a few generated lines.

### Example generated script

>jerry: what about me?
>
>jerry: i don't have to wait.
>
>kramer:(to the sales table)
>
>elaine:(to jerry) hey, look at this, i'm a good doctor.
>
>newman:(to elaine) you think i have no idea of this...
>
>elaine: oh, you better take the phone, and he was a little nervous.
>
>kramer:(to the phone) hey, hey, jerry, i don't want to be a little bit.(to kramer and jerry) you can't.
>
>jerry: oh, yeah. i don't even know, i know.
>
>jerry:(to the phone) oh, i know.
>
>kramer:(laughing) you know...(to jerry) you don't know.

You can see that there are multiple characters that say (somewhat) complete sentences, but it doesn't have to be perfect! It takes quite a while to get good results, and often, you'll have to use a smaller vocabulary (and discard uncommon words), or get more data.  The Seinfeld dataset is about 3.4 MB, which is big enough for our purposes; for script generation you'll want more than 1 MB of text, generally. 

# Submitting This Project
When submitting this project, make sure to run all the cells before saving the notebook. Save the notebook file as "dlnd_tv_script_generation.ipynb" and save another copy as an HTML file by clicking "File" -> "Download as.."->"html". Include the "helper.py" and "problem_unittests.py" files in your submission. Once you download these files, compress them into one zip file for submission.