# 生成电视剧剧本

在这个项目中，你将使用 RNN 创作你自己的[《辛普森一家》](https://zh.wikipedia.org/wiki/%E8%BE%9B%E6%99%AE%E6%A3%AE%E4%B8%80%E5%AE%B6)电视剧剧本。你将会用到《辛普森一家》第 27 季中部分剧本的[数据集](https://www.kaggle.com/wcukierski/the-simpsons-by-the-data)。你创建的神经网络将为一个在 [Moe 酒馆](https://simpsonswiki.com/wiki/Moe's_Tavern)中的场景生成一集新的剧本。

## 获取数据
我们早已为你提供了数据`./data/Seinfeld_Scripts.txt`。我们建议你打开文档来看看这个文档内容。

>* 第一步，我们来读入文档，并看几段例子。
* 然后，你需要定义并训练一个 RNN 网络来生成新的剧本！

In [1]:
"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
# load in data
import helper
data_dir = './data/Seinfeld_Scripts.txt'
text = helper.load_data(data_dir)

## 探索数据
使用 `view_line_range` 来查阅数据的不同部分，这个部分会让你对整体数据有个基础的了解。你会发现，文档中全是小写字母，并且所有的对话都是使用 `\n` 来分割的。

In [2]:
view_line_range = (0, 10)

"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
import numpy as np

print('Dataset Stats')
print('Roughly the number of unique words: {}'.format(len({word: None for word in text.split()})))

lines = text.split('\n')
print('Number of lines: {}'.format(len(lines)))
word_count_line = [len(line.split()) for line in lines]
print('Average number of words in each line: {}'.format(np.average(word_count_line)))

print()
print('The lines {} to {}:'.format(*view_line_range))
print('\n'.join(text.split('\n')[view_line_range[0]:view_line_range[1]]))

Dataset Stats
Roughly the number of unique words: 46367
Number of lines: 109233
Average number of words in each line: 5.544240293684143

The lines 0 to 10:
jerry: do you know what this is all about? do you know, why were here? to be out, this is out...and out is one of the single most enjoyable experiences of life. people...did you ever hear people talking about we should go out? this is what theyre talking about...this whole thing, were all out now, no one is home. not one person here is home, were all out! there are people trying to find us, they dont know where we are. (on an imaginary phone) did you ring?, i cant find him. where did he go? he didnt tell me where he was going. he must have gone out. you wanna go out you get ready, you pick out the clothes, right? you take the shower, you get all ready, get the cash, get your friends, the car, the spot, the reservation...then youre standing around, what do you do? you go we gotta be getting back. once youre out, you wanna get back! y

---
## 实现预处理函数
对数据集进行的第一个操作是预处理。请实现下面两个预处理函数：

- 查询表
- 标记符号

### 查询表
要创建词嵌入，你首先要将词语转换为 id。请在这个函数中创建两个字典：

- 将词语转换为 id 的字典，我们称它为 `vocab_to_int`
- 将 id 转换为词语的字典，我们称它为 `int_to_vocab`

请在下面的元组中返回这些字典
 `(vocab_to_int, int_to_vocab)`

In [3]:
import problem_unittests as tests
import numpy as np
import helper
from collections import Counter

def create_lookup_tables(text):
    """
    Create lookup tables for vocabulary
    :param text: The text of tv scripts split into words
    :return: A tuple of dicts (vocab_to_int, int_to_vocab)
    """
    # TODO: Implement Function
    word_counts = Counter(text)
    # sorting the words from most to least frequent in text occurrence
    sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)
    # create int_to_vocab dictionaries
    int_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}
    vocab_to_int = {word: ii for ii, word in int_to_vocab.items()}

    # return tuple
    return (vocab_to_int, int_to_vocab)


"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
tests.test_create_lookup_tables(create_lookup_tables)

Tests Passed


### 标记符号的字符串
我们会使用空格当作分隔符，来将剧本分割为词语数组。然而，句号和感叹号等符号使得神经网络难以分辨“再见”和“再见！”之间的区别。

实现函数 `token_lookup` 来返回一个字典，这个字典用于将 “!” 等符号标记为 “||Exclamation_Mark||” 形式。为下列符号创建一个字典，其中符号为标志，值为标记。

- period ( . )
- comma ( , )
- quotation mark ( " )
- semicolon ( ; )
- exclamation mark ( ! )
- question mark ( ? )
- left parenthesis ( ( )
- right parenthesis ( ) )
- dash ( -- )
- return ( \n )

这个字典将用于标记符号并在其周围添加分隔符（空格）。这能将符号视作单独词汇分割开来，并使神经网络更轻松地预测下一个词汇。请确保你并没有使用容易与词汇混淆的标记。与其使用 “dash” 这样的标记，试试使用“||dash||”。

In [4]:
def token_lookup():
    """
    Generate a dict to turn punctuation into a token.
    :return: Tokenized dictionary where the key is the punctuation and the value is the token
    """
    # TODO: Implement Function
    token_lookup_dict = {}
    token_lookup_dict['.'] = '||PERIOD||'
    token_lookup_dict[','] = '||COMMA||'
    token_lookup_dict['"'] = '||QUOTATION_MARK||'
    token_lookup_dict[';'] = '||SEMICOLON||'
    token_lookup_dict['!'] = '||EXCLAMATION_MARK||'
    token_lookup_dict['?'] = '||QUESTION_MARK||'
    token_lookup_dict['('] = '||LEFT_PARENTHESES||'
    token_lookup_dict[')'] = '||RIGHT_PARENTHESES||'
    token_lookup_dict['-'] = '||DASH||'
    token_lookup_dict['\n'] = '||RETURN||'

    return token_lookup_dict

"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
tests.test_tokenize(token_lookup)

Tests Passed


## 预处理并保存所有数据
运行以下代码将预处理所有数据，并将它们保存至文件。建议你查看`helpers.py` 文件中的 `preprocess_and_save_data` 代码来看这一步在做什么，但是你不需要修改`helpers.py`中的函数。

In [5]:
"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
# pre-process training data
helper.preprocess_and_save_data(data_dir, token_lookup, create_lookup_tables)

# 检查点
这是你遇到的第一个检点。如果你想要回到这个 notebook，或需要重新打开 notebook，你都可以从这里开始。预处理的数据都已经保存完毕。

In [6]:
"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
import helper
import problem_unittests as tests

int_text, vocab_to_int, int_to_vocab, token_dict = helper.load_preprocess()

## 创建神经网络
在本节中，你会构建 RNN 中的必要 Module，以及 前向、后向函数。

### 检查 GPU 访问权限

In [7]:
"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
import torch

# Check for a GPU
train_on_gpu = torch.cuda.is_available()
if not train_on_gpu:
    print('No GPU found. Please use a GPU to train your neural network.')

No GPU found. Please use a GPU to train your neural network.


## 输入
让我们开始预处理输入数据。我们会使用 [TensorDataset](http://pytorch.org/docs/master/data.html#torch.utils.data.TensorDataset) 来为数据库提供一个数据格式；以及一个 [DataLoader](http://pytorch.org/docs/master/data.html#torch.utils.data.DataLoader), 该对象会实现 batching，shuffling 以及其他数据迭代功能。

你可以通过传入 特征 和目标 tensors 来创建 TensorDataset，随后创建一个 DataLoader 。
```
data = TensorDataset(feature_tensors, target_tensors)
data_loader = torch.utils.data.DataLoader(data, 
                                          batch_size=batch_size)
```

### Batching
 通过 `TensorDataset` 和 `DataLoader` 类来实现  `batch_data` 函数来将 `words` 数据分成 `batch_size` 批次。

>你可以使用 DataLoader 来分批 单词, 但是你可以自由设置 `feature_tensors` 和 `target_tensors` 的大小以及 `sequence_length`。

比如，我们有如下输入:
```
words = [1, 2, 3, 4, 5, 6, 7]
sequence_length = 4
```

你的第一个 `feature_tensor` 会包含:
```
[1, 2, 3, 4]
```
随后的 `target_tensor` 会是接下去的一个字符值:
```
5
```
那么，第二组的`feature_tensor`, `target_tensor` 则如下所示:
```
[2, 3, 4, 5]  # features
6             # target
```

In [8]:
from torch.utils.data import TensorDataset, DataLoader


def batch_data(words, sequence_length, batch_size):
    """
    Batch the neural network data using DataLoader
    :param words: The word ids of the TV scripts
    :param sequence_length: The sequence length of each batch
    :param batch_size: The size of each batch; the number of sequences in a batch
    :return: DataLoader with batched data
    """
    # TODO: Implement function
    n_batches = len(words) // batch_size
    # only full batches
    words = words[:n_batches * batch_size]
    y_len = len(words) - sequence_length
    x, y = [], []
    for idx in range(0, y_len):
        idx_end = sequence_length + idx
        x_batch = words[idx:idx_end]
        x.append(x_batch)
        # print("Feature: ",x_batch)
        y_batch = words[idx_end]
        # print("Target: ", y_batch)
        y.append(y_batch)

        # create Tensor datasets
    # data = TensorDataset(torch.from_numpy(np.asarray(x)), torch.from_numpy(np.asarray(y)))
    data = TensorDataset(torch.from_numpy(np.asarray(x)).long(), torch.from_numpy(np.asarray(y)).long())
    # make sure the SHUFFLE your training data
    data_loader = DataLoader(data, shuffle=True, batch_size=batch_size)

    # return a dataloader
    return data_loader

# there is no test for this function, but you are encouraged to create
# print statements and tests of your own


### 测试你的 dataloader 

你需要改写下述代码来测试 batching 函数，改写后的代码会现在的比较类似。

下面，我们生成了一些测试文本数据，并使用了一个你上面写 dataloader 。然后，我们会得到一些使用`sample_x`输入以及`sample_y`目标生成的文本。

你的代码会返回如下结果(通常是不同的顺序，如果你 shuffle 了你的数据):

```
torch.Size([10, 5])
tensor([[ 28,  29,  30,  31,  32],
        [ 21,  22,  23,  24,  25],
        [ 17,  18,  19,  20,  21],
        [ 34,  35,  36,  37,  38],
        [ 11,  12,  13,  14,  15],
        [ 23,  24,  25,  26,  27],
        [  6,   7,   8,   9,  10],
        [ 38,  39,  40,  41,  42],
        [ 25,  26,  27,  28,  29],
        [  7,   8,   9,  10,  11]])

torch.Size([10])
tensor([ 33,  26,  22,  39,  16,  28,  11,  43,  30,  12])
```

### 大小
你的 sample_x 应该是 `(batch_size, sequence_length)`的 大小 或者是(10, 5)， sample_y 应该是 一维的: batch_size (10)。

### 值

你应该也会发现 sample_y, 是 test_text 数据中的*下一个*值。因此，对于一个输入的序列 `[ 28,  29,  30,  31,  32]` ，它的结尾是 `32`, 那么其相应的输出应该是 `33`。

In [9]:
# test dataloader

test_text = range(50)
t_loader = batch_data(test_text, sequence_length=5, batch_size=10)

data_iter = iter(t_loader)
sample_x, sample_y = data_iter.next()

print(sample_x.shape)
print(sample_x)
print()
print(sample_y.shape)
print(sample_y)

torch.Size([10, 5])
tensor([[ 38,  39,  40,  41,  42],
        [ 13,  14,  15,  16,  17],
        [  7,   8,   9,  10,  11],
        [ 17,  18,  19,  20,  21],
        [  0,   1,   2,   3,   4],
        [ 20,  21,  22,  23,  24],
        [ 44,  45,  46,  47,  48],
        [  9,  10,  11,  12,  13],
        [ 15,  16,  17,  18,  19],
        [ 30,  31,  32,  33,  34]])

torch.Size([10])
tensor([ 43,  18,  12,  22,   5,  25,  49,  14,  20,  35])


---
## 构建神经网络
使用 PyTorch [Module class](http://pytorch.org/docs/master/nn.html#torch.nn.Module) 来实现一个 循环神经网络 RNN。你需要选择一个 GRU 或者 一个 LSTM。为了完成循环神经网络。为了实现 RNN，你需要实现以下类:
 - `__init__` - 初始化函数
 - `init_hidden` - LSTM/GRU 隐藏组昂泰的初始化函数
 - `forward` - 前向传播函数
 
初始化函数需要创建神经网络的层数，并保存到类。前向传播函数会使用这些网络来进行前向传播，并生成输出和隐藏状态。

在该流程完成后，**该模型的输出是 *最后的* 文字分数结果** 对于每段输入的文字序列，我们只需要输出一个单词，也就是，下一个单词。 

### 提示

1. 确保 lstm 的输出会链接一个 全链接层，你可以参考如下代码 `lstm_output = lstm_output.contiguous().view(-1, self.hidden_dim)`
2. 你可以通过 reshape 模型最后输出的全链接层，来得到最终的文字分数:

```
# reshape into (batch_size, seq_length, output_size)
output = output.view(batch_size, -1, self.output_size)
# get last batch
out = output[:, -1]
```

In [10]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.5):
        """
        Initialize the PyTorch RNN Module
        :param vocab_size: The number of input dimensions of the neural network (the size of the vocabulary)
        :param output_size: The number of output dimensions of the neural network
        :param embedding_dim: The size of embeddings, should you choose to use them
        :param hidden_dim: The size of the hidden layer outputs
        :param dropout: dropout to add in between LSTM/GRU layers
        """
        super(RNN, self).__init__()
        # TODO: Implement function

        # set class variables
        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim

        # define model layers
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=dropout, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_size)

    def forward(self, nn_input, hidden):
        """
        Forward propagation of the neural network
        :param nn_input: The input to the neural network
        :param hidden: The hidden state
        :return: Two Tensors, the output of the neural network and the latest hidden state
        """
        # TODO: Implement function
        # print(nn_input.type())  LongTensor
        embedding_output = self.embedding(nn_input)
        # print(embedding_output.type())  torch.FloatTensor

        # print(hidden[0].type())  torch.FloatTensor
        lstm_output, hidden = self.lstm(embedding_output, hidden)

        # copy from above
        lstm_output = lstm_output.contiguous().view(-1, self.hidden_dim)
        out = self.fc(lstm_output)

        batch_size = nn_input.size(0)
        out = out.view(batch_size, -1, self.output_size)
        out = out[:, -1]

        # return one batch of output word scores and the hidden state
        return out, hidden

    def init_hidden(self, batch_size):
        '''
        Initialize the hidden state of an LSTM/GRU
        :param batch_size: The batch_size of the hidden state
        :return: hidden state of dims (n_layers, batch_size, hidden_dim)
        '''
        # Implement function
        weight = next(self.parameters()).data

        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())

        # initialize hidden state with zero weights, and move to GPU if available
        return hidden

"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
tests.test_rnn(RNN, train_on_gpu)

Tests Passed


### 定义前向及后向传播

通过你实现的 RNN 类来进行前向及后项传播。你可以在训练循环中，不断地调用如下代码来实现：
```
loss = forward_back_prop(decoder, decoder_optimizer, criterion, inp, target)
```

函数中需要返回一个批次以及其隐藏状态的loss均值，你可以调用一个函数`RNN(inp, hidden)`来实现。记得，你可以通过调用`loss.item()` 来计算得到该loss。

**如果使用 GPU，你需要将你的数据存到 GPU 的设备上。**

In [11]:
def forward_back_prop(network, optimizer, criterion, inputs, targets, hidden):
    """
    Forward and backward propagation on the neural network
    :param decoder: The PyTorch Module that holds the neural network
    :param decoder_optimizer: The PyTorch optimizer for the neural network
    :param criterion: The PyTorch loss function
    :param inp: A batch of input to the neural network
    :param target: The target output for the batch of input
    :return: The loss and the latest hidden state Tensor
    """

    # TODO: Implement Function
    clip = 5

    # move data to GPU, if available
    if (train_on_gpu):
        network.cuda()
        inputs = inputs.cuda()
        targets = targets.cuda()

    # perform backpropagation and optimization
    h = tuple([each.data for each in hidden])
    network.zero_grad()  # 将模型参数的梯度值初始为0
    output, h = network(inputs, h)
    loss = criterion(output, targets)  # 计算模型损失
    loss.backward()  # 反向传播计算梯度
    nn.utils.clip_grad_norm_(network.parameters(), clip)
    optimizer.step()  # 更新所有参数

    # move data to GPU, if available

    # perform backpropagation and optimization

    # return the loss over a batch and the hidden state produced by our model
    return loss.item(), h

# Note that these tests aren't completely extensive.
# they are here to act as general checks on the expected outputs of your functions
"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
tests.test_forward_back_prop(RNN, forward_back_prop, train_on_gpu)

Tests Passed


## 神经网络训练

神经网络结构完成以及数据准备完后，我们可以开始训练网络了。

### 训练循环

训练循环是通过 `train_decoder` 函数实现的。该函数将进行 epochs 次数的训练。模型的训练成果会在一定批次的训练后，被打印出来。这个“一定批次”可以通过`show_every_n_batches` 来设置。你会在下一节设置这个参数。

In [12]:
"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
from workspace_utils import keep_awake

def train_rnn(rnn, batch_size, optimizer, criterion, n_epochs, show_every_n_batches=100):
    batch_losses = []
    
    rnn.train()

    print("Training for %d epoch(s)..." % n_epochs)
#     for epoch_i in range(1, n_epochs + 1):
    for epoch_i in keep_awake(range(1, n_epochs + 1)):
        
        # initialize hidden state
        hidden = rnn.init_hidden(batch_size)
        
#         for batch_i, (inputs, labels) in enumerate(train_loader, 1):
        for batch_i, (inputs, labels)  in keep_awake(enumerate(train_loader, 1)):
            
            # make sure you iterate over completely full batches, only
            n_batches = len(train_loader.dataset)//batch_size
            if(batch_i > n_batches):
                break
            
            # forward, back prop
            loss, hidden = forward_back_prop(rnn, optimizer, criterion, inputs, labels, hidden)          
            # record loss
            batch_losses.append(loss)

            # printing loss stats
            if batch_i % show_every_n_batches == 0:
                print('Epoch: {:>4}/{:<4} Batch: {:>4}/{:<4} Loss: {}'.format(
                    epoch_i, n_epochs,batch_i,len(train_loader), np.average(batch_losses)))
                batch_losses = []

    # returns a trained rnn
    return rnn

### 超参数

设置并训练以下超参数:
-  `sequence_length`，序列长度 
-  `batch_size`，分批大小
-  `num_epochs`，循环次数
-  `learning_rate`，Adam优化器的学习率
-  `vocab_size`，唯一标示词汇的数量
-  `output_size`，模型输出的大小 
-  `embedding_dim`，词嵌入的维度，小于 vocab_size
-  `hidden_dim`， 隐藏层维度
-  `n_layers`， RNN的层数
-  `show_every_n_batches`，打印结果的频次

如果模型没有获得你预期的结果，调整 `RNN`类中的上述参数。

In [13]:
%%time
# Data params
# Sequence Length
# set the hyperparamaters
sequence_length = 131        # number of words in a sequence; total words: 892,110: factors are 30, 131, 227
batch_size = 128
train_loader = batch_data(int_text, sequence_length, batch_size)

CPU times: user 13.5 s, sys: 1.2 s, total: 14.7 s
Wall time: 14.8 s


In [14]:
%%time
# Training parameters
# set the training parameters
num_epochs = 5
learning_rate = 0.0015       # 0.01 is worse

# set the model parameters
vocab_size = len(vocab_to_int)
output_size = vocab_size
embedding_dim = 200        # 128 is worse
hidden_dim = 300
n_layers = 2

# show stats for every n number of batches
show_every_n_batches = 50

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 8.34 µs


### 训练
下一节，通过预处理数据来训练神经网络。如果你的loss结果不好，可以通过调整超参数来修正。通常情况下，大的隐藏层及层数会带来比较好的效果，但同时也会消耗较长的时间来训练。
> **你应该努力得到一个低于3.5的loss** 

你也可以试试不同的序列长度，该参数表明模型学习的范围大小。

In [15]:
%%time
"""
DON'T MODIFY ANYTHING IN THIS CELL
"""

# create model and move to gpu if available
rnn = RNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.0)
if train_on_gpu:
    rnn.cuda()

# defining loss and optimization functions for training
optimizer = torch.optim.Adam(rnn.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()
# criterion = nn.BCELoss()

# training the model
from workspace_utils import active_session
with active_session():
    trained_rnn = train_rnn(rnn, batch_size, optimizer, criterion, num_epochs, show_every_n_batches)

# saving the trained model
helper.save_model('./save/trained_rnn', trained_rnn)
print('Model Trained and Saved')

Training for 5 epoch(s)...
Epoch:    1/5    Batch:   50/6968 Loss: 6.770396203994751
Epoch:    1/5    Batch:  100/6968 Loss: 6.059014739990235
Epoch:    1/5    Batch:  150/6968 Loss: 5.743045816421509
Epoch:    1/5    Batch:  200/6968 Loss: 5.467192850112915
Epoch:    1/5    Batch:  250/6968 Loss: 5.416350860595703
Epoch:    1/5    Batch:  300/6968 Loss: 5.181690368652344
Epoch:    1/5    Batch:  350/6968 Loss: 5.152898540496826
Epoch:    1/5    Batch:  400/6968 Loss: 5.006557598114013
Epoch:    1/5    Batch:  450/6968 Loss: 4.979735336303711
Epoch:    1/5    Batch:  500/6968 Loss: 4.969635744094848
Epoch:    1/5    Batch:  550/6968 Loss: 4.948164777755737
Epoch:    1/5    Batch:  600/6968 Loss: 4.905765752792359
Epoch:    1/5    Batch:  650/6968 Loss: 4.85688157081604
Epoch:    1/5    Batch:  700/6968 Loss: 4.78151798248291
Epoch:    1/5    Batch:  750/6968 Loss: 4.75894642829895
Epoch:    1/5    Batch:  800/6968 Loss: 4.771385827064514
Epoch:    1/5    Batch:  850/6968 Loss: 4.772740

Epoch:    2/5    Batch:  150/6968 Loss: 3.828703536987305
Epoch:    2/5    Batch:  200/6968 Loss: 3.8435431480407716
Epoch:    2/5    Batch:  250/6968 Loss: 3.838767156600952
Epoch:    2/5    Batch:  300/6968 Loss: 3.842117109298706
Epoch:    2/5    Batch:  350/6968 Loss: 3.8312032651901244
Epoch:    2/5    Batch:  400/6968 Loss: 3.8184406042098997
Epoch:    2/5    Batch:  450/6968 Loss: 3.8819620037078857
Epoch:    2/5    Batch:  500/6968 Loss: 3.780607290267944
Epoch:    2/5    Batch:  550/6968 Loss: 3.767456088066101
Epoch:    2/5    Batch:  650/6968 Loss: 3.865486989021301
Epoch:    2/5    Batch:  700/6968 Loss: 3.8399445247650146
Epoch:    2/5    Batch:  750/6968 Loss: 3.7438284397125243
Epoch:    2/5    Batch:  800/6968 Loss: 3.809678544998169
Epoch:    2/5    Batch:  850/6968 Loss: 3.826849718093872
Epoch:    2/5    Batch:  900/6968 Loss: 3.7980453681945803
Epoch:    2/5    Batch:  950/6968 Loss: 3.8657334661483764
Epoch:    2/5    Batch: 1000/6968 Loss: 3.7809427261352537
Epoch

Epoch:    3/5    Batch:  300/6968 Loss: 3.5459644746780397
Epoch:    3/5    Batch:  350/6968 Loss: 3.5757032871246337
Epoch:    3/5    Batch:  400/6968 Loss: 3.5421405744552614
Epoch:    3/5    Batch:  450/6968 Loss: 3.5473378133773803
Epoch:    3/5    Batch:  500/6968 Loss: 3.541634497642517
Epoch:    3/5    Batch:  550/6968 Loss: 3.542183532714844
Epoch:    3/5    Batch:  600/6968 Loss: 3.5846012449264526
Epoch:    3/5    Batch:  650/6968 Loss: 3.5421532726287843
Epoch:    3/5    Batch:  700/6968 Loss: 3.5208300590515136
Epoch:    3/5    Batch:  750/6968 Loss: 3.494650206565857
Epoch:    3/5    Batch:  800/6968 Loss: 3.603822875022888
Epoch:    3/5    Batch:  850/6968 Loss: 3.5291358470916747
Epoch:    3/5    Batch:  900/6968 Loss: 3.562524509429932
Epoch:    3/5    Batch:  950/6968 Loss: 3.5284580421447753
Epoch:    3/5    Batch: 1000/6968 Loss: 3.5546377801895144
Epoch:    3/5    Batch: 1050/6968 Loss: 3.5352625274658203
Epoch:    3/5    Batch: 1100/6968 Loss: 3.567787356376648
Epo

Epoch:    4/5    Batch:  350/6968 Loss: 3.3428549861907957
Epoch:    4/5    Batch:  400/6968 Loss: 3.326801495552063
Epoch:    4/5    Batch:  450/6968 Loss: 3.3466720628738402
Epoch:    4/5    Batch:  500/6968 Loss: 3.337438201904297
Epoch:    4/5    Batch:  550/6968 Loss: 3.273859796524048
Epoch:    4/5    Batch:  600/6968 Loss: 3.392932024002075
Epoch:    4/5    Batch:  650/6968 Loss: 3.360913109779358
Epoch:    4/5    Batch:  700/6968 Loss: 3.3094553470611574
Epoch:    4/5    Batch:  750/6968 Loss: 3.3519128513336183
Epoch:    4/5    Batch:  800/6968 Loss: 3.35781286239624
Epoch:    4/5    Batch:  850/6968 Loss: 3.371531901359558
Epoch:    4/5    Batch:  900/6968 Loss: 3.3097916889190673
Epoch:    4/5    Batch:  950/6968 Loss: 3.3659841632843017
Epoch:    4/5    Batch: 1000/6968 Loss: 3.373643116950989
Epoch:    4/5    Batch: 1050/6968 Loss: 3.3329677104949953
Epoch:    4/5    Batch: 1100/6968 Loss: 3.3693420362472533
Epoch:    4/5    Batch: 1150/6968 Loss: 3.3326178121566774
Epoch:

Epoch:    5/5    Batch:  400/6968 Loss: 3.206629333496094
Epoch:    5/5    Batch:  450/6968 Loss: 3.146385383605957
Epoch:    5/5    Batch:  500/6968 Loss: 3.207815523147583
Epoch:    5/5    Batch:  550/6968 Loss: 3.1660525512695314
Epoch:    5/5    Batch:  600/6968 Loss: 3.171525731086731
Epoch:    5/5    Batch:  650/6968 Loss: 3.171633882522583
Epoch:    5/5    Batch:  700/6968 Loss: 3.175555648803711
Epoch:    5/5    Batch:  750/6968 Loss: 3.185321192741394
Epoch:    5/5    Batch:  800/6968 Loss: 3.1240369081497192
Epoch:    5/5    Batch:  850/6968 Loss: 3.215079312324524
Epoch:    5/5    Batch:  900/6968 Loss: 3.207778687477112
Epoch:    5/5    Batch:  950/6968 Loss: 3.181842427253723
Epoch:    5/5    Batch: 1000/6968 Loss: 3.172011947631836
Epoch:    5/5    Batch: 1050/6968 Loss: 3.203216691017151
Epoch:    5/5    Batch: 1100/6968 Loss: 3.130688509941101
Epoch:    5/5    Batch: 1150/6968 Loss: 3.128520426750183
Epoch:    5/5    Batch: 1200/6968 Loss: 3.1569515466690063
Epoch:    5

  "type " + obj.__name__ + ". It won't be checked "


Model Trained and Saved
CPU times: user 3h 18min 4s, sys: 1h 34min 4s, total: 4h 52min 8s
Wall time: 4h 52min 45s


### 问题: 你如何决定你的模型超参数？
比如，你是否试过不同的 different sequence_lengths 并发现哪个使得模型的收敛速度变化？那你的隐藏层数和层数呢？你是如何决定使用这个网络参数的？

**答案:** (在这里写下)    
<font color=red>1.sequence_length最后选择为131，因为句子长度为:110到892,公因数为30, 131, 227，取其中131;</font>[How to Generate Music using a LSTM Neural Network in Keras](https://towardsdatascience.com/how-to-generate-music-using-a-lstm-neural-network-in-keras-68786834d4c5)   
<font color=red>2.隐藏层数设为~~300~~ -> 200，层数设为~~512~~ -> 300,根据推荐论文并结合运行时间综合选择</font>[A Long Short-Term Memory Model for Answer Sentence Selection in Question Answering](http://www.aclweb.org/anthology/P15-2116)

---
# 检查点

通过运行上面的训练单元，你的模型已经以`trained_rnn`名字存储，如果你存储了你的notebook， **你可以在之后的任何时间来访问你的代码和结果**. 下述代码可以帮助你重载你的结果!

In [15]:
"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
import torch
import helper
import problem_unittests as tests

_, vocab_to_int, int_to_vocab, token_dict = helper.load_preprocess()
trained_rnn = helper.load_model('./save/trained_rnn')

In [16]:
trained_rnn

RNN(
  (embedding): Embedding(21388, 200)
  (lstm): LSTM(200, 300, num_layers=2, batch_first=True)
  (fc): Linear(in_features=300, out_features=21388, bias=True)
)

## 生成电视剧剧本
你现在可以生成你的“假”电视剧剧本啦！

### 生成文字
你的神经网络会不断重复生成一个单词，直到生成满足你要求长度的剧本。使用 `generate` 函数来完成上述操作。首先，使用 `prime_id` 来生成word id，之后确定生成文本长度 `predict_len`。同时， topk 采样来引入文字选择的随机性!

In [17]:
"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
import torch.nn.functional as F

def generate(rnn, prime_id, int_to_vocab, token_dict, pad_value, predict_len=100):
    """
    Generate text using the neural network
    :param decoder: The PyTorch Module that holds the trained neural network
    :param prime_id: The word id to start the first prediction
    :param int_to_vocab: Dict of word id keys to word values
    :param token_dict: Dict of puncuation tokens keys to puncuation values
    :param pad_value: The value used to pad a sequence
    :param predict_len: The length of text to generate
    :return: The generated text
    """
    rnn.eval()
    
    # create a sequence (batch_size=1) with the prime_id
    current_seq = np.full((1, sequence_length), pad_value)
    current_seq[-1][-1] = prime_id
    predicted = [int_to_vocab[prime_id]]
    
    for _ in range(predict_len):
        if train_on_gpu:
            current_seq = torch.LongTensor(current_seq).cuda()
        else:
            current_seq = torch.LongTensor(current_seq)
        
        # initialize the hidden state
        hidden = rnn.init_hidden(current_seq.size(0))
        
        # get the output of the rnn
        output, _ = rnn(current_seq, hidden)
        
        # get the next word probabilities
        p = F.softmax(output, dim=1).data
        if(train_on_gpu):
            p = p.cpu() # move to cpu
         
        # use top_k sampling to get the index of the next word
        top_k = 5
        p, top_i = p.topk(top_k)
        top_i = top_i.numpy().squeeze()
        
        # select the likely next word index with some element of randomness
        p = p.numpy().squeeze()
        word_i = np.random.choice(top_i, p=p/p.sum())
        
        # retrieve that word from the dictionary
        word = int_to_vocab[word_i]
        predicted.append(word)     
        
        # the generated word becomes the next "current sequence" and the cycle can continue
        current_seq = np.roll(current_seq, -1, 1)
        current_seq[-1][-1] = word_i
    
    gen_sentences = ' '.join(predicted)
    
    # Replace punctuation tokens
    for key, token in token_dict.items():
        ending = ' ' if key in ['\n', '(', '"'] else ''
        gen_sentences = gen_sentences.replace(' ' + token.lower(), key)
    gen_sentences = gen_sentences.replace('\n ', '\n')
    gen_sentences = gen_sentences.replace('( ', '(')
    
    # return all the sentences
    return gen_sentences

### 生成一个新剧本
是时候生成一个剧本啦。设置`gen_length` 剧本长度，设置 `prime_word`为以下任意词来开始生成吧:
- "jerry"
- "elaine"
- "george"
- "kramer"

你可以把prime word 设置成 _任意 _ 单词, 但是使用名字开始会比较好(任何其他名字也是可以哒!)

In [18]:
%%time
# run the cell multiple times to get different results!
gen_length = 400 # modify the length to your preference
prime_word = 'jerry' # name for starting the script

"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
pad_word = helper.SPECIAL_WORDS['PADDING']
generated_script = generate(trained_rnn, vocab_to_int[prime_word + ':'], int_to_vocab, token_dict, vocab_to_int[pad_word], gen_length)
print(generated_script)

jerry: and he was gonna be careful.

jerry: oh, you don't have any kind of chemistry with the friars.

george: you know, i don't think we were just curious.

jerry: well, i was wondering i was in the back.

kramer:(pointing) no, i don't want to talk to him for you, but...(points to the phone)

elaine: what?(elaine is shocked)

elaine: i can't do this if i can get this.(george enters and goes to the bathroom, and knocks on the phone)

george:(to george) hey, you know what? i'll see you later.

kramer: yeah, you can just tell me about a thing, but...

jerry: i can't.

george: i think it's a good time.

jerry:(to george) you know what i mean?

elaine: yeah, yeah. well, i'm a little tired.

jerry: i don't know..

kramer:(looking at the closed door) yeah....

kramer: yeah.....

jerry: you can't.

kramer: hey, you know, i got a big deal.

jerry:(pause) i can't believe i had to get it.

george:(pointing at georges finger) hey, you gotta go.

jerry: i can't believe you, i don't want to know wh

In [19]:
real_text = ""
for word in generated_script.split():
    if ':' in word:
        real_text += '\n'+word+"\t"
    else:
        real_text += word+' '
        
print(real_text)


jerry:	and he was gonna be careful. 
jerry:	oh, you don't have any kind of chemistry with the friars. 
george:	you know, i don't think we were just curious. 
jerry:	well, i was wondering i was in the back. 
kramer:(pointing)	no, i don't want to talk to him for you, but...(points to the phone) 
elaine:	what?(elaine is shocked) 
elaine:	i can't do this if i can get this.(george enters and goes to the bathroom, and knocks on the phone) 
george:(to	george) hey, you know what? i'll see you later. 
kramer:	yeah, you can just tell me about a thing, but... 
jerry:	i can't. 
george:	i think it's a good time. 
jerry:(to	george) you know what i mean? 
elaine:	yeah, yeah. well, i'm a little tired. 
jerry:	i don't know.. 
kramer:(looking	at the closed door) yeah.... 
kramer:	yeah..... 
jerry:	you can't. 
kramer:	hey, you know, i got a big deal. 
jerry:(pause)	i can't believe i had to get it. 
george:(pointing	at georges finger) hey, you gotta go. 
jerry:	i can't believe you, i don't want to know w

#### 存下你最爱的片段

一旦你发现一段有趣或者好玩的片段，就把它存下啦！

In [20]:
# save script to a text file
f =  open("generated_script_1.txt","w")
f.write(real_text)
f.close()

# 这个电视剧剧本是无意义的
如果你的电视剧剧本不是很有逻辑也是ok的。下面是一个例子。

### 生成剧本案例

>jerry: what about me?
>
>jerry: i don't have to wait.
>
>kramer:(to the sales table)
>
>elaine:(to jerry) hey, look at this, i'm a good doctor.
>
>newman:(to elaine) you think i have no idea of this...
>
>elaine: oh, you better take the phone, and he was a little nervous.
>
>kramer:(to the phone) hey, hey, jerry, i don't want to be a little bit.(to kramer and jerry) you can't.
>
>jerry: oh, yeah. i don't even know, i know.
>
>jerry:(to the phone) oh, i know.
>
>kramer:(laughing) you know...(to jerry) you don't know.


如果这个电视剧剧本毫无意义，那也没有关系。我们的训练文本不到一兆字节。为了获得更好的结果，你需要使用更小的词汇范围或是更多数据。幸运的是，我们的确拥有更多数据！在本项目开始之初我们也曾提过，这是[另一个数据集](https://www.kaggle.com/wcukierski/the-simpsons-by-the-data)的子集。我们并没有让你基于所有数据进行训练，因为这将耗费大量时间。然而，你可以随意使用这些数据训练你的神经网络。当然，是在完成本项目之后。
# 提交项目
在提交项目时，请确保你在保存 notebook 前运行了所有的单元格代码。请将 notebook 文件保存为 "dlnd_tv_script_generation.ipynb"，并将它作为 HTML 文件保存在 "File" -> "Download as" 中。请将 "helper.py" 和 "problem_unittests.py" 文件一并打包成 zip 文件提交。

$$\;$$