## Recurrent Neural Network Projects 循环神经网络项目

Welcome to the Recurrent Neural Network Project in the Artificial Intelligence Nanodegree! In this notebook, some template code has already been provided for you, and you will need to implement additional functionality to successfully complete this project. You will not need to modify the included code beyond what is requested. Sections that begin with **'Implementation'** in the header indicate that the following block of code will require additional functionality which you must provide. Instructions will be provided for each section and the specifics of the implementation are marked in the code block with a 'TODO' statement. Please be sure to read the instructions carefully!

欢迎来到人工智能纳米程序中的循环神经网络项目！在本笔记本中，已经为您提供了一些模板代码，您将需要实现其他功能来成功完成此项目。您不需要修改包含的代码超出所要求的内容。在标题中以 **'Implementation'** 开头的部分表示以下代码块将需要您实现附加功能。将为每个部分提供说明，并在代码块中使用“TODO”语句标记实现的具体细节。请务必仔细阅读说明！

In addition to implementing code, there will be questions that you must answer which relate to the project and your implementation. Each section where you will answer a question is preceded by a **'Question X'** header. Carefully read each question and provide thorough answers in the following text boxes that begin with **'Answer:'**. Your project submission will be evaluated based on your answers to each of the questions and the implementation you provide.  

除了执行代码之外，还有一些问题，您必须回答与项目和实施相关的问题。您将回答问题的每个部分之前都有一个**'Question X'** 标题。仔细阅读每个问题，并在以下的文本框中提供彻底的答案：以**'Answer:'**开头。您的项目提交将根据您对每个问题的答案和您提供的实施情况进行评估。

>**Note:** Code and Markdown cells can be executed using the **Shift + Enter** keyboard shortcut. In addition, Markdown cells can be edited by typically double-clicking the cell to enter edit mode.

>**Note:** 可以使用** Shift + Enter **键盘快捷键执行代码和Markdown单元格。此外，通过双击单元格进入编辑模式，可以编辑Markdown单元格。

### Implementation TODOs in this notebook 实现笔记本中的 TODOs

This notebook contains two problems, cut into a variety of TODOs.  Make sure to complete each section containing a TODO marker throughout the notebook.  For convenience we provide links to each of these sections below.

该笔记本包含两个问题，切入各种TODOs。 确保在整个笔记本中完成包含TODO标记的每个部分。为方便起见，我们提供下列各部分的链接。

[TODO #1: Implement a function to window time series](#TODO_1) 实现一个函数到窗口时间序列

[TODO #2: Create a simple RNN model using keras to perform regression](#TODO_2) 使用keras创建一个简单的RNN模型来执行回归

[TODO #3: Finish cleaning a large text corpus](#TODO_3) 完成清理大文本语料库

[TODO #4: Implement a function to window a large text corpus](#TODO_4) 实现一个函数来窗口大文本语料库

[TODO #5: Create a simple RNN model using keras to perform multiclass classification](#TODO_5) 使用keras创建一个简单的RNN模型来执行多类分类

[TODO #6: Generate text using a fully trained RNN model and a variety of input sequences](#TODO_6) 使用经过充分训练的RNN模型和各种输入序列生成文本


# Problem 1: Perform time series prediction  问题1执行时间序列预测

In this project you will perform time series prediction using a Recurrent Neural Network regressor.  In particular you will re-create the figure shown in the notes - where the stock price of Apple was forecasted (or predicted) 7 days in advance.  In completing this exercise you will learn how to construct RNNs using Keras, which will also aid in completing the second project in this notebook.

在这个项目中，您将使用循环神经网络回归器来执行时间序列预测。 特别是，您将重新创建笔记中显示的数字 - 苹果的股价预测（或预测）提前7天。 在完成本练习时，您将学习如何使用Keras构建RNNs，这也有助于完成本笔记本中的第二个项目。

The particular network architecture we will employ for our RNN is known as  [Long Term Short Memory (LSTM)](https://en.wikipedia.org/wiki/Long_short-term_memory), which helps significantly avoid technical problems with optimization of RNNs.  

我们将为我们的RNN采用的特定网络架构称为 [Long Term Short Memory (LSTM)](https://en.wikipedia.org/wiki/Long_short-term_memory)，这有助于大大避免RNN优化的技术问题。

## 1.1 Getting started 入门

First we must load in our time series - a history of around 140 days of Apple's stock price.  Then we need to perform a number of pre-processing steps to prepare it for use with an RNN model.  First off, it is good practice to normalize time series - by normalizing its range.  This helps us avoid serious numerical issues associated how common activation functions (like tanh) transform very large (positive or negative) numbers, as well as helping us to avoid related issues when computing derivatives.

首先，我们必须加载我们的时间序列 - 苹果股价约140天的历史。 然后，我们需要执行一些预处理步骤来准备与RNN模型一起使用。 首先，将时间序列归一化是一个很好的做法 - 通过对其范围进行规范化。 这有助于我们避免与常见激活函数（如tanh）变换非常大（正或负）数字相关的严重数值问题，并帮助我们在计算衍生品时避免相关问题。

Here we normalize the series to lie in the range [0,1] [using this scikit function](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html), but it is also commonplace to normalize by a series standard deviation.

这里我们将系列归一化在[0,1] [using this scikit function](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) 的范围内，但它是 也常见于通过系列标准偏差正常化。

In [None]:
### Load in necessary libraries for data input and normalization
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

%load_ext autoreload
%autoreload 2

from my_answers import *

%load_ext autoreload
%autoreload 2

from my_answers import *

### load in and normalize the dataset
dataset = np.loadtxt('datasets/normalized_apple_prices.csv')

Lets take a quick look at the (normalized) time series we'll be performing predictions on.

让我们快速看看我们将执行预测的(标准化)时间序列。

In [None]:
# lets take a look at our time series
plt.plot(dataset)
plt.xlabel('time period')
plt.ylabel('normalized series value')

## 1.2  Cutting our time series into sequences 将我们的时间序列切成序列

Remember, our time series is a sequence of numbers that we can represent in general mathematically as 

记住，我们的时间序列是一系列数字，我们可以在数学上代表数字

$$s_{0},s_{1},s_{2},...,s_{P}$$

where $s_{p}$ is the numerical value of the time series at time period $p$ and where $P$ is the total length of the series.  In order to apply our RNN we treat the time series prediction problem as a regression problem, and so need to use a sliding window to construct a set of associated input/output pairs to regress on.  This process is animated in the gif below.

其中 $s_{p}$ 是时间段 $P$ 的时间序列的数值，其中 $P$ 是系列的总长度。为了应用我们的RNN，我们将时间序列预测问题视为回归问题，因此需要使用滑动窗口来构建一组相关的输入/输出对来回归。这个过程在下面的gif中是动画的。

<img src="images/timeseries_windowing_training.gif" width=600 height=600/>

For example - using a window of size T = 5 (as illustrated in the gif above) we produce a set of input/output pairs like the one shown in the table below

例如 - 使用大小为T = 5的窗口（如上面的gif所示），我们产生一组输入/输出对，如下表所示

$$\begin{array}{c|c}
\text{Input} & \text{Output}\\
\hline \color{CornflowerBlue} {\langle s_{1},s_{2},s_{3},s_{4},s_{5}\rangle} & \color{Goldenrod}{ s_{6}} \\
\ \color{CornflowerBlue} {\langle s_{2},s_{3},s_{4},s_{5},s_{6} \rangle } & \color{Goldenrod} {s_{7} } \\
\color{CornflowerBlue}  {\vdots} & \color{Goldenrod} {\vdots}\\
\color{CornflowerBlue} { \langle s_{P-5},s_{P-4},s_{P-3},s_{P-2},s_{P-1} \rangle } & \color{Goldenrod} {s_{P}}
\end{array}$$

Notice here that each input is a sequence (or vector) of length 5 (and in general has length equal to the window size T) while each corresponding output is a scalar value.  Notice also how given a time series of length P and window size T = 5 as shown above, we created P - 5  input/output pairs.  More generally, for a window size T we create P - T such pairs.

请注意，每个输入是长度为5的序列(或向量)(通常具有等于窗口大小T的长度)，而每个对应的输出是标量值。 还要注意，如上所示，如何给出长度P和窗口大小 T=5 的时间序列，我们创建了 P - 5 输入/输出对。 更一般地，对于窗口大小T，我们创建P-T这样的对。

Now its time for you to window the input time series as described above!  

现在它的时间让你按照上述的方式窗口输入时间序列！

<a id='TODO_1'></a>

**TODO:** Implement the function called **window_transform_series** in my_answers.py so that it runs a sliding window along the input series and creates associated input/output pairs.    Note that this function should input a) the series and b) the window length, and return the input/output subsequences.  Make sure to format returned input/output as generally shown in table above (where window_size = 5), and make sure your returned input is a numpy array.

**TODO:** 在my_answers.py中实现名为 **window_transform_series** 的函数，以便它沿着输入序列运行一个滑动窗口，并创建相关的输入/输出对。 请注意，此功能应输入 a）系列和 b）窗口长度，并返回输入/输出子序列。 确保格式化返回的输入/输出，如上表（window_size = 5）所示，并确保返回的输入是一个numpy数组。

-----

You can test your function on the list of odd numbers given below

In [None]:
odd_nums = np.array([1,3,5,7,9,11,13])

Here is a hard-coded solution for odd_nums.  You can compare its results with what you get from your **window_transform_series** implementation.

这是一个用于odd_nums的硬编码解决方案。 您可以将其结果与您从 **window_transform_series** 实现中获得的结果进行比较。

In [None]:
# run a window of size 2 over the odd number sequence and display the results
window_size = 2

X = []
X.append(odd_nums[0:2])
X.append(odd_nums[1:3])
X.append(odd_nums[2:4])
X.append(odd_nums[3:5])
X.append(odd_nums[4:6])

y = odd_nums[2:]

X = np.asarray(X)
y = np.asarray(y)
y = np.reshape(y, (len(y),1)) #optional

assert(type(X).__name__ == 'ndarray')
assert(type(y).__name__ == 'ndarray')
assert(X.shape == (5,2))
assert(y.shape in [(5,1), (5,)])

# print out input/output pairs --> here input = X, corresponding output = y
print ('--- the input X will look like ----')
print (X)

print ('--- the associated output y will look like ----')
print (y)

Again - you can check that your completed **window_transform_series** function works correctly by trying it on the odd_nums sequence - you should get the above output.

再次 - 你可以检查你完成的 **window_transform_series** 函数是否正确地通过在odd_nums序列上尝试 - 你应该得到上面的输出。

In [None]:
### TODO: implement the function window_transform_series in the file my_answers.py
from my_answers import window_transform_series

With this function in place apply it to the series in the Python cell below.  We use a window_size = 7 for these experiments.

将此功能适用于下面的Python单元格中的系列。 我们对这些实验使用一个window_size = 7。

In [None]:
# window the data using your windowing function
window_size = 7
X,y = window_transform_series(series = dataset,window_size = window_size)

## 1.3  Splitting into training and testing sets 拆分训练集和测试集

In order to perform proper testing on our dataset we will lop off the last 1/3 of it for validation (or testing).  This is that once we train our model we have something to test it on (like any regression problem!).  This splitting into training/testing sets is done in the cell below.

为了对我们的数据集执行正确的测试，我们将在最后1/3的时间内进行验证(或测试)。 这就是一旦我们训练我们的模型，我们有一些东西可以测试(就像任何回归问题！)。 这种在训练/测试集中的分裂是在下面的单元格中完成的。

Note how here we are **not** splitting the dataset *randomly* as one typically would do when validating a regression model.  This is because our input/output pairs *are related temporally*.   We don't want to validate our model by training on a random subset of the series and then testing on another random subset, as this simulates the scenario that we receive new points *within the timeframe of our training set*.  

注意我们这里的 **不是** 分解数据集 *随机* ，如通常在验证回归模型时会做的。 这是因为我们的输入/输出对*在时间上相关*。 我们不想通过对该系列的随机子集进行训练，然后在另一个随机子集上进行测试来验证我们的模型，因为这模拟了我们在训练集*的时间范围内接收到新点*的情景。

We want to train on one solid chunk of the series (in our case, the first full 2/3 of it), and validate on a later chunk (the last 1/3) as this simulates how we would predict *future* values of a time series.

我们想训练一个系列的一个实体（在我们的例子中，它是第一个完整的2/3），并在稍后的一个块（最后1/3）上验证，因为它模拟了我们将如何预测*未来*值 的时间序列。

In [None]:
# split our dataset into training / testing sets
train_test_split = int(np.ceil(2*len(y)/float(3)))   # set the split point

# partition the training set
X_train = X[:train_test_split,:]
y_train = y[:train_test_split]

# keep the last chunk for testing
X_test = X[train_test_split:,:]
y_test = y[train_test_split:]

# NOTE: to use keras's RNN LSTM module our input must be reshaped to [samples, window size, stepsize] 
X_train = np.asarray(np.reshape(X_train, (X_train.shape[0], window_size, 1)))
X_test = np.asarray(np.reshape(X_test, (X_test.shape[0], window_size, 1)))

<a id='TODO_2'></a>

## 1.4  Build and run an RNN regression model 构建并运行RNN回归模型

Having created input/output pairs out of our time series and cut this into training/testing sets, we can now begin setting up our RNN.  We use Keras to quickly build a two hidden layer RNN of the following specifications

- layer 1 uses an LSTM module with 5 hidden units (note here the input_shape = (window_size,1))
- layer 2 uses a fully connected module with one unit
- the 'mean_squared_error' loss should be used (remember: we are performing regression here)

在我们的时间序列中创建了输入/输出对，并将其剪切到训练/测试集中，我们现在可以开始设置我们的RNN。我们使用Keras快速构建以下规格的两个隐藏层RNN

- 层1使用具有5个隐藏单元的LSTM模块（请注意，input_shape =（window_size，1））
- 第2层使用一个完整连接的模块与一个单元
- 应该使用'mean_squared_error'损失（记住：我们在这里执行回归）

This can be constructed using just a few lines - see e.g., the [general Keras documentation](https://keras.io/getting-started/sequential-model-guide/) and the [LSTM documentation in particular](https://keras.io/layers/recurrent/) for examples of how to quickly use Keras to build neural network models.  Make sure you are initializing your optimizer given the [keras-recommended approach for RNNs](https://keras.io/optimizers/) 

这可以使用几行来构建 - 例如，[General Keras文档](https://keras.io/getting-started/sequential-model-guide/) 和[特别是LSTM文档](https://keras.io/layers/recurrent/)，了解如何快速使用Keras构建神经网络模型的例子。鉴于 [keras推荐的RNN方法](https://keras.io/optimizers/)，请确保您正在初始化优化器

(given in the cell below).  (remember to copy your completed function into the script *my_answers.py* function titled *build_part1_RNN* before submitting your project)

(在下面的单元格中给出)。 （在提交项目之前，请记得将完成的函数复制到脚本 *my_answers.py* 函数中，名为  *build_part1_RNN*）

In [None]:
### TODO: create required RNN model
# import keras network libraries
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
import keras

# given - fix random seed - so we can all reproduce the same results on our default time series
np.random.seed(0)


# TODO: implement build_part1_RNN in my_answers.py
from my_answers import build_part1_RNN
model = build_part1_RNN(window_size)

# build model using keras documentation recommended optimizer initialization
optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)

# compile the model
model.compile(loss='mean_squared_error', optimizer=optimizer)

With your model built you can now fit the model by activating the cell below!  Note: the number of epochs (np_epochs) and batch_size are preset (so we can all produce the same results).  You can choose to toggle the verbose parameter - which gives you regular updates on the progress of the algorithm - on and off by setting it to 1 or 0 respectively.

建立您的模型，您现在可以通过激活下面的单元格来适应模型！ 注意：历元数(np_epochs)和batch_size是预设的（所以我们都可以产生相同的结果）。 您可以选择切换详细参数，通过将其设置为1或0，可以定期更新算法的进度。

In [None]:
# run your model!
model.fit(X_train, y_train, epochs=1000, batch_size=50, verbose=0)

## 1.5  Checking model performance 检查模型性能

With your model fit we can now make predictions on both our training and testing sets. 凭借您的模型，我们现在可以对我们的训练集和测试集进行预测。

In [None]:
# generate predictions for training
train_predict = model.predict(X_train)
test_predict = model.predict(X_test)

In the next cell we compute training and testing errors using our trained model - you should be able to achieve at least

在下一个单元格中，我们使用我们训练有素的模型计算训练和测试错误 - 您应该能够至少达到

*training_error* < 0.02

and 

*testing_error* < 0.02

with your fully trained model.  

与您的训练有素的模型。

If either or both of your accuracies are larger than 0.02 re-train your model - increasing the number of epochs you take (a maximum of around 1,000 should do the job) and/or adjusting your batch_size.

如果您的两个准确度都大于0.02，则会重新列出您的模型 - 增加您所花费的时期数量（最多可达1,000个），和/或调整batch_size。

In [None]:
# print out training and testing errors
training_error = model.evaluate(X_train, y_train, verbose=0)
print('training error = ' + str(training_error))

testing_error = model.evaluate(X_test, y_test, verbose=0)
print('testing error = ' + str(testing_error))

Activating the next cell plots the original data, as well as both predictions on the training and testing sets. 

激活下一个单元格绘制原始数据，以及对训练和测试集的两个预测。

In [None]:
### Plot everything - the original series as well as predictions on training and testing sets
import matplotlib.pyplot as plt
%matplotlib inline

# plot original series
plt.plot(dataset,color = 'k')

# plot training set prediction
split_pt = train_test_split + window_size 
plt.plot(np.arange(window_size,split_pt,1),train_predict,color = 'b')

# plot testing set prediction
plt.plot(np.arange(split_pt,split_pt + len(test_predict),1),test_predict,color = 'r')

# pretty up graph
plt.xlabel('day')
plt.ylabel('(normalized) price of Apple stock')
plt.legend(['original series','training fit','testing fit'],loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()

**Note:** you can try out any time series for this exercise!  If you would like to try another see e.g., [this site containing thousands of time series](https://datamarket.com/data/list/?q=provider%3Atsdl) and pick another one!

**Note:** 你可以尝试这个练习的任何时间系列！ 如果你想尝试另一个，例如，[这个网站包含数千个时间序列](https://datamarket.com/data/list/?q=provider%3Atsdl) 并选择另一个！

# Problem 2: Create a sequence generator 创建序列生成器

## 2.1  Getting started 入门

In this project you will implement a popular Recurrent Neural Network (RNN) architecture to create an English language sequence generator capable of building semi-coherent English sentences from scratch by building them up character-by-character.  This will require a substantial amount amount of parameter tuning on a large training corpus (at least 100,000 characters long).  In particular for this project we will be using a complete version of Sir Arthur Conan Doyle's classic book The Adventures of Sherlock Holmes.

在这个项目中，您将实施流行的循环神经网络（RNN）体系结构，创建一个英文语言序列生成器，能够通过逐个构建从头开始构建半连贯的英语句子。这将需要在大型训练语料库（至少100,000个字符长）上进行大量的参数调整。特别是对于这个项目，我们将使用完整版的Arthur Conan Doyle爵士的经典书“福尔摩斯冒险之旅”。

How can we train a machine learning model to generate text automatically, character-by-character?  *By showing the model many training examples so it can learn a pattern between input and output.*  With this type of text generation each input is a string of valid characters like this one

我们如何训练机器学习模型来自动生成文字，逐个字符？ *通过显示模型许多训练示例，以便它可以学习输入和输出之间的模式。*通过这种类型的文本生成，每个输入都是一串这样的有效字符

*dogs are grea*

while the corresponding output is the next character in the sentence - which here is 't' (since the complete sentence is 'dogs are great').  We need to show a model many such examples in order for it to make reasonable predictions.

而相应的输出是句子中的下一个字符 - 这里是“t”（因为完整的句子是 'dogs are great'）。我们需要显示一个模型许多这样的例子，以便做出合理的预测。

**Fun note:** For those interested in how text generation is being used check out some of the following fun resources:

**Fun note:**对于如何使用文本生成感兴趣的人，请查看以下有趣的资源：

- [Generate wacky sentences](http://www.cs.toronto.edu/~ilya/rnn.html) with this academic RNN text generator

- [生成古怪的句子](http://www.cs.toronto.edu/~ilya/rnn.html) 与这个学术性的RNN文本生成器

- Various twitter bots that tweet automatically generated text like[this one](http://tweet-generator-alex.herokuapp.com/).

- 各种twitter bots自动生成文本，如[这个](http://tweet-generator-alex.herokuapp.com/)。

- the [NanoGenMo](https://github.com/NaNoGenMo/2016) annual contest to automatically produce a 50,000+ novel automatically

- [NanoGenMo](https://github.com/NaNoGenMo/2016) 年度大赛，自动生成一本50,000多单词小说

- [Robot Shakespeare](https://github.com/genekogan/RobotShakespeare) a text generator that automatically produces Shakespear-esk sentences

- [Robot Shakespeare](https://github.com/genekogan/RobotShakespeare) 一个自动生成Shakespear-esk句子的文本生成器

## 2.2  Preprocessing a text dataset 预处理文本数据集

Our first task is to get a large text corpus for use in training, and on it we perform a several light pre-processing tasks.  The default corpus we will use is the classic book Sherlock Holmes, but you can use a variety of others as well - so long as they are fairly large (around 100,000 characters or more).  

我们的第一个任务是获得一个用于训练的大型文本语料库，并且我们执行了几个轻量级的预处理任务。 我们将使用的默认语料库是经典书Sherlock Holmes，但您也可以使用各种其他语言，只要它们相当大（约100,000个字符或更多）。

In [None]:
# read in the text, transforming everything to lower case
text = open('datasets/holmes.txt').read().lower()
print('our original text has ' + str(len(text)) + ' characters')

Next, lets examine a bit of the raw text.  Because we are interested in creating sentences of English words automatically by building up each word character-by-character, we only want to train on valid English words.  In other words - we need to remove all of the other characters that are not part of English words.

接下来，让我们检查一下原始文本。 因为我们有兴趣通过逐字建立每个单词自动创建英文单词的句子，我们只想训练有效的英语单词。 换句话说 - 我们需要删除不属于英文单词的所有其他字符。

In [None]:
### print out the first 1000 characters of the raw text to get a sense of what we need to throw out
text[:2000]

Wow - there's a lot of junk here (i.e., weird uncommon character combinations - as this first character chunk contains the title and author page, as well as table of contents)!  To keep things simple, we want to train our RNN on a large chunk of more typical English sentences - we don't want it to start thinking non-english words or strange characters are valid! - so lets clean up the data a bit.

哇 - 这里有很多垃圾（即，奇怪的罕见的字符组合 - 因为这个第一个字符块包含标题和作者页面，以及目录）！ 为了保持简单，我们想训练我们的RNN在一大批更典型的英语句子 - 我们不希望它开始思考非英语单词或奇怪的人物是有效的！ - 所以让我们清理数据。

First, since the dataset is so large and the first few hundred characters contain a lot of junk, lets cut it out.  Lets also find-and-replace those newline tags with empty spaces.

首先，由于数据集很大，前几百个字符含有很多垃圾，所以可以把它剪掉。 也可以用空格找到并替换那些换行符。

In [None]:
### find and replace '\n' and '\r' symbols - replacing them 
text = text[1302:]
text = text.replace('\n',' ')    # replacing '\n' with '' simply removes the sequence
text = text.replace('\r',' ')

Lets see how the first 1000 characters of our text looks now!

让我们看看我们的文本的前1000个字符如何看起来！

In [None]:
### print out the first 1000 characters of the raw text to get a sense of what we need to throw out
text[:1000]

<a id='TODO_3'></a>

#### TODO: finish cleaning the text 完成清理文本

Lets make sure we haven't left any other atypical characters (commas, periods, etc., are ok) lurking around in the depths of the text.  You can do this by enumerating all the text's unique characters, examining them, and then replacing any unwanted characters with empty spaces!  Once we find all of the text's unique characters, we can remove all of the atypical ones in the next cell.  Note: don't remove the punctuation marks given in my_answers.py.

让我们确保我们还没有留下任何其他非典型的字符（逗号，句点等），在文本的深处潜伏着。 您可以通过枚举所有文本的唯一字符，检查它们，然后用空格替换任何不需要的字符来执行此操作！ 一旦我们找到所有文本的唯一字符，我们可以删除下一个单元格中的所有非典型字符。 注意：不要删除my_answers.py中给出的标点符号。

In [None]:
### TODO: implement cleaned_text in my_answers.py
from my_answers import cleaned_text

text = cleaned_text(text)

# shorten any extra dead space created above
text = text.replace('  ',' ')

With your chosen characters removed print out the first few hundred lines again just to double check that everything looks good.

随着您选择的字符被删除打印出前几百行再次只是检查一切看起来不错。

In [None]:
### print out the first 2000 characters of the raw text to get a sense of what we need to throw out
text[:2000]

Now that we have thrown out a good number of non-English characters/character sequences lets print out some statistics about the dataset - including number of total characters and number of unique characters.

现在我们已经抛出了很多非英文字符/字符序列，可以打印出关于数据集的一些统计信息 - 包括总字符数和唯一字符数。

In [None]:
# count the number of unique characters in the text
chars = sorted(list(set(text)))

# print some of the text, as well as statistics
print ("this corpus has " +  str(len(text)) + " total number of characters")
print ("this corpus has " +  str(len(chars)) + " unique characters")

## 2.3  Cutting data into input/output pairs 将数据切割成输入/输出对

Now that we have our text all cleaned up, how can we use it to train a model to generate sentences automatically?  First we need to train a machine learning model - and in order to do that we need a set of input/output pairs for a model to train on.  How can we create a set of input/output pairs from our text to train on?

现在我们的文本全部清理了，我们如何用它来训练模型来自动生成句子？ 首先，我们需要训练一个机器学习模型 - 为了做到这一点，我们需要一组输入/输出对来训练模型。 我们如何从我们的文本中创建一组输入/输出对来训练？

Remember in part 1 of this notebook how we used a sliding window to extract input/output pairs from a time series?  We do the same thing here!  We slide a window of length $T$ along our giant text corpus - everything in the window becomes one input while the character following becomes its corresponding output.  This process of extracting input/output pairs is illustrated in the gif below on a small example text using a window size of T = 5.

记住笔记本的第1部分我们如何使用滑动窗口从时间序列中提取输入/输出对？ 我们在这里做同样的事情！ 我们沿着我们的巨型文本语料库滑动长度为 $T$ 的窗口 - 窗口中的所有内容都将成为一个输入，而后面的字符将成为其相应的输出。 提取输入/输出对的这个过程在下面的gif中使用窗口大小为 T = 5 的小示例文本进行了说明。

<img src="images/text_windowing_training.gif" width=400 height=400/>

Notice one aspect of the sliding window in this gif that does not mirror the analogous gif for time series shown in part 1 of the notebook - we do not need to slide the window along one character at a time but can move by a fixed step size $M$ greater than 1 (in the gif indeed $M = 1$).  This is done with large input texts (like ours which has over 500,000 characters!) when sliding the window along one character at a time we would create far too many input/output pairs to be able to reasonably compute with.

注意这个gif中的滑动窗口的一个方面，它不会像笔记本的第1部分所示的时间序列那样镜像类似的gif - 我们不需要一次沿着一个字符滑动窗口，而是可以移动一个固定的步长 $M$ 大于1（在gif中确实 $M = 1$）。 这是通过大量输入文本（像我们这样有超过500,000个字符！）的方式来完成的。当我们在一个字符上滑动窗口时，我们会创建太多的输入/输出对，以便能够合理地计算。

More formally lets denote our text corpus - which is one long string of characters - as follows

更正式地表示我们的文本语料库 - 这是一长串字符，如下所示

$$s_{0},s_{1},s_{2},...,s_{P}$$

where $P$ is the length of the text (again for our text $P \approx 500,000!$).  Sliding a window of size T = 5 with a step length of M = 1 (these are the parameters shown in the gif above) over this sequence produces the following list of input/output pairs

其中 $P$ 是文本的长度（再次为我们的文本 $P \approx 500,000!$）。 通过该顺序滑动步长为M = 1（这些是上述gif中显示的参数）的T = 5大小的窗口，产生以下输入/输出对列表


$$\begin{array}{c|c}
\text{Input} & \text{Output}\\
\hline \color{CornflowerBlue} {\langle s_{1},s_{2},s_{3},s_{4},s_{5}\rangle} & \color{Goldenrod}{ s_{6}} \\
\ \color{CornflowerBlue} {\langle s_{2},s_{3},s_{4},s_{5},s_{6} \rangle } & \color{Goldenrod} {s_{7} } \\
\color{CornflowerBlue}  {\vdots} & \color{Goldenrod} {\vdots}\\
\color{CornflowerBlue} { \langle s_{P-5},s_{P-4},s_{P-3},s_{P-2},s_{P-1} \rangle } & \color{Goldenrod} {s_{P}}
\end{array}$$

Notice here that each input is a sequence (or vector) of 5 characters (and in general has length equal to the window size T) while each corresponding output is a single character.  We created around P total number of input/output pairs  (for general step size M we create around ceil(P/M) pairs).

请注意，每个输入是5个字符的序列（或向量）（通常具有等于窗口大小T的长度），而每个相应的输出是单个字符。 我们创建了P个输入/输出对总数（对于我们围绕 ceil(P/M) 对创建的一般步长M）。

<a id='TODO_4'></a>

Now its time for you to window the input time series as described above! 

现在时候让你按照上述的方式滑窗输入时间序列！

**TODO:** Create a function that runs a sliding window along the input text and creates associated input/output pairs.  A skeleton function has been provided for you.  Note that this function should input a) the text  b) the window size and c) the step size, and return the input/output sequences.  Note: the return items should be *lists* - not numpy arrays.

**TODO:** 创建一个沿输入文本运行滑动窗口并创建相关输入/输出对的函数。 为您提供了一个骨架功能。 请注意，此函数应输入 a）文本 b）窗口大小，c）步长，并返回输入/输出序列。 注意：返回项应为 *lists* - 不是numpy数组。

(remember to copy your completed function into the script *my_answers.py* function titled *window_transform_text* before submitting your project)

(在提交项目之前，请记住将已完成的函数复制到脚本 *my_answers.py* 函数中，名为 *window_transform_text*)

In [None]:
### TODO: implement window_transform_series in my_answers.py
from my_answers import window_transform_series

With our function complete we can now use it to produce input/output pairs!  We employ the function in the next cell, where the window_size = 50 and step_size = 5.

随着我们的功能完成，我们现在可以使用它来生成输入/输出对！ 我们在下一个单元格中使用该函数，其中 window_size = 50 和 step_size = 5。

In [None]:
# run your text window-ing function 
window_size = 100
step_size = 5
inputs, outputs = window_transform_text(text,window_size,step_size)

Lets print out a few input/output pairs to verify that we have made the right sort of stuff!

让我们打印出一些输入/输出对，以验证我们已经做出正确的排序！

In [None]:
# print out a few of the input/output pairs to verify that we've made the right kind of stuff to learn from
print('input = ' + inputs[2])
print('output = ' + outputs[2])
print('--------------')
print('input = ' + inputs[100])
print('output = ' + outputs[100])

Looks good! 看起来很好

## 2.4  Wait, what kind of problem is text generation again?

等等，什么样的问题再次发生文本？

In part 1 of this notebook we used the same pre-processing technique - the sliding window - to produce a set of training input/output pairs to tackle the problem of time series prediction *by treating the problem as one of regression*.  So what sort of problem do we have here now, with text generation?  Well, the time series prediction was a regression problem because the output (one value of the time series) was a continuous value.  Here - for character-by-character text generation - each output is a *single character*.  This isn't a continuous value - but a distinct class - therefore **character-by-character text generation is a classification problem**.  

在本笔记本的第1部分中，我们使用相同的预处理技术 - 滑动窗口来生成一组训练输入/输出对，以通过将问题解决为回归来解决时间序列预测的问题。 那么现在我们有什么问题呢？ 那么时间序列预测是一个回归问题，因为输出（时间序列的一个值）是连续的值。 这里 - 对于逐字符文本生成 - 每个输出都是单个字符。 这不是一个连续的值 - 而是一个独特的类 - 因此 **逐字符文本生成是一个分类问题**。

How many classes are there in the data?  Well, the number of classes is equal to the number of unique characters we have to predict!  How many of those were there in our dataset again?  Lets print out the value again.

数据中有多少类？ 那么类的数量就等于我们预测的唯一字符数！ 我们的数据集中有多少在那里？ 让我们再次打印出这个值。

In [None]:
# print out the number of unique characters in the dataset
chars = sorted(list(set(text)))
print ("this corpus has " +  str(len(chars)) + " unique characters")
print ('and these characters are ')
print (chars)

Rockin' - so we have a multiclass classification problem on our hands!

Rockin' - 所以我们手上有一个多类分类问题！

## 2.5  One-hot encoding characters 独热编码字符

The last issue we have to deal with is representing our text data as numerical data so that we can use it as an input to a neural network. One of the conceptually simplest ways of doing this is via a 'one-hot encoding' scheme.  Here's how it works.

我们必须处理的最后一个问题是将我们的文本数据表示为数值数据，以便我们可以将其用作神经网络的输入。 这样做的概念上最简单的方法之一是通过“独热编码”方案。 这是它的工作原理。

We transform each character in our inputs/outputs into a vector with length equal to the number of unique characters in our text.  This vector is all zeros except one location where we place a 1 - and this location is unique to each character type.  e.g., we transform 'a', 'b', and 'c' as follows

我们将输入/输出中的每个字符转换为长度等于我们文本中唯一字符数的向量。 除了我们放置1的一个位置，此向量全为零 - 此位置对于每个字符类型都是唯一的。 例如，我们如下转换'a'，'b'和'c'

$$a\longleftarrow\left[\begin{array}{c}
1\\
0\\
0\\
\vdots\\
0\\
0
\end{array}\right]\,\,\,\,\,\,\,b\longleftarrow\left[\begin{array}{c}
0\\
1\\
0\\
\vdots\\
0\\
0
\end{array}\right]\,\,\,\,\,c\longleftarrow\left[\begin{array}{c}
0\\
0\\
1\\
\vdots\\
0\\
0 
\end{array}\right]\cdots$$

where each vector has 32 entries (or in general: number of entries = number of unique characters in text).

其中每个向量有32个条目（或一般来说：条目数=文本中的唯一字符数）。

The first practical step towards doing this one-hot encoding is to form a dictionary mapping each unique character to a unique integer, and one dictionary to do the reverse mapping.  We can then use these dictionaries to quickly make our one-hot encodings, as well as re-translate (from integers to characters) the results of our trained RNN classification model.

进行这种单一热编码的第一个实际步骤是形成将每个唯一字符映射到唯一整数的字典，以及一个字典进行反向映射。 然后，我们可以使用这些字典快速制作我们的一次编码，以及重新翻译（从整数到字符）我们训练有素的RNN分类模型的结果。

In [None]:
# this dictionary is a function mapping each unique character to a unique integer
chars_to_indices = dict((c, i) for i, c in enumerate(chars))  # map each unique character to unique integer

# this dictionary is a function mapping each unique integer back to a unique character
indices_to_chars = dict((i, c) for i, c in enumerate(chars))  # map each unique integer back to unique character

Now we can transform our input/output pairs - consisting of characters - to equivalent input/output pairs made up of one-hot encoded vectors.  In the next cell we provide a function for doing just this: it takes in the raw character input/outputs and returns their numerical versions.  In particular the numerical input is given as $\bf{X}$, and numerical output is given as the $\bf{y}$

现在，我们可以将由字符组成的输入/输出对转换成由单热编码向量组成的等效输入/输出对。 在下一个单元格中，我们提供了一个功能：只需要原始字符输入/输出并返回其数字版本。 特别是数值输入给出为 $\bf{X}$，数值输出为 $\bf{y}$

In [None]:
# transform character-based input/output into equivalent numerical versions
def encode_io_pairs(text,window_size,step_size):
    # number of unique chars
    chars = sorted(list(set(text)))
    num_chars = len(chars)
    
    # cut up text into character input/output pairs
    inputs, outputs = window_transform_text(text,window_size,step_size)
    
    # create empty vessels for one-hot encoded input/output
    X = np.zeros((len(inputs), window_size, num_chars), dtype=np.bool)
    y = np.zeros((len(inputs), num_chars), dtype=np.bool)
    
    # loop over inputs/outputs and transform and store in X/y
    for i, sentence in enumerate(inputs):
        for t, char in enumerate(sentence):
            X[i, t, chars_to_indices[char]] = 1
        y[i, chars_to_indices[outputs[i]]] = 1
        
    return X,y

Now run the one-hot encoding function by activating the cell below and transform our input/output pairs!

现在，通过激活下面的单元格并转换我们的输入/输出对，运行独热编码功能！

In [None]:
# use your function
window_size = 100
step_size = 5
X,y = encode_io_pairs(text,window_size,step_size)

<a id='TODO_5'></a>

## 2.6 Setting up our RNN 建立我们的RNN

With our dataset loaded and the input/output pairs extracted / transformed we can now begin setting up our RNN for training.  Again we will use Keras to quickly build a single hidden layer RNN - where our hidden layer consists of LSTM modules.

随着我们的数据集加载和输入/输出对提取/转换，我们现在可以开始设置我们的RNN进行培训。再次，我们将使用Keras快速构建单个隐藏层RNN，其中我们的隐藏层由LSTM模块组成。

Time to get to work: build a 3 layer RNN model of the following specification

- layer 1 should be an LSTM module with 200 hidden units --> note this should have input_shape = (window_size,len(chars)) where len(chars) = number of unique characters in your cleaned text
- layer 2 should be a linear module, fully connected, with len(chars) hidden units --> where len(chars) = number of unique characters in your cleaned text
- layer 3 should be a softmax activation ( since we are solving a *multiclass classification*)
- Use the **categorical_crossentropy** loss 

干活时间：建立以下3层RNN模型

- 层1应该是一个具有200个隐藏单元的LSTM模块 - >注意这应该有input_shape =（window_size，len（chars））其中len（chars）=清除文本中的唯一字符数
- 层2应该是一个线性模块，完全连接，用len(chars)隐藏单位 - >其中len(chars)=清除文本中的唯一字符数
- 第3层应该是softmax激活（因为我们正在解决*multiclass classification*）
- 使用**categorical_crossentropy**损失

This network can be constructed using just a few lines - as with the RNN network you made in part 1 of this notebook.  See e.g., the [general Keras documentation](https://keras.io/getting-started/sequential-model-guide/) and the [LSTM documentation in particular](https://keras.io/layers/recurrent/) for examples of how to quickly use Keras to build neural network models.

这个网络可以使用几行来构建 - 就像这款笔记本第一部分中的RNN网络一样。例如，[通用Keras文档](https://keras.io/getting-started/sequential-model-guide/) 和 [特别是LSTM文档](https://keras.io/layers/recurrent/) 举例说明如何快速使用Keras构建神经网络模型。

In [None]:
### necessary functions from the keras library
from keras.models import Sequential
from keras.layers import Dense, Activation, LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import keras
import random

# TODO implement build_part2_RNN in my_answers.py
from my_answers import build_part2_RNN

model = build_part2_RNN(window_size, len(chars))

# initialize optimizer
optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)

# compile model --> make sure initialized optimizer and callbacks - as defined above - are used
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

## 2.7  Training our RNN model for text generation 训练我们的RNN模型进行文本生成

With our RNN setup we can now train it!  Lets begin by trying it out on a small subset of the larger version.  In the next cell we take the first 10,000 input/output pairs from our training database to learn on.

通过我们的RNN设置，我们现在可以训练它！ 让我们从较大版本的一小部分上尝试一下。 在下一个单元格中，我们从我们的培训数据库中获取前10,000个输入/输出对，以学习。

In [None]:
# a small subset of our input/output pairs
Xsmall = X[:10000,:,:]
ysmall = y[:10000,:]

Now lets fit our model! 现在让我们fit出模型

In [None]:
# train the model
model.fit(Xsmall, ysmall, batch_size=500, epochs=40,verbose = 1)

# save weights
model.save_weights('model_weights/best_RNN_small_textdata_weights.hdf5')

How do we make a given number of predictions (characters) based on this fitted model?   

基于这个适合的模型，我们如何做出一定数量的预测（字符）？

First we predict the next character after following any chunk of characters in the text of length equal to our chosen window size.  Then we remove the first character in our input sequence and tack our prediction onto the end.  This gives us a slightly changed sequence of inputs that still has length equal to the size of our window.  We then feed in this updated input sequence into the model to predict the another character.  Together then we have two predicted characters following our original input sequence.  Repeating this process N times gives us N predicted characters.

首先我们在长度相等于我们选择的窗口大小的文本中跟随任何大块字符后，我们预测下一个字符。 然后我们删除我们的输入序列中的第一个字符，并把我们的预测结束。 这给我们一个略有改变的输入序列，其长度仍等于我们窗口的大小。 然后，我们将这个更新的输入序列输入到模型中以预测另一个字符。 在一起，我们有两个预测字符遵循我们的原始输入序列。 重复这个过程N次给我们N个预测字符。

In the next Python cell we provide you with a completed function that does just this - it makes predictions when given a) a trained RNN model, b) a subset of (window_size) characters from the text, and c) a number of characters to predict (to follow our input subset).

在下一个Python单元格中，我们为您提供一个完成的函数，只要这样做 - 它给定 a）经过训练的RNN模型，b）文本中的（window_size）个字符的子集，以及 c）许多字符到 预测（跟随我们的输入子集）。

In [None]:
# function that uses trained model to predict a desired number of future characters
def predict_next_chars(model,input_chars,num_to_predict):     
    # create output
    predicted_chars = ''
    for i in range(num_to_predict):
        # convert this round's predicted characters to numerical input    
        x_test = np.zeros((1, window_size, len(chars)))
        for t, char in enumerate(input_chars):
            x_test[0, t, chars_to_indices[char]] = 1.

        # make this round's prediction
        test_predict = model.predict(x_test,verbose = 0)[0]

        # translate numerical prediction back to characters
        r = np.argmax(test_predict)                           # predict class of each test input
        d = indices_to_chars[r] 

        # update predicted_chars and input
        predicted_chars+=d
        input_chars+=d
        input_chars = input_chars[1:]
    return predicted_chars

<a id='TODO_6'></a>

With your trained model try a few subsets of the complete text as input - note the length of each must be exactly equal to the window size.  For each subset use the function above to predict the next 100 characters that follow each input.

使用训练有素的模型，尝试将完整文本的几个子集作为输入 - 注意每个子集的长度必须与窗口大小完全相同。 对于每个子集，使用上面的函数预测每个输入后面的接下来的100个字符。

In [None]:
# TODO: choose an input sequence and use the prediction function in the previous Python cell to predict 100 characters following it
# get an appropriately sized chunk of characters from the text
start_inds = []

# load in weights
model.load_weights('model_weights/best_RNN_small_textdata_weights.hdf5')
for s in start_inds:
    start_index = s
    input_chars = text[start_index: start_index + window_size]

    # use the prediction function
    predict_input = predict_next_chars(model,input_chars,num_to_predict = 100)

    # print out input characters
    print('------------------')
    input_line = 'input chars = ' + '\n' +  input_chars + '"' + '\n'
    print(input_line)

    # print out predicted characters
    line = 'predicted chars = ' + '\n' +  predict_input + '"' + '\n'
    print(line)

This looks ok, but not great.  Now lets try the same experiment with a larger chunk of the data - with the first 100,000 input/output pairs.  

这看起来不错，但不是很好。 现在让我们尝试使用更大块数据的同样的实验 - 前10万个输入/输出对。

Tuning RNNs for a typical character dataset like the one we will use here is a computationally intensive endeavour and thus timely on a typical CPU.  Using a reasonably sized cloud-based GPU can speed up training by a factor of 10.  Also because of the long training time it is highly recommended that you carefully write the output of each step of your process to file.  This is so that all of your results are saved even if you close the web browser you're working out of, as the processes will continue processing in the background but variables/output in the notebook system will not update when you open it again.

对于像我们这里使用的典型字符数据集，调整RNN是一个计算密集的工作，因此在典型的CPU上是及时的。 使用合理大小的基于云的GPU可以将训练加速10倍。同时由于训练时间长，强烈建议您仔细地将过程的每个步骤的输出写入文件。 这样即使您关闭正在处理的Web浏览器，您的所有结果都将保存，因为进程将在后台继续处理，但是当您再次打开笔记本系统时，变量/输出将不会更新。

In the next cell we show you how to create a text file in Python and record data to it.  This sort of setup can be used to record your final predictions.

在下一个单元格中，我们向您展示如何在Python中创建文本文件并将数据记录到该文本文件中。 这种设置可用于记录您的最终预测。

In [None]:
### A simple way to write output to file
f = open('my_test_output.txt', 'w')              # create an output file to write too
f.write('this is only a test ' + '\n')           # print some output text
x = 2
f.write('the value of x is ' + str(x) + '\n')    # record a variable value
f.close()     

# print out the contents of my_test_output.txt
f = open('my_test_output.txt', 'r')              # create an output file to write too
f.read()

With this recording devices we can now more safely perform experiments on larger portions of the text.  In the next cell we will use the first 100,000 input/output pairs to train our RNN model.

使用这种记录设备，我们现在可以更安全地在文本的较大部分执行实验。 在下一个单元格中，我们将使用前100,000个输入/输出对来训练我们的RNN模型。

First we fit our model to the dataset, then generate text using the trained model in precisely the same generation method applied before on the small dataset.

首先，我们将我们的模型适合数据集，然后使用训练模型在小数据集上应用的完全相同的生成方法中生成文本。

**Note:** your generated words should be - by and large - more realistic than with the small dataset, but you won't be able to generate perfect English sentences even with this amount of data.  A rule of thumb: your model is working well if you generate sentences that largely contain real English words.

**Note:** 您生成的单词应该大体上比小数据集更现实，但即使使用这些数据，您也不能生成完美的英语句子。 经验法则：如果您生成大量包含真实英语单词的句子，您的模型工作良好。

In [None]:
# a small subset of our input/output pairs
Xlarge = X[:100000,:,:]
ylarge = y[:100000,:]

# TODO: fit to our larger dataset
model.fit(Xlarge, ylarge, batch_size=500, epochs=30, verbose=1)

# save weights
model.save_weights('model_weights/best_RNN_large_textdata_weights.hdf5')

In [None]:
# TODO: choose an input sequence and use the prediction function in the previous Python cell to predict 100 characters following it
# get an appropriately sized chunk of characters from the text
start_inds = []

# save output
f = open('text_gen_output/RNN_large_textdata_output.txt', 'w')  # create an output file to write too

# load weights
model.load_weights('model_weights/best_RNN_large_textdata_weights.hdf5')
for s in start_inds:
    start_index = s
    input_chars = text[start_index: start_index + window_size]

    # use the prediction function
    predict_input = predict_next_chars(model,input_chars,num_to_predict = 100)

    # print out input characters
    line = '-------------------' + '\n'
    print(line)
    f.write(line)

    input_line = 'input chars = ' + '\n' +  input_chars + '"' + '\n'
    print(input_line)
    f.write(input_line)

    # print out predicted characters
    predict_line = 'predicted chars = ' + '\n' +  predict_input + '"' + '\n'
    print(predict_line)
    f.write(predict_line)
f.close()