## Char RNN 生成文本
In the chapter on Cyclic Neural Networks, we learned that it is very good at dealing with sequence problems, so for text, it is also equivalent to a sequence, because each sentence is composed of words or man in sequence order, so we can also use RNN processes it, so how can it generate text? In fact, the principle is very simple, let's talk about Char RNN.

### Training process
Earlier we introduced that there are many relationships between input and output of RNN, such as one-to-many, many-to-many, etc. Different inputs correspond to different applications, such as many-to-many can be used for machine translation, etc. Today we want The Char RNN is a many-to-many type of the same length when training the network, that is, inputting a sequence and outputting a sequence of absorption common length.

The specific network training process is as follows

<img src=https://ws1.sinaimg.cn/large/006tNc79gy1fob5kq3r8jj30mt09dq2r.jpg width=700>

As you can see in the above network flow, the input is a sequence of "front moonlight" and the output is also a sequence of "previous moonlight bed". If you look closely, you can find that every step of the network output is the next step, is this a coincidence?

No, this is Char RNN's design idea. For any sentence, such as "I like kittens", we can split it into Char RNN's training set. The input is "I like kittens". A sequence of length 5, the output of each step of the network is "like the kitten me." Of course, for a sequence, there is no other character after the last character, so there are many ways to choose, such as the first character of the sequence as its output, that is, the output of "light" is "bed", or it will be The output itself, that is, the output of "light" is "light."

What are the benefits of this design? Because the process of training is a process of supervised training, we can't see the meaning of doing so. We can see the benefits of doing this in the process of generating text.

### Generate text
We can directly explain the process of generating text, and can explain the reason of the training process intuitively.

First, you need to input the initial sequence of the network to warm up. The warm-up process does not require the actual output. Just to generate the hidden state with the memory effect, and keep the hidden state, then we start to form the text, continuously Generate a new sentence, this process can be looped indefinitely, or reach the length of our request output, you can look at the following icon

<img src=https://ws2.sinaimg.cn/large/006tNc79gy1fob5z06w1uj30qh09m0sl.jpg width=800>

As you can see from the above example, is it easy to re-enter the previously outputted text into the network, looping through the recursion, and finally generating the sentences of the length we want?

Below we use PyTorch to achieve


We use ancient poetry as an example, read this data and see what it looks like.


In [1]:
with open('./dataset/poetry.txt', 'r') as f:
    poetry_corpus = f.read()

In [2]:
poetry_corpus[:100]

'寒随穷律变，春逐鸟声开。\n初风飘带柳，晚雪间花梅。\n碧林青旧竹，绿沼翠新苔。\n芝田初雁去，绮树巧莺来。\n晚霞聊自怡，初晴弥可喜。\n日晃百花色，风动千林翠。\n池鱼跃不同，园鸟声还异。\n寄言博通者，知予物'

In [3]:
# Look at the number of characters
len(poetry_corpus)

942681

For the sake of visualization, we replaced some other characters with spaces.


In [4]:
poetry_corpus = poetry_corpus.replace('\n', ' ').replace('\r', ' ').replace('，', ' ').replace('。', ' ')
poetry_corpus[:100]

'寒随穷律变 春逐鸟声开  初风飘带柳 晚雪间花梅  碧林青旧竹 绿沼翠新苔  芝田初雁去 绮树巧莺来  晚霞聊自怡 初晴弥可喜  日晃百花色 风动千林翠  池鱼跃不同 园鸟声还异  寄言博通者 知予物'

### Text numeric representation
For each text, the computer does not recognize it effectively, so you must make a conversion to convert the text to a number. For all non-repeating characters, you can start indexing from 0.

At the same time, in order to save memory overhead, words with lower word frequency can be removed.


In [5]:
import numpy as np

class TextConverter(object):
    def __init__(self, text_path, max_vocab=5000):
"""Create a character index converter
        
        Args:
Text_path: text position
Max_vocab: the maximum number of words
        """
        
        with open(text_path, 'r') as f:
            text = f.read()
text = text.replace('\n', ' ').replace('\r', ' ').replace('，', ' ').replace('。', ' ')
# Remove duplicate characters
        vocab = set(text)

# If the total number of words exceeds the maximum value, remove the lowest frequency
        vocab_count = {}
        
# Calculate the frequency of occurrence of words and sort
        for word in vocab:
            vocab_count[word] = 0
        for word in text:
            vocab_count[word] += 1
        vocab_count_list = []
        for word in vocab_count:
            vocab_count_list.append((word, vocab_count[word]))
        vocab_count_list.sort(key=lambda x: x[1], reverse=True)
        
# If the maximum value is exceeded, the character with the lowest interception frequency
        if len(vocab_count_list) > max_vocab:
            vocab_count_list = vocab_count_list[:max_vocab]
        vocab = [x[0] for x in vocab_count_list]
        self.vocab = vocab

        self.word_to_int_table = {c: i for i, c in enumerate(self.vocab)}
        self.int_to_word_table = dict(enumerate(self.vocab))

    @property
    def vocab_size(self):
        return len(self.vocab) + 1

    def word_to_int(self, word):
        if word in self.word_to_int_table:
            return self.word_to_int_table[word]
        else:
            return len(self.vocab)

    def int_to_word(self, index):
        if index == len(self.vocab):
            return '<unk>'
        elif index < len(self.vocab):
            return self.int_to_word_table[index]
        else:
            raise Exception('Unknown index!')

    def text_to_arr(self, text):
        arr = []
        for word in text:
            arr.append(self.word_to_int(word))
        return np.array(arr)

    def arr_to_text(self, arr):
        words = []
        for index in arr:
            words.append(self.int_to_word(index))
        return "".join(words)

In [6]:
convert = TextConverter('./dataset/poetry.txt', max_vocab=10000)

We can visualize the characters represented by numbers


In [7]:
#原文字字符
txt_char = poetry_corpus[:11]
print(txt_char)

# Convert to numbers
print(convert.text_to_arr(txt_char))

寒随穷律变 春逐鸟声开
[ 40 166 358 935 565   0  10 367 108  63  78]


### Constructing time series sample data
In order to input into the cyclic neural network for training, we need to construct the data of the time series samples. Because we know that the cyclic neural network has long-term dependence problems, so we can't input all the texts together as a sequence to the circulating nerves. In the network, we need to divide the whole text into many sequences to make the batch input into the network. As long as we set the length of each sequence, the number of sequences is determined.


In [8]:
n_step = 20

# total number of sequences
num_seq = int(len(poetry_corpus) / n_step)

# Remove the last part of the sequence length
text = poetry_corpus[:num_seq*n_step]

print(num_seq)

47134


Then we convert all the text in the sequence into a digital representation and rearrange it into a matrix of (num_seq x n_step)


In [9]:
import torch

In [10]:
arr = convert.text_to_arr(text)
arr = arr.reshape((num_seq, -1))
arr = torch.from_numpy(arr)

print(arr.shape)
print(arr[0, :])

torch.Size([47134, 20])

  40
 166
 358
 935
 565
   0
  10
 367
 108
  63
  78
   0
   0
 150
   4
 443
 284
 182
   0
 131
[torch.LongTensor of size 20]



According to this, we can construct the data reading in PyTorch to train the network. Here we set the output label of the last character as the first character of the input, that is, the output of "Before the Moonlight" is "the former moonlight bed." "


In [11]:
class TextDataset(object):
    def __init__(self, arr):
        self.arr = arr
        
    def __getitem__(self, item):
        x = self.arr[item, :]
        
#结构 label
        y = torch.zeros(x.shape)
# The first character entered is the last input label
        y[:-1], y[-1] = x[1:], x[0]
        return x, y
    
    def __len__(self):
        return self.arr.shape[0]

In [12]:
train_set = TextDataset(arr)

We can take out one of the data sets and see if it is what we described.


In [13]:
x, y = train_set[0]
print(convert.arr_to_text(x.numpy()))
print(convert.arr_to_text(y.numpy()))

寒随穷律变 春逐鸟声开  初风飘带柳 晚
随穷律变 春逐鸟声开  初风飘带柳 晚寒


### Modeling
The model can be defined as a very simple three-layer, the first layer is the word embedding, the second layer is the RNN layer, because the last is a classification problem, so the third layer is the linear layer, and finally the predicted characters are output.


In [14]:
from torch import nn
from torch.autograd import Variable

use_gpu = True

class CharRNN(nn.Module):
    def __init__(self, num_classes, embed_dim, hidden_size, 
                 num_layers, dropout):
        super().__init__()
        self.num_layers = num_layers
        self.hidden_size = hidden_size

        self.word_to_vec = nn.Embedding(num_classes, embed_dim)
        self.rnn = nn.GRU(embed_dim, hidden_size, num_layers, dropout)
        self.project = nn.Linear(hidden_size, num_classes)

    def forward(self, x, hs=None):
        batch = x.shape[0]
        if hs is None:
            hs = Variable(
                torch.zeros(self.num_layers, batch, self.hidden_size))
            if use_gpu:
                hs = hs.cuda()
        word_embed = self.word_to_vec(x)  # (batch, len, embed)
        word_embed = word_embed.permute(1, 0, 2)  # (len, batch, embed)
        out, h0 = self.rnn(word_embed, hs)  # (len, batch, hidden)
        le, mb, hd = out.shape
        out = out.view(le * mb, hd)
        out = self.project(out)
        out = out.view(le, mb, -1)
        out = out.permute(1, 0, 2).contiguous()  # (batch, len, hidden)
        return out.view(-1, out.shape[2]), h0

### Training Model
When training the model, we know that this is a classification problem, so you can use cross entropy as the loss function. In the language model, we usually use a new indicator to evaluate the result. This indicator is called perplexity and can be simple. The ground is considered to take the exponent of the cross entropy, so its range is $[1, \infty]$, and the smaller the better.

In addition, we mentioned earlier that RNN has a gradient explosion problem, so we need to perform gradient clipping, which can be easily implemented in pytorch using `torch.nn.utils.clip_grad_norm`


In [15]:
from torch.utils.data import DataLoader

batch_size = 128
train_data = DataLoader(train_set, batch_size, True, num_workers=4)

In [16]:
from mxtorch.trainer import ScheduledOptim

model = CharRNN(convert.vocab_size, 512, 512, 2, 0.5)
if use_gpu:
    model = model.cuda()
criterion = nn.CrossEntropyLoss()

basic_optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
optimizer = ScheduledOptim(basic_optimizer)

In [17]:
epochs = 20
for e in range(epochs):
    train_loss = 0
    for data in train_data:
        x, y = data
        y = y.long()
        if use_gpu:
            x = x.cuda()
            y = y.cuda()
        x, y = Variable(x), Variable(y)

        # Forward.
        score, _ = model(x)
        loss = criterion(score, y.view(-1))

        # Backward.
        optimizer.zero_grad()
        loss.backward()
        # Clip gradient.
        nn.utils.clip_grad_norm(model.parameters(), 5)
        optimizer.step()

        train_loss += loss.data[0]
    print('epoch: {}, perplexity is: {:.3f}, lr:{:.1e}'.format(e+1, np.exp(train_loss / len(train_data)), optimizer.lr))

epoch: 1, perplexity is: 290.865, lr:1.0e-03
epoch: 2, perplexity is: 190.468, lr:1.0e-03
epoch: 3, perplexity is: 124.909, lr:1.0e-03
epoch: 4, perplexity is: 88.715, lr:1.0e-03
epoch: 5, perplexity is: 67.819, lr:1.0e-03
epoch: 6, perplexity is: 53.798, lr:1.0e-03
epoch: 7, perplexity is: 43.619, lr:1.0e-03
epoch: 8, perplexity is: 36.032, lr:1.0e-03
epoch: 9, perplexity is: 30.195, lr:1.0e-03
epoch: 10, perplexity is: 25.569, lr:1.0e-03
epoch: 11, perplexity is: 21.868, lr:1.0e-03
epoch: 12, perplexity is: 18.918, lr:1.0e-03
epoch: 13, perplexity is: 16.482, lr:1.0e-03
epoch: 14, perplexity is: 14.505, lr:1.0e-03
epoch: 15, perplexity is: 12.870, lr:1.0e-03
epoch: 16, perplexity is: 11.489, lr:1.0e-03
epoch: 17, perplexity is: 10.358, lr:1.0e-03
epoch: 18, perplexity is: 9.416, lr:1.0e-03
epoch: 19, perplexity is: 8.619, lr:1.0e-03
epoch: 20, perplexity is: 7.905, lr:1.0e-03


It can be seen that after training the model, we are able to reach a level of confusion of around 2.72, and we can start generating text below.

### Generate text
The process of generating text is very simple, as I mentioned earlier, given the starting characters, and then constantly generating characters backwards, passing the generated characters as new input to the network.

It should be noted here that in order to increase more randomness, we will randomly select the probabilities based on their probabilities in the top five with the highest probability of prediction.


In [18]:
def pick_top_n(preds, top_n=5):
    top_pred_prob, top_pred_label = torch.topk(preds, top_n, 1)
    top_pred_prob /= torch.sum(top_pred_prob)
    top_pred_prob = top_pred_prob.squeeze(0).cpu().numpy()
    top_pred_label = top_pred_label.squeeze(0).cpu().numpy()
    c = np.random.choice(top_pred_label, size=1, p=top_pred_prob)
    return c

In [19]:
Begin = 'Azure color and other rains'
text_len = 30

model = model.eval()
samples = [convert.word_to_int(c) for c in begin]
input_txt = torch.LongTensor(samples)[None]
if use_gpu:
    input_txt = input_txt.cuda()
input_txt = Variable(input_txt)
_, init_state = model(input_txt)
result = samples
model_input = input_txt[:, -1][:, None]
for i in range(text_len):
    out, init_state = model(model_input, init_state)
    pred = pick_top_n(out.data)
    model_input = Variable(torch.LongTensor(pred))[None]
    if use_gpu:
        model_input = model_input.cuda()
    result.append(pred[0])
text = convert.arr_to_text(result)
print('Generate text is: {}'.format(text))

Generate text is: 天青色等烟雨 片帆天际波中象璧 不似到仙林何在 新春山月低心出 波透兔中


Finally, you can see that the generated text has already thought of a paragraph.
