# Tutorial - Generative Recurrent Neural Networks

Last time we discussed using recurrent neural networks to make predictions about sequences. In particular, we treated tweets as a **sequence** of words. Since tweets can have a variable number of words, we needed an architecture that can take variable-sized sequences as input.

This time, we will use recurrent neural networks to **generate** sequences.
Generating sequences is more involved compared to making predictions about
sequences. However, it is a very interesting task, and many students chose
sequence-generation tasks for their projects.

Much of today's content is an adaptation of the "Practical PyTorch" GitHub
repository [1].

[1] https://github.com/spro/practical-pytorch/blob/master/char-rnn-generation/char-rnn-generation.ipynb

## Review

In recurrent neural networks, the input sequence is broken down into tokens. We could choose whether to tokenize based on words, or based on characters. The representation of each token (GloVe or one-hot) is processed by the RNN one step at a time to update the hidden (or context) state.

In a predictive RNN, the value of the hidden states  is a representation of **all the text that was processed thus far**. Similarly, in a generative RNN, The value of the hidden state will be a representation of **all the text that still needs to be generated**. We will use this hidden state to produce the sequence, one token at a time.

Similar to the last tutorial we will break up the problem of generating text
to generating one token at a time.

We will do so with the help of two functions:

1. We need to be able to generate the *next* token, given the current
   hidden state. In practice, we get a probability distribution over
   the next token, and sample from that probability distribution.
2. We need to be able to update the hidden state somehow. To do so,
   we need two pieces of information: the old hidden state, and the actual
   token that was generated in the previous step. The actual token generated
   will inform the subsequent tokens.

We will repeat both functions until a special "END OF SEQUENCE" token is
generated.

Note that there are several tricky things that we will have to figure out.
For example, how do we actually sample the actual token from the probability
distribution over tokens? What would we do during training, and how might
that be different from during testing/evaluation? We will answer those
questions as we implement the RNN.

For now, let's start with our training data.

## Data: Donald Trump's Tweets from 2018

The training set we use is a collection of Donald Trump's tweets from 2018.
We will only use tweets that are 140 characters or shorter, and tweets
that contains more than just a URL.
Since tweets often contain creative spelling and numbers, and upper vs. lower
case characters are read very differently, we will use a character-level RNN.

To start, let us load the trump.csv file to Google Colab and provide access to the drive. The file can be obtained from Quercus.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install -U torchtext==0.6

Collecting torchtext==0.6
  Downloading torchtext-0.6.0-py3-none-any.whl.metadata (6.3 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch->torchtext==0.6)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch->torchtext==0.6)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch->torchtext==0.6)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch->torchtext==0.6)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch->torchtext==0.6)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch->torchtext==0.6)


In [3]:
import csv

# file location (make sure to use your file location)
file_dir = '/content/drive/MyDrive/APS360/tut6/'

tweets = list(line[0] for line in csv.reader(open(file_dir + 'trump.csv')))
len(tweets)

22402

There are over 20000 tweets in this collection.
Let's look at a few of them, just to get a sense of the kind of text
we're dealing with:

In [4]:
print(tweets[100])
print(tweets[1000])
print(tweets[10000])

God Bless the people of Venezuela!
It was my honor. THANK YOU! https://t.co/1LvqbRQ1bi
Nobody but Donald Trump will save Israel. You are wasting your time with these politicians and political clowns. Best! #SheldonAdelson


## Generating One Tweet

Normally, when we build a new machine learning model, we want to make sure
that our model can overfit. To that end, we will first build a neural network
that can generate _one_ tweet really well. We can choose any tweet (or any other text) we want. Let's choose to build an RNN that generates `tweet[100]`.

In [5]:
tweet = tweets[100]
print(tweet)
print(len(tweet))

God Bless the people of Venezuela!
34


First, we will need to encode this tweet using a one-hot encoding.
We'll build dictionary mappings
from the character to the index of that character (a unique integer identifier),
and from the index to the character. We'll use the same naming scheme that `torchtext`
uses (`stoi` and `itos`).

For simplicity, we'll work with a limited vocabulary containing
just the characters in `tweet[100]`, plus two special tokens:

- `<EOS>` represents "End of String", which we'll append to the end of our tweet.
  Since tweets are variable-length, this is a way for the RNN to signal
  that the entire sequence has been generated.
- `<BOS>` represents "Beginning of String", which we'll prepend to the beginning of
  our tweet. This is the first token that we will feed into the RNN.

The way we use these special tokens will become more clear as we build the model.

In [6]:
vocab = list(set(tweet)) + ["<BOS>", "<EOS>"]
vocab_stoi = {s: i for i, s in enumerate(vocab)} # String to index
vocab_itos = {i: s for i, s in enumerate(vocab)} # Index to string
vocab_size = len(vocab)

In [7]:
print("Vocab")
print(vocab)
print("STOI")
print(vocab_stoi)
print("ITOS")
print(vocab_itos)
print("Size")
print(vocab_size)


Vocab
['d', 'V', 'l', '!', 's', 'a', 'o', 'e', 'z', 'h', 'f', 'G', 'u', 't', 'n', ' ', 'p', 'B', '<BOS>', '<EOS>']
STOI
{'d': 0, 'V': 1, 'l': 2, '!': 3, 's': 4, 'a': 5, 'o': 6, 'e': 7, 'z': 8, 'h': 9, 'f': 10, 'G': 11, 'u': 12, 't': 13, 'n': 14, ' ': 15, 'p': 16, 'B': 17, '<BOS>': 18, '<EOS>': 19}
ITOS
{0: 'd', 1: 'V', 2: 'l', 3: '!', 4: 's', 5: 'a', 6: 'o', 7: 'e', 8: 'z', 9: 'h', 10: 'f', 11: 'G', 12: 'u', 13: 't', 14: 'n', 15: ' ', 16: 'p', 17: 'B', 18: '<BOS>', 19: '<EOS>'}
Size
20


In [8]:
# Example of string -> index
print(vocab_stoi["s"])
# Example of index -> string
print(vocab_itos[17])

4
B


Now that we have our vocabulary, we can build the PyTorch model
for this problem.
The actual model is not as complex as you might think. We actually
already learned about all the components that we need. (Using and training
the model is the hard part)

In [9]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [10]:
class TextGenerator(nn.Module):
    def __init__(self, vocab_size, hidden_size, n_layers=1):
        super(TextGenerator, self).__init__()

        # identiy matrix for generating one-hot vectors
        self.ident = torch.eye(vocab_size)

        # recurrent neural network
        self.rnn = nn.GRU(vocab_size, hidden_size, n_layers, batch_first=True) ## Gated Recurrent unit

        # a fully-connect layer that outputs a distribution over
        # the next token, given the RNN output
        self.decoder = nn.Linear(hidden_size, vocab_size)

    def forward(self, inp, hidden=None):
        inp = self.ident[inp]                  # generate one-hot vectors of input
        output, hidden = self.rnn(inp, hidden) # get the next output and hidden state
        output = self.decoder(output)          # predict distribution over next tokens
        return output, hidden

model = TextGenerator(vocab_size, hidden_size=64)

## Training with Teacher Forcing

At a very high level, we want our RNN model to have a high probability
of generating the tweet. An RNN model generates text
one character at a time based on the hidden state value.
At each time step, we will check whether the model generated the
correct character. That is, at each time step,
we are trying to select the correct next character out of all the
characters in our vocabulary. Recall that this problem is a multi-class
classification problem, and we can use Cross-Entropy loss to train our
network to become better at this type of problem.

In [11]:
criterion = nn.CrossEntropyLoss()

However, we don't just have a single multi-class classification problem.
Instead, we have **one classification problem per time-step** (per token)!
So, how do we predict the first token in the sequence?
How do we predict the second token in the sequence?

To help you understand what happens durign RNN training, we'll start with
inefficient training code that shows you what happens step-by-step. We'll
start with computing the loss for the first token generated, then the second token,
and so on.
Later on, we'll switch to a simpler and more performant version of the code.

So, let's start with the first classification problem: the problem of generating
the **first** token (`tweet[0]`).

To generate the first token, we'll feed the RNN network (with an initial, empty
hidden state) the "<BOS>" token. Then, the output

In [12]:
# First state is the ""
bos_input = torch.Tensor([vocab_stoi["<BOS>"]])
print(bos_input)

tensor([18.])


In [13]:
print(bos_input.shape, type(bos_input))
bos_input = bos_input.long()
print(bos_input.shape, type(bos_input))
bos_input = bos_input.unsqueeze(0)
print(bos_input.shape, type(bos_input))


torch.Size([1]) <class 'torch.Tensor'>
torch.Size([1]) <class 'torch.Tensor'>
torch.Size([1, 1]) <class 'torch.Tensor'>


In [14]:
output, hidden = model(bos_input, hidden=None)
print("Output for first token - Hidden state 0")
print(output) # distribution over the first token
print()
print("Hidden state:")
print(hidden)
print(hidden.shape)
# It is not by chance that the output is 20 dimensional - same length as the vocabulary

Output for first token - Hidden state 0
tensor([[[ 0.0642, -0.1402, -0.0681,  0.0219, -0.1105,  0.0041,  0.0655,
          -0.0715, -0.0222, -0.1096,  0.0453,  0.1179,  0.0894, -0.0363,
          -0.0339, -0.0599,  0.0808, -0.0526, -0.1597,  0.0347]]],
       grad_fn=<ViewBackward0>)

Hidden state:
tensor([[[ 3.1157e-03, -5.1094e-02, -2.1531e-02,  1.7843e-02, -5.8037e-02,
          -2.0454e-03,  2.2216e-02, -5.6880e-02, -3.5258e-04, -9.3137e-04,
           3.8888e-03,  8.5791e-02, -3.2170e-02, -7.7363e-02, -3.0693e-02,
           2.4126e-02,  6.0290e-02, -4.9565e-02,  1.0500e-03,  1.7272e-02,
          -2.2079e-02,  3.4342e-02, -2.4378e-02, -5.2401e-02, -2.6204e-02,
           4.2237e-02,  1.1823e-02,  3.9661e-02, -1.4478e-02,  3.0761e-02,
           2.5878e-02, -4.9325e-02, -4.1011e-02,  4.3148e-02, -7.0195e-02,
          -8.9750e-02,  1.1353e-01,  6.6081e-02, -1.0613e-01, -4.0260e-02,
           7.0838e-02,  2.3377e-02, -1.1878e-03, -1.7344e-02, -4.3672e-02,
          -9.5209e-02, -1

In [15]:
bos_input

tensor([[18]])

In [16]:
tweet

'God Bless the people of Venezuela!'

In [17]:
tweet[0]

'G'

We can compute the loss using `criterion`. Since the model is untrained,
the loss is expected to be high. (For now, we won't do anything
with this loss, and omit the backward pass.)

In [18]:
target = torch.Tensor([vocab_stoi[tweet[0]]]).long().unsqueeze(0)
criterion(output.reshape(-1, vocab_size), # reshape to 2D tensor
          target.reshape(-1))             # reshape to 1D tensor

tensor(2.8639, grad_fn=<NllLossBackward0>)

In [19]:
if True:
  print(target)
  print(vocab_itos[int(target[0][0])])
  print(output)
  print(output.reshape(-1, vocab_size))
  print(target.reshape(-1))

tensor([[11]])
G
tensor([[[ 0.0642, -0.1402, -0.0681,  0.0219, -0.1105,  0.0041,  0.0655,
          -0.0715, -0.0222, -0.1096,  0.0453,  0.1179,  0.0894, -0.0363,
          -0.0339, -0.0599,  0.0808, -0.0526, -0.1597,  0.0347]]],
       grad_fn=<ViewBackward0>)
tensor([[ 0.0642, -0.1402, -0.0681,  0.0219, -0.1105,  0.0041,  0.0655, -0.0715,
         -0.0222, -0.1096,  0.0453,  0.1179,  0.0894, -0.0363, -0.0339, -0.0599,
          0.0808, -0.0526, -0.1597,  0.0347]], grad_fn=<ViewBackward0>)
tensor([11])


In [20]:
int(target[0][0])

11

Now, we need to update the hidden state and generate a prediction
for the next token. To do so, **we need to provide the current token to
the RNN**. We already said that during test time, we'll need to sample
from the predicted probabilty over tokens that the neural network
just generated.

Right now, we can do something better: we can **use the ground-truth,
actual target token**. This technique is called **teacher-forcing**,
and generally speeds up training. The reason is that right now,
since our model does not perform well, the predicted probability
distribution is pretty far from the ground truth. So, it is very,
very difficult for the neural network to get back on track given bad
input data.

In [21]:
# Use teacher-forcing: we pass in the ground truth `target`,
# rather than using the NN predicted distribution
output, hidden = model(target, hidden)
output # distribution over the second token

tensor([[[ 0.0870, -0.1596, -0.0459,  0.0446, -0.0911,  0.0282,  0.0876,
          -0.0826, -0.0315, -0.1079,  0.0296,  0.0712,  0.1149,  0.0295,
          -0.0699, -0.0842,  0.1033, -0.0481, -0.1743,  0.0039]]],
       grad_fn=<ViewBackward0>)

Similar to the first step, we can compute the loss, quantifying the
difference between the predicted distribution and the actual next
token. This loss can be used to adjust the weights of the neural
network (which we are not doing yet).

In [22]:
target = torch.Tensor([vocab_stoi[tweet[1]]]).long().unsqueeze(0)
criterion(output.reshape(-1, vocab_size), # reshape to 2D tensor
          target.reshape(-1))             # reshape to 1D tensor

if True:
  print(target)
  print(vocab_itos[int(target[0][0])])

tensor([[6]])
o


We can continue this process of:

- feeding the previous ground-truth token to the RNN,
- obtaining the prediction distribution over the next token, and
- computing the loss,

for as many steps as there are tokens in the ground-truth tweet.

In [23]:
for i in range(2, len(tweet)):
    output, hidden = model(target, hidden) ## target for teacher forcing
    target = torch.Tensor([vocab_stoi[tweet[i]]]).long().unsqueeze(0)
    loss = criterion(output.reshape(-1, vocab_size), # reshape to 2D tensor
                     target.reshape(-1))             # reshape to 1D tensor
    print(i, output, loss)
    if True:
      print('*')
      print(vocab_itos[int(target[0][0])])
      print('*')


2 tensor([[[ 0.0931, -0.1537, -0.0544,  0.0621, -0.0726,  0.0507,  0.0596,
          -0.0764, -0.0258, -0.1170,  0.0753,  0.0371,  0.1558, -0.0104,
          -0.0096, -0.1044,  0.1224, -0.0504, -0.2011,  0.0384]]],
       grad_fn=<ViewBackward0>) tensor(2.8978, grad_fn=<NllLossBackward0>)
*
d
*
3 tensor([[[ 0.1269, -0.1355, -0.0324,  0.0390, -0.0339,  0.0387,  0.0541,
          -0.0940, -0.0017, -0.1407,  0.0455,  0.0679,  0.1304,  0.0324,
           0.0135, -0.1132,  0.0932, -0.0212, -0.1425,  0.0393]]],
       grad_fn=<ViewBackward0>) tensor(3.1107, grad_fn=<NllLossBackward0>)
*
 
*
4 tensor([[[ 0.1453, -0.1669, -0.0721,  0.0492, -0.0684,  0.0393,  0.0062,
          -0.0983,  0.0016, -0.1102,  0.0291,  0.0607,  0.1533,  0.0197,
           0.0010, -0.1070,  0.1172, -0.0509, -0.1947,  0.0259]]],
       grad_fn=<ViewBackward0>) tensor(3.0400, grad_fn=<NllLossBackward0>)
*
B
*
5 tensor([[[ 0.1413, -0.1700, -0.0483,  0.0295, -0.0634,  0.0611,  0.0235,
          -0.1235, -0.0066, -0.1226, 

Finally, with our final token, we should expect to output the "<EOS>"
token, so that our RNN learns when to stop generating characters.

In [24]:
output, hidden = model(target, hidden)
target = torch.Tensor([vocab_stoi["<EOS>"]]).long().unsqueeze(0)
loss = criterion(output.reshape(-1, vocab_size), # reshape to 2D tensor
                 target.reshape(-1))             # reshape to 1D tensor
print(i, output, loss)

33 tensor([[[ 0.1341, -0.1527, -0.0270,  0.0259, -0.0832,  0.0709,  0.0623,
          -0.1233, -0.0607, -0.0471, -0.0062,  0.0258,  0.1584, -0.0066,
          -0.0076, -0.0887,  0.1256, -0.0601, -0.2030,  0.0040]]],
       grad_fn=<ViewBackward0>) tensor(2.9831, grad_fn=<NllLossBackward0>)


In practice, we don't really need a loop. Recall that in a predictive RNN,
the `nn.RNN` module can take an entire sequence as input. We can do the
same thing here:

In [25]:
tweet_ch = ["<BOS>"] + list(tweet) + ["<EOS>"]
tweet_indices = [vocab_stoi[ch] for ch in tweet_ch]
tweet_tensor = torch.Tensor(tweet_indices).long().unsqueeze(0)

print(tweet_tensor.shape)
print("Input tensor")
print(tweet_tensor)


output, hidden = model(tweet_tensor[:,:-1]) # <EOS> is never an input token
target = tweet_tensor[:,1:]                 # <BOS> is never a target token
loss = criterion(output.reshape(-1, vocab_size), # reshape to 2D tensor
                 target.reshape(-1))
print("Target tensor")

print(target)             # reshape to 1D tensor

torch.Size([1, 36])
Input tensor
tensor([[18, 11,  6,  0, 15, 17,  2,  7,  4,  4, 15, 13,  9,  7, 15, 16,  7,  6,
         16,  2,  7, 15,  6, 10, 15,  1,  7, 14,  7,  8, 12,  7,  2,  5,  3, 19]])
Target tensor
tensor([[11,  6,  0, 15, 17,  2,  7,  4,  4, 15, 13,  9,  7, 15, 16,  7,  6, 16,
          2,  7, 15,  6, 10, 15,  1,  7, 14,  7,  8, 12,  7,  2,  5,  3, 19]])


Here, the input to our neural network model is the *entire*
sequence of input tokens (everything from "<BOS>" to the
last character of the tweet). The neural network generates a prediction distribution
of the next token at each step. We can compare each of these  with the ground-truth
`target`.


Our training loop (for learning to generate the single `tweet`) will therefore
look something like this:

In [26]:
print(tweet_tensor[:,:-1])
print(target)

tensor([[18, 11,  6,  0, 15, 17,  2,  7,  4,  4, 15, 13,  9,  7, 15, 16,  7,  6,
         16,  2,  7, 15,  6, 10, 15,  1,  7, 14,  7,  8, 12,  7,  2,  5,  3]])
tensor([[11,  6,  0, 15, 17,  2,  7,  4,  4, 15, 13,  9,  7, 15, 16,  7,  6, 16,
          2,  7, 15,  6, 10, 15,  1,  7, 14,  7,  8, 12,  7,  2,  5,  3, 19]])


In [27]:
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
for it in range(1000):
    optimizer.zero_grad()
    output, _ = model(tweet_tensor[:,:-1])
    loss = criterion(output.reshape(-1, vocab_size),
                 target.reshape(-1))
    loss.backward()
    optimizer.step()

    if (it+1) % 100 == 0:
        print("[Iter %d] Loss %f" % (it+1, float(loss)))

[Iter 100] Loss 1.794040
[Iter 200] Loss 0.138833
[Iter 300] Loss 0.029274
[Iter 400] Loss 0.013449
[Iter 500] Loss 0.008067
[Iter 600] Loss 0.005485
[Iter 700] Loss 0.004013
[Iter 800] Loss 0.003080
[Iter 900] Loss 0.002447
[Iter 1000] Loss 0.001994


The training loss is decreasing with training, which is what we expect.

## Generating a Token

At this point, we want to see whether our model is actually learning
something. So, we need to talk about how to
actually use the RNN model to generate text. If we can
generate text, we can make a qualitative asssessment of how well
our RNN is performing.

The main difference between training and test-time (generation time)
is that we don't have the ground-truth tokens to feed as inputs
to the RNN. Instead, we need to actually **sample** a token based
on the neural network's prediction distribution.

But how can we sample a token from a distribution?

On one extreme, we can always take
the token with the largest probability (argmax). This has been our
go-to technique in other classification tasks. However, this idea
will fail here. The reason is that in practice,
**we want to be able to generate a variety of different sequences from
the same model**. An RNN that can only generate a single new Trump Tweet
is fairly useless.

In short, we want some randomness. We can do so by using the logit
outputs from our model to construct a multinomial distribution over
the tokens and then sample a random token from that multinomial distribution.

One natural multinomial distribution we can choose is the
distribution we get after applying the softmax on the outputs.
However, we will do one more thing: we will add a **temperature**
parameter to manipulate the softmax outputs. We can set a
**higher temperature** to make the probability of each token
**more even** (more random), or a **lower temperature** to assign
more probability to the tokens with a higher logit (output).
A **higher temperature** means that we will get a more diverse sample,
with potentially more mistakes. A **lower temperature** means that we
may see repetitions of the same high probability sequence.

In [28]:
def sample_sequence(model, max_len=100, temperature=0.8):
    generated_sequence = ""

    inp = torch.Tensor([vocab_stoi["<BOS>"]]).long()
    hidden = None
    for p in range(max_len):
        output, hidden = model(inp.unsqueeze(0), hidden)
        # Sample from the network as a multinomial distribution
        output_dist = output.data.view(-1).div(temperature).exp()
        top_i = int(torch.multinomial(output_dist, 1)[0])
        # Add predicted character to string and use as next input
        predicted_char = vocab_itos[top_i]

        if predicted_char == "<EOS>":
            break
        generated_sequence += predicted_char
        inp = torch.Tensor([top_i]).long()
    return generated_sequence

print(sample_sequence(model, temperature=0.8))
print(sample_sequence(model, temperature=1.0))
print(sample_sequence(model, temperature=1.5))
print(sample_sequence(model, temperature=2.0))
print(sample_sequence(model, temperature=3.0))

God Bless the people of Venezuela!
God Bless the people of Venezuela!
God Bless the people of VenezVeoa!
GddBlless thespeoplepof Venepuela!
popBBlfssttue peBple ofnee!


Since we only trained the model on a single sequence, we won't see
the effect of the temperature parameter yet.

For now, the output of the calls to the `sample_sequence` function
assures us that our training code looks reasonable, and we can
proceed to training on our full dataset!

## Training the Trump Tweet Generator

For the actual training, let's use `torchtext` so that we can use
the `BucketIterator` to make batches. Like in Lab 5, we'll create a
`torchtext.legacy.data.Field` to use `torchtext` to read the CSV file, and convert
characters into indices. The object has convenient parameters to specify
the BOS and EOS tokens.

In [29]:
import torchtext

text_field = torchtext.data.Field(sequential=True, # text sequence
                                  tokenize=lambda x: x, # because we are building a character-RNN
                                  include_lengths=True, # to track the length of sequences, for batching
                                  batch_first=True,
                                  use_vocab=True,       # to turn each character into an integer index
                                  init_token="<BOS>",   # BOS token
                                  eos_token="<EOS>")    # EOS token

fields = [('text', text_field), ('created_at', None), ('id_str', None)]
trump_tweets = torchtext.data.TabularDataset(file_dir + "trump.csv", "csv", fields)
len(trump_tweets) # should be >20,000 like before

22402

In [30]:
text_field.build_vocab(trump_tweets)
vocab_stoi = text_field.vocab.stoi # so we don't have to rewrite sample_sequence
vocab_itos = text_field.vocab.itos # so we don't have to rewrite sample_sequence
vocab_size = len(text_field.vocab.itos)
vocab_size

253

Let's just verify that the `BucketIterator` works as expected, but start with batch_size of 10.

In [31]:
data_iter = torchtext.data.BucketIterator(trump_tweets,
                                          batch_size=10,
                                          sort_key=lambda x: len(x.text),
                                          sort_within_batch=True)
for (tweet, lengths), label in data_iter:
    print(label)   # should be None
    print(lengths) # contains the length of the tweet(s) in batch
    print(tweet.shape) # should be [10, max(length)]
    break

None
tensor([138, 138, 138, 138, 138, 138, 138, 138, 138, 138])
torch.Size([10, 138])


To account for batching, our actual training code will change, but just a little bit.
In fact, our training code from before will work with a batch size larger than ten!

In [32]:
def train(model, data, batch_size=1, num_epochs=1, lr=0.001, print_every=100):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()
    it = 0

    data_iter = torchtext.data.BucketIterator(data,
                                              batch_size=batch_size,
                                              sort_key=lambda x: len(x.text),
                                              sort_within_batch=True)
    for e in range(num_epochs):
        # get training set
        avg_loss = 0
        for (tweet, lengths), label in data_iter:
            target = tweet[:, 1:] # Exclude BOS
            inp = tweet[:, :-1] # Exclude EOS
            # cleanup
            optimizer.zero_grad()
            # forward pass
            output, _ = model(inp)
            loss = criterion(output.reshape(-1, vocab_size), target.reshape(-1))
            # backward pass
            loss.backward()
            optimizer.step()

            avg_loss += loss
            it += 1 # increment iteration count
            if it % print_every == 0:
                print("[Iter %d] Loss %f" % (it+1, float(avg_loss/print_every)))
                print("    " + sample_sequence(model, 140, 0.8))
                avg_loss = 0
            if it>2000:
              break

model = TextGenerator(vocab_size, 64)

In [33]:
train(model, trump_tweets, batch_size=1, num_epochs=10, lr=0.004, print_every=100)
print(sample_sequence(model, temperature=0.8))
print(sample_sequence(model, temperature=0.8))
print(sample_sequence(model, temperature=1.0))
print(sample_sequence(model, temperature=1.0))
print(sample_sequence(model, temperature=1.5))
print(sample_sequence(model, temperature=1.5))
print(sample_sequence(model, temperature=2.0))
print(sample_sequence(model, temperature=2.0))
print(sample_sequence(model, temperature=5.0))
print(sample_sequence(model, temperature=5.0))

[Iter 101] Loss 3.680947
    👗on
[Iter 201] Loss 3.288491
    JrgNp!
[Iter 301] Loss 3.043342
    @Allig/mpececone yete toreons t alton oKekintiny lales @or bam en forilut @@lDot wougcae i1 oory MS fer fhindere tpas Ionde @le a putps paT 
[Iter 401] Loss 2.890282
    @Wau listet fous cte Wens en ber omelDondente as al ing bengthalacomeroc. @veld toull The Ton Cpid Iww bs inicold.coup!
[Iter 501] Loss 2.783324
    Dorl 
[Iter 601] Loss 2.652318
    You! Wex  GRwTruid ha the stod ly ine tornvind fam if Trump://t.ce/tCcald Tr0ad py.
[Iter 701] Loss 2.529679
    @fubad Mang on the jele you to nes wam of grUacdTrumpl"
[Iter 801] Loss 2.509182
    @ayforsoote a lang Oid the hico be th to jast #ctuspes ane!2016: Wot the thing the an the deke Nes oups to apd to Wore an the icn in she tht
[Iter 901] Loss 2.433627
    @revillareat @realDonaldTrump in @realDonillo to um Sons an of is anl o dorearareg I an O  Ind enande in thing stos on Vewateon chands: “lec
[Iter 1001] Loss 2.363587
    @adalloyo

In [34]:
len(trump_tweets)

22402

In [35]:
train(model, trump_tweets, batch_size=32, num_epochs=1, lr=0.004, print_every=100)
print(sample_sequence(model, temperature=0.8))
print(sample_sequence(model, temperature=1.0))
print(sample_sequence(model, temperature=1.5))
print(sample_sequence(model, temperature=2.0))
print(sample_sequence(model, temperature=5.0))

[Iter 101] Loss 2.104971
    @realDonaldTrump will erviin was be @realDonaldTrump ener make you!
[Iter 201] Loss 2.037815
    So onting allant @peasherger @Coudsuttion so is for https://t.co/HaBj9c4Q7UY
[Iter 301] Loss 1.992187
    .@ThenicaGreakeryelace @vento and me in wi las us at ahaue mo the @relenduntareer every. Weth sut the great reccul everist!
[Iter 401] Loss 1.949339
    The @HeratwaryonDeat. Irank on @Coxamondact—are into deviews will for are degest vite in Donald reain agan the Great the meding it shey best
[Iter 501] Loss 1.924959
    @ailfuch our you realling &amp; have romisher you ware the sogir deening congah and do now of can time to need &amp; to than the renus the U
[Iter 601] Loss 1.902433
    Make @Wuttucto Americano of Flednore @foxandatia into of - had at &amp; On the camers In onay to dayo.
[Iter 701] Loss 1.889245
    @postremesple @realDonaldTrump http://t.co/Pk5alvk9jian
@ForgSan6 Bus a hespunst on @Channeding Lorder Semoran Morizing be now int - dis going 

## Generative RNN using GPU
Training a generative RNN can be a slow process. Here's a sample GPU implementation to speed up the training. The changes required to enable GPU are provided in the comments below.

In [36]:
# Generative Recurrent Neural Network Implementation with GPU

def sample_sequence_cuda(model, max_len=100, temperature=0.8):
    generated_sequence = ""

    inp = torch.Tensor([vocab_stoi["<BOS>"]]).long().cuda()    # <----- GPU
    hidden = None
    for p in range(max_len):
        output, hidden = model(inp.unsqueeze(0), hidden)
        # Sample from the network as a multinomial distribution
        output_dist = output.data.view(-1).div(temperature).exp().cpu()
        top_i = int(torch.multinomial(output_dist, 1)[0])
        # Add predicted character to string and use as next input
        predicted_char = vocab_itos[top_i]

        if predicted_char == "<EOS>":
            break
        generated_sequence += predicted_char
        inp = torch.Tensor([top_i]).long().cuda()    # <----- GPU
    return generated_sequence


def train_cuda(model, data, batch_size=1, num_epochs=1, lr=0.001, print_every=100):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()
    it = 0
    data_iter = torchtext.data.BucketIterator(data,
                                              batch_size=batch_size,
                                              sort_key=lambda x: len(x.text),
                                              sort_within_batch=True)
    for e in range(num_epochs):
        # get training set
        avg_loss = 0
        for (tweet, lengths), label in data_iter:
            target = tweet[:, 1:].cuda()              # <------- GPU
            inp = tweet[:, :-1].cuda()                # <------- GPU
            # cleanup
            optimizer.zero_grad()
            # forward pass
            output, _ = model(inp)
            loss = criterion(output.reshape(-1, vocab_size), target.reshape(-1))
            # backward pass
            loss.backward()
            optimizer.step()

            avg_loss += loss
            it += 1 # increment iteration count
            if it % print_every == 0:
                print("[Iter %d] Loss %f" % (it+1, float(avg_loss/print_every)))
                print("    " + sample_sequence_cuda(model, 140, 0.8))
                avg_loss = 0
            if it==10000:
              break

model = TextGenerator(vocab_size, 64)
model = model.cuda()
model.ident = model.ident.cuda()
train_cuda(model, trump_tweets, batch_size=32, num_epochs=10, lr=0.004, print_every=100)

[Iter 101] Loss 3.619126
    %ialalefo/e<pad>l i heintprsl itt/punv!In tipld r s he 9N 
[Iter 201] Loss 2.974206
    #ACe Nond ustpps2txt so<pad>t itto//tht an dond the foret atere wes s  pre Trhinte Grinlomp Cag das test to. th tong! fop hicisi/////////. @re B
[Iter 301] Loss 2.660617
    27U7YEr Withert ond pros! wereme as in batid by. her alt old he bu then CJI Gnevint.. ann!
[Iter 401] Loss 2.472301
    @cerull ont inca the she she nepyrite have toild stepes so in do beicores thver bondiphmarrines lotpruts: #forgpiMalidest at Inateaving you 
[Iter 501] Loss 2.344373
    @lampels://t...." oum freath 'se non thesting than gar ise for nabe craply is of inky to bun ond Ghamp the Could act congittts://t.co/aacowe
[Iter 601] Loss 2.257957
    @stilcerti: @realDongu: @realDonaldTrump in to Migat an Cowsty he hecan Hertrint yor courtidend it fielthel nottedint for in bealw the tirs 
[Iter 701] Loss 2.191346
    @MAckeralldors: @rialDonaldTrump womarst whake the imay wore DonaldTrump thd Co

In [37]:
train_cuda(model, trump_tweets, batch_size=32, num_epochs=10, lr=0.004, print_every=500)

[Iter 501] Loss 1.687595
    I will have making cotty been that hoggam give and to rally pollion have police inductions. Earter of the fairing sand would support head as
[Iter 1001] Loss 1.004453
    @ifcottellow: Wend JOBS a mann in 2016 Personon him on my help a thanks can #ExNews to he our get incispies.
[Iter 1501] Loss 0.328217
    Join in thountingt why story tnat with the Abmitic. You's that leavenes hame bial of New Hampshire. We was prosters have all and fire in PIA
[Iter 2001] Loss 1.673794
    Senty book without in very paylinnsing to @realDonaldTrump #Mex66🇺🇸
[Iter 2501] Loss 1.323943
    .@CNNNBO a licked on all only and Coore To bank to 66% will some for the @JANCerFarlAmow is not forget in a toome it and live A being of wit
[Iter 3001] Loss 0.651706
    @Pabbary87 Great in New Hampshire for the hard south of rice and a for looked ellanked country fally of the comment the times and eary https
[Iter 3501] Loss 1.665778
    The Elect will be to @BCanGoutor Thank @realDonald

In [38]:
train_cuda(model, trump_tweets, batch_size=32, num_epochs=10, lr=0.0001, print_every=500)

[Iter 501] Loss 1.628715
    We has my Peacher by Enjoy—to the you debtring to hand of during pooded the place and you we whine. It for me destroys for President - your 
[Iter 1001] Loss 0.969579
    @mamimigbinnzas: @realDonaldTrump The can't trond action in Want working fighting and incoluted are all a we only for are out the even Ball.
[Iter 1501] Loss 0.317234
    It's states to be on the Brean talkion in 16 in the really forward at eadracking the Watling of the far the poll to real peopmates.
[Iter 2001] Loss 1.621639
    Jefficlue are there times honigged of the her terribary with and goal histaguziny the discods proppity fonirouse who dishupport the American
[Iter 2501] Loss 1.285280
    @StanTicker  @TrumpTramp &amp; very in the doing that cour exciting the pay by @chaineightroming 317 ints of a the trump to about meass and 
[Iter 3001] Loss 0.634229
    I only the Brigge News debate is the Finst You wafice my incsmate a fand doesn any doony was my forwars Iowa will forcerning 

In [39]:
train_cuda(model, trump_tweets, batch_size=32, num_epochs=10, lr=0.0001, print_every=500)

[Iter 501] Loss 1.618482
    We have all run is oppress good and very and cared get Hillary Aballeman New Hellinge and for the U.S. so worken forgets &amp; great some al
[Iter 1001] Loss 0.966697
    MAKE AMERICA GREAT AGAIN! https://t.co/spr3fHdgG6
[Iter 1501] Loss 0.316514
    Got Trump deally and the great our really short of the can great the best thearsed and hishup statelation of make in the General man of!
[Iter 2001] Loss 1.618269
    “Heal getned to and (and for a work
[Iter 2501] Loss 1.282945
    @sonyBeonery Trump and he dealslutines out a stop of Creet Donald in Martand has much in they were to very the can sraition that wanteds I h
[Iter 3001] Loss 0.633193
    I will be day a that is the national you nice itwerting to be the is trump don’l making has many be to terrorate Alabilly Hillary Virgin!
[Iter 3501] Loss 1.617531
    Hillary Crooked Nets amazing oined busive one of has been the USena and next American your Wanding Penuerwe - will be mo. I way swoy time se
[Iter 4

Let's generate some results using different levels of temperature.

In [40]:
for i in range(5):
  print(sample_sequence_cuda(model, 140, 0.2))

@Allalesoneen: @realDonaldTrump @realDonaldTrump is the protect in the great and a be to were to deal with the U.S.
@Theaman: @realDonaldTrump @realDonaldTrump with the best the politics!
@mantheram: @realDonaldTrump @realDonaldTrump with the state to the president and states the president!
@Allloonanee: @realDonaldTrump @realDonaldTrump is the great the president.
@CanySchride: @realDonaldTrump @realDonaldTrump http://t.co/G5d0gfugJe


In [41]:
for i in range(5):
  print(sample_sequence_cuda(model, 140, 0.6))

I watch show the enemys on the the so mistary please longer honor serain fount we will be great will have to and and amnests a see the every
Thank you on Ald Trump
@messonactharlane: @realDonaldTrump I am was me sides out to be poll. http://t.co/wfyJWrkERl
@wantallecher: @realDonaldTrump @Breatoly Macy &amp; Grandives hear to support the hit becoming politicians and with the done. Great started
#CelebApprentice @GHays @realDonaldTrump @realDonaldTrump thank you!


In [42]:
for i in range(5):
  print(sample_sequence_cuda(model, 140, 0.8))

@kyvasinjemmegav Great Helling @FoxNews @realDonaldTrump forgey to fuls &amp; today. Rounty
Cal Looking Inaytons respices for it people when huct of our great protect when itech pelfating at the border pay to go of and border. He wa
@cortiganer: @dxellardroon @foxandfriends from morling and crowd being down more! #NetUYESPURAN https://t.co/SAt4yIJIba
@GaltolENewor: MAKE AMERICA GREAT AGAIN.genity he winns righters Thank you!
#Trump2016 - I was wonderful bill the be forcemne sindly for the day. #CelebApprentice is guing me.


In [43]:
for i in range(5):
  print(sample_sequence_cuda(model, 140, 1))

@ra_bAwack8: @realDonaldTrump wilidain Masquaie of Leadis!
.@realDonaldTrump anion. Sening gone #MakeAmericaGreatAgain @FLGANGOP is . Perily welco interviewed. Warring bit camen!
I will friends a thoughed it is the duntor inting!!!  -- So hethers will bad mons!
Wather the clinnived in @DiggerneldTry. Thank you. I'm anasting on @Whendic_Renatuell you - again for in the fmer help
Cordas Prina iv come should passing big chuct's end We are lettisome! #TrumpJokShum (neet)"


In [44]:
for i in range(5):
  print(sample_sequence_cuda(model, 140, 1.5))

@Cosacpz676 playgen 1 on 2:30pMR eceint oppoo tinut25 missid hupsfure Dollbral Appoy:… https://t.ps!
We hance Delaken. http://t.co/jQQ2OC4kpw https://t.co/1JK8OrCkuDp
Than. Briwali 2 keyusio experidminott rement.
.@Miac_Dolloo NationUp How guy 4%;bectobard herd nichousle!
Http ERESTM3 y've dummbumetration-genermefur bad "NOK0 Uny's (en).”
