<img src="../Pierian-Data-Logo.PNG">
<br>
<strong><center>Copyright 2019. Created by Jose Marcial Portilla.</center></strong>

# Apply an RNN to Text
Given their ability to remember past patterns, RNNs are good at predicting what character should follow a given sequence of characters, or what word might complete a given phrase.

## Perform standard imports

In [2]:
import torch
import torch.nn as nn

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
train_on_gpu = torch.cuda.is_available()
if(train_on_gpu):
    print('Training on GPU!')
else: 
    print('No GPU available, training on CPU; consider making n_epochs very small.')

Training on GPU!


## Load data
For this exercise we'll use <em>The Adventures of Tom Sawyer</em> by Mark Twain, available from Project Gutenberg at http://www.gutenberg.org/ebooks/74.<br>
We've removed most of the front matter (Table of Contents, List of Illustrations, etc.) and retained just the chapter headings and the text of the novel.

In [4]:
with open('../Data/TomSawyer.txt', 'r', encoding='utf8') as f:
    text = f.read()

In [5]:
len(text)

406270

In [6]:
text[:1000]

'THE ADVENTURES OF TOM SAWYER\n\nBy Mark Twain\n(Samuel Langhorne Clemens)\n\nPREFACE\n\nMost of the adventures recorded in this book really occurred; one or two\nwere experiences of my own, the rest those of boys who were schoolmates\nof mine. Huck Finn is drawn from life; Tom Sawyer also, but not from an\nindividual--he is a combination of the characteristics of three boys whom\nI knew, and therefore belongs to the composite order of architecture.\n\nThe odd superstitions touched upon were all prevalent among children and\nslaves in the West at the period of this story--that is to say, thirty or\nforty years ago.\n\nAlthough my book is intended mainly for the entertainment of boys and\ngirls, I hope it will not be shunned by men and women on that account,\nfor part of my plan has been to try to pleasantly remind adults of what\nthey once were themselves, and of how they felt and thought and talked,\nand what queer enterprises they sometimes engaged in.\n\nTHE AUTHOR.\n\nHARTFORD, 187

Note that <tt><strong>\n</strong></tt> is a Python <a href='https://docs.python.org/3/reference/lexical_analysis.html'>escape sequence</a> that represents a line feed, and it counts as one character.

In [7]:
# Display four connected characters:
text[27:31]

'R\n\nB'

## Encode the characters
We want to first identify all the unique characters, including digits and punctuation, contained in the test. We can do this quickly and efficiently by casting the text as a set. Then we want to assign integers to each character.

In [8]:
chars = set(text)

In [9]:
print(chars)

{'B', 'v', 'r', 's', 'R', 'M', 'g', 'n', '3', '8', 'L', 'u', '$', 'Y', 'D', 'W', '@', 'i', '\n', 'P', 'l', 'H', 'z', 'p', 'q', 'K', 'o', 'V', '(', '_', ']', 'b', ':', '.', '4', '&', '*', 'I', 'y', '2', '7', 'j', 'E', '“', '%', 'N', 'h', 'F', "'", 'A', 'd', 'O', '[', '?', ';', 'J', 'a', '1', 'Q', '0', 'e', 'f', 't', '5', ' ', 'w', 'T', 'X', 'x', '”', 'G', 'S', 'C', '!', '9', '/', 'm', 'c', 'k', '6', ')', ',', '-', 'U'}


In [10]:
len(chars)

84

In [11]:
char_list = sorted(list(chars))
print(char_list)

['\n', ' ', '!', '$', '%', '&', "'", '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '“', '”']


Now to assign a unique integer to each character:

In [12]:
encoder = {}
for i,x in enumerate(char_list):
    encoder[x] = i
print(encoder)

{'\n': 0, ' ': 1, '!': 2, '$': 3, '%': 4, '&': 5, "'": 6, '(': 7, ')': 8, '*': 9, ',': 10, '-': 11, '.': 12, '/': 13, '0': 14, '1': 15, '2': 16, '3': 17, '4': 18, '5': 19, '6': 20, '7': 21, '8': 22, '9': 23, ':': 24, ';': 25, '?': 26, '@': 27, 'A': 28, 'B': 29, 'C': 30, 'D': 31, 'E': 32, 'F': 33, 'G': 34, 'H': 35, 'I': 36, 'J': 37, 'K': 38, 'L': 39, 'M': 40, 'N': 41, 'O': 42, 'P': 43, 'Q': 44, 'R': 45, 'S': 46, 'T': 47, 'U': 48, 'V': 49, 'W': 50, 'X': 51, 'Y': 52, '[': 53, ']': 54, '_': 55, 'a': 56, 'b': 57, 'c': 58, 'd': 59, 'e': 60, 'f': 61, 'g': 62, 'h': 63, 'i': 64, 'j': 65, 'k': 66, 'l': 67, 'm': 68, 'n': 69, 'o': 70, 'p': 71, 'q': 72, 'r': 73, 's': 74, 't': 75, 'u': 76, 'v': 77, 'w': 78, 'x': 79, 'y': 80, 'z': 81, '“': 82, '”': 83}


Once we have an encoder, we can encode the entire corpus.

In [13]:
encoded_text = [encoder[x] for x in text]
encoded_text = torch.LongTensor(encoded_text)
len(encoded_text)

406270

In [14]:
print(text[:28])
print(encoded_text[:28])

THE ADVENTURES OF TOM SAWYER
tensor([47, 35, 32,  1, 28, 31, 49, 32, 41, 47, 48, 45, 32, 46,  1, 42, 33,  1,
        47, 42, 40,  1, 46, 28, 50, 52, 32, 45])


## Set up training data
We want to use the entire corpus for training. Since we're using an LSTM, we want to created a window of one-hot-encoded characters followed by an integer label. In this way the argmax of the prediction should match the label.

In [15]:
# X_train = encoded_text[:-1].view(len(encoded_text)-1,1).type(torch.float)
# y_train = encoded_text[1:].type(torch.int64)

In [16]:
def input_data(seq,ws):  # ws is the window size
    out = []
    L = len(seq)
    for i in range(L-ws):
        window = seq[i:i+ws]
        label = seq[i+ws:i+ws+1]
        out.append((window,label))
    return out

In [17]:
train_data = input_data(encoded_text,20)

In [18]:
train_data[0]

(tensor([47, 35, 32,  1, 28, 31, 49, 32, 41, 47, 48, 45, 32, 46,  1, 42, 33,  1,
         47, 42]), tensor([40]))

In [19]:
len(train_data)

406250

## Define a model
Our input size is going to be 1, the number of hidden layers is arbitrary - we'll use 128. The output size is 84 and we'll use CrossEntropyLoss as our loss function.

In [20]:
class LSTM(nn.Module):
    def __init__(self, input_size=1, hidden_size=128, out_size=84):
        super().__init__()
        self.hidden_size = hidden_size
        
        # Add an LSTM layer:
        self.lstm = nn.LSTM(input_size,hidden_size)
        
        # Add a fully-connected layer:
        self.linear = nn.Linear(hidden_size,out_size)
        
        # Initialize h0 and c0:
        self.hidden = (torch.zeros(1,1,hidden_size),torch.zeros(1,1,hidden_size))
    
    def forward(self,seq):
        lstm_out, self.hidden = self.lstm(seq.view(len(seq), 1, -1), self.hidden)
        pred = self.linear(lstm_out.view(len(seq),-1))
        return pred[-1]   # we only care about the last prediction

## Instantiate the model, define loss & optimization functions
Since we're running a classification, we'll use <a href='https://pytorch.org/docs/stable/nn.html#crossentropyloss'><tt><strong>torch.nn.CrossEntropyLoss</strong></tt></a><br>Also, we've found that <a href='https://pytorch.org/docs/stable/optim.html#torch.optim.SGD'><tt><strong>torch.optim.SGD</strong></tt></a> converges faster for this application than <a href='https://pytorch.org/docs/stable/optim.html#torch.optim.Adam'><tt><strong>torch.optim.Adam</strong></tt></a>

In [21]:
torch.manual_seed(101)
model = LSTM()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

model

LSTM(
  (lstm): LSTM(1, 128)
  (linear): Linear(in_features=128, out_features=84, bias=True)
)

In [22]:
def count_parameters(model):
    params = [p.numel() for p in model.parameters() if p.requires_grad]
    for item in params:
        print(f'{item:>6}')
    print(f'______\n{sum(params):>6}')
    
count_parameters(model)

   512
 65536
   512
   512
 10752
    84
______
 77908


## OPTIONAL: LOAD OUR MODEL

**Training this takes a very, very long time! You may just want to load our provided model here and skip the training, training takes AT LEAST an hour on a fast computer!**

In [19]:
saved_model = LSTM()
saved_model.load_state_dict(torch.load('MarkTwainModel2.pt'));
saved_model.eval()

LSTM(
  (lstm): LSTM(1, 128)
  (linear): Linear(in_features=128, out_features=84, bias=True)
)

## Train the model

In [22]:
epochs = 1
batch = 0

import time
start_time = time.time()

for i in range(epochs):
    
    # tuple-unpack the entire set of data
    for seq, y_train in train_data[178330:]:  
        seq = seq.type(torch.float).view(-1,1).cuda()
        y_train = y_train.view(-1).cuda()
        
        # reset the parameters and hidden states
        optimizer.zero_grad()
        model.hidden = (torch.zeros(1,1,model.hidden_size),
                        torch.zeros(1,1,model.hidden_size))
        
        y_pred = model(seq).view(1,-1) # this wants to be 2D
        
        loss = criterion(y_pred,y_train)
        
        loss.backward()
        optimizer.step()
        batch+=1
        if batch%10000 == 0:
            print(batch)
        
    # print training result
    print(f'Epoch: {i+1:2} Loss: {loss.item():10.8f}')
    
print(f'\nDuration: {time.time() - start_time:.0f} seconds')

10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
140000
150000
160000
170000
180000
190000
200000
210000
220000
Epoch:  1 Loss: 1.92200708

Duration: 3395 seconds


In [23]:
# batches processed after 1 hr out of 406250:
batch

227920

## Save the model
We'll save this in a file called "MarkTwainModel.pt".

In [24]:
torch.save(model.state_dict(), 'MarkTwainModel_overwrite.pt')

In [25]:
train_data[-100]

(tensor([76, 69, 59, 56, 75, 64, 70, 69, 10,  1, 63, 70, 78,  1, 75, 70,  1, 63,
         60, 67]), tensor([71]))

-----
-----
# Generating New Text
----
----

## Step 1: Create Seed Text

In [52]:
seed_text = "part of my plan has been to try"[:20]

In [53]:
len(seed_text) == 20

True

## Step 2: Encode Seed Text

In [99]:
def encode_text(seed_text):
    encoded_char_list = []
    for character in seed_text:
        encoded_char_list.append(encoder[character])
    
    return torch.Tensor(np.array(encoded_char_list)).view(-1,1)

In [100]:
encoded_text_tensor = encode_text(seed_text)

In [101]:
encoded_text_tensor

tensor([[71.],
        [56.],
        [73.],
        [75.],
        [ 1.],
        [70.],
        [61.],
        [ 1.],
        [68.],
        [80.],
        [ 1.],
        [71.],
        [67.],
        [56.],
        [69.],
        [ 1.],
        [63.],
        [56.],
        [74.],
        [ 1.]])

## Step 3: Generate Next Predicted Character

In [84]:
decoder = {v: k for k, v in encoder.items()}

In [94]:
def predict_next_char(seq,model):
    pred_tensor = model(seq)
    max_prob_char = pred_tensor.argmax().item()
    pred_char = max_prob_char
    return decoder[pred_char]

In [102]:
predict_next_char(encoded_text_tensor,saved_model)

'o'

## Step 4: Loop for N predicted characters

In [103]:
my_full_text = seed_text

In [104]:
my_full_text

'part of my plan has '

In [107]:
n = 20
for i in range(n):
    
    last_20_char = encode_text(my_full_text[-20:])
    pred_char = predict_next_char(last_20_char,saved_model) 
    my_full_text += pred_char
    
print(my_full_text)

part of my plan has ooo ooo ooo ooo ooo ooo ooo ooo ooo ooo 


https://towardsdatascience.com/writing-like-shakespeare-with-machine-learning-in-pytorch-d77f851d910c