## LSTM Based Text Generation using PyTorch
This is similar to what can be found in the text generation documentation by TensorFlow, but my focus is on implementing it using PyTorch, just to see how well it works on the framework. Moreover, they implemented it using GRU, while I'm working with an LSTM. <br><br>
Link to the aforementioned documentation: <br>
https://colab.research.google.com/github/tensorflow/text/blob/master/docs/tutorials/text_generation.ipynb

### Import libraries

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
from torchsummary import summary
import numpy as np
import time

In [3]:
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))

True
NVIDIA GeForce RTX 4050 Laptop GPU


### Data setup

In [4]:
path = 'shakespeare.txt'
text = open(path, 'rb').read().decode(encoding = 'utf-8')
text = text.replace('\r', '')

In [5]:
char_count = len(text)
print('Number of characters:', char_count)
print()
print(text[0:250])

Number of characters: 1115393

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



In [6]:
# character vocabulary for prediction
vocab = sorted(set(text))
vocab_count = len(vocab)
print(vocab)

['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


### Text processing
1. Tokenize text at character level
2. Create a text encoder and a decoder that reverses the encoding
3. Join the decoded output

In [7]:
# tokenizer
def tokenize(text):
    return np.array(list(text)) # string to list of character tokens

In [8]:
# text encoder & decoder 
s2i = {ch:i for i, ch in enumerate(vocab)} # {symbol : idx}
i2s = {i:ch for i, ch in enumerate(vocab)} # {idx : symbol}

encode = lambda s: [s2i[c] for c in s]
decode = lambda l: ''.join([i2s[i] for i in l]) # join the output

# testing
sample = "Hello World!"
enc = encode(sample)
dec = decode(enc)
print(f'Sample input:\t{sample}\nEncoded:\t{enc}\nDecoded:\t{dec}')

Sample input:	Hello World!
Encoded:	[20, 43, 50, 50, 53, 1, 35, 53, 56, 50, 42, 2]
Decoded:	Hello World!


### Setup for prediction task

In [9]:
# tokenize the text...
tokens = tokenize(text)
tokens[:20]

array(['F', 'i', 'r', 's', 't', ' ', 'C', 'i', 't', 'i', 'z', 'e', 'n',
       ':', '\n', 'B', 'e', 'f', 'o', 'r'], dtype='<U1')

In [10]:
# ... and encode them
all_ids = encode(text)
all_ids[:20]

[18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 14, 43, 44, 53, 56]

In [11]:
# the idea is to have sequences of inputs where each input sequence will have a corresponding output which
# is a single character existing right after that input sequence which starts in the next sequence.
seq_len = 100
input_text = []
target_text = []
for i in range(0, char_count - seq_len): # subract with seq_len to prevent out of range
    seq_in = text[i:i + seq_len] # text[i + seq_len excluded]...
    char_out = text[i + seq_len] # ...but included here.
    input_text.append([s2i[char] for char in seq_in])
    target_text.append(s2i[char_out])
patterns = len(input_text)
print("Total patterns:", patterns)

Total patterns: 1115293


#### Few things to note:
1. PyTorch's LSTM expects all of its inputs to be 3D tensors (sample, time steps, features).
2. The data must be converted to floating point tensors.
3. Normalizing the data helps training.

In [12]:
# data reshaping
X = torch.tensor(input_text, dtype = torch.float32).reshape(patterns, seq_len, 1)
X = X / float(vocab_count) # normalizing
y = torch.tensor(target_text)
print('X dimention:', X.shape)
print('y dimention:', y.shape)

X dimention: torch.Size([1115293, 100, 1])
y dimention: torch.Size([1115293])


### Model defining & training

**Hidden states** are basically a function of the current and previous inputs. It evolves with information at each time step.<br>
**Logits** are raw outputs from a network. They are the unscaled scores for each class in a classification task.<br>
**Cross-entropy loss** is to be used as we are predicting a single class from 65 classes.

In [13]:
# model definition
class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.lstm = nn.LSTM(input_size = 1,
                            hidden_size = 256,
                            num_layers = 4,
                            batch_first = True,
                            dropout = 0.225) # 4 lstm layers with 256 hidden units each
        self.dropout = nn.Dropout(0.225)
        self.linear = nn.Linear(256, vocab_count) # dense layer
    def forward(self, x):
        x, _ = self.lstm(x) # will look into 'hidden' later
        x = x[:, -1, :] # only the last time step is taken as it contains the most information
        x = self.linear(self.dropout(x)) # produce output
        return x

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 
model = Model().to(device)
epochs = 40
batch_size = 128

optimizer = optim.Adam(model.parameters())
loss_function = nn.CrossEntropyLoss(reduction = 'sum')
loader = data.DataLoader(data.TensorDataset(X, y),
                         shuffle = True,
                         batch_size = batch_size) 

# model summary
print(summary(model, verbose = False))

Layer (type:depth-idx)                   Param #
├─LSTM: 1-1                              1,844,224
├─Dropout: 1-2                           --
├─Linear: 1-3                            16,705
Total params: 1,860,929
Trainable params: 1,860,929
Non-trainable params: 0


In [54]:
# now we train the model!
best_model = None
best_loss = np.inf

start = time.time()
for epoch in range(epochs):
    model.train()
    for X_batch, y_batch in loader:
        # forward pass
        y_pred = model(X_batch.to(device)) 
        # loss computation
        loss = loss_function(y_pred, y_batch.to(device))
        # backpropagation and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    # validation
    model.eval()
    loss = 0
    with torch.no_grad(): # gradients not required for validation
        for X_batch, y_batch in loader: 
            y_pred = model(X_batch.to(device)) 
            loss += loss_function(y_pred, y_batch.to(device))
        if loss < best_loss:
            best_loss = loss
            best_model = model.state_dict() 
        print("Epoch %d: Cross-entropy: %.4f" % (epoch + 1, loss))
end = time.time()

torch.save([best_model, s2i], "single-char.pth") # saving the mappings (s2i) as well for later use

elapsed_time = end - start
print()
print("Elapsed time: {:.2f} minutes".format(elapsed_time / 60)) # just curious...

Epoch 1: Cross-entropy: 2421611.5000
Epoch 2: Cross-entropy: 2113608.2500
Epoch 3: Cross-entropy: 1964630.8750
Epoch 4: Cross-entropy: 1867911.8750
Epoch 5: Cross-entropy: 1786178.3750
Epoch 6: Cross-entropy: 1740098.3750
Epoch 7: Cross-entropy: 1710207.8750
Epoch 8: Cross-entropy: 1680417.5000
Epoch 9: Cross-entropy: 1644015.3750
Epoch 10: Cross-entropy: 1622073.1250
Epoch 11: Cross-entropy: 1643736.5000
Epoch 12: Cross-entropy: 1579415.5000
Epoch 13: Cross-entropy: 1569533.2500
Epoch 14: Cross-entropy: 1556315.3750
Epoch 15: Cross-entropy: 1540297.5000
Epoch 16: Cross-entropy: 1536158.7500
Epoch 17: Cross-entropy: 1516412.6250
Epoch 18: Cross-entropy: 1516249.1250
Epoch 19: Cross-entropy: 1503287.2500
Epoch 20: Cross-entropy: 1495808.6250
Epoch 21: Cross-entropy: 1495943.6250
Epoch 22: Cross-entropy: 1484764.7500
Epoch 23: Cross-entropy: 1483875.7500
Epoch 24: Cross-entropy: 1471096.0000
Epoch 25: Cross-entropy: 1471430.5000
Epoch 26: Cross-entropy: 1465601.7500
Epoch 27: Cross-entro

### Generating text
Now that we have a trained model, lets generate some text!

In [14]:
# loading the trained model
best_model, s2i = torch.load("single-char.pth", weights_only = True)
vocab_count = len(s2i)
i2s = dict((i, s) for s, i in s2i.items()) # {idx : symbol}

In [15]:
# reload the model
model = Model().to(device)
model.load_state_dict(best_model)

<All keys matched successfully>

In [20]:
# i chose a random section from the text as a prompt
prompt = '''ROMEO:
Tut, I have lost myself; I am not here;
This is not Romeo, he's some other where.

BENVOLIO:
Tell me in sadness, who is that you love.

ROMEO:
What, shall I groan and tell thee?'''

prompt_enc = encode(prompt)

In [22]:
model.eval()
print('Prompt:', prompt)
print()
print('--------------------------------------------------')
print()
print('Generation:')
batch_size = 1 
with torch.no_grad():
    for i in range(1000):
        # format the prompt as the model expects a vector
        x = np.reshape(prompt_enc, (1, len(prompt_enc), 1)) / float(vocab_count)
        x = torch.tensor(x, dtype = torch.float32).to(device)
        prediction = model(x) # logits
        temperature = 1.1 # adjust the softmax distribution (> 1: more creative, < 1: more deterministic)
        probabilities = torch.softmax(prediction / temperature, dim = -1)
        samples = [torch.multinomial(probabilities, num_samples = 1).item() for _ in range(80)]
        idx = max(set(samples), key = samples.count) # takes the most frequent sample from 80 samples
        # idx = torch.multinomial(probabilities, num_samples = 1).item()
        result = decode([idx])
        # print(result)
        print(result, end = "")
        # append new character to prompt for next iteration
        prompt_enc.append(idx)
        prompt_enc = prompt_enc[1:] # shift right
print('\n\nDone!')

Prompt: ROMEO:
Tut, I have lost myself; I am not here;
This is not Romeo, he's some other where.

BENVOLIO:
Tell me in sadness, who is that you love.

ROMEO:
What, shall I groan and tell thee?

--------------------------------------------------

Generation:
all be heard the world of the company.

KING RICHARD III:
And he is not the soul of the world to hear,
And so make her that have the common hand,
That thou shalt thought the body of the common showers
That shall be so a man to the man of soul.

LING RICHARD III:
The care and the state of the death of thee,
The sea and hand of his son might be there,
The strength of soldiers with the prince the life
That should be so dead to the common courtesy,
And he hath been a thousand things to thee,
And then the sun of such a shame and life,
I will not show the house of Rome, and then
And then I have been so beheld the prince,
That shall be so the house of her of her
That be the soul of some of the warlike son.

BUCKINGHAM:
The world of the fat

### Conclusion
Even though the results aren't perfect, it's truly fascinating to see how a model that was trained to generate text at a character level, can produce words and sometimes even sentences that make sense. Even though we can dive deep into the mathematics and algorithms behind these networks, the real question still remains: how do these simple numbers achieve such complex results?<br>

