## INF265 - Weeks 14-15: LSTM-based character level language models
## By Hans Martin Aannestad

*1. Download a text document that you like. The project Gutenberg is a
good place to obtain legally interesting books !

COMMENT: To see if the model can generate a specific writing style, I follow the tradition and use the complete works of Shakespeare, downloaded from Gutenberg.org

*2. Load and clean up the text. You will remove all accents, force all letters
to be lower case, remove all occurrences of ”\n”, ”\t” and ”\r” and make
sure every two consecutive words in the text are separated by a single
space ” ”. You can also remove part (or all) of the punctuation.


In [1]:
import numpy as np
import torch as t
from torch import nn
import torch.nn.functional as F
import unicodedata
import string
import os

text=open("shakespeare.txt",encoding="utf8").read().strip().lower().split()
words=[]

#vocab = string.ascii_letters + " .,;'"
#n = len(vocab)

# normalize text
for word in text[:100000]:
     words.append(''.join(char for char in unicodedata.normalize('NFD', word) if   unicodedata.category(char) != 'Mn' and char in string.ascii_letters + " .,;'"))

text = " ".join(words)
text = text.replace(';','').replace(':','').replace(',','')

In [2]:
# Inspect some Shakespeare
text[10000:10500]

'll pointing to each his thunder rain and wind or say with princes if it shall go well by oft predict that i in heaven find. but from thine eyes my knowledge i derive and constant stars in them i read such art as truth and beauty shall together thrive if from thy self to store thou wouldst convert or else of thee this i prognosticate thy end is truths and beautys doom and date.  when i consider every thing that grows holds in perfection but a little moment. that this huge stage presenteth nought '

*3. Create a character-based vocabulary from the preprocessed text: every appearing character will have a unique integer ID. You can use a dictionary structure to store the vocabulary.

In [3]:
v = list(''.join(set(text)))
v = sorted(v)
n_chars = len(v)
d = {}
for i in range(0,len(v)):
    d[i] = v[i]

chars = tuple(sorted(set(text)))
char_int = dict([(c,i) for i,c in enumerate(chars)])
int_char = dict(enumerate(chars)) # (Invert) model ints back to characters
int_text = [char_int[char] for char in text] # Convert text to int ID

# To demonstrate strategy: encode a single one-hot converted text
L = 25
Y = t.zeros(1)
i = 0

label = np.array([[c] for c in int_text[i:i+L]])
label_t = t.LongTensor(label)
x_one_hot = t.zeros(L, n_chars).scatter_(1,label_t,1)
Y[0] = int_text[i+L]

*4. Generate the data (X, y) for a character level language model:
(a) Decide of a sequence length L.

In [4]:
N = len(text)

X = t.zeros(N,L,n_chars)
Y = t.zeros(N,dtype=t.int64)

for i in range(N-L):
    label = np.array([[c] for c in int_text[i:i+L]])
    label_t = t.LongTensor(label)
    X[i] = t.zeros(L, n_chars).scatter_(1,label_t,1)
    Y[i] = int_text[i+L]

*5. Shuffle then split the data between train and validation sets. You probably
want to keep most of the data available for training.

In [5]:
from torch.utils.data import TensorDataset, DataLoader, random_split
ds = TensorDataset(X, Y)

train_len = int(len(ds)*0.9)
ds_train, ds_test = random_split(ds, [train_len, len(ds)-train_len])

batch_size = 2048

train_loader = DataLoader(ds_train, batch_size=batch_size)
#test_loader = DataLoader(ds_test, batch_size=1)

In [6]:
# Hyperparameters

input_size = n_chars
seq_length = 25
num_layers = 1

hidden_size = 256 # via rute of thumb << N/(10*(inputs+outputs))
num_classes = n_chars
num_epochs = 10 # Due to time constraints
learning_rate = 0.001 # By industry conventions

*6. Implement a LSTM classifier designed to predict the next character from
a sequence of L consecutive characters. A good starting point would be to
stack a ’torch.nn.LSTM()’ module followed by a ’torch.nn.Linear()’ layer.
Hint: For this task, we need a ”many-to-one” type of LSTM.

In [7]:
class Net(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_classes):
        super(Net, self).__init__()
        self.num_layers = num_layers
        self.hidden_size = hidden_size

        self.lstm = nn.Net(input_size, hidden_size, num_layers, batch_first = True)

        # x -> batch_size, seq, input_size

        self.fc = nn.Linear(hidden_size, num_classes) # fully connected

    def forward(self, x):
        # initial hidden state (not used)
        #h0 = t.zeros(self.num_layers, x.size(0), self.hidden_size)

        # initial cell state (not used)
        #c0 = t.zeros(self.num_layers, x.size(0), self.hidden_size)

        out, (h,c) = self.lstm(x) #, (h0, c0))

        # out shape: batch_size, seq_length, hidden_size
    
        out = self.fc(h) # Since n_layers = 1 we can use hidden state as input in linear layer
        return out.squeeze()

*7. Implement the training loop. You will choose carefully the loss function
and a metric that are suitable for the task. Your code will print the loss
and the chosen metric at the end of every epoch, both for the train and
the validation sets.

In [8]:
# instantiate model
model = Net(input_size, hidden_size, num_layers, num_classes)

# loss function
criterion = nn.CrossEntropyLoss()

# Optimizer
optimizer = t.optim.Adam(model.parameters(), lr=learning_rate)

# Training loop

n_steps = len(train_loader)            # num examples
print("num steps:",n_steps)

for epoch in range(num_epochs):
    epoch_loss = 0
    for i, (samples, labels) in enumerate(train_loader):
        #print(sample[0])

        # forward
        outputs = model(samples)           # 1: predict
        loss = criterion(outputs, labels)  # 2: calculate loss
        epoch_loss += loss.item()

        # backward
        optimizer.zero_grad()   # 1: empty vals in gradient
        loss.backward()         # 2: backpropagation
        optimizer.step()        # 3: update parameters

        # print progress info
        #if (i+1) % 100 == 0:
    print (f'Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{n_steps}], Loss: {epoch_loss/n_steps:.4f}')

# Save, load, run
#t.save(model,"trained")
#m1=t.load("trained")
#examples = iter(train_loader)
#samples, lables = examples.next()
#print(m1(samples))

num steps: 232
  Variable._execution_engine.run_backward(
Epoch [1/10], Step [232/232], Loss: 2.6230
Epoch [2/10], Step [232/232], Loss: 2.1562
Epoch [3/10], Step [232/232], Loss: 2.0116
Epoch [4/10], Step [232/232], Loss: 1.9034
Epoch [5/10], Step [232/232], Loss: 1.8123
Epoch [6/10], Step [232/232], Loss: 1.7360
Epoch [7/10], Step [232/232], Loss: 1.6727
Epoch [8/10], Step [232/232], Loss: 1.6207
Epoch [9/10], Step [232/232], Loss: 1.5772
Epoch [10/10], Step [232/232], Loss: 1.5402


*8. Finetune your model (for simplicity, no model selection pipeline required
in this exercise): you will play with the hyperparameters of your model,
including (non-exclusively) the sequence length, the hidden size of the
LSTM module, the number of layers in the LSTM module, the batch size
and the weight decay.

ANSWER: Batch size and sequence sequence length and number of epochs as stated above in final version

*9. Train your model and analyze its performance

In [10]:
# Validate

test_loader = DataLoader(ds_test, batch_size=1) # no batching
n_samples = len(test_loader)

n_correct = 0

with t.no_grad():
    for sample, label in test_loader:
        #print(sample)
        #print(sample.shape)
  
        output = model(sample)
        _, pred = t.max(output, dim = 0)
        n_correct += int(pred == label)

    acc = n_correct / n_samples
    print(f'Correct predicted / total = {n_correct} / {n_samples}')
    print(f'Prediction accuracy = {acc:.4f}')

Correct predicted / total = 27839 / 52671
Prediction accuracy = 0.5285


*10. Use your trained model to generate new text:
(a) Choose a ”seed sequence” consisting of L characters of your choice
(among the characters in the vocabulary).

In [13]:
seed_seq = "his thunder rain and wind"  # L=25
print("Seed sequence: " + "his thunder rain and wind")

Y = t.zeros(1)
int_seed_seq = [char_int[char] for char in seed_seq] # Convert text to int ID
label = np.array([[c] for c in int_seed_seq ])
label_t = t.LongTensor(label)
x_one_hot = t.zeros(L, n_chars).scatter_(1,label_t,1).unsqueeze(dim=0)
Y[0] = char_int[' ']  # next character intentionally left blank

gen_text = seed_seq

for i in range(200):
    
    #(c) Pass your encoded seed sequence through your trained model.
    y_out = model(x_one_hot) #  (DataLoader(TensorDataset(X, Y), batch_size=1))

    #(d) Predict the next character as the argmax of the softmax activation on the output.
    pred_int = int(t.argmax(t.softmax(y_out,dim=0)).detach())
    gen_text += int_char[pred_int]

    #(e) One-hot encode the predicted character.
    pred_hot = t.zeros(1,n_chars)
    pred_hot[0][pred_int] = 1

    #(e-f) Update the seed sequence: remove the encoded character in first position of the encoded seed sequence, then      add the encoded predicted character in last position of the encoded seed sequence.
    x_one_hot=t.cat([x_one_hot[0][1:],pred_hot]).unsqueeze(dim=0)

print(gen_text)


Seed sequence: his thunder rain and wind
his thunder rain and wind that the sent the beart the world and the true that i have so the part of the world and the true that i have so the part of the world and the true that i have so the part of the world and the true th


*11. What problem seems to occur with the previous procedure ?

ANSWER: The prediction started ok, however no variation will happen in the generated text (only repeating the (best) prediction).

*12. Modify the procedure described in Question 10: rather than predicting
the next character as the argmax of the softmax activation on the output,
you will instead sample it from the probability distribution given by the
softmax activation on the output.

In [20]:
seed_seq = "his thunder rain and wind"  # L=25, "hippolyta" the unique word to be completed by prediction
print("Seed sequence: " + "his thunder rain and wind\n")

Y = t.zeros(1)
int_seed_seq = [char_int[char] for char in seed_seq] # Convert text to int ID
label = np.array([[c] for c in int_seed_seq ])
label_t = t.LongTensor(label)
x_one_hot = t.zeros(L, n_chars).scatter_(1,label_t,1).unsqueeze(dim=0)
Y[0] = char_int[' ']  # next character intentionally left blank

gen_text = seed_seq

for i in range(1000):
    
    #(c) Pass your encoded seed sequence through your trained model.
    y_out = model(x_one_hot) #  (DataLoader(TensorDataset(X, Y), batch_size=1))

    #(d)* By random sampling, predict the next character as the argmax of the softmax activation on the output.
    s_max = np.array(t.softmax(y_out,dim=0).detach().numpy()).astype('float64')
    pred_int = np.argmax(np.random.multinomial(1,s_max/sum(s_max),1))

    gen_text += int_char[pred_int]
    
    #(e) One-hot encode the predicted character.
    pred_hot = t.zeros(1,n_chars)
    pred_hot[0][pred_int] = 1

    #(e-f) Update the seed sequence: remove the encoded character in first position of the encoded seed sequence, then      add the encoded predicted character in last position of the encoded seed sequence.
    x_one_hot=t.cat([x_one_hot[0][1:],pred_hot]).unsqueeze(dim=0)
    
print(gen_text)

Seed sequence: his thunder rain and wind

his thunder rain and wind tell it no homoured caesar shapparcowen dauch as their godaty see all or coed know my count dornof o to your wasters tadee. i sheesbry death till what shall noternessold gonce ill bains men lide it. wist hath benied you and held at dead and heart spike the combong non that i lutter i know my playe here to the bear i will i besere but dack and vessell and at add upenst foor dingratem delive and that and chayse at mine that kiend lay whure the brotwied me. thenefilewacly morrilled his diade and weld he than celfarity. wistrat mine early you shall be natles beine your bugwara well thou arstone with way nater have the stain and make hores grawer that i heagh and the camp. anowan oft thet himpekend and love i hate your sirve and repaie the gropid on the fachilar's happinast. his will truth hore to that thou beace yount. jurge a read me day nought make may creak layte in the shound antony. tongued upon the known ows ring cow

*13. Entertain us by generating amusing text !