**Homework 25**

In this assignment your will train a RNN to predict characters of *Alice in Wonderland*, from strings of consecutive characters.

We begin as usual with the imports you will need for this assignment.

In [None]:
import numpy as np
import torch
from torch import nn

In [None]:
device=('cuda' if torch.cuda.is_available()
        else 'cpu')

device

'cuda'

Run the following text block to read *Alice in Wonderland* from the web, store it in the variable `text`, convert to lower case and remove punctuation.

In [None]:
import string
from urllib.request import urlopen
url='https://gist.githubusercontent.com/phillipj/4944029/raw/75ba2243dd5ec2875f629bf5d79f6c1e4b5a8b46/alice_in_wonderland.txt'
text = urlopen(url).read().decode('utf-8')
text=text.lower()
text=[c for c in text if (c not in string.punctuation) and (c!='\n')]

Write a class `Tokenizer` with the following methods:


*   `__init__`, a method that builds a dictionary `tokens` whose keys are the set of unique characters in some input `text`, and values are integers.
*   `encode`, a method that takes in a corpus of text, converts each character according to the dictionary built by the __init__ method, and outputs a list of those integers.
*   `decode`, a method that takes a single integer (a value from the dictionary), and returns the corresponding character key.



In [None]:
class Tokenizer():
  def __init__(self,text):
    tokens={}
    n=0
    for c in text:
      if c not in tokens.keys():
        tokens[c]=n
        n+=1
    self.tokens=tokens

  def encode(self,text):
    out=[]
    for c in text:
      out+=[self.tokens[c]]
    return out

  def decode(self,n):
    for c in self.tokens:
      if n==self.tokens[c]:
        return c



Now, create an object called `tok` of your `Tokenizer` class, and use it to encode `text` as a list of integers, `text_indices`.

In [None]:
tok=Tokenizer(text)
text_indices=tok.encode(text)

For convenience, we'll define `vocab_size=len(tok.tokens)` to be the length of your tokenizer dictionary:

In [None]:
vocab_size=len(tok.tokens)
vocab_size

29

The next task is to create feature sequences and targets. From `text_indices`, create a list-of-lists `X`. Each sublist of `X` should correspond to 50 consecutive elements of `text_indices`. At the same time, create a list `y` which contains the indices of the characters that follow each sublist of `X`. For example, `X[0]` should be a list containing the first 50 elements of `text_indices`: `text_indices[0]` through `text_indices[49]`. `y[0]` should be the 51st element, `text_indices[50]`.

To keep the size of the feature and target vectors manageable, consecutive lists in `X` should be shifted by 3, so the overlap is 47 elements. Hence, `X[1]` should be a list containing the integers `text_indices[3]` through `text_indices[52]`, and `y[1]` should be the integer `text_indices[53]`.

In [None]:
seq_len=50
X=[]
y=[]
for i in range(0,len(text_indices)-seq_len-1,3):
  X.append(text_indices[i:i+seq_len])
  y.append(text_indices[i+seq_len])

Convert `X` and `y` to numpy arrays with the same names, and check their shapes. If done correctly, the shape of `X` should be (45539, 50) and the shape of `y` should be (45539, ):

In [None]:
X=torch.tensor(X).to(device)
y=torch.tensor(y).to(device)
X.shape, y.shape

(torch.Size([45539, 50]), torch.Size([45539]))

Use the `to_categorical` function again to convert both `X` and `y` to one-hot encoded vectors of 0's and 1's, and check their shapes again. You should now have shapes (45539,50,29) and (45539,29). In other words, the vector `X` now contains 45,539 sequences of length 50, and each element of each sequence is a 29-dimensional vector of 28 zeros and a single one in the entry corresponding to some character in the text.

In [None]:
import torch.nn.functional as F

In [None]:
OneHotX=F.one_hot(X,vocab_size).float()

In [None]:
OneHotX.shape

torch.Size([45539, 50, 29])

In [None]:
'''
OneHotX=torch.zeros((X.shape[0],X.shape[1],vocab_size)).to(device)
for i in range(X.shape[0]):
  for j in range(X.shape[1]):
    OneHotX[i,j,X[i,j]]=1
'''

'\nOneHotX=torch.zeros((X.shape[0],X.shape[1],vocab_size)).to(device)\nfor i in range(X.shape[0]):\n  for j in range(X.shape[1]):\n    OneHotX[i,j,X[i,j]]=1\n'

You're now ready to create your model, which will consist of two seperate one-layer pytorch models. The first will be a recurrent layer that takes in sequences of 29-dimensional vectors, and has a 128 dimensional hidden state. The second will ve a linear layer that will take the last hidden state and produce a 29 dimensional vector.

In [None]:
rnn=nn.RNN(vocab_size,128,batch_first=True).to(device)
fc=nn.Linear(128,vocab_size).to(device)

Compile your model using the `Adam` optimizer and an approporiately chosen loss function.

In [None]:
# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(list(rnn.parameters()) + list(fc.parameters()), lr=0.001)

Fit your data to X and y. Train for 50 epochs with a batch size of 128. Each epoch will take about 95 seconds, so you'll want to leave your computer for about an hour for this to complete.

In [None]:
n_epochs=75
N = OneHotX.shape[0]  # total number of observations in training data
batch_size=32

rnn.train()
for epoch in range(n_epochs):
  epoch_loss = 0.0

  # Shuffle the indices
  indices = torch.randperm(N,device=device)

  # Create mini-batches
  for i in range(0, N, batch_size):
    batch_indices = indices[i:i+batch_size]
    batch_X = OneHotX[batch_indices]
    batch_y = y[batch_indices]

    optimizer.zero_grad()
    last_hidden=lstm(batch_X)[1][0].squeeze(0)
    out=fc(last_hidden)
    loss=criterion(out,batch_y)
    loss.backward()
    optimizer.step()

    epoch_loss += loss.item()*batch_size

  if epoch%2==0:
      avg_loss = epoch_loss / len(y)
      print(f"epoch: {epoch}, avg_loss: {avg_loss}")

epoch: 0, avg_loss: 2.362585588155856
epoch: 2, avg_loss: 1.9816782456769542
epoch: 4, avg_loss: 1.8501564488192839
epoch: 6, avg_loss: 1.7504945752198664
epoch: 8, avg_loss: 1.678781470363666
epoch: 10, avg_loss: 1.615750982548875
epoch: 12, avg_loss: 1.566183029687798
epoch: 14, avg_loss: 1.5211048758666834
epoch: 16, avg_loss: 1.4840404801004428
epoch: 18, avg_loss: 1.4528725123844712
epoch: 20, avg_loss: 1.4242201379805195
epoch: 22, avg_loss: 1.3963262983774616
epoch: 24, avg_loss: 1.3751589620680638
epoch: 26, avg_loss: 1.3546202259888414
epoch: 28, avg_loss: 1.3412804920742112
epoch: 30, avg_loss: 1.325974803026487
epoch: 32, avg_loss: 1.3119116378774462
epoch: 34, avg_loss: 1.2990341318457563
epoch: 36, avg_loss: 1.2888927964417185
epoch: 38, avg_loss: 1.2765215783830284
epoch: 40, avg_loss: 1.2700999458226876
epoch: 42, avg_loss: 1.2654386535798998
epoch: 44, avg_loss: 1.255694712916943
epoch: 46, avg_loss: 1.2511304133771717
epoch: 48, avg_loss: 1.2472890148432154
epoch: 50, 

We will now use your trained model to generate text, one character at a time. Run the following code block to do this. (It will take a minute or two to complete.) Its interesting that although the model generates one character at a time, you'll see very word-like strings in the final text.

In [None]:
rnn.eval()
next_seq=OneHotX[:1]

newtext=''
with torch.no_grad():
  for i in range(500):
    seq=next_seq
    pred=fc(rnn(seq)[1].squeeze()) #predictions of your model
    pred_probs=torch.softmax(pred,dim=0).detach().cpu().numpy() #predictions->probs
    index_pred=np.random.choice(vocab_size,1,p=pred_probs)[0] #choose one
    newtext+=tok.decode(index_pred) #corresponding character

    next_vec=torch.zeros(vocab_size).to(device)
    next_vec[index_pred]=1  #one-hot encode chosen letter index
    next_seq=torch.zeros(1,seq_len,29).to(device)
    next_seq[0,:seq_len-1]=seq[0,1:] #new sequence is last 49 of old sequence
    next_seq[0,seq_len-1]=next_vec  #plus new vector

newtext #display generated text

'ce and taremint  its greeth this with a cut lersterk and wan ive too bolk an orfond the bohind  a worde him  the hatter when with ened quite and the dayture amout that be puzzer ther she who very suppoce horde then i perpan said the duchess mad feer arimbelly mouse  ho hees andshe doumor to be a fildipes and mofy drep on fent so she spoke tive      hememboor alice she found at there and of hard the mock turtlist groind iow  i dorfive you said the mus a sing  pokies heast idaly fan of the queen p'

**COPY AND PASTE THIS TEXT INTO THE SUBMISSION WINDOW ON GRADESCOPE**