<a href="https://colab.research.google.com/github/archyyu/GPT-from-MLP-to-RNN-to-Transformer/blob/main/GPT_by_RNN_version_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [12]:
import requests
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from torch.nn import functional as F
import matplotlib.pyplot as plt # for making figures
%matplotlib inline

# Set random seed for reproducibility
torch.manual_seed(42)

<torch._C.Generator at 0x7df46ef943d0>

In [13]:
# Data I/O
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
#url = "https://raw.githubusercontent.com/archyyu/publicResource/main/google.dev.en"
#url = "https://raw.githubusercontent.com/torvalds/linux/master/mm/madvise.c"
response = requests.get(url)
data = response.text

chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print(f'data has {data_size} characters, {vocab_size} unique.')

char_to_ix = {ch: i for i, ch in enumerate(chars)}
ix_to_char = {i: ch for i, ch in enumerate(chars)}

n =  (int)(0.9*len(data))
training_data = data[:n]
val_data = data[n:]

data has 1115394 characters, 65 unique.


In [14]:
# Hyperparameters
hidden_size = 100
embedding_dim = 20
seq_length = 25
learning_rate = 1e-1
batch_size = 20
eval_iters = 200

In [26]:
class ManillaRNN(nn.Module):
  def __init__(self, vocab_size, embedding_dim, hidden_size):
    super(ManillaRNN, self).__init__()
    self.hidden_size = hidden_size
    self.embedding = nn.Embedding(vocab_size, embedding_dim)
    self.i2h = nn.Linear(embedding_dim, hidden_size)
    self.h2h = nn.Linear(hidden_size, hidden_size)
    self.h2o = nn.Linear(hidden_size, vocab_size)
    self.hb2 = nn.Parameter(torch.zeros(1, hidden_size))
    self.ob = nn.Parameter(torch.zeros(1, vocab_size))

  def forward(self, x, targets):
    h = torch.zeros(1, self.hidden_size)
    y_list = []
    for i in range(x.shape[1]):
      t = self.embedding(x[:,i])
      h = torch.tanh(self.i2h(t) + self.h2h(h) + self.hb2)
      y = self.h2o(h) + self.ob
      y_list.append(y)
    predicts = torch.stack(y_list, dim=1);

    if targets is None:
      loss = None
    else:
      B,T = targets.shape
      loss = F.cross_entropy(predicts.view(B*T, -1), targets.view(B*T))
    return predicts, loss


criterion = nn.CrossEntropyLoss()

model = ManillaRNN(vocab_size, embedding_dim, hidden_size)
optimizer = optim.Adagrad(model.parameters(), lr=learning_rate)

Now tusi ManillaRNN is different with the VanillaRNN.
It will iterate all the time step, drop the intermediate output, and only output the final one.

But I am not going to rewrite the training function to retrain the new model.
Because I think the VanillaRNN is more better and controlable than this one.


In [25]:
def getBatch(split):
  batch_inputs = []
  batch_targets = []

  data = training_data if split == 'train' else val_data
  start_idx = torch.randint(len(data) - batch_size - seq_length - 2,[1]).item()

  # Generate examples for the current minibatch
  for i in range(batch_size):
    p = start_idx + i
    inputs = torch.tensor([char_to_ix[ch] for ch in data[p:p + seq_length]], dtype=torch.long).view(1, -1)
    targets = torch.tensor([char_to_ix[ch] for ch in data[p + 1:p + seq_length + 1]], dtype=torch.long).view(-1)

    batch_inputs.append(inputs)
    batch_targets.append(targets)

  # Convert lists to tensors
  minibatch_inputs = torch.cat(batch_inputs, dim=0)
  minibatch_targets = torch.stack(batch_targets)
  return minibatch_inputs, minibatch_targets

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = getBatch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

In [32]:
# Training loop
num_iterations = 20000
for iteration in range(num_iterations):

  inputs, targets = getBatch('train')
  predicts, loss = model(inputs, None)
  optimizer.zero_grad()
  B,T = targets.shape

  totalloss = criterion(predicts.view(B*T, -1), targets.view(B*T))
  totalloss.backward()
  optimizer.step()

  if iteration % 1000 == 0:
        losses = estimate_loss()
        print(f"step {iteration}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")


step 0: train loss 1.8895, val loss 1.9674
step 1000: train loss 1.8635, val loss 1.9978
step 2000: train loss 1.8343, val loss 2.0014
step 3000: train loss 1.8512, val loss 1.9489
step 4000: train loss 1.8565, val loss 1.9407
step 5000: train loss 1.7965, val loss 1.9693
step 6000: train loss 1.8494, val loss 1.9797
step 7000: train loss 1.8748, val loss 1.9767
step 8000: train loss 1.8604, val loss 1.9742
step 9000: train loss 1.8603, val loss 1.9759
step 10000: train loss 1.8498, val loss 1.9463
step 11000: train loss 1.7943, val loss 1.9614
step 12000: train loss 1.8332, val loss 1.9412
step 13000: train loss 1.8238, val loss 1.9847
step 14000: train loss 1.8174, val loss 1.9862
step 15000: train loss 1.8433, val loss 1.9471
step 16000: train loss 1.7991, val loss 1.9472
step 17000: train loss 1.8069, val loss 1.9311
step 18000: train loss 1.8141, val loss 1.9344
step 19000: train loss 1.7987, val loss 1.9529


In [34]:
with torch.no_grad():

  start = ['F']
  result = ['F']

  for i in range(2000):
    start = start[-seq_length:]
    ilist = torch.tensor([char_to_ix[i] for i in start])
    ilist = ilist.reshape(1, -1)
    outputs, loss = model(ilist, None)

    p = nn.functional.softmax(outputs[:,-1,:], dim=-1).detach().numpy().ravel()
    ix = np.random.choice(range(vocab_size), p=p)
    start.append(ix_to_char[ix])
    result.append(ix_to_char[ix])

  print(''.join(result))


Fon him Lond the mustion:
Is nevicand fait shay,
To dreainst scrriend',
I in tain.

WICH:
Mand sead, sure:
And tat.

TONCONKEDS LORS:
My Wooke in aclecs?

Feet livawe?

SICHARD:
In herh VION MAUTIUC
ERTY:
Was rewy!

FLORIZARE:
My from should A and quick'd,
The wither my lagman! Beg cans.

NISIA:
Than dound will draot cone say.

Nather in's than curd:
Where your here plutise sham brem, I am arwenckired
Mit!

PRINCENT:
Gyim, then very shan have it batimies the she his in't misiniup
Yis itwards-
Thus I will with them loverilesly soad in then at An neven. Kinster you, from the
Deoming some
prined MICHAPET:
NoRm.
 the word ben may this follintn nike thence are go
Hest shall trut
That so.
I came.

RUEEONENCE:
When-us,
Pof him noter.
Lfright,
And
be is reat my qunolbst surnics sorte; lives!
His of dost beep of this senzen the whithe soolue, to enother:
When the liruter's will alming.
That.

CORIOLANUS:
Well hay wiffult
And think, signing-Lay-pineds al liver, those;
This, though death
Dy brole

Since the RNN is better for seq to seq.but it also better for seq to 1.
In the above example, it is actually a example of seq to 1.
The thing is that, I only use the last output of the node as the result of the rnn.

I am not sure what if I stack all the outputs of the nodes, and squeese them tegether to get the final result is better or not. will continue to explore.

I think there is also another way to optimize the above model.
When I train the model, I prepare my data like this, the input is a list of characters, the expected output is one character.
but we could also prepare data like this, the input a list of characters, the expect result is also a list of character, but with one step shift.
but when generating data, I could just use the last character as the result.

I will try the last one later.

The new idea is that.

I always use the last time step as the final output of the RNN, or as the predict character of the previous input. we could go further as that, add attention in the final step, to focus differently on the previous tokens to decide what to predict, let do that.

In this new version, I split the data into training data and val data
So after some iteration, we could compare the training error with val error to see if we have overfitted the model.