# Learning RNNs

This notebook will mirror the content in Chapter 12 ("A Language Model from Scatch") of [*Deep Learning for Coders with fastai & PyTorch*](https://github.com/fastai/fastbook) by Jeremy Howard & Sylvain Gugger, as well as Jeremy's ["Practical Deep Learning"](https://course.fast.ai/) course.

In an attempt to make sure I understand what is going on under the hood, I'll avoid using the conveniences provided by fastai. That being said, I am not implementing things from scratch and consider anything in Pytorch as fair game.

I'll also make it my own by trying to use different datasets and adding anything else that occurs to me that might make the models perform better. We'll see what actually works.

## Prep

In [None]:
try:
    import google.colab
    IN_COLAB = True
except:
    IN_COLAB = False

if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    project_path = "/content/drive/MyDrive/Projects/code/LearningDeepLearning"
    !pip install datasets
    !pip install transformers
else:
    project_path = "."

## Imports

In [None]:
import os
from pathlib import Path
from collections import defaultdict
from psutil import virtual_memory
from functools import partial
import time
import pandas as pd
import pickle

import torch
import torch.nn as nn
from torch.nn import functional as F
from torch import tensor
from torch.utils.data import DataLoader

from datasets import Dataset, DatasetDict, load_dataset, load_from_disk
from transformers import GPT2Tokenizer

# Data Prep: TinyStories

TinyStories is a dataset developed by Ronen Eldan and Yuanzhi Li and described in their paper ["TinyStories: How Small Can Language Models Be and Still Speak Coherent English?"](https://arxiv.org/abs/2305.07759). From the paper, TinyStories is "a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4." The paper goes on to show that they can train "small" language models (<10 million parameters) that nevertheless "produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammer, and demonstrate reasoning capabilities."

We're starting with very rudimentary models with much simpler architecture and even fewer parameters, but let's see how good we can make this.

In [None]:
token_limit = 64e4 # limiting amount of data used to speed things up, can relax later

## Load and inspect dataset

In [None]:
project_dir = Path(project_path)
data_dir = Path(project_dir/'data')
data_dir.mkdir(exist_ok=True)

In [None]:
train_fn = "TinyStoriesV2-GPT4-train.txt"
train_url = "https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-train.txt"
test_fn = "TinyStoriesV2-GPT4-valid.txt"
test_url = "https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-valid.txt"

In [None]:
if not os.path.exists(data_dir/train_fn):
    os.system(f'wget {train_url} -O {data_path/train_fn} --progress=dot:mega')
if not os.path.exists(data_dir/test_fn):
    os.system(f'wget {test_url} -O {data_path/test_fn} --progress=dot:mega')

In [None]:
# note: if last doc doesn't end in endoftext then it is omitted
def tinystories_generator(file_path, skip_first=False):
    current_doc = ""
    is_first = True
    with open(file_path, "r") as f:
        for line in f:
            if line.startswith("<|endoftext|>"):
                if not skip_first or not is_first:
                    yield {"text": current_doc.strip() + "<|endoftext|>"}
                is_first = False
                current_doc = ""
            else:
                current_doc += line

In [None]:
train_ds = Dataset.from_generator(partial(tinystories_generator, file_path=data_dir/train_fn))
test_ds = Dataset.from_generator(partial(tinystories_generator, file_path=data_dir/test_fn, skip_first=True))

ds = DatasetDict({
    'train': train_ds,
    'test': test_ds
})

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
ds

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 2717699
    })
    test: Dataset({
        features: ['text'],
        num_rows: 27629
    })
})

In [None]:
ds['train'][0:2]['text']

["Once upon a time there was a little boy named Ben. Ben loved to explore the world around him. He saw many amazing things, like beautiful vases that were on display in a store. One day, Ben was walking through the store when he came across a very special vase. When Ben saw it he was amazed!  \nHe said, “Wow, that is a really amazing vase! Can I buy it?” \nThe shopkeeper smiled and said, “Of course you can. You can take it home and show all your friends how amazing it is!”\nSo Ben took the vase home and he was so proud of it! He called his friends over and showed them the amazing vase. All his friends thought the vase was beautiful and couldn't believe how lucky Ben was. \nAnd that's how Ben found an amazing vase in the store!<|endoftext|>",
 'Once upon a time, there was a reliable otter named Ollie. He lived in a river with his family. They all loved to play and swim together.\nOne day, Ollie\'s mom said, "Ollie, hurry and get some fish for dinner!" Ollie swam fast to catch fish. He s

In [None]:
ds['test'][0:2]

{'text': ['Once upon a time, in a warm and sunny place, there was a big pit. A little boy named Tom liked to play near the pit. One day, Tom lost his red ball. He was very sad.\nTom asked his friend, Sam, to help him search for the ball. They looked high and low, but they could not find the ball. Tom said, "I think my ball fell into the pit."\nSam and Tom went close to the pit. They were scared, but they wanted to find the red ball. They looked into the pit, but it was too dark to see. Tom said, "We must go in and search for my ball."\nThey went into the pit to search. It was dark and scary. They could not find the ball. They tried to get out, but the pit was too deep. Tom and Sam were stuck in the pit. They called for help, but no one could hear them. They were sad and scared, and they never got out of the pit.<|endoftext|>',
  'Tom and Lily were playing with their toys in the living room. They liked to build towers and bridges with their blocks and cars. Tom was very proud of his tal

In [None]:
ndocs_small = int(token_limit//50)
ds_small = DatasetDict({
    'train': train_ds.select(range(ndocs_small)),
    'test': test_ds.select(range(ndocs_small))
})
ds_small

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 12800
    })
    test: Dataset({
        features: ['text'],
        num_rows: 12800
    })
})

## Tokenize and inspect tokens

In [None]:
# Load the pretrained GPT-Neo tokenizer as is used in the TinyStories paper
tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B")

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

In [None]:
tokenizer.vocab_size

50257

In [None]:
tds_filename = 'tinystories_tokenized_gpt-neo-1.3B'
if os.path.exists(data_dir/tds_filename):
    tds = load_from_disk(data_dir/tds_filename)
else:
    tds = ds.map(lambda e: tokenizer(e['text']), batched=True, remove_columns="text")
    tds.save_to_disk(data_dir/tds_filename)

In [None]:
tds

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 2717699
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 27629
    })
})

In [None]:
tf_fn = 'tokenfreq_tinystories_tokenized_gpt-neo-1.3B.pkl'
if os.path.exists(data_dir/tf_fn):
    tf = pd.read_pickle(data_dir/tf_fn)
else:
    # Initialize a defaultdict to keep track of the frequency of each token
    token_freq = defaultdict(int)
    # Define a function to update the token frequencies
    def update_freqs(batch):
        for token_list in batch['input_ids']:
            for token_id in token_list:
                token_freq[token_id] += 1
        return {}
    # Apply the function to the dataset
    tds['train'].map(update_freqs, batched=True, batch_size=10000)
    # Put it in a dataframe and compute cdf
    tf = pd.DataFrame(dict(token_freq).items(), columns=('token', 'freq'))
    tf['token'] = tf['token'].astype('category')
    tf['token_str'] = tf['token'].apply(tokenizer.decode)
    tf.sort_values('freq', ascending=False, inplace=True)
    tf.reset_index(inplace=True, drop=True)
    tf['cdf'] = tf['freq'].cumsum() / tf['freq'].sum()
    # Save to disk
    tf.to_pickle(data_dir/tf_fn)
tf

Unnamed: 0,token,freq,token_str,cdf
0,13,41825583,.,0.077245
1,11,23298942,",",0.120274
2,262,20828658,the,0.158741
3,290,19476061,and,0.194709
4,257,15074432,a,0.222549
...,...,...,...,...
27989,35172,1,Elves,1.000000
27990,16976,1,specialized,1.000000
27991,21942,1,injustice,1.000000
27992,38868,1,adier,1.000000


In [None]:
# 28k tokens in the training set vs. 50k in the tokenizer vocab
len(tf), tokenizer.vocab_size

(27994, 50257)

In [None]:
# top ten tokens by frequency, accounting for 30% of all tokens
tf.head(10)

Unnamed: 0,token,freq,token_str,cdf
0,13,41825583,.,0.077245
1,11,23298942,",",0.120274
2,262,20828658,the,0.158741
3,290,19476061,and,0.194709
4,257,15074432,a,0.222549
5,284,14906882,to,0.25008
6,373,10594487,was,0.269646
7,198,9119907,\n,0.286489
8,1119,5226509,They,0.296141
9,340,5141200,it,0.305636


In [None]:
# the top 44 account for 50% of all tokens
tf.iloc[(0.5 - tf['cdf']).abs().argmin()]

token           20037
freq          1995606
token_str        Lily
cdf          0.501679
Name: 44, dtype: object

In [None]:
# the top ~800 account for 90% of all tokens
tf.iloc[(0.9 - tf['cdf']).abs().argmin()]

token           1382
freq           58844
token_str      build
cdf          0.89995
Name: 793, dtype: object

In [None]:
# the top ~3700 account for 99% of all tokens
tf.iloc[(0.99 - tf['cdf']).abs().argmin()]

token           10291
freq             3663
token_str     wanting
cdf          0.990001
Name: 3709, dtype: object

In [None]:
# the top 10k tokens account for 99.93% of tokens, tokens past this appear <120 times in the training set
tf.iloc[10000]

token             3033
freq               119
token_str     features
cdf           0.999304
Name: 10000, dtype: object

In [None]:
# 3k tokens only appear once, 10k appear 10 or fewer times
(tf['freq'] == 1).sum(), (tf['freq'] <= 10).sum()

(3175, 10009)

# Training

In [None]:
bs = 64
lr = 3e-3
epochs = 1
seq_len = 3

In [None]:
vocab_size = tokenizer.vocab_size
vocab_size

50257

In [None]:
if torch.backends.mps.is_available():
    def_device = device_name = 'mps'
elif torch.cuda.is_available():
    def_device = 'cuda'
    device_name = torch.cuda.get_device_name(0)
else:
    def_device = device_name = 'cpu'
def_device, device_name

('cuda', 'NVIDIA A100-SXM4-40GB')

In [None]:
virtual_memory().total / 1e9

89.636769792

In [None]:
def get_sequences_from_doc(doc, seq_len=3):
    return [(doc[i:i+seq_len], doc[i+seq_len]) for i in range(0,len(doc)-seq_len-1,seq_len)]

In [None]:
def get_sequences_from_ds(ds, token_limit):
    seqs = []
    for doc in ds:
        i = 0 if len(seqs)==0 else i+1
        seqs.extend(get_sequences_from_doc(tensor(doc['input_ids']), seq_len))
        if i % 1000 == 0:
            print(f'{len(seqs)}/{token_limit} done')
        if len(seqs) >= token_limit:
            break
    return seqs

In [None]:
print('train')
train_seqs = get_sequences_from_ds(tds['train'], token_limit)
print('test')
test_seqs = get_sequences_from_ds(tds['test'], token_limit)

train
62/640000.0 done
65885/640000.0 done
132398/640000.0 done
198478/640000.0 done
265368/640000.0 done
331051/640000.0 done
396882/640000.0 done
462737/640000.0 done
529767/640000.0 done
596213/640000.0 done
test
69/640000.0 done
64681/640000.0 done
129571/640000.0 done
194245/640000.0 done
259498/640000.0 done
324271/640000.0 done
389586/640000.0 done
453914/640000.0 done
519785/640000.0 done
585284/640000.0 done


In [None]:
def group_chunks(ds, bs):
  m = len(ds)//bs
  new_ds = []
  for i in range(m): new_ds += ds[i:m*bs:m]
  return new_ds

In [None]:
train_dl = DataLoader(group_chunks(train_seqs, bs), batch_size=bs, shuffle=False, drop_last=True)
test_dl = DataLoader(group_chunks(train_seqs, bs), batch_size=bs, shuffle=False, drop_last=True)

In [None]:
len(train_dl), len(test_dl)

(10000, 10000)

In [None]:
class LMModel3(nn.Module):
    def __init__(self, vocab_sz, n_hidden):
      super().__init__()
      self.i_h = nn.Embedding(vocab_sz, n_hidden)
      self.h_h = nn.Linear(n_hidden, n_hidden)
      self.h_o = nn.Linear(n_hidden,vocab_sz)
      self.h = 0

    def forward(self, x):
        for i in range(3):
            self.h = self.h + self.i_h(x[:,i])
            self.h = F.relu(self.h_h(self.h))
        out = self.h_o(self.h)
        self.h = self.h.detach()
        return out

    def reset(self): self.h = 0

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions based on 3 most recent tokens
            logits = self(idx[:,-3:]) # (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [None]:
model = LMModel3(vocab_size, 64).to(def_device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr = lr, momentum = 0.9)

In [None]:
model

LMModel3(
  (i_h): Embedding(50257, 64)
  (h_h): Linear(in_features=64, out_features=64, bias=True)
  (h_o): Linear(in_features=64, out_features=50257, bias=True)
)

In [None]:
def get_params(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

In [None]:
%%time
start_time = time.time()
for epoch in range(epochs):
    running_loss = 0.0
    for i, data in enumerate(train_dl, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data
        # zero the parameter gradients
        optimizer.zero_grad()
        # forward (calc predictions) + backward (calc loss & gradients) +
        # optimize (step the weights)
        outputs = model(inputs.to(def_device))
        loss = criterion(outputs, labels.to(def_device))
        loss.backward()
        optimizer.step()
        # tally loss
        running_loss += loss.item()
        if i % 200 == 199:    # print every 200 mini-batches
          print(f'\t[{epoch + 1}, {i + 1:5d}] loss: {running_loss / i:.3f}')
    # reset hidden state before validation
    model.reset()
    # calc validation loss and accuracy
    running_valid_loss = 0.0
    correct = 0
    total = 0
    with torch.no_grad():
        for valid_i, valid_data in enumerate(test_dl, 0):
            inputs, labels = valid_data
            outputs = model(inputs.to(def_device))
            valid_loss = criterion(outputs, labels.to(def_device))
            running_valid_loss += valid_loss.item()
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels.to(def_device)).sum().item()
    # print stats
    print(f'[{epoch + 1}, {i + 1:5d}] train loss: {running_loss / i:.3f} valid loss: {running_valid_loss / valid_i:.3f} valid i: {valid_i} accuracy: {correct / total:.3f}')
    # reset hidden state before next epoch
    model.reset()
print('Finished Training')
end_time = time.time()
print(f'{end_time - start_time} seconds')

	[1,   200] loss: 10.526
	[1,   400] loss: 9.558
	[1,   600] loss: 8.828
	[1,   800] loss: 8.240
	[1,  1000] loss: 7.852
	[1,  1200] loss: 7.569
	[1,  1400] loss: 7.329
	[1,  1600] loss: 7.133
	[1,  1800] loss: 6.979
	[1,  2000] loss: 6.843
	[1,  2200] loss: 6.718
	[1,  2400] loss: 6.611
	[1,  2600] loss: 6.512
	[1,  2800] loss: 6.430
	[1,  3000] loss: 6.352
	[1,  3200] loss: 6.283
	[1,  3400] loss: 6.223
	[1,  3600] loss: 6.165
	[1,  3800] loss: 6.109
	[1,  4000] loss: 6.063
	[1,  4200] loss: 6.012
	[1,  4400] loss: 5.971
	[1,  4600] loss: 5.929
	[1,  4800] loss: 5.894
	[1,  5000] loss: 5.856
	[1,  5200] loss: 5.820
	[1,  5400] loss: 5.788
	[1,  5600] loss: 5.757
	[1,  5800] loss: 5.728
	[1,  6000] loss: 5.700
	[1,  6200] loss: 5.671
	[1,  6400] loss: 5.645
	[1,  6600] loss: 5.621
	[1,  6800] loss: 5.599
	[1,  7000] loss: 5.578
	[1,  7200] loss: 5.555
	[1,  7400] loss: 5.535
	[1,  7600] loss: 5.513
	[1,  7800] loss: 5.493
	[1,  8000] loss: 5.474
	[1,  8200] loss: 5.457
	[1,  8400] los

In [None]:
# do some generation
prompt = tensor(tds['train'][0]['input_ids'][:3])
gen_txt = model.generate(idx = prompt.view(1,3).to(def_device), max_new_tokens=20)[0].tolist()
print(tokenizer.decode(gen_txt))

**Log results**

In [None]:
log_dir = Path(project_dir/'logs')
log_dir.mkdir(exist_ok=True)

In [None]:
log_path = log_dir/'log.pkl'
if os.path.exists(log_path):
    log = pd.read_pickle(log_path)
else:
    log = pd.DataFrame(columns=['model', 'params', 'device_name', 'vocab_size', 'train_tokens', 'test_tokens', 'batch_size',
                                'epochs', 'train_time', 'train_loss', 'test_loss', 'accuracy', 'sample'])

In [None]:
log = \
pd.concat([log,
           pd.DataFrame({'model': model.__class__.__name__, 'params': get_params(model),'device_name': device_name,
                         'vocab_size': tokenizer.vocab_size, 'train_tokens': len(train_dl)*bs, 'test_tokens': len(test_dl)*bs,
                         'batch_size': bs, 'epochs':epochs, 'train_time': end_time-start_time,
                         'train_loss': running_loss / i, 'test_loss': running_valid_loss / valid_i, 'accuracy': correct / total,
                         'sample': tokenizer.decode(gen_txt)}, index=[len(log)])
          ])

In [None]:
log

Unnamed: 0,model,params,device_name,vocab_size,train_tokens,test_tokens,batch_size,epochs,train_time,train_loss,test_loss,accuracy,sample
0,LMModel3,6487313,mps,50257,640000,640000,64,1,296.992819,5.323405,4.635509,0.23078,"Once upon a time again, wings into should. Whe..."
1,LMModel3,6487313,Tesla T4,50257,640000,640000,64,1,53.839692,5.314124,4.639222,0.232855,"Once upon a timeubuntu Buddy the best, Amy set..."
2,LMModel3,6487313,Tesla V100-SXM2-16GB,50257,640000,640000,64,1,39.803033,5.283554,4.619744,0.228327,Once upon a time house Nato the small by and s...
3,LMModel3,6487313,NVIDIA A100-SXM4-40GB,50257,640000,640000,64,1,33.560122,5.319119,4.627001,0.233545,Once upon a time wheni when Mia somethingMobil...


In [None]:
pd.to_pickle(log, log_path)

## Notes

First, very simple model (LMModel3) training for one epoch with only 64k tokens took 5 minutes on my Macbook and less than a minute on Colab with a real GPU. The A100 was only twice as fast at the T4, which just suggests we are no where near to the point where we need the power of the A100.