### Note about the dataset
You should start by running the data preprocessing code in the github repo (`data/preprocessing/get_data.ipynb`) or just clone the repo to get a copy of `limericks.json`, which is then used to finetune the GPT-2 model.

In [1]:
# Start by installing required libraries (mainly Transformers)
# !pip install transformers==4.17.0
# !pip install scikit-learn

Collecting transformers==4.17.0
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 5.0 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 38.5 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 3.7 MB/s 
[?25hCollecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 38.9 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 51.0 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyam

In [2]:
# Only needed when running in colab
# from google.colab import drive
# drive.mount("/content/drive/", force_remount=True)

Mounted at /content/drive/


In [3]:
import glob
import json
import math
import numpy as np
import os
import random
import shutil
import string
import torch
import torch.optim as optim
import tqdm.notebook as tqdm

from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader
from transformers import AutoModelForCausalLM
from transformers import DataCollatorForLanguageModeling
from transformers import GPT2LMHeadModel
from transformers import GPT2Model
from transformers import GPT2Tokenizer
from transformers import AdamW, get_scheduler

In [4]:
# Change them if needed
data_dir = "/content/drive/MyDrive/11-785-final/data/"
ckpt_dir = "/content/drive/MyDrive/11-785-final/ckpt/"

os.makedirs(ckpt_dir, exist_ok=True)

In [5]:
data = json.load(open(f"{data_dir}/limericks.json"))
limericks = []

for _, limerick in data['limericks'].items():
    lines = limerick['lines']
    flag = True

    # Remove the final punctuation of each line
    # (we'll use a special separator instead)
    for idx, line in enumerate(lines):
        if len(line) == 0:
            flag = False
            break
        if line[-1] in string.punctuation:
            lines[idx] = line[:-1]
    
    if flag:
        limericks.append(lines)

In [6]:
print(f"# of limericks before clean-up: {len(data['limericks'])}")
print(f"# of limericks after clean-up: {len(limericks)}")

# of limericks before clean-up: 72432
# of limericks after clean-up: 72431


In [7]:
# We'll use a new special token <LINE> as the separator between lines
# Also notice that we add the pad_token for padding purpose, but it should be
# masked out (i.e. ineffective) by using attention_mask throughout the training
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.add_special_tokens({
    "sep_token": "<LINE>",
    "pad_token": "<PAD>"})
print(f"New sep_token: {tokenizer.sep_token} ({tokenizer.sep_token_id})")
print(f"New pad_token: {tokenizer.pad_token} ({tokenizer.pad_token_id})")

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

New sep_token: <LINE> (50257)
New pad_token: <PAD> (50258)


In [8]:
# We can construct a training sample of limericks by merging the lines
# with the separator attached at the end of each line
def merge_lines(lines):
    string = ' <LINE> '.join(lines) + ' <LINE>'
    return string

In [9]:
sample = random.sample(limericks, 1)[0]
string = merge_lines(sample)
print(f"Lines with separator: {string}")
input_ids = tokenizer(string)['input_ids']
print(f"Tokens: {input_ids}")
decoded_string = tokenizer.decode(input_ids)
print(f"Decoding result: {decoded_string}")

Lines with separator: in the civilisation cycladic <LINE> the people were far from nomadic <LINE> for a thousand years plus <LINE> they remained, without fuss <LINE> in a style that was hardly sporadic <LINE>
Tokens: [259, 262, 45605, 11700, 23876, 50257, 1169, 661, 547, 1290, 422, 4515, 23876, 50257, 1640, 257, 7319, 812, 5556, 50257, 9930, 6150, 11, 1231, 34297, 50257, 259, 257, 3918, 326, 373, 8941, 48172, 50257]
Decoding result: in the civilisation cycladic <LINE> the people were far from nomadic <LINE> for a thousand years plus <LINE> they remained, without fuss <LINE> in a style that was hardly sporadic <LINE>


In [10]:
train_data, val_data = train_test_split(limericks, train_size=0.9)
print(f"# of training samples: {len(train_data)}")
print(f"# of validation samples: {len(val_data)}")

# of training samples: 65187
# of validation samples: 7244


In [11]:
class LimerickDataset(Dataset):
    def __init__(self, data):
        self.data = [merge_lines(limerick) for limerick in data]

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

In [12]:
def reverse_line(input_ids):
    new_input_ids = np.zeros_like(input_ids)
    start = 0
    for end in range(1, len(input_ids)):
        if input_ids[end] == tokenizer.sep_token_id:
            new_input_ids[start: end] = input_ids[start: end][::-1]
            new_input_ids[end] = tokenizer.sep_token_id
            start = end + 1
    new_input_ids[start:] = input_ids[start:]
    return new_input_ids

def gen_collate_fn(tokenizer, reverse=False):
    def collate_fn(batch):
        if not reverse:
            batch = tokenizer(batch, padding="longest", return_tensors="pt")
        else:
            batch = tokenizer(batch, padding="longest", return_tensors="np")
            for i, input_ids in enumerate(batch['input_ids']):
                batch['input_ids'][i] = reverse_line(batch['input_ids'][i])
            batch['input_ids'] = torch.tensor(batch['input_ids'])
            batch['attention_mask'] = torch.tensor(batch['attention_mask'])
        batch['labels'] = torch.clone(batch['input_ids']).detach()

        for key, value in batch.items():
            batch[key] = value.cuda()
        return batch

    return collate_fn

In [13]:
# optimizer
learning_rate = 5e-5
weight_decay = 0.0
# scheduler
scheduler_type = "linear"
num_warmup_steps = 0
# training loop
epochs = 20
batch_size = 32
gradient_accumulation_steps = 1
# data
reverse = True
# ckpt
exp_name = "reverse-gpt2"
debug = False

In [14]:
exp_dir = f"{ckpt_dir}/{exp_name}"
os.makedirs(exp_dir, exist_ok=True)
log_file = f"{exp_dir}/log.txt"

In [None]:
if not debug:
    train_dataset = LimerickDataset(train_data)
    val_dataset = LimerickDataset(val_data)
else:
    train_dataset = LimerickDataset(train_data[:batch_size * 8])
    val_dataset = LimerickDataset(val_data[:batch_size * 2])

train_loader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    drop_last=True,
    shuffle=True,
    collate_fn=gen_collate_fn(tokenizer, reverse=reverse))
val_loader = DataLoader(
    val_dataset,
    batch_size=batch_size,
    drop_last=False,
    shuffle=False,
    collate_fn=gen_collate_fn(tokenizer, reverse=reverse))

In [20]:
# initialize the model, also resize the embeddings for new tokens
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer))
model = model.cuda()

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

In [None]:
# Reference: https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm_no_trainer.py
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
    {
        "params": [
            p for n, p in model.named_parameters()
            if not any(nd in n for nd in no_decay)],
        "weight_decay": weight_decay,
    },
    {
        "params": [
            p for n, p in model.named_parameters()
            if any(nd in n for nd in no_decay)],
        "weight_decay": 0.0,
    },
]
optimizer = optim.AdamW(optimizer_grouped_parameters, lr=learning_rate)

T_epoch = np.ceil(len(train_loader) // gradient_accumulation_steps)
scheduler = get_scheduler(
    name=scheduler_type,
    optimizer=optimizer,
    num_warmup_steps=num_warmup_steps,
    num_training_steps=epochs * T_epoch)
scaler = torch.cuda.amp.GradScaler()

In [None]:
files = glob.glob(f"{exp_dir}/epoch-*.ckpt")
if len(files) != 0:
    files = sorted(files, key=lambda x: int(os.path.basename(x)[6:-5]))
    states = torch.load(files[-1])
    
    model.load_state_dict(states['model_state_dict'])
    optimizer.load_state_dict(states['optimizer_state_dict'])
    scheduler.load_state_dict(states['scheduler_state_dict'])
    scaler.load_state_dict(states['scaler_state_dict'])
    start_epoch = states['epoch'] + 1
    best_perplexity = states['perplexity']
else:
    start_epoch = 0
    best_perplexity = 1e30

if start_epoch == 0:
    print("Start training from scratch")
else:
    print(f"Resume training from epoch {start_epoch + 1}")

Start training from scratch


In [None]:
# Reference: https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm_no_trainer.py
def train_epoch(model, train_loader, optimizer, scheduler, scaler):
    model.train()
    optimizer.zero_grad()

    bar = tqdm.tqdm(train_loader, leave=False)
    loss_total = 0.

    for step, batch in enumerate(bar):
        outputs = model(**batch)
        loss = outputs.loss
        loss_total += loss.item()
        loss = loss / gradient_accumulation_steps
        scaler.scale(loss).backward()
  
        if (
                step % gradient_accumulation_steps == 0 or
                step == len(train_loader) - 1
        ):
            scaler.step(optimizer)
            scaler.update()
            scheduler.step()
            optimizer.zero_grad()

        bar.set_postfix({"Loss": f"{loss_total / (step + 1):.4f}"})

    return loss_total / len(train_loader)

In [None]:
# Reference: https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm_no_trainer.py
def validation(model, val_loader):
    model.eval()

    bar = tqdm.tqdm(val_loader, leave=False)
    losses = []

    for step, batch in enumerate(bar):
        with torch.no_grad():
            outputs = model(**batch)

        batch_size = batch['input_ids'].shape[0]
        loss = outputs.loss.item()
        losses.extend([loss for _ in range(batch_size)])

        try:
            perplexity = math.exp(np.mean(losses))
        except OverflowError:
            perplexity = float('inf')

    return perplexity

In [None]:
# Reference: https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm_no_trainer.py
epoch_bar = tqdm.trange(start_epoch, epochs, leave=False)

for epoch in epoch_bar:
    loss = train_epoch(model, train_loader, optimizer, scheduler, scaler)
    perplexity = validation(model, val_loader)

    log = f"Epoch {epoch+1} Loss: {loss:.4f} Perplexity {perplexity:.4f}"
    epoch_bar.write(log)
    with open(log_file, 'a') as file:
        file.write(f"{log}\n")

    flag = False
    if perplexity < best_perplexity:
        best_perplexity = perplexity
        flag = True

    epoch_bar.write(f"Save model at epoch {epoch+1}")
    torch.save({
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'scheduler_state_dict': 
            scheduler.state_dict()
            if scheduler is not None else None,
        'scaler_state_dict': scaler.state_dict(),
        'epoch': epoch,
        'perplexity': perplexity,
        'best_perplexity': best_perplexity
    }, f"{exp_dir}/epoch-{epoch+1}.ckpt")

    if flag:
        print(f"Save best model at epoch {epoch+1}")
        best_perplexity = perplexity
        shutil.copyfile(
            f"{exp_dir}/epoch-{epoch+1}.ckpt",
            f"{exp_dir}/best-model.ckpt")

In [57]:
tmp_dir = "/content/test"

states = torch.load(f"{exp_dir}/best-model.ckpt")
model.load_state_dict(states['model_state_dict'])

model.save_pretrained(tmp_dir)
new_model = AutoModelForCausalLM.from_pretrained(tmp_dir)

device = 'cuda'
new_model = new_model.to(device)

In [74]:
def generate_limericks(prompts, num_generation=10):
    for prompt in prompts:
        prompt = prompt.strip()
        if prompt[-6:] != "<LINE>":
            prompt += " <LINE>"

        if reverse is True:
              input_ids = reverse_line(
                  tokenizer(prompt, return_tensors="np").input_ids[0])
              input_ids = torch.tensor(input_ids).reshape(1, -1)
              input_ids = input_ids.repeat(num_generation, 1)
        else:
              input_ids = tokenizer(prompt, return_tensors="pt")
              input_ids = input_ids.repeat(num_generation, 1)

        input_ids = input_ids.to(device=device)

        outputs = new_model.generate(
            input_ids, max_length=100, do_sample=True)

        if reverse is True:
            reversed = []
            for output in outputs:
                output = torch.tensor(
                    reverse_line(output.cpu().numpy())).reshape(-1)
                reversed.append(output)
            outputs = torch.stack(reversed)

        outputs = tokenizer.batch_decode(
            outputs.cpu(),
            skip_special_tokens=False)

        limericks = []
        for output in outputs:
            limericks.append(output.strip().split(" <LINE> ")[:-1])

        return limericks

In [61]:
results = generate_limericks(["I'd rather watch the clouds in the sky"])

for result in results:
    print("\n".join(result))
    print()

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


I'd rather watch the clouds in the sky
though don't climb to a field, and i'll try
make the sky always clear
because nobody's here
you'll say i'll not fly off the eye

I'd rather watch the clouds in the sky
but ignore it, take care if you'd die
if you fly on a fly
you'd be seen in the sky
or a comerfly, put your way high

I'd rather watch the clouds in the sky
and clouds, if i'd take on a try
of the cloud, and, say
that the clouds had held sway..
being such an observant was i

I'd rather watch the clouds in the sky
that were conically shaped like a pie
and they'd fall in the night
simply fall, not just right
and to fall, they could fall way up high

I'd rather watch the clouds in the sky
for his eyes with a scientist's eye
to observe and observe
to observe, observe, observe
are a change from my mind, smile and sigh

I'd rather watch the clouds in the sky
ae and stars that are bigger than i
far from sea to up high
from a view in the sky
help me up. so thanks to heaven, i'm high

I'd rat

In [62]:
results = generate_limericks(["I haven't switched on my TV for years"])
for result in results:
    print("\n".join(result))
    print()

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


I haven't switched on my TV for years
and i could have shed all the tears
do you love you and me
to those days i set free
i'm afraid you're no more effeteers

I haven't switched on my TV for years
on tv; i'm the most of my fears
it's me wasting away
yes, i sit night and day
i have given myself to the tears

I haven't switched on my TV for years
with some newspaper ads, it appears
so i come to a show
or as bright, in that glow
since my cellular phone's got a cheers

I haven't switched on my TV for years
i worked all day long. are my fears
you can catch this tv
from all you can see
dip me off. you can see with no cheers

I haven't switched on my TV for years
some actors sing praises and cheers
i called an old bando
of acronycello
i'm singing them out with the cheers

I haven't switched on my TV for years
a theater or stage, generate cheers
and people don't know
that original show
unproduced by elievo. those cheers

I haven't switched on my TV for years
we have people like me, and my fear

In [63]:
results = generate_limericks(["if you're using a subsurface map"])
for result in results:
    print("\n".join(result))
    print()

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


if you're using a subsurface map
go to write it; you may want to nap
if the text is all there
(no matter how rare
as if text is, you might use it to nap)

if you're using a subsurface map
when two faces refer to the map
just those lines with two dots
you don't match up those spots
but what's better if that's where you'll dip

if you're using a subsurface map
in which lexically beats all the rap
or define as a source
or the language whose splendor
and all you would use like a cop

if you're using a subsurface map
on the surface it's used as a base map
from your map, it's the base
of the surface of which face
and might thus define an aprace map

if you're using a subsurface map
it's the nest of trips to your shop
you should have a wide space space
the chartometer's place
and without it, you'd no place to stop

if you're using a subsurface map
use the best for a map, fit for a map
there's a page on the place
what a kind of disgrace
and it's urcopy: its space is no top map

if you're using