# Fine-tuning GPT-2 on a jokes dataset in PyTorch
- For `CNIT 519 - Natural Language Technologies`
- Let's see if the model can learn to crack some jokes!

For this experiment, I will use a pre-trained GPT-2 medium-sized model from the huggingface [transformers repository](https://github.com/huggingface/transformers).

In [1]:
!mkdir jokes_data
# Add shortjokes.csv to ./jokes_data
!pip install --quiet transformers

[K     |████████████████████████████████| 5.8 MB 4.6 MB/s 
[K     |████████████████████████████████| 182 kB 64.2 MB/s 
[K     |████████████████████████████████| 7.6 MB 65.5 MB/s 
[?25h

In [16]:
!nvidia-smi

Mon Dec  5 02:20:48 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.65       Driver Version: 527.37       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   50C    P8    12W /  N/A |      0MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import numpy as np

import logging
logging.getLogger().setLevel(logging.CRITICAL)

import warnings
warnings.filterwarnings('ignore')

device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'

In [3]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')
model = GPT2LMHeadModel.from_pretrained('gpt2-medium')
model = model.to(device)

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/718 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

In [5]:
def choose_from_top(probs, n=5):
    ind = np.argpartition(probs, -n)[-n:]
    top_prob = probs[ind]
    top_prob = top_prob / np.sum(top_prob) # Normalize
    choice = np.random.choice(n, 1, p = top_prob)
    token_id = ind[choice][0]
    return int(token_id)

### PyTorch Dataset module for Short jokes dataset

For fine-tuning the GPT2 model, we will use this [Short Jokes dataset](https://www.kaggle.com/abhinavmoudgil95/short-jokes) published on Kaggle. After each joke, we add `<|endoftext|>` which is recognized by the GPT2 model as the `end of text` marker (Like `EOS`). The marker will allow us to concatenate many jokes in a single input sequence.

In [6]:
from torch.utils.data import Dataset, DataLoader
import json
import csv
import os

class JokesDataset(Dataset):
    def __init__(self, jokes_dataset_path = 'jokes_data/'):
        super().__init__()

        short_jokes_path = os.path.join(jokes_dataset_path, 'shortjokes.csv')

        self.joke_list = []
        self.end_of_text_token = "<|endoftext|>"
        
        with open(short_jokes_path) as csv_file:
            csv_reader = csv.reader(csv_file, delimiter=',')
            
            x = 0
            for row in csv_reader:
                joke_str = f"JOKE:{row[1]}{self.end_of_text_token}"
                self.joke_list.append(joke_str)
        
    def __len__(self):
        return len(self.joke_list)

    def __getitem__(self, item):
        return self.joke_list[item]


In [8]:
dataset = JokesDataset()
joke_loader = DataLoader(dataset, batch_size=1, shuffle=True)

### Hyperparameters

I tested many(more than 5) hyperparameter sets till I found one that works the best. I mostly tuned **`BATCH_SIZE`** (in this case, it's the number of forward-backward passes between each optimization step), **`EPOCHS`**, and **`LEARNING_RATE`**.

In [7]:
BATCH_SIZE = 16
EPOCHS = 5 # Fully bounded [0, EPOCH]
LEARNING_RATE = 3e-5
WARMUP_STEPS = 5000
MAX_SEQ_LEN = 400

from transformers import AdamW
from transformers import get_linear_schedule_with_warmup

device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'

### Model training

- The model is trained and weights are saved after each epoch and then we will try to generate jokes with each version of the weight to see which performs the best (or funnier?). 
- The length of jokes varies a lot in the dataset — there are many short sequences. To make the total sequence element count in one optimization step more consistent, The model will try to fit in as many jokes as possible in each `MAX_SEQ_LEN` element sequence.
- Once you begun training, take a coffee break (or even a couple). On a 3070 set to turbo, each epoch will take atleast an hour to run.
- If you're continuing from last epoch, change `START`.
---
- `Note to self`: Make sure you're atleast using WSL. Shift to Ubuntu if you can. 
---

In [10]:
from tqdm.auto import tqdm

In [11]:
START = 1 # If set to 1, it'll load Epoch 0 and continue from there
models_folder = "/home/kk/Projects/NLT_Group_Project/trained_models"

In [None]:
model = model.to(device)

if START > 0:
  continue_path = os.path.join(models_folder, 
                               f"gpt2_medium_joker_{START - 1}.pt")
  # Add map_location=torch.device('cpu') if you're using a CPU
  # But it might rip your system apart lol
  model.load_state_dict(torch.load(continue_path))
  print(f"Loaded weights from `{ continue_path }.`")
  print(f"Starting from Epoch { START }....")

model.train()
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps=-1)
proc_seq_count = 0
sum_loss = 0.0
batch_count = 0

tmp_jokes_tens = None
if not os.path.exists(models_folder):
    os.mkdir(models_folder)

for epoch in tqdm(range(START, EPOCHS), position=0, leave=True):

    print(f"\nEPOCH {epoch} started\n{'=' * 30}\n")
    
    for idx, joke in tqdm(enumerate(joke_loader), total=len(joke_loader), position=0, leave=True):
        
        #### "Fit as many joke sequences into MAX_SEQ_LEN sequence as possible" logic start ####
        joke_tens = torch.tensor(tokenizer.encode(joke[0])).unsqueeze(0).to(device)
        # Skip sample from dataset if it is longer than MAX_SEQ_LEN
        if joke_tens.size()[1] > MAX_SEQ_LEN:
            continue
        
        # The first joke sequence in the sequence
        if not torch.is_tensor(tmp_jokes_tens):
            tmp_jokes_tens = joke_tens
            continue
        else:
            # The next joke does not fit in so we process the sequence and leave the last joke 
            # as the start for next sequence 
            if tmp_jokes_tens.size()[1] + joke_tens.size()[1] > MAX_SEQ_LEN:
                work_jokes_tens = tmp_jokes_tens
                tmp_jokes_tens = joke_tens
            else:
                # Add the joke to sequence, continue and try to add more
                tmp_jokes_tens = torch.cat([tmp_jokes_tens, joke_tens[:,1:]], dim=1)
                continue

        ########## Sequence ready, process it trough the model ################
            
        outputs = model(work_jokes_tens, labels=work_jokes_tens)
        loss, logits = outputs[:2]                        
        loss.backward()
        sum_loss = sum_loss + loss.detach().data
                       
        proc_seq_count = proc_seq_count + 1
        if proc_seq_count == BATCH_SIZE:
            proc_seq_count = 0    
            batch_count += 1
            optimizer.step()
            scheduler.step() 
            optimizer.zero_grad()
            model.zero_grad()

        if batch_count == 100:
            print(f"Sum Loss\t→ {sum_loss}")
            batch_count = 0
            sum_loss = 0.0
    
    # Store the model after each epoch to compare the performance of them
    # Funny story, I initially had this loop outside the inner loop and I suffered the consequence.
    curr_model_path = f"gpt2_medium_joker_{epoch}.pt"
    torch.save(model.state_dict(), os.path.join(models_folder, curr_model_path))
    print(f"Epoch {epoch} state saved to {curr_model_path}")
            

Loaded weights from `/home/kk/Projects/NLT_Group_Project/trained_models/gpt2_medium_joker_0.pt.`
Starting from Epoch 1....


  0%|          | 0/4 [00:00<?, ?it/s]


EPOCH 1 started



  0%|          | 0/231658 [00:00<?, ?it/s]

Sum Loss	→ 5124.86083984375


### Generating the jokes

In [None]:
MODEL_EPOCH = 5

models_folder = "trained_models"

model_path = os.path.join(models_folder, f"gpt2_medium_joker_{MODEL_EPOCH}.pt")
model.load_state_dict(torch.load(model_path))

jokes_output_file_path = f'generated_{MODEL_EPOCH}.jokes'

model.eval()
if os.path.exists(jokes_output_file_path):
    os.remove(jokes_output_file_path)
    
joke_num = 0
with torch.no_grad():
   
        for joke_idx in range(1000):
        
            joke_finished = False

            cur_ids = torch.tensor(tokenizer.encode("JOKE:")).unsqueeze(0).to(device)

            for i in range(100):
                outputs = model(cur_ids, labels=cur_ids)
                loss, logits = outputs[:2]
                softmax_logits = torch.softmax(logits[0,-1], dim=0) # Take the first(from only one in this case) batch and the last predicted embedding
                if i < 3:
                    n = 20
                else:
                    n = 3
                next_token_id = choose_from_top(softmax_logits.to('cpu').numpy(), n=n) # Randomly(from the topN probability distribution) select the next word
                cur_ids = torch.cat([cur_ids, torch.ones((1,1)).long().to(device) * next_token_id], dim = 1) # Add the last word to the running sequence

                if next_token_id in tokenizer.encode('<|endoftext|>'):
                    joke_finished = True
                    break

            
            if joke_finished:
                
                joke_num = joke_num + 1
                
                output_list = list(cur_ids.squeeze().to('cpu').numpy())
                output_text = tokenizer.decode(output_list)

                with open(jokes_output_file_path, 'a') as f:
                    f.write(f"{output_text} \n\n")
                    
      

## (My) Inference
- We need to save/use the best performing weights from different epochs. More epochs might not always be better.
- I have not implemented sound similarity into the system yet.