# Fine-tuning GPT-2 on a jokes dataset in PyTorch
[![GitHub](https://img.shields.io/badge/GitHub%20(Krishna)-bearlike-orange?style=flat&logo=github)](https://github.com/bearlike)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/bearlike/Discord-GPT2-Joke-Bot/blob/main/519_GPT2_Joker.ipynb)

> For Discord Integration, check out https://github.com/bearlike/Joke-and-Pun-Bot

- For `CNIT 519 - Natural Language Technologies`
- Let's see if the model can learn to crack some jokes!
- [All the versions of the trained models are available here.](https://drive.google.com/drive/folders/1cYMczEnNVBPM_Su_QeyZ6EJug5V5oB1g?usp=share_link). This notebook can help you retrain or optimize them.
- I initially ran this notebook on my personal machince since Colab Instances are subject to sudden unavailabilty. This notebook was modified to work on Google Colab. If you're training on the free tier, make sure that your instance is running after each epoch. It'll likely interrupt after each epoch.   
- You need [`jokes_data/bad_words_en.txt`](https://github.com/bearlike/Joke-and-Pun-Bot/blob/main/jokes_data/bad_words_en.txt) during inference and [`jokes_data/shortjokes.csv`](https://github.com/bearlike/Joke-and-Pun-Bot/blob/main/jokes_data/shortjokes.csv) during training from [`bearlike/Joke-and-Pun-Bot`](https://github.com/bearlike/Joke-and-Pun-Bot)

---

For this experiment, I will use a pre-trained GPT-2 medium-sized model from the huggingface [transformers repository](https://github.com/huggingface/transformers).

In [1]:
!mkdir jokes_data
# Upload shortjokes.csv to ./jokes_data
!pip install --quiet wheel transformers torch transformers torchsummary numpy tqdm

A subdirectory or file jokes_data already exists.


In [2]:
!nvidia-smi

Mon Dec  5 14:38:41 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.65       Driver Version: 527.37       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   47C    P0    25W /  N/A |      0MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import numpy as np

import logging
logging.getLogger().setLevel(logging.CRITICAL)

import warnings
warnings.filterwarnings('ignore')

device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'
print(f"Running on { device }")

Running on cuda


In [5]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')
model = GPT2LMHeadModel.from_pretrained('gpt2-medium')
model = model.to(device)

In [6]:
def choose_from_top(probs, n=5):
    ind = np.argpartition(probs, -n)[-n:]
    top_prob = probs[ind]
    top_prob = top_prob / np.sum(top_prob) # Normalize
    choice = np.random.choice(n, 1, p = top_prob)
    token_id = ind[choice][0]
    return int(token_id)

### PyTorch Dataset module for Short jokes dataset

For fine-tuning the GPT2 model, we will use this [Short Jokes dataset](https://www.kaggle.com/abhinavmoudgil95/short-jokes) published on Kaggle. After each joke, we add `<|endoftext|>` which is recognized by the GPT2 model as the `end of text` marker (Like `EOS`). The marker will allow us to concatenate many jokes in a single input sequence.

In [7]:
from torch.utils.data import Dataset, DataLoader
import json
import csv
import os

class JokesDataset(Dataset):
    def __init__(self, jokes_dataset_path = 'jokes_data/'):
        super().__init__()

        short_jokes_path = os.path.join(jokes_dataset_path, 'shortjokes.csv')

        self.joke_list = []
        self.end_of_text_token = "<|endoftext|>"
        
        with open(short_jokes_path) as csv_file:
            csv_reader = csv.reader(csv_file, delimiter=',')
            
            x = 0
            for row in csv_reader:
                joke_str = f"JOKE:{row[1]}{self.end_of_text_token}"
                self.joke_list.append(joke_str)
        
    def __len__(self):
        return len(self.joke_list)

    def __getitem__(self, item):
        return self.joke_list[item]


In [8]:
dataset = JokesDataset()
joke_loader = DataLoader(dataset, batch_size=1, shuffle=True)

### Hyperparameters

I tested many(more than 5) hyperparameter sets till I found one that works the best. I mostly tuned **`BATCH_SIZE`** (in this case, it's the number of forward-backward passes between each optimization step), **`EPOCHS`**, and **`LEARNING_RATE`**.

In [9]:
BATCH_SIZE = 16
EPOCHS = 5 # Fully bounded [0, EPOCH]
LEARNING_RATE = 3e-5
WARMUP_STEPS = 5000
MAX_SEQ_LEN = 400

from transformers import AdamW
from transformers import get_linear_schedule_with_warmup

device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'

### Model training

- The model is trained and weights are saved after each epoch and then we will try to generate jokes with each version of the weight to see which performs the best (or funnier?). 
- The length of jokes varies a lot in the dataset — there are many short sequences. To make the total sequence element count in one optimization step more consistent, The model will try to fit in as many jokes as possible in each `MAX_SEQ_LEN` element sequence.
- Once you begun training, take a coffee break (or even a couple). On a 3070 set to turbo, each epoch will take atleast an hour to run.
- If you're continuing from last epoch, change `START`.
---
- `Note to self`: Make sure you're atleast using WSL. Shift to Ubuntu if you can. 
---

In [10]:
from tqdm.auto import tqdm

In [14]:
START = 5 # If set to 1, it'll load Epoch 0 and continue from there
models_folder = "/content/drive/MyDrive/trained_models"

In [12]:
model = model.to(device)

if START > 0:
  continue_path = os.path.join(models_folder, 
                               f"gpt2_medium_joker_{START - 1}.pt")
  # Add map_location=torch.device('cpu') if you're using a CPU
  # But it might rip your system apart lol
  model.load_state_dict(torch.load(continue_path))
  print(f"Loaded weights from `{ continue_path }.`")
  print(f"Starting from Epoch { START }....")

model.train()
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps=-1)
proc_seq_count = 0
sum_loss = 0.0
batch_count = 0

tmp_jokes_tens = None
if not os.path.exists(models_folder):
    os.mkdir(models_folder)

for epoch in tqdm(range(START, EPOCHS), position=0, leave=True):

    print(f"\nEPOCH {epoch} started\n{'=' * 30}\n")
    
    for idx, joke in tqdm(enumerate(joke_loader), total=len(joke_loader), position=0, leave=True):
        
        #### "Fit as many joke sequences into MAX_SEQ_LEN sequence as possible" logic start ####
        joke_tens = torch.tensor(tokenizer.encode(joke[0])).unsqueeze(0).to(device)
        # Skip sample from dataset if it is longer than MAX_SEQ_LEN
        if joke_tens.size()[1] > MAX_SEQ_LEN:
            continue
        
        # The first joke sequence in the sequence
        if not torch.is_tensor(tmp_jokes_tens):
            tmp_jokes_tens = joke_tens
            continue
        else:
            # The next joke does not fit in so we process the sequence and leave the last joke 
            # as the start for next sequence 
            if tmp_jokes_tens.size()[1] + joke_tens.size()[1] > MAX_SEQ_LEN:
                work_jokes_tens = tmp_jokes_tens
                tmp_jokes_tens = joke_tens
            else:
                # Add the joke to sequence, continue and try to add more
                tmp_jokes_tens = torch.cat([tmp_jokes_tens, joke_tens[:,1:]], dim=1)
                continue

        ########## Sequence ready, process it trough the model ################
            
        outputs = model(work_jokes_tens, labels=work_jokes_tens)
        loss, logits = outputs[:2]                        
        loss.backward()
        sum_loss = sum_loss + loss.detach().data
                       
        proc_seq_count = proc_seq_count + 1
        if proc_seq_count == BATCH_SIZE:
            proc_seq_count = 0    
            batch_count += 1
            optimizer.step()
            scheduler.step() 
            optimizer.zero_grad()
            model.zero_grad()

        if batch_count == 100:
            print(f"Sum Loss\t→ {sum_loss}")
            batch_count = 0
            sum_loss = 0.0
    
    # Store the model after each epoch to compare the performance of them
    # Funny story, I initially had this loop outside the inner loop and I suffered the consequence.
    curr_model_path = f"gpt2_medium_joker_{epoch}.pt"
    torch.save(model.state_dict(), os.path.join(models_folder, curr_model_path))
    print(f"Epoch {epoch} state saved to {curr_model_path}")
            

Loaded weights from `/content/drive/MyDrive/trained_models/gpt2_medium_joker_4.pt.`
Starting from Epoch 5....


0it [00:00, ?it/s]

### Generating the jokes

In [24]:
from tqdm.auto import tqdm
tqdm._instances.clear()

In [29]:
from time import sleep
for _ in tqdm(range(6*60)):
  sleep(1)

  0%|          | 0/360 [00:00<?, ?it/s]

In [30]:
MODEL_EPOCH = 2
models_folder = "/content/drive/MyDrive/trained_models"
model_path = os.path.join(models_folder, f"gpt2_medium_joker_{MODEL_EPOCH}.pt")
model.load_state_dict(torch.load(model_path))

jokes_output_file_path = f'{ models_folder }/generated_{MODEL_EPOCH}.jokes'

model.eval()
if os.path.exists(jokes_output_file_path):
    os.remove(jokes_output_file_path)
    
joke_num = 0
with torch.no_grad():
   
        for joke_idx in tqdm(range(1000), position=0, leave=True):
        
            joke_finished = False

            cur_ids = torch.tensor(tokenizer.encode("JOKE:")).unsqueeze(0).to(device)

            for i in range(100):
                outputs = model(cur_ids, labels=cur_ids)
                loss, logits = outputs[:2]
                softmax_logits = torch.softmax(logits[0,-1], dim=0) # Take the first(from only one in this case) batch and the last predicted embedding
                if i < 3:
                    n = 20
                else:
                    n = 3
                next_token_id = choose_from_top(softmax_logits.to('cpu').numpy(), n=n) # Randomly(from the topN probability distribution) select the next word
                cur_ids = torch.cat([cur_ids, torch.ones((1,1)).long().to(device) * next_token_id], dim = 1) # Add the last word to the running sequence

                if next_token_id in tokenizer.encode('<|endoftext|>'):
                    joke_finished = True
                    break
            
            if joke_finished:
                
                joke_num = joke_num + 1
                
                output_list = list(cur_ids.squeeze().to('cpu').numpy())
                output_text = tokenizer.decode(output_list)

                with open(jokes_output_file_path, 'a') as f:
                    f.write(f"{output_text} \n\n")
                    
      

  0%|          | 0/1000 [00:00<?, ?it/s]

## Generate one Joke without offensive words

In [25]:
def generate_joke(begin="", epoch=2):
    MODEL_EPOCH = epoch
    models_folder = "trained_models"
    model_path = os.path.join(models_folder, f"gpt2_medium_joker_{MODEL_EPOCH}.pt")
    model.load_state_dict(torch.load(model_path))
    model.eval()
    
    with torch.no_grad():
        joke_finished = False
        cur_ids = torch.tensor(tokenizer.encode(f"JOKE:{begin}")).unsqueeze(0).to(device)
        for i in range(100):
            outputs = model(cur_ids, labels=cur_ids)
            loss, logits = outputs[:2]
            softmax_logits = torch.softmax(logits[0,-1], dim=0) # Take the first(from only one in this case) batch and the last predicted embedding
            if i < 3:
                n = 20
            else:
                n = 3

            # This loop prevents "bad words" from being there in the
            # sentence.
            contains_bad_word = True
            while contains_bad_word:
                # Randomly(from the topN probability distribution)
                # select the next word
                next_token_id = self.choose_from_top(
                    softmax_logits.to('cpu').numpy(), n=n)
                # Bad word check
                if self.tokenizer.decode(next_token_id) not in self.bad_words:
                    contains_bad_word = False

            # Add the last word to the running sequence
            cur_ids = torch.cat([cur_ids, torch.ones((1,1)).long().to(device) * next_token_id], dim = 1)
            if next_token_id in tokenizer.encode('<|endoftext|>'):
                joke_finished = True
                break

        if joke_finished:
            output_list = list(cur_ids.squeeze().to('cpu').numpy())
            output_text = tokenizer.decode(output_list)
            output_text = output_text.replace("JOKE:", "").replace('<|endoftext|>', "")
            return (True, output_text)
    return (False, "I don't know a joke with that")

In [26]:
generate_joke(begin="", epoch=2)[1]

'Do you remember the first time you heard the word "cute?" It was a girl\'s name, but she was so cute, you could have been a girl\'s sister.'

## (My) Inference
- We need to save/use the best performing weights from different epochs. More epochs might not always be better.

## Experimenting for topic modelling
- Grab the hidden states corresponding to the last token in the sentence. Choose a pooling method, average or max.
- You'll be left with one hidden state for each sentence, which should be a vector of 512-768(?). 
- Choose a clustering method and cluster the sentence vectors. I've had good results with HDBScan. 
- Inspect the topics within each cluster to see if you found anything meaningful. 
- You can also use PCA, UMAP, or TSNE to reduce the dimensions of the labelled sentence vectors to 2 or 3 and inspect how tightly the labelled sentences are packed together, and if there is significant distance between clusters.