# GPT2 Base model for story generation

Problem Statement: Automatic story generation (ASG) involves the ability to generate coherent and engaging stories and has many potential applications, from creative writing to chatbots and virtual assistants. In recent years, it has gained a lot of popularity and is an active research area.

Dataset: For this project, we will use the WritingPrompts dataset, which was released by Facebook Research. This contains over 300,000 short stories. We will use a part of this dataset, owing to the limited computing resources available, and split the dataset into training, validation, and test sets for training and evaluating our models.

This colab contains the code for calculating the perplexity score for the GPT model imported from Transformers library. 


In [1]:
import numpy as np 
import pandas as pd
import torch
import logging
from tqdm import tqdm
import math
import argparse
import os


Download and store the GPT2LMHeadModel and install transformers library

In [2]:
!git clone https://github.com/huggingface/transformers
!pip install transformers/
from transformers import GPT2Tokenizer, GPT2LMHeadModel

Cloning into 'transformers'...
remote: Enumerating objects: 141318, done.[K
remote: Counting objects: 100% (1332/1332), done.[K
remote: Compressing objects: 100% (628/628), done.[K
remote: Total 141318 (delta 799), reused 1028 (delta 579), pack-reused 139986[K
Receiving objects: 100% (141318/141318), 139.61 MiB | 15.51 MiB/s, done.
Resolving deltas: 100% (105732/105732), done.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Processing ./transformers
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers==4.30.0.dev0)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers

# Define Hyperparamaters 
We define the following hyper paramaters.


In [3]:
parser = argparse.ArgumentParser()
parser.add_argument("--max_seq_length", default=512, type=int)
parser.add_argument("--train_batch_size", default=4, type=int)
parser.add_argument("--valid_batch_size", default=4, type=int)
args, _ = parser.parse_known_args()

# Load the data

We load the data from drive. Will upload the same folder to github too!

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Preprocess, pickle

In [5]:
import pickle
DATAPATH= "/content/drive/MyDrive/Colab Notebooks/DL Final Project/Pickles-finalpartition/"

def combinetext(prompt, story):
    with open(DATAPATH+prompt, 'rb') as f:
        prompts = pickle.load(f)

    with open(DATAPATH+story, 'rb') as f:
        stories = pickle.load(f)
    assert len(prompts)==len(stories)
    combine=[]
    for i in range(len(prompts)):
        combine.append(' '.join(prompts[i])+' <sep> '+" ".join(stories[i][:300]))
    return combine

#do a littel text clean with punctuations
def cleanpunctuation(s):
    for p in '!,.:;?':
        s=s.replace(' '+p,p)
    s=s.replace(' '+'n\'t','n\'t')
    s=s.replace(' '+'\'s','\'s')
    s=s.replace(' '+'\'re','\'re')
    s=s.replace(' '+'\'ve','\'ve')
    s=s.replace(' '+'\'ll','\'ll')
    s=s.replace(' '+'\'am','\'am')
    s=s.replace(' '+'\'m','\'m')
    s=s.replace(' '+'\' m','\'m')
    s=s.replace(' '+'\'m','\'m')
    s=s.replace(' '+'\' ve','\'ve')
    s=s.replace(' '+'\' s','\'s')
    s=s.replace('<newline>','\n')
    return s   

train_text=combinetext('val_src_tokenized.pickle', 'val_tgt_tokenized.pickle')
train_text=list(map(cleanpunctuation,train_text))
valid_text=combinetext('test_src_tokenized.pickle', 'test_tgt_tokenized.pickle')
valid_text=list(map(cleanpunctuation,valid_text))

In [6]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token=tokenizer.eos_token

inputs_train = tokenizer(train_text, padding=True,truncation=True,max_length=args.max_seq_length)
inputs_valid=tokenizer(valid_text, padding=True,truncation=True,max_length=args.max_seq_length)

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [7]:
def create_labels(inputs):
    labels=[]
    for ids,attention_mask in zip(inputs['input_ids'],inputs['attention_mask']):
        label=ids.copy()
        real_len=sum(attention_mask)
        padding_len=len(attention_mask)-sum(attention_mask)
        label[:]=label[:real_len]+[-100]*padding_len
        labels.append(label)
    inputs['labels']=labels
    
create_labels(inputs_train)
create_labels(inputs_valid)

Create a custom dataloader function containing the ids, attention mask and the combined labels.

In [8]:
class StoryDataset:
    def __init__(self, inputs):
        self.ids = inputs['input_ids']
        self.attention_mask = inputs['attention_mask']
        self.labels=inputs['labels']

    def __len__(self):
        return len(self.ids)

    def __getitem__(self, item):

        return [torch.tensor(self.ids[item], dtype=torch.long),
                torch.tensor(self.attention_mask[item], dtype=torch.long),
                torch.tensor(self.labels[item], dtype=torch.long)]
            

Load the GPT-2 model

In [9]:
train_batch_size=args.train_batch_size
valid_batch_size=args.valid_batch_size
traindata=StoryDataset(inputs_train)
train_dataloader = torch.utils.data.DataLoader(
    traindata,
    shuffle=False,
    batch_size=train_batch_size)

validdata=StoryDataset(inputs_valid)
valid_dataloader = torch.utils.data.DataLoader(
    validdata,
    shuffle=False,
    batch_size=valid_batch_size)

In [10]:
model = GPT2LMHeadModel.from_pretrained('gpt2')

Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [12]:
model.to('cuda')
model.eval()
eval_loss=[]
for inputs in tqdm(valid_dataloader, desc="eval"):
    d1,d2,d3=inputs
    d1=d1.to('cuda')        
    d2=d2.to('cuda')
    d3=d3.to('cuda')

    with torch.no_grad():
        output = model(input_ids=d1, attention_mask=d2,labels=d3)
        batch_loss=output[0]
    eval_loss+=[batch_loss.cpu().item()]
    del batch_loss
eval_loss=np.mean(eval_loss)
perplexity=math.exp(eval_loss)
print(f'\nThe average perplexity for test dataset {perplexity}') 

eval: 100%|██████████| 625/625 [02:04<00:00,  5.04it/s]


The average perplexity for test dataset 93.97279204446644





In [13]:
model.to('cuda')
model.eval()
eval_loss=[]
for inputs in tqdm(train_dataloader, desc="eval"):
    d1,d2,d3=inputs
    d1=d1.to('cuda')        
    d2=d2.to('cuda')
    d3=d3.to('cuda')

    with torch.no_grad():
        output = model(input_ids=d1, attention_mask=d2,labels=d3)
        batch_loss=output[0]
    eval_loss+=[batch_loss.cpu().item()]
    del batch_loss
eval_loss=np.mean(eval_loss)
perplexity=math.exp(eval_loss)
print(f'\nThe average perplexity for valid dataset {perplexity}') 

eval: 100%|██████████| 625/625 [02:01<00:00,  5.15it/s]


The average perplexity for valid dataset 94.10435559084917





Now we take a random sample and try getting the output so that it can be compared with the other fine-tuned models

In [None]:
prompt=valid_text[22][:valid_text[22].find('<sep>')]
target=valid_text[22][valid_text[22].find('<sep>')+5:]

def generate_story(prompt,target,k=0,p=0.9,output_length=300,temperature=1,num_return_sequences=2,repetition_penalty=1.0):
    print("\nPrompt: ", prompt)
    print('\nTarget: ', target)
    encoded_prompt = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
    model.to('cpu')
    model.eval()
    output_sequences = model.generate(
        input_ids=encoded_prompt,
        max_length=output_length,
        temperature=temperature,
        top_k=k,
        top_p=p,
        repetition_penalty=repetition_penalty,
        do_sample=True,
        num_return_sequences=num_return_sequences
    )
    if len(output_sequences.shape) > 2:
        output_sequences.squeeze_()
    for generated_sequence_idx, generated_sequence in enumerate(output_sequences):
        print("\nGPT-2 GENERATED SEQUENCE {} ".format(generated_sequence_idx + 1))
        generated_sequence = generated_sequence.tolist()
        text = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True)
        text = text[: text.find(tokenizer.eos_token)]
        print(text)

generate_story(prompt,target)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Prompt:  WP You live world where light helps you retain and regain memory while darkness makes you forget everything One day 

Target:   Jennifer woke up dark room no light She rubbed her eyes as if that would help her to see ` ` Oh god what hell She threw herself bed again too exhausted to move < newline > < newline > Not that she remembers but waking up dark room with no memory happens to her regular bases If she has bad day she turns off light before she goes to sleep so she doesn't have to think about anything and possibly avoid nightmares < newline > < newline > But what hell happened last night that made her this tired must have been something horrible Last time she this tired to wake up when her dog died She stayed her dark room for 2 days straight < newline > < newline > ` ` I should get something to eat Jennifer woke herself up and opened door From distance she saw dim light from kitchen's curtain memory came back < newline > < newline > ` ` Lucy She yelled as she remembered 

Conclusion:

Without any doubt, GPT-2 has produced very good results. Now, we shall fine tune the model in other colabs to improve the performance.