#Lab 11 - Story Generation From a Given Prompt Via Transformers
# Ateeb Ahmad
# 334030

##Uploading Kaggle Json file to Use Kaggle Dataset Download API Command
You can get yours from Kaggle->Profile->Account->API->Create New Token

In [1]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"ateebahmad","key":"f5df25d3e20074fa692b59d153b472a6"}'}

##Moving the Json file to Kaggle folder

In [2]:
!rm -r ~/.kaggle # Removing any existing kaggle directory
!mkdir ~/.kaggle # Creating new directory
!mv ./kaggle.json ~/.kaggle/ # Moving the Json file to Kaggle directory
!chmod 600 ~/.kaggle/kaggle.json # Changing the permission to only owner can access and read it

rm: cannot remove '/root/.kaggle': No such file or directory


##Downloading the Dataset from Kaggle and Unzipping it

In [3]:
!kaggle datasets download -d ratthachat/writing-prompts

Downloading writing-prompts.zip to /content
100% 369M/370M [00:10<00:00, 42.4MB/s]
100% 370M/370M [00:10<00:00, 38.4MB/s]


In [4]:
%%capture
!unzip /content/writing-prompts.zip;

##Importing Necessary Libraries

In [2]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm
from torch.cuda.amp import autocast, GradScaler
import math

device = "cuda" if torch.cuda.is_available() else "cpu"

##Using PreTrained GPT Model & Tokenizer

In [3]:
tokenizer = GPT2Tokenizer.from_pretrained('distilgpt2', pad_token='pad')
model = GPT2LMHeadModel.from_pretrained('distilgpt2')

model.to(device)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-5): 6 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

##Creating Dataset

In [4]:
class Dataset(Dataset):
  def __init__(self, prompt, story):

    # Opening the respective source and target files line-by-line
    self.prompt = open(prompt).readlines()
    self.stories = open(story).readlines()

  def __len__(self):
    return len(self.prompt)

  def __getitem__(self, index):

    # Returning the prompt and it's respective story
    return self.prompt[index].strip(), self.stories[index].strip()

In [5]:
train_set = Dataset("/content/writingPrompts/train.wp_source", "/content/writingPrompts/train.wp_target")
val_set = Dataset("/content/writingPrompts/valid.wp_source", "/content/writingPrompts/valid.wp_target")
test_set = Dataset("/content/writingPrompts/test.wp_source", "/content/writingPrompts/test.wp_target")

##Showing Data

In [41]:
for i,j in val_set:
  print("prompt: ", i)
  print("story: ", j)
  break

prompt:  [ WP ] Every person in the world undergoes a `` goodness '' test . It 's designed to give a score from 1 to 200 , where 1 is pure evil , and 200 is an angel in human body . Then the world is divided into 200 zones , where people can live among their own kind .
story:  Clancy Marguerian , 154 , private first class of the 150+ army , sits in his foxhole . Tired cold , wet and hungry , the only thing preventing him from laying down his rifle and walking towards the enemy lines in surrender is the knowledge that however bad he has it here , life as a 50-100 POW is surely much worse . He 's fighting to keep his eyes open and his rifle ready when the mortar shells start landing near him . <newline> <newline> He hunkers lower . <newline> <newline> After a few minutes under the barrage , Marguerian hears hurried footsteps , a grunt , and a thud as a soldier leaps into the foxhole . The man 's uniform is tan , he must be a 50-100 . <newline> <newline> The two men snarl and grab at each

##Creating DataLoaders

In [6]:
batchSize = 6

trainLoader = DataLoader(
    train_set,
    batch_size = batchSize,
    shuffle = True,
)

valLoader = DataLoader(
    val_set,
    batch_size = batchSize,
    shuffle = True
)

testLoader = DataLoader(
    test_set,
    batch_size = 4,
    shuffle = False
)

##Defining Loss Function and Optimizer

In [7]:
# Optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)

# Scaler
scaler = GradScaler()

##Defining Training & Testing Function

In [8]:
def train_or_val(model, optimizer, dataloader, train):

  if train:
    model.train()
  else:
    model.eval()

  running_loss = 0

  for prompt, stories in tqdm(dataloader):

    # Passing Inputs to Tokenizer
    input = tokenizer(prompt, return_tensors='pt', truncation=True, padding='max_length', max_length = 1000)['input_ids']
    labels = tokenizer(stories, return_tensors='pt', truncation=True, padding='max_length',  max_length = 1000)['input_ids']

    # Passing them on CUDA or CPU depending on the availability
    input = input.to(device)
    labels = labels.to(device)

    # Forward Pass with autocasting
    with autocast():
      yhat = model(input_ids = input, labels = labels)

    # Computing loss
    loss = yhat.loss
    running_loss += loss.item()

    # Backpropogation
    if train:
      optimizer.zero_grad() # Making gradient equal to zero to avoid accumulation
      scaler.scale(loss).backward() # Calculating Gradients
      scaler.step(optimizer) # Updating weights and biases
      scaler.update() # Updating the Scaled Loss

  # Calculating Loss
  loss = running_loss/len(dataloader)

  return loss

##Mounting Google Drive for Ease of Downloading & Uploading Checkpoints

In [9]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


##Defining Paths for Checkpointing

In [10]:
checkpoint_path = '/content/drive/MyDrive/DL_Lab11/checkpoint.pth'

##Starting Training

In [None]:
Epochs = 10
train_losses = []
val_losses = []

for i in range(0, Epochs):

  print(f"Epoch {i + 1}: Train")
  t_loss = train_or_val(model, optimizer, valLoader, True)
  train_losses.append(t_loss)

  print(f"\nTrain Loss: {t_loss:>0.3f}")
  print(f"Perplexity Score: {math.exp(t_loss)}\n")

  checkpoint = {
      "model": model.state_dict(),
      "optimizer": optimizer.state_dict()
      }

  # Saving the Checkpoint after each Epoch
  torch.save(checkpoint, checkpoint_path)

Epoch 1: Train


100%|██████████| 2604/2604 [24:21<00:00,  1.78it/s]



Train Loss: 4.856
Perplexity Score: 128.51598851799886

Epoch 2: Train


100%|██████████| 2604/2604 [24:25<00:00,  1.78it/s]



Train Loss: 4.757
Perplexity Score: 116.3992772716706

Epoch 3: Train


100%|██████████| 2604/2604 [24:15<00:00,  1.79it/s]



Train Loss: 4.723
Perplexity Score: 112.5125796693891

Epoch 4: Train


100%|██████████| 2604/2604 [24:19<00:00,  1.78it/s]



Train Loss: 4.696
Perplexity Score: 109.5618079215732

Epoch 5: Train


100%|██████████| 2604/2604 [24:12<00:00,  1.79it/s]



Train Loss: 4.673
Perplexity Score: 107.03646963227852

Epoch 6: Train


100%|██████████| 2604/2604 [24:13<00:00,  1.79it/s]



Train Loss: 4.653
Perplexity Score: 104.92211479934501

Epoch 7: Train


 77%|███████▋  | 2013/2604 [18:38<05:25,  1.82it/s]

##Testing

In [11]:
# Loading Checkpoint
checkpoint = torch.load(checkpoint_path, map_location=torch.device(device))

# Loading the state of model
model.load_state_dict(checkpoint['model'])

<All keys matched successfully>

In [12]:
loss = train_or_val(model, optimizer, testLoader, False)
print(f"\nTrain Loss: {loss:>0.3f}")
print(f"Perplexity Score: {math.exp(loss)}\n")

100%|██████████| 3785/3785 [09:51<00:00,  6.40it/s]


Train Loss: 4.907
Perplexity Score: 135.20822263553882






##Inferencing Model

In [None]:
ids = tokenizer.encode('You ''ve finally managed to discover the secret to immortality . Suddenly , Death appears before you , hands you a business card , and says , `` When you realize living forever sucks , call this number , I ''ve got a job offer for you .',
                      return_tensors='pt').to(device)

story = model.generate(
    ids,
    do_sample=True,
    max_new_tokens=150,
    top_k=0,
    num_return_sequences=1,
)

print(tokenizer.decode(story[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
prompt :
You've finally managed to discover the secret to immortality. Suddenly, Death appears before you, hands you a business card, and says, `` When you realize living forever sucks, call this number, I've got a job offer for you. ''
story:
`` Hello? Hello? '' <newline> <newline> Jenny whispered she could sense his name coming from nowhere. A thin, sad looking dude in what appeared to be a complete punk charade was playing on him whilst he secured a massive candy bar with his initials. All around him were carnivals. In other words, Kim Kardashian and Veronica Reynolds at the ends of the job in Canada. Every one of them was right there. Half done. He had actually been invited back to a popular house party. A home party for a few years. One that made him happy enough to not be killed. He had done his best to regain his consciousness and own this carnival. This didn't make sense

In [None]:
ids = tokenizer.encode('Through Iron And Flame',
                      return_tensors='pt').to(device)

story = model.generate(
    ids,
    do_sample=True,
    max_new_tokens=150,
    top_k=0,
    num_return_sequences=1,
)

print(tokenizer.decode(story[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
prompt :
Through Iron And Flame
story:
The river Styx was thin, but still had the stark blue hue of pink and orange, burning with the sounds of the river Styx. The year was 1500, and everyone had accepted the charade of time from high above. Many stared bemused at the beast, and the gladty of the herd. Tycho Novgorville, the charge at the tracks was different, in many respects. <newline> <newline> Every centurion had come looking for the river Styx upon its skin, sealing their wounds for days. All around them were men in bays and ribbons, were weapons of traditional weaponry, and each assumed eye own. A gigantic castle was on its hind legs, projected to soar above the courtyard
