This notebook is based largely on @mf1024's "Teach GPT-2 a sense of humour"
I've graciously borrowed much of his logic for fine tuning and generating from his implmenetation
Check out his notebook here https://github.com/mf1024/Transformers/blob/master/Teaching%20GPT-2%20a%20sense%20of%20humor.ipynb

In [None]:
!pip install transformers

In [21]:
import torch
from torch.utils.data import Dataset
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Tokenizer, GPT2LMHeadModel, AdamW, get_linear_schedule_with_warmup
import numpy as np
import os
import random

In [2]:
device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'

In [3]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')
model = GPT2LMHeadModel.from_pretrained('gpt2-medium')
model = model.to(device)

In [4]:
FILE_PATH = os.path.join("storage","data", "film_text.txt")

In [5]:
from language_modelling import ScriptData

In [15]:
dataset = ScriptData(tokenizer= tokenizer, file_path= FILE_PATH )
script_loader = DataLoader(dataset,batch_size=1,shuffle=True)

In [22]:
BATCH_SIZE = 1
EPOCHS = 3
LEARNING_RATE = 1e-5
WARMUP_STEPS = 10000


In [23]:
model = model.to(device)
model.train()
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps=-1)
script_count = 0
sum_loss = 0.0
batch_count = 0

In [24]:
for epoch in range(EPOCHS):
    print(f"EPOCH {epoch} started" + '=' * 30)
    for idx,script in enumerate(script_loader):
        outputs = model(script.to(device), labels=script.to(device))
        
        loss, logits = outputs[:2]                        
        loss.backward()
        sum_loss = sum_loss + loss.detach().data
                       
        script_count = script_count + 1
        if script_count == BATCH_SIZE:
            script_count = 0    
            batch_count += 1
            optimizer.step()
            scheduler.step() 
            optimizer.zero_grad()
            model.zero_grad()
            
        if batch_count == 200:
            print(f"sum loss {sum_loss}")
            sample_outputs = model.generate(
                                    bos_token_id=random.randint(1,30000),
                                    do_sample=True,   
                                    top_k=50, 
                                    max_length = 500,
                                    top_p=0.95, 
                                    num_return_sequences=3
                                )

            print("Output:\n" + 100 * '-')
            for i, sample_output in enumerate(sample_outputs):
                  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
            
            batch_count = 0
            sum_loss = 0.0

sum loss 296.13543701171875
Output:
----------------------------------------------------------------------------------------------------
0:  cited.

When you hear the word "pump" and "pump",

You see a deep puddle.

You know that
The air was full of a heavy stench.

You were tired, so when you're in the morning

I can see that you have been going on
About the same.

This is a place where you go to get something.

There's a place where the water
The man and woman walk down
Together.

There's a place where men sit
And drink coffee.
It's not even that far away.

When I used to go
There'd be a girl or a man
Just as they'd walk,

But they're the same.
You don't know
The girl or the man,
It's just that they've never met.
A woman, a man.
I used to love it when I used to go
There'd be a girl or a man
And just walk the same way,
Just a little bit,
A little bit.
A little bit!
I always wanted to
Go down there
It was beautiful,
It was so beautiful.
When I used to go down
There'd be a girl or a man

KeyboardInterrupt: 

In [26]:
from transformers import WEIGHTS_NAME, CONFIG_NAME
output_dir = "./storage/models/"
output_model_file = os.path.join(output_dir, WEIGHTS_NAME)
output_config_file = os.path.join(output_dir, CONFIG_NAME)

torch.save(model.state_dict(), output_model_file)
model.config.to_json_file(output_config_file)
tokenizer.save_vocabulary(output_dir)

('./storage/models/vocab.json', './storage/models/merges.txt')

In [27]:
model = GPT2LMHeadModel.from_pretrained(output_dir)
tokenizer = GPT2Tokenizer.from_pretrained(output_dir)

In [35]:
input_ids = tokenizer.encode('The door opens and the man creeps around the corner.', return_tensors='pt')

In [41]:
sample_outputs = model.generate(
                        input_ids= input_ids,
                        top_k=50, 
                        max_length = 200,
                        top_p=0.95, 
                        num_return_sequences=3
                    )

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
      print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: The door opens and the man creeps around the corner.

He is, in fact Bentley.

Bentley is in the driver's seat, making eye contact. His breath is coming in
a gurgle, which is all he can hear.


INT. CAR - NIGHT




BENTLEY
                          (CONVINCING)
                            That's Bentley.

He stops to watch the rear-view mirror, just out of earshot from the
driver.


EXT. SIDEWALK - EVENING


BENTLEY
                             
1: The door opens and the man creeps around the corner. He stops short and turns.
          The man who opened the door looks up and then back.  He stares 
          back for a beat.  He looks up and smiles.


                         MAN
           Is it me or did your sister 
          call you from prison?
           The man nods, smiles, looks up to a few people on the 
          ground, and nods.  He walks over to the little boy 

In [None]:
model.generate()