### Fine-tune GPT2
This notebook fine-tunes the GPT2 model for text generation based on the ~20k articles I pulled from the media stack API for recent U.S. news in English.
I chose to use GPT2 since it is a popular generative model that is still small enough to be able to train on my GPU.

#### Methodology
1. Load the dataset
2. Set aside a few articles to test on at the end
2. Concatenate the title and descriptions of the articles
3. Tokenize the text and get encoded input
4. Load GPT2 and initialize with the pretrained weights to take advantage of the massive amount of training this model already went through
5. Fine-tune the model by training on the article text for a few epochs
6. Generate the descriptions based on the article title for the test set

##### Training & Generating
A language model is a function that inputs a sequence and outputs a probability distribution for the next token after the input sequence. So we will iterate through our new dataset using the previous sequence (up to a maximum) to predict the next token. Since we have all of the text in the training data, we have the true target value of that next token, so that will be used to calculate the loss and update the model weights. 

While generating new text we will provide the model with the article title as the initial sequence and then continue predicting the next token in the sequence until we reach a specific token that represents the end of the article, this special token (<|endoftext|>) was added to the end of each article's description before encoding the text, and the <|startoftext|> token was added to the beginning of the title.

In [1]:
import os
import pandas as pd
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Tokenizer, GPT2LMHeadModel, get_linear_schedule_with_warmup
from torch.optim import AdamW
from tqdm import trange
import torch.nn.functional as F

### load data and drop any rows with null values
## For this we are only going to use the title and description, so drop the rest of the columns
data = pd.read_csv('data/english_news.csv')
data = data.dropna(subset=['title','description'])
data = data.drop(columns=["category","author","url","image","language","country","published_at", "source"])

# Drop the rows that have too much text since we're using GPT2 it has a fairly limited number of tokens (compared to GPT3/4)
data = data[data['description'].apply(lambda x: len(x.split(' ')) < 150)]
data = data[data['title'].apply(lambda x: len(x.split(' ')) < 50)]


# Create a small test set to compare generated text with the reality
test_set = data.sample(n = 200)
df = data.loc[~data.index.isin(test_set.index)]

#Reset the indexes
test_set = test_set.reset_index()
df = df.reset_index()

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
## This custom dataset class will help integrate with the dataloader to load batches of the encoded text data
class NewsArticles(Dataset):  
    def __init__(self, df, tokenizer, control_code="startoftext", truncate=True, gpt2_type="gpt2", max_length=1024):

        self.tokenizer = GPT2Tokenizer.from_pretrained(gpt2_type)
        self.articles = []

        for idx in range(len(df)):
            title = df.loc[idx, "title"]
            description = df.loc[idx, "description"]
            self.articles.append(torch.tensor(
                self.tokenizer.encode(f"<|{control_code}|>{title + ' - ' + description}<|endoftext|>", max_length=max_length, truncation=truncate)
            ))               

        self.article_count = len(self.articles)
        
    def __len__(self):
        return self.article_count

    def __getitem__(self, item):
        return self.articles[item]
    
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')  
dataset = NewsArticles(df, tokenizer, truncate=True, gpt2_type="gpt2") 

In [3]:
#Accumulated batch size
def pack_tensor(new_tensor, packed_tensor, max_seq_len):
    if packed_tensor is None:
        return new_tensor, True, None
    if new_tensor.size()[1] + packed_tensor.size()[1] > max_seq_len:
        return packed_tensor, False, new_tensor
    else:
        packed_tensor = torch.cat([new_tensor, packed_tensor[:, 1:]], dim=1)
        return packed_tensor, True, None

In [4]:
def train(
    dataset, model,
    batch_size=16, epochs=10, lr=2e-5, warmup_steps=200,
    output_dir=".", output_prefix="mediastack", max_seq_len = 768,
    save_model_on_epoch=False,
):
    device=torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.cuda()
    model.train()

    optimizer = AdamW(model.parameters(), lr=lr)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=warmup_steps, num_training_steps=-1
    )

    train_dataloader = DataLoader(dataset, batch_size=1, shuffle=True)
    loss=0
    accumulating_batch_count = 0
    input_tensor = None

    for epoch in range(epochs):

        print(f"Training epoch {epoch}")
        print(loss)
        for idx, entry in enumerate(train_dataloader):
            (input_tensor, carry_on, remainder) = pack_tensor(entry, input_tensor, max_seq_len)

            if carry_on and idx != len(train_dataloader) - 1:
                continue

            input_tensor = input_tensor.to(device)
            outputs = model(input_tensor, labels=input_tensor)
            loss = outputs[0]
            loss.backward()

            if (accumulating_batch_count % batch_size) == 0:
                optimizer.step()
                scheduler.step()
                optimizer.zero_grad()
                model.zero_grad()

            accumulating_batch_count += 1
            input_tensor = None
        if save_model_on_epoch:
            torch.save(
                model.state_dict(),
                os.path.join(output_dir, f"{output_prefix}-{epoch}.pt"),
            )
    return model

In [6]:
## Here we train the model on the data
model = train(dataset, model, save_model_on_epoch=True)

Training epoch 0
0
Training epoch 1
tensor(3.0816, device='cuda:0', grad_fn=<NllLossBackward0>)
Training epoch 2
tensor(3.3601, device='cuda:0', grad_fn=<NllLossBackward0>)
Training epoch 3
tensor(2.9124, device='cuda:0', grad_fn=<NllLossBackward0>)
Training epoch 4
tensor(2.5391, device='cuda:0', grad_fn=<NllLossBackward0>)
Training epoch 5
tensor(2.9778, device='cuda:0', grad_fn=<NllLossBackward0>)
Training epoch 6
tensor(2.9307, device='cuda:0', grad_fn=<NllLossBackward0>)
Training epoch 7
tensor(2.7018, device='cuda:0', grad_fn=<NllLossBackward0>)
Training epoch 8
tensor(2.8351, device='cuda:0', grad_fn=<NllLossBackward0>)
Training epoch 9
tensor(2.9748, device='cuda:0', grad_fn=<NllLossBackward0>)


Based on the training error seen above, it looks like our training is unstable (it is going up and down). This means I should try the following options to stabalize training: 
1. Get more training data
2. Adjust the parameters (lowering the learning rate may help)
3. Only unfreeze the last few layers so that only those are updated. This would ensure that the model keeps most of its general knowledge about language

In the interest of time, I won't be able to try these different methods since I only have 48 hours to complete this assessment and training takes over an hour on my computer

In [8]:
## The model outputs a probability distribution of what the model thinks the next token is, so in order to generate text
###  We sample from that distribution and then add that next token to the input sequence to continue predicting the next word
###  We continue predicting the next word until we get the <|endoftext|> token, or reach the max length
def generate(
    model,
    tokenizer,
    prompt,
    entry_count=1, #How many different options do we want
    entry_length=25, #maximum number of words
    top_p=0.8, # only keep the most likely tokens (up to this cumulative probability)
    temperature=1.,
):
    model.eval()
    generated_num = 0
    generated_list = []

    filter_value = -float("Inf")
    device = torch.device( "cuda" if torch.cuda.is_available() else "cpu")

    with torch.no_grad():

        for entry_idx in range(entry_count):

            entry_finished = False
            generated = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0).to(device)

            for i in range(entry_length):
                outputs = model(generated, labels=generated)
                loss, logits = outputs[:2]
                logits = logits[:, -1, :] / (temperature if temperature > 0 else 1.0)

                sorted_logits, sorted_indices = torch.sort(logits, descending=True)
                cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)

                sorted_indices_to_remove = cumulative_probs > top_p
                sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[
                    ..., :-1
                ].clone()
                sorted_indices_to_remove[..., 0] = 0

                indices_to_remove = sorted_indices[sorted_indices_to_remove]
                logits[:, indices_to_remove] = filter_value

                next_token = torch.multinomial(F.softmax(logits, dim=-1), num_samples=1)
                generated = torch.cat((generated, next_token), dim=1)

                if next_token in tokenizer.encode("<|endoftext|>"):
                    entry_finished = True

                if entry_finished:

                    generated_num = generated_num + 1

                    output_list = list(generated.squeeze().cpu().numpy())
                    output_text = tokenizer.decode(output_list)
                    generated_list.append(output_text)
                    break
            
            if not entry_finished:
              output_list = list(generated.squeeze().cpu().numpy())
              output_text = f"{tokenizer.decode(output_list)}<|endoftext|>" 
              generated_list.append(output_text)
                
    return generated_list

#Function to generate multiple sentences. Test data should be a dataframe
def text_generation(test_data):
  device = torch.device( "cuda" if torch.cuda.is_available() else "cpu" )
  generated_articles = []
  for i in range(len(test_data)):
    x = generate(model.to(device), tokenizer, "<|startoftext|>" + test_data['title'][i], entry_count=1)
    generated_articles.append(x)
  return generated_articles

#Run the functions to generate the descriptions
generated_articles = text_generation(test_set)

In [10]:
## Here we can just generate text based on a different starting sequence
device = torch.device( "cuda" if torch.cuda.is_available() else "cpu" )
prompt = "city"
number_of_generations = 2 # generates multiple options from the input prompt
responses = generate(model.to(device), tokenizer, f"<|startoftext|>{prompt}", number_of_generations)
[response.replace("<|startoftext|>", "").replace("<|endoftext|>","") for response in responses]

['city police officer indicted over getting into trouble with man on bike in New York — and killed in a car crash.\n\nCopyright',
 'city trader blasts condo tower and alleges CEO is denying toxic bets – A resident of New York City is suing City trader Jerrold']

In [11]:
## Set up the articles set aside for testing
test_articles = test_set.apply(lambda x: x['title'] + " - " + x["description"], axis =1).values

In [16]:
## generate from the original GPT-2 model
from transformers import pipeline, set_seed
non_tuned_gpt = pipeline('text-generation', model='gpt2')
def get_non_tuned_generated(text):
    return non_tuned_gpt(text, max_length=75, num_return_sequences=1)

In [17]:
### Here we are going to use the test data to compare the fine-tuned model to the generic GPT-2 model
from IPython.display import Markdown as md
results = []
for pred_article, true_article in zip(generated_articles[7:20], test_articles[7:20]):
    pred_description = pred_article[0].replace("<|startoftext|>", "").replace("<|endoftext|>","...")
    if len(pred_description.split(" - ")) != 2:
        continue
    print(pred_description)
    pred_description = pred_description.split(" - ")[-1]
    orig_title = true_article.split(" - ")[0]
    non_tuned_prediction = get_non_tuned_generated(orig_title + " -- ")
    non_tuned_prediction = non_tuned_prediction[0]["generated_text"].split(" -- ")[-1]
    orig_desc = true_article.split(" - ")[-1]
    result_card = f"##### {orig_title}\nOriginal Description: {orig_desc}\n\nGenerated Description: {pred_description}\n\nNonTuned Generated Text: {non_tuned_prediction}\n\n_________________________________________________________"
    results.append(result_card)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Greywind, From Chef Behind Loring Place, Opens Near Hudson Yards - Greywind, From Chef Behind Loring Place, Opens Near Hudson Yards...


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sacramento taqueria allegedly used priest to get confessions of workplace ‘sins’ - Sacramento police received two complaints of mass abuses during a hiring process for an Atlanta taqueria after it alleged that two priests...


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Forchelli Deegan Terrana forms new practice group, adds management skills - Forchelli Deegan is still up on the guidance she was given by GM Gary Sanchez when she...


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Muhammad Ali’s Daughter Giving ‘Lifeline’ to Police Athletes Receives High Praises From Fans For Protesting their Female Body’s Reaction to Her Mother's Behaviour - “Muhammad Ali’s...


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Wayward cow spotted in east Montgomery Township pond - Police say a Wayne County deer was spotted in east Montgomery Township pond by a neighbor, and police said they got...


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Fabrizio Romano claims Arsenal and Chelsea target's new contract extension 'doesn't change the situation for the summer' - Fabrizio Romano claims Arsenal and Chelsea target's new contract extension 'doesn't change the situation for the summer'...


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Avril Lavigne and Tyga Split, But Still Friends From AVENGERS 2 - (Vincent Malardi/Getty Images) Avril Lavigne and Tyga Split, But...


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Keystone Wealth Services LLC Has $789,000 Position in The Walt Disney Company (NYSE:DIS) - Keystone Wealth Services LLC has $789,000 position in The Walt Disney Company (NYSE:DIS)...


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Color Of Change Selects Thoughtworks to Build Social Change Campaign Management Tool For US Students - Color of Change Selects Thoughtworks to Build Social Change Campaign Management Tool For US Students...


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Trump Says He Will Be Arrested on Tuesday as Indictment Looms After Crowd of Protesters Involved in Protest [Video] - President Donald Trump told reporters that he will be arrested as part of...


We can look at the results below and compare the original description, generated description and the generated text from a non-tuned GPT2 model.

Looking through some of these results, there is definitely some areas of improvement for the fine-tuned model. A longer training time, adding more data, or changing the task to use the article description to generate a title might work better since there is more information contained in the description than the title. It looks like the non-tuned GPT2 model is more likely to hallucinate and alter the context of the description to include text that might be more common in its original training data (such as talking about GM in 2014 when the article title is about Ford recovering auto sales, not GM) wheras the tuned model has a habit of repeating the article title as the description, this might be due to some articles in the training data repeating its article as the description (such as the Financial Advocates article below).


In [18]:
md("\n\n".join(results))

##### Greywind, From Chef Behind Loring Place, Opens Near Hudson Yards
Original Description: A spot offering coffee by day and noodles, dumplings and more by night in Long Island City, a lobster-based spinoff from the Seamore’s team, and a return for a fine-dining personality.

Generated Description: Greywind, From Chef Behind Loring Place, Opens Near Hudson Yards...

NonTuned Generated Text: __________________

Post Extras:


Quote:

turtle wrote:

Hey all,


In your thread I've got the new version which I'm running on my ps3. If anyone is interested, do let me know if you have any issues.



_________________________________________________________

##### Sacramento taqueria allegedly used priest to get confessions of workplace ‘sins’
Original Description: SACRAMENTO, Calif.&#160;-&#160;The owners of&#160;multiple Sacramento taquerias&#160;will have to shell out thousands in damages to former employees and fines after they were busted hiring an alleged priest to spy on workers and extract confessions of workplace "sins." "Federal wage and hour investigators have seen corrupt employers try all kinds of scams to shortchange workers and toThe post Sacramento taqueria allegedly used priest to get confessions of workplace &#8216;sins&#8217; appeared first on KION546.

Generated Description: Sacramento police received two complaints of mass abuses during a hiring process for an Atlanta taqueria after it alleged that two priests...

NonTuned Generated Text: 1 and 5 of 17 people arrested. The same report says that the man is being detained for six months.

‍‍‍‍‍‍ ‪ ‪ … … ‹

_________________________________________________________

##### Forchelli Deegan Terrana forms new practice group
Original Description: Co-chairing the firm's Securities Litigation & Regulation practice group are two new partners

Generated Description: Forchelli Deegan is still up on the guidance she was given by GM Gary Sanchez when she...

NonTuned Generated Text:  (left),  (right) during the Spring and fall of 2014 to discuss her new teaching methods, and she is currently the President of the American College of Obstetricians and Gynecologists.  She blogs at   www.thewoman.com. Dr. Delaney is

_________________________________________________________

##### Muhammad Ali’s Daughter Giving ‘Lifeline’ to Police Athletes Receives High Praises From Fans
Original Description: Widely regarded as the Greatest boxer of all time, Muhammad Ali spent his entire life helping people. His moments for social justice and the rights of the people facing discrimination made him a worldwide icon. The late legend made the service of society his foremost priority. He even put his boxing career at stake for&#8230;The post Muhammad Ali’s Daughter Giving ‘Lifeline’ to Police Athletes Receives High Praises From Fans appeared first on EssentiallySports.

Generated Description: “Muhammad Ali’s...

NonTuned Generated Text: 

ADVERTISEMENT

"As a father, I find comfort in understanding my daughter, her feelings... and she made a mistake in speaking about her father's feelings, and our daughter's family can't say that,"

_________________________________________________________

##### Wayward cow spotted in east Montgomery
Original Description: Folks in east Montgomery got a surprise Monday morning when a cow was seen wandering near EastChase.

Generated Description: Police say a Wayne County deer was spotted in east Montgomery Township pond by a neighbor, and police said they got...

NonTuned Generated Text: !!!!!!!! https://t.co/1Y6xjUO2eQ — Jason Foltz (@jason_foltz) May 29, 2017

The animal did not respond to a request for comment Wednesday evening.

A Montgomery County Sheriff's Office spokeswoman said the animal was a coyote

_________________________________________________________

##### Fabrizio Romano claims Arsenal and Chelsea target's new contract extension 'doesn't change the situation for the summer'
Original Description: Fabrizio Romano claims Arsenal and Chelsea target's new contract extension 'doesn't change the situation for the summer'

Generated Description: Fabrizio Romano claims Arsenal and Chelsea target's new contract extension 'doesn't change the situation for the summer'...

NonTuned Generated Text: ???

There is a huge difference of opinion on the matter. We are talking about a contract extension which means that Arsenal's only contract at the end of the season runs through to August 2015.

It's not like Arsene is

_________________________________________________________

##### Avril Lavigne and Tyga Split, But Still Friends
Original Description: Tyga and Avril Lavigne aren't a couple anymore, 'cause they've broken up -- but they're still on good terms ... TMZ has learned. Sources tell us the two musicians -- who went public with their relationship in March -- recently went their separate&hellip;

Generated Description: (Vincent Malardi/Getty Images) Avril Lavigne and Tyga Split, But...

NonTuned Generated Text:  Kendrick Lamar & Kanye West
I'd Like to Give You A Real Love Song This Christmas (I'm So So Lucky I Know  (I'm So Lucky I Know, This Christmas)]
The Night A Man Bodies First And Gets His Pants Back (I'm So So Lucky

_________________________________________________________

##### Keystone Wealth Services LLC Has $789,000 Position in The Walt Disney Company (NYSE:DIS)
Original Description: Keystone Wealth Services LLC Has $789,000 Position in The Walt Disney Company (NYSE:DIS)

Generated Description: Keystone Wealth Services LLC has $789,000 position in The Walt Disney Company (NYSE:DIS)...

NonTuned Generated Text: ‍‍‍‍‍‍‍‍‍‍�

_________________________________________________________

##### Color Of Change Selects Thoughtworks to Build Social Change Campaign Management Tool
Original Description: (marketscreener.com) Thoughtworks , a global technology consultancy that integrates strategy, design, and engineering, today announced that it will collaborate with the Color Of Change, a 501 organization and the nation's largest online racial justice organization, to build a tool that helps teams comprised of diverse staff and volunteers collaborate and manage...https://www.marketscreener.com/quote/stock/THOUGHTWORKS-HOLDING-INC-126986329/news/Color-Of-Change-Selects-Thoughtworks-to-Build-Social-Change-Campaign-Management-Tool-44149464/?utm_medium=RSS&utm_content=20230620

Generated Description: Color of Change Selects Thoughtworks to Build Social Change Campaign Management Tool For US Students...

NonTuned Generated Text:  to help the organization build and maintain social trust.   In addition to those "digital" aspects of building trust, there are those activities I'm interested in exploring now:     Social Media Management Tool
   Community Engagement Tool
In order to engage social media with an organization I know

_________________________________________________________

##### Trump Says He Will Be Arrested on Tuesday as Indictment Looms
Original Description: His indictment by a Manhattan grand jury is expected, but its timing is unclear.

Generated Description: President Donald Trump told reporters that he will be arrested as part of...

NonTuned Generated Text:  It's Time to Move on from #RapePolice. The Problem IS They Don't Stop Telling It! Read More 
The Trump Administration Is Not A Good Thing and So What is it? I was curious at first what's the Trump administration all about. The President talks

_________________________________________________________