<a href="https://colab.research.google.com/github/dinesh-umkc/kdm/blob/main/ICP_12_GPT_2_Fine_Tuning_Text_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GPT-2 Fine-Tuning Tutorial with PyTorch & Huggingface in Colab




#Objective

* Use GPT-2 for text generation
* Explore Sampling Top-K
* Fine Tuning - Genrate fake News



# Setup

In [1]:
!pip install transformers
!pip install datasets


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 22.3 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 65.0 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 53.3 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.24.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.7.1-py3-none-any.whl (451 kB)
[K     |████████████████████████████████| 451 kB 4.3 MB/

In [2]:
import torch, os, re, pandas as pd, json
from sklearn.model_selection import train_test_split
from transformers import DataCollatorForLanguageModeling, DataCollatorWithPadding, GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments, AutoConfig
from datasets import Dataset


In [3]:
def pretty_print(text, max_len_line=100):
    words = text.split(' ')
    len_line = 0
    line = ''
    for w in words:
        if w == '\n':
            print(line)
            line = ''
            continue
        if (len(line) + len(w)) > max_len_line:
            print(line)
            line = ''
        line += ' ' + w
    print(line)


In [4]:
# We load the model
base_model = GPT2LMHeadModel.from_pretrained('gpt2-medium')
# options: ['gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl']


Downloading:   0%|          | 0.00/718 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

In [5]:
# We load the tokenizer
base_tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')


Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [6]:
text = "Hi, I'm Dinesh and I work as a Software Architect"
base_tokenizer.tokenize(text)


['Hi',
 ',',
 'ĠI',
 "'m",
 'ĠD',
 'ines',
 'h',
 'Ġand',
 'ĠI',
 'Ġwork',
 'Ġas',
 'Ġa',
 'ĠSoftware',
 'ĠArchitect']

In [7]:
text_ids = base_tokenizer.encode(text, return_tensors = 'pt')
text_ids

# tensorflow
#text_ids = base_tokenizer.encode(text, return_tensors = 'tf')


tensor([[17250,    11,   314,  1101,   360,  1127,    71,   290,   314,   670,
           355,   257, 10442, 17340]])

In [8]:
generated_text_samples = base_model.generate(
    text_ids
)
generated_text_samples


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


tensor([[17250,    11,   314,  1101,   360,  1127,    71,   290,   314,   670,
           355,   257, 10442, 17340,   379,   257,  1588,  3788,  1664,    13]])

In [9]:
for i, beam in enumerate(generated_text_samples):
    print(f"{i}: {base_tokenizer.decode(beam, skip_special_tokens=True)}")
    print()


0: Hi, I'm Dinesh and I work as a Software Architect at a large software company.



In [10]:
generated_text_samples = base_model.generate(
    text_ids,
    max_length= 100,
)

for i, beam in enumerate(generated_text_samples):
    print(f"{i}: {base_tokenizer.decode(beam, skip_special_tokens=True)}")
    print()

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


0: Hi, I'm Dinesh and I work as a Software Architect at a large software company. I'm also a software engineer and I'm currently working on a project called "The Future of Software".

I'm also a big fan of the Internet and I'm a big fan of open source software.

I'm also a big fan of the Internet and I'm a big fan of open source software.

I'm also a big fan of the Internet and I'm a



#Remove duplicate texts

In [11]:
# text generation example
generated_text_samples = base_model.generate(
    text_ids,
    max_length= 50,  
    num_beams=5,
    no_repeat_ngram_size=2,
    num_return_sequences= 5,
    early_stopping=True 
)

for i, beam in enumerate(generated_text_samples):
  print(f"{i}: {base_tokenizer.decode(beam, skip_special_tokens=True)}")
  print()


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


0: Hi, I'm Dinesh and I work as a Software Architect at Google.

I've been a software engineer for over 10 years. I've worked on a wide variety of projects, from web applications to mobile apps, web services,

1: Hi, I'm Dinesh and I work as a Software Architect at Google.

I've been a software engineer for over 10 years. I've worked on a wide variety of projects, from web applications to mobile apps, and have been

2: Hi, I'm Dinesh and I work as a Software Architect at Google.

I've been a software engineer for over 10 years. I've worked on a wide variety of projects, from web applications to mobile apps, web services to

3: Hi, I'm Dinesh and I work as a Software Architect at Google.

I've been a software engineer for over 10 years. I've worked on a wide variety of projects, from web applications to mobile apps, web services and

4: Hi, I'm Dinesh and I work as a Software Architect at Google.

I've been a software engineer for over 10 years. I've worked on a wide variety of projects

#Sampling

In [12]:
# text generation example
generated_text_samples = base_model.generate(
    text_ids,
    max_length= 50,  
    do_sample=True,  
    top_k=0,
    num_return_sequences= 5
)

for i, beam in enumerate(generated_text_samples):
  pretty_print(f"{i}: {base_tokenizer.decode(beam, skip_special_tokens=True)}")
  print()


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 0: Hi, I'm Dinesh and I work as a Software Architect for a blockchain than let's say small startup,
 but I have a life which can be scarifying if I make mistakes. So I thought why not build something
 from scratch, right

 1: Hi, I'm Dinesh and I work as a Software Architect at Yinpress. I work with life's everyday
 company problems. More specifically, I care deeply about new ideas.

I have started my career as a
 Search Engine Optimizer

 2: Hi, I'm Dinesh and I work as a Software Architect with AWS. I see my job there is to ensure
 Amazon EC2 is running locally and health of the server can be monitored over the Web. We get to
 match queries and logs

 3: Hi, I'm Dinesh and I work as a Software Architect and plan and create a website ringtail pattern
 for the head metal penny.I have over 25 years of experience by which I can say that I am definitely
 enamoured of this

 4: Hi, I'm Dinesh and I work as a Software Architect for a small manufacturing firm here in
 Bengaluru. I have a Ph

#Use Temperature parameter

In [13]:
# text generation example
generated_text_samples = base_model.generate(
    text_ids,
    max_length= 50,  
    do_sample=True,  
    top_k=0,
    temperature=0.9,
    num_return_sequences= 5
)

for i, beam in enumerate(generated_text_samples):
  pretty_print(f"{i}: {base_tokenizer.decode(beam, skip_special_tokens=True)}")
  print()


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 0: Hi, I'm Dinesh and I work as a Software Architect for a large New Orleans accounting firm. Most
 of the customers' business is related to online banking and I am very interested in how to put
 together sophisticated management tools that help make a

 1: Hi, I'm Dinesh and I work as a Software Architect at Voicemail Beta. I look a lot like Dinesh but
 like…different. In my case, the plan isn't necessarily "to be like Dinesh"

 2: Hi, I'm Dinesh and I work as a Software Architect at Microsoft, and I'm a mobile adoption
 advocate for the nonprofit organization, in- memory mobile healthcare organization, and we've been
 serving tiny offices and larger companies with technologies that can

 3: Hi, I'm Dinesh and I work as a Software Architect at IFrame {liddesktop}. I also worked as a
 developer at Thomas Digital in Zurich and we deliver software engineering solutions for clients
 around the world, including desktop applications,

 4: Hi, I'm Dinesh and I work as a Software Architect a

#Top-K Sample

In [14]:
# text generation example
generated_text_samples = base_model.generate(
    text_ids,
    max_length= 50,  
    do_sample=True,  
    top_k=25,
    num_return_sequences= 5
)

for i, beam in enumerate(generated_text_samples):
  pretty_print(f"{i}: {base_tokenizer.decode(beam, skip_special_tokens=True)}")
  print()


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 0: Hi, I'm Dinesh and I work as a Software Architect at a healthcare tech company. I've been doing
 this in my spare time for 8 months so far.

When I was 16 I started a job at a tech startup in

 1: Hi, I'm Dinesh and I work as a Software Architect at a small software company, but I'm interested
 in a wider range of areas. My interest is in how teams are working together, and I really want to be
 able to talk

 2: Hi, I'm Dinesh and I work as a Software Architect for Google. My background is Computer Science,
 but it is very hard for me to leave my current job because I love the project and the challenge. I
 would love your help to

 3: Hi, I'm Dinesh and I work as a Software Architect for a very large IT infrastructure company. My
 job involves many components including database administration, backup, load balancing, and web
 development/web hosting to name a few.

My

 4: Hi, I'm Dinesh and I work as a Software Architect on Microsoft's SQL Server. This post will
 explore what is h

#Fine tuning: Generate fake news

In [15]:
#Gdrive access
from google.colab import drive
drive.mount("/content/gdrive")

Mounted at /content/gdrive


In [16]:
filepath= '/content/gdrive/MyDrive/Projects/Data/archive/articles1.csv'
df = pd.read_csv(filepath, encoding = 'utf-8', usecols=['title', 'publication'])\
                    .rename(columns={'title': 'text'})

pd.set_option("display.max_colwidth", None)
df.head(5)

Unnamed: 0,text,publication
0,House Republicans Fret About Winning Their Health Care Suit - The New York Times,New York Times
1,Rift Between Officers and Residents as Killings Persist in South Bronx - The New York Times,New York Times
2,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial Bias, Dies at 106 - The New York Times",New York Times
3,"Among Deaths in 2016, a Heavy Toll in Pop Music - The New York Times",New York Times
4,Kim Jong-un Says North Korea Is Preparing to Test Long-Range Missile - The New York Times,New York Times


In [17]:
def remove_publication_headline(headline, publication):
    # publication col doesn't match exactly with newspaper in title col
    if str(publication) in str(headline):
        headline = headline.split(' - ')[0]
    return headline

def process_headlines(df, text_colname):
  
    # Remove empty and null rows
    titulo_vacio = (df['text'].str.len() == 0) | df['text'].isna()
    df = df[~titulo_vacio]

    # Remove publication name from title
    df['text'] = df.apply(lambda row: remove_publication_headline(row['text'], row['publication']), axis = 1)

    # Remove headlines with less than 8 words
    titlos_len_ge8 = (df['text'].str.split().apply(lambda x: len(x)) >= 8)
    df = df[titlos_len_ge8]

    # Drop duplicates
    text_df = df.drop_duplicates(subset = [text_colname])\
                [[text_colname]]

    return text_df
    
df = process_headlines(df, 'text')


In [20]:
# the eos and bos tokens are defined
bos = '<|endoftext|>'
eos = '<|EOS|>'
pad = '<|pad|>'

special_tokens_dict = {'eos_token': eos, 'bos_token': bos, 'pad_token': pad}

# the new token is added to the tokenizer
num_added_toks = base_tokenizer.add_special_tokens(special_tokens_dict)

# the model config to which we add the special tokens
config = AutoConfig.from_pretrained('gpt2-medium', 
                                    bos_token_id=base_tokenizer.bos_token_id,
                                    eos_token_id=base_tokenizer.eos_token_id,
                                    pad_token_id=base_tokenizer.pad_token_id,
                                    output_hidden_states=False)

# the pre-trained model is loaded with the custom configuration
base_model = GPT2LMHeadModel.from_pretrained('gpt2-medium', config=config)

# the model embedding is resized
base_model.resize_token_embeddings(len(base_tokenizer))


Embedding(50259, 1024)

In [21]:
df['text'] = bos + ' ' + df['text'] + ' ' + eos

df_train, df_val = train_test_split(df, train_size = 0.9, random_state = 77)
print(f'There are {len(df_train)} headlines for training and {len(df_val)} for validation')


There are 36380 headlines for training and 4043 for validation


In [22]:
# we load the datasets directly from a pandas df
train_dataset = Dataset.from_pandas(df_train[['text']])
val_dataset = Dataset.from_pandas(df_val[['text']])


In [23]:
 def tokenize_function(examples):
        return base_tokenizer(examples['text'], padding=True)


tokenized_train_dataset = train_dataset.map(
    tokenize_function,
    batched=True,
    num_proc=5,
    remove_columns=['text'],
)
tokenized_val_dataset = val_dataset.map(
    tokenize_function,
    batched=True,
    num_proc=5,
    remove_columns=['text'],
)


      

#0:   0%|          | 0/8 [00:00<?, ?ba/s]

 

#1:   0%|          | 0/8 [00:00<?, ?ba/s]

 

#2:   0%|          | 0/8 [00:00<?, ?ba/s]

 

#3:   0%|          | 0/8 [00:00<?, ?ba/s]

 

#4:   0%|          | 0/8 [00:00<?, ?ba/s]

        

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#4:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

In [24]:
# Example of the result of the tokenization process with padding
base_tokenizer.decode(tokenized_train_dataset['input_ids'][0])


'<|endoftext|> Donald Trump: Hillary Clinton ’Opened the Pandora’s Box of Radical Islam’ <|EOS|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|>'

#Training

In [25]:
model_headlines_path = '/content/gdrive/MyDrive/Projects/Data/archive/model_headlines_news'

training_args = TrainingArguments(
    output_dir=model_headlines_path,          # output directory
    num_train_epochs=6,              # total # of training epochs
    per_device_train_batch_size=32,  # batch size per device during training
    per_device_eval_batch_size=16,   # batch size for evaluation
    warmup_steps=200,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir=model_headlines_path,            # directory for storing logs
    prediction_loss_only=True,
    save_steps=10000 
)


In [26]:
data_collator = DataCollatorForLanguageModeling(
        tokenizer=base_tokenizer,
        mlm=False
    )


In [27]:
trainer = Trainer(
    model=base_model,                         # the instantiated  Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    data_collator=data_collator,
    train_dataset=tokenized_train_dataset,         # training dataset
    eval_dataset=tokenized_val_dataset            # evaluation dataset
)
trainer.train()


The following columns in the training set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: __index_level_0__. If __index_level_0__ are not expected by `GPT2LMHeadModel.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 36380
  Num Epochs = 6
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 6822
  Number of trainable parameters = 354825216


Step,Training Loss
500,7.1396
1000,3.5415
1500,3.2549
2000,3.1776
2500,3.024
3000,2.8788
3500,2.8436
4000,2.6467
4500,2.6741
5000,2.5088




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=6822, training_loss=3.1197480777951845, metrics={'train_runtime': 8271.6335, 'train_samples_per_second': 26.389, 'train_steps_per_second': 0.825, 'total_flos': 2.0489669481775104e+16, 'train_loss': 3.1197480777951845, 'epoch': 6.0})

In [28]:
trainer.save_model()
base_tokenizer.save_pretrained(model_headlines_path)


Saving model checkpoint to /content/gdrive/MyDrive/Projects/Data/archive/model_headlines_news
Configuration saved in /content/gdrive/MyDrive/Projects/Data/archive/model_headlines_news/config.json
Model weights saved in /content/gdrive/MyDrive/Projects/Data/archive/model_headlines_news/pytorch_model.bin
tokenizer config file saved in /content/gdrive/MyDrive/Projects/Data/archive/model_headlines_news/tokenizer_config.json
Special tokens file saved in /content/gdrive/MyDrive/Projects/Data/archive/model_headlines_news/special_tokens_map.json
added tokens file saved in /content/gdrive/MyDrive/Projects/Data/archive/model_headlines_news/added_tokens.json


('/content/gdrive/MyDrive/Projects/Data/archive/model_headlines_news/tokenizer_config.json',
 '/content/gdrive/MyDrive/Projects/Data/archive/model_headlines_news/special_tokens_map.json',
 '/content/gdrive/MyDrive/Projects/Data/archive/model_headlines_news/vocab.json',
 '/content/gdrive/MyDrive/Projects/Data/archive/model_headlines_news/merges.txt',
 '/content/gdrive/MyDrive/Projects/Data/archive/model_headlines_news/added_tokens.json')

In [29]:
trainer.evaluate()


The following columns in the evaluation set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: __index_level_0__. If __index_level_0__ are not expected by `GPT2LMHeadModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 4043
  Batch size = 16


{'eval_loss': 3.5279788970947266,
 'eval_runtime': 42.617,
 'eval_samples_per_second': 94.868,
 'eval_steps_per_second': 5.937,
 'epoch': 6.0}

#Headline generation

In [37]:
def generate_n_text_samples(model, tokenizer, input_text, device, n_samples = 5):
    text_ids = tokenizer.encode(input_text, return_tensors = 'pt')
    text_ids = text_ids.to(device)
    model = model.to(device)

    generated_text_samples = model.generate(
        text_ids, 
        max_length= 100,  
        num_return_sequences= n_samples,
        no_repeat_ngram_size= 2,
        repetition_penalty= 1.5,
        top_p= 0.92,
        temperature= .85,
        do_sample= True,
        top_k= 125,
        early_stopping= True
    )
    gen_text = []
    for t in generated_text_samples:
        text = tokenizer.decode(t, skip_special_tokens=True)
        gen_text.append(text)
    return gen_text


In [34]:
# trained model loading

headlines_model = GPT2LMHeadModel.from_pretrained(model_headlines_path)
headlines_tokenizer = GPT2Tokenizer.from_pretrained(model_headlines_path)

device = "cuda:0"

input_text = headlines_tokenizer.bos_token




loading configuration file /content/gdrive/MyDrive/Projects/Data/archive/model_headlines_news/config.json
Model config GPT2Config {
  "_name_or_path": "gpt2-medium",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50257,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 1024,
  "n_head": 16,
  "n_inner": null,
  "n_layer": 24,
  "n_positions": 1024,
  "n_special": 0,
  "pad_token_id": 50258,
  "predict_special_tokens": true,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 5

In [38]:
headlines = generate_n_text_samples(headlines_model, headlines_tokenizer, 
                                    input_text, device, n_samples = 10)
print(len(headlines))
for h in headlines:
    print(h)
    print()

10
 Trump is calling for a ’major investigation‘ into Hillary Clinton and the FBI

 WATCH: Migrant Sex Attacks On German Woman On New Year’s Eve

 Facebook’s Mark Zuckerberg is in New York to discuss the future of his $38 billion company

 Donald Trump’s Campaign Manager on Why Hillary Clinton May Have Lost the Election

 ’This Is the Beginning’: Trump Fires up Working-Class Voters at Pennsylvania Rally

 Breitbart’s Klein: ‘Clinton Cash in the Gutter, It Is Not a Movie Anymore”

 Trump Campaign Manager: Hillary Clinton Has ‘The Will to Run the Country’

 Facebook, Google and Twitter are all facing a new threat to their dominance

 Hillary Clinton: I’m Not Raising Money For My Book Because It Has ‘Anti-Semitic Overtones”

 The Hill: Ted Cruz Bests Donald Trump in Iowa With 6.5 Point Swing



# Assignment:

*   Finetune the model with another dataset (Wikipedia, IMDB, News, Yelp, Stories, ...) from HuggingFace's datasets library or another external source (Reddit, Twitter, Web Scrape, ...).
* Increase the size instead of 1,000 samples (may take more time)
* Report generation with different parameters (`top_k, do_sample, top_p, temperature`) - Hint: https://huggingface.co/blog/how-to-generate
* Use another version of GPT-2 if possible


