## Finetuning OOO  GPT2 for text classification 

Based on implementation at: https://www.kaggle.com/code/baekseungyun/gpt-2-with-huggingface-pytorch

http://mohitmayank.com/a_lazy_data_science_guide/natural_language_processing/GPTs/


https://towardsdatascience.com/guide-to-fine-tuning-text-generation-models-gpt-2-gpt-neo-and-t5-dc5de6b3bc5e

##### Prerequisites

In [None]:
%%capture 

!pip install transformers
!pip install scikit-learn

#### Imports 

In [2]:
from sklearn.model_selection import train_test_split
from transformers import TrainingArguments
from transformers import GPT2LMHeadModel
from torch.utils.data import DataLoader
from transformers import GPT2Tokenizer
from torch.utils.data import Dataset
from sklearn.metrics import f1_score
from transformers import GPT2Config
from transformers import set_seed
from transformers import Trainer
from tqdm import tqdm
import pandas as pd
import transformers
import logging
import sklearn
import pandas
import random
import pickle
import torch
import re
import os

In [3]:
logger = logging.getLogger('sagemaker')
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())

In [4]:
logger.info(f'[Using transformers version: {transformers.__version__}]')
logger.info(f'[Using sklearn version: {sklearn.__version__}]')
logger.info(f'[Using torch version: {torch.__version__}]')
logger.info(f'[Using pandas version: {pandas.__version__}]')

[Using transformers version: 4.18.0]
[Using sklearn version: 0.24.2]
[Using torch version: 1.8.1]
[Using pandas version: 1.1.5]


#### 1. Setup GPT2 model & tokenizer   

In [5]:
set_seed(123)

Setup GPT2 model

In [6]:
model = GPT2LMHeadModel.from_pretrained('gpt2').cuda()

##### Setup GPT2 tokenizer 

In [7]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2', 
                                          bos_token='<|startoftext|>',
                                          eos_token='<|endoftext|>', 
                                          pad_token='<|pad|>')
tokenizer.padding_side = 'left'
tokenizer.model_max_length = 512
tokenizer

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


PreTrainedTokenizer(name_or_path='gpt2', vocab_size=50257, model_max_len=512, is_fast=False, padding_side='left', truncation_side='right', special_tokens={'bos_token': AddedToken("<|startoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'pad_token': '<|pad|>'})

In [8]:
len(tokenizer)

50259

In [9]:
# resizes the token embeddings of the model to match the number of tokens in the tokenizer
model.resize_token_embeddings(len(tokenizer))

Embedding(50259, 768)

#### 2. Setup dataset

In [10]:
class CustomDataset(Dataset):
    def __init__(self, txt_list, label_list, tokenizer, max_length):          
        self.input_ids = []
        self.attention_mask = []
        self.labels = []

        # Define label map
        label_map = {0 : 'business', 1 : 'esg', 2 : 'general', 3 : 'science', 4 : 'tech'}
        
        # Iterate through the dataset
        for txt, label in zip(txt_list, label_list):
            # Prepare the text
            prep_txt = f'<|startoftext|>tweet: {txt}<|pad|>sentiment: {label_map[label]}<|endoftext|>'
            # Tokenize the text
            encodings_dict = tokenizer(prep_txt, 
                                       truncation=True,
                                       max_length=max_length, 
                                       padding='max_length')
            # Append the tokenized text to the list
            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attention_mask.append(torch.tensor(encodings_dict['attention_mask']))
            self.labels.append(label_map[label])

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attention_mask[idx], self.labels[idx]

In [11]:
file_path = './covid_articles_clf_data.csv'
df = pd.read_csv(file_path, names=['text', 'label'])
#df = df.sample(50000, random_state=1)
df.head() 

Unnamed: 0,text,label
0,mysterious respiratory virus strikes 44 people...,2
1,coronavirus impact on tech supply chains minim...,4
2,"hackers imitating cdc, who with coronavirus ph...",4
3,new virus identified as likely cause of myster...,3
4,"new sars related virus, wuhan pneumonia, ideni...",2


In [12]:
X_train, X_test, y_train, y_test = train_test_split(df['text'].tolist(), 
                                                    df['label'].tolist(), 
                                                    shuffle=True, 
                                                    test_size=0.05, 
                                                    random_state=123)

In [13]:
train_dataset = CustomDataset(X_train, y_train, tokenizer, max_length=512)

In [14]:
training_args = TrainingArguments(output_dir='./output', 
                                  num_train_epochs=1,  
                                  optim='adamw_torch', 
                                  save_strategy='epoch', 
                                  #evaluation_strategy='epoch', 
                                  per_device_train_batch_size=8, 
                                  #per_device_eval_batch_size=8, 
                                  warmup_steps=100, 
                                  weight_decay=0.01, 
                                  logging_dir='logs')

# start training
Trainer(model=model, 
        args=training_args, 
        train_dataset=train_dataset, 
        #eval_dataset=test_dataset,
        data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                    'attention_mask': torch.stack([f[1] for f in data]),
                                    'labels': torch.stack([f[0] for f in data])}).train()

***** Running training *****
  Num examples = 133308
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 4166


[2023-01-22 13:54:34.079 pytorch-1-8-gpu-py3-ml-g5-12xlarge-a2b82b571c5bb70d3876e8ca70c8:19349 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2023-01-22 13:54:34.110 pytorch-1-8-gpu-py3-ml-g5-12xlarge-a2b82b571c5bb70d3876e8ca70c8:19349 INFO profiler_config_parser.py:102] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.




Step,Training Loss
500,0.394
1000,0.1499
1500,0.1386
2000,0.1343
2500,0.1317
3000,0.1301
3500,0.1296
4000,0.1269


Saving model checkpoint to ./output/checkpoint-4166
Configuration saved in ./output/checkpoint-4166/config.json
Model weights saved in ./output/checkpoint-4166/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=4166, training_loss=0.16531857652155985, metrics={'train_runtime': 3755.1506, 'train_samples_per_second': 35.5, 'train_steps_per_second': 1.109, 'total_flos': 3.4832318201856e+16, 'train_loss': 0.16531857652155985, 'epoch': 1.0})

In [15]:
torch.cuda.empty_cache()

#### Test for inference

In [16]:
import transformers

# Load the fine-tuned GPT-2 model
model = transformers.AutoModelForCausalLM.from_pretrained('./output/checkpoint-4166')

loading configuration file ./output/checkpoint-4166/config.json
Model config GPT2Config {
  "_name_or_path": "./output/checkpoint-4166",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "torch_dtype": "float32",
  "transformers_version": "4.18.0",
  "use_cache": true,
  "vocab_

In [17]:

text = "Tweet: beijing reports 21 new covid-19 cases in city as of june 17\nSentiment:"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
output[0]

tensor([[[-1.4510e+02, -1.3709e+02, -1.4375e+02,  ..., -1.4558e+02,
          -5.6403e+00,  1.0422e+01],
         [-1.5525e+01, -1.5560e+01, -1.6763e+01,  ..., -1.1477e+01,
          -1.7757e+00,  9.8000e-02],
         [-1.6649e+01, -1.5544e+01, -1.8423e+01,  ..., -1.7433e+01,
          -3.2185e+00, -3.9600e-01],
         ...,
         [-3.0732e+01, -3.3927e+01, -3.6025e+01,  ..., -2.7694e+01,
          -2.2712e+00, -6.4070e-01],
         [ 1.0196e+01,  8.9675e+00,  8.0908e+00,  ...,  1.5406e+01,
          -1.8195e+00,  2.0543e+00],
         [-1.0244e+01, -1.0119e+01, -1.1309e+01,  ..., -1.1721e+00,
          -1.5378e+00,  2.6751e-01]]], grad_fn=<UnsafeViewBackward>)

In [18]:
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='./output/checkpoint-4166', tokenizer='gpt2')
set_seed(42)
generator("tweet: beijing reports 21 new covid-19 cases in city as of june 17\nsentiment:", max_length=128, num_return_sequences=25)

loading configuration file ./output/checkpoint-4166/config.json
Model config GPT2Config {
  "_name_or_path": "./output/checkpoint-4166",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "torch_dtype": "float32",
  "transformers_version": "4.18.0",
  "use_cache": true,
  "vocab_

[{'generated_text': 'tweet: beijing reports 21 new covid-19 cases in city as of june 17\nsentiment: business'},
 {'generated_text': 'tweet: beijing reports 21 new covid-19 cases in city as of june 17\nsentiment: business'},
 {'generated_text': 'tweet: beijing reports 21 new covid-19 cases in city as of june 17\nsentiment: business'},
 {'generated_text': 'tweet: beijing reports 21 new covid-19 cases in city as of june 17\nsentiment: business'},
 {'generated_text': 'tweet: beijing reports 21 new covid-19 cases in city as of june 17\nsentiment: business'},
 {'generated_text': 'tweet: beijing reports 21 new covid-19 cases in city as of june 17\nsentiment: business'},
 {'generated_text': 'tweet: beijing reports 21 new covid-19 cases in city as of june 17\nsentiment: business'},
 {'generated_text': 'tweet: beijing reports 21 new covid-19 cases in city as of june 17\nsentiment: business'},
 {'generated_text': 'tweet: beijing reports 21 new covid-19 cases in city as of june 17\nsentiment: busi

In [19]:
# set the model to eval mode
_ = model.eval()

# run model inference on all test data
original_label, predicted_label, original_text, predicted_text = [], [], [], []
label_map = {0 : 'business', 1 : 'esg', 2 : 'general', 3 : 'science', 4 : 'tech'}
# iter over all of the test data
i = 0
for text, label in zip(X_test, y_test):
    text = text[0:512]
    # create prompt (in compliance with the one used during training)
    prompt = f'<|startoftext|>tweet: {text}\nsentiment:'
    # generate tokens
    generated = tokenizer(f"{prompt}", return_tensors="pt").input_ids
    print(generated.shape)
    # perform prediction
    sample_outputs = model.generate(generated, 
                                    do_sample=True, 
                                    top_k=50, 
                                    max_length=512, 
                                    top_p=0.90, 
                                    temperature=0.01)

    
    # decode the predicted tokens into texts
    pred_text  = tokenizer.decode(sample_outputs[0], skip_special_tokens=True)
    print(pred_text)
    # extract the predicted sentiment
    try:
        pred_sentiment = re.findall("\nsentiment: (.*)", pred_text)[-1]
    except:
        pred_sentiment = "None"
    # append results
    original_label.append(label_map[label])
    predicted_label.append(pred_sentiment)
    original_text.append(text)
    predicted_text.append(pred_text)
    i += 1
    if i == 5:
        break

# transform result into dataframe
df = pd.DataFrame({'original_text': original_text, 'predicted_label': predicted_label, 
                    'original_label': original_label, 'predicted_text': predicted_text})

# predict the accuracy
print(f1_score(original_label, predicted_label, average='macro'))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


torch.Size([1, 27])
tweet: after coronavirus: california liberals say returning to normal won’ t be enough
sentiment: business
torch.Size([1, 40])


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


tweet: visitors to shanghai disneyland become ‘ envy of the whole world’ after park reopens at one-third capacity following coronavirus shutdown
sentiment: business
torch.Size([1, 18])
tweet: zoom video stock surges as coronavirus fears deepen
sentiment: business
torch.Size([1, 31])


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


tweet: detroit to be first to deploy abbott labs’ 5-minute covid-19 test, mayor says
sentiment: business
torch.Size([1, 40])
tweet: le test salivaire covid-19 d'intelligent fingerprinting reçoit le marquage ce et devient disponible à la vente
sentiment: general
1.0


In [20]:
df

Unnamed: 0,original_text,predicted_label,original_label,predicted_text
0,after coronavirus: california liberals say ret...,business,business,tweet: after coronavirus: california liberals ...
1,visitors to shanghai disneyland become ‘ envy ...,business,business,tweet: visitors to shanghai disneyland become ...
2,zoom video stock surges as coronavirus fears d...,business,business,tweet: zoom video stock surges as coronavirus ...
3,detroit to be first to deploy abbott labs’ 5-m...,business,business,tweet: detroit to be first to deploy abbott la...
4,le test salivaire covid-19 d'intelligent finge...,general,general,tweet: le test salivaire covid-19 d'intelligen...


In [21]:
for text in df['original_text']:
    print(text)

after coronavirus: california liberals say returning to normal won’ t be enough
visitors to shanghai disneyland become ‘ envy of the whole world’ after park reopens at one-third capacity following coronavirus shutdown
zoom video stock surges as coronavirus fears deepen
detroit to be first to deploy abbott labs’ 5-minute covid-19 test, mayor says
le test salivaire covid-19 d'intelligent fingerprinting reçoit le marquage ce et devient disponible à la vente


In [22]:
for text in df['predicted_text']:
    print(text)

tweet: after coronavirus: california liberals say returning to normal won’ t be enough
sentiment: business
tweet: visitors to shanghai disneyland become ‘ envy of the whole world’ after park reopens at one-third capacity following coronavirus shutdown
sentiment: business
tweet: zoom video stock surges as coronavirus fears deepen
sentiment: business
tweet: detroit to be first to deploy abbott labs’ 5-minute covid-19 test, mayor says
sentiment: business
tweet: le test salivaire covid-19 d'intelligent fingerprinting reçoit le marquage ce et devient disponible à la vente
sentiment: general
