# Fine-tuning GPT2 with Hugging Face Transformers

In [3]:
%pip install datasets transformers==4.28.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [4]:
from google.colab import drive
import os
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
directory = './drive/MyDrive/Colab\ Notebooks/Reginald'

In [6]:
!ls ./drive/MyDrive/Colab\ Notebooks/Reginald/

data-wiki  gpt2-finetune-REGindald.ipynb  gpt-jt-test.ipynb  REG-RAG.ipynb


## Load dataset

Using Novembers version of REG wiki, HuggingFace `datasets` loads the text files as training samples. The data come directly from a Github wiki download. The library loads the individual files and treats each line as a data sample.

In [7]:
from datasets import load_dataset

dataset = load_dataset("text", data_dir='./drive/MyDrive/data/wiki-reg/')

Resolving data files:   0%|          | 0/44 [00:00<?, ?it/s]

Downloading and preparing dataset text/default to /root/.cache/huggingface/datasets/text/default-0be3b60379854c69/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/default-0be3b60379854c69/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Let's examine the data: 

In [8]:
dataset['train'][1:10]

{'text': ['',
  '## Setting up Azure',
  '',
  'You can access Azure through the [Azure portal](https://portal.azure.com), using your Office 365 credentials.',
  '',
  'To experiment with Azure you should request a trial account with $300 credit using this [TopDesk Link](https://turingcomplete.topdesk.net/tas/public/ssp/content/serviceflow?unid=ac51b39d8bfc46f9bf41132ef8601b5e&from=7edfe644-ac0d-4895-af98-acd425ee0b19&openedFromService=true). When using Azure for projects you can use the same form to request a project-specific subscription with its own budget.',
  '',
  '## Quick links',
  '']}

In [9]:
# Number of samples should correspond to the number of rows across all files 

len(dataset['train'])

1956

## Loading a pre-trained model

From HuggingFace model repository: [HuggingFace models](https://huggingface.co/models)

I'm choosing a small version of GPT2 called [distilgpt2](https://huggingface.co/distilgpt2)

In [10]:
model_checkpoint = 'distilgpt2'

### 1. Tokenizer
Each model comes with a tokenizer that was used for originally training the model.

In [11]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast = True)

Downloading (…)lve/main/config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [12]:
def tokenize_function(examples):
  return tokenizer(examples['text'])

tokenized_dataset = dataset.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])  

Map (num_proc=4):   0%|          | 0/1956 [00:00<?, ? examples/s]

In [13]:
# Let's explore how the tokenized text looks like

tokenized_dataset['train'][4]

{'input_ids': [1639,
  460,
  1895,
  22134,
  832,
  262,
  685,
  26903,
  495,
  17898,
  16151,
  5450,
  1378,
  634,
  282,
  13,
  1031,
  495,
  13,
  785,
  828,
  1262,
  534,
  4452,
  21268,
  18031,
  13],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1]}

### 2. Reformat training data

Wrangle data into a shape to enable training.


In [14]:
# Maximum length of input of the model
block_size = tokenizer.model_max_length
print(block_size) 

# this seems to be a bit too big for free Colab GPU RAM
block_size = 512

1024


In [15]:
# Reformat the training data to enable effective training - concatenate all 
# the text and then split it into chunks of block_size length

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result



In [16]:
lm_datasets = tokenized_dataset.map(
    group_texts,
    batched=True,
    batch_size=10,
    num_proc=4,
)

Map (num_proc=4):   0%|          | 0/1956 [00:00<?, ? examples/s]

In [17]:

print(len(lm_datasets["train"]))  # batches
print(len(lm_datasets["train"][0]["input_ids"])) # each batch is a block

11
512


In [18]:
# How do the data look now?
# We can use decode from the tokenizer

tokenizer.decode(lm_datasets["train"][0]["input_ids"])

"| Date | Project   | Presenter(s)  | Link(s) || :----- | :---------------- | :------------------ | :----------------------- ||  | VirtualBox/VMs | Luke | [HackMD](https://hackmd.io/G5GwYHZ8QCWG78uG56xrUA) || 30th Nov | Active Learning for new class discovery | Camilla B | [Slides](https://thealanturininstitute.sharepoint.com/:p:/s/ResearchEngineering/ESW2BJQBFFZOjErOX-eQ6kABEeuZI37G37VySembqM05hQ?e=8gmc0t) ||  | Vehicle Grid Integration | Louise | [Slides](https://thealanturininstitute.sharepoint.com/:p:/r/sites/ResearchEngineering/Shared%20Documents/Corporate_Duties/Events/Project-Lightning/20211012_VGI.pptx?d=wc591e22ed92042c4af507f25754b64dd&csf=1&web=1&e=nYaCJA) ||  | WAYS (What Aren't You Seeing) | Ed | || 12th Oct | Living With Machines: A flagship Digital Humanities project leveraging the power of cloud computing | Christina | [Slides](https://drive.google.com/file/d/1iIc411qN5xRJaJfdWgUuv1tn_8hfBx-o/view?usp=sharing)|| 7th Sept | Towards an Open Global Air Quality Monitoring P

### 3. Fine-tuning the model

In [19]:
from transformers import AutoModelForCausalLM

# Load the same model we used for the tokenizer above
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

Downloading pytorch_model.bin:   0%|          | 0.00/353M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [20]:

from transformers import Trainer, TrainingArguments

model_name = model_checkpoint.split("/")[-1]
print(model_name)

distilgpt2


In [21]:
num_epochs = 10

# training parameters
training_args = TrainingArguments(
    f"{model_name}-finetuned-regwiki-{num_epochs}",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=False,
    num_train_epochs=num_epochs
)


In [22]:

# create a trainer class
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["train"],    # This is obviously not correct, haha
)



In [23]:
trainer.train()


import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")


trainer.save_model(output_dir=f'./drive/MyDrive/REGindald-trained/reginald-{model_name}-{num_epochs}/')




Epoch,Training Loss,Validation Loss
1,No log,3.329434
2,No log,3.24901
3,No log,3.212142
4,No log,3.179071
5,No log,3.151037
6,No log,3.132263
7,No log,3.11742
8,No log,3.104218
9,No log,3.095852
10,No log,3.092273


Perplexity: 22.03


In [24]:
!ls ./drive/MyDrive/REGindald-trained/reginald-distilgpt2-10/

config.json  generation_config.json  pytorch_model.bin	training_args.bin
