<a href="https://colab.research.google.com/github/andjoer/llm_poetry_generation/blob/main/colabs/finetune_GPT2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetuning of the GPT2 Language Model

In [None]:
!pip install transformers
!pip install datasets

In [None]:
!nvidia-smi

Thu Jul 21 18:19:57 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Read the text files from the Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')



Mounted at /content/drive


In [None]:
dir = '/content/drive/MyDrive/LLM_training/gpt2_training_gutenberg.txt'

In [None]:
with open(dir, 'r') as f:

  text = f.read()

Divide the text into smaller blocks. For poems different poems should not be mixed. 

In [None]:
seperate_text = False         # True for Kant, False for gutenberg
sep_token = '<|endoftext|>'
stepsize = 500

if seperate_text:
  
  blocks = []
  step = 0
  text_chunks = text.split()
  while step + stepsize < len(text):
      text_tmp = ' '.join(text_chunks[step:step+stepsize])
      if text_tmp:
        blocks.append(text_tmp)
      step += stepsize
  
  text_tmp = ' '.join(text_chunks[step:])
  if text_tmp: 
    blocks.append(text_tmp)

else:
    blocks = text.split(sep_token)


In [None]:
blocks[1000]

'\nTitel: Gedicht\n \nDas Herz ist deiner Liebe Königszelt,\nDas Auge dir zum Spiegel aufgestellt.\n\nDer Bürde deiner Gnaden beuget sich\nDies Haupt, das sich nicht beugt vor aller Welt.\nDer Paradiesbaum jenem, mir dein Wuchs!\nDa jeder Sinn sein eignes Maß enthält.\nDoch was soll ich in diesem Heiligtum,\nWo nur mit Scheu der Ost den Vorhang hält!\nWas ists auch, wenn ich der Befleckte bin?\nDenn deine Reinheit strahlt vor aller Welt.\nEinst war Madschnun, nun bin ich an der Reih,\nUnd jeder steht hier seinen Tag im Feld.\nDer Liebe Königsmacht, der Freuden Schatz,\nDurch Deine Huld ist all dies mir bestellt.\nHeil dir, und meinen Zweck hab ich erreicht,\nWenn Herz und Leben dir zum Opfer fällt.\nNie sei von deinem Bild mein Auge leer!\nNur ihm zum Wohngemach ist es erhellt.\nDie junge Ros im Garten duftet nur,\nWeil ihrem Odem war dein Haupt gesellt.\nSieh nicht Hafisens äußre Armut an!\nSein Innres birgt der Liebe gutes Geld.\n'

In [None]:
from transformers import AutoTokenizer
from datasets import Dataset

ds = Dataset.from_dict({'text': blocks})
#ds_test = Dataset.from_dict({'text': test})
ds = ds.train_test_split(test_size=0.15) 

In [None]:
ds['test'][200]

{'text': '\nTitel: Orientalisches Potpourri\n \nGestern Nachmittag, meine braune Geliebte,\ndie du nach Ruhm begehrst vor allen Frauen\ndeines Volkes, saß ich in einem Treibhaus,\nund von allen Palmen und andern Gewächsen\nflogen mir neue Gedichte zu.\n\nHier ist eins von einem Agavenwildling:\nMeine Geliebte!\nGrau in staubiger Wüste\nstand mein dorniges Blattwerk\njahrlang mit durstig schwellendem Fleisch.\nPlötzlich schoß über Nacht\nein steiler Schaft, knospengekrönt,\naus dem staubgrauen Schoß\nin die feurige Morgenluft.\nSchick mir zu Mittag, Geliebte,\ndeine tausend durstigen braunen Bienen:\nviertausend goldgelbe Blütenglöckchen\nhaben sich aufgetan und triefen,\ntriefen, triefen von Honigsaft.\nOder eins von einer verschulten Musa:\nMeine Geliebte!\nWen mit deinen üppig langen\nBlättern willst du denn umfangen,\ndie du überreichlich treibst?\nFühlst du nicht den Abend glühen?\nWenn du ohne Blüte bleibst,\nSchönste, kannst du nie verblühen,\nÄrmste, nie mit Früchten prangen.\nO

Tokenize the training data

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('benjamin/gerpt2-large')


Downloading:   0%|          | 0.00/477 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/946k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/515k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/387 [00:00<?, ?B/s]

In [None]:
from transformers import TextDataset,DataCollatorForLanguageModeling
from datasets import Dataset

block_size = 128
min_len = 30
remove_too_short = True

def load_dataset(dataset,tokenizer):

    tokenized_dataset = dataset.map(lambda batch: tokenizer(batch['text'], max_length = block_size, return_overflowing_tokens=True, truncation=True), batched=True, num_proc=2, remove_columns=['text']) #no padding; would be necessary on TPU
    print(tokenized_dataset[0])
    keep_lst = []
    if remove_too_short:
      for i in range(len(tokenized_dataset)):                         # no padding, just ignore too short rows
        if len(tokenized_dataset[i]['input_ids']) >= min_len: 
          keep_lst.append(i)

      tokenized_dataset = tokenized_dataset.select(keep_lst)
    
    return tokenized_dataset

train_dataset = load_dataset(ds['train'],tokenizer)
test_dataset = load_dataset(ds['test'],tokenizer)

data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False,
    )



   

#0:   0%|          | 0/8 [00:00<?, ?ba/s]

 

#1:   0%|          | 0/8 [00:00<?, ?ba/s]

{'input_ids': [199, 14814, 26, 1954, 832, 11364, 1267, 61, 199, 221, 199, 832, 11364, 1, 4091, 4620, 352, 828, 405, 8189, 27, 199, 4945, 1318, 384, 367, 310, 838, 39027, 75, 1131, 199, 5331, 1144, 82, 358, 15973, 27, 1649, 355, 21619, 18876, 221, 199, 28662, 10227, 291, 2007, 221, 479, 364, 1568, 548, 37323, 6108, 1, 199, 5248, 5255, 332, 6472, 352, 45281, 283, 19007, 75, 14870, 199, 5838, 30239, 3042, 3243, 14, 356, 1568, 548, 45344, 6226, 2356, 12, 199, 13775, 1105, 332, 10932, 410, 11293, 3100, 793, 199, 509, 48265, 398, 9869, 358, 328, 7921, 283, 428, 303, 6108, 14, 199, 26948, 665, 446, 405, 3109, 352, 828, 598, 937, 4325, 1, 199, 4945, 12044, 375, 3389, 1103, 5885, 4550, 12, 199, 1670, 332, 345, 9886], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

#0:   0%|          | 0/2 [00:00<?, ?ba/s]

 

#1:   0%|          | 0/2 [00:00<?, ?ba/s]

{'input_ids': [199, 14814, 26, 11885, 84, 513, 40190, 199, 221, 199, 1974, 9471, 371, 12, 373, 9471, 371, 199, 1670, 9471, 730, 1827, 12, 199, 10127, 384, 373, 9134, 376, 9471, 1941, 14, 199, 44302, 497, 310, 3195, 283, 4438, 12, 199, 1340, 9143, 641, 4603, 12, 199, 43400, 641, 308, 17615, 401, 3584, 532, 4180, 14, 199, 199, 509, 12880, 345, 524, 26604, 1519, 12, 199, 3991, 3802, 345, 548, 375, 4275, 12, 199, 41381, 345, 524, 4900, 283, 8510, 496, 4267, 14, 199, 18907, 283, 48489, 4254, 371, 26, 199, 1630, 1448, 464, 548, 350, 1, 199, 1283, 499, 376, 355, 548, 523, 4254, 487, 1, 199, 1974, 278, 1172, 371, 12, 373, 278, 1172, 371, 199, 1670, 278, 1172, 730, 1827, 199, 3345, 25262, 401, 11885], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

Finetune the model

In [None]:
from transformers import Trainer, TrainingArguments,AutoModelWithLMHead

model = AutoModelWithLMHead.from_pretrained('benjamin/gerpt2-large')

repo_name = 'new model'            # enter the name the final model should have
training_args = TrainingArguments(
    output_dir="./"+repo_name, #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=2, # number of training epochs
    per_device_train_batch_size=2, # batch size for training
    per_device_eval_batch_size=2,  # batch size for evaluation
    evaluation_strategy='epoch',
    load_best_model_at_end = False, 
    warmup_steps=500,# number of warmup steps for learning rate scheduler
    prediction_loss_only=True,
    save_strategy = 'epoch',               
    #save_steps = 50
    #push_to_hub = True,         # could be done if you have a hub_token
    #hub_token = ''
    )


trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

loading configuration file https://huggingface.co/benjamin/gerpt2-large/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/f83b5b4fe2eb33c339c5ca4db6a67914df29da75325ac16bb883a63c2e3a46fe.6fc424aaf1bc0da1e10d2f49429646384b10cf78fb72792792e513f1bc9fecfc
Model config GPT2Config {
  "_name_or_path": "benjamin/gerpt2-large",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 1280,
  "n_head": 20,
  "n_inner": null,
  "n_layer": 36,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "su

In [None]:
trainer.train()

Push the model to the huggingface hub (a hub token is required)

In [None]:
trainer.push_to_hub()

Evaluate the final state of the model on the evaluation dataset

In [None]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: overflow_to_sample_mapping. If overflow_to_sample_mapping are not expected by `GPT2LMHeadModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 4024
  Batch size = 2


{'epoch': 8.0,
 'eval_loss': 3.4149293899536133,
 'eval_runtime': 148.9224,
 'eval_samples_per_second': 27.021,
 'eval_steps_per_second': 13.51}