Useful resources:
- https://www.philschmid.de/fine-tune-a-non-english-gpt-2-model-with-huggingface
- https://colab.research.google.com/drive/13dZVYEOMhXhkXWfvSMVM1TTtUDrT6Aeh?usp=sharing#scrollTo=gFsCTp_mporB

In [None]:
!pip install datasets
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# PyTorch
import torch
from torch.utils.data import Dataset, DataLoader

# HuggingFace
from datasets import list_datasets, load_dataset
from transformers import Trainer, TrainingArguments, GPT2LMHeadModel, GPT2Tokenizer, DataCollatorWithPadding, AutoModelWithLMHead, DataCollatorForLanguageModeling

In [None]:
# load dataset from huggingface
midjourney = load_dataset('succinctly/midjourney-prompts')



  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
# loading GPT tokenizer
tokenizer = GPT2Tokenizer.from_pretrained(
    'gpt2-medium', 
    bos_token='<|startoftext|>', 
    eos_token='<|endoftext|>', 
    pad_token='<|pad|>',
    return_tensors='pt'
)
tokenizer

# map the dataset with the tokenize and batchify function
def tokenize_func(item):
    return tokenizer(item['text'], truncation=True, padding=True)

tokenized_midjourney = midjourney.map(tokenize_func, batched=True)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


  0%|          | 0/222 [00:00<?, ?ba/s]

  0%|          | 0/13 [00:00<?, ?ba/s]

In [None]:
tokenized_midjourney

DatasetDict({
    validation: Dataset({
        features: ['text', 'input_ids', 'attention_mask'],
        num_rows: 12318
    })
    train: Dataset({
        features: ['text', 'input_ids', 'attention_mask'],
        num_rows: 221743
    })
    test: Dataset({
        features: ['text', 'input_ids', 'attention_mask'],
        num_rows: 12320
    })
})

In [None]:
# load the model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = AutoModelWithLMHead.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer))
model = model.to(device)
model

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--gpt2/snapshots/75e09b43581151bd1d9ef6700faa605df408979f/config.json
Model config GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_vers

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50259, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dro

In [None]:
# create training arguments for the trainer
training_args = TrainingArguments(
    output_dir="./gpt_midjourney", # The output directory
    overwrite_output_dir=True, # overwrite the content of the output directory
    num_train_epochs=1, # number of training epochs
    per_device_train_batch_size=16, # batch size for training
    per_device_eval_batch_size=16, # batch size for evaluation
    eval_steps=400, # Number of update steps between two evaluations.
    save_steps=800, # after # steps model is saved
    warmup_steps=500, # number of warmup steps for learning rate scheduler
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:
# trainer class
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
    train_dataset=tokenized_midjourney['train'],
    eval_dataset=tokenized_midjourney['validation']
)

In [None]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: text. If text are not expected by `GPT2LMHeadModel.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 221743
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 13859
  Number of trainable parameters = 124441344


Step,Training Loss
500,4.7066
1000,3.6784
1500,3.4482
2000,3.2776
2500,3.154
3000,2.9911
3500,2.9218
4000,2.8528
4500,2.8242
5000,2.7523


Saving model checkpoint to ./gpt_midjourney/checkpoint-800
Configuration saved in ./gpt_midjourney/checkpoint-800/config.json
Model weights saved in ./gpt_midjourney/checkpoint-800/pytorch_model.bin
Saving model checkpoint to ./gpt_midjourney/checkpoint-1600
Configuration saved in ./gpt_midjourney/checkpoint-1600/config.json
Model weights saved in ./gpt_midjourney/checkpoint-1600/pytorch_model.bin
Saving model checkpoint to ./gpt_midjourney/checkpoint-2400
Configuration saved in ./gpt_midjourney/checkpoint-2400/config.json
Model weights saved in ./gpt_midjourney/checkpoint-2400/pytorch_model.bin
Saving model checkpoint to ./gpt_midjourney/checkpoint-3200
Configuration saved in ./gpt_midjourney/checkpoint-3200/config.json
Model weights saved in ./gpt_midjourney/checkpoint-3200/pytorch_model.bin
Saving model checkpoint to ./gpt_midjourney/checkpoint-4000
Configuration saved in ./gpt_midjourney/checkpoint-4000/config.json
Model weights saved in ./gpt_midjourney/checkpoint-4000/pytorch_mod

In [None]:
!nvidia-smi