# GPT2 Language Model Fine-tuning with Texts from Shakespeare
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/fcakyon/gpt2-shakespeare/blob/main/gpt2-shakespeare.ipynb)

## 0. Install requirements

In [71]:
!pip install -U transformers datasets torch sentencepiece pyyaml

Requirement already up-to-date: transformers in /usr/local/lib/python3.7/dist-packages (4.7.0)
Requirement already up-to-date: datasets in /usr/local/lib/python3.7/dist-packages (1.8.0)
Requirement already up-to-date: torch in /usr/local/lib/python3.7/dist-packages (1.9.0+cu102)
Requirement already up-to-date: sentencepiece in /usr/local/lib/python3.7/dist-packages (0.1.96)


## 1. Initialize Model and Tokenizer

- Import required modules:

In [72]:
import torch
import math
from transformers import GPT2Tokenizer, GPT2LMHeadModel, HfArgumentParser, TrainingArguments, Trainer, default_data_collator
from datasets import load_dataset

- Initialize a GPT2 model with a language modelling head:

In [73]:
 model = GPT2LMHeadModel.from_pretrained('gpt2')

- Initialize GPT2 tokenizer:

In [74]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

In [75]:
model.resize_token_embeddings(len(tokenizer))

Embedding(50257, 768)

## 2. Initialize Dataset

- Download Shakespeare dataset from [Huggingface datasets hub](https://github.com/huggingface/datasets/blob/master/datasets/tiny_shakespeare/tiny_shakespeare.py):

In [76]:
# download and load the dataset from the hub
dataset_name = "tiny_shakespeare"
cache_dir = "lm_dataset/"
datasets = load_dataset(dataset_name, cache_dir=cache_dir)

Using custom data configuration default
Reusing dataset tiny_shakespeare (lm_dataset/tiny_shakespeare/default/1.0.0/b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e)


- Tokenize all the texts:

In [77]:
column_names = datasets["train"].column_names
text_column_name = "text" if "text" in column_names else column_names[0]

def tokenize_function(examples):
    # truncate dataset with max accepted size of the model
    output = tokenizer(examples[text_column_name])
    return output

# tokenize dataset
tokenized_datasets = datasets.map(
    tokenize_function,
    batched=True,
    remove_columns=column_names,
    desc="Running tokenizer on dataset",
)

Loading cached processed dataset at lm_dataset/tiny_shakespeare/default/1.0.0/b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e/cache-86ddf4b956c3d870.arrow
Loading cached processed dataset at lm_dataset/tiny_shakespeare/default/1.0.0/b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e/cache-8f8c3b07ee3d7fd9.arrow
Loading cached processed dataset at lm_dataset/tiny_shakespeare/default/1.0.0/b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e/cache-b018bddd97924ae8.arrow


- Split whole dataset into smaller sets of blocks:

In [78]:
# get block size (max input length of the model)
block_size = tokenizer.model_max_length
if block_size > 1024:
    block_size = 1024
    
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

# split total dataset into smaller sets of length block_size
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    desc=f"Grouping texts in chunks of {block_size}",
)

Loading cached processed dataset at lm_dataset/tiny_shakespeare/default/1.0.0/b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e/cache-6991fd94a3bf1e9f.arrow
Loading cached processed dataset at lm_dataset/tiny_shakespeare/default/1.0.0/b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e/cache-b046afe72f1f5889.arrow
Loading cached processed dataset at lm_dataset/tiny_shakespeare/default/1.0.0/b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e/cache-ddf2d61014c59483.arrow


In [79]:
train_dataset = lm_datasets["train"]
eval_dataset = lm_datasets["validation"]

## 3. Initialize Trainer

In [80]:
training_args = TrainingArguments(output_dir = "output/", per_device_train_batch_size=1, num_train_epochs=30, save_total_limit=1)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    # Data collator will default to DataCollatorWithPadding, so we change it.
    data_collator=default_data_collator,
)

# 4. Perform Training

In [81]:
# perform training
train_result = trainer.train()

# saves the tokenizer
trainer.save_model()

# save training metrics
metrics = train_result.metrics
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)

# save training state
trainer.save_state()

Step,Training Loss
500,3.4939
1000,3.1553
1500,2.9578
2000,2.7512
2500,2.5805
3000,2.4314
3500,2.2775
4000,2.1438
4500,2.0281
5000,1.942


# 5. Evaluate Model

In [82]:
# perform evaluation over validation data
metrics = trainer.evaluate()

# calculate perplexity
try:
    perplexity = math.exp(metrics["eval_loss"])
except OverflowError:
    perplexity = float("inf")
    
# save perplexity
metrics["perplexity"] = perplexity
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)

# 6. Generate Samples

In [87]:
# fix seed
import torch
torch.manual_seed(2)

# tokenize start of a sentence
ids = tokenizer.encode('One does not simply walk into',
                      return_tensors='pt').cuda()

# generate samples by top-p sampling
sample_output = model.generate(
    ids, 
    do_sample=True, 
    max_length=100, 
    top_p=0.92, 
    top_k=0
)

# print generated texts
print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
One does not simply walk into a room,
And with his or her eyes fix fixed on nothing;
Consenting nothing but his or her images and thoughts;
And, with himself or nobleness, nothing but himself or
his or her images: so he, nothing but himself,
His image flatters himself: just as two men
Are one flat, one flat, one good, one good for another;
For if he did not wish to be but one image,


# 7. Push to Hub

In [None]:
kwargs = {"finetuned_from": "gpt2", "tasks": "text-generation"}
kwargs["dataset_tags"] = "tiny_shakespeare"
kwargs["dataset_args"] = "default"
kwargs["dataset"] = "tiny_shakespeare default"

trainer.push_to_hub(**kwargs)